Important HPC Checkpoint NOW works with distrubuted file-systems /dfs1, dfs2, /fast-scratch, etc.

Application checkpointing is a facility to save the complete status of an executing program to disk and to restore and resume the process from the last checkpoint file (continue the job)

The benefits of checkpointing in the HPC environment are immense since HPC jobs can run for several days/weeks and if your job stops due to a node crashing, power failure, job errors, NSA snooping :-), etc, you have to start your job from the beginning. With automatic checkpoints, the job resumes from the last successful checkpoint. We do BLCR checkpoints every 6 hours on HPC so if your job crashes, you tend to lose at most 6 hours of work instead of days, weeks or more.

Another benefit of job checkpointing in conjunction with the HPC FreeQ system, is that jobs submitted to the free64 queue for example, will literally migrate (bounce) from node-to-node until the jobs complete instead of the jobs being suspended and staying suspended. As long as there are free nodes available, your jobs will never suspend but will automatically continue running by bouncing from node-to-node until the jobs complete.

All free queues have wall-clock run-time limits ( like free64 ). Using checkpoint will ensure that your jobs continue to run instead of the jobs being killed when the run-time limit has expired. When the job reaches the queue run-time limit, checkpoint will occur and the job will then go back into the queue with a new run-time limit ready to continue running again when a node is ready for work.

For users who have access to private queues, you will gain the best of both worlds using checkpoint and requesting both your private queue(s) and the free64 queue. For example, the BIO group can request:

#$ -q bio,abio,free64

And Grid Engine will look for nodes on all queues listed. This means that you have a much better chance of running with so many nodes available to run on.

The automatic checkpointing and migration of jobs is done by the Grid Engine scheduler and some fancy home grown scripts. Grid Engine functional FairShare ( which is enabled on HPC ) will ensure that all jobs waiting to run will receive equal and fair access to available nodes waiting to receive jobs.

Tip Grid Engine FairShare ensures that no one user will dominate, monopolize, or abuse the available HPC computing resources on private or public queues.
Note MPI Job checkpointing is NOT ready yet. OpenMPI has to be re-compiled with BLCR.

Quick Start Guide:

For those that want to get going with checkpointing asap, the easiest way is to use qsub with the following checkpoint options and you are good to go. Assuming your Grid Engine job script is job.sh:

qsub -ckpt blcr  job.sh

Your job ( job.sh ) will now be running with checkpoint and with no changes to the original job script.

If you want to add the Grid Engine checkpoint directives inside your job script ( recommended ) so that all you need to do is qsub job.sh, then edit your job script ( job.sh ) and add these Grid Engine directives:

#$ -ckpt blcr

That’s it. Here is a complete Grid Engine job script using CheckPoint for NAMD as an example:

#!/bin/bash
#$ -N My-Job
#$ -q free64
#$ -pe openmp 64
#$ -ckpt blcr
#$ -m beas

module load NAMD-multicore/2.9

charmrun +p $CORES  `which namd2` input > output

Congratulations, you are now running with HPC CheckPoint.

The Details for those that want to know more:

In order to use HPC CheckPoint, you do not need to read the rest of this manual. The following is intented for those who want to know more on how the process works and/or how to review the CheckPoint log files to see how your jobs are CheckPointing (or not).

BLCR:

On HPC we are using the Berkeley Lab Checkpoint/Restart (BLCR) facility for the checkpoint process.

There is a good BLCR FAQ available here that describes what BLCR does and more importantly the limitations of BLCR which you will want to review to make sure that your jobs are check-pointabled!

There is also another checkpoint facility called Distributed MultiThreaded CheckPointing (DMTCP) that we are evaluating, however, DMTCP has not yet been setup on the HPC cluster.

To load the BLCR module:

module load blcr

and it will pickup the latest version. Three commands along with the manual pages will now become available to you:

  • cr_run

  • cr_checkpoint

  • cr_restart

Instead of re-stating how to use the above commands, the BLCR User’s Guide has a lot of good information and explanation on their use. The next segment will illustrate how to use BLCR on HPC.

How to Use BLCR on HPC:

The UCI HPC Campus Cluster is using Grid Engine as the resource manager for the scheduling of jobs. BLCR has been integrated into Grid Engine and can be used with your current HPC scripts by simply adding the following GE directives:

#$ -ckpt blcr

That’s it. You then submit your job as you normally would ( qsub job.sh ).

You may also add checkpoint functionality to your existing jobs without having to edit nor change your current Grid Engine scripts by simply passing the checkpoint options on the qsub command line. Example:

qsub -ckpt  job.sh
Grid Engine Option What it does

#$ -ckpt blcr

Use the Grid Engine BLCR checkpoint facility.

Example:

The easiest way to see how the process works on HPC is by using the Grid Engine scripts and programs that are available in the demo ( ~demo/blcr ) directory.

From your HPC account open two windows and on the first window do:

$ mkdir blcr-test
$ cd blcr-test
$ cp ~demo/blcr/* .
$ qsub job.sh
Your job-array 971115.1-5:1 ("Testing-BLCR") has been submitted
$ ./display.sh

From the first window, you will see an output similar to this:

job-ID  prior   name    user     state submit       queue              slots task

 971115 0.94746 Testing jfarran    r   10/18/2013   free64@compute-1-12   1  1
 971115 0.94746 Testing jfarran    r   10/18/2013   free64@compute-1-12   1  2
 971115 0.94746 Testing jfarran    r   10/18/2013   free64@compute-1-12   1  3
 971115 0.94746 Testing jfarran    r   10/18/2013   free64@compute-1-12   1  4
 971115 0.94746 Testing jfarran    r   10/18/2013   free64@compute-1-12   1  5

Job # 971115 has started running on HPC ( state has "r" for run ) with 5 tasks all running on compute-1-12.

Now to illustrate how the checkpoint & restart process works, we are going to suspend job array task # 2.

By sending the suspend signal to our job, Grid Engine will NOT suspend ( briefly suspend ) but rather checkpoint the process and then immediately terminate it. The job will then go back into the queue and the Grid Engine FairShare process will resume the job from the last checkpoint on another node when one becomes available. If a node is not available, the job will sit in the queue competing with other jobs waiting to run until the Grid Engine FairShare starts it up again when a node becomes ready for work.

Ok, now from the second window force suspend job array task # 2:

$ qmod -sj 971115.2
jfarran - suspended job-array task 971115.2

The following will happen rather quickly so it’s easy to miss it. Looking at your display on window one, you will notice task # 2 state change from "r" to "s" (suspend) and the job will be then be re-queued "Rq". If a node is ready for work, the job will start again (resume) from the last checkpoint file on a new node (or same node - it depends). The job will now have an "R" preceding the job state "r" to indicated that this is a Restarted job (Rr):

job-ID  prior   name    user     state submit       queue              slots task

 971115 0.94746 Testing jfarran    r   10/18/2013   free64@compute-1-12   1  1
 971115 0.94746 Testing jfarran    Rr  10/18/2013   free64@compute-2-16   1  2
 971115 0.94746 Testing jfarran    r   10/18/2013   free64@compute-1-12   1  3
 971115 0.94746 Testing jfarran    r   10/18/2013   free64@compute-1-12   1  4
 971115 0.94746 Testing jfarran    r   10/18/2013   free64@compute-1-12   1  5
Note In the example above task 971115.2 migrated from compute-1-12 to compute-2-16 on the free64 queue.

The demo job array above will run for approximately 3 minutes with each task counting from 1 to 180 and saving the output to output-task-id in the job submission directory $SGE_O_WORKDIR or blcr-test directory in our case.

Understanding the Files & Logs:

All CheckPoint data files have been moved to the /checkpoint/$USER file system.

CheckPoint files are now located at:

  • /checkpoint/$USER

As the job runs you will see files similar to these from the location you submitted the job:

$ ls
count    display.sh  output-1    output-3   output-5
cleanup.sh    count.c  job.sh      output-2    output-4
File/Dir What it contains

count

Our counting program.

count.c

Our C code for count program.

job.sh

Our Grid Engine job.

output-task-id*

The output from our count program.

The checkpoint data and logs from our Grid Engine job is located at:

/dfs1/checkpoint/$USER/ckpt.971115

Looking at output-2 we have:

$ more output-2
Counting demo starting with pid=(17276)
[compute-1-12]  Count = 1
[compute-1-12]  Count = 2
[compute-1-12]  Count = 3
[compute-1-12]  Count = 4
[compute-1-12]  Count = 5
[compute-1-12]  Count = 6
[compute-1-12]  Count = 7
[compute-2-16]  Count = 8
[compute-2-16]  Count = 9
[compute-2-16]  Count = 10
...
[compute-2-16]  Count = 180

Notice how after count # 7, the node changed from compute-1-12 to compute-2-16. This was the point when we issued the suspend command qmod -sj 971115.2 above. Also notice how the count continued correctly from 7 to 8 and onwards.

Note The running job was check-pointed on compute-1-12, killed and then restarted (resumed) on compute-2-16 from the last checkpoint context file.

The ckpt.971115 directory (/dfs1/checkpoint/$USER/ckpt.971115) is created by the checkpoint process and it contains files needed by checkpoint and also log files containing information of what transpired during the job run.

Change directory to the checkpoint dir to see the logs:

$ cd /dfs1/checkpoint/$USER/ckpt.971115
$ ls
1/  2/  3/  4/  5/  log.1  log.2  log.3  log.4  log.5

The log.task-id are the checkpoint logs of what happened as the job ran.

These log files are very important when you want to know what is happening as your jobs run. There will be one log file for each job.

Note If the checkpoint process is not working, you will want to look at the checkpoint logs to see what happended.

The 1 through 5 are directories created by the checkpoint process, one for each task.

Here is a table of the contents of the checkpoint directory 2/ and what the files contain. All of the files in directory 2 pertain to task # 2:

File/Dir What it contains

context

Checkpoint data file ( automatic checkpoints are done every 6 hours or when a job is migrated ).

context-first

The first checkpoint file done soon after the job starts.

context-prev

The previos checkpoint file. This is kept in case our checkpoint file is bad so that we have a previous checkpoint file to revert to.

job

Our job (copy)

node-history

Maintains a list of nodes the job ran on.

report

Output produced by the job ( if any ).

tmp

Directory for temp files ( Checkpoint needs to have everything available, including temp files, when the job is restarted ).

pid

The process ID of the job that was check-pointed.

pstree

A pstree listing of our job to.

NOTE: For non-array jobs ( single jobs ), the task-id will be zero: log.0, and directory name 0.

USER Job CheckPoint Flags & Options:

The following are CheckPoint options you can request in your job script and what they do.

export CHECKPOINT_CLEAN_ALL_ON_SUCCESS=1
export CHECKPOINT_ALWAYS_CLEAN_ALL=0
export CHECKPOINT_CLEAN_REPORT=0
export CHECKPOINT_SET_TMP=1
export CHECKPOINT_CR_OPTIONS=""
FLAG DEFAULT VALUE WHAT IT DOES

CHECKPOINT_CLEAN_ALL_ON_SUCCESS

ON

1

When on it will wipe clean everything in our ckpt.971115 directory IF our job completed successfully and there is no output in the "report" file.

CHECKPOINT_ALWAYS_CLEAN_ALL

OFF

0

When on it will wipe clean everything in our ckpt.971115 directory regardless if our job completed successfully or not.

CHECKPOINT_CLEAN_REPORT

OFF

0

When on it will wipe clean everything in our ckpt.971115 directory IF our job completed successfully and regardless if "report" file has data or not.

CHECKPOINT_SET_TMP

ON

1

When on TMPDIR, TMP, & TEMP will be set to /dfs1/checkpoint/$USER/ckpt.job-id/task-id/tmp

CHECKPOINT_CR_OPTIONS

empty

""

Options you can send to cr_checkpoint command.

Note If a flag is NOT included in your job script, the flag default will take effect.

By default CHECKPOINT_CLEAN_ALL_ON_SUCCESS is turned on which is highly recommended as it can be very helpful with large job arrays in that if all jobs completed successfully, the ckpt.job-id directory ( ckpt.971115 ) will be empty when the job array finally ends. This way you will know which jobs succeeded and which jobs failed.

To turn off CHECKPOINT_CLEAN_ALL_ON_SUCCESS in order to preserve the checkpoint logs and files, include this line in your job script:

export CHECKPOINT_CLEAN_ALL_ON_SUCCESS=0

To turn CHECKPOINT_CLEAN_ALL_ON_SUCCESS to on, you don’t need to do anything as this is the default.

Note If a job crashes and does not complete successfully, the contents of the directory ckpt.job-id for that job will NOT be removed in order to protect and preserve the job checkpoint data and logs (unless the always clean all flag is set)

The CHECKPOINT_CR_OPTIONS flag may be helpful if you are experiencing jobs not resuming correctly. For example, you can tell cr_checkpoint to save everything when doing a checkpoint with the "--save-all" option. Example:

export CHECKPOINT_CR_OPTIONS="--save-all"

Please make sure you include all of your options in "double quotes". Also, please consult the BLCR User’s Guide and/or the manual page for cr_checkpoint options you can use. Generally speaking, you don’t have to worry about sending any options to cr_checkpoint for normal jobs.

Warning Warning: Specifying --save-all will create very large context checkpoint files as it has to save everything for the running job to disk, so please use with caution and NOT with large memory jobs.

Node Memory size:

As jobs migrate from one node to another, the amount of memory each node has may be an issue if the new node does not have sufficient physical memory to resume the job. For example, if your job is consuming 300GB of memory, no way will it be able to restart the job on a new node with only 128GB of physical memory. Therefore, it’s up to you to make sure you request nodes with enough memory.

One quick way to ensure you get nodes having the same memory size is with the Grid Engine mem_size directive. To request nodes that have 512GB of memory, use:

#$ -l mem_size=512

This will make Grid Engine pick nodes that have 512GB of physical memory for your jobs. If you don’t need that much memory you can request a lesser amount:

#$ -l mem_size=256

and Grid Engine will pick nodes having at least 256GB of memory of more.

Do NOT mix CPU types:

Do NOT mix Intel / AMD CPU types in your queue selection. You never want to do something like this:

#$ -q free48,free64,free32i,free24i

The above queue selection is picking AMD free48,free64 CPUs and Intel free32i,free24i CPUs types. Also, don’t mix CPU core counts free48,free64 and free32i,free24i.

Think about it, if the job is running with 64-cores on free64 queue and then the job is migrated to a 48-core node, it wont work (job will crash).

Also, if the job is running on an AMD node and then it jumps to an Intel node, the CPU instruction set, as well as many other things are very different and the job will most likely crash.

Stay with the same CPU type and core count. This will work:

#$ -q free64,pub64

When Things Go Wrong:

The process of checkpointing a running job and more importantly, being able to resume it is a very challenging and complicated process. If any file is missing when the job is resumed, the job cannot continue. This is why it’s important to make sure that all temp files are in a common area and on a shared file-system and not in /tmp for example which is local to the node.

Even when you do everything right, resuming a job may fail if the file-system is misbehaving (if there is no checkpoint file the job cannot resume), or the node is having issues.

The list of possible failures is endless and if you run job arrays consisting of tens of thousands of jobs, your chances of having one or more failed tasks increases dramatically.

HPC Checkpoint Safety / Sanity checks:

The HPC checkpoint scripts have built-in smarts in trying to deal with unexpected failures. For example, a checkpoint file may have checkpointed successfully but fails when cr_restart tries to use it.

The HPC checkpoint process will detect corrupted context checkpoint files and will then try the next context file exhausting all available checkpoint files until there are no more and at which time it has no other option but to abort the restart.

A lot of logging is produced in the log.task_id file for each job, so this is the first place you will want to check and see what transpired.

Tip Check the logs when there are issues with your job.

If the Grid Engine automatic restart process fails ( job died ), not all hope has been lost!

For example, the job may have been started on a node with insufficient memory, or another user was consuming resources on the restarted node that kept your jobs from re-starting. When this happens, you will need to manually try to restart the job.

Try to resume the job on the exact same compute node the job was running on before it migrated the job. If the node is not available, try another node that is similar or wait until the same node becomes available.

To see a history of the nodes your job ran on, consult /dfs1/checkpoint/$USER/task-id/node-history file.

It is beyond the scope of the manual on how to resume failed checkpoint files. For further help please consult the HPC staff.

Testing Your Code:

Assuming your jobs are checkpointable and that they migrate from node to node without issues ( that checkpointing is working ), you will want to run a few test cases to make sure all is working as expected, specially before running large job arrays.

In short, you will want to test the process and check for expected results since job checkpointing is new on HPC.

List of Programs NOT Checkpointing on HPC:

Important This section will be used to keep track of programs that have been tried on HPC with checkpoint and have failed.
  • MPI jobs. - OpenMPI has to be recompiled and Grid Engine scripts updated. This will be done as time permits.

  • Jobs which span multiple nodes - Support for programs like MPI that run on multiple nodes is not ready. This will be done as time permits.

  • NAMD. Checkpoint actually works on NAMD but the issue is that when the NAMD job continues running on a new node, it runs but very slowly. If this is ok with you, then you can ingore this.

If you find a program that is failing checkpoint, please let us know so that we can added it to this list.

HELP:

We are extremely short staffed on HPC, so please try to review this document and the BLCR manuals as much as possible if you are having issues or have questions with checkpointing.

Tip If you ask for help with checkpointing, please make sure you followed and ran the demo checkpoint process listed above. Going through the demo exercise will answer a lot of questions.

If you are still having issues with checkpointing, you are welcome to call or email me Joseph at jfarran@uci.edu or ask for help at hpc-support@uci.edu and give as much details as possible.

Happy CheckPointing

J. Farran