There are two types of Grid Engine checkpoint formats available on HPC which can be used with your batch job scripts so that your jobs will never suspend when running on a suspend-able queue:
JOB CHECKPOINT & CONTINUE ( BLCR )
The reason you may want to use either format is so that if you are running on a suspend-able queue (like the free64 queue), your job will NOT suspend but rather RESTART on another node, otherwise the job may stay suspended for a very long time.
Of the two formats, HPC JOB CHECKPOINT & CONTINUE ( BLCR ) is the best and most powerful format generally speaking and it is explained here.
JOB RESTART on the other hand is the simplest format with very little over-head so it’s perfect for large job arrays in which each task runs for a short time.
Which one should I use?
HPC Job restart does just that (nothing more, nothing less). It will restart your job from the beginning on a different node if the node your job is running on receives a suspend signal, or if the node crashes.
The Job-ID will be preserved when a job is restarted. So if your job number is 12345, the job-id will remain the same on the new node when the job is restarted. Jobs using the job-hold status ( -hold_jid ) will work just fine because the job id’s do not change.
HPC Checkpoint & Continue is significantly more complex and has a huge over-head as it has to save the entire working structure of the job to disk and in doing so, it is more prone to failures if the file-system is having issues, pid is in use, etc.
So which one should I use? Generally speaking, you will want to use RESTART instead of CONTINUE when:
Running large job arrays in which each task runs for a short time ( ~under a day ).
Jobs that cannot use BLCR ( like MPI jobs ).
You need to run from /dfs1 file-system.
In order to be eligible to use RESTART, your jobs MUST:
Be able to be re-started from the beginning without issues.
Complete under queue wall-clock time limit (otherwise your job will be in an infinte loop).
How to use Restart:
To use it restart, simply add the following line to your batch submit script:
#$ -ckpt restart
Or you can also specify it at the command line when you submit your job:
qsub -ckpt restart job.sh
That’s it. Your jobs will now restart on a different node instead of staying suspended in the event the node receives a suspend signal, or the node crashes.
Why use Restart?
Imaging you submit a 20,000 job-array to the free64 queue with 5,000+ cores. In time several job tasks will end up suspended when the node owners use their nodes. At the end of the 20,000 job array, you will end up with a huge mess of completed and suspended tasks, and/or tasks that failed due to being suspended and running-out of the queue wall-clock run-time limit.
With restart, each job/task will not suspend but rather restart on a different node. If the node crashes, those jobs/tasks that were running when the node crashed will automatically be restarted on another working node and continue to run.
Yes there is waste because jobs/tasks are restarted from the beginning, however, the over-all speed-up is significant because not everyone uses their nodes 24/7 and you will end up with many more completed jobs/tasks than jobs/taks that were restarted.
The restart function has a very useful option to notify you in the event of job failures. This option is really meant for job arrays and not for regular jobs.
With job arrays, specially large jobs arrays, you can request an email of failed tasks. You will receive one email at the end of the job array only if one or more tasks failed summarizing exactly which task(s) failed.
To enable the email summary report, add the following to your script:
Node Memory size:
If your jobs required nodes with specific memory size, like nodes with 512GB of main memory, add the following Grid Engine option to your job script:
#$ -l mem_size=512
This will make Grid Engine pick nodes that have 512GB of physical memory. If you don’t need that much memory you can request a lesser amount:
#$ -l mem_size=256
and Grid Engine will pick nodes having at least 256GB of memory ( or more ).
What jobs failed?
You can use the email notification explained above, or you can also look in:
Successful jobs are automatically cleaned (removed) from the checkpoint area /checkpoint/$USER, but failed jobs are left so that you can see what failed.