Checkpoint HPC jobs¶
Why checkpoint?¶
Running jobs might be interrupted for a number of different reasons, including the program crashing or potentially issues with the HPC cluster itself.
The objective of checkpointing is to prevent losing simulation/compute time by needing to restart your job from the very beginning (i.e. initial conditions) if it’s stopped, or crashes for any reason.
Important
Here is a Wikipedia article on the subject: https://en.wikipedia.org/wiki/Application_checkpointing
When a checkpoint mechanism is implemented, it means that the job saves checkpoints regularly or in response to a termination signal. Regular checkpoints are especially critical in long running jobs in the event your simulations reaches its maximum wall time or in the event of infrastructure failures.
How to checkpoint¶
Checkpointing jobs is considered a best practice approach for all HPC jobs, so you should have this built into your workflow.
Checkpointing can be achieved in many ways - the simplest example would be to have the simulation generate an output file once a day, which you can then use as “input” (i.e. initial conditions) to restart and continue a job.
How checkpointing is implemented for each job depends on the software used. Please refer to the documentation of the software or libraries used to determine how you can make your job check point itself, many applications have a “built in” mechanism to achieve this instead of using input/output files - so please read up and use this if possible.
On gaining more familiarity and experience with the nature of your software and jobs, you could eventually automate this checkpointing process by means of triggers.
Hint
Consider Available Support from an RSE if you would like some help with adding checkpoints to your job