Checkpointing on HPC¶
What is checkpointing?¶
Checkpointing is the practice of periodically saving the state of a running program to persistent storage so that it can be restarted from that point after an interruption. Rather than rerunning a job from the beginning, a checkpoint lets the computation resume from the last saved state, preserving progress, compute time, and results.
On High-Performance Computing (HPC) systems, checkpointing is an essential best practice and is strongly recommended for all medium- to long-running jobs.
Why checkpointing matters on HPC¶
HPC jobs rarely run in perfect, uninterrupted conditions. Jobs may stop due to:
⏱️ Wall-time limits - schedulers enforce time caps; jobs are terminated when they hit the limit.
💥 Application crashes - software bugs or memory errors can stop a job run.
🔄 Job eviction/re-queuing - Jobs may be interupted on the
_riskpartitions to free up resources for a higher priority job.⚙️ Hardware or filesystem failures - node faults or storage outages can interrupt I/O or compute.
🔧 Scheduled maintenance or system reboots - planned downtime can end jobs without warning.
Without checkpointing, any interruption means that all progress is lost and the job must restart from its initial conditions.
Note
Jobs run until their wall time limit unless they are submitted to a _risk partition, where they can be interrupted at any time
(see AISURREY preemption
or Eureka2 preemption).
Benefits:¶
Using checkpointing properly provides several advantages:
Time efficiency – resume long simulations instead of starting again
Efficient use of resources – avoid wasting CPU or GPU hours
Scalability – enable simulations that cannot finish in a single job run
Resilience – recover from crashes or system interruptions
Workflow flexibility – split work across multiple job submissions
Checkpointing and wall-time limits¶
Most HPC clusters enforce a maximum wall-time maximum run time for each job. These limits are introduced to ensure fair sharing of resources, prevent individual jobs from occupying compute nodes indefinitely, and help the scheduler efficiently plan and balance workloads across all users.
When a job reaches its wall-time limit, the scheduler will automatically terminate it — even if the computation is not yet finished.
With checkpointing in place:
Your application periodically saves its progress to persistent storage
The job can be safely stopped without losing completed work
You can submit a new job that restarts from the most recent checkpoint
Long-running workloads can be completed across multiple job submissions
How checkpointing works¶
Checkpointing typically involves:
Saving the application state (e.g. iteration counters, model parameters, simulation data)
Writing this state to persistent storage
Restarting the application using the saved state
The exact implementation depends on the application or library being used.
Some software provides built-in checkpointing¶
Many modern scientific and data-intensive applications include native checkpointing support, which is the recommended approach when available. In some applications, it may be necessary to implement custom checkpointing when built-in checkpointing support is not available or does not meet the application’s requirements.
Deep learning frameworks are a common example, offering automatic tools to save model weights, optimizer state, and training progress.
PyTorch¶
PyTorch provides built-in utilities to save and restore training state using
torch.save and torch.load.
This approach is widely used for long-running GPU jobs on HPC systems.
# save
torch.save(
{
"epoch": epoch,
"model_state_dict": model.state_dict(),
"optimizer_state_dict": optimizer.state_dict(),
},
"checkpoint.pt",
)
# restore
checkpoint = torch.load("checkpoint.pt", map_location="cpu")
model.load_state_dict(checkpoint["model_state_dict"])
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
epoch = checkpoint["epoch"]
TensorFlow / Keras¶
TensorFlow (Keras) provides high-level checkpointing APIs that automate much of the process.
Commonly used mechanisms include:
tf.train.Checkpointtf.keras.callbacks.ModelCheckpoint
# save
checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
checkpoint.write("ckpt")
# restore
checkpoint.restore(tf.train.latest_checkpoint("."))
MATLAB¶
MATLAB can save and restore model state using save and load on a MAT file.
% save
checkpoint.epoch = epoch;
checkpoint.net = net;
checkpoint.optimizer = optimizer;
save("checkpoint.mat", "-struct", "checkpoint");
% restore
checkpoint = load("checkpoint.mat");
net = checkpoint.net;
optimizer = checkpoint.optimizer;
epoch = checkpoint.epoch;
General guidance¶
Prefer built-in checkpointing mechanisms when available
Store checkpoints on persistent storage, not temporary node-local paths
Choose checkpoint frequency carefully to balance I/O and recovery cost
Always test that your application can restart successfully from a checkpoint
Next steps¶
For practical guidance on integrating checkpointing with batch jobs and schedulers, see:
➡️ Examples: Checkpointing HPC jobs
The examples include checkpointing CNN models in Conda/Tensorflow/Torch and Matlab, custom checkpointing in R, and similar versions using apptainers.
Important
Here is a Wikipedia article on the subject: https://en.wikipedia.org/wiki/Application_checkpointing
Hint
If you need help adding checkpointing to your workflow, consider contacting Available Support from an RSE for tailored advice.