Checkpointing on HPC

What is checkpointing?

Checkpointing is the practice of periodically saving the state of a running program to persistent storage so that it can be restarted from that point after an interruption. Rather than rerunning a job from the beginning, a checkpoint lets the computation resume from the last saved state, preserving progress, compute time, and results.

On High-Performance Computing (HPC) systems, checkpointing is an essential best practice and is strongly recommended for all medium- to long-running jobs.

Why checkpointing matters on HPC

HPC jobs rarely run in perfect, uninterrupted conditions. Jobs may stop due to:

  • ⏱️ Wall-time limits - schedulers enforce time caps; jobs are terminated when they hit the limit.

  • 💥 Application crashes - software bugs or memory errors can stop a job run.

  • 🔄 Job eviction/re-queuing - Jobs may be interupted on the _risk partitions to free up resources for a higher priority job.

  • ⚙️ Hardware or filesystem failures - node faults or storage outages can interrupt I/O or compute.

  • 🔧 Scheduled maintenance or system reboots - planned downtime can end jobs without warning.

Without checkpointing, any interruption means that all progress is lost and the job must restart from its initial conditions.

Note

Jobs run until their wall time limit unless they are submitted to a _risk partition, where they can be interrupted at any time (see AISURREY preemption or Eureka2 preemption).

Benefits:

Using checkpointing properly provides several advantages:

  • Time efficiency – resume long simulations instead of starting again

  • Efficient use of resources – avoid wasting CPU or GPU hours

  • Scalability – enable simulations that cannot finish in a single job run

  • Resilience – recover from crashes or system interruptions

  • Workflow flexibility – split work across multiple job submissions

Checkpointing and wall-time limits

Most HPC clusters enforce a maximum wall-time maximum run time for each job. These limits are introduced to ensure fair sharing of resources, prevent individual jobs from occupying compute nodes indefinitely, and help the scheduler efficiently plan and balance workloads across all users.

When a job reaches its wall-time limit, the scheduler will automatically terminate it — even if the computation is not yet finished.

With checkpointing in place:

  • Your application periodically saves its progress to persistent storage

  • The job can be safely stopped without losing completed work

  • You can submit a new job that restarts from the most recent checkpoint

  • Long-running workloads can be completed across multiple job submissions

How checkpointing works

Checkpointing typically involves:

  1. Saving the application state (e.g. iteration counters, model parameters, simulation data)

  2. Writing this state to persistent storage

  3. Restarting the application using the saved state

The exact implementation depends on the application or library being used.

Some software provides built-in checkpointing

Many modern scientific and data-intensive applications include native checkpointing support, which is the recommended approach when available. In some applications, it may be necessary to implement custom checkpointing when built-in checkpointing support is not available or does not meet the application’s requirements.

Deep learning frameworks are a common example, offering automatic tools to save model weights, optimizer state, and training progress.

PyTorch

PyTorch provides built-in utilities to save and restore training state using torch.save and torch.load.

This approach is widely used for long-running GPU jobs on HPC systems.

# save
torch.save(
    {
        "epoch": epoch,
        "model_state_dict": model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
    },
    "checkpoint.pt",
)

# restore
checkpoint = torch.load("checkpoint.pt", map_location="cpu")
model.load_state_dict(checkpoint["model_state_dict"])
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
epoch = checkpoint["epoch"]

TensorFlow / Keras

TensorFlow (Keras) provides high-level checkpointing APIs that automate much of the process.

Commonly used mechanisms include:

  • tf.train.Checkpoint

  • tf.keras.callbacks.ModelCheckpoint

# save
checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
checkpoint.write("ckpt")

# restore
checkpoint.restore(tf.train.latest_checkpoint("."))

MATLAB

MATLAB can save and restore model state using save and load on a MAT file.

% save
checkpoint.epoch = epoch;
checkpoint.net = net;
checkpoint.optimizer = optimizer;
save("checkpoint.mat", "-struct", "checkpoint");

% restore
checkpoint = load("checkpoint.mat");
net = checkpoint.net;
optimizer = checkpoint.optimizer;
epoch = checkpoint.epoch;

General guidance

  • Prefer built-in checkpointing mechanisms when available

  • Store checkpoints on persistent storage, not temporary node-local paths

  • Choose checkpoint frequency carefully to balance I/O and recovery cost

  • Always test that your application can restart successfully from a checkpoint

Next steps

For practical guidance on integrating checkpointing with batch jobs and schedulers, see:

➡️ Examples: Checkpointing HPC jobs

The examples include checkpointing CNN models in Conda/Tensorflow/Torch and Matlab, custom checkpointing in R, and similar versions using apptainers.

Important

Here is a Wikipedia article on the subject: https://en.wikipedia.org/wiki/Application_checkpointing

Hint

If you need help adding checkpointing to your workflow, consider contacting Available Support from an RSE for tailored advice.