HPC job scheduler (Slurm)

To run your job on an HPC system you will need to submit it to a job scheduler.

An HPC system can have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the scheduler. On an HPC system, the scheduler manages which jobs run where and when.

Slurm - job scheduler

To submit jobs and interact with our clusters we use the Slurm workload manager. Slurm is a compute resource manager and job scheduler. In essence, it is a queuing system which users submit their HPC jobs to, and it allocates compute resources and time on a cluster for that defined job to run. It is a very popular scheduler and is widely adopted across many HPC facilities.

There are 2 principal ways to submit jobs using Slurm:

Also see Array jobs (submitting a batch of jobs) for running multiple batch jobs.

Slurm also offers a range of tools for managing your jobs, including checking the job queues and cancelling running or queued jobs. See Job Management for more information.

Note

Jobs run until their wall time limit unless they are submitted to a _risk partition, where they can be interrupted at any time (see AISURREY preemption or Eureka2 preemption). Therefore, use checkpointing to make sure your code starts from where interrupted.