HPC job scheduler (Slurm)¶
To run your job on an HPC system you will need to submit it to a job scheduler.
An HPC system can have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the scheduler. On an HPC system, the scheduler manages which jobs run where and when.
Slurm - job scheduler¶
To submit jobs and interact with our clusters we use the Slurm workload manager.
Slurm is a compute resource manager and job scheduler. In essence, it is a queuing system which users submit their HPC jobs to,
and it allocates compute resources and time on a cluster for that defined job to run. It is a very popular scheduler
and is widely adopted across many HPC facilities.
There are 2 principal ways to submit jobs using Slurm:
Batch jobs (sbatch) using
sbatchInteractive jobs (srun) using
srun.
Also see Array jobs (submitting a batch of jobs) for running multiple batch jobs.
Slurm also offers a range of tools for managing your jobs, including checking the job queues and cancelling running or queued jobs. See Job Management for more information.
Note
Jobs run until their wall time limit unless they are submitted to a _risk partition, where they can be interrupted at any time
(see AISURREY preemption
or Eureka2 preemption).
Therefore, use checkpointing to make sure your code starts from where interrupted.