5. Running HPC jobs
To run your Job on a HPC you will need to submit it to a Job Scheduler.
A HPC system can have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the scheduler. On an HPC system, the scheduler manages which jobs run where and when.
5.1. Slurm - job scheduler
To submits jobs and interact with our clusters we use the Slurm workload manager. Slurm, is a compute resource manager and job scheduler. In essence, it is a queuing system which users submit their HPC jobs to, and it allocates compute resources and time on a cluster for that defined job to run. It is a very popular scheduler and is widely adopted across many HPC facilities.
There are 2 principal ways to use Slurm: Batch Job’s using Job scripts, or Interactive jobs. Also see Array jobs (submitting a batch of jobs) for running multiple batch jobs.
5.1.1. Job scripts
These are plain text files in which you specify and request cluster resources and list, in sequence,
the commands that you want to execute (like applications) as you would on the command prompt.
Below is an example of a Slurm Job script, it is a text file called slurm_test.sub
. it’s a job that runs a mpi helloworld program.
[abc123@login7(eureka) ~]$ vim slurm_test.sub
#!/bin/bash
#SBATCH --partition=shared #Selecting “shared” Queue
#SBATCH --job-name="hello" #Name of Jobs (displayed in squeue)
#SBATCH --nodes=2 #No of nodes to run job
#SBATCH --ntasks-per-node=10 #No of cores to use per node
#SBATCH --time=00:05:00 #Maximum time for job to run
#SBATCH --mem=2G #Amount of memory per node
#SBATCH --output=helloworld.out #Output file for stdout (optional)
cd $SLURM_SUBMIT_DIR #Change to submission directory
module load helloworld/1.1 #Load up hello module for program to run
mpirun -np 20 helloworld #Execute myprogram on 20 cores with mpi comes from 20 = ntasks-per-node * nodes
echo $SLURM_NODELIST > nodes #Record the nodes the code runs on to file nodes
An identical job submit script is available for you to use in gitlab: https://gitlab.surrey.ac.uk/rcs/eureka-examples/-/blob/master/Example/Example.sub
The #SBATCH
directives define the resources requested for compute jobs (use what you need to describe the resources you require).
The general format of these is as follows: #SBATCH --“<option>”=“<value>”
.
Not all of these directives need to be specified, if one is missed, a default will be given upon submission.
These must always be at the top of the file without any gaps.
5.1.1.1. SBATCH options
Some common options you might want to use in your job submit file:
- --nodes=<number>
Number of nodes requested
- --ntasks-per-node=<number>
Number of processes to run per node
- --ntasks
Total number of processes
- --mem=<number>
Total memory per node
- --mem-per-cpu=<number>
Total memory per core
- --constraint=<attribute>
Node property to request (e.g. avx, IB,OP)
- --partition=<partition_name>
Request specified partition or queue
- --job-name=<myjobname>
Name of Job
- --error=<slurm.err>
Print out file for slurm errors
- --output=<example.out>
Specify output file for stout
- --time=<hh:mm:ss>
Define time jobs will run
- --exclusive
Exclusive access to node
–gpus-per-node=2 Allows requestion of GPU resources
5.1.1.2. Slurm environment variables
$SLURM_XXXX
are useful in-built environment variables from slurm that you can put into your scripts to make
them more automated and transferable. In the example script above, the slurm environment variable $SLURM_SUBMIT_DIR
is used so that when this jobs runs, it will change to the directory from where its submitted before it runs anything.
Environment Variables:
- $SLURM_JOB_ID:
ID of job allocation
- $SLURM_SUBMIT_DIR:
Directory job where was submitted
- $SLURM_JOB_NODELIST:
File containing allocated host names
- $SLURM_NTASKS:
Total number of cores for job
- $SLURM_JOB_ID:
ID of job allocation
- $SLURM_ARRAY_TASK_ID:
Index for array task
5.1.1.3. Submit a job
Once you have made your job script you need to submit it.
This is done using the command sbatch <job_script>
.
For example to submit the previous job script we made slurm_test.sub
:
[abc123@login7(eureka) ~]$ sbatch slurm_test.sub
Submitted batch job 40145
Once submitted, your job is allocated a job id number, which is used to reference a job and interact with it after it has been submitted.
We can then see our job in the queue using the command: squeue
or
we could use the command squeue -u <username>
to see only jobs you have submitted.
[abc123@login7(eureka) ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
40143 shared castepjo abc123 R 8:01 1 node11
40072 shared raccoon abc123 R 1-20:07:59 1 node40
40145 shared hello abc123 R 0:01 2 node[102-103] <------ Here is my Job!
40116 shared es254.sh abc126 R 20:49:02 1 node108
40125 shared bash abc127 R 2:06:20 1 node07
34114 shared halo_332 abc131 R 4-11:08:00 4 node[30-33]
40090 gpu rbd_EPR_ abc140 R 1-03:18:31 1 node124
[abc123@login7(eureka) ~]$ squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
40145 shared hello abc123 R 0:11 2 node[102-103]
If you have followed this example you should, have the following outputs in your directory.
[abc123@login7(eureka) ~]$ ls -ltr
-rw-r--r-- 1 abc123 itsstaff 18084 Apr 17 12:48 helloworld.out
-rw-r--r-- 1 abc123 itsstaff 14 Apr 17 12:48 nodes
You can also query jobs to get the full information about the job using its job id number,
this can be done using the command scontrol show job <Job Id Number>
as shown below:
abc123@login7(eureka) ~]$ scontrol show job 40148
JobId= 40145 JobName=hello
UserId=abc123(282122) GroupId=itsstaff(40000) MCS_label=N/A
Priority=12940 Nice=0 Account=it QOS=normal
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:01 TimeLimit=00:05:00 TimeMin=N/A
SubmitTime=2019-04-17T13:06:21 EligibleTime=2019-04-17T13:06:21
StartTime=2019-04-17T13:06:21 EndTime=2019-04-17T13:06:22 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=shared AllocNode:Sid=login7:28627
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node[102-103]
BatchHost=node102
NumNodes=2 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=20,mem=4G,node=2
Socks/Node=* NtasksPerN:B:S:C=10:0:*:* CoreSpec=*
MinCPUsNode=10 MinMemoryNode=2G MinTmpDiskNode=0
Features=ib DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/users/abc123/slurm_test.sh
WorkDir=/users/abc123
StdErr=/users/abc123/hello.out
StdIn=/dev/null
StdOut=/users/abc123/hello.out
Power=
5.1.1.4. Stopping or cancelling a job
If you have submitted a job, and want to delete or cancel it you can use the command,
scancel <Job ID Number>
.
5.1.2. Interactive jobs
Interactive jobs puts you in an interactive shell on a compute node(s). This can be a helpful as a debugging tool for creating job scripts for batch job submission in a test scenario. It allows you to experiment on compute nodes with command options, and environmental variables, providing immediate feedback (helpful in determining your workflow!).
The way to allocate a node is using the command srun.
This is used to allocate resources after which you can ssh into the node(s) allocated to do interactive work.
Resources for interactive sessions can be allocated using the previous slurm options shown in section 1,
by adding them as arguments to the srun --“<option>”=“<value>” --“<option>”=“<value>” --pty bash
command.
[abc123@login7(eureka) ~]$ srun -N 1 --exclusive --constraint=avx2 --time=02:00:00 --pty bash
[abc123@node14(eureka) ~]$ squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
108098 shared bash abc123 R 0:05 1 node14
[abc123@node14(eureka) ~]$ exit
exit
[abc123@login7(eureka) ~]$ squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
[abc123@login7(eureka) ~]$
srun
can also be used to run commands interactively and then immediately close the allocation,
this can be done if the srun command is executed with command at the end of it
srun --“<option>”=“<value>” --“<option>”=“<value>” <command to run>
.
[abc123@login7(eureka) ~]$ module load helloworld/1.1
[abc123@login7(eureka) ~]$ srun -N 1 --constraint=avx2 --time=02:00:00 mpirun -np 28 helloworld
Hello world from processor node19.swmgmt.eureka, rank 1 out of 28 processors
Hello world from processor node19.swmgmt.eureka, rank 4 out of 28 processors
Hello world from processor node19.swmgmt.eureka, rank 5 out of 28 processors
Hello world from processor node19.swmgmt.eureka, rank 9 out of 28 processors
Hello world from processor node19.swmgmt.eureka, rank 12 out of 28 processors
Hello world from processor node19.swmgmt.eureka, rank 16 out of 28 processors
Hello world from processor node19.swmgmt.eureka, rank 17 out of 28 processors
5.1.3. Array jobs (submitting a batch of jobs)
Often you may need to submit hundreds of jobs over a list or an index. In these cases you should avoid creating and submitting 100s of separate job scripts. Instead, you should submit a 100 jobs in one job script. This is done through Array Jobs. Array Jobs are a way to submit jobs and collectively manage sets of jobs that are similar in a quick and compact manner.
The below example demonstrates an Array job, where the job is submitted for the index 1 to 16,
specified by the #SBATCH --array=1-16
directive, the index is then referred to in the script
through the slurm environment variable $SLURM_ARRAY_TASK_ID
.
[abc123@login7(eureka) ~]$ vim slurm_array.sub
#!/bin/bash
#SBATCH --job-name=array
#SBATCH --array=1-16 #Array job indices/range for $SLURM_ARRAY_TASK_ID (can be incremented if desired)
#SBATCH --time=01:00:00
#SBATCH --partition=shared
#SBATCH --ntasks=1
#SBATCH --mem=4G
#SBATCH --error=array_%A_%a.err #Error file label by job number and index.
# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID > test_"$SLURM_ARRAY_TASK_ID"
In the example script above, the %A_%a
notation is filled in with the master job id (%A) and the array task id (%a).
This is a simple way to create output files in which the file name is different for each job in the array.
There are different ways of specifying the arrays indices, depends on the jobs you have, examples are shown below:
# A job array with array tasks numbered from 0 to 31.
#SBATCH --array=0-31
# A job array with array tasks numbered 1, 2, 5, 19, 27.
#SBATCH --array=1,2,5,19,27
# A job array with array tasks numbered 1, 3, 5 and 7.
#SBATCH --array=1-7:2
To cancel an entire array job:
scancel <Job ID Number>
To cancel a specific task in an array job:
scancel <jobid>_<taskid>
To cancel a range of specific tasks in an array job:
scancel <jobid>_[<taskid>_<taskid>]
5.2. Slurm - Advanced topics
5.2.1. Handling Job dependencies
The sbatch can also be utilised to assist workflows that involve multiple steps or when you are using checkpoints:
the --dependency
option allows launch jobs on the condition of completion (or successful completion) of another job.
To submit a job to start after the completion of specified job the following can be used: sbatch --dependency=afterok:<Job Number> <Job script>
.
For example, the below submits the jobs script dependant_job.sub
so that it will only start upon the completion of job number, 106178.
[abc123@login7(eureka) ~]$ sbatch --dependency=afterok:106178 dependant_job.sub
5.2.2. Requesting resources for MPI or OpenMP jobs
Up till now we have been using either the combination of --nodes
and --ntasks-per-node
or --ntasks
to specify the number of cores we would like use. Another option which can be used is --cpus-per-task
.
With theses multiple ways of allocating nodes, there are several ways to get the same/similar allocation.
For example, the following: --nodes=3 --ntasks=3 --cpus-per-task=3
is equivalent in terms of resource allocation to
--ntasks=9 --ntasks-per-node=3
but seen differently by slurm and MPI: where the first case,
3 processes are launched and in the second case 9 processes are launched.
So consider the following examples where 9 cores are required, there are variety of scenarios to ask for these 9 cores depending on if you are doing MPI (or Distributed jobs) or OpenMP(single node parallel jobs):*
5.2.2.1. MPI (or distributed)
you use mpi and do not care about where those cores are distributed:
--ntasks=16
you want to launch 16 independent processes (no communication):
--ntasks=16
you want those cores to spread across distinct nodes:
--ntasks=16 --ntasks-per-node=1 or --ntasks=16 --nodes=16
you want 10 processes to spread across 5 nodes to have two processes per node:
--ntasks=10 --ntasks-per-node=2
you want 10 processes to stay on the same node:
--ntasks=10 --ntasks-per-node=10
5.2.3. Checkpointing HPC jobs
5.2.3.1. Why checkpoint?
Running jobs might be interrupted for a number of reasons. The objective of checkpointing is to not lose simulation time with having to restart from the very start, i.e. initial conditions.
Important
Here is a wikipedia article on the subject: https://en.wikipedia.org/wiki/Application_checkpointing
When a check pointing mechanism is implemented, it means that the job saves checkpoints regularly or in response to a termination signal. Regular checkpoints are helpful in the event your simulations reaches its maximum wall time or in the event of infrastructure failures.
5.2.3.2. How to checkpoint
Checkpointing jobs is considered a best practice approach for all HPC jobs so you should have this built into your workflow. Checkpointing can be achieved in many ways - the simplest example would be to have the simulation generate an output file once a day, which you can then use as “input” (i.e. initial conditions) to restart and continue a job.
How checkpointing is implemented for each job depends on the software used. Please refer to the documentation of the software or libraries used to determine how you can make your job check point itself, many applications have a “built in” mechanism to achieve this instead of using input/output files - so please read up and use this if possible.
On gaining more familiarity and experience with the nature of your software and jobs, you could eventually automate this checkpointing process by means of triggers.
5.2.4. Benchmarking and scaling
Benchmarking and scaling are very important aspects when running simulations on HPC clusters, it can help maximise your output and minimise the wastage of resources.
One of the most important things to do before running production workloads is to benchmark your problem against the number of cores you might use to ensure you asking for the correct amount of resources.
Below are plots of a benchmarking exercise of a DFT B3LYP energy calculation of 2 molecules:
Both plots shows the time of the calculation as a function of the number of cores, and it can be see than it scales quite well leading to significant increases in time up to 20 cores in both cases.
However in these examples after 20 cores there is no improvement and therefore no benefit in running the calculation of anymore cores.
In fact, requesting more cores in your slurm job script would:
Waste resources which could be allocated to another job.
Although not shown here asking for too many cores can result in a significant slowdown in a calculation due to the parallel overhead, turning the above plots into U-plots!
Tip
Having the scaling of your problem for every simulation is not necessary, but using model problems to inform yourself is a very good idea, so you can gauge the correct amount of resources to request when running day to day jobs.
5.3. Slurm - useful commands
sinfo
View information about status of cluster, partitions and node usage
[abc123@login7(eureka) ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
shared* up 7-00:00:00 2 down* node[101,136]
shared* up 7-00:00:00 3 resv node[01,43-44]
shared* up 7-00:00:00 32 mix node[02,04,06-08,15-21,32,34,36-40,45,50,109-118,140]
shared* up 7-00:00:00 29 alloc galaxy[1-13],node[09-14,22-25,30-31,35,42,46,49]
shared* up 7-00:00:00 22 idle node[03,05,26-29,33,102-108,120-121,132-135,139,141]
debug_latest up 1:00:00 1 idle node41
debug_all up 1:00:00 1 idle node100
high_mem up 7-00:00:00 8 idle node[119,125-131]
gpu up 7-00:00:00 3 idle node[122-124]
sacct
retrieve job information from slurm after the job has finished.Retrieve information based on job number:
sacct -j <Job Number> --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
[abc123@login7(eureka) ~]$ sacct -j 56814 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist User JobID JobName Partition State Timelimit Start End Elapsed MaxRSS MaxVMSize NNodes NCPUS NodeList --------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ---------- --------------- abc123 56814 Hello shared COMPLETED 00:05:00 2019-08-13T14:45:59 2019-08-13T14:46:00 00:00:01 4 112 node[45-46,49-+ 56814.batch batch COMPLETED 2019-08-13T14:45:59 2019-08-13T14:46:00 00:00:01 1160K 147756K 1 28 node45 56814.0 pmi_proxy COMPLETED 2019-08-13T14:45:59 2019-08-13T14:46:00 00:00:01 828K 213820K 4 4 node[45-46,49-+
Get information by user and time:
sacct --starttime YYYY-MM-DD -u <username> --format=JobID,state,time,JobName,MaxRSS,Elapsed,nodelist
[abc123@login7(eureka) ~]$ sacct --starttime 2020-01-17 -u abc123 --format=JobID,state,time,JobName,MaxRSS,Elapsed,nodelist JobID State Timelimit JobName MaxRSS Elapsed NodeList ------------ ---------- ---------- ---------- ---------- ---------- --------------- 106178 COMPLETED 00:10:00 hello 00:02:00 node45 106178.batch COMPLETED batch 1956K 00:02:00 node45 106179 COMPLETED 00:10:00 hello 00:02:00 node45 106179.batch COMPLETED batch 1908K 00:02:00 node45
scancel
The command scancel can be used in multiples ways to delete job as opposed to one at a time.cancel a specific job:
scancel <job-id>
cancel all your jobs:
scancel -u <username>
cancel all your jobs in the PENDING status:
scancel -t PENDING -u <username>
cancel all your jobs in the RUNNING status:
scancel -t RUNNING -u <username>
slurmtop
like the linux command ‘top’ but displays the layout of jobs running on compute nodes.sview
Display graphical overview of cluster usage sview.