5. Running HPC jobs

To run your Job on a HPC you will need to submit it to a Job Scheduler.

A HPC system can have thousands of nodes and thousands of users. How do we decide who gets what and when? How do we ensure that a task is run with the resources it needs? This job is handled by a special piece of software called the scheduler. On an HPC system, the scheduler manages which jobs run where and when.

5.1. Slurm - job scheduler

To submits jobs and interact with our clusters we use the Slurm workload manager. Slurm, is a compute resource manager and job scheduler. In essence, it is a queuing system which users submit their HPC jobs to, and it allocates compute resources and time on a cluster for that defined job to run. It is a very popular scheduler and is widely adopted across many HPC facilities.

There are 2 principal ways to use Slurm: Batch Job’s using Job scripts, or Interactive jobs. Also see Array jobs (submitting a batch of jobs) for running multiple batch jobs.

5.1.1. Job scripts

These are plain text files in which you specify and request cluster resources and list, in sequence, the commands that you want to execute (like applications) as you would on the command prompt. Below is an example of a Slurm Job script, it is a text file called slurm_test.sub. it’s a job that runs a mpi helloworld program.

an example submit script that runs a helloworld program

[abc123@login7(eureka) ~]$ vim slurm_test.sub

#!/bin/bash

#SBATCH --partition=shared              #Selecting “shared” Queue
#SBATCH --job-name="hello"              #Name of Jobs (displayed in squeue)
#SBATCH --nodes=2                       #No of nodes to run job
#SBATCH --ntasks-per-node=10            #No of cores to use per node
#SBATCH --time=00:05:00                 #Maximum time for job to run
#SBATCH --mem=2G                        #Amount of memory per node
#SBATCH --output=helloworld.out         #Output file for stdout (optional)

cd $SLURM_SUBMIT_DIR                    #Change to submission directory

module load helloworld/1.1              #Load up hello module for program to run

mpirun -np 20 helloworld                #Execute myprogram on 20 cores with mpi comes from 20 = ntasks-per-node * nodes

echo $SLURM_NODELIST > nodes            #Record the nodes the code runs on to file nodes

An identical job submit script is available for you to use in gitlab: https://gitlab.surrey.ac.uk/rcs/eureka-examples/-/blob/master/Example/Example.sub

The #SBATCH directives define the resources requested for compute jobs (use what you need to describe the resources you require). The general format of these is as follows: #SBATCH --“<option>”=“<value>”. Not all of these directives need to be specified, if one is missed, a default will be given upon submission. These must always be at the top of the file without any gaps.

5.1.1.1. SBATCH options

Some common options you might want to use in your job submit file:

--nodes=<number>: Number of nodes requested
--ntasks-per-node=<number>: Number of processes to run per node
--ntasks: Total number of processes
--mem=<number>: Total memory per node
--mem-per-cpu=<number>: Total memory per core
--constraint=<attribute>: Node property to request (e.g. avx, IB,OP)
--partition=<partition_name>: Request specified partition or queue
--job-name=<myjobname>: Name of Job
--error=<slurm.err>: Print out file for slurm errors
--output=<example.out>: Specify output file for stout
--time=<hh:mm:ss>: Define time jobs will run
--exclusive: Exclusive access to node

–gpus-per-node=2 Allows requestion of GPU resources

5.1.1.2. Slurm environment variables

$SLURM_XXXX are useful in-built environment variables from slurm that you can put into your scripts to make them more automated and transferable. In the example script above, the slurm environment variable $SLURM_SUBMIT_DIR is used so that when this jobs runs, it will change to the directory from where its submitted before it runs anything.

Environment Variables:

$SLURM_JOB_ID:: ID of job allocation
$SLURM_SUBMIT_DIR:: Directory job where was submitted
$SLURM_JOB_NODELIST:: File containing allocated host names
$SLURM_NTASKS:: Total number of cores for job
$SLURM_JOB_ID:: ID of job allocation
$SLURM_ARRAY_TASK_ID:: Index for array task

5.1.1.3. Submit a job

Once you have made your job script you need to submit it.

This is done using the command sbatch <job_script>. For example to submit the previous job script we made slurm_test.sub:

submitting a slurm script with sbatch

[abc123@login7(eureka) ~]$ sbatch slurm_test.sub
Submitted batch job 40145

Once submitted, your job is allocated a job id number, which is used to reference a job and interact with it after it has been submitted.

We can then see our job in the queue using the command: squeue or we could use the command squeue -u <username> to see only jobs you have submitted.

checking our job with squeue

[abc123@login7(eureka) ~]$ squeue
            JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
            40143    shared castepjo   abc123  R       8:01      1 node11
            40072    shared  raccoon   abc123  R 1-20:07:59      1 node40
            40145    shared    hello   abc123  R       0:01      2 node[102-103]   <------ Here is my Job!
            40116    shared es254.sh   abc126  R   20:49:02      1 node108
            40125    shared     bash   abc127  R    2:06:20      1 node07
            34114    shared halo_332   abc131  R 4-11:08:00      4 node[30-33]
            40090       gpu rbd_EPR_   abc140  R 1-03:18:31      1 node124

[abc123@login7(eureka) ~]$ squeue -u abc123
            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            40145    shared    hello   abc123  R       0:11      2 node[102-103]

If you have followed this example you should, have the following outputs in your directory.

checking output from the job

[abc123@login7(eureka) ~]$ ls -ltr
-rw-r--r--  1 abc123 itsstaff      18084 Apr 17 12:48 helloworld.out
-rw-r--r--  1 abc123 itsstaff         14 Apr 17 12:48 nodes

You can also query jobs to get the full information about the job using its job id number, this can be done using the command scontrol show job <Job Id Number> as shown below:

using scontrol to view job information

abc123@login7(eureka) ~]$ scontrol show job 40148
JobId= 40145 JobName=hello
UserId=abc123(282122) GroupId=itsstaff(40000) MCS_label=N/A
Priority=12940 Nice=0 Account=it QOS=normal
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:01 TimeLimit=00:05:00 TimeMin=N/A
SubmitTime=2019-04-17T13:06:21 EligibleTime=2019-04-17T13:06:21
StartTime=2019-04-17T13:06:21 EndTime=2019-04-17T13:06:22 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=shared AllocNode:Sid=login7:28627
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node[102-103]
BatchHost=node102
NumNodes=2 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=20,mem=4G,node=2
Socks/Node=* NtasksPerN:B:S:C=10:0:*:* CoreSpec=*
MinCPUsNode=10 MinMemoryNode=2G MinTmpDiskNode=0
Features=ib DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/users/abc123/slurm_test.sh
WorkDir=/users/abc123
StdErr=/users/abc123/hello.out
StdIn=/dev/null
StdOut=/users/abc123/hello.out
Power=

5.1.1.4. Stopping or cancelling a job

If you have submitted a job, and want to delete or cancel it you can use the command, scancel <Job ID Number>.

5.1.2. Interactive jobs

Interactive jobs puts you in an interactive shell on a compute node(s). This can be a helpful as a debugging tool for creating job scripts for batch job submission in a test scenario. It allows you to experiment on compute nodes with command options, and environmental variables, providing immediate feedback (helpful in determining your workflow!).

The way to allocate a node is using the command srun. This is used to allocate resources after which you can ssh into the node(s) allocated to do interactive work. Resources for interactive sessions can be allocated using the previous slurm options shown in section 1, by adding them as arguments to the srun --“<option>”=“<value>” --“<option>”=“<value>” --pty bash command.

Allocation a node for an interactive session using srun

[abc123@login7(eureka) ~]$ srun -N 1 --exclusive --constraint=avx2 --time=02:00:00 --pty bash
[abc123@node14(eureka) ~]$ squeue -u abc123
            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            108098    shared     bash   abc123  R       0:05      1 node14
[abc123@node14(eureka) ~]$ exit
exit
[abc123@login7(eureka) ~]$ squeue -u abc123
            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[abc123@login7(eureka) ~]$

srun can also be used to run commands interactively and then immediately close the allocation, this can be done if the srun command is executed with command at the end of it srun --“<option>”=“<value>” --“<option>”=“<value>” <command to run>.

running a single interactive command with srun

[abc123@login7(eureka) ~]$ module load helloworld/1.1
[abc123@login7(eureka) ~]$ srun -N 1  --constraint=avx2 --time=02:00:00 mpirun -np 28 helloworld
Hello world from processor node19.swmgmt.eureka, rank 1 out of 28 processors
Hello world from processor node19.swmgmt.eureka, rank 4 out of 28 processors
Hello world from processor node19.swmgmt.eureka, rank 5 out of 28 processors
Hello world from processor node19.swmgmt.eureka, rank 9 out of 28 processors
Hello world from processor node19.swmgmt.eureka, rank 12 out of 28 processors
Hello world from processor node19.swmgmt.eureka, rank 16 out of 28 processors
Hello world from processor node19.swmgmt.eureka, rank 17 out of 28 processors

5.1.3. Array jobs (submitting a batch of jobs)

Often you may need to submit hundreds of jobs over a list or an index. In these cases you should avoid creating and submitting 100s of separate job scripts. Instead, you should submit a 100 jobs in one job script. This is done through Array Jobs. Array Jobs are a way to submit jobs and collectively manage sets of jobs that are similar in a quick and compact manner.

The below example demonstrates an Array job, where the job is submitted for the index 1 to 16, specified by the #SBATCH --array=1-16 directive, the index is then referred to in the script through the slurm environment variable $SLURM_ARRAY_TASK_ID.

Submitting an array job

[abc123@login7(eureka) ~]$ vim slurm_array.sub

#!/bin/bash

#SBATCH --job-name=array
#SBATCH --array=1-16              #Array job indices/range for $SLURM_ARRAY_TASK_ID (can be incremented if desired)
#SBATCH --time=01:00:00
#SBATCH --partition=shared
#SBATCH --ntasks=1
#SBATCH --mem=4G
#SBATCH --error=array_%A_%a.err   #Error file label by job number and index.

# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID > test_"$SLURM_ARRAY_TASK_ID"

In the example script above, the %A_%a notation is filled in with the master job id (%A) and the array task id (%a). This is a simple way to create output files in which the file name is different for each job in the array.

There are different ways of specifying the arrays indices, depends on the jobs you have, examples are shown below:

examples of different expressions of array indices

# A job array with array tasks numbered from 0 to 31.
#SBATCH --array=0-31

# A job array with array tasks numbered 1, 2, 5, 19, 27.
#SBATCH --array=1,2,5,19,27

# A job array with array tasks numbered 1, 3, 5 and 7.
#SBATCH --array=1-7:2

To cancel an entire array job: scancel <Job ID Number>
To cancel a specific task in an array job: scancel <jobid>_<taskid>
To cancel a range of specific tasks in an array job: scancel <jobid>_[<taskid>_<taskid>]

5.2. Slurm - Advanced topics

5.2.1. Handling Job dependencies

The sbatch can also be utilised to assist workflows that involve multiple steps or when you are using checkpoints: the --dependency option allows launch jobs on the condition of completion (or successful completion) of another job.

To submit a job to start after the completion of specified job the following can be used: sbatch --dependency=afterok:<Job Number> <Job script>. For example, the below submits the jobs script dependant_job.sub so that it will only start upon the completion of job number, 106178.

examples of different expressions of array indices

 [abc123@login7(eureka) ~]$ sbatch --dependency=afterok:106178 dependant_job.sub

5.2.2. Requesting resources for MPI or OpenMP jobs

Up till now we have been using either the combination of --nodes and --ntasks-per-node or --ntasks to specify the number of cores we would like use. Another option which can be used is --cpus-per-task.

With theses multiple ways of allocating nodes, there are several ways to get the same/similar allocation.

For example, the following: --nodes=3 --ntasks=3 --cpus-per-task=3 is equivalent in terms of resource allocation to --ntasks=9 --ntasks-per-node=3 but seen differently by slurm and MPI: where the first case, 3 processes are launched and in the second case 9 processes are launched.

So consider the following examples where 9 cores are required, there are variety of scenarios to ask for these 9 cores depending on if you are doing MPI (or Distributed jobs) or OpenMP(single node parallel jobs):*

5.2.2.1. MPI (or distributed)

you use mpi and do not care about where those cores are distributed: --ntasks=16
you want to launch 16 independent processes (no communication): --ntasks=16
you want those cores to spread across distinct nodes: --ntasks=16 --ntasks-per-node=1 or --ntasks=16 --nodes=16
you want 10 processes to spread across 5 nodes to have two processes per node: --ntasks=10 --ntasks-per-node=2
you want 10 processes to stay on the same node: --ntasks=10 --ntasks-per-node=10

5.2.2.2. OpenMP (single node/shared memory parallel)

you want one process that can use 9 cores for multi-threading (OpenMP single node parallel): --ntasks=1 --cpus-per-task=9
you want 4 processes that can use 4 cores each for multi-threading(OpenMP single node parallel): --ntasks=4 --cpus-per-task=4

Note

example cases taken from https://support.ceci-hpc.be/doc/_contents/SubmittingJobs/SlurmFAQ.html

An example is shown below in diagram below:

../_images/mpi_vs_openmp.png — MPI vs Open MP

5.2.3. Checkpointing HPC jobs

5.2.3.1. Why checkpoint?

Running jobs might be interrupted for a number of reasons. The objective of checkpointing is to not lose simulation time with having to restart from the very start, i.e. initial conditions.

Important

Here is a wikipedia article on the subject: https://en.wikipedia.org/wiki/Application_checkpointing

When a check pointing mechanism is implemented, it means that the job saves checkpoints regularly or in response to a termination signal. Regular checkpoints are helpful in the event your simulations reaches its maximum wall time or in the event of infrastructure failures.

5.2.3.2. How to checkpoint

Checkpointing jobs is considered a best practice approach for all HPC jobs so you should have this built into your workflow. Checkpointing can be achieved in many ways - the simplest example would be to have the simulation generate an output file once a day, which you can then use as “input” (i.e. initial conditions) to restart and continue a job.

How checkpointing is implemented for each job depends on the software used. Please refer to the documentation of the software or libraries used to determine how you can make your job check point itself, many applications have a “built in” mechanism to achieve this instead of using input/output files - so please read up and use this if possible.

On gaining more familiarity and experience with the nature of your software and jobs, you could eventually automate this checkpointing process by means of triggers.

5.2.4. Benchmarking and scaling

Benchmarking and scaling are very important aspects when running simulations on HPC clusters, it can help maximise your output and minimise the wastage of resources.

One of the most important things to do before running production workloads is to benchmark your problem against the number of cores you might use to ensure you asking for the correct amount of resources.

Below are plots of a benchmarking exercise of a DFT B3LYP energy calculation of 2 molecules:

Both plots shows the time of the calculation as a function of the number of cores, and it can be see than it scales quite well leading to significant increases in time up to 20 cores in both cases.

However in these examples after 20 cores there is no improvement and therefore no benefit in running the calculation of anymore cores.

In fact, requesting more cores in your slurm job script would:

Waste resources which could be allocated to another job.
Although not shown here asking for too many cores can result in a significant slowdown in a calculation due to the parallel overhead, turning the above plots into U-plots!

Tip

Having the scaling of your problem for every simulation is not necessary, but using model problems to inform yourself is a very good idea, so you can gauge the correct amount of resources to request when running day to day jobs.

5.3. Slurm - useful commands

sinfo View information about status of cluster, partitions and node usage

[abc123@login7(eureka) ~]$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
shared*         up 7-00:00:00      2  down* node[101,136]
shared*         up 7-00:00:00      3   resv node[01,43-44]
shared*         up 7-00:00:00     32    mix node[02,04,06-08,15-21,32,34,36-40,45,50,109-118,140]
shared*         up 7-00:00:00     29  alloc galaxy[1-13],node[09-14,22-25,30-31,35,42,46,49]
shared*         up 7-00:00:00     22   idle node[03,05,26-29,33,102-108,120-121,132-135,139,141]
debug_latest    up    1:00:00      1   idle node41
debug_all       up    1:00:00      1   idle node100
high_mem        up 7-00:00:00      8   idle node[119,125-131]
gpu             up 7-00:00:00      3   idle node[122-124]

sacct retrieve job information from slurm after the job has finished.

Retrieve information based on job number:

sacct -j <Job Number> --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist

[abc123@login7(eureka) ~]$ sacct -j 56814 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
     User        JobID    JobName  Partition      State  Timelimit               Start                 End    Elapsed     MaxRSS  MaxVMSize   NNodes      NCPUS        NodeList
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ---------- ---------------
   abc123 56814             Hello     shared  COMPLETED   00:05:00 2019-08-13T14:45:59 2019-08-13T14:46:00   00:00:01                              4        112 node[45-46,49-+
          56814.batch       batch             COMPLETED            2019-08-13T14:45:59 2019-08-13T14:46:00   00:00:01      1160K    147756K        1         28          node45
          56814.0       pmi_proxy             COMPLETED            2019-08-13T14:45:59 2019-08-13T14:46:00   00:00:01       828K    213820K        4          4 node[45-46,49-+

Get information by user and time:

sacct --starttime YYYY-MM-DD -u <username> --format=JobID,state,time,JobName,MaxRSS,Elapsed,nodelist

[abc123@login7(eureka) ~]$ sacct --starttime 2020-01-17 -u abc123 --format=JobID,state,time,JobName,MaxRSS,Elapsed,nodelist
       JobID      State  Timelimit    JobName     MaxRSS    Elapsed        NodeList
------------ ---------- ---------- ---------- ---------- ---------- ---------------
106178        COMPLETED   00:10:00      hello              00:02:00          node45
106178.batch  COMPLETED                 batch      1956K   00:02:00          node45
106179        COMPLETED   00:10:00      hello              00:02:00          node45
106179.batch  COMPLETED                 batch      1908K   00:02:00          node45

scancel The command scancel can be used in multiples ways to delete job as opposed to one at a time.
- cancel a specific job:
  
  scancel <job-id>
- cancel all your jobs:
  
  scancel -u <username>
- cancel all your jobs in the PENDING status:
  
  scancel -t PENDING -u <username>
- cancel all your jobs in the RUNNING status:
  
  scancel -t RUNNING -u <username>
slurmtop like the linux command ‘top’ but displays the layout of jobs running on compute nodes.
sview Display graphical overview of cluster usage sview.