Job Management

Slurm offers many different tools for checking on the Queues and interacting with your jobs on the cluster.

Checking the job queue (squeue)

You can view your jobs in the queue using the command: squeue or you could use the command squeue -u <username> to see only jobs you have submitted.

checking our job with squeue
[abc123@login1(eureka2) ~]$ squeue
            JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
            40143    shared castepjo   abc123  R       8:01      1 node11
            40072    shared  raccoon   abc123  R 1-20:07:59      1 node40
            40145    shared    hello   abc123  R       0:01      2 node[102-103]   <------ Here is my Job!
            40116    shared es254.sh   abc126  R   20:49:02      1 node108
            40125    shared     bash   abc127  R    2:06:20      1 node07
            34114    shared halo_332   abc131  R 4-11:08:00      4 node[30-33]
            40090       gpu rbd_EPR_   abc140  R 1-03:18:31      1 node124

[abc123@login1(eureka2) ~]$ squeue -u abc123
            JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            40145    shared    hello   abc123  R       0:11      2 node[102-103]

If you have followed this example you should, have the following outputs in your directory.

checking output from the job
[abc123@login1(eureka2) ~]$ ls -ltr
-rw-r--r--  1 abc123 itsstaff      18084 Apr 17 12:48 helloworld.out
-rw-r--r--  1 abc123 itsstaff         14 Apr 17 12:48 nodes

You can also query jobs to get the full information about the job using its job id number, this can be done using the command scontrol show job <Job Id Number> as shown below:

using scontrol to view job information
abc123@login1(eureka2) ~]$ scontrol show job 40148
JobId= 40145 JobName=hello
UserId=abc123(282122) GroupId=itsstaff(40000) MCS_label=N/A
Priority=12940 Nice=0 Account=it QOS=normal
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:01 TimeLimit=00:05:00 TimeMin=N/A
SubmitTime=2019-04-17T13:06:21 EligibleTime=2019-04-17T13:06:21
StartTime=2019-04-17T13:06:21 EndTime=2019-04-17T13:06:22 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=shared AllocNode:Sid=login7:28627
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node[102-103]
BatchHost=node102
NumNodes=2 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=20,mem=4G,node=2
Socks/Node=* NtasksPerN:B:S:C=10:0:*:* CoreSpec=*
MinCPUsNode=10 MinMemoryNode=2G MinTmpDiskNode=0
Features=ib DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/users/abc123/slurm_test.sh
WorkDir=/users/abc123
StdErr=/users/abc123/hello.out
StdIn=/dev/null
StdOut=/users/abc123/hello.out
Power=

Cancel a job (scancel)

If you have submitted a job, and want to delete or cancel it you can use the command, scancel <Job ID Number>.

scancel <job-id>:

cancel a specific job

scancel -u <username>:

cancel all your jobs

scancel -t PENDING -u <username>:

cancel all your jobs in the PENDING status:

scancel -t RUNNING -u <username>:

cancel all your jobs in the RUNNING status:

scancel <jobid>_<taskid>:

cancel a specific task in an array job:

scancel <jobid>_[<taskid>_<taskid>]:

cancel a range of specific tasks in an array job

View cluster information (sinfo)

sinfo allows you to view information about status of cluster, partitions and node usage.

[abc123@login1(eureka2) ~]$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST
shared*         up 7-00:00:00      2  down* node[101,136]
shared*         up 7-00:00:00      3   resv node[01,43-44]
shared*         up 7-00:00:00     32    mix node[02,04,06-08,15-21,32,34,36-40,45,50,109-118,140]
shared*         up 7-00:00:00     29  alloc galaxy[1-13],node[09-14,22-25,30-31,35,42,46,49]
shared*         up 7-00:00:00     22   idle node[03,05,26-29,33,102-108,120-121,132-135,139,141]
debug_latest    up    1:00:00      1   idle node41
debug_all       up    1:00:00      1   idle node100
high_mem        up 7-00:00:00      8   idle node[119,125-131]
gpu             up 7-00:00:00      3   idle node[122-124]

View job stats after job completion (sacct)

sacct retrieves useful job stats from slurm after the job has finished.

Retrieve information based on job number:

sacct -j <Job Number> --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist

sacct example by job id number
[abc123@login1(eureka2) ~]$ sacct -j 56814 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
     User        JobID    JobName  Partition      State  Timelimit               Start                 End    Elapsed     MaxRSS  MaxVMSize   NNodes      NCPUS        NodeList
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ---------- ---------------
   abc123 56814             Hello     shared  COMPLETED   00:05:00 2019-08-13T14:45:59 2019-08-13T14:46:00   00:00:01                              4        112 node[45-46,49-+
          56814.batch       batch             COMPLETED            2019-08-13T14:45:59 2019-08-13T14:46:00   00:00:01      1160K    147756K        1         28          node45
          56814.0       pmi_proxy             COMPLETED            2019-08-13T14:45:59 2019-08-13T14:46:00   00:00:01       828K    213820K        4          4 node[45-46,49-+

Get information by user and time:

sacct --starttime YYYY-MM-DD -u <username> --format=JobID,state,time,JobName,MaxRSS,Elapsed,nodelist

sacct example by job user and time
[abc123@login1(eureka2) ~]$ sacct --starttime 2020-01-17 -u abc123 --format=JobID,state,time,JobName,MaxRSS,Elapsed,nodelist
       JobID      State  Timelimit    JobName     MaxRSS    Elapsed        NodeList
------------ ---------- ---------- ---------- ---------- ---------- ---------------
106178        COMPLETED   00:10:00      hello              00:02:00          node45
106178.batch  COMPLETED                 batch      1956K   00:02:00          node45
106179        COMPLETED   00:10:00      hello              00:02:00          node45
106179.batch  COMPLETED                 batch      1908K   00:02:00          node45

Check current cluster utilisation (sview & slurmtop)

slurmtop is like the linux command ‘top’ but displays the layout of jobs running on compute nodes.

sview Displays a graphical overview of the live (current) cluster utilisation.

../../_images/sview.png