Job Management¶
Slurm offers many different tools for checking on the Queues and interacting with your jobs on the cluster.
Checking the job queue (squeue)¶
You can view your jobs in the queue using the command: squeue or
you could use the command squeue -u <username> to see only jobs you have submitted.
[abc123@login1(eureka2) ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
40143 shared castepjo abc123 R 8:01 1 node11
40072 shared raccoon abc123 R 1-20:07:59 1 node40
40145 shared hello abc123 R 0:01 2 node[102-103] <------ Here is my Job!
40116 shared es254.sh abc126 R 20:49:02 1 node108
40125 shared bash abc127 R 2:06:20 1 node07
34114 shared halo_332 abc131 R 4-11:08:00 4 node[30-33]
40090 gpu rbd_EPR_ abc140 R 1-03:18:31 1 node124
[abc123@login1(eureka2) ~]$ squeue -u abc123
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
40145 shared hello abc123 R 0:11 2 node[102-103]
If you have followed this example you should, have the following outputs in your directory.
[abc123@login1(eureka2) ~]$ ls -ltr
-rw-r--r-- 1 abc123 itsstaff 18084 Apr 17 12:48 helloworld.out
-rw-r--r-- 1 abc123 itsstaff 14 Apr 17 12:48 nodes
You can also query jobs to get the full information about the job using its job id number,
this can be done using the command scontrol show job <Job Id Number> as shown below:
abc123@login1(eureka2) ~]$ scontrol show job 40148
JobId= 40145 JobName=hello
UserId=abc123(282122) GroupId=itsstaff(40000) MCS_label=N/A
Priority=12940 Nice=0 Account=it QOS=normal
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:01 TimeLimit=00:05:00 TimeMin=N/A
SubmitTime=2019-04-17T13:06:21 EligibleTime=2019-04-17T13:06:21
StartTime=2019-04-17T13:06:21 EndTime=2019-04-17T13:06:22 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=shared AllocNode:Sid=login7:28627
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node[102-103]
BatchHost=node102
NumNodes=2 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=20,mem=4G,node=2
Socks/Node=* NtasksPerN:B:S:C=10:0:*:* CoreSpec=*
MinCPUsNode=10 MinMemoryNode=2G MinTmpDiskNode=0
Features=ib DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/users/abc123/slurm_test.sh
WorkDir=/users/abc123
StdErr=/users/abc123/hello.out
StdIn=/dev/null
StdOut=/users/abc123/hello.out
Power=
Cancel a job (scancel)¶
If you have submitted a job, and want to delete or cancel it you can use the command,
scancel <Job ID Number>.
- scancel <job-id>:
cancel a specific job
- scancel -u <username>:
cancel all your jobs
- scancel -t PENDING -u <username>:
cancel all your jobs in the PENDING status:
- scancel -t RUNNING -u <username>:
cancel all your jobs in the RUNNING status:
- scancel <jobid>_<taskid>:
cancel a specific task in an array job:
- scancel <jobid>_[<taskid>_<taskid>]:
cancel a range of specific tasks in an array job
View cluster information (sinfo)¶
sinfo allows you to view information about status of cluster, partitions and node usage.
[abc123@login1(eureka2) ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
shared* up 7-00:00:00 2 down* node[101,136]
shared* up 7-00:00:00 3 resv node[01,43-44]
shared* up 7-00:00:00 32 mix node[02,04,06-08,15-21,32,34,36-40,45,50,109-118,140]
shared* up 7-00:00:00 29 alloc galaxy[1-13],node[09-14,22-25,30-31,35,42,46,49]
shared* up 7-00:00:00 22 idle node[03,05,26-29,33,102-108,120-121,132-135,139,141]
debug_latest up 1:00:00 1 idle node41
debug_all up 1:00:00 1 idle node100
high_mem up 7-00:00:00 8 idle node[119,125-131]
gpu up 7-00:00:00 3 idle node[122-124]
View job stats after job completion (sacct)¶
sacct retrieves useful job stats from slurm after the job has finished.
Retrieve information based on job number:
sacct -j <Job Number> --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
[abc123@login1(eureka2) ~]$ sacct -j 56814 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
User JobID JobName Partition State Timelimit Start End Elapsed MaxRSS MaxVMSize NNodes NCPUS NodeList
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ---------- ---------------
abc123 56814 Hello shared COMPLETED 00:05:00 2019-08-13T14:45:59 2019-08-13T14:46:00 00:00:01 4 112 node[45-46,49-+
56814.batch batch COMPLETED 2019-08-13T14:45:59 2019-08-13T14:46:00 00:00:01 1160K 147756K 1 28 node45
56814.0 pmi_proxy COMPLETED 2019-08-13T14:45:59 2019-08-13T14:46:00 00:00:01 828K 213820K 4 4 node[45-46,49-+
Get information by user and time:
sacct --starttime YYYY-MM-DD -u <username> --format=JobID,state,time,JobName,MaxRSS,Elapsed,nodelist
[abc123@login1(eureka2) ~]$ sacct --starttime 2020-01-17 -u abc123 --format=JobID,state,time,JobName,MaxRSS,Elapsed,nodelist
JobID State Timelimit JobName MaxRSS Elapsed NodeList
------------ ---------- ---------- ---------- ---------- ---------- ---------------
106178 COMPLETED 00:10:00 hello 00:02:00 node45
106178.batch COMPLETED batch 1956K 00:02:00 node45
106179 COMPLETED 00:10:00 hello 00:02:00 node45
106179.batch COMPLETED batch 1908K 00:02:00 node45
Check current cluster utilisation (sview & slurmtop)¶
slurmtop is like the linux command ‘top’ but displays the layout of jobs running on compute nodes.
sview Displays a graphical overview of the live (current) cluster utilisation.