5. Condor user guide

5.1. What is a job?

HTCondor refers to any single computing task as a “job”. It is accompanied by several attributes, such as executable, input and output files etc.

A Submit file gathers all these attributes in a single place to be communicated to HTCondor.

|-- Executable
|-- Input          -->   Submit File   -->    HTCondor Q  -->   HTCondor Execute   -->    Results
|-- Output
|-- Requirements

Jobs have their own life cycle starting from a submit file, being submitted into the queue, being allocated an execution slot, executing, and completing.

Running a Job: the Steps to take

5.2. Submitting jobs

5.2.1. Job submit file

A condor submit file is just a text file with the ultimate purpose of holding a description of the job to be run. As condor provides default values for most of the attributes, it can be as simple as a few lines. The more descriptive the better.

an example condor submit file
abc123@orca:~ $ cat submit_file
####################
#
# Example Job for HTCondor
#
####################

# --------------------------------------------
# Executable and its arguments
executable    = myexe
arguments     = -a1 one -a2 two

# --------------------------------------------
# Input, Output and Log files
log    = $(cluster).$(process).myexe.log
output = $(cluster).$(process).myexe.out
error  = $(cluster).$(process).myexe.error

# -------------------------------------
# Requirements for the Job
requirements  = ( has_avx2 == true ) && ( CUDACapability >= 5 )

# --------------------------------------
# Resource requirements
request_GPUs     = 1
request_CPUs     = 2
request_memory = 4096

# -----------------------------------
# Queue commands
queue 1

There is a long list of parameters that can go into the submit file. For a full reference, have a look at HTCondor Manual: 2.5 Submitting a Job

5.2.2. Submit a job

To submit a job use the condor_submit command:

abc123@orca:~ $ condor_submit submit_file
Submitting job(s).
1 job(s) submitted to cluster 270.

Submitted jobs are assigned a Job ID. Because multiple jobs can bee submitted with a single submit file, there is the notion of a cluster of processes. JobID = ClusterId.ProcessId

If the description file has some syntax error or is otherwise wrong by policy, the submission will fail.

Remember to use condor_submit -i for interactive jobs.

5.2.3. Check job status

To monitor submitted jobs use the condor_q command:

abc123@orca:~ $ condor_q

-- Schedd: orca.eps.surrey.ac.uk : <131.227.81.42:1870?... @ 05/07/18 11:31:12
OWNER  BATCH_NAME             SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
abc123 CMD:                  4/30 12:11      _      _      _      2      2 249.0 ... 250.0
abc123 CMD: /usr/bin/python  5/1  17:45      _      4      _      1      5 269.0 ... 291.0

7 jobs; 0 completed, 0 removed, 0 idle, 4 running, 3 held, 0 suspended

condor_q is the command to use to get information about jobs that are still in the pipeline (somewhere between submission and completion). After completion you need to use the condor_history command to get information about a job’s run.

A Job can be in a number of states, in natural order of occurrence:

  • Idle which means it is waiting for a slot to be allocated for its execution

  • Running which means it is running somewhere in the pool

  • Completed which means that it has completed execution. Other processes of the same cluster might still be executing though.

  • Suspended which means that the job has paused execution

  • Held which means that the job has encountered some problem and it put on hold for inspection.

By default, condor_q will show only the user’s jobs summarised in batches. Use the -nobatch flag to see individual job information with extra details. To get a full listing of all the information condor holds about a running job, use condor_q -l.

Tip

If you would like to watch the condor queue you can run condor_watch_q , which is more efficient than running condor_q again and again. https://htcondor.readthedocs.io/en/latest/man-pages/condor_watch_q.html?highlight=condor_watch_q

Supported on Condor Pools running a new enough version of Condor

5.3. Job status

5.3.1. Idle jobs

Jobs are idle while waiting for a suitable execution slot to be allocated to them.

The one interesting thing to see while a job is idle and the answer to the question “Why is my job still idle?” is the progress of the matchmaking process by the negotiator.

To get an analysis on the matchmaking progress for a job use:

abc123@condor:~  > condor_q 32648.0 -better-analyze


-- Schedd: condor.eps.surrey.ac.uk : <131.227.81.42:24662?...
The Requirements expression for job 32648.000 is

    ( ( HasStornext == true ) && ( CUDACapability >= 5 ) ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
    ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) &&
    ( TARGET.GPUs >= RequestGPUs ) && ( TARGET.FileSystemDomain == MY.FileSystemDomain )

Job 32648.000 defines the following attributes:

    DiskUsage = 1
    FileSystemDomain = "eps.surrey.ac.uk"
    RequestCpus = 4
    RequestDisk = DiskUsage
    RequestGPUs = 1
    RequestMemory = 12000

The Requirements expression for job 32648.000 reduces to these conditions:

        Slots
Step    Matched  Condition
-----  --------  ---------
[1]          68  CUDACapability >= 5
[9]          77  TARGET.Memory >= RequestMemory
[10]         54  [1] && [9]
[11]         52  TARGET.Cpus >= RequestCpus
[12]         26  [10] && [11]
[13]         61  TARGET.GPUs >= RequestGPUs
[14]         16  [12] && [13]


32648.000:  Run analysis summary ignoring user priority.  Of 96 machines,
    52 are rejected by your job's requirements
    0 reject your job because of their own requirements
    28 are exhausted partitionable slots
    0 match and are already running your jobs
    16 match but are serving other users
    0 are available to run your job

If you think your job is supposed to run on a specific machine but instead it is sitting idle, use a reverse machine lookup using the -reverse -machine flags:

abc123@condor:~  > condor_q 32648.0 -better-analyze -reverse -machine cvppe03.eps.surrey.ac.uk


-- Schedd: condor.eps.surrey.ac.uk : <131.227.81.42:24662?...

-- Slot: slot1@cvppe03.eps.surrey.ac.uk : Analyzing matches for 1 Jobs in 1 autoclusters

The Requirements expression for this slot is

    ( START ) && ( IsValidCheckpointPlatform ) &&
            ( WithinResourceLimits )

START is
    ifThenElse(DetectedGPUs >= 1,( TARGET.RequestGPUs >= 1 &&
                        TARGET.RequestCpus <= ( DetectedCpus / DetectedGPUs ) ),true)

IsValidCheckpointPlatform is
    ( TARGET.JobUniverse isnt 1 ||
            ( ( MY.CheckpointPlatform isnt undefined ) &&
                ( ( TARGET.LastCheckpointPlatform is MY.CheckpointPlatform ) ||
                    ( TARGET.NumCkpts == 0 ) ) ) )

WithinResourceLimits is
    ( ifThenElse(TARGET._cp_orig_RequestCpus isnt undefined,TARGET.RequestCpus <= MY.Cpus,MY.ConsumptionCpus <= MY.Cpus) &&
    ifThenElse(TARGET._cp_orig_RequestDisk isnt undefined,TARGET.RequestDisk <= MY.Disk,MY.ConsumptionDisk <= MY.Disk) &&
    ifThenElse(TARGET._cp_orig_RequestGPUs isnt undefined,TARGET.RequestGPUs <= MY.GPUs,MY.ConsumptionGPUs <= MY.GPUs) &&
    ifThenElse(TARGET._cp_orig_RequestMemory isnt undefined,TARGET.RequestMemory <= MY.Memory,MY.ConsumptionMemory <= MY.Memory) )

This slot defines the following attributes:

    CheckpointPlatform = "LINUX X86_64 4.4.0-141-generic normal N/A none"
    ConsumptionCpus = quantize(target.RequestCpus,{ 1 })
    ConsumptionDisk = quantize(target.RequestDisk,{ 1024 })
    ConsumptionGPUs = ifthenelse(target.RequestGPUs =?= undefined,0,target.RequestGPUs)
    ConsumptionMemory = quantize(target.RequestMemory,{ 128 })
    Cpus = 16
    DetectedCpus = 16
    DetectedGPUs = 0
    Disk = 119524384
    GPUs = 0
    Memory = 32169

Job 32648.0 has the following attributes:

    TARGET.JobUniverse = 5
    TARGET.NumCkpts = 0
    TARGET.RequestCpus = 4
    TARGET.RequestDisk = 1
    TARGET.RequestGPUs = 1
    TARGET.RequestMemory = 12000

The Requirements expression for this slot reduces to these conditions:

    Clusters
Step    Matched  Condition
-----  --------  ---------
[0]           1  START
[1]           1  IsValidCheckpointPlatform
[3]           0  WithinResourceLimits

slot1@cvppe03.eps.surrey.ac.uk: Run analysis summary of 1 jobs.
    0 (0.00 %) match both slot and job requirements.
    0 match the requirements of this slot.
    0 have job requirements that match this slot.

5.3.2. Running jobs

When a job is running, you check its status with condor_q. Several attributes are being updated at regular intervals by the job starter which is monitoring its progress on behalf of condor. You can see all the attributes using condor_q -l.

You can also inspect it in real time by use of condor_ssh_to_job.

abc123@orca:~/trials/condor/condor-examples/Docker/Example09  > condor_ssh_to_job 184
Welcome to slot1@whale01.eps.surrey.ac.uk!
Your condor job is running with pid(s) 1006.
I have no name!@abc123-184:/var/lib/condor/execute/dir_989  > ls -lha
total 60K
drwx------ 6 247166 40000 4.0K Mar 13 13:36 .
drwxr-xr-x 3 root   root  4.0K Mar 13 13:36 ..
-rwx------ 1 247166 40000   49 Mar 13 13:36 .chirp.config
drwxr-xr-x 2 247166 40000 4.0K Mar 13 13:36 .condor_ssh_to_job_1
srwxr-xr-x 1 247166 40000    0 Mar 13 13:36 .docker_sock
-rw-r--r-- 1    130   143 3.8K Mar 13 13:36 .job.ad
-rw-r--r-- 1    130   143 5.8K Mar 13 13:36 .machine.ad
-rw-r--r-- 1 247166 40000 5.6K Mar 13 13:36 .update.ad
-rw-r--r-- 1 247166 40000 2.6K Mar 13 13:36 _condor_stderr
-rw-r--r-- 1 247166 40000    0 Mar 13 13:36 _condor_stdout
-rwxr-xr-x 1 247166 40000    0 Mar 13 13:36 docker_stderror
drwxr-xr-x 2 247166 40000 4.0K Mar 13 13:36 mnist
-rwxr-xr-x 1 247166 40000 7.8K Mar 13 13:36 mnist_deep.py
drwx------ 2 247166 40000 4.0K Mar 13 13:36 tmp
drwx------ 3 247166 40000 4.0K Mar 13 13:36 var
I have no name!@abc123-184:/var/lib/condor/execute/dir_989  > exit
read returned, exiting
Connection to condor-job.whale01.eps.surrey.ac.uk closed.
abc123@orca:~/trials/condor/condor-examples/Docker/Example09  >

Notice how it takes you to the scratch execution directory created for your job. It contains a copy of the job and machine class adds, error and output files, and anything else produced by the job.

5.3.3. Held jobs

HTCondor puts a job in the Hold (“H”) state if there is something that needs fixing by the user. Common hold reasons include:

  • Incorrect path to files that are needed

  • Badly formatted scripts

  • Disk quotas

  • Admin

Check the hold reason, its an attribute in the job ClassAd:

abc123@condor:~ $ condor_q -hold

-- Schedd: condor.eps.surrey.ac.uk : <131.227.81.42:1870?... @ 05/07/18 20:06:04
ID      OWNER          HELD_SINCE  HOLD_REASON
249.0   abc123          4/30 17:14 via condor_hold (by user abc123)
250.0   abc123          4/30 17:14 via condor_hold (by user abc123)
273.0   abc123          5/3  13:07 Error from slot1_3@bilbo.eps.surrey.ac.uk: Error running docker job: error while cre

abc123@cvssp-condor-master:~ $ condor_q 250 -hold -af HoldReason
via condor_hold (by user abc123)

abc123@cvssp-condor-master:~ $ condor_q 273 -hold -af HoldReason
Error from slot1_3@bilbo.eps.surrey.ac.uk: Error running docker job: error while creating mount source path '/user/HS203/abc123/trials/condor/nvidia': mkdir /user/HS203/abc123/trials: permission denied

5.3.4. Completed jobs

Jobs that have been submitted to the queue and are no longer running, are history! One way or another, they have completed.

To inspect their completion use the condor_history command.

abc123@condor:~  > condor_history
ID     OWNER          SUBMITTED   RUN_TIME     ST COMPLETED   CMD
32529.0   bobby         3/13 10:33   0+00:00:44 C   3/13 13:41 /vol/vssp/run_train.sh
32530.0   bobby         3/13 10:33   0+00:00:40 C   3/13 13:41 /vol/vssp/qrun_train.sh
31143.933 tony         3/7  16:22   0+00:08:09 C   3/13 13:41 deep_train.sh
32527.0   bobby         3/13 10:33   0+00:00:47 C   3/13 13:40 /vol/vssp/run_train.sh
32528.0   bobby         3/13 10:33   0+00:00:44 C   3/13 13:40 /vol/vssp/run_train.sh

For more information on how a job completed and what it has achieved, you should probably have a look at the job logs, in the relevant files. Condor only knows what happened to the job from a batch scheduling point of view. A completed job does not always mean successful completion.

A job can go wrong in many places. Either from internal to process errors or from errors from HTCondor’s point of view. Following tricks might help

5.3.5. Retiring jobs

When a server needs to be rebooted for maintenance the IT team will ‘drain the node’.

A node in Draining state will accept no more new jobs, while allowing running jobs to continue running for the amount of time promised. If this is happening when you run the condor_status or condor_gstatus command you will see the list of jobs come up as retiring.

If your job can be check-pointed and is the last retiring job on a server then killing and restarting your job will allow the machine to come back into service faster.

Tip

See Job checkpoints for more information on how to add “Checkpointing” to your jobs.

5.4. Job handling

While the job is in any of the above states, there are commands we can do run to alter change its status.

condor_qedit:

edit job attributes while its held.

condor_hold:

put a job on hold manually.

condor_release:

release a job from hold back into the queue.

condor_rm:

remove a job from the queue.

condor_vacate_job:

vacate your job from a node back into the queue.

5.5. Interactive jobs

5.5.1. Why use an interactive job?

An interactive job allows you to interact with your job once it starts.

Sometimes your job might be failing to start, or failing to execute properly, or even executing but doing other things than you expected.

In these cases it is helpful to be able to inspect what going on with your job to debug and troubleshoot. its worth investigating things such as the running environment, the job itself, command arguments, or a machines specific hardware.

5.5.2. How to run an interactive job?

Start an interactive job just like submitting any regular job, with the addition of the -interactive (or just -i) flag in your job submit file.

Interactive jobs proceed like normal batch jobs but instead of running the executable of the job, HTCondor opens a bash shell and connects you to it. This way you can attempt launching the process manually, or inspect the running environment, fixing any issues.

example submit file for an interactive job
abc123@orca:~/trials/condor/nvidia $ condor_submit -i submit_file
Submitting job(s).
1 job(s) submitted to cluster 299.
Welcome to slot1@willow.eps.surrey.ac.uk!
abc123@willow:/var/lib/condor/execute/dir_1403382 $ ls
_condor_stderr  _condor_stdout

Condor still needs to take the job through the matchmaking process like any other job, until it finds a suitable slot for you to move into. Once troubleshooting is finished, exiting the shell will return you to the submit node.

Note

For pools running condor version 8.8 and upwards, interactive jobs and ssh_to_job works natively for docker universe jobs.

5.6. Job checkpoints

5.6.1. Why use checkpoints?

Running jobs might be interrupted for a number of reasons. It might be the machines policy to allow jobs to run for a specific period of time (wall time). It might be that the machine prefers jobs from its owners, in which case your job will be evicted if the owner wants to run a job. Or it might be that the machine is being used by its owner and it is no longer available to run your job.

In any of the above cases the running job will be evicted from its allocated slot. The job will return to the queue, since it has not completed running, where it will go through the allocation process all over again, until it finds a new slot to run in.

The above is not as bad as it sounds if a check pointing mechanism is implemented by the job. The job saves checkpoints regularly or in response to a termination signal (SIGTERM). When it starts running again in its new slot, it can resume execution from its last checkpoint.

Termination (SIGTERM) is the signal that Condor sends when it wants to evict a job from its allocated slot. The job can catch this and by default has 10 minutes of vacating time to act on it. This is considered a graceful termination. After the vacating time comes a hard kill (SIGKILL), which cannot be caught or acted upon.

Regular checkpoints are helpful in the event of power cuts or other infrastructure failures. In these cases the jobs do not receive a graceful termination signal, but are rather interrupted suddenly.

Important

See Condor pools for information on the policies in effect and to help you understand when and why your job might get evicted.

5.6.2. How to checkpoint

How check pointing is implemented for each job depends on the software used.

Please refer to the documentation of the software or libraries used to determine how you can make your job checkpoint itself.

Example04 in the Condor examples repository demonstrates a method of implementing checkpoints with a python wrapper. It is also important to make sure that the checkpoint files are saved on a location that will be accessible from the next execution host, so either a network storage location or by use of the file transfer mechanism of Condor.

More information on creating self-checkpointing jobs: https://htcondor.readthedocs.io/en/latest/users-manual/self-checkpointing-applications.html

5.7. Files and Condor

Jobs will have to handle your data in some capacity. Therefore its important to understand how condor works with files.

The executable of the job itself is a file and it is needed if the job is to execute.

There are two ways of handling files so your job is able to move around the pool:

  • Transferring the files onto the execution host’s local storage with condor’s file transfer mechanism.

  • Directly accessing your files from shared network accessible file systems.

5.7.1. File transfer mechanism

Condor will transfer files between the submit node and the execution host with its transfer mechanism.

When a job starts, the executable and and files specified with the input attribute are transferred over to the execution host automatically. A scratch execution directory is created for the job, where the files are put. Upon job completion any files created in that execution folder are brought back to the submit node.

The two main attributes that control the file transfer mechanism in your job submit file are the following:

(Default options in bold)

should_transfer_files:

IF_NEEDED / YES / NO

Controls weather to transfer files or not. By default will transfer files if submit and execute nodes are in different filesystem domains.

when_to_transfer_output:

ON_EXIT / ON_EXIT_OR_EVICT

Specify when to transfer output of process back to submit host. By default when the process finishes. Use ON_EXIT_OR_EVICT when you want to transfer files like checkpoints between job execution hosts in case job gets evicted.

More information can be found here: https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html

5.7.2. Shared file system

The Surrey condor pools have access to some shared file systems. Which file systems are accessible will depend on the condor pool you are using. Please see Condor-pool-list for full details on which file systems are accessible on the condor pool you are using.

The following Shared file systems are available:

home directories:

available everywhere

This is your Linux user home directory. The same one you get when you log into any University Managed Linux machine.

project spaces:

available on some machines

This is the project spaces you would usually access at /vol/vssp/XYZ for example.

Weka High performance scratch space:

exclusively available on the AI Surrey Condor pool

This is a high-performance, NVME based, shared filesystem dedicated to serving the AI SURREY condor pool. For more information about this file systems and how to use it please see WEKA.

This filesystem can also be accessed via ssh using the datamove1 & datamove2 servers.

Subject to user access lists applied to each storage space. Jobs can access the same files at the same locations throughout the pool.

When accessing your files directly on the shared filesystem locations HTCondor doesn’t need to transfer any files and this is why the file transfer mechanism defaults to “NO”.

Despite having access to the same storage locations, if you need to use files from a location that is non shared, you can turn the transferring of files “on” by inserting the line should_transfer_files = YES in your job’s submit file. This will make condor copy the files directly onto the local storage of the execute machine as described in File transfer mechanism.

More information can be found here: Submitting Jobs Using a Shared File System

5.8. Local scratch storage

5.8.1. What is scratch storage?

Scratch space refers to disk storage areas that are not backed up and can be used temporarily. They’re non-persistent storage areas. One would use these spaces to store temporary files produced while a job is running, temporarily store output files until they are moved elsewhere, or stage input files that are needed for a job to run.

Note

While there are several flavours of scratch space, the focus of this document is on the local scratch space available on the execute machines in the condor pools.

5.8.2. Job scratch area and the HTCondor file-transfer mechanism

All jobs get a minimum allocation in this area as it is needed for condor to set up and launch a job. Users can request a larger allocation through the request_disk submit file parameter, have condor stage input files into the area with transfer_input_files, have condor transfer output files with transfer_output_files, and further control the surrounding behaviour of the mechanism.

This area is cleaned up when the job completes, so any output files not transferred are lost and any input files required for subsequent jobs will have to be transferred afresh.

More info on how this works here: https://htcondor.readthedocs.io/en/latest/users-manual/file-transfer.html

5.8.3. PScratch local storage

Note

pscratch (persistent-scratch) is a locally developed condor plugin. This is specific to using condor here at Surrey and is currently exclusive to the AISURREY condor pool

The overhead of the time required for re-transferring files to local scratch at the start of each new job lead to us investigating the idea of a more persistent-scratch space,(Faster network storage solutions coming to surrey in the future should reduce our need for this).

Each condor execute machine has a limited amount of local disk. PScratch capability is enabled on machines with enough disk to support it, such as AI @ Surrey machines which have about 3TB of local disk. On machines with PScratch capability, the scratch space is shared between:

  • Job scratch area, which is allocatable through the scheduler and is attached to the lifespan of a job.

  • PScratch (Persistent Scratch) area, which is allocatable on a per user/area basis, through the scheduler, but is not attached to the lifespan of a job, and as such persists to be utilised by subsequent jobs.

Data transferred in to the pscratch area will persist over multiple Condor jobs run on that machine. This allows use of local, fast disk on the compute node without the overhead of data transfer prior to each job run.

PScratch space is good for moderately sized, read-only data that is invariant and unchanged between multiple runs of the same or similar job.

See PScratch documentation in the advance topics section for further information including how to use it.

5.9. Pool status

There are several different commands you can use to check the status of different elements of the Condor pool such as the available machines in the pool, the GPU’s in the pool and the Jobs in the pools job queue.

5.9.1. Machine status

You can check the status of the available machines and the slots they advertise with the condor_status command:

Example output from the condor_status command
abc123@orca:~  > condor_status
Name                             OpSys      Arch   State     Activity LoadAv Mem    ActvtyTime
slot1@whale01.eps.surrey.ac.uk   LINUX      X86_64 Unclaimed Idle      0.000  7168  0+00:29:48
slot2@whale01.eps.surrey.ac.uk   LINUX      X86_64 Unclaimed Idle      0.000  8784  1+01:29:11
slot1@whale02.eps.surrey.ac.uk   LINUX      X86_64 Owner     Idle      0.000  7168  0+06:06:27
slot2@whale02.eps.surrey.ac.uk   LINUX      X86_64 Owner     Idle      0.000  8784  0+06:06:27
slot1@whale03.eps.surrey.ac.uk   LINUX      X86_64 Owner     Idle      0.000  7168  0+06:04:00
...
...
                Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain

X86_64/LINUX   382   380       0         2       0          0        0      0
        Total   382   380       0         2       0          0        0      0

You can get a more compact view where all the slots for each machine are rolled up into a single line with -compact.

Example output from the condor_status command with -compact
abc123@orca:~  > condor_status -compact
Machine                    Platform   Slots Cpus Gpus  TotalGb FreCpu  FreeGb  CpuLoad ST Jobs/Min MaxSlotGb

duck01.eps.surrey.ac.uk    x64/LINUX0 _        4    1    15.58      2     7.00    0.00 Oi     0.00 *
duck02.eps.surrey.ac.uk    x64/LINUX0 _        4    1    15.58      2     7.00    0.00 Oi     0.00 *
otter01.eps.surrey.ac.uk   x64/LINUX0 _        4    1    15.54      2     7.00    0.09 Oi     0.00 *
otter02.eps.surrey.ac.uk   x64/LINUX0 _        8    1    15.56      2     7.00    0.05 Oi     0.00 *
otter03.eps.surrey.ac.uk   x64/LINUX0 _        8    1    15.55      2     7.00    0.01 Oi     0.00 *
otter04.eps.surrey.ac.uk   x64/LINUX0 _        8    1    15.55      2     7.00    3.03 Oi     0.00 *

...
...
                Total Owner Claimed Unclaimed Matched Preempting Backfill  Drain

   X86_64/LINUX   382   380       0         2       0          0        0      0

          Total   382   380       0         2       0          0        0      0

5.9.2. GPU status

You can get a listing of all machines that have GPUs, if they are claimed and by whom with condor_gstatus (this is a custom in-house script).

Example output from the condor_gstatus command
abc123@condor:~  > condor_gstatus
Host                    State  Activity     Owner       Job     Sched      GPUs  Assigned
=================== ========= ========= ========= ========= ========= ========= =========
balin.eps.surrey.ac                                                           7
slot1_1               Claimed      Busy   bobby    5531.4    condor         1         0
slot1_2               Claimed      Busy   tony    5477.0    condor         1         1
slot1_3               Claimed      Busy   bobby    5531.6    condor         1         2
slot1_4               Claimed      Busy   bobby   5501.15    condor         1         3
slot1_5               Claimed      Busy   bobby    5503.2    condor         1         4
slot1_6               Claimed      Busy   bobby    5503.1    condor         1         5
slot1_7               Claimed      Busy   howard    5553.6    condor         1         6
bifur.eps.surrey.ac                                                           3
slot1_1               Claimed      Busy   howard    5553.4    condor         1         0
slot1_2               Claimed      Busy   bruce      57.1  condor01         1         2
slot1_3               Claimed      Busy   bruce    5178.2    condor         1         1
bofur.eps.surrey.ac  DRAINING             PROJECT                             4
slot2                 Claimed  Retiring   bobby   88594.0 cvssp-con         1         1
cogvis1.eps.surrey.                       PROJECT                             4
slot1                 Claimed      Busy   bobby    5531.0   condor.         1         0
slot2                 Claimed      Busy   bobby    5531.9   condor.         1         1
slot3                 Claimed      Busy   bruce    5549.0   condor.         1         2
slot4                 Claimed      Busy   bobby    5531.1   condor.         1         3
cogvis2.eps.surrey.                       PROJECT                             4
slot1                 Claimed      Busy    tasha    5529.1   condor.         1         0
slot2                 Claimed      Busy    tasha    5529.0   condor.         1         1
slot3                 Claimed      Busy    tasha    5529.2   condor.         1         2
slot4                 Claimed      Busy    tasha    5529.3   condor.         1         3

...

Tip

You can use watch to periodically query these lists, but if doing so please avoid running at a very high frequency ie more than once every 1-2 seconds.

5.9.3. Queue status

You can get information about the state of a queue with the condor_q command:

Example output from the condor_q command
abc123@condor:~  > condor_q

-- Schedd: cvssp-condor-master.eps.surrey.ac.uk : <131.227.81.42:24662?... @ 03/13/19 14:24:25
OWNER   BATCH_NAME                                   SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
tony CMD: python                                 1/20 16:55      _      _      1      _      1 22419.0
user1 CMD: EncoderApp                             2/7  22:32      _      4      _      _      4 25176.0 ... 25180.0
dave  CMD: /src/start.sh                          2/28 16:28      _      3      _      _      3 28826.0 ... 32718.0
bobby  CMD: venv_spinn_spm12Ca.sh                  3/2  00:21      _      1      _      _      1 29790.0
jenny CMD: condor_speaker0_4stacks.sh             3/4  16:22      _      1      _      _      1 30624.0
tasha CMD: python                                 3/4  18:07      _      _      _      1      1 30641.0
tasha CMD: /bin/sleep                             3/4  18:15      _      _      _      1      1 30643.0
howard CMD: deep_adversarial.sh                    3/7  16:16   5937      1  14062      _  20000 31139.2017 ... 31145.499
bruce CMD: python                                 3/8  13:23      2      7      4      _     13 31358.1 ... 32727.4
bobby  CMD: joint_direct.sh                        3/9  15:14      _      _      _      1      1 31479.0
ash  CMD: MDLoss5_WeightedMD.sh                  3/11 13:12      _      1      _      _      1 31612.0
...
...
14883 jobs; 0 completed, 0 removed, 14803 idle, 71 running, 9 held, 0 suspended

5.9.4. Status dashboards

You can view a number of Grafana based dashboards relating to current and historic statistics for Condor pools, machines and GPUs here: https://prometheus.surrey.ac.uk/grafana

5.10. Troubleshooting jobs

Its important you check the following things before you Open a support ticket. They will often reveal the cause of your problem.

  • Logs

    Logs are incredibly useful and will be able to give you the information on all of the following:

    • When jobs where submitted, started and stopped

    • Resources used

    • Exit status

    • where job ran

    • interruption reasons

  • Output

    The output from the program you are running will often give you vital clues or simply tell you what you problem is. be sure to check:

    • stdout of your program (this is the standard output for the program)

    • stderror of your program (this is the output for the program to print error messages)

  • condor_history will give you a list of jobs that ran. You can use the --long option to check the ClassAd of the job:

    abc123@condor:~ $ condor_history abc123
    210.0   abc123          4/26 10:42   0+00:00:07 C   4/26 10:42 /user/HS203/abc123/NVIDIA_CUDA-8.0_Samples/samples/1_Uti
    209.0   abc123          4/26 10:40   0+00:00:40 C   4/26 10:41 /user/HS203/abc123/NVIDIA_CUDA-9.0_Samples/samples/1_Uti
    208.0   abc123          4/26 10:37   0+00:00:07 C   4/26 10:37 /user/HS203/abc123/NVIDIA_CUDA-9.0_Samples/samples/1_Uti
    207.0   abc123          4/26 10:06   0+00:00:05 X         ???  /user/HS203/abc123/NVIDIA_CUDA-9.0_Samples/samples/1_Uti
    
  • condor_ssh_to_job will take you to where the job is running so you can troubleshoot “Live”.

    Unfortunately this cannot take you into Docker containers, yet. It will however take you the host where your container is running, to the scratch dir of the job, the same one that is mounted inside the container. So, if your process is producing any output you will be able to see it.

  • ensure you have read and understood the Advanced topics section of this user guide. The answer to why your job isn’t running may be contained within

Tip

try running Interactive jobs to aid with troubleshooting your jobs