8. Tips & tricks

Tips and tricks for Condor users:

8.1. Assigned GPUs

Condor automatically sets the CUDA_VISIBLE_DEVICE environment variable with the resources you were assigned with.

If you are setting this variable manually in your code and using Vanilla as your universe, you should remove this line from your code to avoid using resources assigned to other users. Alternatively, you can pick up your assigned GPU by reference to the _CONDOR_AssignedGPUs environmental variable.

Note

In the Docker universe, you can only see your assigned GPUs

8.2. Code compilation

Condor has an heterogeneous pool of resources, which means your job can be allocated in nodes with different resources and architectures unless you specify otherwise. If your job requires compilation, you can compile your job on a specific node class and these nodes as job requirements in your submit file. Otherwise, you can compile multiple versions of your code on various node classes and call pre-compiled binaries on run time with respect to the node you are assigned to.

8.3. Pool status & metrics

You can see the status of the cluster using this monitoring dashboard

Note

VPN required to access

8.4. CUDA requirements

The GPU’s we run have the following CUDACapability values - be sure to request the right level for your CUDA Verson!

You can check which version you need on the CUDA Wiki Page

For example TensorFlow needs “CUDACapability >= 3” and you’ll need to use CUDA 10+ to use anything with a CUDACapability > 7.2:

  • Quadro RTX 8000 = 7.5

  • Quadro RTX 5000 = 7.5

  • GeForce RTX 2080 Ti = 7.5

  • GeForce GTX 1080 Ti = 6.1

  • GeForce GTX 1050 Ti = 6.1

  • GeForce GTX TITAN Xp = 6.1 (These show as undefined in the list using the tip below)

  • GeForce GTX TITAN X = 5.2

  • Tesla K40c = 3.5

  • Tesla M2090 = 2.0

8.5. What GPUs are there in the pool?

Use the following command to get a nice list.

condor_status -compact -constraint 'TotalGPUs>0' -af:h machine TotalGPUs CUDADeviceName CUDACapability CUDADriverVersion CUDAGlobalMemoryMb CUDAComputeUnits CUDACoresPerCU HasNvidiaDriver

Remember to use these attributes to tell HTCondor what hardware is suitable to run your job.

example output from running the above condor_status command
abc123@condor02:~  > condor_status -compact -constraint 'TotalGPUs>0' -af:h machine TotalGPUs CUDADeviceName CUDACapability CUDADriverVersion CUDAGlobalMemoryMb CUDAComputeUnits CUDACoresPerCU HasNvidiaDriver
machine                     TotalGPUs CUDADeviceName              CUDACapability        CUDADriverVersion     CUDAGlobalMemoryMb CUDAComputeUnits CUDACoresPerCU HasNvidiaDriver
balin.eps.surrey.ac.uk      7         GeForce GTX TITAN X         5.2                   10.2                  12213              24               128            440.82
bifur.eps.surrey.ac.uk      3         undefined                   6.1                   10.2                  12196              30               128            440.82
bofur.eps.surrey.ac.uk      4         GeForce GTX 1080 Ti         6.1                   10.2                  11178              28               128            440.82
cogvis1.eps.surrey.ac.uk    4         GeForce GTX 1080 Ti         6.1                   10.2                  11178              28               128            440.82
cogvis2.eps.surrey.ac.uk    4         GeForce GTX 1080 Ti         6.1                   10.2                  11178              28               128            440.82
creative01.eps.surrey.ac.uk 4         Quadro RTX 5000             7.5                   10.2                  16125              48               undefined      440.82
cvsspgpu01.eps.surrey.ac.uk 4         Quadro RTX 5000             7.5                   10.2                  16125              48               undefined      440.82
cvsspgpu02.eps.surrey.ac.uk 4         Quadro RTX 5000             7.5                   10.2                  16125              48               undefined      440.82
cvsspgpu03.eps.surrey.ac.uk 4         Quadro RTX 5000             7.5                   10.2                  16125              48               undefined      440.82
dori.eps.surrey.ac.uk       4         undefined                   6.1                   10.2                  12196              30               128            440.82
dwalin.eps.surrey.ac.uk     7         GeForce GTX TITAN X         5.2                   10.2                  12213              24               128            440.82
elenwe.eps.surrey.ac.uk     2         TITAN X (Pascal)            6.1                   10.2                  12196              28               128            440.82
fili.eps.surrey.ac.uk       7         GeForce GTX TITAN X         5.2                   10.2                  12213              24               128            440.82
gloin.eps.surrey.ac.uk      7         GeForce GTX TITAN X         5.2                   10.2                  12213              24               128            440.82
kili.eps.surrey.ac.uk       7         GeForce GTX TITAN X         5.2                   10.2                  12213              24               128            440.82
nain.eps.surrey.ac.uk       4         Quadro RTX 8000             7.5                   10.2                  48601              72               undefined      440.82
nellas.eps.surrey.ac.uk     2         Tesla M2090                 2.0                   9.1                   5302               16               32             390.132
nimloth.eps.surrey.ac.uk    1         Tesla K40c                  3.5                   10.2                  11441              15               192            440.82
nimrodel.eps.surrey.ac.uk   2         TITAN Xp COLLECTORS EDITION 6.1                   10.2                  12196              30               128            440.82
oin.eps.surrey.ac.uk        4         GeForce GTX 1080 Ti         6.1                   10.2                  11178              28               128            440.82
ori.eps.surrey.ac.uk        4         undefined                   6.1                   10.2                  12196              30               128            440.82
rollo.eps.surrey.ac.uk      1         Tesla M2090                 2.0                   9.1                   5302               16               32             390.116
sounds01.eps.surrey.ac.uk   3         GeForce RTX 2080 Ti         7.5                   10.2                  11019              68               undefined      440.82
tauriel.eps.surrey.ac.uk    1         GeForce RTX 2080 Ti         7.5                   10.2                  11019              68               undefined      440.82
tefnut.eps.surrey.ac.uk     4         GeForce GTX 1080 Ti         6.1                   10.2                  11178              28               128            440.82

8.6. Avoid priority machines

Your job might get evicted while running on a priority machine at any time. If you don’t feel comfortable with this, you can avoid the priority machines. You should know that the negotiator will avoid priority machines for you already, unless there is no other machine available. If this is not enough and you definitely want to avoid those machines, you can append && NotProjectOwned to your job requirements.

8.7. Tell HTCondor your job can checkpoint!

If your job can checkpoint, it is worth setting MaxJobRetirementTime = 0 in your submit file. This essentially tells the system that it doesnt need to wait to evict your job, should it need to do so. This saves time for all of us in situations where we want to defrag or drain to maintain a machine.

8.8. Reduce data transfer times on AISURREY with pscratch

If you are utilising local scratch space for your jobs and are spending a lot of time waiting for condor to re-copy data over to the execute node at the start of your jobs. Consider using the persistent-scratch (PScratch) feature available on the AISURREY condor pool.

It prevents condor cleaning up all the data from the local scratch at the end of the job so you can re-use it on your next job.

8.9. Multiple job submission (Looping)

This is for submitting a larger volume of jobs with parameter changes that can be represented in the form of a DO/FOR loop.

2 level nested loop:

MAX_I = 20
MAX_J = 15

N = $(MAX_I) * $(MAX_J)

I = ($(Process) / $(MAX_J))
J = ($(Process) % $(MAX_J))

executable = bashscript
arguments  = $INT(I) $INT(J)

queue $(N)

Note

% sets the variable back to 0 once the loop maximum (i.e. MAX_K) is exceeded.

3 level nested loop:

MAX_I = 20
MAX_J = 15
MAX_K = 35

N = $(MAX_I) * $(MAX_J) * $(MAX_K)

I = ( $(Process) / ($(MAX_K)  * $(MAX_J)))
J = (($(Process) /  $(MAX_K)) % $(MAX_J))
K = ( $(Process) %  $(MAX_K))

executable = bashscript
arguments  = $INT(I) $INT(J) $INT(K)

queue $(N)

8.10. Black hole solution (network fault with resources)

Occasionally, the resources in the condor queue will experience a network fault. This results in any job allocated to this resource to instantly run and fail. This is especially relevant for the ORCA pool.

This code updates the requirements of the job to avoid machines that cause a job to fail due to a network fault. (Note: 127 is the return value I get on the ORCA pool from a bash script, this may change for you).

% insert your normal requirements here - Note: this field must contain something.
requirements  = ( machine != "DoesNotExist.eps.surrey.ac.uk" ) % machine does not exist on purpose.

on_exit_remove = (ExitCode != 127) % network fault return value

% updates the requirements if the job fails on a set machine.
job_machine_attrs = Machine
job_machine_attrs_history_length = 10
requirements = $(requirements) && (target.machine =!= MachineAttrMachine1) && (target.machine =!= MachineAttrMachine2)