8. Tips & tricks
Tips and tricks for Condor users:
8.1. Assigned GPUs
Condor automatically sets the CUDA_VISIBLE_DEVICE
environment variable with the resources you were assigned with.
If you are setting this variable manually in your code and using Vanilla as your universe,
you should remove this line from your code to avoid using resources assigned to other users.
Alternatively, you can pick up your assigned GPU by reference to the _CONDOR_AssignedGPUs
environmental variable.
Note
In the Docker universe, you can only see your assigned GPUs
8.2. Code compilation
Condor has an heterogeneous pool of resources, which means your job can be allocated in nodes with different resources and architectures unless you specify otherwise. If your job requires compilation, you can compile your job on a specific node class and these nodes as job requirements in your submit file. Otherwise, you can compile multiple versions of your code on various node classes and call pre-compiled binaries on run time with respect to the node you are assigned to.
8.3. Pool status & metrics
You can see the status of the cluster using this monitoring dashboard
Note
VPN required to access
8.4. CUDA requirements
The GPU’s we run have the following CUDACapability values - be sure to request the right level for your CUDA Verson!
You can check which version you need on the CUDA Wiki Page
For example TensorFlow needs “CUDACapability >= 3” and you’ll need to use CUDA 10+ to use anything with a CUDACapability > 7.2:
Quadro RTX 8000 = 7.5
Quadro RTX 5000 = 7.5
GeForce RTX 2080 Ti = 7.5
GeForce GTX 1080 Ti = 6.1
GeForce GTX 1050 Ti = 6.1
GeForce GTX TITAN Xp = 6.1 (These show as undefined in the list using the tip below)
GeForce GTX TITAN X = 5.2
Tesla K40c = 3.5
Tesla M2090 = 2.0
8.5. What GPUs are there in the pool?
Use the following command to get a nice list.
condor_status -compact -constraint 'TotalGPUs>0' -af:h machine TotalGPUs CUDADeviceName CUDACapability CUDADriverVersion CUDAGlobalMemoryMb CUDAComputeUnits CUDACoresPerCU HasNvidiaDriver
Remember to use these attributes to tell HTCondor what hardware is suitable to run your job.
abc123@condor02:~ > condor_status -compact -constraint 'TotalGPUs>0' -af:h machine TotalGPUs CUDADeviceName CUDACapability CUDADriverVersion CUDAGlobalMemoryMb CUDAComputeUnits CUDACoresPerCU HasNvidiaDriver
machine TotalGPUs CUDADeviceName CUDACapability CUDADriverVersion CUDAGlobalMemoryMb CUDAComputeUnits CUDACoresPerCU HasNvidiaDriver
balin.eps.surrey.ac.uk 7 GeForce GTX TITAN X 5.2 10.2 12213 24 128 440.82
bifur.eps.surrey.ac.uk 3 undefined 6.1 10.2 12196 30 128 440.82
bofur.eps.surrey.ac.uk 4 GeForce GTX 1080 Ti 6.1 10.2 11178 28 128 440.82
cogvis1.eps.surrey.ac.uk 4 GeForce GTX 1080 Ti 6.1 10.2 11178 28 128 440.82
cogvis2.eps.surrey.ac.uk 4 GeForce GTX 1080 Ti 6.1 10.2 11178 28 128 440.82
creative01.eps.surrey.ac.uk 4 Quadro RTX 5000 7.5 10.2 16125 48 undefined 440.82
cvsspgpu01.eps.surrey.ac.uk 4 Quadro RTX 5000 7.5 10.2 16125 48 undefined 440.82
cvsspgpu02.eps.surrey.ac.uk 4 Quadro RTX 5000 7.5 10.2 16125 48 undefined 440.82
cvsspgpu03.eps.surrey.ac.uk 4 Quadro RTX 5000 7.5 10.2 16125 48 undefined 440.82
dori.eps.surrey.ac.uk 4 undefined 6.1 10.2 12196 30 128 440.82
dwalin.eps.surrey.ac.uk 7 GeForce GTX TITAN X 5.2 10.2 12213 24 128 440.82
elenwe.eps.surrey.ac.uk 2 TITAN X (Pascal) 6.1 10.2 12196 28 128 440.82
fili.eps.surrey.ac.uk 7 GeForce GTX TITAN X 5.2 10.2 12213 24 128 440.82
gloin.eps.surrey.ac.uk 7 GeForce GTX TITAN X 5.2 10.2 12213 24 128 440.82
kili.eps.surrey.ac.uk 7 GeForce GTX TITAN X 5.2 10.2 12213 24 128 440.82
nain.eps.surrey.ac.uk 4 Quadro RTX 8000 7.5 10.2 48601 72 undefined 440.82
nellas.eps.surrey.ac.uk 2 Tesla M2090 2.0 9.1 5302 16 32 390.132
nimloth.eps.surrey.ac.uk 1 Tesla K40c 3.5 10.2 11441 15 192 440.82
nimrodel.eps.surrey.ac.uk 2 TITAN Xp COLLECTORS EDITION 6.1 10.2 12196 30 128 440.82
oin.eps.surrey.ac.uk 4 GeForce GTX 1080 Ti 6.1 10.2 11178 28 128 440.82
ori.eps.surrey.ac.uk 4 undefined 6.1 10.2 12196 30 128 440.82
rollo.eps.surrey.ac.uk 1 Tesla M2090 2.0 9.1 5302 16 32 390.116
sounds01.eps.surrey.ac.uk 3 GeForce RTX 2080 Ti 7.5 10.2 11019 68 undefined 440.82
tauriel.eps.surrey.ac.uk 1 GeForce RTX 2080 Ti 7.5 10.2 11019 68 undefined 440.82
tefnut.eps.surrey.ac.uk 4 GeForce GTX 1080 Ti 6.1 10.2 11178 28 128 440.82
8.6. Avoid priority machines
Your job might get evicted while running on a priority machine at any time.
If you don’t feel comfortable with this, you can avoid the priority machines.
You should know that the negotiator will avoid priority machines for you already, unless there is no other machine available.
If this is not enough and you definitely want to avoid those machines, you can append && NotProjectOwned
to your job requirements.
8.7. Tell HTCondor your job can checkpoint!
If your job can checkpoint, it is worth setting MaxJobRetirementTime = 0
in your submit file.
This essentially tells the system that it doesnt need to wait to evict your job, should it need to do so.
This saves time for all of us in situations where we want to defrag or drain to maintain a machine.
8.8. Reduce data transfer times on AISURREY with pscratch
If you are utilising local scratch space for your jobs and are spending a lot of time waiting for condor to re-copy data over to the execute node at the start of your jobs. Consider using the persistent-scratch (PScratch) feature available on the AISURREY condor pool.
It prevents condor cleaning up all the data from the local scratch at the end of the job so you can re-use it on your next job.
8.9. Multiple job submission (Looping)
This is for submitting a larger volume of jobs with parameter changes that can be represented in the form of a DO/FOR loop.
2 level nested loop:
MAX_I = 20
MAX_J = 15
N = $(MAX_I) * $(MAX_J)
I = ($(Process) / $(MAX_J))
J = ($(Process) % $(MAX_J))
executable = bashscript
arguments = $INT(I) $INT(J)
queue $(N)
Note
% sets the variable back to 0 once the loop maximum (i.e. MAX_K) is exceeded.
3 level nested loop:
MAX_I = 20
MAX_J = 15
MAX_K = 35
N = $(MAX_I) * $(MAX_J) * $(MAX_K)
I = ( $(Process) / ($(MAX_K) * $(MAX_J)))
J = (($(Process) / $(MAX_K)) % $(MAX_J))
K = ( $(Process) % $(MAX_K))
executable = bashscript
arguments = $INT(I) $INT(J) $INT(K)
queue $(N)
8.10. Black hole solution (network fault with resources)
Occasionally, the resources in the condor queue will experience a network fault. This results in any job allocated to this resource to instantly run and fail. This is especially relevant for the ORCA pool.
This code updates the requirements of the job to avoid machines that cause a job to fail due to a network fault. (Note: 127 is the return value I get on the ORCA pool from a bash script, this may change for you).
% insert your normal requirements here - Note: this field must contain something.
requirements = ( machine != "DoesNotExist.eps.surrey.ac.uk" ) % machine does not exist on purpose.
on_exit_remove = (ExitCode != 127) % network fault return value
% updates the requirements if the job fails on a set machine.
job_machine_attrs = Machine
job_machine_attrs_history_length = 10
requirements = $(requirements) && (target.machine =!= MachineAttrMachine1) && (target.machine =!= MachineAttrMachine2)