3. Condor pools
3.1. Summary of Condor pools
In Condor terms groups of servers are called pools. Creating pools allows us to manage separate pools of compute resources for different groups of users.
There are several different condor pools currently in operation at Surrey and they all have different access requirements as detailed in the table below:
Pool Name |
Submit Nodes |
Access requirement |
---|---|---|
AISURREY |
aisurrey-condor, aisurrey-condor01 |
User must pass AI SURREY Surreylearn Module |
CVSSP |
condor, condor01, condor02 |
User must belong to CVSSP department |
ICS (in testing) |
ics-condor |
User must belong ro ICS department |
CS |
cscondor |
User must belong to the Computing Department |
3.2. Condor pool policies
Condor Pools have policies applied to them that can affect various things such as user/job priority, how long jobs are allowed to run for, how much grace period jobs are given when evicted from worker nodes etc.
There are 2 main policies its important to understand as they apply to all pools:
- Job run time limit:
After this time a job will receive a SIGTERM and a vacating time of 10 minutes for it to shut down gracefully. It will then be evicted back into the queue to continue running later - see Checkpoints for more details (LINK HERE). This time limit has been chosen as while many people are running short jobs there are some users who need a longer run time to enable them to reach a suitable check-point. At the same time it prevents users from holding on to resources for ever.
- Interactive job run time limit:
Interactive jobs will be killed after this time and will not be put back into the queue. It is up to the user to restart the session if they need to. This time limit has been chosen as people were leaving sessions open but idle which was blocking GPUs on the system. Interactive sessions are designed for debugging where large GPU memory or multiple GPUs are required
3.3. Condor pool list
3.3.1. AI @ Surrey
Machines in the AI @ Surrey pool are dedicated servers with high compute power, GPUs, and fast storage.
The pool is funded by the Surrey Institute for People-Centred AI and is intended for use AI centered research endeavors.
Access Requirements
In order to gain access to this pool you must first complete a Surrey Learn course.
You can sign up here: Register for SurreyLearn Module (If it takes you to the Surrey Learn home page then try clicking it again)
Once you’ve registered then the course is at https://surreylearn.surrey.ac.uk/d2l/home/221123
Submit nodes
login to these nodes via SSH to submit your jobs to the pool:
aisurrey-condor.surrey.ac.uk
aisurrey-condor01.surrey.ac.uk
Pool Policies
- Job run time limit:
3 days
- Interactive job run time limit:
4 hours
Shared file systems available
Home directories
/mnt/fast/ WEKA Filesystems
/vol Project spaces (on some nodes)
Tip
see WEKA for more information on how to use the WEKA storage with your Condor jobs on the AI@SURREY condor pool
3.3.2. CVSSP
Machines in the CVSSP pool are dedicated servers with high compute power, GPUs, and fast storage. This pool is funded by, and belongs to, the CVSSP department.
The ownership of these machines is distributed and so higher privileges are granted to their owners.
Access Requirements
Members of CVSSP can use this pool
Submit nodes
login to these nodes via SSH to submit your jobs to the pool:
condor.eps.surrey.ac.uk
condor01.eps.surrey.ac.uk
condor02.eps.surrey.ac.uk
Pool policies
- Job run time limit:
3 days
- Interactive job run time limit:
4 hours
Priority Machines prefer their Owner’s jobs
Some machines in the pool are bought by specific research groups for specific research purposes. For the mutual benefit of users in the research groups and everyone else, the machines are included in the pool. They do however prefer to run their owner’s jobs and have preemptive powers. This means that they will happily run any job while idle, but will always evict other jobs in order to accommodate jobs of their owners.
Non-priority users jobs will only be scheduled if no other suitable resource is available.
Priority users jobs will be allocated to this machine preferentially
Priority users jobs will evict non-priority users jobs (usually 10 min warning via sigterm)
Shared file systems available
Home directories
/vol Project spaces
3.3.3. ICS
Machines in the ICS pool are dedicated servers which have been set up so that people in ICS can test out Condor.
Access Requirements
Members of the ICS department can use this pool
Submit nodes
login to these nodes via SSH to submit your jobs to the pool:
ics-condor.eps.surrey.ac.uk
Pool policies
- Job run time limit:
3 days
- Interactive job run time limit:
4 hours
Shared file systems available
Home directories
/vol Project spaces (on some nodes)
3.3.4. CS
Machines in the CS pool are dedicated Computing department servers with high compute power and GPUs.
Access Requirements
Members of the Computing department can use this pool
Submit nodes
login to these nodes via SSH to submit your jobs to the pool:
cscondor.eps.surrey.ac.uk
Pool policies
- Job run time limit:
3 days
- Interactive job run time limit:
4 hours
Shared file systems available
Home directories
/vol Project spaces