3. Condor pools

3.1. Summary of Condor pools

In Condor terms groups of servers are called pools. Creating pools allows us to manage separate pools of compute resources for different groups of users.

There are several different condor pools currently in operation at Surrey and they all have different access requirements as detailed in the table below:

Pool Name

Submit Nodes

Access requirement

AISURREY

aisurrey-condor, aisurrey-condor01

User must pass AI SURREY Surreylearn Module

CVSSP

condor, condor01, condor02

User must belong to CVSSP department

ICS (in testing)

ics-condor

User must belong ro ICS department

CS

cscondor

User must belong to the Computing Department

3.2. Condor pool policies

Condor Pools have policies applied to them that can affect various things such as user/job priority, how long jobs are allowed to run for, how much grace period jobs are given when evicted from worker nodes etc.

There are 2 main policies its important to understand as they apply to all pools:

Job run time limit:

After this time a job will receive a SIGTERM and a vacating time of 10 minutes for it to shut down gracefully. It will then be evicted back into the queue to continue running later - see Checkpoints for more details (LINK HERE). This time limit has been chosen as while many people are running short jobs there are some users who need a longer run time to enable them to reach a suitable check-point. At the same time it prevents users from holding on to resources for ever.

Interactive job run time limit:

Interactive jobs will be killed after this time and will not be put back into the queue. It is up to the user to restart the session if they need to. This time limit has been chosen as people were leaving sessions open but idle which was blocking GPUs on the system. Interactive sessions are designed for debugging where large GPU memory or multiple GPUs are required

3.3. Condor pool list

3.3.1. AI @ Surrey

Machines in the AI @ Surrey pool are dedicated servers with high compute power, GPUs, and fast storage.

The pool is funded by the Surrey Institute for People-Centred AI and is intended for use AI centered research endeavors.

Access Requirements

In order to gain access to this pool you must first complete a Surrey Learn course.

Submit nodes

login to these nodes via SSH to submit your jobs to the pool:

  • aisurrey-condor.surrey.ac.uk

  • aisurrey-condor01.surrey.ac.uk

Pool Policies

Job run time limit:

3 days

Interactive job run time limit:

4 hours

Shared file systems available

  • Home directories

  • /mnt/fast/ WEKA Filesystems

  • /vol Project spaces (on some nodes)

Tip

see WEKA for more information on how to use the WEKA storage with your Condor jobs on the AI@SURREY condor pool

3.3.2. CVSSP

Machines in the CVSSP pool are dedicated servers with high compute power, GPUs, and fast storage. This pool is funded by, and belongs to, the CVSSP department.

The ownership of these machines is distributed and so higher privileges are granted to their owners.

Access Requirements

Members of CVSSP can use this pool

Submit nodes

login to these nodes via SSH to submit your jobs to the pool:

  • condor.eps.surrey.ac.uk

  • condor01.eps.surrey.ac.uk

  • condor02.eps.surrey.ac.uk

Pool policies

Job run time limit:

3 days

Interactive job run time limit:

4 hours

  • Priority Machines prefer their Owner’s jobs

    Some machines in the pool are bought by specific research groups for specific research purposes. For the mutual benefit of users in the research groups and everyone else, the machines are included in the pool. They do however prefer to run their owner’s jobs and have preemptive powers. This means that they will happily run any job while idle, but will always evict other jobs in order to accommodate jobs of their owners.

    • Non-priority users jobs will only be scheduled if no other suitable resource is available.

    • Priority users jobs will be allocated to this machine preferentially

    • Priority users jobs will evict non-priority users jobs (usually 10 min warning via sigterm)

Shared file systems available

  • Home directories

  • /vol Project spaces

3.3.3. ICS

Machines in the ICS pool are dedicated servers which have been set up so that people in ICS can test out Condor.

Access Requirements

Members of the ICS department can use this pool

Submit nodes

login to these nodes via SSH to submit your jobs to the pool:

  • ics-condor.eps.surrey.ac.uk

Pool policies

Job run time limit:

3 days

Interactive job run time limit:

4 hours

Shared file systems available

  • Home directories

  • /vol Project spaces (on some nodes)

3.3.4. CS

Machines in the CS pool are dedicated Computing department servers with high compute power and GPUs.

Access Requirements

Members of the Computing department can use this pool

Submit nodes

login to these nodes via SSH to submit your jobs to the pool:

  • cscondor.eps.surrey.ac.uk

Pool policies

Job run time limit:

3 days

Interactive job run time limit:

4 hours

Shared file systems available

  • Home directories

  • /vol Project spaces