AISURREY

AISURREY is the GPU compute facility at the heart of Surrey’s People centred institute for AI.

It is a heterogeneous compute pool with a focus on GPU’s for processing AI and ML workloads.

It’s coupled with a highly performant WEKA storage system designed to keep up with the IO demands of the GPU’s.

Note

AISurrey does not Support MPI - If you are looking for an MPI capable cluster you use* Eureka2

AISurrey cluster overview

Quick Getting Started Guide

  1. Access the cluster

    Connect through Open OnDemand or log in to aisurrey-submit01.surrey.ac.uk via Secure shell (SSH) access.

  2. Upload and organise your code and data

    Copy code and any input data to the cluster before you run jobs. AISURREY provides three storage spaces for different data and workload types; see WEKA file systems.

  3. Prepare your software environment

    There are two main ways to provide your software on AISURREY:

    • Create a Conda environment to manage your own Python version and packages in an isolated user-space environment.

    • Use a container. For most AISURREY production workflows, this is the preferred approach. Build your image and publish it to the registry before submitting production jobs; see Building Container Images.

  4. Submit your job

    Once your data and software environment are ready, submit either a batch job with Batch jobs (sbatch) or start an interactive job with Interactive jobs (srun).

  5. Monitor and manage your jobs

    After your job enters the queue, use the commands described in Job Management to check status, inspect resource use, or cancel it.

Cluster specification

Compute node name

Usable CPU cores

RAM (GBs)

GPU’s

GPU memory (GBs)

GPU GRES name

Data storage connection

aisurrey-debug01

18

128

2 x RTX a5000

24

nvidia_rtx_a5000

fs_nfs

aisurrey-debug02

18

128

2 x RTX a5000

24

nvidia_rtx_a5000

fs_nfs

aisurrey-debug03

18

128

2 x RTX a5000

24

nvidia_rtx_a5000

fs_nfs

aisurrey-debug04

18

128

2 x RTX a5000

24

nvidia_rtx_a5000

fs_nfs

aisurrey01

60

256

6 x GeForce RTX 2080

11

nvidia_geforce_rtx_2080_ti

fs_weka

aisurrey02

60

256

6 x GeForce RTX 2080

11

nvidia_geforce_rtx_2080_ti

fs_weka

aisurrey03

60

256

6 x GeForce RTX 2080

11

nvidia_geforce_rtx_2080_ti

fs_weka

aisurrey04

60

256

7 x GeForce RTX 2080

11

nvidia_geforce_rtx_2080_ti

fs_nfs

aisurrey05

60

256

7 x GeForce RTX 2080

11

nvidia_geforce_rtx_2080_ti

fs_nfs

aisurrey07

60

256

7 x GeForce RTX 2080

11

nvidia_geforce_rtx_2080_ti

fs_nfs

aisurrey08

60

256

7 x GeForce RTX 2080

11

nvidia_geforce_rtx_2080_ti

fs_nfs

aisurrey09

60

256

7 x GeForce RTX 2080

11

nvidia_geforce_rtx_2080_ti

fs_nfs

aisurrey10

60

256

7 x GeForce RTX 2080

11

nvidia_geforce_rtx_2080_ti

fs_nfs

aisurrey11

60

512

7 x GeForce RTX 3090

24

nvidia_geforce_rtx_3090

fs_weka

aisurrey12

60

512

7 x GeForce RTX 3090

24

nvidia_geforce_rtx_3090

fs_weka

aisurrey13

60

512

7 x GeForce RTX 3090

24

nvidia_geforce_rtx_3090

fs_weka

aisurrey14

60

512

8 x GeForce RTX 3090

24

nvidia_geforce_rtx_3090

fs_weka

aisurrey15

60

512

8 x GeForce RTX 3090

24

nvidia_geforce_rtx_3090

fs_weka

aisurrey16

60

512

8 x GeForce RTX 3090

24

nvidia_geforce_rtx_3090

fs_weka

aisurrey17

60

512

8 x GeForce RTX 3090

24

nvidia_geforce_rtx_3090

fs_weka

aisurrey18

60

512

8 x GeForce RTX 3090

24

nvidia_geforce_rtx_3090

fs_weka

aisurrey19

60

512

8 x GeForce RTX 3090

24

nvidia_geforce_rtx_3090

fs_weka

aisurrey20

40

192

4 x Quadro RTX 8000

48

quadro_rtx_8000

fs_nfs

aisurrey21

60

512

4 x A100 SXM 80GB

80

nvidia_a100-sxm4-80gb

fs_weka

aisurrey22

60

512

4 x A100 SXM 80GB

80

nvidia_a100-sxm4-80gb

fs_weka

aisurrey23

60

512

4 x A100 SXM 80GB

80

nvidia_a100-sxm4-80gb

fs_weka

aisurrey24

92

1024

8 x A100 SXM 80GB

80

nvidia_a100-sxm4-80gb

fs_weka

aisurrey25

92

1024

8 x A100 SXM 80GB

80

nvidia_a100-sxm4-80gb

fs_weka

aisurrey26

92

1024

8 x A100 SXM 80GB

80

nvidia_a100-sxm4-80gb

fs_weka

aisurrey27

124

512

8 x GeForce RTX 3090

24

nvidia_geforce_rtx_3090

fs_weka

aisurrey28

36

192

2 x RTX A6000

48

nvidia_rtx_a6000

fs_nfs

aisurrey29

36

192

2 x RTX A6000

48

nvidia_rtx_a6000

fs_nfs

aisurrey30

36

190

4 x RTX 5000

16

quadro_rtx_5000

fs_nfs

aisurrey31

36

190

4 x RTX 5000

16

quadro_rtx_5000

fs_nfs

aisurrey32

36

190

4 x RTX 5000

16

quadro_rtx_5000

fs_nfs

aisurrey35

16

128

2 x L40s

48

nvidia_l40s

fs_weka

Note

For an explanation of the data storage connection column see: WEKA data storage network connection types

Slurm on AISURREY

This section covers some of the specifics of Slurm on the AISURREY cluster.

For general information about the Slurm scheduler, please see HPC job scheduler (Slurm)

AISURREY example job scripts

Tip

Here are some example job scripts for the AISURREY cluster to help get you started.

Condor to Slurm command cheat sheet

condor_status

Cluster resources state

condor_gstatus

Cluster GPU resources state

condor_q  [-u <user>]

Job queues state

condor_q -l <jobID>

Current job information

condor_rm <jobID>

Cancel job

condor_submit <job_script>

Submit batch job

condor_submit -i <job_script>

Submit interactive job

AISURREY partitions (queues)

The below table contains details of all the available partitions on AISurrey

Name

Total node count

Total GPU count (GPU Type)

Time limit

Purpose

Preemption enabled

debug

4

8 (NVIDIA RTX A5000)

4 hours

general access code testing and debugging

no

2080ti

6

42 (NVIDIA GeForce RTX 2080 Ti)

3 days

general access production jobs

no

3090

8

64 (NVIDIA GeForce RTX 3090)

3 days

general access production jobs

no

3090_risk

10

80 (NVIDIA GeForce RTX 3090)

3 days

general access production jobs with checkpointing

yes (30min grace)

a100

6

36 (NVIDIA A100)

3 days

general access production jobs

no

rtx8000

1

4 (NVIDIA Quadro RTX 8000)

3 days

general access production jobs

no

rtx5000

3

12 (NVIDIA Quadro RTX 5000)

3 days

general access production jobs

no

pair-project

2

16 (NVIDIA GeForce RTX 3090)

3 days

exclusively for use of pair project group

no

cogvis-project

3

12 (NVIDIA 3090 and A6000).

3 days

exclusively for use by cogvis-project

no

rtx_a6000_risk

2

4 (NVIDIA RTX A6000)

3 days

general access production jobs with checkpointing

yes (30min grace)

narrative-project

1

2 (NVIDIA RTX 5000 ASA)

3 days

exclusively for use & debugging by narrative proj

no

nice-project

1

2 (NVIDIA L40S)

3 days

exclusively for use of nice project group

no

l40s_risk

1

2 (NVIDIA L40S)

3 days

general access production jobs with checkpointing

yes (30min grace).

Preemption - risk queues

Any partition with _risk in the name means preemption is enabled on that partition. This means that jobs running in this partition may be at risk of interruption and being re-queued (stopped and sent back to the queue). It is strongly advised that you only use the _risk queues if you have enabled checkpointing in your job.

Tip

Enabling your job to be able to checkpoint and making use of the _risk queues is a great way to reduce your jobs queuing time as these queues will always contain more resources and allow use of the exclusive project owned nodes when they are not in use by the project users.

The below diagram illustrates the different partitions available on AISURREY

  • The green partitions are open to all and have no risk of jobs being preempted by other jobs.

  • The orange partitions are the group partitions, these can only be used by members of the relevant groups and have priority access to their compute nodes.

  • The red partitions are the _risk partitions, these are open to all. Jobs submitted to these might end up on a compute node owned by one of the group partitions and are therefore at risk of being interrupted and re-queued by a competing job from one of the group partitions. These should only be used if your job can checkpoint. These offer access to ALL GPU’S of a given type and therefore if you can take advantage may offer shorter queue times and allow people to make use of group owned nodes when not in use by the owning group.

../../_images/aisurrey_slurm_partitions.png

Note

See checkpointing for descriptions of how to enable checkpointing. Step-by-step examples for tools like Python, MATLAB, and R are available in the checkpointing examples GitLab repository.

AISURREY accounting and usage

When you submit a job on our HPC cluster, Slurm calculates the cost of your job based on the resources you request. This cost is used to manage fair access to the system and ensure that all users have a balanced opportunity to run their workloads.

Each job is assigned a billing weight based on the resources it requests. These weights are determined by Slurm’s TRES, such as CPUs, memory, and GPUs. The total cost of a job is calculated as:

\(Job Cost=∑(TRES Count×TRES Weight)×Job Duration\)

For example, if a job requests 4 CPUs for 2 hours, and each CPU has a weight of 1:

\(4×1×2=8\) billing units

Below is the current weightings for the AI@Surrey cluster:

Name

CPU

Memory

GPU

debug

0.5

0.08

124.40

2080ti

0.9

0.22

150.10

3090

0.9

0.11

198.50

a100

1.0

0.13

435.10

rtx8000

1.7

0.36

182.00

Note

These weights will change over time as we monitor usage and adjust the system to ensure fair access for all users.

All users start with the same fairshare value, meaning jobs are initially given equal priority. Over time, fairshare is adjusted based on historical usage:

Users who consume more resources will see their priority decrease.

Users who use fewer resources will have higher priority when submitting jobs.

This system ensures fair access to compute resources for all users.

AISurrey data storage - WEKA

The AI@Surrey cluster has a bespoke fast non-backed up storage area designed to be able to serve data to the GPU servers at speed. This scratch area is a WEKA file system built on servers full of NVMe drives interconnected with a dedicated 100GbE network.

WEKA file systems

AI@Surrey nodes have access to 3 different WEKA file systems.

/mnt/fast/nobackup/users:

Contains user owned directories. By default, each user gets 200GB hard quota on this directory. This directory will get deleted when you leave the university and your surrey account is disabled. Do not keep precious research data here for long term storage.

/mnt/fast/nobackup/scratch4weeks:

a first-come-first-served temporary scratch space. Files stored here are deleted after 4 weeks

/mnt/fast/datasets:

Collection of popular READ ONLY datasets.

If you would like a dataset to be copied into the /mnt/fast/datasets directory please open a Support Ticket.

WEKA data storage network connection types

Nodes have 2 different types of storage connection accessing the WEKA Storage (i.e. /mnt/fast):

  1. fs_weka refers to the faster and direct 100GbE connection to WEKA storage.

  2. fs_nfs refers to the slower and remote NFS connection to WEKA storage

Both types of connection will give you access to the same data storage locations, however those nodes with fs_nfs will not get the same level of performance from the strogae as those with fs_weka.

These strings can be used as features/filters within the Slurm job script like below, for example, use below to only select nodes with the fast connection: #SBATCH --constraint=fs_weka.

Checking quota for your user directory

Your /mnt/fast/nobackup/user directory has a 200GB Quota by default.

There are a couple of ways to check your quota for your directory in /mnt/fast/nobackup/users

  • Grafana dashboard

    You can check your quota from the Grafana dashboard at the link below (Global Protect VPN required). Log in with your Surrey credentials.

    WEKA user quota dashboard.

    ../../_images/grafana_quota.png

    WEKA User Quota Dashboard

  • df command

    You can use the df command to see the available quota left on your user directory.

    df -h /mnt/fast/nobackup/users/<your_username_here>
    

scratch4weeks cleanup script

The /mnt/fast/nobackup/scratch4weeks filesystem is a temporary scratch space, it is not backed up and is not intended for the long term storage of your data.

A clean-up script runs in this area daily and deletes data that has not been accessed in 4 weeks.

  • The script will check all the data in the directory for anything that hasn’t been accessed in the last 23 days and mark these for deletion.

  • It will then e-mail the owners of these files and issue a 5 day warning of the deletion of the data.

  • If the owner still needs to keep the data they will have 5 days to touch or access the files, which will update the access timestamp of the file.

  • After 5 days, any files that have still not been accessed for 28 days will be removed from the area by the script.

Please ensure you are regularly copying data you wish to keep long term back to your project space(s).

Warning

Data deleted by the script cannot be recovered. It is necessary for us to run this tidy-up script to ensure that space isn’t being wasted on this system. It’s a high performance system and has a very high cost vs capacity, so to ensure all users can get the most out of it we need to ensure that the space is utilised appropriately.

CEEMS: Job Monitoring & Profiling

The AI@Surrey cluster utilises CEEMS (Cluster Energy and Emissions Monitoring System) to provide users with deep visibility into their SLURM jobs. CEEMS is a Prometheus-based monitoring stack that tracks resource utilisation, energy consumption, and performance metrics at the job level.

Note

The CEEMS integration is currently in a testing phase. Features may change, and historical data prior to the rollout may be partial or missing.

Accessing the Dashboard

Monitoring data is visualized through our internal Grafana instance.

  1. Navigate to the Grafana Server.

  2. Go to the AI@Surrey folder.

  3. Select the User SLURM Job Summary dashboard.

Features

Job Summary

The User SLURM Job Summary dashboard provides a high-level overview of your usage:

  • A summary of your SLURM jobs from the past 7 days.

  • Aggregate statistics for your usage across the whole cluster.

  • Job Drill-down: The list at the bottom shows individual jobs. Clicking on a Job ID will take you to a detailed metrics page for that specific execution.

Detailed Job Metrics

Clicking on a specific job in the list will navigate you to a detailed dashboard for that job. This view includes in-depth metrics such as:

  • Compute: CPU usage and cache hits/misses.

  • Accelerators: GPU utilisation and GPU memory usage.

  • Data: I/O operations and network activity.

  • Profiling: Memory and CPU profiling flame graphs for code optimization.

Enabling Continuous Profiling

CEEMS supports zero-instrumentation continuous profiling using eBPF and Grafana Pyroscope. This allows you to see exactly which functions in your code are consuming the most resources without modifying your source code.

To access the memory and CPU profiling flame graphs, you must opt-in by setting a specific environment variable in your SLURM job script.

Add the following line to your submission script:

export ENABLE_CONTINUOUS_PROFILING=1

Caveats & Limitations

Please be aware of the following hardware and software constraints regarding metric availability:

  • GPU Profiling: Only available on A100 & L40s nodes. These enterprise GPUs export the necessary statistics, whereas consumer-grade GPUs do not.

  • I/O & Network Stats: Only collected on nodes running Rocky Linux 9.

  • Continuous Profiling: Only available on Rocky Linux 9 nodes and requires the ENABLE_CONTINUOUS_PROFILING environment variable described above.

We are currently in the process of upgrading all nodes to Rocky 9, which will expand the availability of these metrics.

Further Reading