AISURREY¶
AISURREY is the GPU compute facility at the heart of Surrey’s People centred institute for AI.
It is a heterogeneous compute pool with a focus on GPU’s for processing AI and ML workloads.
It’s coupled with a highly performant WEKA storage system designed to keep up with the IO demands of the GPU’s.
Note
AISurrey does not Support MPI - If you are looking for an MPI capable cluster you use* Eureka2
AISurrey cluster overview¶
Open OnDemand URL: https://aisurrey-ondemand.surrey.ac.uk
Login Node/Submit node: aisurrey-submit01.surrey.ac.uk - Accessible via Secure shell (SSH) access
Quick Getting Started Guide¶
Access the cluster
Connect through Open OnDemand or log in to
aisurrey-submit01.surrey.ac.ukvia Secure shell (SSH) access.Upload and organise your code and data
Copy code and any input data to the cluster before you run jobs. AISURREY provides three storage spaces for different data and workload types; see WEKA file systems.
Prepare your software environment
There are two main ways to provide your software on AISURREY:
Create a Conda environment to manage your own Python version and packages in an isolated user-space environment.
Use a container. For most AISURREY production workflows, this is the preferred approach. Build your image and publish it to the registry before submitting production jobs; see Building Container Images.
Submit your job
Once your data and software environment are ready, submit either a batch job with Batch jobs (sbatch) or start an interactive job with Interactive jobs (srun).
Monitor and manage your jobs
After your job enters the queue, use the commands described in Job Management to check status, inspect resource use, or cancel it.
Cluster specification¶
Compute node name |
Usable CPU cores |
RAM (GBs) |
GPU’s |
GPU memory (GBs) |
GPU GRES name |
Data storage connection |
|---|---|---|---|---|---|---|
aisurrey-debug01 |
18 |
128 |
2 x RTX a5000 |
24 |
nvidia_rtx_a5000 |
fs_nfs |
aisurrey-debug02 |
18 |
128 |
2 x RTX a5000 |
24 |
nvidia_rtx_a5000 |
fs_nfs |
aisurrey-debug03 |
18 |
128 |
2 x RTX a5000 |
24 |
nvidia_rtx_a5000 |
fs_nfs |
aisurrey-debug04 |
18 |
128 |
2 x RTX a5000 |
24 |
nvidia_rtx_a5000 |
fs_nfs |
aisurrey01 |
60 |
256 |
6 x GeForce RTX 2080 |
11 |
nvidia_geforce_rtx_2080_ti |
fs_weka |
aisurrey02 |
60 |
256 |
6 x GeForce RTX 2080 |
11 |
nvidia_geforce_rtx_2080_ti |
fs_weka |
aisurrey03 |
60 |
256 |
6 x GeForce RTX 2080 |
11 |
nvidia_geforce_rtx_2080_ti |
fs_weka |
aisurrey04 |
60 |
256 |
7 x GeForce RTX 2080 |
11 |
nvidia_geforce_rtx_2080_ti |
fs_nfs |
aisurrey05 |
60 |
256 |
7 x GeForce RTX 2080 |
11 |
nvidia_geforce_rtx_2080_ti |
fs_nfs |
aisurrey07 |
60 |
256 |
7 x GeForce RTX 2080 |
11 |
nvidia_geforce_rtx_2080_ti |
fs_nfs |
aisurrey08 |
60 |
256 |
7 x GeForce RTX 2080 |
11 |
nvidia_geforce_rtx_2080_ti |
fs_nfs |
aisurrey09 |
60 |
256 |
7 x GeForce RTX 2080 |
11 |
nvidia_geforce_rtx_2080_ti |
fs_nfs |
aisurrey10 |
60 |
256 |
7 x GeForce RTX 2080 |
11 |
nvidia_geforce_rtx_2080_ti |
fs_nfs |
aisurrey11 |
60 |
512 |
7 x GeForce RTX 3090 |
24 |
nvidia_geforce_rtx_3090 |
fs_weka |
aisurrey12 |
60 |
512 |
7 x GeForce RTX 3090 |
24 |
nvidia_geforce_rtx_3090 |
fs_weka |
aisurrey13 |
60 |
512 |
7 x GeForce RTX 3090 |
24 |
nvidia_geforce_rtx_3090 |
fs_weka |
aisurrey14 |
60 |
512 |
8 x GeForce RTX 3090 |
24 |
nvidia_geforce_rtx_3090 |
fs_weka |
aisurrey15 |
60 |
512 |
8 x GeForce RTX 3090 |
24 |
nvidia_geforce_rtx_3090 |
fs_weka |
aisurrey16 |
60 |
512 |
8 x GeForce RTX 3090 |
24 |
nvidia_geforce_rtx_3090 |
fs_weka |
aisurrey17 |
60 |
512 |
8 x GeForce RTX 3090 |
24 |
nvidia_geforce_rtx_3090 |
fs_weka |
aisurrey18 |
60 |
512 |
8 x GeForce RTX 3090 |
24 |
nvidia_geforce_rtx_3090 |
fs_weka |
aisurrey19 |
60 |
512 |
8 x GeForce RTX 3090 |
24 |
nvidia_geforce_rtx_3090 |
fs_weka |
aisurrey20 |
40 |
192 |
4 x Quadro RTX 8000 |
48 |
quadro_rtx_8000 |
fs_nfs |
aisurrey21 |
60 |
512 |
4 x A100 SXM 80GB |
80 |
nvidia_a100-sxm4-80gb |
fs_weka |
aisurrey22 |
60 |
512 |
4 x A100 SXM 80GB |
80 |
nvidia_a100-sxm4-80gb |
fs_weka |
aisurrey23 |
60 |
512 |
4 x A100 SXM 80GB |
80 |
nvidia_a100-sxm4-80gb |
fs_weka |
aisurrey24 |
92 |
1024 |
8 x A100 SXM 80GB |
80 |
nvidia_a100-sxm4-80gb |
fs_weka |
aisurrey25 |
92 |
1024 |
8 x A100 SXM 80GB |
80 |
nvidia_a100-sxm4-80gb |
fs_weka |
aisurrey26 |
92 |
1024 |
8 x A100 SXM 80GB |
80 |
nvidia_a100-sxm4-80gb |
fs_weka |
aisurrey27 |
124 |
512 |
8 x GeForce RTX 3090 |
24 |
nvidia_geforce_rtx_3090 |
fs_weka |
aisurrey28 |
36 |
192 |
2 x RTX A6000 |
48 |
nvidia_rtx_a6000 |
fs_nfs |
aisurrey29 |
36 |
192 |
2 x RTX A6000 |
48 |
nvidia_rtx_a6000 |
fs_nfs |
aisurrey30 |
36 |
190 |
4 x RTX 5000 |
16 |
quadro_rtx_5000 |
fs_nfs |
aisurrey31 |
36 |
190 |
4 x RTX 5000 |
16 |
quadro_rtx_5000 |
fs_nfs |
aisurrey32 |
36 |
190 |
4 x RTX 5000 |
16 |
quadro_rtx_5000 |
fs_nfs |
aisurrey35 |
16 |
128 |
2 x L40s |
48 |
nvidia_l40s |
fs_weka |
Note
For an explanation of the data storage connection column see: WEKA data storage network connection types
Slurm on AISURREY¶
This section covers some of the specifics of Slurm on the AISURREY cluster.
For general information about the Slurm scheduler, please see HPC job scheduler (Slurm)
AISURREY example job scripts¶
Tip
Here are some example job scripts for the AISURREY cluster to help get you started.
Condor to Slurm command cheat sheet¶
condor_statusCluster resources state
condor_gstatusCluster GPU resources state
condor_q [-u <user>]Job queues state
condor_q -l <jobID>Current job information
condor_rm <jobID>Cancel job
condor_submit <job_script>Submit batch job
condor_submit -i <job_script>Submit interactive job
sinfoCluster resources state
slurm_gpustatCluster GPU resources state
squeue [-u <user>]Job queues state
scontrol show job <jobID>Current job information
scancel <jobID>Cancel job
sbatch <job_script>Submit batch job
srun -p <partition> -N <no. of nodes> --pty bashSubmit interactive job
squeue -u <user>Get job pending reason
showcluster(Surrey custom command)Entire cluster resources info
AISURREY partitions (queues)¶
The below table contains details of all the available partitions on AISurrey
Name |
Total node count |
Total GPU count (GPU Type) |
Time limit |
Purpose |
Preemption enabled |
|---|---|---|---|---|---|
debug |
4 |
8 (NVIDIA RTX A5000) |
4 hours |
general access code testing and debugging |
no |
2080ti |
6 |
42 (NVIDIA GeForce RTX 2080 Ti) |
3 days |
general access production jobs |
no |
3090 |
8 |
64 (NVIDIA GeForce RTX 3090) |
3 days |
general access production jobs |
no |
3090_risk |
10 |
80 (NVIDIA GeForce RTX 3090) |
3 days |
general access production jobs with checkpointing |
yes (30min grace) |
a100 |
6 |
36 (NVIDIA A100) |
3 days |
general access production jobs |
no |
rtx8000 |
1 |
4 (NVIDIA Quadro RTX 8000) |
3 days |
general access production jobs |
no |
rtx5000 |
3 |
12 (NVIDIA Quadro RTX 5000) |
3 days |
general access production jobs |
no |
pair-project |
2 |
16 (NVIDIA GeForce RTX 3090) |
3 days |
exclusively for use of pair project group |
no |
cogvis-project |
3 |
12 (NVIDIA 3090 and A6000). |
3 days |
exclusively for use by cogvis-project |
no |
rtx_a6000_risk |
2 |
4 (NVIDIA RTX A6000) |
3 days |
general access production jobs with checkpointing |
yes (30min grace) |
narrative-project |
1 |
2 (NVIDIA RTX 5000 ASA) |
3 days |
exclusively for use & debugging by narrative proj |
no |
nice-project |
1 |
2 (NVIDIA L40S) |
3 days |
exclusively for use of nice project group |
no |
l40s_risk |
1 |
2 (NVIDIA L40S) |
3 days |
general access production jobs with checkpointing |
yes (30min grace). |
Preemption - risk queues¶
Any partition with _risk in the name means preemption is enabled on that partition. This means that jobs running in this partition may be at risk of interruption
and being re-queued (stopped and sent back to the queue). It is strongly advised that you only use the _risk queues if you have enabled checkpointing in your job.
Tip
Enabling your job to be able to checkpoint and making use of the _risk queues is a great way to reduce your jobs queuing time as these queues will always
contain more resources and allow use of the exclusive project owned nodes when they are not in use by the project users.
The below diagram illustrates the different partitions available on AISURREY
The green partitions are open to all and have no risk of jobs being preempted by other jobs.
The orange partitions are the group partitions, these can only be used by members of the relevant groups and have priority access to their compute nodes.
The red partitions are the
_riskpartitions, these are open to all. Jobs submitted to these might end up on a compute node owned by one of the group partitions and are therefore at risk of being interrupted and re-queued by a competing job from one of the group partitions. These should only be used if your job can checkpoint. These offer access to ALL GPU’S of a given type and therefore if you can take advantage may offer shorter queue times and allow people to make use of group owned nodes when not in use by the owning group.
Note
See checkpointing for descriptions of how to enable checkpointing. Step-by-step examples for tools like Python, MATLAB, and R are available in the checkpointing examples GitLab repository.
AISURREY accounting and usage¶
When you submit a job on our HPC cluster, Slurm calculates the cost of your job based on the resources you request. This cost is used to manage fair access to the system and ensure that all users have a balanced opportunity to run their workloads.
Each job is assigned a billing weight based on the resources it requests. These weights are determined by Slurm’s TRES, such as CPUs, memory, and GPUs. The total cost of a job is calculated as:
\(Job Cost=∑(TRES Count×TRES Weight)×Job Duration\)
For example, if a job requests 4 CPUs for 2 hours, and each CPU has a weight of 1:
\(4×1×2=8\) billing units
Below is the current weightings for the AI@Surrey cluster:
Name |
CPU |
Memory |
GPU |
|---|---|---|---|
debug |
0.5 |
0.08 |
124.40 |
2080ti |
0.9 |
0.22 |
150.10 |
3090 |
0.9 |
0.11 |
198.50 |
a100 |
1.0 |
0.13 |
435.10 |
rtx8000 |
1.7 |
0.36 |
182.00 |
Note
These weights will change over time as we monitor usage and adjust the system to ensure fair access for all users.
All users start with the same fairshare value, meaning jobs are initially given equal priority. Over time, fairshare is adjusted based on historical usage:
Users who consume more resources will see their priority decrease.
Users who use fewer resources will have higher priority when submitting jobs.
This system ensures fair access to compute resources for all users.
AISurrey data storage - WEKA¶
The AI@Surrey cluster has a bespoke fast non-backed up storage area designed to be able to serve data to the GPU servers at speed. This scratch area is a WEKA file system built on servers full of NVMe drives interconnected with a dedicated 100GbE network.
WEKA file systems¶
AI@Surrey nodes have access to 3 different WEKA file systems.
- /mnt/fast/nobackup/users:
Contains user owned directories. By default, each user gets 200GB hard quota on this directory. This directory will get deleted when you leave the university and your surrey account is disabled. Do not keep precious research data here for long term storage.
- /mnt/fast/nobackup/scratch4weeks:
a first-come-first-served temporary scratch space. Files stored here are deleted after 4 weeks
- /mnt/fast/datasets:
Collection of popular READ ONLY datasets.
If you would like a dataset to be copied into the /mnt/fast/datasets directory please open a Support Ticket.
WEKA data storage network connection types¶
Nodes have 2 different types of storage connection accessing the WEKA Storage (i.e. /mnt/fast):
fs_wekarefers to the faster and direct 100GbE connection to WEKA storage.fs_nfsrefers to the slower and remote NFS connection to WEKA storage
Both types of connection will give you access to the same data storage locations, however those nodes with fs_nfs will not get the same level of
performance from the strogae as those with fs_weka.
These strings can be used as features/filters within the Slurm job script like below, for example, use below to only select nodes with the fast connection:
#SBATCH --constraint=fs_weka.
Checking quota for your user directory¶
Your /mnt/fast/nobackup/user directory has a 200GB Quota by default.
There are a couple of ways to check your quota for your directory in /mnt/fast/nobackup/users
Grafana dashboard
You can check your quota from the Grafana dashboard at the link below (Global Protect VPN required). Log in with your Surrey credentials.
WEKA User Quota Dashboard¶
df command
You can use the df command to see the available quota left on your user directory.
df -h /mnt/fast/nobackup/users/<your_username_here>
scratch4weeks cleanup script¶
The /mnt/fast/nobackup/scratch4weeks filesystem is a temporary scratch space, it is not backed up and is not intended for the long term storage of your data.
A clean-up script runs in this area daily and deletes data that has not been accessed in 4 weeks.
The script will check all the data in the directory for anything that hasn’t been accessed in the last 23 days and mark these for deletion.
It will then e-mail the owners of these files and issue a 5 day warning of the deletion of the data.
If the owner still needs to keep the data they will have 5 days to touch or access the files, which will update the access timestamp of the file.
After 5 days, any files that have still not been accessed for 28 days will be removed from the area by the script.
Please ensure you are regularly copying data you wish to keep long term back to your project space(s).
Warning
Data deleted by the script cannot be recovered. It is necessary for us to run this tidy-up script to ensure that space isn’t being wasted on this system. It’s a high performance system and has a very high cost vs capacity, so to ensure all users can get the most out of it we need to ensure that the space is utilised appropriately.
CEEMS: Job Monitoring & Profiling¶
The AI@Surrey cluster utilises CEEMS (Cluster Energy and Emissions Monitoring System) to provide users with deep visibility into their SLURM jobs. CEEMS is a Prometheus-based monitoring stack that tracks resource utilisation, energy consumption, and performance metrics at the job level.
Note
The CEEMS integration is currently in a testing phase. Features may change, and historical data prior to the rollout may be partial or missing.
Accessing the Dashboard¶
Monitoring data is visualized through our internal Grafana instance.
Navigate to the Grafana Server.
Go to the AI@Surrey folder.
Select the User SLURM Job Summary dashboard.
Features¶
Job Summary¶
The User SLURM Job Summary dashboard provides a high-level overview of your usage:
A summary of your SLURM jobs from the past 7 days.
Aggregate statistics for your usage across the whole cluster.
Job Drill-down: The list at the bottom shows individual jobs. Clicking on a Job ID will take you to a detailed metrics page for that specific execution.
Detailed Job Metrics¶
Clicking on a specific job in the list will navigate you to a detailed dashboard for that job. This view includes in-depth metrics such as:
Compute: CPU usage and cache hits/misses.
Accelerators: GPU utilisation and GPU memory usage.
Data: I/O operations and network activity.
Profiling: Memory and CPU profiling flame graphs for code optimization.
Enabling Continuous Profiling¶
CEEMS supports zero-instrumentation continuous profiling using eBPF and Grafana Pyroscope. This allows you to see exactly which functions in your code are consuming the most resources without modifying your source code.
To access the memory and CPU profiling flame graphs, you must opt-in by setting a specific environment variable in your SLURM job script.
Add the following line to your submission script:
export ENABLE_CONTINUOUS_PROFILING=1
Caveats & Limitations¶
Please be aware of the following hardware and software constraints regarding metric availability:
GPU Profiling: Only available on A100 & L40s nodes. These enterprise GPUs export the necessary statistics, whereas consumer-grade GPUs do not.
I/O & Network Stats: Only collected on nodes running Rocky Linux 9.
Continuous Profiling: Only available on Rocky Linux 9 nodes and requires the
ENABLE_CONTINUOUS_PROFILINGenvironment variable described above.
We are currently in the process of upgrading all nodes to Rocky 9, which will expand the availability of these metrics.