7. AI Surrey documentation

A collection of Topics specific to working with the AI @ Surrey Condor Pool.

7.1. AI@Surrey Pool overview

Execute node name	HasWeka	HasStornext	TransferNode	GPU’s
aisurrey01	✅	✅	✅	6 x GeForce RTX 2080
aisurrey02	✅	✅	✅	6 x GeForce RTX 2080
aisurrey03	✅	✅	✅	6 x GeForce RTX 2080
aisurrey04	✅	❌	❌	7 x GeForce RTX 2080
aisurrey05	✅	❌	❌	7 x GeForce RTX 2080
aisurrey06	✅	❌	❌	7 x GeForce RTX 2080
aisurrey07	✅	❌	❌	7 x GeForce RTX 2080
aisurrey08	✅	❌	❌	7 x GeForce RTX 2080
aisurrey10	✅	❌	❌	7 x GeForce RTX 2080
aisurrey11	✅	✅	✅	7 x GeForce RTX 3090
aisurrey12	✅	✅	✅	7 x GeForce RTX 3090
aisurrey13	✅	✅	✅	7 x GeForce RTX 3090
aisurrey14	✅	❌	❌	8 x GeForce RTX 3090
aisurrey15	✅	❌	❌	8 x GeForce RTX 3090
aisurrey16	✅	❌	❌	8 x GeForce RTX 3090
aisurrey17	✅	❌	❌	8 x GeForce RTX 3090
aisurrey18	✅	❌	❌	8 x GeForce RTX 3090
aisurrey19	✅	❌	❌	8 x GeForce RTX 3090
aisurrey21	✅	❌	❌	4 x A100 SXM 80GB
aisurrey22	✅	❌	❌	4 x A100 SXM 80GB
aisurrey23	✅	❌	❌	4 x A100 SXM 80GB
aisurrey24	✅	❌	❌	8 x A100 SXM 80GB
aisurrey25	✅	❌	❌	8 x A100 SXM 80GB
aisurrey26	✅	❌	❌	8 x A100 SXM 80GB

7.2. WEKA

The AI@Surrey cluster has a bespoke fast non-backed up storage area designed to be able to serve data to the GPU servers at speed. This scratch area is a WEKA file system built on servers full of NVMe drives interconnected with a dedicated 100GbE network.

7.2.1. WEKA file systems

AI@Surrey nodes have access to 3 different WEKA file systems.

/mnt/fast/nobackup/users:: Contains user owned directories. By default each user gets 200GB hard quota on this directory. This directory will get deleted when you leave the university and your surrey account is disabled. Do not keep precious research data here for long term storage.
/mnt/fast/nobackup/scratch4weeks:: a first-come-first-served temporary scratch space. Files stored here are deleted after 4 weeks
/mnt/fast/datasets:: Collection of popular READ ONLY datasets.

If you would like a dataset to be copied into the /mnt/fast/datasets directory please Open a support ticket

7.2.1.1. Checking Quota for your user directory

Your /mnt/fast/nobackup/user directory has a 200GB Quota by default.

There are a couple of ways to check your quota for your directory in /mnt/fast/nobackup/users

Grafana Dashboard

You can check your quota from the Grafana dashboard at the link below (Global Protect VPN required). Log in with your surrey credentials.

https://prometheus.surrey.ac.uk/grafana/d/hKSSTp97z/weka-user-quota?orgId=1

WEKA User Quota Dashboard
df command

You can use the df command to see the available quota left on your user directory.
```
df -h /mnt/fast/nobackup/users/<your_username_here>
```
Note

This currently only works from the execute nodes (simplest way is to submit an interactive Condor job and then run this command). This does not currently work on the submit nodes (aisurrey-condor & aisurrey-condor01) but will be coming in a future WEKA software upgrade.

7.2.1.2. scratch4weeks cleanup script

The /mnt/fast/nobackup/scratch4weeks filesystem is a temporary scratch space, it is not backed up and is not intended for the long term storage of your data. A cleanup script runs in this area daily and deletes data that has not been accessed in 4 weeks.

The script will check all the data in the directory for anything that hasn’t been accessed in the last 23 days and mark these for deletion.
It will then e-mail the owners of these files and issue a 5 day warning of the deletion of the data.
If the owner still needs to keep the data they will have 5 days to touch or access the files, which will update the access timestamp of the file.
After 5 days any files that have still not been accessed for 28 days will be removed from the area by the script.

Note

Data deleted by the script cannot be recovered. It is necessary for us to run this tidy-up script to ensure that space isn’t being wasted on this system. Its a high performance system and has a very high cost vs capacity, so to ensure all users can get the most out of it we need to ensure that the space is utilised appropriately.

Please ensure you are regularly copying data you wish to keep long term back to your project spaces or OneDrive.

7.2.2. How to use the WEKA storage in your Condor jobs

Your job will have access to the file systems at the paths listed below by default.

/mnt/fast/datasets
/mnt/fast/nobackup/users
/mnt/fast/nobackup/scratch4weeks

Note

You do not need to specify any of the WEKA storage in the environment line of your Condor submit file. The WEKA file systems will be available and mounted to your job by default.

For more information on Job submission scripts see: Submitting jobs

Your user directory will be at /mnt/fast/nobackup/users/<your-username>

7.2.2.1. ClassAds

The execute nodes with access to the WEKA filesystems have HasWeka in their Machine ClassAds. So you can target these machines in your job submission scripts in your requirements section like so:

requirements = HasWeka

If you want to transfer data from StorNext (project spaces at /vol) then you will need the following:

requirements = HasWeka && HasStornext

or

requirements = TransferNode

Note

All machines with both WEKA and StorNext will have the TransferNode Classad attribute.

7.2.2.2. Environment Variables

The following environment variables have been created for easier reference to the file system paths for use in your code.

FASTSCRATCHUSER=/mnt/fast/nobackup/users/<username>
FASTSCRATCH4WEEKS=/mnt/fast/nobackup/scratch4weeks
FASTDATASETS=/mnt/fast/datasets

7.2.3. Transferring data to and from the WEKA file systems

The best way to transfer data is to use a Condor job to copy data to and from the WEKA file systems on one of the TransferNodes (an execute node with a WEKA and StorNext connection). However you can also access the WEKA file systems from the AI@Surrey submit nodes for a more quick and flexible way to carry out more light-weight operations on your data (e.g. checking files, reading logs, small data copies, deleting stuff).

See below for more detail on each method:

7.2.3.1. Condor job

You can schedule a Condor job to run a copy program such as rsync or cp to transfer data on or off the WEKA file systems.

✅ Best way to quickly copy large amounts of Data as the Condor execute nodes have high speed connections to the project spaces and Weka.
Your requirements section will need to specify the HasWeka and HasStornext Classad attributes. Or simply TransferNode.

an example Condor submit file to copy data onto WEKA file systems.

abc123@aisurrey-Condor:~ $ cat submit_file
####################
#
# Example data copy job for HTCondor
#
####################

# --------------------------------------------
# Executable and its arguments
executable    = /usr/bin/rsync
arguments     = -a -v /vol/research/path/to/data_source /mnt/fast/nobackup/path/to/data_destination
environment   = "mount=/vol/research/path/to/data_source"

# --------------------------------------------
# Input, Output and Log files
log    = $(cluster).$(process).log
output = $(cluster).$(process).out
error  = $(cluster).$(process).error

# -------------------------------------
# Requirements for the Job
requirements = HasWeka && HasStornext

# --------------------------------------
# Resource requirements
request_CPUs     = 1
request_memory = 1028

+CanCheckpoint = true
+JobRunTime = 12

# -----------------------------------
# Queue commands
queue 1

Note

You do not need to specify any of the WEKA storage in the environment line of your Condor submit file. The WEKA file systems will be available and mounted to your job by default.

7.2.3.2. Interactive Condor job

You can also submit an interactive Condor job. This will allow you to manually run commands and monitor your data copy in real-time.

Simply add the -i flag to your Condor submit command: condor_submit -i submit_file. More information on How to run an interactive job?.

✅ Use this to quickly inspect the two different storage locations.
✅ Can be used to manually initiate copies and transfers. Can do many different copy operations ad-hoc.
✅ Good for quickly copying large amounts of data as the Condor execute nodes have high speed connections to the project spaces and Weka.
🚨 Interactive jobs only have a 4hr runtime limit so for very large data transfers its best not to use an interactive job.

Note

We are working on giving you more ways to access the data on the WEKA file systems. We will update the information when appropriate.

7.2.3.3. datamove servers

These servers are user accessible machines with high performance connections to both the Research project spaces on /vol/research and the WEKA filesystems on /mnt/fast.

Currently there is one datamove server, we may add more in the future.

datamove1.surrey.ac.uk

You can login to these servers via SSH (using the VPN - Global Protect or SSH Gateways - access.eps.surrey.ac.uk if you are not on campus) and use tools such as rsync to transfer or copy data between /mnt/fast, /vol and your Linux home directory.

✅ Use these servers to shift data around between storage locations without needing a condor job
✅ Can be used to manually initiate copies and transfers. Can do many different copy operations ad-hoc.
✅ Good for quickly copying large amounts of data as the system has high speed connections to the Project spaces and WEKA.
🚨 Bandwidth will be shared with other users data transfer processes on this machine (Condor jobs are still the best way to get dedicated bandwidth).

Note

You can also use SFTP to transfer data on or off the filesystems via this server.

7.2.3.4. Condor submit nodes

The WEKA file systems are mounted on both of the AI Surrey submit nodes at the same path as on the execute nodes i.e. /mnt/fast/nobackup/scratch4weeks.

login with SSH to either:

aisurrey-condor
aisurrey-condor01

Use the shell to navigate the file systems and access your data. Read and write access is enabled which means you can also copy or delete data if you need to without the need for a Condor job.

✅ Use this to quickly inspect data stored on the file systems, e.g. reading logs, locating a file, deleting old data.
🚨 Not recommended for copying large amounts of data. The submit nodes do not have a high performance connection to Weka. You should always use a Condor job to transfer large amounts of data.

Warning

WEKA NFS doesn’t support file locking. This means that applications will not be able to lock a file when working with them. Just be aware of this when accessing files from the submit nodes. The file won’t be locked.

More info on File locking. https://docstore.mik.ua/orelly/networking_2ndEd/nfs/ch11_02.htm

7.2.3.5. Samba - Accessing WEKA from anywhere

Samba access is available for the “nobackup” WEKA file systems. This means you can access these file systems from anywhere. All you need are:

Surrey user credentials
Global Protect VPN
User access to AI Surrey Condor pool

Samba use cases:

✅ Samba access should only be used for general file browsing, and data transfer operations such as copying files to or from the WEKA storage. It could also be useful for tidying up and deleting old data, or just viewing your files.
🚨 It should not be used for processing data with programs or jobs, and it should not be used for the transfer of large amounts of data. Condor jobs are by far the best way to do this. Samba will not be able to deliver anywhere close to the speed and performance the Condor execute nodes can as full-fledged WEKA Clients. Samba will be limited by the network between your device and the WEKA servers and will only be as good as the weakest link in that chain.

For full instructions on how to access the WEKA SMB shares please see the below Help guide.

Weka SMB Help guide

Note

This document is sharepoint hosted and a University of Surrey account is required to view it.