FAQ¶
General Information¶
What types of clusters exist at the University of Surrey?
How do I get an account on the HPC clusters?
Access depends on your affiliation. Eureka2 is free for all University of Surrey users. Other clusters may have prerequisites or be reserved for specific research groups. Please visit the Getting Access to HPC page (technical docs) for the step-by-step account creation guide.
How do I run code on the clusters?
To run code on the clusters, log in and submit a Slurm job script from a login node. Your script should define the resources your job needs (for example CPUs, memory, time) and the commands to execute your program code. See the Slurm job submission guide.
As a newcomer to Surrey HPC, which cluster should I use?
Where can I get training opportunities?
Surrey Research Computing runs regular training sessions and tutorials. See past and upcoming events on their SharePoint site.
You can also use the University’s LinkedIn Learning subscription. For suggested courses, see here.
I am new to research software engineering. Do you provide guidance on best practices?
Yes. The RSE team maintains a best-practice guide that is updated regularly: RSE best-practice guide.
Who should I contact for urgent issues or outages?
Submit a Support Ticket for all technical issues so they can be tracked and prioritised. For urgent issues or outages, also post in the SRC Teams community channels for faster visibility.
Slurm Scheduler¶
How can I attach to a running Slurm job?
You can attach to a running Slurm job with:
1srun --pty --overlap --jobid YOUR-JOBID bash
Replace YOUR-JOBID with your Slurm job ID.
This opens an interactive bash shell on the first node allocated to that job.
How can I find out why my job has been pending for a long period on Slurm, despite appearing to have enough resources available?
Use squeue -u <username> to inspect jobs with status PD (pending).
The NODELIST(REASON) column explains why the job is waiting.
Priority means other jobs currently have higher scheduling priority.
Resources means no available node currently matches your requested resources
(for example partition, memory, CPUs, GPUs, or constraints).
For more detail on a specific job, run scontrol show job <jobid>.
See Job Management (squeue/scontrol)
and Slurm job priority and fairshare.
How can I ensure my job is submitted to the right node within the cluster?
Specify the target partition and node features in your job script, for example:
#SBATCH --partition=<partition_name> and #SBATCH --constraint=<feature>.
Use sinfo (and showcluster where available) to inspect available partitions,
node types, and features before submission.
See Submitting Slurm jobs (sbatch options)
and View cluster information (sinfo).
Software¶
Do you support commercial software (MATLAB, ANSYS, etc.)?
Yes, subject to license availability and cluster compatibility. Visit our Software Catalog to check whether your required application is available for research use. If it is not listed, submit a Support Ticket to discuss options with the SRC team.
The software I need isn’t installed. What do I do?
Try the following:
Check whether it is already available as an environment module.
If not, install it in user space using Conda, virtual environments, or an Apptainer container.
If you still need help, submit a Support Ticket to discuss central installation, licensing, and support options.
What should I do if I want to use Anaconda on the clusters?
On clusters that use Lmod (for example Eureka2), load Anaconda with
module load
(for example anaconda3), then create and manage environments with Conda.
On clusters that do not support Lmod (for example AI@Surrey), install Miniconda
in your user space or use an Apptainer container.
See the Conda guide
for setup and usage examples.
How do I create a custom container image for use on the clusters?
Build Docker images or Apptainer images (via ORAS) in Surrey GitLab and push them to the University container registry. You can also publish images from your local machine to Docker Hub for use on our clusters.
On University HPC clusters, only Apptainer is supported at runtime. Docker images are automatically converted and can be run through Apptainer.
Data Storage¶
How do I request extra storage for my research data?
Project space provides managed storage for your research data beyond the available storage in your home directory. The first 500 GB of project space is free; additional storage is chargeable. Request project space.
How do I access a project space, and how long does setup take?
After a request is submitted (up to 500 GB), project space is usually available within about 2 hours.
On Linux, access is typically via /vol/research/<project_space_name>.
On Windows, access is via S:\\Research\\<project_space_name>.
See Accessing a storage volume.
Is project space backed up, and can I store sensitive data there?
Project space is managed and backed up, but it should not be used for sensitive, personal, or special category data. For data with additional compliance requirements, consider using a Trusted Research Environment (TRE) and contact support to discuss suitable options. See managed storage guidance and open a support ticket if you are unsure.
Job Performance¶
How do I know how many CPU cores or RAM I require for my job submission script?
Start by identifying whether your code can only be run in serial (single-core) or can be parallelized (multi-core). For parallel workloads, use scaling tests and Amdahl’s law to estimate a sensible CPU core count. Then benchmark with different CPU and memory requests and select the smallest values that deliver stable performance. See benchmarking and scaling guidance.
How can I check output from my HPC job?
Check the output files defined in your job script using #SBATCH --output and #SBATCH --error.
If you do not set these options, Slurm writes output to a default file such as slurm-<jobid>.out.
During execution, you can monitor output with tail -f <output_file>.
My code is taking too long to run. What can I do?
Start with profiling and benchmarking to identify the real bottleneck (CPU, memory, I/O, or communication). Then optimize in this order:
Request appropriate resources and test scaling behaviour.
Improve code hotspots (algorithms, vectorization, parallelisation, and I/O patterns).
Use optimized libraries or containerised software stacks where appropriate.
See benchmarking and scaling guidance and checkpointing guidance for long-running jobs.
How do I implement checkpointing in my HPC job?
Checkpointing means saving your job state periodically so it can restart after interruption or timeout. This is especially important for long-running jobs and preemptible queues.
Common approaches are:
Use framework-native checkpoint features (for example TensorFlow/PyTorch model checkpoints).
Save application state at regular intervals (for example every N iterations or every N minutes).
Restart from the latest checkpoint in your job script.
See the checkpointing guide for implementation patterns and examples.