Eureka2¶
The Eureka2 HPC cluster is our most modern shared HPC facility. This means it is open for use to anyone at the University, whereas some other clusters at Surrey are owned by and reserved for the private use of certain research groups at the University.
If you would like to request access to the eureka clusters, please see Getting access to HPC
If you are looking to purchase some HPC compute resources for your research, we would encourage you to invest in Eureka2. There is match funding available, and you could increase your groups Job priority and “Fairshare” on the cluster by investing in additional capacity.
See Purchasing HPC Servers for more information.
Eureka2 cluster overview¶
Eureka2 is our most modern shared HPC cluster. It’s open to use for anyone at the University of Surrey.
Eureka2 is currently a homogenous cluster, however this is likely not going to remain the case as we add new nodes and partitions in the future.
Open OnDemand URL: https://eureka2-ondemand.surrey.ac.uk
Login Node/Submit node: eureka2.surrey.ac.uk (Accessible via Secure shell (SSH) access or RemoteLabs web portal)
Quick Getting Started Guide¶
Access the cluster
Connect through Open OnDemand, log in to
eureka2.surrey.ac.ukvia Secure shell (SSH) access, or use RemoteLabs web portal.Upload and organise your code and data
Copy your code, input files, and datasets to the cluster before running jobs. Store important files in your home directory and use the BeeGFS parallel scratch space for larger temporary job data; see HPC local storage and BeeGFS parallel scratch storage.
Prepare your software environment
There are three main ways to provide your software on Eureka2:
Use Environment Modules (Lmod) to load centrally installed compilers, libraries, and applications.
Create a Conda environment to manage your own Python packages in an isolated user-space environment.
Use Apptainer containers when you need a portable and reproducible software stack.
Submit your job
Once your data and software environment are ready, submit either a batch job with Batch jobs (sbatch) or start an interactive job with Interactive jobs (srun).
Monitor and manage your jobs
After your job enters the queue, use the commands described in Job Management to check status, inspect resource use, or cancel it.
Cluster specification¶
Eureka2 cluster specification |
|||
|---|---|---|---|
OS |
Rocky Linux 8 |
||
Fabric |
Mellanox Infiniband EDR 100Gb/s |
||
Parallel storage |
BeeGFS 105TB |
||
Standard Storage |
NFS 30GB (user quota) |
||
Scheduler/Queue |
Slurm |
||
Open OnDemand |
|||
Login Node |
eureka2.surrey.ac.uk |
||
32 x CPU node |
2x AMD EPYC 7452 @ 3.3 GHz |
512GB RAM |
|
4 x High memory node |
2x AMD EPYC 7452 @ 3.3 GHz |
2TB RAM |
|
2 x GPU node |
2x AMD EPYC 7513 @ 3.7 GHz |
512GB RAM |
3x A100 80GB |
1 x GPU node |
2x Intel Gold 6548N @ 2.8 GHz |
512GB RAM |
4x L40S 48GB |
Total CPU Cores |
2800+ |
||
Total Memory |
30TB |
||
Software¶
The Eureka clusters host a wide variety of software and development tools relevant to Science, Engineering and Statistical computing workloads. Compilers such GCC and Intel, scripting languages such as Julia, R and Python as well as a wide range of standard software such as MATLAB, Mathematica, LAMMPS, CASTEP and GROMACS plus many more…
If there is an application you need installed on the cluster, please submit a request. See Requesting New Software
If you have your own builds of software, you are free to build and use your own version in your HPC storage areas.
BeeGFS parallel scratch storage¶
The BeeGFS filesystem has been tuned to ensure we are getting the best performance possible from the system.
The optimum configuration for performance yielded the following results in benchmark tests:
Peak write |
Peak read |
Agg write @128 threads |
Agg read @128 threads |
Single thread read |
Single thread write |
|---|---|---|---|---|---|
47 GB/s |
48 GB/s |
46.3 GB/s |
48 GB/s |
3.6 GB/s |
3.6 GB/s |
These results are based on sequential read/writes and an “N to N” files to thread ratio (a file per thread).
Sequential IO showed no significant performance improvement when increasing the number of threads beyond 128.¶
We conducted similar benchmarks using random read/writes (rather than sequential) and this yielded interesting results some continued performance gains beyond 128 threads.
Random IO showed some continued performance improvement beyond 128 threads.¶
The Eureka 2 BeeGFS storage summary:
2 storage servers
48 NVMe drives
6 storage targets per server
Total usable capacity of 70 TB
Slurm on Eureka2¶
This section covers some of the specifics of Slurm on the Eureka2 cluster.
For general information about the Slurm scheduler please see HPC job scheduler (Slurm)
Eureka2 partitions (queues)¶
On Eureka2 there are different partitions (queues) to which you can submit your jobs.
The configurations of the partitions are summarised in the table below:
Name |
Node count |
Limitations |
Purpose |
|---|---|---|---|
debug |
2 |
4 hrs default, 4 hrs maximum
32 cores per job
2 jobs queued per user
|
Debugging jobs that can eventually run across all nodes. |
shared |
29 |
1 day default, 1 week maximum |
Day to Day production jobs. |
shared_risk |
32 |
1 day default, 1 week maximum |
Day to Day production jobs that can checkpoint. |
high_mem |
4 |
1 day default, 1 week maximum |
Jobs that require a large amount of memory. |
high_mem_risk |
6 |
1 day default, 1 week maximum |
Jobs that require a large amount of memory that can checkpoint. |
gpu |
2 |
1 day default, 1 week maximum |
Jobs that require GPU compute. |
gpu_risk |
3 |
1 day default, 1 week maximum |
Jobs that require GPU compute tha can checkpoint. |
astro_edge_project |
1 |
1 day default, 1 week maximum |
Exclusively for use by Astro Edge project members. |
astro_black_project |
2 |
1 day default, 1 week maximum |
Exclusively for use by Astro Black project members. |
astro_dd_project |
1 |
1 day default, 1 week maximum |
Exclusively for use by Astro DD project members. |
applied_micro_project |
2 |
1 day default, 1 week maximum |
Exclusively for use by Applied Micro project members. |
Preeemtion Risk Partitions¶
Some partitions on Eureka2 are marked as “risk” partitions. Jobs running in these partitions may be preempted if the cluster is busy and there are jobs waiting in the priority project partitions. By submitting your job to a “risk” partition, you are agreeing that your job may be preempted at any time. If your job is preempted, it will be re-queued and will run again when resources become available.
What is the benefit of submitting to a “risk” partition? 💡
Enabling your job to be able to checkpoint and making use of the _risk queues is a great way to reduce your jobs queuing time as these queues contain more resources and allow use of the project owned nodes when they are not in use by the project users.
Who should consider using the “risk” partitions? ✅
Anyone with jobs that can checkpoint and restart from a saved state should consider using the “risk” partitions to reduce their queuing time. Or any jobs with a very short run time that can easily be re-queued if preempted.
Who should avoid using the “risk” partitions? ❌
Anyone with jobs that cannot checkpoint and restart from a saved state should not use the “risk” partitions. Jobs that run for a long time that cannot checkpoint and would be significantly impacted if they were preempted.
Eureka2 accounting and usage¶
XDMoD: https://eureka2-xdmod.surrey.ac.uk/
A graphical user interface with extensive graphic and analytical capability.
Detailed utilization metrics including number of jobs, CPU hours, wait times, job size, etc.
Customizable Metric Explorer where users can generate custom plots comparing multiple metrics
A custom report builder for the automatic generation of detailed periodic reports.
Note
XDMoD requires you to connect from on campus or via the University’s VPN - Global Protect
Eureka2 quick start¶
To help users get started quickly, we recommend using the eureka2-ondemand web interface: https://eureka2-ondemand.surrey.ac.uk
Note
Open OnDemand requires you to connect from on campus or via the University’s VPN - Global Protect
Eureka2 GPUs¶
Eureka2 currently has 6x Nvidia A100 80GB GPUs. A number of these GPUs are partitioned up into smaller GPUs (Multi-Instance GPUs or MIG), allowing us to run more GPU jobs simultaneously. For more information on MIG, please see NVIDIA’s documentation
The table below details the current type of GPUs available on the cluster:
Type |
Total Count |
Node |
Description |
|---|---|---|---|
1g.10gb |
7 |
gpu-node01 |
1 compute instance & 10 GB memory |
2g.20gb |
3 |
gpu-node01 |
2 compute instances & 20 GB memory |
3g.40gb |
4 |
2x gpu-node01 2x gpu-node02 |
3 compute instances & 40 GB memory |
a100 |
2 |
gpu-node02 |
A non MIG’d A100 with 80 GB memory |
l40s |
4 |
gpu-node03 |
A L40S with 48 GB memory |
Use the following options to submit a job to the gpu partition using the default job QoS:
#SBATCH --partition=gpu
#SBATCH --gres=gpu:<type>:<number_of_gpus>
For example, to request 2x 2g.20gb GPUs for your job, you would add #SBATCH --gres=gpu:2g.20gb:2 to your submission script or to request a single full A100 GPU #SBATCH --gres=gpu:a100:1.
The number and type of MIG GPUs is subject to change in the future as we work out what is the best layout for users’ needs. Any changes will be announced on the Eureka HPC teams channel in the Research Computing Community Team.