6. Eureka2 documentation
A collection of Topics about working with the Eureka2 HPC Cluster.
The Eureka HPC service consists of our main, shared, HPC clusters. This means that they are open for use to anyone at the University, whereas other clusters at Surrey are owned by and reserved for the private use of certain research groups at the University.
If you would like to request access to the eureka clusters please see Requesting access to HPC
If you are looking to purchase some HPC compute resources for your research we would always encourage you to invest in Eureka, see Purchasing HPC servers for more information.
6.1. Eureka2 cluster overview
Eureka2 is our most modern shared HPC cluster. Its open to use for anyone at the University of Surrey.
Eureka2 is currently a homogenous cluster however this is likely not going to remain the case as we add new nodes and partitions in the future.
6.1.1. Cluster specification
Eureka2 cluster specification |
|||
---|---|---|---|
OS |
Rocky Linux 8 |
||
Fabric |
Mellanox Infiniband EDR 100Gb/s |
||
Parallel storage |
BeeGFS 70TB |
||
Standard Storage |
NFS 30GB (Personal Quota) |
||
Scheduler/Queue |
Slurm |
||
Open OnDemand |
|||
Login Node |
eureka2.surrey.ac.uk |
||
32 x CPU node |
2x AMD EPYC 7452 @ 3.3 GHz |
512GB RAM |
|
4 x High memory node |
2x AMD EPYC 7452 @ 3.3 GHz |
2TB RAM |
|
2 x GPU node |
2x AMD EPYC 7513 @ 3.7 GHz |
512GB RAM |
3x A100 80GB |
Total CPU Cores |
2400+ |
||
Total Memory |
25TB |
6.1.2. Software
The Eureka clusters host a wide variety of software and development tools relevant to Science, Engineering and Statistical computing workloads. Compilers such GCC and Intel, scripting languages such as Julia, R and Python as well as a wide range of standard software such as matlab, mathematica, lammps, castep and gromacs plus many more…
If there is an application you need installed on the cluster please submit a request. See Request new software
If you have your own builds of software, you are free to build and use your own version in your hpc storage areas.
6.1.3. BeeGFS Parallel scratch storage
The BeeGFS filesystem has been tuned to ensure we are getting the best performance possible from the system.
The optimum configuration for performance yielded the following results in benchmark tests:
Peak write |
Peak read |
Agg write @128 threads |
Agg read @128 threads |
Single thread read |
Single thread write |
---|---|---|---|---|---|
47 GB/s |
48 GB/s |
46.3 GB/s |
48 GB/s |
3.6 GB/s |
3.6 GB/s |
These results are based on sequential read/writes and an “N to N” files to thread ratio (a file per thread).
We conducted similar benchmarks using random read/writes (rather than sequential) and this yielded interesting results some continued performance gains beyond 128 threads.
The Eureka 2 BeeGFS storage summary:
2 storage servers
48 NVME drives
6 storage targets per server
Total usable capacity of 70 TB
6.2. Using Slurm on Eureka2
This section covers some of the specifics of using slurm on the Eureka Clusters.
For general information about the Slurm scheduler please see Slurm - job scheduler
6.2.1. Eureka2 partitions (Queues)
On Eureka2 there are 4 different partitions (Queues) to which you can submit your jobs: shared
, debug
, high_mem
and gpu
.
The configurations of the partitions are summarised in the table below:
Name |
Node count |
Time limit |
Purpose |
---|---|---|---|
debug |
2 |
4 hrs default, 8 hrs maximum |
Debugging jobs that can eventually run across all nodes. |
shared |
30 |
1 day default, 1 week maximum |
Day to Day production jobs. |
high_mem |
4 |
1 day default, 1 week maximum |
Jobs that require a large amount of memory. |
gpu |
2 |
1 day default, 1 week maximum |
Jobs that require GPU compute. |
6.2.2. Eureka2 accounting and usage
XDMoD: https://eureka2-xdmod.surrey.ac.uk/
A graphical user interface with extensive graphic and analytical capability.
Detailed utilization metrics including number of jobs, CPU hours, wait times, job size, etc.
Customizable Metric Explorer where users can generate custom plots comparing multiple metrics
A custom report builder for the automatic generation of detailed periodic reports.
6.3. Eureka2 quick start
To help users get started quickly, we recommend using the eureka2-ondemand web interface: https://eureka2-ondemand.surrey.ac.uk
6.4. Eureka2 GPUs
Eureka2 currently has 6x Nvidia A100 80GB GPUs. A number of these GPUs are partitioned up into smaller GPUs (Multi-Instance GPUs or MIG), allowing us to run more GPU jobs simultaneously. For more information on MIG, please see Nvidia’s documentation https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html
The table below details the current type of GPUs available on the cluster:
Type |
Total Count |
Node |
Description |
---|---|---|---|
1g.10gb |
7 |
gpu-node01 |
1 compute instance & 10 GB memory |
2g.20gb |
3 |
gpu-node01 |
2 compute instances & 20 GB memory |
3g.40gb |
4 |
2x gpu-node01 2x gpu-node02 |
3 compute instances & 40 GB memory |
a100 |
2 |
gpu-node02 |
A non MIG’d A100 with 80 GB memory |
Use the following options to submit a job to the gpu
partition using the default job QoS:
#SBATCH --partition=gpu
#SBATCH --gres=gpu:<type>:<number_of_gpus>
For example to request 2x 2g.20gb GPUs for your job, you would add #SBATCH --gres=gpu:2g.20gb:2
to your submission script or to request a single full A100 GPU #SBATCH --gres=gpu:a100:1
.
The number and type of MIG GPUs is subject to change in the future as we work out what is the best layout for users’ needs. Any changes will be announced on the Eureka HPC teams channel in the Research Computing Community Team.