Eureka2

The Eureka2 HPC cluster is our most modern shared HPC facility. This means it is open for use to anyone at the University, whereas some other clusters at Surrey are owned by and reserved for the private use of certain research groups at the University.

If you would like to request access to the eureka clusters, please see Getting access to HPC

If you are looking to purchase some HPC compute resources for your research, we would encourage you to invest in Eureka2. There is match funding available, and you could increase your groups Job priority and “Fairshare” on the cluster by investing in additional capacity.

See Purchasing HPC Servers for more information.

Eureka2 cluster overview

Eureka2 is our most modern shared HPC cluster. It’s open to use for anyone at the University of Surrey.

Eureka2 is currently a homogenous cluster, however this is likely not going to remain the case as we add new nodes and partitions in the future.

Quick Getting Started Guide

  1. Access the cluster

    Connect through Open OnDemand, log in to eureka2.surrey.ac.uk via Secure shell (SSH) access, or use RemoteLabs web portal.

  2. Upload and organise your code and data

    Copy your code, input files, and datasets to the cluster before running jobs. Store important files in your home directory and use the BeeGFS parallel scratch space for larger temporary job data; see HPC local storage and BeeGFS parallel scratch storage.

  3. Prepare your software environment

    There are three main ways to provide your software on Eureka2:

  4. Submit your job

    Once your data and software environment are ready, submit either a batch job with Batch jobs (sbatch) or start an interactive job with Interactive jobs (srun).

  5. Monitor and manage your jobs

    After your job enters the queue, use the commands described in Job Management to check status, inspect resource use, or cancel it.

Cluster specification

Eureka2 cluster specification

OS

Rocky Linux 8

Fabric

Mellanox Infiniband EDR 100Gb/s

Parallel storage

BeeGFS 105TB

Standard Storage

NFS 30GB (user quota)

Scheduler/Queue

Slurm

Open OnDemand

https://eureka2-ondemand.surrey.ac.uk

Login Node

eureka2.surrey.ac.uk

32 x CPU node

2x AMD EPYC 7452 @ 3.3 GHz

512GB RAM

4 x High memory node

2x AMD EPYC 7452 @ 3.3 GHz

2TB RAM

2 x GPU node

2x AMD EPYC 7513 @ 3.7 GHz

512GB RAM

3x A100 80GB

1 x GPU node

2x Intel Gold 6548N @ 2.8 GHz

512GB RAM

4x L40S 48GB

Total CPU Cores

2800+

Total Memory

30TB

Software

The Eureka clusters host a wide variety of software and development tools relevant to Science, Engineering and Statistical computing workloads. Compilers such GCC and Intel, scripting languages such as Julia, R and Python as well as a wide range of standard software such as MATLAB, Mathematica, LAMMPS, CASTEP and GROMACS plus many more…

If there is an application you need installed on the cluster, please submit a request. See Requesting New Software

If you have your own builds of software, you are free to build and use your own version in your HPC storage areas.

BeeGFS parallel scratch storage

The BeeGFS filesystem has been tuned to ensure we are getting the best performance possible from the system.

The optimum configuration for performance yielded the following results in benchmark tests:

Peak write

Peak read

Agg write @128 threads

Agg read @128 threads

Single thread read

Single thread write

47 GB/s

48 GB/s

46.3 GB/s

48 GB/s

3.6 GB/s

3.6 GB/s

These results are based on sequential read/writes and an “N to N” files to thread ratio (a file per thread).

../../_images/beegfs_sequentialIO.png

Sequential IO showed no significant performance improvement when increasing the number of threads beyond 128.

We conducted similar benchmarks using random read/writes (rather than sequential) and this yielded interesting results some continued performance gains beyond 128 threads.

../../_images/beegfs_randomIO.png

Random IO showed some continued performance improvement beyond 128 threads.

The Eureka 2 BeeGFS storage summary:

  • 2 storage servers

  • 48 NVMe drives

  • 6 storage targets per server

  • Total usable capacity of 70 TB

Slurm on Eureka2

This section covers some of the specifics of Slurm on the Eureka2 cluster.

For general information about the Slurm scheduler please see HPC job scheduler (Slurm)

Eureka2 partitions (queues)

On Eureka2 there are different partitions (queues) to which you can submit your jobs.

The configurations of the partitions are summarised in the table below:

Name

Node count

Limitations

Purpose

debug

2

4 hrs default, 4 hrs maximum
32 cores per job
2 jobs queued per user

Debugging jobs that can eventually run across all nodes.

shared

29

1 day default, 1 week maximum

Day to Day production jobs.

shared_risk

32

1 day default, 1 week maximum

Day to Day production jobs that can checkpoint.

high_mem

4

1 day default, 1 week maximum

Jobs that require a large amount of memory.

high_mem_risk

6

1 day default, 1 week maximum

Jobs that require a large amount of memory that can checkpoint.

gpu

2

1 day default, 1 week maximum

Jobs that require GPU compute.

gpu_risk

3

1 day default, 1 week maximum

Jobs that require GPU compute tha can checkpoint.

astro_edge_project

1

1 day default, 1 week maximum

Exclusively for use by Astro Edge project members.

astro_black_project

2

1 day default, 1 week maximum

Exclusively for use by Astro Black project members.

astro_dd_project

1

1 day default, 1 week maximum

Exclusively for use by Astro DD project members.

applied_micro_project

2

1 day default, 1 week maximum

Exclusively for use by Applied Micro project members.

Preeemtion Risk Partitions

Some partitions on Eureka2 are marked as “risk” partitions. Jobs running in these partitions may be preempted if the cluster is busy and there are jobs waiting in the priority project partitions. By submitting your job to a “risk” partition, you are agreeing that your job may be preempted at any time. If your job is preempted, it will be re-queued and will run again when resources become available.

What is the benefit of submitting to a “risk” partition? 💡

Enabling your job to be able to checkpoint and making use of the _risk queues is a great way to reduce your jobs queuing time as these queues contain more resources and allow use of the project owned nodes when they are not in use by the project users.

Who should consider using the “risk” partitions?

Anyone with jobs that can checkpoint and restart from a saved state should consider using the “risk” partitions to reduce their queuing time. Or any jobs with a very short run time that can easily be re-queued if preempted.

Who should avoid using the “risk” partitions?

Anyone with jobs that cannot checkpoint and restart from a saved state should not use the “risk” partitions. Jobs that run for a long time that cannot checkpoint and would be significantly impacted if they were preempted.

Eureka2 accounting and usage

XDMoD: https://eureka2-xdmod.surrey.ac.uk/

  • A graphical user interface with extensive graphic and analytical capability.

  • Detailed utilization metrics including number of jobs, CPU hours, wait times, job size, etc.

  • Customizable Metric Explorer where users can generate custom plots comparing multiple metrics

  • A custom report builder for the automatic generation of detailed periodic reports.

Note

XDMoD requires you to connect from on campus or via the University’s VPN - Global Protect

Eureka2 quick start

To help users get started quickly, we recommend using the eureka2-ondemand web interface: https://eureka2-ondemand.surrey.ac.uk

Note

Open OnDemand requires you to connect from on campus or via the University’s VPN - Global Protect

Eureka2 GPUs

Eureka2 currently has 6x Nvidia A100 80GB GPUs. A number of these GPUs are partitioned up into smaller GPUs (Multi-Instance GPUs or MIG), allowing us to run more GPU jobs simultaneously. For more information on MIG, please see NVIDIA’s documentation

The table below details the current type of GPUs available on the cluster:

Type

Total Count

Node

Description

1g.10gb

7

gpu-node01

1 compute instance & 10 GB memory

2g.20gb

3

gpu-node01

2 compute instances & 20 GB memory

3g.40gb

4

2x gpu-node01

2x gpu-node02

3 compute instances & 40 GB memory

a100

2

gpu-node02

A non MIG’d A100 with 80 GB memory

l40s

4

gpu-node03

A L40S with 48 GB memory

Use the following options to submit a job to the gpu partition using the default job QoS:

#SBATCH --partition=gpu
#SBATCH --gres=gpu:<type>:<number_of_gpus>

For example, to request 2x 2g.20gb GPUs for your job, you would add #SBATCH --gres=gpu:2g.20gb:2 to your submission script or to request a single full A100 GPU #SBATCH --gres=gpu:a100:1.

The number and type of MIG GPUs is subject to change in the future as we work out what is the best layout for users’ needs. Any changes will be announced on the Eureka HPC teams channel in the Research Computing Community Team.