Eureka

Attention

The Eureka cluster was retired in December 2024 and is no longer available for use. These pages remain for information only.

A collection of Topics about working with the Eureka HPC Cluster.

The Eureka HPC service consists of our main, shared, HPC clusters. This means that they are open for use to anyone at the University, whereas other clusters at Surrey are owned by and reserved for the private use of certain research groups at the University.

If you would like to request access to the eureka clusters please see Getting access to HPC

If you are looking to purchase some HPC compute resources for your research we would always encourage you to invest in Eureka, see Purchasing HPC Servers for more information.

Eureka cluster overview

Eureka is a heterogenous cluster. It’s running Centos 7 Linux for its Operating system.

The compute resources on Eureka can be utilised for a wide variety of workloads, including large parallel jobs, high memory jobs and it has a small amount of GPU capabilities too.

Eureka Cluster General Specifications

OS

CentOS 7

Fabric

Intel Omni-Path (OP) and Infiniband (IB)

Parallel storage

BeeGFS 56TB

Standard Storage

NFS 7.5TB

Scheduler/Queue

Slurm

Login Node

eureka.surrey.ac.uk

Eureka Node Specifications

Node Type

CPU Specification

RAM

Accelerator

Fabric

16 x CPU node

Intel Xeon E5-2660 v4 @ 2.0 GHz

128GB

Omni-Path

38 x CPU node

Intel Xeon Gold 5120 @ 2.20 GHz

192GB

13 x CPU node

Intel Xeon E5-2680 v2 @ 2.80 GHz

128GB

2 x High Mem node

Intel Xeon Gold 5120 @ 2.20 GHz

375GB

8 x CPU node

Intel Xeon E5-2470-0 @ 2.30 GHz

64GB

Infiniband

12 x CPU node

Intel Xeon E5-2670 @ 2.60 GHz

64GB

2 x CPU node

Intel Xeon E5-2670-v2 @ 2.50 GHz

128GB

6 x CPU node

Intel Xeon E5-2697-v2 @ 2.70 GHz

128GB

7 x High Mem node

Intel XEON E5-2670 v2 @ 2.50GHz

256GB

1 x High Mem node

Intel XEON E5-2670 v1 @ 2.60GHz

256GB

3 x GPU node

Intel XEON E5-2670 v2 @ 2.50GHz

256GB

Nvidia Tesla K20m

../_images/eureka-diagram.png

Cluster topology diagram.

Software

The Eureka clusters host a wide variety of software and development tools relevant to Science, Engineering and Statistical computing workloads. Compilers such GCC and Intel, scripting languages such as Julia, R and Python as well as a wide range of standard software such as matlab, mathematica, lammps, castep and gromacs plus many more…

If there is an application you need installed on the cluster please submit a request. See Requesting New Software

If you have your own builds of software, you are free to build and use your own version in your hpc storage areas.

Using Slurm on Eureka

This section covers some of the specifics of using slurm on the Eureka Clusters.

For general information about the Slurm scheduler please see HPC job scheduler (Slurm)

Eureka partitions (queues)

On Eureka there are 5 different partitions (queues) to which you can submit your jobs: shared, high-mem, gpu, debug_all and debug_latest.

  • The shared partition contains most nodes on Eureka

  • high-mem contains a few nodes with a large mem/core ratios

  • gpu contains a few nodes with GPU cards

  • debug_latest contains a node with the newest cpu features like avx2 and avx512

  • debug_all contains a node which should run code which have been installed without avx2+ support

The configurations of the partitions are summarised in the table below:

Name

Node count

time limit

purpose

debug_all

1

60 min maximum

Debugging jobs that can eventually run across all nodes.

debug_latest

1

60 min maximum

Debugging jobs that can eventually run on latest nodes, hence take advantage of newest features.

gpu

3

1 day default, 1 week maximum

Run jobs that require GPU cards.

high-mem

10

1 day default, 1 week maximum

Run jobs with high memory requirements. Current threshold is nodes with >=12 GB/core.

shared

90

1 day default, 1 week maximum

Day to Day production jobs.

Eureka is a heterogenous cluster in which we have different types of nodes and 2 different low latency network fabrics. This influences how you use the cluster and the resources you ask for in your Slurm job scripts.

There are two different low latency network fabrics, Intel Omni-Path (op) and Infiniband (ib). Most nodes on the op fabric are the newer nodes which support a minimum of avx2 instruction sets. Nodes on the ib fabric do not support avx2 instructions.

If a piece of software can only run on avx2 enabled nodes this is usually indicated in the modules name and so you should know when you load the module. In many cases this is not an issue, where programs will simply run on any node regardless of instruction set. Futhermore, many nodes have a different numbers of cores, where some have 28, 24, 20 or 16.

Note

If you are running multi-node parallel jobs, you will need to consider which fabric you are using op or ib. Jobs cannot run on two different fabrics e.g. you cannot use a mixtures of nodes from the ib and op fabric.

How to submit jobs to the right nodes

To allow users to submits jobs to the correct type of nodes we have enabled the #SBATCH --constraints directive to be used in SLURM. This allows you submit your jobs to the specific types of nodes you wish to run your job on via their features.

To see the sets of features and whats available, the command sinfo -o "%R %.6D %.4c  %.6m %.30f" | column -t or showcluster can be used to give a summary of all nodes, the number of cores they have, their memory, the partitions they belong and the features you use can user to select them via the constraint.

[abc123@login7(eureka) Python_example]$ showcluster
PARTITION     NODES  CPUS  MEMORY  AVAIL_FEATURES
shared        8      16    64216   e5-2470-0,v1,ib
shared        33     28    191678  gold-5120,avx2,avx512,op
shared        2      20    128679  e5-2670-v2,v2,ib
shared        13     20    128706  e5-2680-v2,galaxy,op
shared        16     28    128658  e5-2660-v4,avx2,v4,op
shared        6      24    128711  e5-2697-v2,v2,ib
shared        12     16    64171+  e5-2670,v1,ib
debug_latest  1      28    191908  gold-5120,avx2,avx512,op
debug_all     1      16    64216   e5-2470-0,v1,ib
high_mem      2      28    385204  gold-5120,avx2,avx512,op
high_mem      7      20    257695  e5-2670-v2,v2,ib
high_mem      1      16    257695  e5-2670,v1,ib
gpu           3      20    257695  e5-2670-v2,ib

The above can be quite daunting for most to find the correct options for #SBATCH --constraints initially, however most users will normally fall into a few sets of scenarios when submitting jobs to Eureka. Note: nodes have more than one feature, so if you were you ask for avx2 nodes this would include any node with that feature. Also the number of CPU cores a nodes has is important as well.

  • Case 1 #SBATCH --constraint=[ib|op]

Request nodes exclusively on Omni-Path fabric OR Infiniband fabric. This for when you want to run a parallel multi-node job and you don’t care about cpu instruction set.

  • Case 2 #SBATCH --constraint=avx512

Request nodes only with avx512, this for when you want to run a single node or parallel multi-node job and you want nodes with the avx512 instruction set.

  • Case 3 #SBATCH --constraint=avx2

Request nodes at least with the avx2 feature, this for when you want to run a single node or parallel multi-node job and you want nodes with at least the avx2 instruction set.

  • Case 4 #SBATCH --constraint=ib

Request nodes only on the Infiniband network fabric. This for when you want to run a parallel multi-node jobs on nodes only on the infiniband fabric.

  • Case 5 #SBATCH --constraint=op

Request nodes only on the omni-path network fabric. This for when you want to run a parallel multi-node jobs on nodes only on the omni-path fabric.

  • Case 6 #SBATCH --constraint=e5-2660-v4

Request nodes only with the e5-2660-v4 cpu model, this for when you want to run a single node or parallel multi-node job and want your job to run on these specific nodes.

  • Case 7 #SBATCH --constraint="e5-2660-v4|gold-5120"

Request nodes with the e5-2660-v4 OR gold-5120 cpu model, this for when you want to run a single node or parallel multi-node job and want run on mixture nodes which have the e5-2660-v4 or gold-5120 cpu.

Consider the number of cores on the nodes

When requesting for jobs on different sets of nodes it is important to take into account the number of cores on the nodes that you have requested.

For example, if you request nodes only on ib, and your are running a large parallel job, depending on your jobs requirements, you could ask for a maximum of 16 cores per node so you can utilise any node on ib fabric. If you were to ask for 24 cores per node you would restrict yourself to only 6 nodes on the ib fabric.

Alternatively you can use #SBATCH --ntasks= to specify the total number of cores, rather than specifying #SBATCH --nodes=2 and #SBATCH --ntasks-per-node=10 this will allow you maximise the number of cores you can use, since SLURM will allocate you cores on any node. Using this will depend on the type of job you are doing and whether balancing of workloads per node is important for your simulations.

For example, Instead of the below:

#!/bin/bash

#SBATCH --partition=shared
#SBATCH --job-name="hello"
#SBATCH --nodes=2            #<----- Request 2 nodes
#SBATCH --ntasks-per-node=10 #<----- Request 10 core per node so there must be 2 nodes with 10 cores available
#SBATCH --time=00:05:00
#SBATCH --constraint=[ib|op]
#SBATCH --mem=2G
#SBATCH --output=helloworld.out

cd $SLURM_SUBMIT_DIR

module load helloworld/1.1

mpirun -np 20 helloworld

echo $SLURM_NODELIST > nodes

You could use:

#!/bin/bash

#SBATCH --partition=shared
#SBATCH --job-name="hello"
#SBATCH --ntasks=20     #<--------  20 cores anywhere they can be found, no "per node" restriction
#SBATCH --time=00:05:00
#SBATCH --constraint=[ib|op]
#SBATCH --mem=2G
#SBATCH --output=helloworld.out


cd $SLURM_SUBMIT_DIR

module load helloworld/1.1

mpirun -np 20 helloworld

echo $SLURM_NODELIST > nodes

You can submit to ranges of nodes, slurm allows specifying a range of number of nodes, e.g. --nodes=2-15. This means that your job will start as soon as at least two nodes are available, however if 10 nodes are available, you will be allocated 10 nodes.

#!/bin/bash

#SBATCH --partition=shared
#SBATCH --job-name="hello"
#SBATCH --nodes=2-15
#SBATCH --ntasks-per-node=18
#SBATCH --time=00:01:00
#SBATCH --constraint=op
#SBATCH --mem=2G
#SBATCH --output=helloworld.out

cd $SLURM_SUBMIT_DIR

module load helloworld/1.1

NTASKS=$[$SLURM_NTASKS_PER_NODE*$SLURM_JOB_NUM_NODES]
mpirun -np $NTASKS  helloworld

Eureka accounting and Usage

http://eureka-monitor.eps.surrey.ac.uk/xdmod/

http://eureka-monitor.eps.surrey.ac.uk/ganglia

Eureka quick start

To help users get started quickly, we have created a repository of working submission script examples for a variety of programs currently on the cluster, and hosted them on GitLab.

If you would like to contribute any examples scripts yourself please let us know.

The repository can be accessed at: https://gitlab.surrey.ac.uk/rcs/eureka-examples

If you are logged into Eureka and have setup your GitLab account and configured SSH access to it, you can clone the repository into your space, with the command below:

$ git clone git@gitlab.surrey.ac.uk:rcs/eureka-examples.git

The repository contains a variety of scripts:

  1. Example starter scripts (ready to be customised):

    • Interactive jobs

    • Raw Submission scripts

    • Raw Array job scripts

  2. A few ready made submission scripts & example inputs for specific software:

    • Lammps

    • matlab

    • Castep

    • and more…

  3. Introductory Example “hello world” exercise to get started with submitting jobs to Eureka.

MATLAB on Eureka

Some notes on using MATLAB specifically in the context of the Eureka cluster.

Note

Unlimited workers for parallel server is currently set up for MATLAB 2019a/2019b/2020a.

Tip

Setting up MATLAB to submit to the cluster via its GUI can be tricky, please seek hpc-support if you need assistance/advice.

Submitting Matlab jobs

Slurm submission script

MATLAB’s code can be run straight off the command line and therefore be submitted to be executed on the cluster in a slurm submission script. This allows you to bypass using the GUI and simply run the code you have developed on the cluster.

Examples can be found in our git repo: https://gitlab.surrey.ac.uk/rcs/eureka-examples

MATLAB GUI

If you access the Eureka cluster via our remotelabs web portal you can access the MATLAB graphical user interface.

Note

see RemoteLabs web portal for more information on how to access the web portal

When using the MATLAB GUI with remotelabs, you can add a Eureka cluster profile, enabling you run jobs on the cluster and use parallel pools on the cluster for parallel for loops etc. In order to use this functionality, you need to set up the cluster profile as shown below:

  1. Launch MATLAB by opening a terminal in the Remote Desktop session, loading the MATLAB R2019b module, and then execute the MATLAB command.

    [abc123@vis1(eureka) ~]$ module load matlab/R2019b
    [abc123@vis1(eureka) ~]$ matlab
    MATLAB is selecting SOFTWARE OPENGL rendering.
    
  2. Create a slurmprofile for the cluster:

    From MATLAB’s top task bar: Environment > Parallel > create and manage clusters > Add Cluster Profile > Slurm. This will create blank slurm profile called slurmprofile1.

    ../_images/matlab-clustermenu.png

    ../_images/matlab-clusterprofile.png

  3. Select slurmprofile1 on the left, right click and rename this to “eureka”. Once done select EDIT, set the following settings below as shown and leave the rest unchanged, once done click Done.

    ../_images/cmr.png

    ../_images/nw.png

    ../_images/rt.png

    ../_images/nwr.png

  4. Once the above setup, you can validate the above by clicking the validation button, if the above is correctly set, all validations test should pass. As shown below:

    ../_images/con-test.png

Remote job submission

Caution

This method is experimental and not yet fully tested

It is possible to submit MATLAB jobs directly through MATLAB to run on Eureka. For more information please see the following: https://www.mathworks.com/help/parallel-computing/batch-processing.html.

In order to submit jobs to Eureka you must first do the following:

  1. On the MATLAB prompt, straight after MATLAB is opened you must set your fully qualified hostname:

    >> pctconfig('hostname','myhostname.eps.surrey.ac.uk')

  2. Some example submit code:

    • create parallel code example to work with

      >> edit parallel_example
      tic ;
      parfor (i=1:5000) ;
      c(:,i) = eig(rand(1000));
      end ;
      toc ;
      delete(gcp)
      
    • Submit the code to run on Eureka

      >> c = parcluster('Eureka');                 %Choose parallel pool to use
      >> c.AdditionalProperties.time = '24:00:00'; %Setup up Time for Jobs
      >> c.AdditionalProperties.constraints = "[ib|op]"; %Setup constraints to run job
      >> j = c.batch('parallel_example','Pool',10); %Submit Job to Eureka to run on 10 workers(cores)
      >> diary(j)                                   %Check output from job
      
  3. A second example:

    • Open an interactive parallel pool, run your code, then delete the pool.

      >> parpool('Eureka',10)                     %startup parallel pool on eureka with 10 cores/workers
      >> tic ; parfor (i=1:5000) ; c(:,i) = eig(rand(1000)); end ; toc
      >> delete(gcp)                              %deletes all parallel pools
      

Monitoring a MATLAB job

Jobs submitted to the cluster to run, can be monitored and managed through MATLAB too. The monitoring window can be accessed via the top task bar: Environment > Parallel > Monitor Jobs.

../_images/monitor-jobs.png