Announcements and key dates

Key dates

7th May 2024:

Maintenance Tuesday

Eureka2 - Expansion of the parallel scratch filesystem storage capacity. We will be bringing the new storage node online and adding it to the BeeGFS cluster. We will also be making some Slurm config changes. Because of the nature of the changes we will need to stop currently running jobs and take the cluster offline for the day.

Kara02 - Risk of short interruption while we carry out some maintenance on the data backup systems.

4th June 2024:

Maintenance Tuesday - Exact maintenance TBC.

Note

If you would like to talk to us about any of the announcements or dates on this page please e-mail itservicedesk@surrey.ac.uk.

Announcements

Surrey Research Compute - Site launch & RSE 1to1 consultations

May 2024

SRC Site launch

Surrey Research Compute (SRC) is an interdisciplinary platform created to support Surrey researchers at all career stages in all things computing and data management.

We have just launched a new site containing lots of information on how SRC can help you and your research.

https://surreyac.sharepoint.com/sites/SurreyResearchCompute

Please see the site for more info on available SRC services, training, Support and platforms.

RSE Appointments available via bookings site

If you have a research software development issue or require some additional support with a software development problem you can now book a 1 to 1 consultation with a Research software engineer via our bookings site

Eureka2 GPU partition: Now MIG enabled

February 2024

Eureka2 is now utilising NVIDIA Multi-Instance GPUs. This allows us to split large GPUs into small ones. Allowing smaller jobs to run and not hold a whole 80 GB GPU. We now provide a range of different GPU sizes.

MIG

See Eureka2 documentation for more information on the specifics of using MIG

Eureka2 GPU and High Memory Partition

November 2023

Eureka2 GPU partition: early access

Eureka2 now has 6 NVidia A100 (80GB) GPU’s deployed in the cluster as part of the new GPU partition. We have conducted testing to ensure the basic level of functionality and have made the partition available for use. We have currently reserved one of the nodes to conduct our own internal testing, however there are now GPU nodes in the partition available for you to use.

Please keep in mind at this early stage you are essentially “testing” this partition so you might run into issues. We’d be grateful if you could report any you find to us.

We will be updating these Documentation pages with support guides for using the GPU partition in the coming weeks, we are currently developing this content.

Eureka2 High Memory partition

The first High memory node (2TB RAM) has been added to the Eureka2 cluster and establishes the new High Memory Partition. This partition is now available for use from slurm and is accepting jobs.

As we communicated earlier in the year the “Bigdata” cluster is being merged into the Eureka2 HPC cluster to serve as its “high memory” partition. This merge will bring 6 new nodes to the cluster with 2TB of Memory each. Thats 12TB of additional memory for Eureka2 and many other benefits including, consolidation onto a single job scheduler, more efficient support, wider user accessibility and access to the all the benefits of the Eureka2 platform such as a higher CPU core count (1500+) including the standard memory “shared” partition, as well as the Open OnDemand web interface (https://eureka2-ondemand.surrey.ac.uk).

This merge will happen gradually, i.e. a few nodes at a time, over the coming months and we will support existing Bigdata users in transitioning their workloads to Eureka2. We aim to complete the merge by end of year 2023 (dependent on Research deadlines - papers, conferences, etc, as we want to cause as little disruption to existing Bigdata users as possible).

Maintenance Tuesdays

July 2023

We are implementing a monthly scheduled maintenance period for our HPC/HTC facilities on the first Tuesday of each month. We are calling this “Maintenance Tuesday”. This will apply to all of our managed clusters/compute pools such as; Eureka2, Eureka, AISURRREY, CVSSP Condor, March & Kara.

Regular maintenance windows will allow us to deploy new improvements, features, patches and security updates on a more regular cadence. It will also establish a regular schedule so maintenance periods are easier to anticipate for our users.

It will work as follows:

  • The first Tuesday of every month will be “Maintenance Tuesday”

  • We will always communicate in advance of “Maintenance Tuesday” which of the clusters will be impacted by this month’s maintenance work, and the extent of interruption to service.

    • We would not be making changes to every cluster, every month, and some months there may be no scheduled maintenance to any clusters.

  • During “Maintenance Tuesday” impacted clusters may be temporarily unavailable for use. We will always take the minimum possible disruption approach to this, if the work can be carried out without interrupting user access or the currently running jobs then we will endeavour to do this.

Details of the upcoming Maintenance Tuesday can be found in key dates.

NVIDIA A100 GPU’s now available on AISURREY

November 2022

3 x HPE Apollo servers & 3 x Dell PowerEdge XE8545 have being added to the AI Surrey Condor Pool.

This adds an additional 32 NVIDIA A100 GPU’s to the pool. Each with 80GB of RAM and a very high memory bandwidth which is useful for processing larger AI and ML models.

See AI Surrey documentation for more information on the specifics of the AI@Surrey condor pool.

The following article has some interesting Benchmarks comparing the performance of the A100 to some other cards, including the Nvidia 3090.

https://bizon-tech.com/blog/best-gpu-for-deep-learning-rtx-3090-vs-rtx-3080-vs-titan-rtx-vs-rtx-2080-ti

Weka available on AI @ Surrey

May 2022

The Surrey Institute for People-Centred Artificial Intelligence has recently invested in procuring some high-performance data storage to compliment the ever growing pool of GPU compute.

The new storage is now available for use on the AI @ Surrey condor nodes.

For information on how you can start using the new storage in your condor jobs please see WEKA.

WekaFS is built specifically for NVME and is designed to get the very best performance from the NVME drives. Our early benchmarks of the system have demonstrated the system is capable of read and write speeds from a single client of ~10GB/s.

You can learn more about weka here: https://www.weka.io/