News

Note

If you would like to talk to us about any of the announcements or dates on this page, please e-mail itservicedesk@surrey.ac.uk.

Key dates

Maintenance Tuesdays

7th October:

datamove1 and datamove2 are being rebooted for Weka Client updates. This will disconnect any currently running sessions on these nodes.

4th November:

Exact maintenance tbd.

Announcements

Eureka - Has been retired

January 2025

The Eureka cluster has now been retired as planned and is no longer available for use. You should now use Eureka2 or AISURREY for your HPC workloads.

Eureka cluster is being de-commissioned 20/12/2024

September 2024

At the end of the calendar year we will be de-commissioning the Eureka Cluster (not eureka2). The cluster hardware has reached the end of its service life and the contract in place with our 3rd party data centre hosting provider is coming to its end.

Eureka continues to be available for you to use until the end of the year (it will be shut down just before the University closes for Christmas break - 20/12/2024). Eureka2 continues to be available and we intend to continue investing in and growing the capacity of the Eureka2 cluster over the coming years (despite ongoing challenges with data centre hosting capacity at the University).

We can see from the cluster’s utilisation data that many of you have already migrated your workloads to eureka2. Anyone who still needs to do so will need to do this by the end of the year. If you need support with migrating your workload to Eureka2 it is available to you either here in our online documentation, or you can Support Ticket or book Research Computing Virtual Appointments with a member of the team.

Removal of data from Eureka

When the cluster is de-commissioned at the end of the year the clusters dedicated data storage will also be removed.

This means it is vital you take a copy of any data you have stored on Eureka filesystems that you need to keep long term. This includes any data you may have stored in the temporary parallel_scratch area ( /users/<username>/parallel_scratch ) or under your users home directory ( /users/<username> ).

You should move this data to another storage location such as a project space on the network file store or SharePoint or other backed up storage location. (Reminder that you should not store data that must persist beyond your life at the University in personal storage areas such as home drives or Microsoft OneDrive as these are removed after your Surrey account expires).

Eureka’s legacy

Eureka was Surrey’s first “free at point of use” shared cluster, before this if a department did not have their own compute facilities, they would not have had access to any HPC at Surrey. It was successful initiative that has spawned a successor in eureka2 and has done much to elevate HPC’s profile across FEPS and the wider university as well as provide access to HPC facilities for many research projects who might not have had the privilege otherwise.

Since July 2017 Eureka has:

  • Processed in excess of 1.2 million HPC jobs (we think array batch jobs are being counted as a single job).

  • Provided over 77 million CPU Core hours of compute time to HPC jobs.

  • Served over 300 users across more than 30 different research groups/departments.

Version 2.0 of docs site launches

June 2024

We have launched the revamped version of the docs pages. This new version includes a whole new look and feel (including a dark mode), with more a more intuitive content structure re-organised to be more beginner friendly and new content.

This is just the beginning and lays the groundwork for more content coming soon, particularly with a focus on containers on HPC, Open OnDemand and the AI@SURREY Condor pools transition to the SLURM scheduler and associated software stack.

Surrey Research Compute - Site launch & RSE 1to1 consultations

May 2024

SRC Site launch

Surrey Research Compute (SRC) is an interdisciplinary platform created to support Surrey researchers at all career stages in all things computing and data management.

We have just launched a new site containing lots of information on how SRC can help you and your research.

https://surreyac.sharepoint.com/sites/SurreyResearchComputing

Please see the site for more info on available SRC services, training, support and platforms.

RSE appointments available via bookings site

If you have a research software development issue or require some additional support with a software development problem you can now book a 1 to 1 consultation with a Research software engineer via our bookings site

Eureka2 GPU partition: Now MIG enabled

February 2024

Eureka2 is now utilising Nvidia Multi-Instance GPUs (MIG). This allows us to split large GPUs into small ones. Allowing smaller jobs to run and not hold a whole 80GB GPU. We now provide a range of different GPU sizes.

MIG

See eureka2-documentation for more information on the specifics of using MIG.

Eureka2 GPU and High Memory Partition

November 2023

Eureka2 GPU partition: early access

Eureka2 now has 6 Nvidia A100 (80GB) GPU’s deployed in the cluster as part of the new GPU partition. We have conducted testing to ensure the basic level of functionality and have made the partition available for use. We have currently reserved one of the nodes to conduct our own internal testing, however there are now GPU nodes in the partition available for you to use.

Please keep in mind at this early stage you are essentially “testing” this partition, so you might run into issues. We’d be grateful if you could report any you find to us.

We will be updating these Documentation pages with support guides for using the GPU partition in the coming weeks, we are currently developing this content.

Eureka2 High Memory partition

The first High memory node (2TB RAM) has been added to the Eureka2 cluster and establishes the new High Memory Partition. This partition is now available for use from slurm and is accepting jobs.

As we communicated earlier in the year the “Bigdata” cluster is being merged into the Eureka2 HPC cluster to serve as its “high memory” partition. This merge will bring 6 new nodes to the cluster with 2TB of Memory each. Thats 12TB of additional memory for Eureka2 and many other benefits including, consolidation onto a single job scheduler, more efficient support, wider user accessibility and access to the all the benefits of the Eureka2 platform such as a higher CPU core count (1500+) including the standard memory “shared” partition, as well as the Open OnDemand web interface (https://eureka2-ondemand.surrey.ac.uk).

This merge will happen gradually, i.e. a few nodes at a time, over the coming months and we will support existing Bigdata users in transitioning their workloads to Eureka2. We aim to complete the merge by end of year 2023 (dependent on Research deadlines - papers, conferences, etc, as we want to cause as little disruption to existing Bigdata users as possible).

Maintenance Tuesdays

July 2023

We are implementing a monthly scheduled maintenance period for our HPC/HTC facilities on the first Tuesday of each month. We are calling this “Maintenance Tuesday”. This will apply to all of our managed clusters/compute pools such as; Eureka2, Eureka, AISURRREY, CVSSP Condor, March & Kara.

Regular maintenance windows will allow us to deploy new improvements, features, patches and security updates on a more regular cadence. It will also establish a regular schedule so maintenance periods are easier to anticipate for our users.

It will work as follows:

  • The first Tuesday of every month will be “Maintenance Tuesday”

  • We will always communicate in advance of “Maintenance Tuesday” which of the clusters will be impacted by this month’s maintenance work, and the extent of interruption to service.

    • We would not be making changes to every cluster, every month, and some months there may be no scheduled maintenance to any clusters.

  • During “Maintenance Tuesday” impacted clusters may be temporarily unavailable for use. We will always take the minimum possible disruption approach to this, if the work can be carried out without interrupting user access or the currently running jobs then we will endeavour to do this.

Details of the upcoming Maintenance Tuesday can be found in key dates.

NVIDIA A100 GPU’s now available on AISURREY

November 2022

3 x HPE Apollo servers & 3 x Dell PowerEdge XE8545 have being added to the AI Surrey Condor Pool.

This adds an additional 32 NVIDIA A100 GPU’s to the pool. Each with 80GB of RAM and a very high memory bandwidth which is useful for processing larger AI and ML models.

See AI Surrey documentation for more information on the specifics of the AI@Surrey condor pool.

The following article has some interesting Benchmarks comparing the performance of the A100 to some other cards, including the Nvidia 3090.

https://bizon-tech.com/blog/best-gpu-for-deep-learning-rtx-3090-vs-rtx-3080-vs-titan-rtx-vs-rtx-2080-ti

WEKA available on AI @ Surrey

May 2022

The Surrey Institute for People-Centred Artificial Intelligence has recently invested in procuring some high-performance data storage to compliment the ever growing pool of GPU compute.

The new storage is now available for use on the AI @ Surrey condor nodes.

For information on how you can start using the new storage in your condor jobs please see AISurrey data storage - WEKA.

WekaFS is built specifically for NVMe and is designed to get the very best performance from the NVMe drives. Our early benchmarks of the system have demonstrated the system is capable of read and write speeds from a single client of ~10GB/s.

You can learn more about WEKA here: https://www.weka.io/