HPC data storage

HPC local storage

HPC clusters at Surrey each have dedicated high performance filesystems. These filesystems are specific to each cluster and accessible to each node of the cluster via the cluster’s private storage network.

Each have a specific function and purpose and will affect how you work with and manage your data.

User Home Directory:

/users/<username>

User Home Directory Quota:

30 GB (Personal Quota - fixed)

User Home Directory backed up?:

yes

High performance Parallel Scratch:

/parallel_scratch/<username>

High performance Parallel Scratch:

105 TB (FS Total)

filesystem:

BeeGFS


Tip

More specific detail on each cluster’s individual local high performance storage can be found on the HPC clusters pages.

User home directory

Personal HPC storage space

Your home directory is your personal dedicated storage area on the HPC cluster. This filesystem is local to the cluster, and is separate to the standard university home directory, e.g. /user/HS100/<username>.

This space is where you should store data you want to keep, such as input files and outputs from your simulations, or code you are developing/working on (this should also be pushed to GitLab) and its subsequent executables.

This space is a communal storage area for all users, and on some clusters this space is backed up. Usually, there is a quota applied to the User home directories to limit the amount of data each user can store in this area. For details on the specifics of data storage areas on each cluster, please see the tabs above.

Warning

This space is not for permanent research data storage. research data requiring long term storage and protection should be transferred to a project space on the network file store or SharePoint.

HPC high performance scratch

high performance storage for temporary storage of scratch data

Each cluster has a scratch storage space for temporary storage of data generated by your jobs or data that will be processed by your jobs.

The path to these areas can be seen in the tabs at the top of the page. This area is the storage space for your scratch data, which is for heavy parallel Input/Output and read/write workloads during simulations. This includes any temporary files that may get written by your code during a simulation, large data sets that need to be read/written before, during or after a simulation starts or any excessive output such as writing 1000s of lines of data.

This space is a communal storage area and data in this space is NOT BACKED UP! There are currently no limits on how much each user can store in this space, this space only for temporary data storage when running simulations, any important data you are writing to this area that you want to keep should be copied back to your home directory /users/<username> or off the cluster to a project space on the network file store or SharePoint.

Note

Please practice good citizenship in this space and ensure you clean up any temporary files which are written during simulations that you don’t want after they are finished. Abandoned data on this space may get deleted.

Checking your HPC local storage usage

  • To check how much space you are using in your user home directory, you can use the command:

du -hs </path/to/homedir>

or

ncdu <path/to/homedir>
for example
[abc123@login1(eureka2) ~]$ du -hs /users/abc123
399M    /users/abc123/
  • To check how much space you are using in the scratch directory, you can use the df or ncdu command and provide the full path to your directory:

du -hs </path/to/directory>

or

ncdu </path/to/directory>
for example
[abc123@login1(eureka2) ~]$ du -hs /parallel_scratch/abc123
30G /parallel_scratch/abc123

Tip

If your simulations are deterministic, you probably only need to keep the input files once you’re finished with the data generated/your project.

Transferring data to/from HPC storage

There are multiple ways to transfer data to and from the cluster you use. The main ways are using SCP and rsync, or for Windows users an SFTP client.

There are a number of methods utilising the command line or GUI tools, detailed below.

Note

The hostnames to use when transferring data to or from the respective HPC clusters are:

  • Eureka: eureka.surrey.ac.uk

  • Eureka2: eureka2.surrey.ac.uk

  • Kara02: kara02.eps.surrey.ac.uk

  • AISURREY: datamove1.surrey.ac.uk, datamove2.surrey.ac.uk or datamove3.surrey.ac.uk

Open OnDemand

If the cluster has Open OnDemand you can log in and use the “Files” feature to manage your files on the cluster via a web interface.

SCP (Linux/macOS)

To securely copy data to a remote host:

$ scp –r <Directory_to_copy> username@remotehost:/path/to/remotedir/

Examples:

Use scp to copy the IMPORTANT_DATA directory on Eureka to home directory on a remote machine called “myhost”.
[abc123@login7(eureka) ~]$ scp -r IMPORTANT_DATA abc123@myhost:~
abc123@myhosts password:
DATA_FILE_4.txt                               100%    0     0.0KB/s   00:00
DATA_FILE_3.txt                               100%    0     0.0KB/s   00:00
DATA_FILE_1.txt                               100%    0     0.0KB/s   00:00
Use scp to copy IMPORTANT_INPUT_FILES directory on “myhost” to home directory on Eureka.
[abc123@myhost ~]$ scp -r IMPORTANT_INPUT_FILES abc123@eureka:~
abc123@myhosts password:
DATA_FILE_4.txt                               100%    0     0.0KB/s   00:00
DATA_FILE_3.txt                               100%    0     0.0KB/s   00:00
DATA_FILE_1.txt                               100%

rsync (Linux/macOS)

To synchronise a directory from a local machine to a remote machine (or vice versa):

$ rsync –avz <Directory> user@remotehost:/path/to/remotedir/

Examples:

Synchronise data in IMPORTANT_INPUT_FILES files on “myhost” to directory on Eureka.
[abc123@myhost ~]$ rsync -avz IMPORTANT_INPUT_FILES abc123@eureka:/users/abc123/
abc123@myhosts password:
sending incremental file list
IMPORTANT_INPUT_FILES/
IMPORTANT_INPUT_FILES/INPUT_FILE_1.in
IMPORTANT_INPUT_FILES/INPUT_FILE_2.in
IMPORTANT_INPUT_FILES/INPUT_FILE_3.in
IMPORTANT_INPUT_FILES/INPUT_FILE_4.in
sent 306 bytes  received 99 bytes  810.00 bytes/sec
total size is 0  speedup is 0.00
Synchronise data in IMPORTANT_DATA on “Eureka” to directory on host “myhost”.
[abc123@login7(eureka) ~]$ rsync -avz IMPORTANT_DATA abc123@myhost:/user/HS204/abc123/
abc123@myhosts password:
sending incremental file list
IMPORTANT_DATA/
IMPORTANT_DATA/DATA_FILE_1.txt
IMPORTANT_DATA/DATA_FILE_2.txt
IMPORTANT_DATA/DATA_FILE_3.txt
IMPORTANT_DATA/DATA_FILE_4.txt

sent 290 bytes  received 92 bytes  69.45 bytes/sec
total size is 0  speedup is 0.00

Caution

PLEASE ENSURE you use the trailing “/” exactly as shown above to avoid overwriting any folders/data, as the rsync command is sensitive to the way in which trailing slashes are used.

Danger

Rsync is very useful for copying, moving and backing up/synchronising data, but it can be very easy to make a mistake in a command, slashes in paths make a big difference.

Read a guide to ensure you’re doing exactly what you want (test out commands):

https://www.thegeekstuff.com/2010/09/rsync-command-examples

Windows data transfer methods

MobaXterm allows you to transfer files to/from a cluster using a pseudo-terminal, so you can use all the previously mentioned rsync and SCP commands.

MobaXterm also has a SFTP function (Secure File Transfer Protocol) this allows for a drag and drop style transfer of data:

../_images/moba_sftp.png

Note:

Windows-based editors (e.g. notepad++) may put an extra “carriage return” (^M) character at the end of each line of text.

This will cause problems for most Linux-based applications. To correct this problem, execute the built-in utility dos2unix on each ASCII file on Eureka you transfer to it from Windows. An example is shown below:

[abc123@login7(eureka) ~]$ dos2unix example.txt
dos2unix: converting file water.inp to Unix format ...

Working with data on HPC local storage

If you need to work with files stored on the HPC local storage there are a number of ways you can do this.

  • If the cluster has Open OnDemand you can log in and use the web interface to work with your files on the cluster’s local storage.

  • If you’re comfortable working in the terminal, you can use an SSH connection and use all the CLI tools you are used to, such as Vim, Emacs, Nano etc.

  • If you prefer a graphical user interface, you can get a desktop session via the RemoteLabs web portal.

See connecting-hpc for more information.

Remote development with Visual Studio Code

Microsoft Visual Studio Code has a feature that allows you to connect to a remote filesystem via ssh to work with your files remotely.

https://code.visualstudio.com/docs/remote/ssh

To use this, you will need to have Microsoft Visual Studio Code installed on your workstation.

  • If you have a university managed machine you can hpc-support-ticket to request an installation.

  • If you are using a personal/self managed machine you can install this yourself.

Note

Visual Studio Code is not installed on the cluster’s login nodes as the application uses a lot of system resources, particularly with multiple instances of the program running inside multiple user sessions.

We encourage you to use this feature which will enable you to work with the files in the HPC local storage as if they were files stored locally on your workstation. This will help to keep development workloads off the clusters and improve usability for all.