3. HPC data storage

3.1. HPC Cluster local storage

Clusters at Surrey usually have two local filesystems. Each have a specific function and purpose and will affect how you work with and manage your data.

Cluster Name

Home Directory (NFS)

Parallel Scratch (BeeGFS)

Parallel Scratch Paths

Eureka2

30 GB (Personal Quota - fixed)

70 TB (FS Total)

/parallel_scratch/<username>

Eureka

7.5 TB (FS Total)

56 TB (FS Total)

/users/<username>/parallel_scratch (this is a symbolic link to /mnt/beegfs/users/<username>

3.1.1. HPC home directory

NFS Standard Storage

Your home directory /users/<username> is the space where you are taken to when you log in to a cluster. This filesystem is local to the cluster, and is separate to the standard university home directory e.g. /user/HS100/<username>.

This space is where you should store data you want to keep such as input files and outputs from your simulations, or code you are developing/working on (this should also be pushed to GitLab) and its subsequent executables.

This space is a communal storage area for all users and all data in this space is backed up. There are currently no limits on how much each user can store in this space. This space however is not for permanent research data storage. Data needs to be taken elsewhere for long-term storage once you have finished with your work/project.

3.1.2. HPC parallel scratch

BeeGFS Parallel Storage

Your parallel scratch path, as shown in the table above, is the space for your scratch data, which is for Heavy parallel Input/Output and Read/Write workloads during simulations. This includes any temporary files that may get written by your code during a simulation, large data sets that need to be read/written before,during or after a simulation starts or any excessive output such as writing 1000s of lines of data.

This space is a communal storage area and NO data in this space is backed up. There are currently no limits on how much each user can store in this space, this space only for temporary data storage when running simulations, any important data you are writing to this area that you want to keep should be copied back to your Home Directory /users/<username>.

Note

Please practice good citizenship in this space and ensure you clean up any temporary files which are written during simulations that you don’t want after they are finished.

3.1.3. Checking your HPC Local storage usage

  • To check how much space you are using in your home directory /users/<username> you can use the command:

du -hs /users/<username>

or

ncdu /users/<username>
for example
[abc123@login7(eureka) ~]$ du -hs /users/abc123
399M    /users/abc123/
[abc123@login7(eureka) ~]$
  • To check how much space you are using in parallel_scratch directory /users/<username>/parallel_scratch you can use the command:

du -hs /users/<username>/parallel_scratch   # On Eureka
du -hs /parallel_scratch/users/<username>   # On Eureka2

or

ncdu /users/parallel_scratch/<username>   # On Eureka
ncdu /parallel_scratch/users/<username>   # On Eureka2
for example
[abc123@login7(eureka) ~]$ du -hs /users/abc123/parallel_scratch/
30G /users/abc123/parallel_scratch/
[abc123@login7(eureka) ~]$

Tip

If your simulations are deterministic, you can get away with just keeping the input files once your finished with the data generated/your project.

Tip

If you are familiar enough with the home and parallel storage areas, you could create a symbolic link from your home directory to the parallel scratch area for convenience.

[abc123@login(eureka2) ~] ln -s /parallel_scratch/$USER ~/parallel_scratch

3.2. Transferring data onto HPC

There are multiple ways to transfer data to and from the cluster you use. The main ways are using scp and rsync, or for windows users an SFTP client.

Below are the hostnames to use for the respective HPC clusters:

  • Eureka: eureka.surrey.ac.uk

  • Eureka2: eureka2.surrey.ac.uk

  • Kara: kara.ati.surrey.ac.uk

  • Kara02: kara02.eps.surrey.ac.uk

3.2.1. scp (Linux/Mac)

To securely copy data to a remote host:

$ scp –r <Directory> username@remotehost:/path/to/remotedir/

Examples:

use scp to copy the IMPORTANT_DATA directory on Eureka to home directory on a remote machine called “myhost”.
[abc123@login7(eureka) ~]$ scp -r IMPORTANT_DATA abc123@myhost:~
abc123@myhosts password:
DATA_FILE_4.txt                               100%    0     0.0KB/s   00:00
DATA_FILE_3.txt                               100%    0     0.0KB/s   00:00
DATA_FILE_1.txt                               100%    0     0.0KB/s   00:00
use scp to copy IMPORTANT_INPUT_FILES directory on “myhost” to home directory on Eureka.
[abc123@myhost ~]$ scp -r IMPORTANT_INPUT_FILES abc123@eureka:~
abc123@myhosts password:
DATA_FILE_4.txt                               100%    0     0.0KB/s   00:00
DATA_FILE_3.txt                               100%    0     0.0KB/s   00:00
DATA_FILE_1.txt                               100%

3.2.2. rsync (Linux/Mac)

To synchronise a directory from a local machine to a remote machine (or vice versa):

$ rsync –avz <Directory> user@remotehost:/path/to/remotedir/

Examples:

Synchronise data in IMPORTANT_INPUT_FILES files on “myhost” to directory on Eureka.
[abc123@myhost ~]$ rsync -avz IMPORTANT_INPUT_FILES abc123@eureka:/users/abc123/
abc123@myhosts password:
sending incremental file list
IMPORTANT_INPUT_FILES/
IMPORTANT_INPUT_FILES/INPUT_FILE_1.in
IMPORTANT_INPUT_FILES/INPUT_FILE_2.in
IMPORTANT_INPUT_FILES/INPUT_FILE_3.in
IMPORTANT_INPUT_FILES/INPUT_FILE_4.in
sent 306 bytes  received 99 bytes  810.00 bytes/sec
total size is 0  speedup is 0.00
Synchronise data in IMPORTANT_DATA on “Eureka” to directory on host “myhost”.
[abc123@login7(eureka) ~]$ rsync -avz IMPORTANT_DATA abc123@myhost:/user/HS204/abc123/
abc123@myhosts password:
sending incremental file list
IMPORTANT_DATA/
IMPORTANT_DATA/DATA_FILE_1.txt
IMPORTANT_DATA/DATA_FILE_2.txt
IMPORTANT_DATA/DATA_FILE_3.txt
IMPORTANT_DATA/DATA_FILE_4.txt

sent 290 bytes  received 92 bytes  69.45 bytes/sec
total size is 0  speedup is 0.00

Caution

PLEASE ENSURE you use the trailing “/” exactly as shown above to avoid overwriting any folders/data, as the rsync command is sensitive to the way in which trailing slashes are used.

Note:

Rsync is very useful for copying, moving and backing up/synchronising data, but it can be very easy to make a mistake in a command, slashes in paths make a big difference.

Read a guide to ensure your doing exactly what you want (test out commands):

https://www.thegeekstuff.com/2010/09/rsync-command-examples

3.2.3. Windows data transfer methods

MobaXterm allows you to transfer files to/from a cluster using a psuedo-terminal, so you can use all the previously mentioned rsync and scp commands.

MobaXterm also has a SFTP function (Secure File Transfer Protocol) this allows for a drag and drop style transfer of data:

../_images/moba_sftp.png

Note:

Windows-based editors (e.g. notepad++) may put an extra “carriage return” (^M) character at the end of each line of text.

This will cause problems for most Linux-based applications. To correct this problem, execute the built-in utility dos2unix on each ASCII file on Eureka you transfer to it from windows. An example is shown below:

[abc123@login7(eureka) ~]$ dos2unix example.txt
dos2unix: converting file water.inp to Unix format ...

3.3. Working with data on HPC local storage

If you need to work with files stored on the HPC local storage there are a number of ways you can do this.

  • If you’re comfortable working in the terminal, you can use an SSH connection and use all of the CLI tools you are used to, such as Vim, Emacs, Nano etc.

  • If you prefer a graphical user interface you can get a desktop session via the RemoteLabs web portal.

See connecting-hpc for more information.

3.3.1. Remote Development with Visual Studio Code

Microsoft Visual Studio code has a feature that allows you to connect to a remote filesystem via ssh to work with your files remotely.

https://code.visualstudio.com/docs/remote/ssh

To use this you will need to have Microsoft Visual Studio Code installed on your workstation.

Note

Visual Studio code is not installed on the clusters login nodes as the application uses a lot of system resources, particularly with multiple instances of the program running inside multiple user sessions.

We encourage you to use this feature which will enable you to work with the files in the HPC local storage as if they were files stored locally on your workstation. This will help to keep development workloads off the clusters and improve usability for all.