3. Data Management

Research data storage capacity is not infinite and it’s important to manage the data lifecycle and practice good data housekeeping. Its recommended you regularly review the data you’re storing and delete what you no longer need. This will help you to manage your data storage costs, avoid hitting your quota limits (which will prevent you from being able to store any new data until you free up some space), and on shared storage areas ensure that you are using space efficiently and not needlessly occupying space needed by others.

This page contains various different tips and guides to help you manage your data stored on University data storage locations such as on HPC Clusters or Network File Store.

3.1. Analysing project space data use

Research project spaces on the Network File Store (/vol/research/…. on Linux) are subject to quotas and have a finite amount of space.

There are programs you can use to help you get some insight on what data you have in your project spaces and whats taking up the most space. We have created a simple script ps_analysis (project space analysis) which uses the ncdu program to scan the files in your project space and provide a simple text based user interface for analysing your space consumption.

To see what is taking up all the space in your project space:

SSH to datamove1.surrey.ac.uk server
```
ssh username@datamove1.surrey.ac.uk
```
If you are not on campus or the University VPN you will first need to either connect to the VPN - Global Protect or connect to the SSH Gateways - access.eps.surrey.ac.uk.

Navigate to the top level of your project space

cd /vol/research/[name of the project space]

Run the ps_analysis (project space analysis) script and follow the instructions on screen.
```
ps_analysis
```
The script will ask you to verify that you are in the correct directory and it will search for the file ncdu_usage.gz within the top level of your project space. If the file exists, you will be taken to an interactive, text based interface (TUI), where you can browse your project space directory structure and see the size of the files within.

File size / directory sizes are shown on the left. You can navigate with the up and down keys and press enter to navigate into directories and explore all the way down the directory structure.

Full ncdu instructions can be found on the manpage https://linux.die.net/man/1/ncdu

If the ncdu_usage.gz file does not exist, the script will ask you if you want to scan you project space and will generate the ncdu_usage.gz file. (this file has been pre-generated on some of the larger project spaces in an attempt to save you time). The scan process can take some time and it depends on the number of files in your project space. The scan process will happen in the background and the script will inform you when it has completed the scan. During that time please do not terminate your terminal session.

Hint

Consider running your terminal sessions in a terminal multiplexer such as screen or tmux for te that can persist between sessions.

https://en.wikipedia.org/wiki/GNU_Screen

https://en.wikipedia.org/wiki/Tmux

Now you can use this information to identify which data you might want to delete, and then the most important part, delete the data. …

3.2. Moving Data between storage locations

There are many different data storage locations. Each compute cluster/facility has its own local high performance storage area (scratch space) which is usually not backed up. There is also the general Network File Store, Linux Home areas and of course OneDrive which are all backed up.

Sometimes you will need to transfer data between these different locations and the method for doing this will vary depending on the source location you are copying/moving from and the destination you want to copy/move to.

There are many different possible combinations so the below methods are not an exhaustive list. However we will try to document some of recommended options below. (we will add to these over time)

3.2.1. datamove1 & datamove2

These servers act as a bridge between the Network File Store (Research Project spaces), Linux Home Directories and AISurrey’s High performance scratch storage (WEKA, /mnt/fast/…..).

These are Linux Servers offering SSH login for simple Terminal access so you can use tools such as rsync to move your data around.

They offer high speed connections to each storage location so are a good choice to move larger datasets around.

They do not require you to use a condor job.

access method:

ssh

GUI:

no

how to access:

ssh datamove1.surrey.ac.uk or ssh datamove2.surrey.ac.uk

accessible storage areas:

AISurrey high performance scratch(WEKA) - /mnt/fast/datasets, /mnt/fast/nobackup/users, /mnt/fast/nobackup/scratch4weeks

Network file store (Project spaces) - /vol/research/.....

Linux Home Directories

3.2.2. Eureka2-ondemand

Eureka2’s web interface provides a nice way to download or upload data to/from the clusters storage areas via your web browser

access method:

web portal

GUI:

yes

how to access:

login at https://Eureka2-ondemand.surrey.ac.uk

accessible storage areas:

Eureka2 Home directory - /users/....

Eureka2 high performance scratch area - /parallel_scratch/....

3.2.3. Transferring data to HPC using command line tools

You can copy data to and from the different HPC clusters storage locations using standard linux and windows tools such as SCP, RSYNC and SFTP. More information on how to do this can be found at hpc-data-transfer.

3.2.4. Condor Jobs

You can use a condor job to transfer data to and from the storage locations available on the condor pools. The storage locations available will vary depending on the condor pool you are using.

More info on Condor and file transfers can be found at Files and Condor.

information about which shared filesystems are available on which condor pools can be found at Condor pools.