Data management¶
Research data storage capacity is not infinite and it’s important to manage the data lifecycle and practice good data housekeeping. It’s recommended you regularly review the data you’re storing and delete what you no longer need. This will help you to manage your data storage costs, avoid hitting your quota limits (which will prevent you from being able to store any new data until you free up some space), and on shared storage areas, ensure that you are using space efficiently and not needlessly occupying space needed by others.
This page contains various different tips and guides to help you manage your data stored on university data storage locations such as on HPC clusters or network file store.
Analysing project space data use¶
Research project spaces on the network file Store (/vol/research/…. on Linux) are subject to quotas and have a finite amount of space.
There are programs you can use to help you get some insight on what data you have in your project spaces and what’s
taking up the most space. We have created a simple script ps_analysis (project space analysis) which uses the ncdu program
to scan the files in your project space and provide a simple text based user interface for analysing your space consumption.
To see what is taking up all the space in your project space:
SSH to
datamove1.surrey.ac.ukserver
ssh username@datamove1.surrey.ac.uk
If you are not on campus or the university VPN, you will first need to either connect to the VPN - Global Protect or connect to the SSH gateways - access.eps.surrey.ac.uk.
Navigate to the top level of your project space
cd /vol/research/[name of the project space]
Run the
ps_analysis(project space analysis) script and follow the instructions on screen.ps_analysisThe script will ask you to verify that you are in the correct directory, and it will search for the file
ncdu_usage.gzwithin the top level of your project space. If the file exists, you will be taken to an interactive, text based user interface (TUI), where you can browse your project space directory structure and see the size of the files within.File size / directory sizes are shown on the left. You can navigate with the up and down keys and press enter to navigate into directories and explore all the way down the directory structure.
Full
ncduinstructions can be found on the manpage https://linux.die.net/man/1/ncdu
If the
ncdu_usage.gzfile does not exist, the script will ask you if you want to scan your project space and will generate thencdu_usage.gzfile. (This file has been pre-generated on some of the larger project spaces in an attempt to save you time). The scan process can take some time, and it depends on the number of files in your project space. The scan process will happen in the background and the script will inform you when it has completed the scan. During that time, please do not terminate your terminal session.
Hint
Consider running your terminal sessions in a terminal multiplexer such as screen or tmux, this allows your session to continue if you disconnect.
Now you can use this information to identify which data you might want to delete, and then the most important part, delete the data. …
Moving data between storage locations¶
There are many different data storage locations. Each compute cluster has its own local high performance storage area (scratch space) which is usually not backed up. There is also the general network file store and Linux home areas, which are both backed up.
Sometimes you will need to transfer data between these different locations and the method for doing this will vary depending on the source location you are copying/moving from and the destination you want to copy/move to.
There are many different possible combinations, so the below methods are not an exhaustive list. However, we will try to document some of recommended options below. (We will add to these over time)
datamove servers¶
These servers act as a bridge between the network file store (Research project spaces), Linux Home Directories and AI@Surrey’s high performance scratch storage (WEKA, /mnt/fast/…..).
These are Linux Servers offering SSH login for simple terminal access, so you can use tools such as rsync to move your data around.
They offer high speed connections to each storage location, so are a good choice to move larger datasets around.
They do not require you to use a condor job.
- access method:
ssh
- GUI:
no
- how to access:
the datamove servers are accesible via SSH. If SSH has a problem connecting to a datamove server then it might be down for maintenance and you should try connecting to a different datamove server. There are currently 4 datamove servers, connect to a server using one of the below commands.
ssh datamove1.surrey.ac.ukssh datamove2.surrey.ac.ukssh datamove3.surrey.ac.uk- accessible storage areas:
AISurrey high performance scratch (WEKA) -
/mnt/fast/datasets,/mnt/fast/nobackup/users,/mnt/fast/nobackup/scratch4weeksNetwork file store (Project spaces) -
/vol/research/.....Linux Home Directories
Eureka2 OnDemand¶
Eureka2’s web interface provides a nice way to download or upload data to/from the cluster’s storage areas via your web browser
- access method:
web portal
- GUI:
yes
- how to access:
- accessible storage areas:
Eureka2 Home directory -
/users/....Eureka2 high performance scratch area -
/parallel_scratch/....
Transferring data to HPC using command line tools¶
You can copy data to and from the different HPC clusters storage locations using standard Linux and Windows tools such as SCP, rsync and SFTP. More information on how to do this can be found at hpc-data-transfer.
Condor jobs¶
You can use a Condor job to transfer data to and from the storage locations available on the cCondor pools. The storage locations available will vary depending on the Condor pool you are using.
More information on Condor and file transfers can be found at Files and Condor.
Information about which shared filesystems are available on which Condor pools can be found at Condor pools.
FTP¶
File Transfer Protocol allows for easy copy/move of your files/folders to/from your personal machine and the University resources like the HPC clusters. If users required FTP/SFTP access to the clusters, this would require the Global Protect VPN to be enabled. Once the VPN is enabled, below are some good examples of the client software that can be used for FTP/SFTP access. Please feel free to download one of the below based on the operating system of your personal device. Please contact IT Services if you have any issues connecting using the below.
MobaXterm: https://mobaxterm.mobatek.net/
FileZilla: https://filezilla-project.org/
Cyberduck: https://cyberduck.io/