Mistakes to Avoid¶
1. Avoid running jobs on the login nodes¶
To access Surreys HPC clusters through SSH, you initially land on the login node, which is shared among all users. This node is designated for job submission, tasks that utilize only minimal system resources, and conclude in a short span of time.
Any software installation, compilation, or experimentation should be performed on one of the compute nodes. For interactive debugging, you can submit your job to the debug partition on Eureka2 or AISurrey, which can be done by submitting a script or requesting an interactive session via the Slurm scheduler.
You can learn more about the scheduler here: HPC job scheduler (Slurm).
2. Avoid exceeding your storage limit¶
Exceeding your allocated storage quota can cause many issues, for example jobs not completing, or the slowing down of your SSH session. It is important to regularly monitor your storage, as explained here: Checking your HPC local storage usage.
Should you require additional space, consider deleting unnecessary files or applying for an increase in your home directory or project space storage.
3. Use checkpointing during simulations or model training¶
There are several reasons why simulation or machine learning training jobs may be disrupted, including technical issues with the clusters or required maintenance. Therefore, implementing checkpointing for your jobs is considered good practice. This allows your job to resume from the last saved state in the event of an interruption, eliminating the need to re-start from scratch.
4. Do not ask for more resources than you need¶
An essential preliminary step before submitting your HPC jobs is to benchmark your specific tasks against the processing cores, RAM, and GPUs you intend to use. This process ensures that the resources you request align precisely with your requirements.
Requesting excessive resources can result in longer waiting times for everyone and the waste of valuable resources. Consistently underusing requested resources may negatively impact your Fairshare, leading to longer queue times in the future, see Job priority and “Fairshare”.
5. Do not manipulate the Job Scheduler¶
If your job remains idle or experiences long delays, first check whether your resource requests, partition choice, or job dependencies might be causing the issue. Understanding the job scheduling policies and priority system can help diagnose common problems, see HPC job scheduler (Slurm).
However, avoid attempting to bypass scheduling policies by creating unnecessary processes, exploiting loopholes, or otherwise manipulating the systems. If you are unsure why your job is not starting, contact the Research Computing team for assistance.
6. Do not run serial jobs on more than one CPU core¶
Single-threaded programs do not benefit from multiple CPU cores, so requesting more cores will not improve performance. Instead, this wastes resources and may lower the priority of future jobs. If you need to run multiple independent serial jobs, consider using an array job instead.
7. Do not assume identical docker permissions locally and on the clusters¶
Running a Docker container on a cluster may differ from running it on a local machine due to permission restrictions. Locally, Docker typically runs with root privileges, allowing it to create directories, files, and perform other administrative tasks.
On a cluster, stricter permission constraints often prevent such operations, which may affect the container’s functionality.
8. Do not connect VS Code to the clusters’ login servers¶
Connecting VS Code to a cluster login server is a bad idea because its server processes consume significant resources, potentially overloading the login nodes and slowing down performance for all users.
Additionally, login nodes are meant for light, interactive tasks rather than running persistent development environments, which can lead to system instability.
See VS Code and HPC for different solutions to the problem.