Use a Supercomputer easily (SLURM-emission)
I’ve created the slurm-emission package to make my life easier when I need to run experiments on a supercomputer. It allows me to focus on the Python code and not worry about the SLURM details. In this post, I will explain how to use it.
Intro
Most clusters are managed with SLURM, a package that takes care of organizing servers with many nodes, and users that are submitting jobs randomly.
To be able to focus on the SLURM details, I’m going to assume i) you know how to create a GitHub repository for your project, ii) your main script is called main.py
, and iii) you know how to use conda to create an environment, that we will refer to as my_env. You can use the repo https://github.com/LuCeHe/nice_repo_name if you wanna do some tests quickly.
Entry, Interactive and Submission
In a cluster we usually have 3 modes of access: entry node, interactive mode and job submission. As soon as we get inside the server we will be in the entry node, but we will be restricted in the amount of resources, so we will be able to do only basic stuff there. If you want to test if your code runs properly with a small model and a few data samples on a GPU, you want to use the interactive mode. To access interactive mode use the salloc command
salloc --partition only-one-gpu --account e.johnsson --gres=gpu:1
Whenever you feel your code is fine and does not make errors, you are ready to launch a job. For that you need to create an .sh file and use the sbatch command. In the .sh file you specify the resources you need, the commands that need to be run before your code and the command to run your code. It will look as follows:
#!/bin/bash
#SBATCH --time=23:00:00
#SBATCH --account=e.johnsson
#SBATCH --partition=only-one-gpu
#SBATCH --gres=gpu:1
module load amd/slurm; module load amd/gcc-8.5.0/miniforge3
source activate base; conda activate my_env
cd /home/e.johnsson/work/nice_repo_name
python main.py
If you save it as my_sbatch.sh, then you submit the job with the command
sbatch my_sbatch.sh
That will put your job in line, and it will be run as the resources are freed.
Automate .sh creation
As you see creating an .sh file for each submission can be annoying, so some automation that allows you to work only on Python could be interesting. To do that you can install
pip install slurm-emission
and write inside a submit.py
file the following
from slurm_emission import run_experiments
script_path = 'path/to/your/script'
script_name = 'script.py'
sbatch_args = {
'partition': 'only-one-gpu',
'gres': 'gpu:1',
'account': 'e.johnsson',
'time': '23:00:00',
}
id = 'llms'
experiments = [{
'seed': list(range(4)), 'epochs': [300],
'model': ['transformer', 'lstm'], 'dataset': ['cifar', 'mnist']
}]
load_modules = 'module load amd/slurm; module load amd/gcc-8.5.0/miniforge3'
activate_env = 'source activate base; conda activate my_env'
py_location = f'cd {script_path}'
bash_prelines = f'{load_modules}\n{activate_env}\n{py_location}'
run_experiments(
experiments,
init_command=f'python {script_name} ',
sbatch_args=sbatch_args,
bash_prelines=bash_prelines,
id=id,
)
This python will create the exact same .sh as we did together manually, but on its own, and it will reuse it for the 16 possible combinations of datasets, models and seeds.
Now it’s on your plate, give it a try ;)