This site requires JavaScript to be enabled

High Performance Computing: RUCC SLURM

54 views

Overview

The Simple Linux Utility for Resource Management, or SLURM, is an open source, fault-tolerant and highly scalable cluster management and job scheduling system for large and small Linux clusters.

SLURM has three key functions. First, it allocates access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

More information is available on the official SLURM website on the Documentation and FAQ pages.

From the user perspective SLURM helps to manage 4 different entities (see also Figure 1):

2. Submitting jobs
 

SLURM has a lot of flexibility when it comes to resource management and launching processes but probably about 90 percent of all use cases for launching jobs can be covered by using different combinations of the following concepts:
 
There are three command line tools available for allocating resources from the scheduler to carry out your computations. All of these tools have different use cases so please refer to the Examples section before you start using them. 
The main purpose of these tools is to allow you to create a job allocation and manage your tasks within this allocation. 
The script itself is always executed on only one of the compute nodes in only one instance, but you can call parallel executables from inside this script (eg. srun).
 
In almost all cases we recommend that you use the sbatch command for submitting your jobs. This way you can specify all the required resources for your job and all the commands and steps that are actually run in a clear and transparent way. This makes debugging jobs much easier and also leaves some sort of documentation behind when for example you want to run a similar job maybe three months later.

2.1 Examples

All of the examples here are run from the headnode of our RUCC cluster (rucc.rowan.edu)

2.1.1 Run a Few Test Commands on a Compute Node

Here we a running a very simple sbatch script to print out some information about the compute node the script is running on:

 

[marzin@rucc-headnode demo]$ pwd
/csm_data/demo
[marzin@rucc-headnode demo]$ ls -la
total 28
drwxr-xr-x 2 root root  131 Oct 11 11:17 .
drwxr-xr-x 8 root root  122 Oct 11 11:17 ..
-rw-r--r-- 1 root root  866 Oct 11 11:17 hello_world.sh
-rwxr-xr-x 1 root root 9026 Oct 11 11:17 hybridhello
-rw-r--r-- 1 root root  737 Oct 11 11:17 hybridhello.c
-rw-r--r-- 1 root root  351 Oct 11 11:17 nodeinfo.sh
-rw-r--r-- 1 root root  367 Oct 11 11:17 parallel_uname.sh
 
[marzin@rucc-headnode demo]$ cat nodeinfo.sh
#!/bin/bash
#The name of the job is test_job
#SBATCH -J test_job
 
#The job requires 1 compute node
#SBATCH -N 1
 
#The job requires 1 task per node
#SBATCH --ntasks-per-node=1
 
#The maximum walltime of the job is a half hour
#SBATCH -t 00:30:00
 
#These commands are run on one of the nodes allocated to the job (batch node)
 
uname -a
pwd
sleep 30
  
[marzin@rucc-headnode demo]$ sbatch nodeinfo.sh
Submitted batch job 111

 

After the script has been executed on a compute node, it's output (both stdout and stderr) is written to the current working directory in a file that is named after the job id:

 

[marzin@rucc-headnode demo]$ ls -la
total 20
drwxr-xr-x  2 marzin domain users 4096 Mar  3 22:46 .
drwx------ 20 marzin domain users 4096 Mar  3 22:29 ..
-rw-r--r--  1 marzin domain users  351 Mar  3 22:24 nodeinfo.sh
-rw-r--r--  1 marzin domain users  124 Mar  3 22:46 slurm-111.out
  
[marzin@rucc-headnode demo]$ cat slurm-111.out
Linux csm-com-001 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
/home/marzin/demo

 

What we can see from the output of the script is that it executed on the compute node csm-com-001 and that it's working directory was exactly the same as the current working directory on the headnode at the time when the script was submitted.
 
To change the working directory of the script, one could for example use the #SBATCH -D <other directory> directive or the -D or --chdir commandline flags. Another option would be to just cd to some other directory at the beginning of your script.
 

2.1.2 Running Jobs in Parallel

Let's say that we would like to run the "uname" command on 4 different compute nodes. Obviously we write a sbatch script:

 

[marzin@rucc-headnode demo]$ cat parallel_uname.sh
#!/bin/bash
#The name of the job is test_job
#SBATCH -J parallel_uname
  
#The job requires 4 compute nodes
#SBATCH -N 4
  
#The job requires 1 task per node
#SBATCH --ntasks-per-node=1
  
#The maximum walltime of the job is a half hour
#SBATCH -t 00:30:00
  
#These commands are run on one of the nodes allocated to the job (batch node)
uname -a
sleep 30
  
[marzin@rucc-headnode demo]$ sbatch parallel_uname.sh
Submitted batch job 112
  
[marzin@rucc-headnode demo]$ cat slurm-112.out
Linux csm-com-001 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

 

Sadly, when looking at the output, we see that our script executed on only one compute node and in only one instance. This is always the case with sbatch scripts! But we did specify that we required 4 nodes and each node should run one task - why was the "uname" command run only once?
 
You can think of an sbatch script as an outline of your compute job that describes the required resources and the actual steps of your computation, eg. setting up input data, running the compute tasks, collecting output, etc. The script above actually did allocate a CPU on 4 different compute nodes (4 CPUs total) that we could have used to run 4 instances on uname.
 
In order to run our 4 instances (or tasks in SLURM lingo) we have to use the "srun" tool. After modifying our script (just added "srun" before "uname"):

 

[marzin@rucc-headnode demo]$ cat parallel_uname.sh
#!/bin/bash
#The name of the job is test_job
#SBATCH -J parallel_uname
  
#The job requires 4 compute nodes
#SBATCH -N 4
  
#The job requires 1 task per node
#SBATCH --ntasks-per-node=1
  
#The maximum walltime of the job is a half hour
#SBATCH -t 00:30:00
  
#Here we call srun to launch the uname command in parallel
srun uname -a
sleep 30
  
[marzin@rucc-headnode demo]$ sbatch parallel_uname.sh
Submitted batch job 113
  
[marzin@rucc-headnode demo]$ cat slurm-113.out
Linux csm-com-001 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Linux csm-com-003 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Linux csm-com-002 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Linux csm-com-004 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

 

Now we get the output we were expecting, "srun" ran 4 different instances of uname on 4 different compute nodes. An invocation of srun is called a "job step" in SLURM lingo.
By default when "srun" is used within a job allocation, it inherits the configuration of the entire job (eg. the configuration of the sbatch script).
 
The "srun" tool can also be used outside an already existing job allocation to quickly run some executable on the cluster. In this case it creates a job allocation automatically.

 

[marzin@rucc-headnode demo]$ srun -N 4 --ntasks-per-node=1 uname -a
Linux csm-com-002 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Linux csm-com-004 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Linux csm-com-003 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Linux csm-com-001 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

 

In many cases it's actually not important to specify the number of nodes, you might just want to run 24 instances of your executable and not really care where they end up:

 

[marzin@rucc-headnode demo]$ srun --ntasks=50 hostname
csm-com-001
csm-com-001
csm-com-001
csm-com-001
csm-com-001
csm-com-001
csm-com-001
csm-com-