Rowan Support Center

Overview

The Simple Linux Utility for Resource Management, or SLURM, is an open source, fault-tolerant and highly scalable cluster management and job scheduling system for large and small Linux clusters.

SLURM has three key functions. First, it allocates access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

More information is available on the official SLURM website on the Documentation and FAQ pages.

From the user perspective SLURM helps to manage 4 different entities (see also Figure 1):

Node - a physical machine where the users' processes are actually run.
Partition - nodes are grouped into partitions (which can overlap). Different partitions can have different constraints, for example how long a job can run in a specific partition or how many cores the job can utilize etc. Therefore these partitions can also be thought of as different queues.
Job - allocations of resources assigned to a user for a specified amount of time. For example: 4 nodes, 10 CPU cores and 5GB of memory on each node.
Job step - Once a job is assigned a set of nodes, the user is able to initiate parallel work in the form of job steps in any configuration within the allocation. For instance, a single job step may be started that utilizes all nodes allocated to the job, or several job steps may independently use a portion of the allocation.

2. Submitting jobs

SLURM has a lot of flexibility when it comes to resource management and launching processes but probably about 90 percent of all use cases for launching jobs can be covered by using different combinations of the following concepts:

NODE - a physical compute node
TASK - an instance of some executable (a UNIX process)
CPUs PER TASK - how many CPU cores are allocated to a TASK, this is useful when the processes you are launching are multi-threaded.
ADDITIONAL RESOURCES - other constraints for a job allocation like the amount of memory required per node, the walltime for a job, only run on nodes with special properties etc.

There are three command line tools available for allocating resources from the scheduler to carry out your computations. All of these tools have different use cases so please refer to the Examples section before you start using them.

The main purpose of these tools is to allow you to create a job allocation and manage your tasks within this allocation.

sbatch - takes a "special" Bash script for an argument. This is actually just a regular Bash script that can contain #SBATCH directives, which are used to describe resources required for the job.

The script itself is always executed on only one of the compute nodes in only one instance, but you can call parallel executables from inside this script (eg. srun).

srun - is used to run executables in parallel. This is usually used within a sbatch script. Each invocation of the srun command is a "job step" in SLURM lingo.
salloc - this tool can be used to manually create a job allocation in order to later manually run srun "inside" this allocation. This is mostly used for running quick tests.
scancel < jobid> - is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.

In almost all cases we recommend that you use the sbatch command for submitting your jobs. This way you can specify all the required resources for your job and all the commands and steps that are actually run in a clear and transparent way. This makes debugging jobs much easier and also leaves some sort of documentation behind when for example you want to run a similar job maybe three months later.

2.1 Examples

All of the examples here are run from the headnode of our RUCC cluster (rucc.rowan.edu)

2.1.1 Run a Few Test Commands on a Compute Node

Here we a running a very simple sbatch script to print out some information about the compute node the script is running on:

[marzin@rucc-headnode demo]$ pwd
/csm_data/demo
[marzin@rucc-headnode demo]$ ls -la
total 28
drwxr-xr-x 2 root root  131 Oct 11 11:17 .
drwxr-xr-x 8 root root  122 Oct 11 11:17 ..
-rw-r--r-- 1 root root  866 Oct 11 11:17 hello_world.sh
-rwxr-xr-x 1 root root 9026 Oct 11 11:17 hybridhello
-rw-r--r-- 1 root root  737 Oct 11 11:17 hybridhello.c
-rw-r--r-- 1 root root  351 Oct 11 11:17 nodeinfo.sh
-rw-r--r-- 1 root root  367 Oct 11 11:17 parallel_uname.sh
 
[marzin@rucc-headnode demo]$ cat nodeinfo.sh
#!/bin/bash
#The name of the job is test_job
#SBATCH -J test_job
 
#The job requires 1 compute node
#SBATCH -N 1
 
#The job requires 1 task per node
#SBATCH --ntasks-per-node=1
 
#The maximum walltime of the job is a half hour
#SBATCH -t 00:30:00
 
#These commands are run on one of the nodes allocated to the job (batch node)
 
uname -a
pwd
sleep 30
  
[marzin@rucc-headnode demo]$ sbatch nodeinfo.sh
Submitted batch job 111

After the script has been executed on a compute node, it's output (both stdout and stderr) is written to the current working directory in a file that is named after the job id:

[marzin@rucc-headnode demo]$ ls -la
total 20
drwxr-xr-x  2 marzin domain users 4096 Mar  3 22:46 .
drwx------ 20 marzin domain users 4096 Mar  3 22:29 ..
-rw-r--r--  1 marzin domain users  351 Mar  3 22:24 nodeinfo.sh
-rw-r--r--  1 marzin domain users  124 Mar  3 22:46 slurm-111.out
  
[marzin@rucc-headnode demo]$ cat slurm-111.out
Linux csm-com-001 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
/home/marzin/demo

What we can see from the output of the script is that it executed on the compute node csm-com-001 and that it's working directory was exactly the same as the current working directory on the headnode at the time when the script was submitted.

To change the working directory of the script, one could for example use the #SBATCH -D <other directory> directive or the -D or --chdir commandline flags. Another option would be to just cd to some other directory at the beginning of your script.

2.1.2 Running Jobs in Parallel

Let's say that we would like to run the "uname" command on 4 different compute nodes. Obviously we write a sbatch script:

[marzin@rucc-headnode demo]$ cat parallel_uname.sh
#!/bin/bash
#The name of the job is test_job
#SBATCH -J parallel_uname
  
#The job requires 4 compute nodes
#SBATCH -N 4
  
#The job requires 1 task per node
#SBATCH --ntasks-per-node=1
  
#The maximum walltime of the job is a half hour
#SBATCH -t 00:30:00
  
#These commands are run on one of the nodes allocated to the job (batch node)
uname -a
sleep 30
  
[marzin@rucc-headnode demo]$ sbatch parallel_uname.sh
Submitted batch job 112
  
[marzin@rucc-headnode demo]$ cat slurm-112.out
Linux csm-com-001 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Sadly, when looking at the output, we see that our script executed on only one compute node and in only one instance. This is always the case with sbatch scripts! But we did specify that we required 4 nodes and each node should run one task - why was the "uname" command run only once?

You can think of an sbatch script as an outline of your compute job that describes the required resources and the actual steps of your computation, eg. setting up input data, running the compute tasks, collecting output, etc. The script above actually did allocate a CPU on 4 different compute nodes (4 CPUs total) that we could have used to run 4 instances on uname.

In order to run our 4 instances (or tasks in SLURM lingo) we have to use the "srun" tool. After modifying our script (just added "srun" before "uname"):

[marzin@rucc-headnode demo]$ cat parallel_uname.sh
#!/bin/bash
#The name of the job is test_job
#SBATCH -J parallel_uname
  
#The job requires 4 compute nodes
#SBATCH -N 4
  
#The job requires 1 task per node
#SBATCH --ntasks-per-node=1
  
#The maximum walltime of the job is a half hour
#SBATCH -t 00:30:00
  
#Here we call srun to launch the uname command in parallel
srun uname -a
sleep 30
  
[marzin@rucc-headnode demo]$ sbatch parallel_uname.sh
Submitted batch job 113
  
[marzin@rucc-headnode demo]$ cat slurm-113.out
Linux csm-com-001 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Linux csm-com-003 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Linux csm-com-002 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Linux csm-com-004 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Now we get the output we were expecting, "srun" ran 4 different instances of uname on 4 different compute nodes. An invocation of srun is called a "job step" in SLURM lingo.

By default when "srun" is used within a job allocation, it inherits the configuration of the entire job (eg. the configuration of the sbatch script).

The "srun" tool can also be used outside an already existing job allocation to quickly run some executable on the cluster. In this case it creates a job allocation automatically.

[marzin@rucc-headnode demo]$ srun -N 4 --ntasks-per-node=1 uname -a
Linux csm-com-002 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Linux csm-com-004 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Linux csm-com-003 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Linux csm-com-001 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

In many cases it's actually not important to specify the number of nodes, you might just want to run 24 instances of your executable and not really care where they end up:

[marzin@rucc-headnode demo]$ srun --ntasks=50 hostname
csm-com-001
csm-com-001
csm-com-001
csm-com-001
csm-com-001
csm-com-001
csm-com-001
csm-com-

High Performance Computing: RUCC SLURM

Overview

2. Submitting jobs

2.1 Examples

2.1.1 Run a Few Test Commands on a Compute Node

2.1.2 Running Jobs in Parallel