High Performance Computing Service

SLURM Quick Start

Overview

The HPCS is committed to moving from Torque/Maui to SLURM at the beginning of the Cambridge quarter on 1st February 2014.

This is a major change which will affect all users of Darwin and Wilkes (note that Wilkes has already been running SLURM since December 2013 but this instantiation will be rebuilt).

This page is intended as a quick and dirty guide to migrating your PBS-style batch workload to SLURM. It is to be expected that all outstanding workload in Torque/Maui ("PBS") on 31st January will need to be resubmitted to SLURM.

Each Darwin project in PBS will be replaced, with its current usage transferred, to a new project in SLURM with the same name on 31st January. The PBS queues will then close, and all new jobs submitted to SLURM. Jobs that are still running under PBS at the changeover will be allowed to complete.

Resources for use on Wilkes will continue to be held in separate projects which have the string "-GPU" appended to the usual name. Note that project names in SLURM are stored in a case-insensitive way.

In the majority of cases migrating a job submission script from PBS to SLURM should be straightforward - the default implementation of MPI (Intel MPI) has internal support for both schedulers and this has already been tested on Wilkes. However the OSC mpiexec (good for starting multiple small jobs within a larger batch job) is not applicable to SLURM, so in cases where this is required it will be necessary to reimplement the script to use srun instead. Support can assist with this on a case-by-case basis. A small SLURM test bed continues to be available on which submissions can be tested.

Environment

In order to access SLURM commands and man pages, do

module load slurm

Note that this will be loaded by default at login after the production service switches to SLURM.

Submission of jobs

Sample submission scripts

As in PBS, in SLURM one creates a batch job which is a shell script containing the set of commands to run, plus the resource requirements for the job which are coded as specially formatted shell comments at the top of the script. The batch job script is then submitted to SLURM using the sbatch command instead of qsub. The most obvious difference between the two types of submission script is that SLURM uses directives beginning with '#SBATCH' whereas PBS directives begin with '#PBS'.

Please copy and edit the sample submission scripts that can be found under

/usr/local/Cluster-Docs/SLURM

where slurm_submit.darwin is the appropriate choice for Darwin, and slurm_submit.wilkes is suitable for Wilkes. The main directives to modify are

#! Which project should be charged:
#SBATCH -A CHANGEME
#! How many whole nodes should be allocated?
#SBATCH --nodes=2
#! How many (MPI) tasks will there be in total? (<= nodes*16)
#SBATCH --ntasks=32
#! How much wallclock time will be required?
#SBATCH --time=02:00:00

in particular, the name of the project is required for the job to be scheduled (use the command mybalance to check what this is for you in case of doubt). Allocation occurs in entire nodes, and charging is reported in units of core hours according to

number of nodes x 16 x total walltime in hours

for Darwin, and

number of nodes x 12 x total walltime in hours

for Wilkes (1 GPU hour is equivalent to 6 core hours since each node has 2 GPUs and 12 CPU cores).

The --ntasks setting does not affect charging but is taken by the MPI launch logic to be the total number of MPI tasks that should be started. Usually this should be set to the number of nodes times the number of cores in the node (which is 16 for Darwin and 12 for Wilkes). If for memory or other reasons this should be reduced, make that change here.

Submitting the job to the queuing system

The command sbatch is used to submit jobs, e.g.

sbatch submission_script

The command will return a unique job identifier, which is used to query and control the job and to identify output. Please see the above link or do man sbatch on Darwin to see all the available options.

The following more complex example submits a job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5 and 7) to the project STARS-SL2:

sbatch --array=1-7:2 -A STARS-SL2 submission_script

Monitoring jobs

Please see the man pages or the documentation here for more information on these commands.

List all jobs in SLURM

squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           18974_7 sandybrid novoalig    abc123  R    1:47:23      2 sand-1-[1-2]
           18970_2 sandybrid novoalig    abc123  R    1:47:53      2 sand-1-[22-23]
           18970_3 sandybrid novoalig    abc123  R    1:47:53      2 sand-1-[8,18]
           18974_6 sandybrid novoalig    abc123  R    1:47:53      2 sand-1-[31,33]
...
             18821     tesla     prop    spqr45 PD       0:00      4 (QOSResourceLimit)

In the above sandybridge and tesla indicate that the jobs are destined for Darwin and Wilkes respectively. The state R means that the job is running, while PD indicates that the job is pending (still queued), in this case because of the restrictions of the QOS (determined by the service level).

The jobids reported as mmmm_n are elements of an array job, where mmmm is the SLURM_ARRAY_JOB_ID common to all jobs in the array, and n is the array index (SLURM_ARRAY_TASK_ID).

List all jobs owned by abc123

squeue -u abc123

List all jobs owned by project projectname

squeue -A projectname

and other options and combinations of options.

Examine job jobid in detail

scontrol show job <jobid>

Deleting jobs

Cancel job jobid

scancel <jobid>

Interactive Jobs

The following command will request two Darwin nodes interactively for 2 hours, charged to the project MYPROJECT:

sintr -p sandybridge -A MYPROJECT -N2 -t 2:0:0

This command will pause until the job starts, then create a screen terminal running on the first node allocated (cf man screen). X windows applications started inside this terminal should display properly. Within the screen session, new terminals can be started with control-a c, with navigation between the different terminals being accomplished with control-a n and control-a p. Also srun can be used to start processes on any of the nodes in the job allocation, and SLURM-aware MPI implementations will use this to launch remote processes on the allocated nodes without the need to give them explicit host lists.

Global view

Show activity across all nodes

sinfo

or

sview

Accounting Commands

The following commands are intended to behave similarly to the commands of the same names from Gold. They are wrappers around the underlying SLURM commands sacct and sreport which are much more powerful.

How many core hours available do I have?

mybalance
User            Usage |        Account     Usage | Account Limit Available (CPU hrs)
----------  --------- + -------------- --------- + ------------- ---------
abc123             18 |          STARS       171 |       100,000    99,829
abc123             18 |      STARS-SL2        35 |       101,000   100,965
abc123            925 |         BLACKH    10,634 |       166,667   156,033

This shows for each project of which the user is a member how many cores have been used, awarded, and remain available.

How many core hours does some other project or user have?

gbalance -p HALOS
User           Usage |   Account     Usage | Account Limit Available (CPU hrs)
---------- --------- + --------- --------- + ------------- ---------

pq345              0 |     HALOS   317,656 |       600,000   282,344
xyz10         11,880 |     HALOS   317,656 |       600,000   282,344
...

This outputs the total usage in core hours accumulated to date for the project, the total awarded and total remaining available (i.e. to all members). It also prints the component of the total usage due to each member.

I would like a listing of all jobs I have submitted through a certain project and between certain times

gstatement -p HALOS  -u xyz10 -s "2014-01-01-00:00:00" -e "2014-01-20-00:00:00" 
       JobID      User   Account  JobName  Partition                 End      NCPUS CPUTimeRAW ExitCode      State 
------------ --------- ---------- -------- ---------- ------------------- ---------- ---------- -------- ---------- 
14505            xyz10    halos       help sandybrid+ 2014-01-07T12:59:40         16         32      0:9  COMPLETED 
14506            xyz10    halos       help sandybrid+ 2014-01-07T13:00:11         16         48      2:0     FAILED 
14507            xyz10    halos       bash sandybrid+ 2014-01-07T13:05:20         16       4128      0:0 CANCELLED+ 
14541            xyz10    halos       bash sandybrid+ 2014-01-07T15:31:44         32      85216      0:9  COMPLETED 
14560            xyz10    halos       bash sandybrid+ 2014-01-07T16:19:36         32      89824      0:0  COMPLETED 
14569            xyz10    halos       bash sandybrid+ 2014-01-07T18:19:47         80     576240      0:1    TIMEOUT 
14598            xyz10    halos       bash sandybrid+ 2014-01-07T19:27:54         80     324080      0:0  COMPLETED 
15619            xyz10    halos    test.sh sandybrid+ 2014-01-09T16:10:35         16         64      0:0  COMPLETED 
...

This lists the charge for each job in the CPUTimeRAW column in core seconds.

I would like to add core hours to a particular member of my group

gdeposit -z 10000 -p halos-spqr1

This adds 10000 core hours to the HALOS-SPQR1 project assigned to the user spqr1. Note that if a core hour limit applies to the parent of the project in the project hierarchy - i.e. if the parent project HALOS has an overall core hour limit (which it almost certainly does) - then the global limit will still apply across all per-user projects.

Core hours may be added to a project by a designated project coordinator user. Reducing the core hours available to a project currently can only be done through the system administrators.