SLURM Quick Start
The HPCS is committed to moving from Torque/Maui to SLURM at the beginning of the Cambridge quarter on 1st February 2014.
This page is intended as a quick and dirty guide to migrating your PBS-style batch workload to SLURM. It is to be expected that all outstanding workload in Torque/Maui ("PBS") on 31st January will need to be resubmitted to SLURM.
Each Darwin project in PBS will be replaced, with its current usage transferred, to a new project in SLURM with the same name on 31st January. The PBS queues will then close, and all new jobs submitted to SLURM. Jobs that are still running under PBS at the changeover will be allowed to complete.
Resources for use on Wilkes will continue to be held in separate projects which have the string "-GPU" appended to the usual name. Note that project names in SLURM are stored in a case-insensitive way.
In the majority of cases migrating a job submission script from PBS to SLURM should be straightforward - the default implementation of MPI (Intel MPI) has internal support for both schedulers and this has already been tested on Wilkes. However the OSC mpiexec (good for starting multiple small jobs within a larger batch job) is not applicable to SLURM, so in cases where this is required it will be necessary to reimplement the script to use srun instead. Support can assist with this on a case-by-case basis. A small SLURM test bed continues to be available on which submissions can be tested.
In order to access SLURM commands and man pages, do
module load slurm
Note that this will be loaded by default at login after the production service switches to SLURM.
Sample submission scripts
As in PBS, in SLURM one creates a batch job which is a shell script containing the set of commands to run, plus the resource requirements for the job which are coded as specially formatted shell comments at the top of the script. The batch job script is then submitted to SLURM using the sbatch command instead of qsub. The most obvious difference between the two types of submission script is that SLURM uses directives beginning with '#SBATCH' whereas PBS directives begin with '#PBS'.
Please copy and edit the sample submission scripts that can be found under
where slurm_submit.darwin is the appropriate choice for Darwin, and slurm_submit.wilkes is suitable for Wilkes. The main directives to modify are
#! Which project should be charged: #SBATCH -A CHANGEME #! How many whole nodes should be allocated? #SBATCH --nodes=2 #! How many (MPI) tasks will there be in total? (<= nodes*16) #SBATCH --ntasks=32 #! How much wallclock time will be required? #SBATCH --time=02:00:00
in particular, the name of the project is required for the job to be scheduled (use the command mybalance to check what this is for you in case of doubt). Allocation occurs in entire nodes, and charging is reported in units of core hours according to
number of nodes x 16 x total walltime in hours
for Darwin, and
number of nodes x 12 x total walltime in hours
for Wilkes (1 GPU hour is equivalent to 6 core hours since each node has 2 GPUs and 12 CPU cores).
The --ntasks setting does not affect charging but is taken by the MPI launch logic to be the total number of MPI tasks that should be started. Usually this should be set to the number of nodes times the number of cores in the node (which is 16 for Darwin and 12 for Wilkes). If for memory or other reasons this should be reduced, make that change here.
Submitting the job to the queuing system
The command sbatch is used to submit jobs, e.g.
The command will return a unique job identifier, which is used to query and control the job and to identify output. Please see the above link or do man sbatch on Darwin to see all the available options.
The following more complex example submits a job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5 and 7) to the project STARS-SL2:
sbatch --array=1-7:2 -A STARS-SL2 submission_script
Please see the man pages or the documentation here for more information on these commands.
List all jobs in SLURM
squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 18974_7 sandybrid novoalig abc123 R 1:47:23 2 sand-1-[1-2] 18970_2 sandybrid novoalig abc123 R 1:47:53 2 sand-1-[22-23] 18970_3 sandybrid novoalig abc123 R 1:47:53 2 sand-1-[8,18] 18974_6 sandybrid novoalig abc123 R 1:47:53 2 sand-1-[31,33] ... 18821 tesla prop spqr45 PD 0:00 4 (QOSResourceLimit)
In the above sandybridge and tesla indicate that the jobs are destined for Darwin and Wilkes respectively. The state R means that the job is running, while PD indicates that the job is pending (still queued), in this case because of the restrictions of the QOS (determined by the service level).
The jobids reported as mmmm_n are elements of an array job, where mmmm is the SLURM_ARRAY_JOB_ID common to all jobs in the array, and n is the array index (SLURM_ARRAY_TASK_ID).
List all jobs owned by abc123
squeue -u abc123
List all jobs owned by project projectname
squeue -A projectname
and other options and combinations of options.
Examine job jobid in detail
scontrol show job <jobid>
Cancel job jobid
The following command will request two Darwin nodes interactively for 2 hours, charged to the project MYPROJECT:
sintr -p sandybridge -A MYPROJECT -N2 -t 2:0:0
This command will pause until the job starts, then create a screen terminal running on the first node allocated (cf man screen). X windows applications started inside this terminal should display properly. Within the screen session, new terminals can be started with control-a c, with navigation between the different terminals being accomplished with control-a n and control-a p. Also srun can be used to start processes on any of the nodes in the job allocation, and SLURM-aware MPI implementations will use this to launch remote processes on the allocated nodes without the need to give them explicit host lists.
Show activity across all nodes
The following commands are intended to behave similarly to the commands of the same names from Gold. They are wrappers around the underlying SLURM commands sacct and sreport which are much more powerful.
How many core hours available do I have?
mybalance User Usage | Account Usage | Account Limit Available (CPU hrs) ---------- --------- + -------------- --------- + ------------- --------- abc123 18 | STARS 171 | 100,000 99,829 abc123 18 | STARS-SL2 35 | 101,000 100,965 abc123 925 | BLACKH 10,634 | 166,667 156,033
This shows for each project of which the user is a member how many cores have been used, awarded, and remain available.
How many core hours does some other project or user have?
gbalance -p HALOS User Usage | Account Usage | Account Limit Available (CPU hrs) ---------- --------- + --------- --------- + ------------- --------- pq345 0 | HALOS 317,656 | 600,000 282,344 xyz10 11,880 | HALOS 317,656 | 600,000 282,344 ...
This outputs the total usage in core hours accumulated to date for the project, the total awarded and total remaining available (i.e. to all members). It also prints the component of the total usage due to each member.
I would like a listing of all jobs I have submitted through a certain project and between certain times
gstatement -p HALOS -u xyz10 -s "2014-01-01-00:00:00" -e "2014-01-20-00:00:00" JobID User Account JobName Partition End NCPUS CPUTimeRAW ExitCode State ------------ --------- ---------- -------- ---------- ------------------- ---------- ---------- -------- ---------- 14505 xyz10 halos help sandybrid+ 2014-01-07T12:59:40 16 32 0:9 COMPLETED 14506 xyz10 halos help sandybrid+ 2014-01-07T13:00:11 16 48 2:0 FAILED 14507 xyz10 halos bash sandybrid+ 2014-01-07T13:05:20 16 4128 0:0 CANCELLED+ 14541 xyz10 halos bash sandybrid+ 2014-01-07T15:31:44 32 85216 0:9 COMPLETED 14560 xyz10 halos bash sandybrid+ 2014-01-07T16:19:36 32 89824 0:0 COMPLETED 14569 xyz10 halos bash sandybrid+ 2014-01-07T18:19:47 80 576240 0:1 TIMEOUT 14598 xyz10 halos bash sandybrid+ 2014-01-07T19:27:54 80 324080 0:0 COMPLETED 15619 xyz10 halos test.sh sandybrid+ 2014-01-09T16:10:35 16 64 0:0 COMPLETED ...
This lists the charge for each job in the CPUTimeRAW column in core seconds.
I would like to add core hours to a particular member of my group
gdeposit -z 10000 -p halos-spqr1
This adds 10000 core hours to the HALOS-SPQR1 project assigned to the user spqr1. Note that if a core hour limit applies to the parent of the project in the project hierarchy - i.e. if the parent project HALOS has an overall core hour limit (which it almost certainly does) - then the global limit will still apply across all per-user projects.
Core hours may be added to a project by a designated project coordinator user. Reducing the core hours available to a project currently can only be done through the system administrators.