High Performance Computing Service

Running jobs

SLURM

SLURM is an open source workload management and job scheduling system. The HPCS adopted SLURM in February 2014, but previously used Torque, Maui/Moab and Gold (referred to in the following simply as "PBS") for the same purpose. Please note that there are several commands available in SLURM with the same names as in PBS (e.g. showq, qstat, qsub, qdel, qrerun) intended for backwards compatibility, but in general we would recommend using the native SLURM commands, as described below.

Submission of jobs

Sample submission scripts

To use SLURM (as in PBS), one creates a batch job which is a shell script containing the set of commands to run, plus the resource requirements for the job which are coded as specially formatted shell comments at the top of the script. The batch job script is then submitted to SLURM. A job script can be resubmitted with different parameters (e.g. different sets of data or variables).

Please copy and edit the sample submission scripts that can be found under

/usr/local/Cluster-Docs/SLURM

where slurm_submit.darwin is the appropriate choice for Darwin, and slurm_submit.wilkes is suitable for Wilkes. Lines beginning #SBATCH are directives to the batch system. The rest of each directive specifies arguments to the sbatch commmand. SLURM stops reading directives at the first executable (i.e. non-blank, and doesn't begin with #) line.

The main directives to modify are

#! Which project should be charged:
#SBATCH -A CHANGEME
#! How many whole nodes should be allocated?
#SBATCH --nodes=2
#! How many (MPI) tasks will there be in total? (<= nodes*16)
#SBATCH --ntasks=32
#! How much wallclock time will be required?
#SBATCH --time=02:00:00

in particular, the name of the project is required for the job to be scheduled (use the command mybalance to check what this is for you in case of doubt). Allocation occurs in entire nodes, and charging is reported in units of core hours according to

number of nodes x 16 x total walltime in hours

for Darwin, and

number of nodes x 12 x total walltime in hours

for Wilkes (1 GPU hour is equivalent to 6 core hours since each node has 2 GPUs and 12 CPU cores).

The --ntasks setting does not affect charging but is taken by the MPI launch logic to be the total number of MPI tasks that should be started. Usually this should be set to the number of nodes times the number of cores in the node (which is 16 for Darwin and 12 for Wilkes). If for memory or other reasons this should be reduced, make that change here.

Sample MPI submission script

Note this script is designed for the Sandy Bridge compute nodes on Darwin. The latest version can also be found on Darwin itself at /usr/local/Cluster-Docs/SLURM/slurm_submit.sandybridge, and there is a similar template script in the same directory tuned for Wilkes and GPU (Tesla) jobs.

#!/bin/bash
#!
#! Example SLURM job script for Darwin (Sandy Bridge, ConnectX3)
#! Last updated: Fri Jan 24 12:12:18 GMT 2014
#!

#!#############################################################
#!#### Modify the options in this section as appropriate ######
#!#############################################################

#! sbatch directives begin here ###############################
#! Name of the job:
#SBATCH -J darwinjob
#! Which project should be charged:
#SBATCH -A CHANGEME
#! How many whole nodes should be allocated?
#SBATCH --nodes=2
#! How many (MPI) tasks will there be in total? (<= nodes*16)
#SBATCH --ntasks=32
#! How much wallclock time will be required?
#SBATCH --time=02:00:00
#! What types of email messages do you wish to receive?
#SBATCH --mail-type=FAIL
#! Uncomment this to prevent the job from being requeued (e.g. if
#! interrupted by node failure or system downtime):
##SBATCH --no-requeue

#! Do not change:
#SBATCH -p sandybridge

#! sbatch directives end here (put any additional directives above this line)

#! Notes:
#! Charging is determined by node number*walltime. Allocation is in entire nodes.
#! The --ntasks value refers to the number of tasks to be launched by SLURM only. This
#! usually equates to the number of MPI tasks launched. Reduce this from nodes*16 if
#! demanded by memory requirements, or if OMP_NUM_THREADS>1.

#! Number of nodes and tasks per node allocated by SLURM (do not change):
numnodes=$SLURM_JOB_NUM_NODES
numtasks=$SLURM_NTASKS
mpi_tasks_per_node=$(echo "$SLURM_TASKS_PER_NODE" | sed -e  's/^\([0-9][0-9]*\).*$/\1/')
#! ############################################################
#! Modify the settings below to specify the application's environment, location 
#! and launch method:

#! Optionally modify the environment seen by the application
#! (note that SLURM reproduces the environment at submission irrespective of ~/.bashrc):
. /etc/profile.d/modules.sh                # Leave this line (enables the module command)
module load default-impi                   # REQUIRED - loads the basic environment

#! Insert additional module load commands after this line if needed:

#! Full path to application executable: 
application=""

#! Run options for the application:
options=

#! Work directory (i.e. where the job will run):
workdir="$SLURM_SUBMIT_DIR"  # The value of SLURM_SUBMIT_DIR sets workdir to the directory
                             # in which sbatch is run.

#! Are you using OpenMP (NB this is unrelated to OpenMPI)? If so increase this
#! safe value to no more than 16:
export OMP_NUM_THREADS=1

#! Number of MPI tasks to be started by the application per node and in total (do not change):
np=$[${numnodes}*${mpi_tasks_per_node}]

#! The following variables define a sensible pinning strategy for Intel MPI tasks -
#! this should be suitable for both pure MPI and hybrid MPI/OpenMP jobs:
export I_MPI_PIN_DOMAIN=omp:compact # Domains are $OMP_NUM_THREADS cores in size
export I_MPI_PIN_ORDER=scatter # Adjacent domains have minimal sharing of caches/sockets
#! Notes:
#! 1. These variables influence Intel MPI only.
#! 2. Domains are non-overlapping sets of cores which map 1-1 to MPI tasks.
#! 3. I_MPI_PIN_PROCESSOR_LIST is ignored if I_MPI_PIN_DOMAIN is set.
#! 4. If MPI tasks perform better when sharing caches/sockets, try I_MPI_PIN_ORDER=compact.


#! Uncomment one choice for CMD below (add mpirun/mpiexec options if necessary):

#! Choose this for a MPI code (possibly using OpenMP) using Intel MPI.
CMD="mpirun -ppn $mpi_tasks_per_node -np $np $application $options"

#! Choose this for a pure shared-memory OpenMP parallel program on a single node:
#! (OMP_NUM_THREADS threads will be created):
#CMD="$application $options"

#! Choose this for a MPI code (possibly using OpenMP) using OpenMPI:
#CMD="mpirun -npernode $mpi_tasks_per_node -np $np $application $options"


###############################################################
### You should not have to change anything below this line ####
###############################################################

cd $workdir
echo -e "Changed directory to `pwd`.\n"

JOBID=$SLURM_JOB_ID

echo -e "JobID: $JOBID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"

if [ "$SLURM_JOB_NODELIST" ]; then
        #! Create a machine file:
        export NODEFILE=`generate_pbs_nodefile`
        cat $NODEFILE | uniq > machine.file.$JOBID
        echo -e "\nNodes allocated:\n================"
        echo `cat machine.file.$JOBID | sed -e 's/\..*$//g'`
fi

echo -e "\nnumtasks=$numtasks, numnodes=$numnodes, mpi_tasks_per_node=$mpi_tasks_per_node (OMP_NUM_THREADS=$OMP_NUM_THREADS)"

echo -e "\nExecuting command:\n==================\n$CMD\n"

eval $CMD 

As can be seen from the above script, a machine file is built in the PBS style using the SLURM_JOB_NODELIST variable. This variable is supplied by the queueing system and contains the node names that are reserved by the queueing system for the particular job. However the version of the mpirun command that is supplied with Intel MPI and used above is SLURM-aware and extracts the node information directly without requiring a machine file.

Submitting the job to the queuing system

The command sbatch is used to submit jobs, e.g.

sbatch submission_script

The command will return a unique job identifier, which is used to query and control the job and to identify output. See the man page (man sbatch) for more options.

The following more complex example submits a job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5 and 7) to the project STARS-SL2:

sbatch --array=1-7:2 -A STARS-SL2 submission_script

Monitoring jobs

In SLURM, the command squeue shows what jobs are currently submitted in the queueing system and the command squeue -u spqr1 shows only those jobs belonging to the user spqr1 (other selections are possible, e.g. use -A to select on a particular project). An example output from darwin is shown below:

squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             10119 sandybrid MILCtest dc-klmn13 PD       0:00      1 (QOSResourceLimit)
           18974_7 sandybrid novoalig    abc123  R    1:47:23      2 sand-1-[1-2]
           18970_2 sandybrid novoalig    abc123  R    1:47:53      2 sand-1-[22-23]
             11206 sandybrid sf1l1000    abc123  R      50:31      8 sand-3-[35-37],sand-5-[55-59]
              7819 sandybrid    LCDMb     spqr1  R    9:13:48      4 sand-6-[39-42]
             11230 sandybrid alaR_iso     xyz12  R       8:11      4 sand-3-[22-25]
...
             18821     tesla     prop    spqr45 PD       0:00      4 (QOSResourceLimit)
...

In the above sandybridge and tesla indicate that the jobs are destined for Darwin and Wilkes respectively. If the state is PENDING (PD), i.e. the job is still waiting in the queue and not yet running, the final column lists the reason for this - in the case of job 10119 above it is because the user is already using the maximum resources permitted at any one time by their quality of service (QOS), which is determined by the service level. If the state is RUNNING (R) the same column lists which nodes have been allocated to the job.

The jobids reported as mmmm_n are elements of an array job, where mmmm is the SLURM_ARRAY_JOB_ID common to all jobs in the array, and n is the array index (SLURM_ARRAY_TASK_ID).

The command scontrol is a more powerful command allowing more detailed queries. E.g. to examine a particular job with id <jobid> in detail:

scontrol show job <jobid>

or

scontrol show node <nodename>

to see information regarding the node <nodename>.

Schematic representations of activity across the entire system can be obtained from sinfo and sview.

Further details can be found on the manual pages.

Deleting jobs

To cancel a job (either running or still queuing) use scancel:

scancel <jobid>

The <jobid> is printed when the job is submitted, alternatively use the commands squeue, qstat or showq to obtain the job ID.

Accounting Commands

The following commands are intended to behave similarly to the commands of the same names from Gold. They are wrappers around the underlying SLURM commands sacct and sreport which are much more powerful.

How many core hours available do I have?

mybalance
User            Usage |        Account     Usage | Account Limit Available (CPU hrs)
----------  --------- + -------------- --------- + ------------- ---------
abc123             18 |          STARS       171 |       100,000    99,829
abc123             18 |      STARS-SL2        35 |       101,000   100,965
abc123            925 |         BLACKH    10,634 |       166,667   156,033

This shows for each project of which the user is a member how many cores have been used, awarded, and remain available.

How many core hours does some other project or user have?

gbalance -p HALOS
User           Usage |   Account     Usage | Account Limit Available (CPU hrs)
---------- --------- + --------- --------- + ------------- ---------

pq345              0 |     HALOS   317,656 |       600,000   282,344
xyz10         11,880 |     HALOS   317,656 |       600,000   282,344
...

This outputs the total usage in core hours accumulated to date for the project, the total awarded and total remaining available (i.e. to all members). It also prints the component of the total usage due to each member.

I would like a listing of all jobs I have submitted through a certain project and between certain times

gstatement -p HALOS  -u xyz10 -s "2014-01-01-00:00:00" -e "2014-01-20-00:00:00" 
       JobID      User   Account  JobName  Partition                 End      NCPUS CPUTimeRAW ExitCode      State 
------------ --------- ---------- -------- ---------- ------------------- ---------- ---------- -------- ---------- 
14505            xyz10    halos       help sandybrid+ 2014-01-07T12:59:40         16         32      0:9  COMPLETED 
14506            xyz10    halos       help sandybrid+ 2014-01-07T13:00:11         16         48      2:0     FAILED 
14507            xyz10    halos       bash sandybrid+ 2014-01-07T13:05:20         16       4128      0:0 CANCELLED+ 
14541            xyz10    halos       bash sandybrid+ 2014-01-07T15:31:44         32      85216      0:9  COMPLETED 
14560            xyz10    halos       bash sandybrid+ 2014-01-07T16:19:36         32      89824      0:0  COMPLETED 
14569            xyz10    halos       bash sandybrid+ 2014-01-07T18:19:47         80     576240      0:1    TIMEOUT 
14598            xyz10    halos       bash sandybrid+ 2014-01-07T19:27:54         80     324080      0:0  COMPLETED 
15619            xyz10    halos    test.sh sandybrid+ 2014-01-09T16:10:35         16         64      0:0  COMPLETED 
...

This lists the charge for each job in the CPUTimeRAW column in core seconds.

I would like to add core hours to a particular member of my group

gdeposit -z 10000 -p halos-spqr1

This adds 10000 core hours to the HALOS-SPQR1 project assigned to the user spqr1. Note that if a core hour limit applies to the parent of the project in the project hierarchy - i.e. if the parent project HALOS has an overall core hour limit (which it almost certainly does) - then the global limit will still apply across all per-user projects.

Core hours may be added to a project by a designated project coordinator user. Reducing the core hours available to a project currently can only be done through the system administrators.