skip to primary navigationskip to content
 

Submission of jobs

Sample submission scripts

To use SLURM (as in PBS), one creates a batch job which is a shell script containing the set of commands to run, plus the resource requirements for the job which are coded as specially formatted shell comments at the top of the script. The batch job script is then submitted to SLURM. A job script can be resubmitted with different parameters (e.g. different sets of data or variables).

Please copy and edit the sample submission scripts that can be found under

/usr/local/Cluster-Docs/SLURM

where slurm_submit.darwin is the appropriate choice for Darwin, and slurm_submit.wilkes is suitable for Wilkes. Lines beginning #SBATCH are directives to the batch system. The rest of each directive specifies arguments to the sbatch commmand. SLURM stops reading directives at the first executable (i.e. non-blank, and doesn't begin with #) line.

The main directives to modify are

#! Which project should be charged:
#SBATCH -A CHANGEME
#! How many nodes should be allocated?
#SBATCH --nodes=2
#! How many (MPI) tasks will there be in total? (<= nodes*16)
#SBATCH --ntasks=32
#! How much memory in MB is required _per node_? Not setting this
#! will lead to a default of (1/16)*total memory per task.
#! Setting a larger amount per task increases the number of cores.
##SBATCH --mem=   # 63900 is the maximum value allowed per node.
#! How much wallclock time will be required?
#SBATCH --time=02:00:00

in particular, the name of the project is required for the job to be scheduled (use the command mybalance to check what this is for you in case of doubt). Charging is reported in units of core hours according to

number of CPU cores allocated x total walltime in hours

for Darwin, and similarly for Wilkes (note that each GPU hour consumed or available on Wilkes is represented in SLURM as 6 core hours, since each node has 2 GPUs and 12 CPU cores).

The --ntasks setting is taken by the MPI launch logic to be the total number of MPI tasks that should be started. Usually this should be set to the number of nodes times the number of cores in the node (which is 16 for Darwin). If for memory or other reasons this should be reduced, make that change here, but don't forget to explicitly request the required memory per node (up to 63900MB for the entire node), otherwise SLURM will scale the memory allocated down with the number of tasks.

Sample MPI submission script

Note this script is designed for the Sandy Bridge compute nodes on Darwin. The latest version can also be found on Darwin itself at /usr/local/Cluster-Docs/SLURM/slurm_submit.sandybridge, and there is a similar template script in the same directory tuned for Wilkes and GPU (Tesla) jobs.

#!/bin/bash
#!
#! Example SLURM job script for Darwin (Sandy Bridge, ConnectX3)
#! Last updated: Fri Jan 24 12:12:18 GMT 2014
#!

#!#############################################################
#!#### Modify the options in this section as appropriate ######
#!#############################################################

#! sbatch directives begin here ###############################
#! Name of the job:
#SBATCH -J darwinjob
#! Which project should be charged:
#SBATCH -A CHANGEME
#! How many whole nodes should be allocated?
#SBATCH --nodes=2
#! How many (MPI) tasks will there be in total? (<= nodes*16)
#SBATCH --ntasks=32
#! How much memory in MB is required _per node_? Not setting this
#! will lead to a default value of (1/16)*total memory per task.
#! Setting a larger amount per task increases the number of cores.
##SBATCH --mem=   # 63900 is the maximum value allowed per node.
#! How much wallclock time will be required?
#SBATCH --time=02:00:00
#! What types of email messages do you wish to receive?
#SBATCH --mail-type=FAIL
#! Uncomment this to prevent the job from being requeued (e.g. if
#! interrupted by node failure or system downtime):
##SBATCH --no-requeue

#! Do not change:
#SBATCH -p sandybridge

#! sbatch directives end here (put any additional directives above this line)

#! Notes:
#! Charging is determined by core number*walltime.
#! The --ntasks value refers to the number of tasks to be launched by SLURM only. This
#! usually equates to the number of MPI tasks launched. Reduce this from nodes*16 if
#! demanded by memory requirements, or if OMP_NUM_THREADS>1.

#! Number of nodes and tasks per node allocated by SLURM (do not change):
numnodes=$SLURM_JOB_NUM_NODES
numtasks=$SLURM_NTASKS
mpi_tasks_per_node=$(echo "$SLURM_TASKS_PER_NODE" | sed -e  's/^\([0-9][0-9]*\).*$/\1/')
#! ############################################################
#! Modify the settings below to specify the application's environment, location 
#! and launch method:

#! Optionally modify the environment seen by the application
#! (note that SLURM reproduces the environment at submission irrespective of ~/.bashrc):
. /etc/profile.d/modules.sh                # Leave this line (enables the module command)
module load default-impi                   # REQUIRED - loads the basic environment

#! Insert additional module load commands after this line if needed:

#! Full path to application executable: 
application=""

#! Run options for the application:
options=

#! Work directory (i.e. where the job will run):
workdir="$SLURM_SUBMIT_DIR"  # The value of SLURM_SUBMIT_DIR sets workdir to the directory
                             # in which sbatch is run.

#! Are you using OpenMP (NB this is unrelated to OpenMPI)? If so increase this
#! safe value to no more than 16:
export OMP_NUM_THREADS=1

#! Number of MPI tasks to be started by the application per node and in total (do not change):
np=$[${numnodes}*${mpi_tasks_per_node}]

#! The following variables define a sensible pinning strategy for Intel MPI tasks -
#! this should be suitable for both pure MPI and hybrid MPI/OpenMP jobs:
export I_MPI_PIN_DOMAIN=omp:compact # Domains are $OMP_NUM_THREADS cores in size
export I_MPI_PIN_ORDER=scatter # Adjacent domains have minimal sharing of caches/sockets
#! Notes:
#! 1. These variables influence Intel MPI only.
#! 2. Domains are non-overlapping sets of cores which map 1-1 to MPI tasks.
#! 3. I_MPI_PIN_PROCESSOR_LIST is ignored if I_MPI_PIN_DOMAIN is set.
#! 4. If MPI tasks perform better when sharing caches/sockets, try I_MPI_PIN_ORDER=compact.


#! Uncomment one choice for CMD below (add mpirun/mpiexec options if necessary):

#! Choose this for a MPI code (possibly using OpenMP) using Intel MPI.
CMD="mpirun -ppn $mpi_tasks_per_node -np $np $application $options"

#! Choose this for a pure shared-memory OpenMP parallel program on a single node:
#! (OMP_NUM_THREADS threads will be created):
#CMD="$application $options"

#! Choose this for a MPI code (possibly using OpenMP) using OpenMPI:
#CMD="mpirun -npernode $mpi_tasks_per_node -np $np $application $options"


###############################################################
### You should not have to change anything below this line ####
###############################################################

cd $workdir
echo -e "Changed directory to `pwd`.\n"

JOBID=$SLURM_JOB_ID

echo -e "JobID: $JOBID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"

if [ "$SLURM_JOB_NODELIST" ]; then
        #! Create a machine file:
        export NODEFILE=`generate_pbs_nodefile`
        cat $NODEFILE | uniq > machine.file.$JOBID
        echo -e "\nNodes allocated:\n================"
        echo `cat machine.file.$JOBID | sed -e 's/\..*$//g'`
fi

echo -e "\nnumtasks=$numtasks, numnodes=$numnodes, mpi_tasks_per_node=$mpi_tasks_per_node (OMP_NUM_THREADS=$OMP_NUM_THREADS)"

echo -e "\nExecuting command:\n==================\n$CMD\n"

eval $CMD 

As can be seen from the above script, a machine file is built in the PBS style using the SLURM_JOB_NODELIST variable. This variable is supplied by the queueing system and contains the node names that are reserved by the queueing system for the particular job. However the version of the mpirun command that is supplied with Intel MPI and used above is SLURM-aware and extracts the node information directly without requiring a machine file.

Submitting the job to the queuing system

The command sbatch is used to submit jobs, e.g.

sbatch submission_script

The command will return a unique job identifier, which is used to query and control the job and to identify output. See the man page (man sbatch) for more options.

The following more complex example submits a job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5 and 7) to the project STARS-SL2:

sbatch --array=1-7:2 -A STARS-SL2 submission_script