High Performance Computing Service

Running jobs

Submission of jobs

Portable Batch System (PBS)

The Portable Batch System is a workload management and job scheduling system first developed to manage computing resources at NASA. It exists in various forms going by names such as PBSPro, OpenPBS and Torque - Darwin uses Torque 2.5 (in conjunction with the Maui scheduler), but we shall refer to it here simply as PBS.

To use PBS, one creates a batch job which is a shell script containing the set of commands to run, plus the resource requirements for the job. The batch job script is then submitted to PBS. A job script can be resubmitted with different parameters (e.g. different sets of data or variables).

Sample non-MPI submission script

This is a small (minimal) example script, which could be used to submit non-MPI jobs to PBS. (A more useful and Darwin-specific example script for MPI jobs can be found in the next section.)

#!/bin/bash

#PBS -l nodes=1:ppn=16,walltime=1:00:00
#PBS -l mem=64000mb
#PBS -m abe

cd ${HOME}/myprogs
myprog a b c

The lines beginning #PBS are directives to the batch system. The directives with -l are resource directives, which specify arguments to the -l option of qsub. In this case, a job time of one hour and at least 64GB are requested. The -m directive instructs PBS to send an email notification on job abort, beginning, and end. PBS stops reading directives at the first executable (i.e. non-blank, and doesn't begin with #) line. The last two lines simply say to change to the directory ~/myprogs and then run the executable myprog with arguments a b c.

Sample MPI submission script

Note this script is designed for the Sandy Bridge compute nodes. The latest version can also be found on Darwin itself at /usr/local/Cluster-Docs/Torque/mpi_submit.sandybridge, and there are similar template scripts in the same directory tuned for Westmere and GPU (Tesla) jobs.

#!/bin/bash 
#!
#! Example batch job script for Darwin (Sandy Bridge nodes)
#! Last updated: Tue Jun  5 19:27:59 BST 2012
#!

#!#############################################################
#!#### Modify the options in this section as appropriate ######
#!#############################################################

#! PBS directives begin here ##################################
#! Name of the job ('MPI' in this example):
#PBS -N MPI

#! Select queue (note: sandybridge-int for interactive jobs):
#PBS -q sandybridge

#! Numbers of 16-core nodes (nodes) and processor cores per node (ppn) required.
#! The total number of cores available to the job will be nodes*ppn.
#! The second entry is the total memory required - this should usually be
#! less than nodes*64000mb.
#! The third entry is the total amount of wall-clock time (true time) 
#! requested - 02:00:00 indicates 2 hours:
#PBS -l nodes=8:ppn=16,mem=512000mb,walltime=02:00:00

#! Only run the job after a certain time, e.g. the following specifies that
#! the job is eligible to run only after 15:00 on 29th of the month (remove one
#! # from the line below to enable it):
##PBS -a 291500

#! Send mail to the user when the job aborts or ends (add 'b' to receive mail at the
#! beginning of execution; use 'n' on its own to turn off messages completely):
#PBS -m ae

#! PBS directives end here (put any additional directives above this line)
#! ############################################################
#! Modify the settings below to specify the application's location, environment, 
#! and launch method:  

#! Full path to application executable: 
application=""

#! Run options for the application:
options=""

#! Work directory (i.e. where the job will run):
workdir=""

#! Optionally modify the environment seen by the application
#! (the environment without these module settings will come from ~/.bashrc):
. /etc/profile.d/modules.sh                # Leave this line
module purge                               # Removes all modules loaded by ~/.bashrc
module load default-impi                   # REQUIRED - loads the basic environment
                                           # for Intel MPI/Intel compilers;
# NB The options below are not yet ready for Sandy Bridge:
#! alternatively, comment out the above line and uncomment one of these:
# module load default-mva2     # MVAPICH2/Intel compilers 
# module load default-ompi     # OpenMPI/Intel compilers
#! Insert additional module load commands after this line if needed:

#! Are you using OpenMP? If so increase this safe value to no more than 12:
export OMP_NUM_THREADS=1

#! The following variables define a sensible pinning strategy for Intel MPI tasks -
#! this should be suitable for both pure MPI and hybrid MPI/OpenMP jobs:
export I_MPI_PIN_DOMAIN=omp:compact # Domains are $OMP_NUM_THREADS cores in size
export I_MPI_PIN_ORDER=scatter # Adjacent domains have minimal sharing of caches/sockets
#! Notes:
#! 1. These variables influence Intel MPI only.
#! 2. Domains are non-overlapping sets of cores which map 1-1 to MPI tasks.
#! 3. I_MPI_PIN_PROCESSOR_LIST is ignored if I_MPI_PIN_DOMAIN is set.
#! 4. If MPI tasks perform better when sharing caches/sockets, try I_MPI_PIN_ORDER=compact.

#! Do not change these:
np=$(cat "$PBS_NODEFILE" | wc -l)
ppn=$(uniq -c "$PBS_NODEFILE" | head --lines=1 | sed -e 's/^ *\([0-9]\+\) .*$/\1/g') 

#! Uncomment one choice for CMD below (add mpirun/mpiexec options if necessary):

#! Choose this for a pure MPI code on all allocated cores using Intel MPI:
CMD="mpirun -tune -ppn $ppn -np $np $application $options"

#! Choose this for starting a pure OpenMP parallel program on a single node
#! (OMP_NUM_THREADS threads will be created):
#CMD="$application $options"

#! Uncomment these three lines for hybrid OpenMP/Intel MPI codes -
#! $mpi_procs_per_node * OMP_NUM_THREADS worker threads will be created per node:
#!
#mpi_procs_per_node=$[$ppn/$OMP_NUM_THREADS]  # number of MPI processes per node
#PRECMD='mpirun -tune -ppn $mpi_procs_per_node -np $[$numnodes*$mpi_procs_per_node]'
#CMD="$PRECMD $application $options"


#! Choose this for a pure MPI code on all allocated cores using OpenMPI:
#CMD="mpirun -np $np $application $options"


#! The remaining options for CMD assume we are using the OSC mpiexec
#! (use these for MVAPICH2 or non-MPI jobs):

#! Uncomment this first to load OSC mpiexec:
#module load mpiexec

#! This is the usual case (MPI processes on all allocated cores):
#CMD="mpiexec -kill -comm pmi $application $options"

#! This is for starting identical copies of a non-MPI program on all allocated cores:
#CMD="mpiexec -comm none $application $options"

#! For hybrid OpenMP/MPI codes uncomment the two lines at the end of this section - 
#! $mpi_procs_per_node * OMP_NUM_THREADS worker threads will be created per node:
#!
#mpi_procs_per_node=$[$ppn/$OMP_NUM_THREADS]  # number of MPI processes per node
#CMD="mpiexec -kill -comm pmi -npernode $mpi_procs_per_node $application $options"


###############################################################
### You should not have to change anything below this line ####
###############################################################

cd $workdir
echo -e "Changed directory to `pwd`.\n"

JOBID=`echo $PBS_JOBID | sed -e 's/\..*$//'`

echo -e "JobID: $JOBID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"

numprocs=0
numnodes=0
if [ -r "$PBS_NODEFILE" ]; then
        #! Create a machine file as for InfiniPath MPI
        cat $PBS_NODEFILE | uniq > machine.file.$JOBID
        numprocs=$[`cat $PBS_NODEFILE | wc -l`]
        numnodes=$[`cat machine.file.$JOBID | wc -l`]
        echo -e "\nNodes allocated:\n================"
        echo `cat machine.file.$JOBID | sed -e 's/\..*$//g'`
fi

echo -e "\nnumprocs=$numprocs, numnodes=$numnodes, ppn=$ppn"

echo -e "\nExecuting command:\n==================\n$CMD\n"

eval $CMD 

For guidance on choosing values for nodes, ppn and mem please see this faq.

As can be seen from the above script, a machine file is built using the $PBS_NODEFILE variable. This variable is supplied by the queueing system and contains the node names that are reserved by the queueing system for the particular job. However the version of the mpirun command that is supplied with Intel MPI and used above is PBS-aware and extracts the node information directly without requiring a machine file; the same is true of the OSC mpiexec which can optionally be loaded via a module and has some useful features.

Submitting the job to the queuing system

The command qsub is used to submit jobs. The command will return a unique job identifier, which is used to query and control the job and to identify output. See the respective man page (man qsub) for more options.

qsub <options> <script> 
-l   list		request  certain resource(s)
-q   queue		jobs is run in this queue (usually not needed)
-N   name   		name of job
-A   project            project to run the job under
                        (required if you are in multiple projects)
-m   a|b|e|n		controls when email messages will be sent
-a   datetime		run the job at a certain time
-W depend=              sets up inter-job dependencies

Monitoring jobs

PBS/Torque

In PBS/Torque, the command qstat -an shows what jobs are currently submitted in the queueing system and the command qstat -q shows what queues are available. An example output from darwin is shown below:

qstat -an:
master01.xcat.cluster: 
                                                                         Req'd  Req'd   Elap
Job ID               Username    Queue    Jobname          SessID NDS   TSK    Memory Time  S Time
-------------------- ----------- -------- ---------------- ------ ----- ------ ------ ----- - -----
14848.master01.x     abc123      sandybri bl_sub_dt         63084    32    512    --  36:00 R 27:17
   sand-9-35/15+sand-9-35/14+sand-9-35/13+sand-9-35/12+sand-9-35/11
...

Maui scheduler

Maui is an advanced scheduler capable of complex scheduling and node allocation decisions. It provides extensive control over which jobs are considered eligible for scheduling, how the jobs are prioritized, and where these jobs are run. As a consequence the job ordering shown by the resource manager (PBS) queue listing command (i.e. qstat) will in general not reflect the actual order in which jobs will be started.

Maui logically sits on top of PBS - Maui decides when a job should run, and PBS acts on this decision, and reports back to Maui. Therefore for a complete picture of the job scheduling system, one should query Maui, using one of the Maui user commands described in the next subsection.

Maui user commands

The most useful commands are

showq -u abc123

to see jobs belonging to abc123;

showq -r

to see all running jobs - also -i to see all jobs eligible to run in order of priority, and -b to see all blocked jobs (i.e. jobs which for some reason or another are not eligible to run at present);

checkjob -v <jobid>

to see information regarding job <jobid>, including any reason it may be in the blocked (non-eligible) state;

canceljob <jobid>

to cancel or delete job <jobid> (qdel also works);

showstate

gives a schematic representation of activity across the entire system.

Further details can be found on the manual pages.

Deleting jobs

To delete an already submitted job, use either the PBS command qdel, or the Maui command canceljob. The syntax is:

qdel <jobid>

or

canceljob <jobid>

The <jobid> is printed when the job is submitted, alternatively use the commands qstat or showq to obtain the job ID.

Job arrays

Sometimes it is desired to submit a large number of similar jobs to the queueing system. One obvious, but inefficient, way to do this would be to prepare a prototype job script, and then to use shell scripting to call qsub on this (possibly modified) job script the required number of times.

Alternatively, the current version of Torque provides a feature known as job arrays which allows the creation of multiple, similar jobs with one qsub command. This feature introduces a new job naming convention that allows users either to reference the entire set of jobs as a unit or to reference one particular job from the set.

To submit a job array use either

qsub -t <array_request> <jobscript>

or equivalently insert a directive

#PBS -t <array_request> <jobscript>

in your batch script. In each case an array is created consisting of a set of jobs, each of which is assigned a unique index number (available to each job as the value of the environment variable PBS_ARRAYID), with each job using the same jobscript and running in a nearly identical environment. Here array_request is a comma-separated list consisting of either single numbers (specifying particular index values), or a pair of numbers separated by a dash (representing a range of index values).

Example

qsub -t 0,1,4-6 jobscript
220531.master01.xcat.cluster

qstat
Job id                      Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
220531-0.master01           MPI-0            sjr20                  0 Q sandybridge        
220531-1.master01           MPI-1            sjr20                  0 Q sandybridge        
220531-4.master01           MPI-4            sjr20                  0 Q sandybridge        
220531-5.master01           MPI-5            sjr20                  0 Q sandybridge        
220531-6.master01           MPI-6            sjr20                  0 Q sandybridge        

Note that each job is assigned a composite job id of the form 220531-x (stored as usual in the environment variable PBS_JOBID). Each job is distinguished by a unique array index x (stored in the environment variable PBS_ARRAYID), which is also added to the value of the PBS_JOBNAME variable. Thus each job can perform slightly different actions based on the value of PBS_ARRAYID (e.g. using different input or output files, or different options). The values of x used are those specified in the argument to the -t option.

Short jobs

Following user feedback we have made a scheduling adjustment to improve waiting times for shorter jobs (defined as those requesting 2 hours of wallclock time or less).

A certain number of nodes of each type will now keep themselves available for QOS1 and QOS2 short jobs during office hours (i.e. 09:00-21:00 weekdays). This will reduce the average wait for short jobs and is part of a larger plan to make interactive/development work on the system easier.

This can only work if the scheduler knows that a short job is short, i.e. it is essential the wallclock time specified in the submission script (or on the qsub command line) for a "short" job states that 2 hours or less are required. Note that it is generally true that walltimes should be stated accurately, and not simply left at the maximum values of 12 or 36 hours, otherwise wait times become larger than necessary (because jobs look larger and it is more difficult to schedule a larger job). To alter the requested walltime for a job already submitted, do

qalter -l walltime=XX:YY:ZZ jobid

where jobid is obviously the id of the job, XX is hours, YY minutes, ZZ seconds.

Further adjustments may be made as we observe the effect on overall throughput.