The scheduling policies governing Darwin and later Wilkes have continued largely unchanged since 2009 (barring the replacement of Moab with Maui, and then later of Maui by SLURM). These policies were well suited to well-established applications using MPI, which often assume dedicated access to entire compute nodes and save their progress.
More recently the mixture of jobs seen on the HPCS has contained an increasing fraction of varied, embarrassingly parallel, smaller scale applications from the biosciences, which find a scheduling system insisting on allocation of entire 16 core nodes difficult to exploit. In order to facilitate effective use of the HPCS by the latter class of workload, without inconveniencing the traditional HPC user community, we plan to introduce a number of important improvements to the SLURM configuration during the next planned maintenance slot on Tuesday 4th August.
These new features and their significance are described below and in the linked pages. It is possible some details may change before they are introduced, or as a result of monitoring and improvement afterwards.
Summary of new features
The first two new features are available now. The third feature constitutes the most significant change and will require a restart of all scheduling daemons during a maintenance period.
- Short jobs (available now)
- Long and very long jobs (available now)
- Node sharing (available Tuesday 4th August)
This involves changing the configuration of SLURM so that cores and memory are explicitly allocated to jobs, instead of entire nodes, thus allowing multiple jobs to share a node.
If you have no time to understand this and just want to carry on running jobs the same way as before, jump here.
What will node sharing change?
- In SLURM language, the linear select plugin will be replaced by the cons_res select plugin which allows individual cores, and segments of memory, to be allocated to jobs.
- This means that submission script complexities previously required to keep all 16 cores of a node fully occupied cease to be necessary and job submission becomes more straightforward.
- Since multiple jobs will then co-exist on each node, we need to enable the cgroup SLURM plugins which confine jobs to the particular cores and amounts of memory which they have been allocated.
- This means that memory allocations are now significant, as the total node memory must be divided between multiple jobs. Submission scripts may therefore need to include explicit memory requirements, where previously this was redundant.
What does this mean for me?
- If in doubt, add the flag --exclusive to your sbatch command line, or insert the equivalent directive in the top part of your submission script:
This will restore the previous behaviour and prevent your job sharing a node with any other. However, if your job is actually not using the cores or memory of an entire node, this will restore the previous problem in which you are charged for the whole node, even though you don't need it all. If this applies to you, please keep reading.
- For Wilkes jobs, the exclusive setting applies globally and automatically to the entire tesla partition, therefore GPU jobs should work in the same way as before without any script changes.
- Jobs should usually specify the number of nodes (--nodes), the number of tasks (--ntasks, which is basically the number of separate programs to be launched across the set of nodes), and the memory per node in MB (--mem, note that this memory is per node). The --mem setting is the new directive which previously one could get away without setting, because SLURM always delivered the entire node. In addition there are the other directives for project, wallclock time, partition etc as before.
- How should I set --mem?
- If your memory requirements are low but you are not sure exactly what they are, it is safe to not set --mem. SLURM will allocate one core to each task and a reasonable default amount of memory to each core (3994 MB). This default amount is equivalent to sharing the total node memory equally over all cores.
- If you are already asking for 16 tasks on each node, then it is not necessary to set --mem as you already have the full node.
- If you need the entire node, but are asking for fewer than 16 tasks per node (e.g. because a single task requires more than 1/16 of the total memory, or because each task is going to use more than one CPU core) then you should set --mem=63900, the entire per node memory. Equivalently, just specify --exclusive. An MPI job requiring less than 16 MPI tasks per node should do one of these to ensure that it doesn't share its nodes.
- If you are requesting fewer than 16 tasks per node, and have a reasonable idea of how many MB of memory your job requires on each node, then give that to --mem. Otherwise either don't set it (and receive the default MB per core) or request the whole node as above.
- If by either the number of tasks, or the memory per node, your job requests a certain fraction of a node, the number of cores allocated will be rescaled proportionately and this number of cores will determine the charge per node. E.g. if your job requests 3/4 of the memory in a node, it will be charged as if it were using 3/4 of the cores in a node. Thus the charge should always reflect the effective proportion of the node occupied.