Definition of long and very long jobs
We define long and very long jobs as jobs requiring wall times (i.e. real execution times) of up to 7 days and 30 days respectively. Continuous execution times of these lengths are normally disallowed by both non-paying and paying qualities of service (QoS), in order to achieve reasonable overall throughput.
In general it is advisable for any application running for extended periods to have the ability to save its progress (i.e. to checkpoint) as insurance against unexpected failures that may result in wastage of significant resources. Applications for which it is possible to checkpoint are largely immune from per-job runtime limits as they can simply resume from the most recent checkpoint in the guise of a new job. Applications for which it is not feasible to checkpoint may find the scheduling features described below to be of use.
Note on checkpointing
Note that Darwin and Wilkes nodes have Berkeley Lab Checkpoint/Restart (BLCR) enabled by default. This may provide the possibility of checkpointing through SLURM for some applications which do not have their own support for this - however not all jobs will work with BLCR successfully (in particular, we don't recommend trying SLURM/BLCR checkpointing with MPI jobs). Nevertheless, some non-parallel jobs may be able to use BLCR to accumulate extended run times without needing to request one of the special QoS described below.
A page on using BLCR checkpointing with SLURM is in preparation.
The QOSL and QOSXL QoS
Paying users with suitable applications may be granted access to one of the QOSL and QOSXL qualities of service, which permit using up to 128 cores for up to 7 days and 30 days respectively. At the time of writing, only the first 64 nodes of Darwin may run jobs associated with these special QoS. It is expected that users wishing to run for a long time are prepared to let others do so too while waiting for a start time.
In order to apply for access to QOSL or QOSXL, please email support'at'hpc.cam.ac.uk detailing why this mode of usage is necessary, and explaining why checkpointing is not a practical option for your application.
Submitting (very) long jobs
Use of QOSL/QOSXL is tied to the long partition, therefore once given access to one of these QoS it is necessary only to specify this partition - e.g.
sbatch -t 7-0:0:0 -p long -A YOUR_PROJECT ...