High Performance Computing Service

Usage policies

General conditions of use

Use of HPCs resources is subject to the Rules of the Information Strategy and Services Syndicate, the Usage Rules of the Cambridge University Data Network (CUDN) and the Privacy Policies provided by the University of Cambridge Computing Service. All users agree to abide by these rules (as stated on the application form).

By logging into the HPCS systems all users also indicate awareness and acceptance of the HPCS Privacy Policy.

Background

The Cambridge High Performance Computing Service (HPC Service) is responsible for hosting, system support, scientific support and service delivery of a large supercomputing resource for the University of Cambridge Research Community.

The University-wide service is to be run as a self-sustaining cost centre within the School of the Physical Sciences and therefore must recover all costs incurred by the capital depreciation and running costs of the computer equipment plus additional scientific support costs incurred to help increase the useful scientific output achieved from the machine. To this end the computational equipment within the Service is to be run as a Major Research Facility under the Full Economic Costing (fEC) funding model. Under this model, units of use are priced and research staff should determine how many units they require for a particular project, explicitly include these costs within a grant application for the project and then pass this funding back to the HPC Service as a direct cost.

As a result of this funding requirement, the HPC Service must have a clearly stated and controlled usage policy which results in well-defined and guaranteed service level agreements (SLAs). This will be achieved by use of the SLURM resource allocation software and the implementation of a detailed resource allocation policy.

Costs overview

  1. The unit cost of the Darwin supercomputer is a CPU core hour and this is now incorporated into the pFACT system (internal University of Cambridge users please see here). External users please contact us directly.

    The cost is calculated on the assumption that at 95% paid consumption of the whole Darwin system all costs are covered including capital depreciation of the machine, all running costs including power, management staff, and support staff. The aim is for the HPC Service to develop a core in-house knowledge base in HPC systems and scientific support which will be available to all research staff within the University.

  2. The service allows a range of service levels with different features. These are associated with different quality of service definitions within the scheduler, enforcing different job priorities, maximum job sizes and maximum run times.

  3. A service level is attached to a project, which is a group of users led by a principal investigator (PI) who controls (or who can apply for) a line of funding for HPC.

Service levels

For the first two years following the start of production service in February 2007, we employed a mixed mode of operation with largely non-paying users and a small but growing paying user base. This involved two simple service levels, namely paying and non-paying, associated with two different qualities of service (QOSs). These service levels allowed the system to differentiate between paying and non-paying users and provide the paying users with an improved, guaranteed throughput relative to non-payers.

The balance of paying and non-paying usage of the system has since shifted significantly to paying users and in response to this the service levels were redefined, in order to ensure that the paying users retained the high level of throughput they paid for but also to ensure that a fair share of the reduced amount of free time available was received by non-paying users.

The new Service Levels (SLs), which came into operation on 1st February 2009, are described below.

Funding units are in the form of usage credits. In all cases, 1 credit = 1 CPU core for 1 hour. Please note that in order to provide guaranteed resources, the minimum allocation to a job in all cases is a single node (i.e. 12 cores for Westmere and 16 cores for Sandy Bridge), and allocations consist of whole numbers of nodes.

Accounting periods or quarters are three month periods of the year running 1st February - 30th April, 1st May - 31st July, 1st August - 31st October and 1st November - 31st January.

Service Level 1 – Guaranteed Usage

Service level 1 (SL1) operates with the highest quality of service QOS1, designed for groups which require large amounts of computer time.

Funds paid will be converted into core hour credits. These credits will be divided over an agreed time period and allocated to quarterly (three month) accounting periods; thus the number of core hours used per quarter is defined at the start of the usage agreement.

Furthermore the quarterly allocation of core hours is set as a minimum usage guarantee. Thus SL1 users running consistent workload throughout a quarter will be guaranteed to be able to use their quarterly allocation.

Unused usage credits at the end of the accounting quarter will expire and are transferred to an expired credit account.

SL1 users are able to use more than their allotted allocation within a quarter on a best efforts basis, by either using expired credits or transferring credits from a future allocation quarter.

Expired credits are (usually) made available automatically to a project once it has exhausted its credits within the current quarter, in the manner of credits assigned under SL2 (no guaranteed rate of usage but also no further time limit on use within the lifetime of the service level agreement).

The transfer of credits from a future quarter should be arranged directly with the support personnel.

Once both normal credits and expired credits have been exhausted, further jobs submitted will be handled under the terms of SL4 (Residual Usage). It should be noted that SL4 is the lowest service level on the system and SL4 jobs will only run when there are no eligible jobs in the queue, i.e. when the system is not fully occupied with other jobs. It is not possible for SL1 users to choose SL4 when they have usable credits. SL4 is designed to help keep the system fully occupied at times of low usage, not as a free way for paying users to submit jobs.

Service Level 2 – Ad Hoc Usage

Service level 2 (SL2) is the same as SL1 except that there is no preallocation of credits into specific quarters, no predefined minimum quarterly usage level, and usage credits do not expire at the end of the quarter. Instead credits are created at the same rate and are available for use until exhausted. This service level has the highest quality of service QOS1, and is designed for groups which require smaller amounts of computer time.

When SL2 users exhaust their credits they move down to their next eligible service level, unless more credits are purchased.

Please contact support'at'hpc.cam.ac.uk for enquiries about the cost of core hours under SL2

Purchase orders generated before 1st August 2014 should be sent to the Office of the School of Physical Sciences at 17 Mill Lane (for the attention of Ms Kamila Lembrych-Turek).

Purchase orders generated on or after 1st August 2014 should be sent to the UIS for the attention of Fay Hider at the following address:

University Information Services
Roger Needham Building
7 JJ Thomson Avenue
Cambridge CB3 0RB

Service Level 3 – Free Usage

Service Level 3 (SL3) operates with the medium quality of service, QOS2. QOS2 is lower than QOS1 which is used in SL1 & SL2. This service level is designed for groups with medium usage requirements who currently do not have funding to pay for their usage, thus an immediate conversion of funds into credits is not required.

SL3 is capped with a maximum usage of 200,000 core hours per quarter. This has been introduced to promote a more even usage of the free time on the system.

There is no guaranteed minimum usage level for SL3 and there is no concept of expiry and recycling of core hours across quarters.

Once a group in SL3 has consumed all its allowed core hours in a quarter, users will default to Service Level 4.

Service Level 4 – Residual Usage

Service Level 4 (SL4) operates with the lowest quality of service, QOS3. QOS3 is the lowest quality of service in operation on the cluster. This is the default service level for users who are not eligible for higher service levels.

SL4 jobs only run when there are no other eligible jobs in the queue and also only small core count jobs can run. Users relying on SL4 to run a job can expect very long wait times.

SL4 is designed to make use of unused compute cycles by allowing SL1, SL2 and SL3 users who have reached their core hour limits to make use of unutilised compute cycles.

QOS descriptions

QOS1 – highest quality of service

  1. QOS1 jobs have the highest priority and will move through the queue fastest.
  2. QOS1 jobs have a maximum job run time of 36 hours.
  3. QOS1 jobs have a 1024 core maximum limit on the number of cores that a single job can take.
  4. QOS1 jobs similarly have a 1024 core limit on number of cores in use at one time.

QOS2 – medium quality of service

  1. QOS2 jobs have a lower priority than QOS1 and will move through the queue slower than QOS1 jobs.
  2. QOS2 jobs have a maximum job run time of 12 hours.
  3. QOS2 jobs have a 256 core maximum limit on the number of cores that a single job can take.
  4. QOS2 jobs have a 256 core limit on number of cores in use at one time.

QOS3 – lowest quality of service

  1. QOS3 jobs have the lowest priority and jobs will only run when there are no eligible QOS1 and QOS2 jobs in the queue.
  2. QOS3 jobs have a maximum job run time of 12 hours.
  3. QOS3 jobs have a 72 core maximum limit on the number of cores that a single job can take.
  4. QOS3 jobs have a 72 core limit on number of cores in use at one time.

Storage policies

User space is limited on the home directory filesystem (/home) to 40 GB per user. The contents of this filesystem are backed up daily, meaning that damaged or accidentally deleted files can be restored (with certain restrictions and for a limited amount of time).

The amount of space currently being used can be seen via the command quota:

[cvsupport@bindloe04 ~]$ quota
====================================================================================
Usage on /home (lfs quota -u cvsupport /home):
====================================================================================
Disk quotas for user cvsupport (uid 589):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
          /home 7191072  41943040 47185920           40843       0       0        
====================================================================================
Usage on /scratch (lfs quota -u cvsupport /scratch):
====================================================================================
Disk quotas for user cvsupport (uid 589):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
       /scratch     220       0       0               3       0       0        
====================================================================================

Where:

kbytes      ->  Number of kilobytes currently in use (in KB)
quota       ->  Soft limit (in KB)
limit       ->  Hard limit (in KB)

Note that this command will only provide information about the user who invokes it.

Availability and planned maintenance

General availability

Every reasonable effort will be made to keep our HPC resources available and operational 24 hours per day and 7 days per week.

Please note however that although the support personnel will do their best to keep the facility running at all times, we cannot guarantee to promptly resolve problems outside UK office hours, and during weekends and public holidays. Nevertheless, please notify support'at'hpc.cam.ac.uk of issues whenever they arise.

Planned maintenance

Occasionally it is necessary as part of maintaining a reliable service to update system software and replace faulty hardware. Sometimes it will be possible to perform these tasks transparently by means of queue reconfiguration in a way that will not disrupt running jobs or interactive use, or significantly inconvenience users. Some tasks however, particularly those affecting storage or login nodes, may require temporary interruption of service.

Where possible, maintenance activities involving a level of disruption to service will be scheduled on:

Tuesdays, 10:00-18:00 (local UK time).

Please note that this does not mean that there will be disruption at this time every week, merely that if potentially disruptive maintenance is necessary we will do our best to ensure it takes place during this period, in which case there will be advance notification.

Establishing a predictable time slot for planned maintenance has the advantage that users may be confident that `dangerous' changes will not intentionally be undertaken at other times. Unfortunately the potential for unplanned periods of disruption is a fact of life - please see the next section.

Exceptional maintenance and unplanned disruptions

It may happen that despite best efforts, it becomes necessary to reduce or withdraw service at short notice and/or outside the planned maintenance time slot. This may happen e.g. for environmental reasons, such as air conditioning or power failure, or in an emergency where immediate shutdown is required to save equipment or data.

It is hoped that these situations will arise rarely, although it should be noted that power cuts in this part of Cambridge are unfortunately not as rare as one would like. Obviously in such cases service will be restored as rapidly as possible.

Additional policies

  1. It should be noted that jobs which require a small number of cores will turn around in a shorter time period; also shorter wall-time jobs will individually turn around faster. However it is most efficient if many small jobs are amalgamated into fewer, larger/longer batch jobs.
  2. The usage policy described here will be reviewed periodically and is subject to change.

Final comments

These usage rules have been constructed in a way that allows for flexibility. For the HPC Service to survive we need to recover costs but at the same time we need to allow users who have limited funding to access the machine on a pump priming basis.

Overall we want to provide the best quality of service to all users so please take time to think how we may improve the service you receive and discuss your thoughts and ideas with the Service Director, Dr. Paul Calleja (director'at'hpc.cam.ac.uk).