Frequently Asked Questions
You should not have to do this if you are logging in from (or via) a machine connected to the CUDN (the Cambridge University Data Network), which is likely to be the case if the machine is in a University Department or College. If the machine is elsewhere, and you have SSH access to a system within the CUDN, then follow the procedure described below. If neither of the above are true, it is necessary to register an additional IP address to allow direct SSH connection to the HPCS systems. Hopefully, the machine you are sitting at has a static, public IP address - please try to find out whether it does from local IT support. In any event, it requires internet access, and browsing to the site
should report the public IP address from which traffic from your machine will appear to originate, and which would have to be registered in order for SSH connections to us to be accepted. Please note however that registering a dynamic or gateway address, which is implied if your machine does not have an address which is both static and public, is strongly disfavoured for reasons of security.
The first time a SSH connection (whether via the ssh UNIX command or via putty) is made to Darwin, you will be asked to accept the current Darwin SSH public host key. A "fingerprint" will be presented - please only accept the key (by typing the complete word yes on UNIX) if the fingerprint matches the following value:
If you are presented with a different fingerprint, something is wrong (the machine to which you are connecting is NOT the intended machine). Please do NOT accept the key under these circumstances (the connection will then fail) but contact support'at'hpc.cam.ac.uk as soon as possible.
Yes, using the usual UNIX command passwd. However, the new password must be of equivalent strength (or better!) to the randomly generated password you were originally issued, i.e. at least 8 characters, using mixed case letters and some non-alphanumeric characters, plus the usual traps to avoid (please see here).
A weak password, for all our other precautions, can allow a compromise that would not only endanger the victim's own data, but that of others, and also put the service itself at risk. This is because it's generally much easier to attack other accounts when one account is already controlled. It's also much easier to compromise an account on a second system after an initial intrusion (a strong password on the second machine doesn't help if an intruder is watching you type it in). All passwords should be treated with the same care as other personal data such as bank and credit card numbers.
Please note that we will from time to time run open source software designed to identify weak passwords.
The short answer is yes - the support staff do this routinely. Note that in general it is not necessary to register your home IP address to do this (in fact this is undesirable since it is probably dynamic), provided you have SSH access from home to a registered machine (e.g. your work system attached to the CUDN). If this is not the case, you may wish to investigate the VPDN service provided by the UCS which will in effect join your local machine to the University network via an encrypted tunnel (at which point SSH to Darwin will work directly). If on the other hand there is a convenient machine to which you can SSH directly from outside, and from which you can already SSH to Darwin, then please keep reading.
The simplest approach is to ssh (or putty) first to the registered machine, then ssh again from within the session to Darwin, as you would normally. In order for X applications to work transparently, both ssh connections would need to have X11 forwarding enabled (with OpenSSH this may require the -X option or, if that produces strange results, -Y).
The first connection to the registered host must be encrypted! I.e., even if the registered host for some reason accepts telnet from outside (it shouldn't!), use ssh. Otherwise your Darwin password will travel the first leg of its journey unencrypted, as indeed will the password to your work machine. This is not safe.
Please don't try this from an untrustworthy computer (e.g. in an internet cafe). Keylogging software designed to harvest passwords and bank account details as they are typed poses a real threat. Similarly you should always keep system, anti-virus and anti-spyware software up to date on all your personal computers to maintain their trustworthiness.
Finally, on machines with X servers, always ensure X11 security is turned on. In particular, never use xhost if this is recommended to you as a remedy for X application problems, as this can easily allow anyone on the internet who can talk to your machine to take complete control of your display, which implies the ability to read what you are typing. Since remote X applications should work transparently through the offices of SSH, in a secure way, the ancient and highly dangerous xhost command is not the solution to the problem you are having.
The simple method above of logging in via ssh twice to get to Darwin can be inconvenient if you wish to do it several times, or need to transfer files via scp, sftp or rsync. A more sophisticated method uses tunnelling through the initial SSH connection to the registered host, to allow subsequent SSH connections (as used by ssh, scp, sftp or rsync) to take place directly from the local machine to Darwin. The following method is known to work with OpenSSH on Fedora Linux and has been reported to work on MacOSX, but note that the simplest way to transfer files if you are using Windows is to use WinSCP which has an advanced option for automatically setting up a tunnel to a registered host.
- Create a localhost alias called login-hpc by editing the
localhost line of /etc/hosts (this is the only step which
needs to be performed as root):
127.0.0.1 localhost other_names login-hpc
Add two sections to the ~/.ssh/config file:
Host login-hpc Port 22000 ForwardX11 yes ForwardX11Trusted yes
describing the above alias, and another section referring to the registered system, mycomp.mydept.cam.ac.uk:
Host mycomp.mydept.cam.ac.uk GateWayPorts no # Darwin SSH LocalForward 22000 login.hpc.cam.ac.uk:22
Note that if you need to create ~/.ssh/config, ensure that it has sufficiently strict permissions by doing
chmod go-w ~/.ssh/config
The end result is that after logging in via SSH to mycomp.mydept.cam.ac.uk in the usual way, it becomes possible to contact login.hpc.cam.ac.uk directly via SSH, simply by doing:
from your local machine (and similarly scp, sftp, rsync). The initial connection to mycomp established a tunnel from the local machine (port 22000) to login.hpc.cam.ac.uk (port 22); subsequent SSH connections made to login-hpc are actually made to the local end of this tunnel. Information sent through these is encrypted (again), emerges from the tunnel at mycomp and then travels normally to Darwin. Since the traffic appears to originate from a registered host, access is permitted. Other services restricted to .cam can be accessed (securely) from home or elsewhere by similar methods.
The login nodes are intended primarily for code compilation and development, job submission and monitoring, and data post-processing and management. In terms of hardware they are identical to one of the types of compute node (e.g. login-sand nodes match the sand(y bridge) compute nodes) but without functioning Infiniband. Although they can in principle run codes using 1-4 cores in shared memory please note that they are not intended to run production code, that is why the compute nodes and batch queue system exist.
Small scale testing is permissible, but please be aware at all times that these nodes are shared resources, and use the command nice -n 19 to launch programs on the command line, being careful to monitor system load and available memory with the top and free commands respectively (see their man pages for further details).
However it is preferable, especially if interactivity is not essential, to package short jobs as batch jobs and submit them to the queues. Note that there are often one or two compute nodes free which are waiting to be used by large jobs still gathering sufficient nodes. These waiting nodes are available to run short jobs, provided the queueing system knows that any such jobs will have finished before the nodes are needed (i.e. set the walltime accurately, using default values won't work). This is referred to as backfilling, and is a feature provided by the Maui scheduler (amongst others).
The batch system on Darwin is a combination of Torque, which handles the submission of jobs and their execution on the compute nodes, and Maui, which performs scheduling, i.e. decides what job runs when on what. Each piece of software has its own set of commands, e.g. qsub, qstat, qdel, qalter, qhold, qsig,... belong to Torque, whereas showq, showstart, checkjob, canceljob,... are part of Maui.
Torque is a species of PBS (Portable Batch System), to be exact a fork of OpenPBS. Like OpenPBS, it is open source. It is not PBS Pro, which has a similar ancestry but is a commercial product. The PBS heritage of Torque shows up in the names of the commands and the appearance of '#PBS' at the beginning of directive lines in submission scripts.
Maui replaces Torque's simple native scheduler and is an open source relative of the commercial product Moab. It allows us to schedule resources using mechanisms such a fair-shares, credits, reservations and quality of service.
I.e. what values of n, p and M should be used in the batch script directive:
#PBS -l nodes=n:ppn=p,mem=Mmb [*]or, equivalently, on the command line as
qsub -l nodes=n:ppn=p,mem=Mmb jobscript ?
Firstly, work out how many processes (N) the job will consist of (this number will be the same as the total number of processor cores allocated). For example, a simple MPI code creating 32 MPI processes will be a 32 task job. However, a run of cosmomc with 8 chains, using OMP_NUM_THREADS=2, will spawn 8 MPI processes, each of which will split via OpenMP into 2 working threads, so in this case there will be 16 processes (and processor cores) associated with the job (but this case has additional features, please see Hybrid OpenMP/MPI codes below).
Secondly, decide how much memory per process will be required (call this m megabytes). Note that each Sandy Bridge compute node in practice has less than 64000mb (the memory visible to Maui) usable by job processes, running on a maximum of 16 cores. (Note in what follows that the Westmere nodes have 36000mb and 12 cores.)
With this in mind, use M = N * m, and choose p and n such that:
p is the largest possible < 64000/m and ≤ 16
n * p is the smallest possible ≥ N.
Most frequently, p = 16, n=N/16 (but remember to check that m is not too large, i.e. greater than 4000mb, before assuming this is true). Note that Maui may adjust n and p automatically if it appears p can be increased consistently with the memory requirements, in the interests of efficiency. Note also that Maui allocates (and charges for) entire nodes to jobs, so cores left idle for reasons of memory cannot be used by other jobs, and will incur a core hour charge. Finally, there will actually be n * p cores allocated to the job - if this is larger than N, or if not all tasks are MPI processes (e.g. see next paragraph), some extra care may be needed in the job submission script where it launches the code.
Hybrid OpenMP/MPI codes (e.g. cosmomc): Hybrid OpenMP/MPI codes (of which cosmomc is one example in use on Darwin) introduce a further issue, namely the need to accommodate both MPI processes and OpenMP threads.
Generally, mpirun understands only MPI processes and will, left to itself, start MPI processes on all allocated cores at run time. In the case of a hybrid code, each MPI process will then split into multiple OpenMP threads, potentially leading to massively overloaded nodes. Thus for such codes the mpirun command line needs to specify explicitly how many MPI processes (chains in the cosmomc context) to start per node. It is simplest to give it one number for this (call it c), which must therefore divide the total number of MPI processes exactly (not a problem since 1 is an allowed value). mpirun should then launch the job at the end of the submission script as follows (here referring to the chosen value of OMP_NUM_THREADS as y):
export OMP_NUM_THREADS = y
mpirun -ppn c progname options ... [**] The -ppn option specifies how many MPI processes to start on each node; in fact ($OMP_NUM_THREADS * c) threads will be created per node, and the load on any one node will be ($OMP_NUM_THREADS * c), hence this number should be no greater than 16 on Sandy Bridge nodes (and 12 on Westmere nodes, and 8 on Tesla nodes). Note that mpirun propagates the environment it receives to the processes it starts, so it should be possible to set the value of OMP_NUM_THREADS here in the submission script, rather than in ~/.bashrc.
Finally, Maui must be forced to allocate exactly ($OMP_NUM_THREADS * c) cores per node. As mentioned previously, by default, it may for reasons of efficiency allocate more than this per node if m is small enough. This would be disastrous because the mpirun command above is going to start precisely ($OMP_NUM_THREADS * c) threads on each node whether this matches the number of cores allocated there or not. It clearly makes no sense to choose values of c and OMP_NUM_THREADS such that the actual memory used per node would be greater than the maximum memory available, so m must already be less than or equal to total_memory_per_node/($OMP_NUM_THREADS * c); adjust m upwards to equal this maximum value (and modify the total memory request M accordingly). This will ensure that the allocation consists of a whole number of nodes, with ($OMP_NUM_THREADS * c) cores on each, as required by the mpirun command above.
Putting all this together, the prescription is slightly different for the parameters in the PBS directive [*] in the case of hybrid OpenMP/MPI jobs. It's most natural to begin with the number of MPI processes per node, c, the number of OpenMP threads per MPI process, OMP_NUM_THREADS=y, and the overall number of nodes n. These are constrained by (assuming Sandy Bridge nodes):
c * y ≤ 16
memory required by c * y threads on a node < 64000mb
The total number of MPI processes will be n * c, but the total number of working threads (and cpu cores required) will be n * c * y. Then in the PBS directive [*] take p = c * y, and M = n*64000. At the bottom of the submission script, set OMP_NUM_THREADS=y and add a -ppn c option to mpirun, as in [**].
The Maui scheduler uses a sophisticated algorithm to determine which job should run when, depending on factors such as what quality of service the job has requested, how long the job has been waiting, whether the project to which the user belongs is achieving its "fair share" of resources, and whether the job can "backfill" resources reserved by a larger job already scheduled to start.
More technically, priority on Darwin is based on QOS factor, fair share factor and queue time factor, where fair share factor is based on group level. For more information about what this means, please see these vendor links:
The answer to the practical question, "When will my job start?", can be found using the command:
but the time reported is only an estimate and may change, if, for example, other jobs finish earlier than expected, or if higher priority jobs are submitted (see this answer).
Please bear in mind that for a busy system, on which jobs typically have a running time of 12 or 36 hours, it should not be considered unreasonable to have a large job wait 12 hours in the queue before it receives an opportunity to run. Also the scheduler can only operate on the basis of the information provided at submission time - if you have a 30 minute job but just use the default of 12 hours' walltime when submitting, expect to wait longer to run than if you'd stated 30 minutes correctly. This is obvious, because it is easier to find room for smaller blocks of resources than larger ones (at least on a finite cluster) and the scheduler here would assume you needed a block of core hours 24 times too big.
showstart provides only a first order estimate, and can only base this on the current information possessed by Maui. This information can be superseded as a result of a number of events:
- Some running job ends earlier than its requested walltime, or jobs higher up in the queue are removed, or modified so as not to be eligible;
- the scheduler finds an opportunity to backfill the reservation of a large job;
- we make more nodes available for user jobs.
(These will tend to bring showstart estimates forward; the following events however will have the reverse effect.)
- if a user with higher priority than you (e.g. a payer who hasn't used the system for a while, or who simply needs to receive a larger share of current resources to achieve their guaranteed service level) submits a new job, or has a previously ineligible job made eligible, then their job will be promoted past yours in the queue;
- we take nodes offline either for administrative use, or to diagnose a problem;
- a node allocated to your job appears to be still busy when the time comes to start, perhaps because the previous job caused a very high load or worse problems before exiting, e.g. lustre evictions, out of memory conditions etc. In this case the scheduler will try to start the job for about 10 minutes before deferring the job to try again later (a new node may need to be found).
Because of the overhead incurred by the scheduler processing each job submitted, which is particularly serious when the jobs are small and/or short in duration, it is generally not a good idea to simply run a loop over qsub and inject a large set of small/short jobs into the queue. It is often far more efficient from this point of view to package such little jobs up into larger 'super-jobs', provided that each small job in the larger job is expected to finish at about the same time (so that cores aren't left allocated but idle). Assuming this condition is met, what follows is a recommended method of aggregating a number of smaller jobs into a single Torque/PBS job which can be scheduled as a unit.
There exists a feature of Torque/PBS called job arrays which is designed to allow easy submission of many similar PBS jobs, but note that since each job on Darwin will be allocated at least one entire node, it may still be necessary to aggregrate multiple small jobs in order to create PBS jobs which make full use of a 16- or 12-core node.
For simplicity this example assumes the small jobs are serial (one-core) jobs. Firstly, group the small jobs into sets of similar runtime (choose the largest multiple of cores-per-node which will end close together), and package each set of N as a single N-core job as follows. The PBS directives at the top of the submission script should specify:
#PBS -l nodes=<N/p>:ppn=p,mem=<N*M>mb
where p=12, M=36000 for Westmere and p=16, M=64000 for Sandy Bridge, and the <>s above should be replaced by the results of the trivial calculations enclosed. (Note that if p single core tasks require more memory than is available on one node, i.e. Mmb, the job should be spread over more than N/p nodes with a smaller value for ppn.) Then instead of the usual
at the end of the submission script to launch one serial job, launch N via something like the following. Note that this makes use of the Ohio SC version of mpiexec, which can be used even though the serial job uses no MPI. Unfortunately this version of mpiexec won't work with codes using OpenMPI, or Intel MPI (which is the default on the Sandy Bridge/Westmere/GPU nodes), but should work with MVAPICH2. For an alternative method which will perform a similar aggregation for Intel MPI "joblets" smaller than one node each, please see the joblets script.
module load mpiexec cd directory_for_job1 mpiexec -comm none -n 1 $application options_for_job1 > output 2> error & cd directory_for_job2 mpiexec -comm none -n 1 $application options_for_job2 > output 2> error & ... cd directory_for_jobN mpiexec -comm none -n 1 $application options_for_jobN > output 2> error & wait
Note in the above:
- the use of mpiexec (not mpirun) with the options -comm none -n 1, which mean that in this case, the application isn't using MPI and just needs 1 core (being serial). We are simply using the job launch functionality of mpiexec in this example, but we could alter the arguments to launch parallel MPI `small' jobs instead of serial ones (in the case of MVAPICH2, add the -comm pmi option to select the correct parallel launch protocol);
- the > output 2> error which direct the stdout and stderr to files called output and error respectively in each directory for the corresponding job (obviously you can change the names of these, and even have the jobs running in the same directory if they are really independent);
- the & at the end of each mpiexec line which allows them to run simultaneously (the mpiexecs will cooperate and take different cores out of the set allocated by Torque/PBS);
- the wait command at the end, which prevents the job script from finishing before the mpiexecs.
No, in fact it's good for you.
There is nothing wrong with submitting large numbers of single core jobs, this is a mode some people want and which we allow. There is very little difference between 500 single core jobs and a single job of 500 cores, which we also allow. They don't in any sense clog up the queue, because the queue is not simply "first in, first out", and it isn't unfair. It may well be true that their waiting times are less, this doesn't mean that they have an unfair advantage however, it just means that small jobs are easier to schedule, and this doesn't occur at the expense of parallel jobs.
The presence of a large set of single core jobs in the queue adversely impacts other users rather less, in general, than the equivalent single multicore job. The system adjusts job priorities so that each user has a fair share of resources. Thus if factors such as the length of time you have been waiting, your quality of service and the amount of resources recently received by your project determine that you should have a higher priority at the moment than someone else, your job will have nodes reserved for it before theirs. However, if your job is multinode, extra time is usually required to gather sufficient nodes because these may not all become available at the same time. Thus nodes already reserved are in danger of staying idle until the rest are available, which is inefficient. Under these circumstances the scheduler may assign some of those waiting nodes to other jobs, provided those jobs will finish before your multicore job expects to start. This is called backfilling, and explicitly does not delay the expected start time of your job.
A virtue of small jobs, i.e. jobs using small numbers of cores and/or requesting short amounts of time, is that they can often be used in backfilling (provided that PBS is given correct values at submission time). This is simply a consequence of smaller jobs being able to fit in a wider range of holes, and as a general rule such jobs will probably wait for less time in the queue. This does not mean that over time users exclusively submitting small jobs gain more resources, because as they use up their fair share, so their priority diminishes and others are looked at in preference. Nor does it mean that submitters of large jobs with higher priority have to wait longer, because their nodes are allocated first. In fact overall, small jobs are good because they allow the scheduler to keep more nodes busy through backfilling, thereby increasing overall throughput. On the other hand, larger jobs tend to have the opposite effect: if 500 single core jobs were repackaged as a single 500-core job, backfilling would be eliminated and 500 cores would need to become idle before the job could start - waiting times for everyone would increase, because overall the queue would process the workload more slowly.
Add a nodeset clause like the following to your #PBS -l directive:
#PBS -l nodeset=ONEOF:FEATURE:CU1,nodes=32:ppn=4,walltime=...
where this will select nodes from computational unit A (node-a01,...,node-a65). Similarly CU2-9 for units B-I.
Maui is currently configured to only ever run one job at one time on any one node, thus you are guaranteed exclusive access. Please note that you will be charged for the entire node and should therefore seek to use either all cores or most of the memory.
In your bash initialization file ~/.bashrc, add a line like:
ulimit -Ss 102400
where this raises the (soft) stack limit to 100MB (the value is in KB). This can also be issued on the command line to affect the current shell, however further steps are required in order to effect this change for batch jobs (in particular, modifying ~/.bashrc is insufficient). Instead replace the MPI binary with a short shell script which performs the above ulimit, then execs the real binary.
Although we currently allow users to set up cron jobs on the login nodes to perform regular tasks (e.g. submit jobs), these will disappear following a reboot of the login node. This is because the cluster integration software on Darwin is image-based, and each node is returned to a pristine function-specific software image on every reboot.
Please note that any cron job should take care to fail safely - for example, a cron job which is found to be flooding the batch queues with malformed jobs is very likely to be deleted by the support staff.
Some users have reported that their codes sometimes die with errors such as:
forrtl: severe (10): cannot overwrite existing file, unit N, file filename
where N is some integer and filename is the name of an output file. This is usually seen during parallel (but not serial) runs, and the error is seen multiple times (once for each of a subset of MPI tasks which die). Codes which have exhibited this problem include siesta and vasp.
A possible explanation for this is the following. When multiple tasks of a parallel job attempt to open the same file for writing, and that file does not already exist, then they each attempt to create the file. A race then takes place between the tasks which have noticed that the file needs to be created to perform the creation. One task wins (and successfully creates the file), and the rest receive "cannot overwrite existing file" errors and die, the job then dying also. The set of tasks which die from this error may be smaller than the set of all remaining tasks, either because not all attempt to write to the file, or because some see that the file has been created when they check for it.
An effective workaround has been found to be to ensure that the output files already exist before the job starts. It is sufficient to create them as empty files using the command touch. For example, if you have seen the error
forrtl: severe (10): cannot overwrite existing file, unit 99, file /scratch/abc123/work/AECCAR0
then you would perform the command
before retrying. In fact there may be several files that need to be created and which may cause this error, so ideally one would create all of them as empty files before attempting to run the job again.
Old binaries compiled against Intel MKL 9.0.018 may fail with messages similar to libmkl.so: file too short due to changes in MKL 10. If this affects you, please recompile your binary and advise support'at'hpc.cam.ac.uk if you continue to have problems.
Intel MKL versions 10.x prior to 10.2.2.025 are deprecated. Unfortunately Intel have made linking somewhat more complicated in recent versions.
For example, a common dynamic link line for LAPACK and BLAS previously contained:
-lmkl_lapack -lmkl -lguide -lpthread
however for 10.3.4.191 this should be replaced by
-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread
and the analogous static link line would contain
MKLROOT=/usr/local/Cluster-Apps/intel/mkl/10.3.4.191/composerxe-2011.4.191/mkl $(MKLROOT)/lib/em64t/libmkl_lapack95_lp64.a -Wl,--start-group $(MKLROOT)/lib/em64t/libmkl_intel_lp64.a $(MKLROOT)/lib/em64t/libmkl_intel_thread.a $(MKLROOT)/lib/em64t/libmkl_core.a -Wl,--end-group -L$(MKLROOT)/lib/em64t -liomp5 -lpthread
where MKLROOT is set and used here as in a Makefile. Note that MKLROOT is also an environment variable set by the intel/mkl module.
The Intel MKL user documentation can be found on darwin at
The Intel® Math Kernel Library Link Line Advisor is a useful online tool for constructing MKL linker lines. The relevant architecture for Darwin is Intel64. Please note that the INCLUDE, MKLROOT, LD_LIBRARY_PATH, LIBRARY_PATH, CPATH, FPATH and NLSPATH environment variables are already set correctly by the mkl module, so please don't change these if tempted by the advice on the link advisor page.
Please advise support'at'hpc.cam.ac.uk if you continue to have problems.
There are two common cases. Firstly, when attempting to start an MPI job using mpirun on a login node, you see an error:
InfiniPath interconnect not detected.
This is simply because the login nodes do not contain QLogic (previously known as InfiniPath) hardware. The correct procedure here is to add the option
at the front of the mpirun command line, in order to skip the check for QLogic/InfiniPath hardware and force the QLogic MPI to operate in shared memory mode.
The second, less common case is when jobs run on the compute nodes die immediately with an error stating that no InfiniPath HCA could be found. This has been caused by old software explicitly loading the old v2.2 InfiniPath libraries - these fail to recognise the v2.3 drivers now controlling the node hardware. Centrally installed software known to be affected by this issue has now been fixed, but if you have created your own modules, please check that they do not contain explicit references to infinipath/core/2.2 or infinipath/mpi/2.2 (omitting the '/2.2' should be sufficient to load the latest version). It is also possible to compile software with the obsolete library locations hardwired into the DT_RPATH attribute of the binary, in which case the software should be recompiled in the current user environment.
The HPCS uses the Gold allocation manager to manage the usage of computational resources. Gold is an open source accounting system built upon the model of a bank in which notional resource credits are deposited into accounts attached to each project. Credits are withdrawn from these Gold accounts as resources are consumed by the project. Depending on how the Principal Investigator (PI) of the project has chosen to organise it, there may be either a single Gold account shared between all user accounts associated with the project, or an individual Gold account assigned to each user account, with a project coordinator responsible for distributing credits between accounts.
Since 1st February 2009, most projects have at least two Gold accounts associated with them. Typically, there is an account with the same name as the project which holds time-limited credits (SL1 and SL3) and a second account with name ending in -extra holding credits without an expiry date (e.g. expired credits for a project at SL1 or ad hoc credits purchased at SL2). Projects with per-user Gold accounts may depart from this scheme.
Jobs using all qualities of service other than QOS3 will result in a charge to the appropriate Gold account, usually at a rate of 1 credit per core hour of work. Higher qualities of service may be introduced over time which charge at a higher rate.
However, it should be understood that, despite the banking language employed by Gold, in most service levels there is no direct relationship between credits and real money (SL2 is arguably the exception). Gold credits are simply a bookkeeping artifice, used to keep track of project resource usage and to ensure that each project receives the correct number of (pre-allocated) machine core hours during the current accounting quarter.
Accounting quarters are three calendar months in length; in order to match the University's quarters, they run from 1st February - 30th April, 1st May - 31st July, 1st August - 31st October and 1st November - 31st January. The reasons for adopting quarters as the basis for allocation are:
- The HPC is completely self-financing and must recover its capital and running costs, therefore we need to realize revenue at a steady rate in order to continue;
- Quarters provide a structure in which we can control resource allocation and seek to provide guaranteed levels of service.
Projects requiring medium amounts of computer time may make a one-off or a series of one-off purchases of core hours under SL2 (``Ad Hoc Usage'') which entitles the project to run jobs under QOS1 until the purchased core hours are exhausted.
Projects wishing to obtain large numbers of core hours over several quarters with a minimum usage guarantee may do so by contacting the Director (director'at'hpc.cam.ac.uk) with a view to running under SL1. Under this service level, the core hours themselves are allocated to each paying project at the beginning of each accounting quarter, according to the funds received by prearrangement with the PI, and are understood to be processor core hours of paid-usage machine work to be performed some time within that three month period. These paid-usage core hours are then given a representation in terms of Gold credits.
In addition to the resource usage checking provided by Gold, the Maui scheduler aims to ensure that SL1 projects receive core hours at a rate consistent with their overall allocation. (E.g. if your project has purchased half of the machine core hours available in the current quarter, then users in your project should expect, on average, to be running on half of the machine.) Thus at the end of the quarter, all paid core hours can be expected to have been used by the correct people, and all credits spent. Note, however, that this assumes a steady demand for resources by each project. Clearly, if a project purchasing half the quarter's core hours submits no work until the third month of the quarter, the full number of core hours can no longer be claimed in that quarter, even with exclusive and unrestricted access to the machine.
At the end of the quarter, unused SL1 credits expire and are transferred to an expired credits account. These accumulate and may be used, on a best efforts basis and without minimum usage guarantees, after the current quarter's active credits have been exhausted.
Use the command
This reports the current balance of credits, which equates to the number of core hours still available in the current accounting quarter (including expired credits if present). When the balance reaches zero, the system will automatically fall back to SL4 (``Residual Usage''). Note also that the much larger numbers seen when omitting the -h option refer to core seconds.
Projects seeking to buy large amounts of time over several three-month quarters, with guaranteed minimum usage levels, should contact the Director (director'at'hpc.cam.ac.uk) and discuss running under Service Level 1. Once this is set up, core hours are allocated automatically at the beginning of each quarter, until the end of the service level agreement.
Projects interested in making ad hoc purchases of core hours without guaranteed minimum usage levels under Service Level 2 should ask their department to raise a purchase order for the desired number of HPC core hours to our controlling institution (which at the time of writing is the School of Physical Sciences). Note that for orders within the University, VAT is not applicable. Please inform us at support'at'hpc.cam.ac.uk that you have done this. Ad hoc purchases may be made more than once, as desired.
This is caused by the application gpk-update-icon which is started by default by the Gnome desktop. To get rid of this, open the menu to System/Preferences/Startup Applications and untick PackageKit Update Applet. Then kill the currently running instance using the command killall gpk-update-icon.
Our current suggestion for use in academic publications etc is:
This work was performed using the Darwin Supercomputer of the University of Cambridge High Performance Computing Service (http://www.hpc.cam.ac.uk/), provided by Dell Inc. using Strategic Research Infrastructure Funding from the Higher Education Funding Council for England and funding from the Science and Technology Facilities Council.
Alternatively for resources awarded by DIRAC, the following form is preferred:
This work used the Darwin Data Analytic system at the University of Cambridge, operated by the University of Cambridge High Performance Computing Service on behalf of the STFC DiRAC HPC Facility (www.dirac.ac.uk). This equipment was funded by a BIS National E-infrastructure capital grant (ST/K001590/1), STFC capital grants ST/H008861/1 and ST/H00887X/1, and DiRAC Operations grant ST/K00333X/1. DiRAC is part of the National E-Infrastructure.
The Sandy Bridge nodes are the sixteen core, 64GB nodes with Mellanox FDR Infiniband interconnect which were introduced in June 2012. The name refers to the type of Intel Xeon CPU, which is eight-core and supports Intel AVX instructions. Binaries to be executed on such nodes are best compiled with Intel compilers and the flag -xAVX (alternatively, use -xSSE4.2 -axAVX for use on all current Darwin CPUs). We recommend that MPI codes use Intel MPI on Sandy Bridge nodes.
Nehalem refers to the previous generation of Intel CPU microarchitecture. We currently have two types of CPU, and node, in this class: firstly, the quad core 45nm CPUs in the eight-core, 24GB nodes attached to the Tesla GPUs (these make up the GPU cluster); secondly, the six core 32nm CPUs inside the twelve-core, 36GB nodes making up the west-m-n nodes. The latter type, a shrunken reimplementation of the original Nehalem microarchitecture, is termed Westmere. When we refer to Westmere nodes, we mean specifically the twelve core, 36GB blade nodes. From the point of view of compiling optimised binaries, the flag -xSSE4.2 should be used for both eight core Nehalem and twelve core Westmere nodes (alternatively, use -xSSE4.2 -axAVX for use on all current Darwin CPUs). The Infiniband hardware on both variants of Nehalem node is Mellanox ConnectX2 and we recommend that MPI codes should use Intel MPI, or MVAPICH2 or OpenMPI on these nodes.
You are probably listing /scratch directly, which since October 2012 is managed by the automounter.
In order to improve the resiliency and flexibility of the scratch storage, user scratch space is distributed across several different filesystems. To make this easier to deal with, we are now using an automounter (automount and autofs) to present scratch directories consistently as paths of the form /scratch/userid, irrespective of what physical filesystem the directory actually resides on. Thus no-one need remember the low-level details of where their files are currently stored (this after all may change with time).
In practice, this should be transparent - each access (via cd, or ls, or simply by reading or writing to a specific location under /scratch/userid) should work exactly as expected, and the presence of the automounter and of the multiple filesystems behind it should be invisible (unless you use the quota command, in which case you will see usages according to physical filesystems).
However it is possible to catch the automounter in the act, and be confused by it. In particular, doing
may show only a few, or perhaps no, user directories. This does not mean that the absent directories no longer exist, but that the automounter has simply not been asked to automount them. After any explicit reference to /scratch/userid on a particular node, the directory should be automounted under the /scratch parent directory and be available as expected. However, if the directory is inactive for long enough, it will disappear again (from under /scratch, but not from existence).
The bottom line is that if users refer to their scratch directory as /scratch/userid, or equivalently through the convenient symbolic link ~/scratch pointing to it which is created for newer accounts, then the directory will be found as required. Only when looking for it under /scratch without explicitly referring to it might it not be observed (as when directly listing the directory /scratch). The real mount points of each lustre filesystem (/lustre1, /lustre2 etc) are listable in the normal way but paths containing these mount points should not be used in scripts - instead always use paths containing /scratch and let the automounter supply the required directory when it is needed.