On 1st February 2014 the HPCS moved from Maui/Torque (PBS) as its job scheduler and from Gold as its accounting software to SLURM. Please see the SLURM Quick Start Guide.
You should not have to do this if you are logging in from (or via) a machine connected to the CUDN (the Cambridge University Data Network), which is likely to be the case if the machine is in a University Department or College. If the machine is elsewhere, and you have SSH access to a system within the CUDN, then follow the procedure described below. If neither of the above are true, it is necessary to register an additional IP address to allow direct SSH connection to the HPCS systems. Hopefully, the machine you are sitting at has a static, public IP address - please try to find out whether it does from local IT support. In any event, it requires internet access, and browsing to the site
http://www.cam.ac.uk/cs/myip/ should report the public IP address from which traffic from your machine will appear to originate, and which would have to be registered in order for SSH connections to us to be accepted. Please note however that registering a dynamic or gateway address, which is implied if your machine does not have an address which is both static and public, is strongly disfavoured for reasons of security.
The first time a SSH connection (whether via the ssh UNIX command or via putty) is made to Darwin, you will be asked to accept the current Darwin SSH public host key. A "fingerprint" will be presented - please only accept the key (by typing the complete word yes on UNIX) if the fingerprint matches one of the following values:
If you are presented with a different fingerprint, something is wrong (the machine to which you are connecting is NOT the intended machine). Please do NOT accept the key under these circumstances (the connection will then fail) but contact email@example.com as soon as possible.
Yes, using the usual UNIX command passwd. However, the new password must be of equivalent strength (or better!) to the randomly generated password you were originally issued, i.e. at least 10 characters, using mixed case letters and some non-alphanumeric characters, plus the usual traps to avoid (please see here).
A weak password, for all our other precautions, can allow a compromise that would not only endanger the victim's own data, but that of others, and also put the service itself at risk. This is because it's generally much easier to attack other accounts when one account is already controlled. It's also much easier to compromise an account on a second system after an initial intrusion (a strong password on the second machine doesn't help if an intruder is watching you type it in). All passwords should be treated with the same care as other personal data such as bank and credit card numbers.
Please note that we will from time to time run open source software designed to identify weak passwords.
The short answer is yes - the support staff do this routinely. Note that in general it is not necessary to register your home IP address to do this (in fact this is undesirable since it is probably dynamic), provided you have SSH access from home to a registered machine (e.g. your work system attached to the CUDN). If this is not the case, you may wish to investigate the VPN service provided by the UIS which will in effect join your local machine to the University network via an encrypted tunnel (at which point SSH to Darwin will work directly). If on the other hand there is a convenient machine to which you can SSH directly from outside, and from which you can already SSH to Darwin, then please keep reading.
The simplest approach is to ssh (or putty) first to the registered machine, then ssh again from within the session to Darwin, as you would normally. In order for X applications to work transparently, both ssh connections would need to have X11 forwarding enabled (with OpenSSH this may require the -X option or, if that produces strange results, -Y).
The first connection to the registered host must be encrypted! I.e., even if the registered host for some reason accepts telnet from outside (it shouldn't!), use ssh. Otherwise your Darwin password will travel the first leg of its journey unencrypted, as indeed will the password to your work machine. This is not safe.
Please don't try this from an untrustworthy computer (e.g. in an internet cafe). Keylogging software designed to harvest passwords and bank account details as they are typed poses a real threat. Similarly you should always keep system, anti-virus and anti-spyware software up to date on all your personal computers to maintain their trustworthiness.
Finally, on machines with X servers, always ensure X11 security is turned on. In particular, never use xhost if this is recommended to you as a remedy for X application problems, as this can easily allow anyone on the internet who can talk to your machine to take complete control of your display, which implies the ability to read what you are typing. Since remote X applications should work transparently through the offices of SSH, in a secure way, the ancient and highly dangerous xhost command is not the solution to the problem you are having.
The simple method above of logging in via ssh twice to get to Darwin can be inconvenient if you wish to do it several times, or need to transfer files via scp, sftp or rsync. A more sophisticated method uses tunnelling through the initial SSH connection to the registered host, to allow subsequent SSH connections (as used by ssh, scp, sftp or rsync) to take place directly from the local machine to Darwin. The following method is known to work with OpenSSH on Fedora Linux and has been reported to work on MacOSX, but note that the simplest way to transfer files if you are using Windows is to use WinSCP which has an advanced option for automatically setting up a tunnel to a registered host.
Add two sections to the ~/.ssh/config file:
Host login-hpc Port 22000 HostName localhost User your_username_on_hpc ForwardX11 yes ForwardX11Trusted yes
and another section referring to the registered gateway system mycomp.mydept.cam.ac.uk:
Host mycomp.mydept.cam.ac.uk GateWayPorts no # Darwin SSH LocalForward 22000 login.hpc.cam.ac.uk:22
Note that if you need to create ~/.ssh/config, ensure that it has sufficiently strict permissions by doing
chmod go-w ~/.ssh/config
The end result is that after logging in via SSH to mycomp.mydept.cam.ac.uk in the usual way, it becomes possible to contact login.hpc.cam.ac.uk directly via SSH, simply by doing:
from your local machine (and similarly scp, sftp, rsync). The initial connection to mycomp established a tunnel from the local machine (port 22000) to login.hpc.cam.ac.uk (port 22); subsequent SSH connections made to login-hpc are actually made to the local end of this tunnel. Information sent through these is encrypted (again), emerges from the tunnel at mycomp and then travels normally to Darwin. Since the traffic appears to originate from a registered host, access is permitted. Other services restricted to .cam can be accessed (securely) from home or elsewhere by similar methods.
The login nodes are intended primarily for code compilation and development, job submission and monitoring, and data post-processing and management. In terms of hardware they are identical to the Sandy Bridge compute nodes but their Infiniband connections are not available for MPI. Although they can in principle run codes in shared memory please note that they are not intended to run production code, that is why the compute nodes and batch queue system exist.
Small scale testing is permissible, but please be aware at all times that these nodes are shared resources, and use the command nice -n 19 to launch programs on the command line, being careful to monitor system load and available memory with the top and free commands respectively (see their man pages for further details).
However it is preferred, especially if interactivity is not essential, to package short jobs as batch jobs and submit them to the queues. Note that there are often one or two compute nodes free which are waiting to be used by large jobs still gathering sufficient nodes. These waiting nodes are available to run short jobs, provided the queueing system knows that any such jobs will have finished before the nodes are needed (i.e. set the walltime accurately, using default values won't work). This is referred to as backfilling, and is a feature provided by the the scheduler.
I.e. what values of N and n should be used in the batch script directives:
or, equivalently, on the command line as
sbatch --nodes=N --ntasks=n jobscript ?
Since the HPCS allocates entire nodes to jobs, the most important parameter is N, as this will determine the resources made available to the job and also the rate of charge. By default SLURM will assume that all the processor cores and memory in each allocated node are available to the job. The number of tasks n is passed into the job environment for the benefit of SLURM-aware software, e.g. to influence the number of MPI tasks started.
The important pieces of information are:
- Sandy Bridge (Darwin) nodes each have 16 CPU cores and 63900 MB of usable memory;
- Westmere (Darwin) nodes each have 12 CPU cores and 35700 MB of usable memory
- Tesla (Wilkes) nodes each have 12 CPU cores and 63900 MB of usable memory.
The simplest case is that in which either mpirun/mpiexec or srun are going to launch a task on every CPU core in the allocated nodes. Here one merely has to decide what multiple of 16 (for Sandy Bridge) or 12 (for Westmere or Tesla) will be the total number of tasks, and this immediately determines n and N. E.g. 32 tasks on Sandy Bridge would obviously imply n=32 and N=2.
The situation becomes more complex if, for example, 16 MPI tasks would require more than 63900 MB of memory on a Sandy Bridge node, or if each task was designed to spawn additional threads thus requiring additional CPU cores per task. In both of these scenarios, it would be necessary to communicate to MPI to launch fewer than 16 tasks per node. The template job submission scripts will take care of this based on the supplied values of N and n. Assume that only p such tasks will fit in a single node. Now one needs to decide what multiple of p will be the total number of tasks, and this immediately determines n and N. E.g. if p=8, then 32 tasks on Sandy Bridge would imply n=32 andN=4.
The scheduler uses a sophisticated algorithm to determine which job should run when, depending on factors such as what quality of service the job has requested, how long the job has been waiting, whether the project to which the user belongs is achieving its "fair share" of resources, and whether the job can "backfill" resources reserved by a larger job already scheduled to start.
More technically, priority is based on QOS factor, fair share factor and queue time factor. For more information about what this means, please see the SLURM documentation.
Please bear in mind that for a busy system, on which jobs typically have a running time of 12 or 36 hours, it should not be considered unreasonable to have a large job wait 12 hours in the queue before it receives an opportunity to run. Also the scheduler can only operate on the basis of the information provided at submission time - if you have a 30 minute job but just use the default of 12 hours' walltime when submitting, expect to wait longer to run than if you'd stated 30 minutes correctly. This is obvious, because it is easier to find room for smaller blocks of resources than larger ones (at least on a finite cluster) and the scheduler here would assume you needed a block of core hours 24 times too big.
Because of the overhead incurred by the scheduler processing each job submitted, which is particularly serious when the jobs are small and/or short in duration, it is generally not a good idea to simply run a loop over sbatch and inject a large set of small/short jobs into the queue. It is often far more efficient from this point of view to package such little jobs up into larger 'super-jobs', provided that each small job in the larger job is expected to finish at about the same time (so that cores aren't left allocated but idle). Assuming this condition is met, what follows is a recommended method of aggregating a number of smaller jobs into a single job which can be scheduled as a unit.
Note that since each job will be allocated (and charged for) at least one entire node, it may be necessary to aggregrate multiple small jobs in order to create batch jobs which make full use of a 16- or 12-core node.
For simplicity this example assumes the small jobs are serial (one-core) jobs. Firstly, group the small jobs into sets of similar runtime (choose the largest multiple of cores-per-node which will end close together), and package each set of N as a single N-core job as follows. The SBATCH directives at the top of the submission script should specify (in addition to project and walltime) just the number of nodes:
where p=12 for Westmere and p=16, for Sandy Bridge, and the <>s above should be replaced by the results of the trivial calculations enclosed. Basically, the number of nodes is the smallest whole number containing sufficient cores. If p single core tasks require more memory than is available on one node the job should be spread over more nodes than this. Then instead of the usual
at the end of the submission script to launch one serial job, launch N via something like the following. Note that this makes use of SLURM's srun command. For an alternative method which will perform a similar aggregation for Intel MPI "joblets" smaller than one node each, please see the joblets script (currently still adapted to PBS).
cd directory_for_job1 srun --exclusive -n 1 $application options_for_job1 > output 2> error & cd directory_for_job2 srun --exclusive -n 1 $application options_for_job2 > output 2> error & ... cd directory_for_jobN srun --exclusive -n 1 $application options_for_jobN > output 2> error & wait
Note in the above:
- the use of --exclusive to ensure that distinct tasks are assigned distinct cores;
- the > output 2> error which direct the stdout and stderr to files called output and error respectively in each directory for the corresponding job (obviously you can change the names of these, and even have the jobs running in the same directory if they are really independent);
- the & at the end of each line which allows them to run simultaneously;
- the wait command at the end, which prevents the job script from finishing before the individual tasks.
No, in fact it's good for you.
Firstly, the minimum allocation is always one entire node, so serial jobs should always be packaged in groups of 16 (if permitted by available memory) before being submitted to the scheduler. There is nothing wrong with submitting large numbers of single-core tasks packaged as single or multi-node jobs in this way, this is a mode some people want and which we allow. There is very little difference between 64 single node jobs and a single job of 1024 cores, which we also allow. The former don't in any sense clog up the queue, because the queue is not simply "first in, first out", and it isn't unfair. It may well be true that their waiting times are less, this doesn't mean that they have an unfair advantage however, it just means that smaller jobs are easier to schedule, and this doesn't occur at the expense of larger jobs.
The presence of a large set of single node jobs in the queue adversely impacts other users rather less, in general, than the equivalent single multi-node job. The system adjusts job priorities so that each user has a fair share of resources. Thus if factors such as the length of time you have been waiting, your quality of service and the amount of resources recently received by your project determine that you should have a higher priority at the moment than someone else, your job will have nodes reserved for it before theirs. However, if your job is multinode, extra time is usually required to gather sufficient nodes because these may not all become available at the same time. Thus nodes already reserved are in danger of staying idle until the rest are available, which is inefficient. Under these circumstances the scheduler may assign some of those waiting nodes to other jobs, provided those jobs will finish before your multi-node job expects to start. This is called backfilling, and explicitly does not delay the expected start time of your job.
A virtue of small jobs, i.e. jobs using small numbers of nodes and/or requesting short amounts of time, is that they can often be used in backfilling (provided that PBS is given correct values at submission time). This is simply a consequence of smaller jobs being able to fit in a wider range of holes, and as a general rule such jobs will probably wait for less time in the queue. This does not mean that over time users exclusively submitting small jobs gain more resources, because as they use up their fair share, so their priority diminishes and others are looked at in preference. Nor does it mean that submitters of large jobs with higher priority have to wait longer, because their nodes are allocated first. In fact overall, small jobs are good because they allow the scheduler to keep more nodes busy through backfilling, thereby increasing overall throughput. On the other hand, larger jobs tend to have the opposite effect: if 64 single node jobs were repackaged as a single 1024-core job, backfilling could not occur and 1024 cores would need to become idle before the job could start - waiting times for everyone would increase, because overall the queue would process the workload more slowly.
In your bash initialization file ~/.bashrc, add a line like:
ulimit -Ss 102400
where this raises the (soft) stack limit to 100MB (the value is in KB). This can also be issued on the command line to affect the current shell, however further steps are required in order to effect this change for batch jobs (in particular, modifying ~/.bashrc is insufficient). Instead replace the MPI binary with a short shell script which performs the above ulimit, then execs the real binary.
Although we currently allow users to set up cron jobs on the login nodes to perform regular tasks (e.g. submit jobs), these will disappear following a reboot of the login node. This is because the cluster integration software on Darwin is image-based, and each node is returned to a pristine function-specific software image on every reboot.
Please note that any cron job should take care to fail safely - for example, a cron job which is found to be flooding the batch queues with malformed jobs is very likely to be deleted by the support staff.
Some users have reported that their codes sometimes die with errors such as:
forrtl: severe (10): cannot overwrite existing file, unit N, file filename
where N is some integer and filename is the name of an output file. This is usually seen during parallel (but not serial) runs, and the error is seen multiple times (once for each of a subset of MPI tasks which die). Codes which have exhibited this problem include siestaand vasp.
A possible explanation for this is the following. When multiple tasks of a parallel job attempt to open the same file for writing, and that file does not already exist, then they each attempt to create the file. A race then takes place between the tasks which have noticed that the file needs to be created to perform the creation. One task wins (and successfully creates the file), and the rest receive "cannot overwrite existing file" errors and die, the job then dying also. The set of tasks which die from this error may be smaller than the set of all remaining tasks, either because not all attempt to write to the file, or because some see that the file has been created when they check for it.
An effective workaround has been found to be to ensure that the output files already exist before the job starts. It is sufficient to create them as empty files using the commandtouch. For example, if you have seen the error
forrtl: severe (10): cannot overwrite existing file, unit 99, file /scratch/abc123/work/AECCAR0
then you would perform the command
before retrying. In fact there may be several files that need to be created and which may cause this error, so ideally one would create all of them as empty files before attempting to run the job again.
Old binaries compiled against Intel MKL 9.0.018 may fail with messages similar tolibmkl.so: file too short due to changes in MKL 10. If this affects you, please recompile your binary and advise support'at'hpc.cam.ac.uk if you continue to have problems.
Intel MKL versions 10.x prior to 10.2.2.025 are deprecated. Unfortunately Intel have made linking somewhat more complicated in recent versions.
For example, a common dynamic link line for LAPACK and BLAS previously contained:
-lmkl_lapack -lmkl -lguide -lpthread
however for 10.3.4.191 this should be replaced by
-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread
and the analogous static link line would contain
MKLROOT=/usr/local/Cluster-Apps/intel/mkl/10.3.4.191/composerxe-2011.4.191/mkl $(MKLROOT)/lib/em64t/libmkl_lapack95_lp64.a -Wl,--start-group $(MKLROOT)/lib/em64t/libmkl_intel_lp64.a $(MKLROOT)/lib/em64t/libmkl_intel_thread.a $(MKLROOT)/lib/em64t/libmkl_core.a -Wl,--end-group -L$(MKLROOT)/lib/em64t -liomp5 -lpthread
where MKLROOT is set and used here as in a Makefile. Note that MKLROOT is also an environment variable set by the intel/mkl module.
The Intel MKL user documentation can be found on darwin at
The Intel® Math Kernel Library Link Line Advisor is a useful online tool for constructing MKL linker lines. The relevant architecture for Darwin is Intel64. Please note that the INCLUDE, MKLROOT, LD_LIBRARY_PATH, LIBRARY_PATH, CPATH, FPATH and NLSPATH environment variables are already set correctly by the mkl module, so please don't change these if tempted by the advice on the link advisor page.
Please advise firstname.lastname@example.org if you continue to have problems.
Projects seeking to buy large amounts of time over several three-month quarters, with guaranteed minimum usage levels, should contact the Director (director'at'hpc.cam.ac.uk) and discuss running under Service Level 1. Once this is set up, core hours are allocated automatically at the beginning of each quarter, until the end of the service level agreement.
Projects interested in making ad hoc purchases of core hours without guaranteed minimum usage levels under Service Level 2 should ask their department to raise a purchase order for the desired number of HPC core hours to our controlling institution (which at the time of writing is University Information Services). Please email the PO to Fay Hider (fay.hider'at'uis.cam.ac.uk) copied to Stuart Rankin (sjr20'at'cam.ac.uk).
At the time of writing the internal rates within the University are 0.012 GBP per core hour on Darwin, and 0.20 GBP per GPU hour on Wilkes.
Note that for orders within the University, VAT is not applicable. Please inform us at support'at'hpc.cam.ac.uk that you have done this. Ad hoc purchases may be made more than once, as desired.
The HPCS uses the accounting features of SLURM and the services of slurmdbd to manage the usage of computational resources. Notional resource credits are deposited intoaccounts attached to each project. Credits are withdrawn from these accounts as resources are consumed by the project. Depending on how the Principal Investigator (PI) of the project has chosen to organise it, there may be either a single account shared between all users associated with the project, or an individual account assigned to each user, with a project coordinator responsible for distributing credits between accounts.
Accounting quarters are three calendar months in length; in order to match the University's quarters, they run from 1st February - 30th April, 1st May - 31st July, 1st August - 31st October and 1st November - 31st January. The reasons for adopting quarters as the basis for allocation are:
- The HPC is completely self-financing and must recover its capital and running costs, therefore we need to realize revenue at a steady rate in order to continue;
- Quarters provide a structure in which we can control resource allocation and seek to provide guaranteed levels of service.
Projects requiring medium amounts of computer time may make a one-off or a series of one-off purchases of core hours under SL2 (``Ad Hoc Usage'') which entitles the project to run jobs under QOS1 until the purchased core hours are exhausted.
Projects wishing to obtain large numbers of core hours over several quarters with a minimum usage guarantee may do so by contacting the Director (director'at'hpc.cam.ac.uk) with a view to running under SL1. Under this service level, the core hours themselves are allocated to each paying project at the beginning of each accounting quarter, according to the funds received by prearrangement with the PI, and are understood to be processor core hours or GPU hours of paid-usage machine work to be performed some time within that three month period. These paid-usage resources are then given a representation in terms of available core hours.
In addition to the resource usage checking, the scheduler aims to ensure that SL1 projects receive core hours at a rate consistent with their overall allocation. (E.g. if your project has purchased half of the machine core hours available in the current quarter, then users in your project should expect, on average, to be running on half of the machine.) Thus at the end of the quarter, all paid core hours can be expected to have been used by the correct people, and all credits spent. Note, however, that this assumes a steady demand for resources by each project. Clearly, if a project purchasing half the quarter's core hours submits no work until the third month of the quarter, the full number of core hours can no longer be claimed in that quarter, even with exclusive and unrestricted access to the machine.
At the end of the quarter, unused SL1 credits expire and are retained. These accumulate and may be used, on a best efforts basis and without minimum usage guarantees, after the current quarter's active credits have been exhausted.
Use the command
This reports the current balance of credits, which equates to the number of core hours still available in the current accounting quarter (including expired credits if present). When the balance reaches zero, the -SL4 project can be drawn upon, if present - this is SL4 (``Residual Usage'').
This is caused by the application gpk-update-icon which is started by default by the Gnome desktop. To get rid of this, open the menu to System/Preferences/Startup Applications and untick PackageKit Update Applet. Then kill the currently running instance using the command killall gpk-update-icon.
These are the commands available at the login prompts of Linux machines (like Darwin or Wilkes), or of computers running other UNIX-like operating systems (e.g. Solaris workstations). However if you own a Mac you will also be able to use these inside the Terminal program of MacOSX, or even on a Windows machine if you have installed Cygwin.
There are many, many introductions to UNIX commands on the web.
One local reference is Thirty Useful Unix Commands.
Our current suggestion for use in academic publications etc is:
This work was performed using the Darwin Supercomputer of the University of Cambridge High Performance Computing Service (http://www.hpc.cam.ac.uk/), provided by Dell Inc. using Strategic Research Infrastructure Funding from the Higher Education Funding Council for England and funding from the Science and Technology Facilities Council.
Alternatively for resources awarded by DIRAC, the following form is preferred:
This work used the Darwin Data Analytic system at the University of Cambridge, operated by the University of Cambridge High Performance Computing Service on behalf of the STFC DiRAC HPC Facility (www.dirac.ac.uk). This equipment was funded by a BIS National E-infrastructure capital grant (ST/K001590/1), STFC capital grants ST/H008861/1 and ST/H00887X/1, and DiRAC Operations grant ST/K00333X/1. DiRAC is part of the National E-Infrastructure.
Finally, for the Wilkes GPU cluster, a different form is appropriate:
This work used the Wilkes GPU cluster at the University of Cambridge High Performance Computing Service (http://www.hpc.cam.ac.uk/), provided by Dell Inc., NVIDIA and Mellanox, and part funded by STFC with industrial sponsorship from Rolls Royce and Mitsubishi Heavy Industries.
The Sandy Bridge nodes are the sixteen core, 64GB nodes with Mellanox FDR InfiniBand interconnect which were introduced in June 2012. The name refers to the type of Intel Xeon CPU, which is eight-core and supports Intel AVX instructions. Binaries to be executed on such nodes are best compiled with Intel compilers and the flag -xAVX. We recommend that MPI codes use Intel MPI on Sandy Bridge nodes.
The Tesla nodes are the nodes comprising the Wilkes GPU cluster. Their CPUs are compatible for compilation purposes with Sandy Bridge, but the computational power in these nodes derives from the NVIDIA Kepler GPUs, rather than the CPUs. The InfiniBand hardware on all nodes is Mellanox ConnectX2 and we recommend that MPI codes should use Intel MPI, or MVAPICH2 or OpenMPI on these nodes.
You are probably listing /scratch directly, which since October 2012 is managed by the automounter.
In order to improve the resiliency and flexibility of the scratch storage, user scratch space is distributed across several different filesystems. To make this easier to deal with, we are now using an automounter (automount and autofs) to present scratch directories consistently as paths of the form /scratch/userid, irrespective of what physical filesystem the directory actually resides on. Thus no-one need remember the low-level details of where their files are currently stored (this after all may change with time).
In practice, this should be transparent - each access (via cd, or ls, or simply by reading or writing to a specific location under /scratch/userid) should work exactly as expected, and the presence of the automounter and of the multiple filesystems behind it should be invisible (unless you use the quota command, in which case you will see usages according to physical filesystems).
However it is possible to catch the automounter in the act, and be confused by it. In particular, doing
may show only a few, or perhaps no, user directories. This does not mean that the absent directories no longer exist, but that the automounter has simply not been asked to automount them. After any explicit reference to /scratch/userid on a particular node, the directory should be automounted under the /scratch parent directory and be available as expected. However, if the directory is inactive for long enough, it will disappear again (from under /scratch, but not from existence).
The bottom line is that if users refer to their scratch directory as /scratch/userid, or equivalently through the convenient symbolic link ~/scratch pointing to it which is created for newer accounts, then the directory will be found as required. Only when looking for it under /scratch without explicitly referring to it might it not be observed (as when directly listing the directory /scratch). The real mount points of each lustre filesystem (/lustre1, /lustre2 etc) are listable in the normal way but paths containing these mount points should not be used in scripts - instead always use paths containing /scratch and let the automounter supply the required directory when it is needed.