The Darwin cluster
The Cambridge High Performance Computing Cluster Darwin was the largest academic supercomputer in the UK at installation in November 2006, providing 50% more performance than any other academic machine in the UK. Darwin was ranked the 20th fastest machine in the world in the November 2006 top500 list. This system originally had 2340 3.0 GHz Intel Woodcrest cores and 4.6 TB of total memory, with a peak Linpack Performance of 28.08 Tflop/s and a sustained value of 18.27 Tflop/s. The SDR Infiniband interconnect provided 900 MB/s bandwidth with a 1.9 microsecond latency. This first Darwin system entered production service in February 2007.
In October 2010 the system was upgraded by replacing 256 Woodcrest cores and 512 GB of memory with 1536 Westmere cores and 4608 GB of memory, connected via QDR Infiniband (3200 MB/s bandwidth and less than one microsecond latency). A 256 Nehalem core, 128 NVIDIA Tesla GPU cluster was also introduced. The second Darwin system provided an equivalent overall sustained performance of around 30 TFlop/s.
Over the period of March-June 2012, a major upgrade was undertaken in which the old Woodcrest nodes were decommissioned and replaced by 9600 2.60GHz Intel Sandy Bridge cores (600 nodes, 64GB of RAM per node, connected by Mellanox FDR Infiniband). This new system achieved a sustained Linpack performance of 183.379 TFlops (90.6% of peak), earning position 93 on the June 2012 Top500 list. By itself this was at this time the fastest (publicly disclosed) x86_64 cluster in the UK. Combined with the legacy 15 TFlop/s Westmere QDR system produced a third generation Darwin cluster with an overall processing power of just under 200 TFlop/s. This system entered full production in June 2012 and currently provides high performance computing both internally for the University of Cambridge and externally as part of the STFC DiRAC facility.
There are currently (June 2012) three distinct flavours of compute node making up Darwin, with two distinct CPU microarchitectures and Infiniband interconnects, and including a 128 Tesla GPU subcluster.
9600 Sandy Bridge cores are provided by 600 quad server Dell C6220 chassis. Each node consists of two 2.60 GHz eight core Intel Sandy Bridge E5-2670 processors, giving sixteen cores in total, forming a single NUMA (Non-Uniform Memory Architecture) server with 64 GB of RAM (4 GB per core), 376 GB of local storage and Mellanox FDR ConnectX3 interconnect. Code run on these nodes should be optimised for CPUs supporting AVX instructions and using Intel MPI (normally satisfied by compiling with mpicc/mpifort on a login-sand node using the -xHOST optimisation flag). Modules with sandybridge in their name are the best choices where available on these nodes.
1536 Westmere cores are provided by 128 dual socket Dell PowerEdge M610 half-height blade servers. Each node consists of two 2.67 GHz six core Intel Westmere processors, giving twelve cores in total, forming a single NUMA (Non-Uniform Memory Architecture) unit with 36 GB of RAM (3 GB per core), 114 GB of local storage and Mellanox ConnectX2 interconnect. Code run on these nodes should be optimised for CPUs supporting SSE4.2 instructions and using Intel MPI (normally satisfied by compiling with mpicc/mpifort on a login-west node using the -xHOST optimisation flag). Modules with nehalem in their name are the best choices where available on these nodes.
Tesla (GPU subcluster)
128 GPUs/256 Nehalem cores are provided by 32 dual socket Dell T5500 servers. Each server is connected to one Tesla S1070 unit, each containing 4 GPUs (CUDA compute capability=1.3). Each node consists of two 2.67 GHz four core Intel Nehalem processors, giving eight cores in total, forming a single NUMA (Non-Uniform Memory Architecture) unit with 24 GB of RAM (3 GB per core), 114 GB of local storage and Mellanox ConnectX2 interconnect. Code run on these nodes should be optimised for CPUs supporting SSE4.2 instructions and using Intel MPI (normally satisfied by compiling with mpicc/mpifort using the -xHOST optimisation flag and nvcc on a login-west or login-gfx node). Modules with nehalem in their name are the best choices where available on these nodes.
The cluster is arranged into 22 compute racks: 10 racks form the Sandy Bridge sector of the system, 8 `racks' (really, blade chassis) make up the Westmere sector, and 4 racks form the Nehalem/Tesla (GPU) sector. All nodes within the Sandy Bridge sector, and all nodes within individual racks from the Westmere/Tesla sectors, are connected to a full bisectional bandwidth Infiniband network. Each Sandy Bridge rack consists of 15 chassis, each containing 4 servers, and each server containing 16 cores. Each Westmere `rack' consists of 16 12-core nodes (1 chassis) providing 192 cores. The GPU subcluster consists of 32 8-core nodes (4 racks) providing 256 Nehalem cores, and 128 Tesla GPUs. Between Sandy Bridge nodes there is a full bisectional bandwidth FDR Infiniband network (6000 MB/sec) and between Westmere racks a 75% bisectional bandwidth QDR Infiniband network (3200 MB/sec).
In addition to the Infiniband networks, each compute rack has a full bisectional bandwidth gigabit ethernet network for data and administration; also each node is connected to a power/management network through which console functions for each box can be accessed remotely.
At the time of writing main storage consists of three shared volumes: the home volume (11TB) and two scratch volumes (172TB each). These are all Lustre filesystems. Home directories have a 40 GB quota enforced per user and the scratch storage is partly owned by contributing research projects.
In addition each compute node has a modest amount of temporary storage available during the lifetime of each job: on local disk under /local, and in virtual memory under /ramdisks. The contents of these directories are destroyed after job completion.