News

Tuesday 21st October - system maintenance affecting batch jobs and login access.

17th October 2014

There will be full maintenance commencing 10:00 on Tuesday 21st October in order to allow kernel and Mellanox software upgrades. Please save all files, quit all applications and log off all Darwin nodes before maintenance commences. Login nodes should return in approximately 30 minutes.

Login node reboots

15th September 2014

The Darwin and Wilkes login nodes (login-sand and login-gfx) will reboot for security reasons at 10:00 on Tuesday 16th September. Please save all files, quit all applications and desktops and log off before this time. Login access will be restored immediately after the reboots complete.

Login node reboots

7th August 2014

The Darwin and Wilkes login nodes (login-sand and login-gfx) will reboot for security reasons at 10:00 on Friday 8th August. Please save all files, quit all applications and desktops and log off before this time. Login access will be restored immediately after the reboots complete.

HPCS move into UIS

29th July 2014

The HPCS will move (virtually, not physically) from the School of Physical Sciences into the UIS on 1st August 2014. This will have few immediate consequences, however please note that any purchase order intended for the HPCS which is generated on or after 1st August 2014 should be sent to a new address:

A/O Fay Hider
University Information Services
Roger Needham Building
7 JJ Thomson Avenue
Cambridge CB3 0RB

Wilkes QoS changes

24th July2014

We have increased the maximum numbers of nodes that it is possible for a single user to employ at once on Wilkes. The new limits are:

  • Paying (GPU1 QoS): 48 nodes (up from 24)
  • Non-paying (GPU2 QoS): 24 nodes (up from 6)
Job time limits are unchanged.

Tuesday 27th May - system maintenance affecting batch jobs and possibly login access.

21st May 2014

There will be full maintenance commencing 10:00 on Tuesday 27th May in order to allow an upgrade of SLURM and possibly of firwmare on the infiniband switches.

Thursday 1st May - DiRAC3 benchmarking affecting Darwin and Wilkes

27th April 2014

On Thursday 1st May (please note the unusual day) Darwin and Wilkes will be devoted to performing DiRAC3 benchmarks. Login and compute nodes will also reboot at 10:00. Login access may be restored subsequently but the majority of jobs will be held until the benchmarking is complete.

Tuesday 11th March - Wilkes Benchmarking.

3rd March 2014

From 00:00 on Tuesday 11th March, Wilkes will be dedicated to large scale benchmark runs. Other jobs will remain pending in the queue until the benchmarking is complete, either later the same day or early on the next.

Tuesday 11th March - Wilkes Benchmarking.

3rd March 2014

From 00:00 on Tuesday 11th March, Wilkes will be dedicated to large scale benchmark runs. Other jobs will remain pending in the queue until the benchmarking is complete, either later the same day or early on the next.

SLURM tuning on Tuesday 11th February 10:00-13:00.

10th February 2014

Debugging and tuning work on SLURM will be performed between 10:00 and 13:00 on Tuesday 11th February. This may produce some temporary loss of contact with the scheduler (e.g. timeouts from commands such as sbatch or squeue). If you encounter these, please wait a moment then retry. No running jobs should be adversely affected by this activity.

Transition from Maui/PBS to SLURM on Saturday 1st February.

27th January 2014

The much-trailed transition of the production service from using Maui/PBS for scheduling to SLURM will take place at the end of the current Cambridge quarter on Saturday 1st February. Please see here for more information.

Special maintenance affecting all services commencing Friday 17th January at 10:00 and continuing through the weekend.

9th January 2014

A full maintenance affecting all systems in the HPCS data centre will take place on Friday 17th January (please note the unusual day), commencing at 10:00. Services will not be restarted until the following Monday 20th. The purpose of this special maintenance is to perform essential repairs to the data centre cooling circuit. Systems will return to service as soon as possible on Monday morning.

Full system maintenance is scheduled for Tuesday 17th December at 10:00.

11th December 2013

A full maintenance affecting all systems in the HPCS data centre will take place on Tuesday 17th, commencing at 10:00. The purpose of this is to adjust the power distribution and test emergency shutdown safeguards before the Christmas break. Systems will return to service as soon as possible.

The new GPU cluster has achieved No.2 in the November 2013 Green500 list.

21st November 2013

The new GPU cluster (Wilkes) has achieved second place in the November 2013 Green500 list with an efficiency of 3,631.86 MFlop/s per watt. There is more information about Wilkes here.

The new GPU cluster has achieved No.166 in the November 2013 Top500 list.

19th November 2013

The new GPU cluster (Wilkes) has achieved number 166 in the November 2013 Top500 list with a performance of 239.9 TFlop/s. There is more information about Wilkes here.

Full system maintenance is scheduled for Tuesday 19th November at 08:00.

11th November 2013

This is required in order to introduce glycol into the room cooling system. A total shutdown of all compute nodes will take place at 08:00 but we will maintain login access.

The Tesla nodes will be permanently decommissioned on Friday 25th October at 5pm.

21st October 2013

The HPCS will shortly install a large (300TFlop/s peak) GPU cluster to replace the existing Tesla S1070 nodes. The new hardware will consist of 128 nodes each with 12x 2.6GHz Intel Ivy Bridge cores, 64GB of memory, 2 NVIDIA K20 GPU cards and 2 FDR IB cards.

The upgrade process will commence at 5pm on Friday 25th October, at which time the current Tesla nodes will be withdrawn from service permanently. Production service will continue on the Sandy Bridge and Westmere nodes normally, although the service should be considered at risk while the physical installation and testing of the GPU hardware takes place - we expect this period to last until 10th November, after which use of the new GPU cluster will be phased in.

The intention is to submit a Linpack score for the new hardware, plus an updated score for the Sandy Bridge cluster, into the November Top500 list. This will entail a period of no service somewhere in the 1st-6th November window (to be confirmed).

No service during the weekend of 31st August to 1st September 2013.

22nd August 2013

Please note that due to essential electrical work affecting the new machine room, all Darwin services will be shut down at 17:00 on Friday 30th August and restored on Monday morning (2nd September). Consequently we will be completely unavailable (including logins, filesystem access and the web site) for entire weekend of 31st-1st.

This disruption is unavoidable and will remove power from the entire site on which we are now hosted. We regret any inconvenience this will cause.

Relocation to the new data centre is complete.

13th August 2013

The relocation of Darwin and all services to the new temporary data centre is complete. Normal production service was restored at midnight.

System maintenance on Tuesday 11th June 2013

4th June 2013

There will be a full system maintenance affecting all services on Tuesday 11th June 2013 commencing at 10:00. Please save all files, quit all applications and log off before 10am.

Power outage morning of Friday 17th May

14th May 2013

Estates have informed us of a partial power outage that will affect our machine room on Friday morning (17th May) between 09:30 and 12:00. Batch jobs will need to be stopped during the blackout but we will attempt to maintain login access.

Update: there will now also be a reboot of all login nodes at 09:30. The login nodes will return immediately.

Web site unavailable from Friday evening 10th May until Monday morning 13th May

7th May 2013

Due to a planned power outage in Mill Lane affecting the HPCS office, the web server (only) will be unavailable during the weekend of 11th-12th (commencing on Friday evening and ending on Monday morning). Darwin service will not be affected.

System maintenance on Tuesday 30th April 2013

24th April 2013

There will be a full system maintenance affecting all services on Tuesday 30th April 2013 commencing at 10:00. Please save all files, quit all applications and log off before 10am.

Login node reboots Friday 1st March 2013

28th February 2013

It is necessary to reboot the login-sand and login-gfx nodes at 11:00 on Friday 1st March. Please move to login-sand5 to to avoid this, otherwise please log off the other nodes before 11:00. The compute nodes will continue to process jobs and the login nodes should be back by roughly 11:30.

System maintenance on Tuesday 19th February 2013

12th February 2013

There will be a full system maintenance affecting all services on Tuesday 19th February 2013 commencing at 10:00. Please save all files, quit all applications and log off before 10am.

System maintenance on Thursday 6th December 2012

1st December 2012

There will be a full system maintenance affecting all services on Thursday 6th December 2012 commencing at 10:00 (note the unusual time). Please save all files, quit all applications and log off before 10am.

System maintenance on Wednesday 21st November 2012

13th November 2012

There will be a full system maintenance affecting all services on Wednesday 21st November 2012 commencing at 10:00 (note the unusual time). Please save all files, quit all applications and log off before 10am.

Login nodes will reboot at 18:00 on Wednesday 7th November 2012

7th November 2012

The login nodes (only) will reboot at 18:00. Please save all files, quit all applications and log off before 6pm. Service will be otherwise unaffected.

Login nodes will reboot at 18:00 on Tuesday 16th October 2012

15th October 2012

The login nodes (only) will reboot at 18:00. Please save all files, quit all applications and log off before 6pm. Service will be otherwise unaffected.

System maintenance on Tuesday 2nd October 2012

26th September 2012 (updated 1st October)

There will be a full system maintenance affecting all services on Tuesday 2nd October commencing at 18:00 (note the unusual time). Please save all files, quit all applications and log off before 6pm. Service will be restored on Wednesday.

Service has been restored, system is operating normally

4th September 2012

Service is now operating normally, please contact us at support@hpc.cam.ac.uk regarding any problems using your account.

Critical issue extending maintenance

28th August 2012

Maintenance has uncovered a critical issue that requires outside support. We are in process of restoring the service by using a secondary filesystem. We apologise for the inconvenience caused.

Critical issue extending maintenance

22nd August 2012

Maintenance has uncovered a critical issue that requires outside support. The system will remain down until we have resolved the problem. We apologise for the inconvenience caused.

System maintenance on Tuesday 21st August 2012

9th August 2012

There will be a full system maintenance affecting all services on Tuesday 21st August commencing at 10:00. Please save all files, quit all applications and log off before 10am.

System maintenance on Tuesday 31st July 2012

27th July 2012

There will be a full system maintenance affecting all services on Tuesday 31st July commencing at 10:00. Please save all files, quit all applications and log off before 10am.

System maintenance on Tuesday 17th July 10:00

12th July 2012

There will be a full system maintenance affecting all services on Tuesday 17th July commencing at 10:00. This is not expected to take the entire day.

Darwin3 enters service

26th June 2012

The Darwin3 cluster entered production service today. Some user data is still being copied and those users will be unable to login until this is completed. Please report all issues to support.

System maintenance on Monday 25th June 10:00

21st June 2012

There will be a full system maintenance affecting all services on Monday 25th June commencing at 10:00 (please note the unusual day). This will be the maintenance to merge the old and new clusters and make the new system generally available.

Position 93 on the June 2012 Top500 list

18th June 2012

The Darwin3 cluster has attained position 93 on the June 2012 Top500 list. This makes it currently the fastest (known) x86_64 cluster in the UK.

System maintenance on Thursday 3rd May 09:00

30th April 2012

There will be a full system maintenance affecting all services on Thursday 3rd May commencing at 09:00 (please note the unusual day). This is to allow essential work towards the ongoing system upgrade. During this maintenance the last woodcrest nodes will be decommissioned.

System maintenance on Wednesday 25th April 09:00

22nd April 2012

There will be a full system maintenance affecting all services on Wednesday 25th April commencing at 09:00 (please note the unusual day). This is to allow essential work towards the ongoing system upgrade.

Pre-upgrade system maintenance on Thursday 22nd March 10:00

19th March 2012

There will be a full system maintenance affecting all services on Thursday 22nd March commencing at 10:00 (please note the unusual day). This will be in order to perform necessary work in preparation for the upgrade which commences next week. We expect to release most of the current system by the end of the day (some westmeres may be retained to clear outstanding benchmark requests).

Please note that this week is the last full week of full Woodcrest service.

Tuesday 13th March: Core switch reboot

12th March 2012

The ethernet core switch will reboot shortly after 10am on Tuesday 13th March. This will create a brief period during which jobs may be interrupted, otherwise service will continue normally.

No service December 17-18

28th November 2011

There will be no service due to urgent work on our building electricity supply during the weekend of December 17-18. All systems will be shutdown on the preceding Friday at 17:30, and restored on the following Monday.

System maintenance on Tuesday 22nd November

14th November 2011

There will be full system maintenance affecting all services commencing 10:00 on Tuesday 22nd November.

/scratch2 maintenance on Friday 7th October

30th September 2011

There will be a special maintenance affecting the /scratch2 filesystem only commencing 11:00 on Friday 7th October.

No service September 17-18

2nd September 2011

There will be no service due to urgent work on our building electricity supply during the weekend of September 17-18. All systems will be shutdown on the preceding Friday at 17:30, and restored on the following Monday after a period of maintenance.

Limited maintenance Tuesday 23rd August 2011

22nd August 2011

The /scratch filesystem will undergo a corrective action during the maintenance period beginning 10:00 on Tuesday 23rd August, designed to restore full performance following a previous hardware failure. It may be possible to perform this transparently to jobs and interactive sessions, otherwise there may be a brief hiatus affecting access to /scratch. If commands attempting to write to /scratch block, please wait for the filesystem to return.

Some Westmere nodes may also be removed temporarily from service to allow benchmarking work.

Storage maintenance Tuesday 2nd August 2011

1st August 2011

The /scratch2 filesystem will be taken offline for approximately 45 minutes at 10:00 on Tuesday 2nd August for maintenance action. Jobs and commands attempting to write to /scratch2 will block until the filesystem returns, please wait for this to occur.

Compute node maintenance Tuesday 19th July

15th July 2011

All compute nodes will reboot commencing 10:00 on Tuesday 19th July. Running jobs will be requeued.

Login node reboots at 18:00 - Friday 15th July 2011

15th July 2011

All login nodes will reboot at 18:00.

Storage maintenance Tuesday 5th July 2011

4th July 2011

The /scratch2 filesystem will be taken offline for approximately 45 minutes at 10:00 on Tuesday 5th July for maintenance action. Jobs and commands attempting to write to /scratch2 will block until the filesystem returns, please wait for this to occur.

Storage maintenance Tuesday 28th June 2011

27th June 2011

The /scratch2 filesystem will be taken offline for approximately 30 minutes at 10:00 on Tuesday 28th June for maintenance action. Jobs and commands attempting to write to /scratch2 will block until the filesystem returns, please wait for this to occur.

Storage and Westmere maintenance Tuesday 14th June 2011

11th June 2011

The /scratch2 filesystem will be taken offline for approximately 45 minutes at 10:00 on Tuesday 14th June for maintenance action. Jobs and commands attempting to write to /scratch2 will block until the filesystem returns, please wait for this to occur. Also all westmere nodes will be dedicated to storage benchmarks from 10:00. This will involve temporary suspension of job processing on these nodes but login access and other compute nodes will continue to operate as normal.

Reduced Westmere maintenance Tuesday 7th June 2011

7th June 2011

There will be a reduced maintenance affecting two computational units of westmere nodes only during the afternoon of Tuesday 7th June. This will involve temporary suspension of job processing on these nodes but login access and other compute nodes will continue to operate as normal.

Storage maintenance Tuesday 24th May 2011

20th May 2011

There will be system maintenance commencing 10:00 on Tuesday 24th May in order to perform an essential change to one storage unit. Login access will be continued throughout, however there will be a temporary suspension of access to /scratch2 and possibly a reboot of all westmere nodes. Commands and jobs attempting to access /scratch2 may hang and the batch queues will be suspended while the change is performed.

Storage maintenance Tuesday 10th May 2011

4th May 2011

There will be system maintenance relating to an upgrade of the storage commencing 10:00 on Tuesday 10th May. This will involve temporary suspension of login access and batch queue processing, and jobs running when maintenance commences will be requeued. This maintenance will allow essential filesystem hardware changes.

Westmere maintenance Tuesday 22nd March 2011

21st March 2011

There will be maintenance affecting the westmere nodes only commencing 10:00 on Tuesday 22nd March. This will involve temporary suspension of job processing on these nodes but login access and other compute nodes will continue as normal.

System maintenance Tuesday 22nd February 2011

17th February 2011

There will be a system maintenance commencing 10:00 on Tuesday 22nd February. This will involve temporary removal of some compute nodes from service but login access and other compute nodes will continue as normal.

System maintenance Tuesday 8th February 2011

3rd February 2011

There will be a full system maintenance commencing 10:00 on Tuesday 8th February. This will involve temporary suspension of login access and batch queue processing, and jobs running when maintenance commences will be requeued. This maintenance will allow necessary filesystem checks, hardware changes, and firmware updates.

Login node reboots to take place at 18:00 on Tuesday 21st December 2010

20th December 2010

All login nodes will reboot commencing 6pm. This will be done in a rolling way without interrupting running jobs:

18:00 bindloe03, bindloe04, pinta02, all mostro and all planck nodes will reboot
18:30 (approx) bindloe01, bindloe02, pinta01 will reboot.

The timing of the second wave of reboots depends slightly on how long the first set take to come back (probably about 20 minutes). These reboots will implement some security related updates. There will also be a rolling reboot of the compute nodes. This will occur as jobs finish releasing nodes and should be transparent.

[CANCELLED] System maintenance Thursday 9th December 2010

5th December 2010

There will be a full system maintenance commencing 10:00 on Thursday 9th December (please note the day). This will involve temporary suspension of login access and batch queue processing, and jobs running when maintenance commences will be requeued. This maintenance has been postponed from Tuesday following the disruptive power cut over the weekend.

System maintenance Tuesday 19th October 2010

15th October 2010

There will be a full system maintenance commencing 10:00 on Tuesday 19th October. This will involve temporary suspension of login access and batch queue processing, and jobs running when maintenance commences will be requeued. It is hoped that the new Westmere-based nodes (an upgrade of 1500 cores) will be available for general use by payers at the end of the maintenance.

Reboot of all login nodes Tuesday 21st September 2010

21st September 2010

There will be a reboot of all login nodes at 18:00 Tuesday 21st September. Queues will continue to operate and login access should be restored after approximately 30 minutes.

System maintenance Tuesday 14th September 2010

10th September 2010

There will be a full system maintenance commencing 10:00 on Tuesday 14th September. This will involve temporary suspension of login access and batch queue processing, and jobs running when maintenance commences will be requeued.

System maintenance and power shutdown Tuesday 31st August

25th August 2010

There will be a full system maintenance commencing 10:00 on Tuesday 31st August. This will involve temporary suspension of login access and batch queue processing, and jobs running when maintenance commences will be requeued. In addition, due to more essential maintenance by Estates Management on our building power supply, there will be a total power shutdown of our machine room at 17:30 extending into Tuesday evening.

Power shutdown Saturday 24th July

24th June 2010

Due to essential maintenance by Estates Management on our building power supply, there will be a total shutdown of Darwin on the evening of Saturday 24th July commencing at 16:30.

System maintenance Tuesday 1st June

27th May 2010

There will be a full system maintenance commencing 10:00 on Tuesday 1st June. This will involve temporary suspension of login access and batch queue processing, and jobs running when maintenance commences will be requeued.

System maintenance Monday 17th May

10th May 2010

There will be a full system maintenance commencing 10:00 on Monday 17th May. This will involve temporary suspension of login access and batch queue processing, and jobs running when maintenance commences will be requeued.

This has been postponed from Tuesday 11th May due to other events.

System maintenance Tuesday 23rd February

19th February 2010

There will be system maintenance commencing 10:00 on Tuesday 23rd February, followed by a series of special 2000 core job runs on behalf of two projects. This will involve temporary suspension of login access and batch queue processing, and jobs running when maintenance commences will be requeued.

System maintenance Tuesday 2nd February

28th January 2010

There will be system maintenance commencing 10:00 on Tuesday 2nd February. This will involve temporary suspension of login access and batch queue processing. Jobs running when maintenance commences will be requeued.

System maintenance Tuesday 15th December

10th December 2009

There will be system maintenance commencing 10:00 on Tuesday 15th December. This will involve temporary suspension of login access and batch queue processing. Jobs running when maintenance commences will be requeued.

Short network interruption Tuesday 24th November

22nd November 2009

There will be an interruption to remote network connectivity at 10:00 on Tuesday 24th November, for approximately 30 minutes, while the Computing Service upgrades our external network connection. Darwin will continue to function normally but new logins will not be possible and existing logins will appear to hang during the interruption.

System maintenance Tuesday 17th November

13th November 2009

There will be system maintenance commencing 10:00 on Tuesday 17th November in order to upgrade all login and compute nodes to Scientific Linux 5.4. This will involve temporary suspension of login access and batch queue processing. Jobs running when maintenance commences will be requeued.

Login node reboots starting 18:00 Thursday 5th November

5th November 2009

The Darwin login nodes will be rebooting starting at 18:00 tonight (5th November).

  • bindloe01 and bindloe02 will reboot at 18:00
  • bindloe03 and bindloe04 will reboot at 18:30

The reboots should take 15-20 minutes each.

New accounting quarter begins Sunday 1st November

30th October 2009

The current July - August accounting quarter will end at midnight on Saturday 31st. The next quarter will run from 1st November until 31st January. All SL1 and SL3 projects will receive new core hour allocations at the transition (please see here for full details).

System maintenance on Tuesday 27th October

23rd October 2009

There will be a brief system maintenance commencing 10:00 on Tuesday 27th October involving temporary suspension of login access and batch queue processing. Jobs running when maintenance commences will be requeued.

System maintenance on Monday 12th October

9th October 2009

There will be system maintenance commencing at 08:00 on Monday 12th October (please note the unusual day and time). Login access and batch job processing will be suspended while this work takes place.

System maintenance on Monday 5th October

5th October 2009

There will be system maintenance commencing 11:00 on Monday 8th October (please note the day, this replaces the maintenance previously announced for Thursday) in order to perform essential filesystem work. Login access and batch job processing will be suspended while this work takes place.

SL5 upgrades Tuesday 29th September

28th September 2009

During the morning of Tuesday 29th September, bindloe03 and bindloe04 will be upgraded to Scientific Linux 5. This will temporarily interrupt access to these two login nodes (only). No other services will be affected.

System maintenance Tuesday 1st September

28th August 2009

There will be system maintenance commencing 10:00 on Tuesday 1st September to perform filesystem work.

System maintenance Tuesday 18th August

5th August 2009

There will be system maintenance commencing 10:00 on Tuesday 18th August to allow further benchmarking and upgrade preparation work.

System maintenance Tuesday 11th August

5th August 2009

There will be system maintenance commencing 10:00 on Tuesday 11th August to allow benchmarking and upgrade preparation work.

System maintenance Wednesday 22nd July

20 July 2009

There will be system maintenance commencing 10:00 on Wednesday 22nd July to allow filesystem work. Please note that this has been moved from Tuesday.

Web pages update

9 July 2009

The HPCS web pages have received some much needed attention this week. Please send feedback to support.