April 4, 2025 11:42AM EDT
[Monitoring] All clusters are operational, will continue to monitor
April 4, 2025 11:36AM EDT
[Monitoring] Phoenix and Buzzard clusters released for user workloads.
April 4, 2025 10:59AM EDT
[Monitoring] Firebird cluster released for user workloads.
April 4, 2025 10:33AM EDT
[Monitoring] Hive cluster released for user workloads.
April 4, 2025 9:12AM EDT
[Monitoring] ICE cluster released for user workloads.
April 4, 2025 8:08AM EDT
[Monitoring] The clusters are being powered up and tested. They will be returned to service as soon as they are ready. Updating the status back to "service disruption".
April 3, 2025 10:26PM EDT
[Monitoring] The controller for the system providing cooling to nodes in the Coda Research Hall has been restored and we have returned to the HTCP lineup and are in normal operation.
April 3, 2025 11:07AM EDT
[Identified] Some compute nodes on ICE were accidentally powered off last night, which may have impacted some running jobs. We have restored a partial selection of those nodes to service so that all hardware types are available.
There was a brief pause in the scheduler this morning from 9:17am to 9:41am, which may have prevented jobs from starting during that time. Most ICE compute nodes are currently available for course usage.
April 3, 2025 9:56AM EDT
[Identified] Our vendors are working to restore cooling capabilities to the datacenter by fully replacing the cooling system controller and expect to have the work completed by 7:00pm ET.
We hope to return all systems to service by tomorrow (Friday) evening, provided that all repairs to the cooling system are complete and after testing for stability after the shutdown. Clusters will be released as testing is completed for each system.
April 2, 2025 9:47PM EDT
[Identified] It has been determined that our water pump controller will need to be replaced, and we are currently coordinating with the support vendor on this replacement process.
April 2, 2025 6:25PM EDT
[Identified] Water pump controller failed, affecting the cooling of the research hall. Support vendor has been engaged and is assessing the situation.
April 2, 2025 5:51PM EDT
[Investigating] Due to continued high temperatures, all Phoenix compute nodes have been turned off, and all running jobs were cancelled. Impacted jobs will be refunded at the end of April.
April 2, 2025 5:25PM EDT
[Investigating] The controller for the system providing cooling to nodes in the Coda Research Hall has failed. To avoid damage, PACE has urgently shut down many compute nodes to reduce heat.
April 2, 2025 5:12PM EDT
[Investigating] All Hive nodes are powered off. All jobs failed.
All Buzzard nodes are powered off. All jobs failed (though presumably requeued).
All new jobs on Phoenix are held.
All idle nodes on Phoenix are being turned off.
All Firebird nodes are powered off. All jobs failed.
April 2, 2025 5:08PM EDT
[Investigating] A cooling controller failed at the data center. Shutting down PACE clusters.