XHQ.E1.092
SW-HH-AL-271 was available
SW-HH-AL-272 came back after the DL switch was back (see below, ca 09:33)
SW-HH-AL-273 had to be power cycled and rebooted by NOC through console to come back fully (ca. 10:52).
SASE1 balcony room 38
restarted at 4am:
DELL servers:
exflprx03n0
exflprx04n0
exflprx112
exflpcl01n0 to 08n0
exflpcl18n0
SuperMicro servers:
exflonc09 to 12 (4-in-1 chassis)
exflong06
IBM ESS systems stayed off, as well as storage jbods
exfl-ons-gs111 and 112
exfl-ons-gl111 and 112
exfl-ons-gl113 and 114
IBM server (hosting ems servers for OS installations) was off as well
exflkvm01
SuperMicro servers stayed off:
exflong03 to 05
juniper (management network) switch stuck at booting, power cycled, after long time finally came back, but we were lacking management network over whole site
SW-HH-AL-DL-07 (office network distribution layer) switch was stuck at booting, came back after power cycling. Some (not all) office switches were not available because of this, seemingly the failover between AL and DL didn't work. Both this switch and the (permanently available) ...DL-08 were correctly cabled to separate PDUs.
Second fuse in BR38 rack 7 was out and had to be pushed back in manually (made power redundant again).
SA1 tunnel:
exflcon08n0 restarted at 4am, but apparently switch in the tunnel was off till 10:30 and since that time the server was accessible
E0.604
All systems were affected. We need the clear statement from YS why the second line did not work.
DELL servers rebooted at 4am:
exflprx01n0
exflprx02n0
exflpcl51n0
exflpcl52n0
exflpcl53n0
exflcon03n0
exflcon04n0
exflcon05n0
exflcon150
exflcon153
exflcon156
exflcon157
exflons01 and 02
exfldaq01 and 02 (gitlab CIs)
exflst105 (karabo project db - development and labs)
IBM servers stayed off:
exflpcl54n0
exflpcl55n0
exflpcl56n0
exflpcl57n0
exflonc41
exflonc41
IBM ESS system stayed off:
exfl-ons-gs901 and 902
There are also several servers (proxmox, serving VM OnlyOffice) all rebooted at 4am
List of h/w failures:
exflpcl05 PSU dead
exflong04 and 05 PSU dead (in exflong04 was actually since yesterday after reboot for OS updates)
may not be related to power glitch
exflprx106 RAM DIMM error
Intel server:
exflonc21 to 24 (4-in-1 chassis) PSU dead