03300 885 250

Technical Infrastructure Status

We believe in full transparency, everything you see here is 100% live.
RESOLVED
This announcement has been resolved, no further updates are expected.
Network switch outage
We are investigating a network switch outage that is causing some servers to become unavailable and we hope to have further information shortly.
Updated by Carl G-M. on 9th Sep 2022 @ 07:46am
Network switch outage
The switch is back online but we are continuing to investigate the cause of this and the outage and will provide further information
Updated by Carl G-M. on 9th Sep 2022 @ 08:44am
Network switch outage
The affected switch is now back online with all connectivity restored to affected cloud servers. We are currently investigating why the switch went offline and also why a few nodes within the cluster did not failover to the other switch in the redundant pair. This issue affected approx 15% of cloud servers hosted within this cluster.

Further updates will be sent as our investigation progresses.
Updated by Chris James on 9th Sep 2022 @ 08:49am
Network switch outage
Throughout today we have been investigating the switch problem that caused some cloud servers to go offline this morning. To summarise the three factors involved:

1. One of the "public" switches failed:

Extreme (the switch vendor) have gathered up all logs and are investigating what happened on the switch at the time of the problem. We are awaiting the outcome of their investigation.


2. Some nodes within the cluster didn't failover to the other switch in the redundant pair:

This is a concern as we have invested heavily in redundant pairs of network equipment to safeguard against this scenario. Upon investigating why some nodes did not failover to the other switch we found that the network ports on the affected switch remained in "up" state so the nodes continued to send traffic through the links as configured.

As this element of the issue was caused by the switch no passing any traffic, despite ports being in "up" state, this too is being investigated by Extreme.


3. The reboot of the switch was delayed due to remote reboot capabilities being limited:

When it became clear that the switch needed to be power-cycled we were unable to access our full remote reboot facility due to part of it being connected to the faulty switch. Due to this, the reboot was delayed by approx 15-20 mins while the task was carried out on site.

This has been rectified today by connecting remote reboot facilities to an "out-of-band" connection completely separate from our network and equipment. While not a solution to the underlying problem, we can now be sure that remote power cycles of network equipment is possible even if our network is having problems.

Another update will be provided to this alert when Extreme have completed their investigation.
Updated by Chris James on 9th Sep 2022 @ 22:05pm