Clook Status - Cloud Infrastructure Outage

RESOLVED
This announcement has been resolved, no further updates are expected.

Cloud Infrastructure Outage \| Clook Internet
This email is regarding Cloud customers, if you are not a cloud customer you can disregard this email. Today we have experienced a major issue affecting some cloud server customers. We currently have staff working on this issue at the data-centre and once resolved we will be conducting full investigations to provide full details. We appreciate your ongoing patience whilst we resolve this. Unfortunately, due to the nature of the issue we cannot provide an ETA. We are working to get your services restored to working order ASAP.
Updated by James Scott on 17th Feb 2019 @ 16:15pm

Cloud Infrastructure Outage | Clook Internet

This email is regarding Cloud customers, if you are not a cloud customer you can disregard this email.

Today we have experienced a major issue affecting some cloud server customers. We currently have staff working on this issue at the data-centre and once resolved we will be conducting full investigations to provide full details. We appreciate your ongoing patience whilst we resolve this.

Unfortunately, due to the nature of the issue we cannot provide an ETA. We are working to get your services restored to working order ASAP.

Updated by James Scott on 17th Feb 2019 @ 16:15pm

Cloud Infrastructure Outage \| Clook Internet
The root cause of the issue has been identified and a mitigation has been put in place to allow us to begin bringing up VM's, if your VM is not yet currently back online, we are in the process of booting these now, and would expect to have this back online within the next hour. I will follow up with further updates as more information becomes available.
Updated by Sam Pizzey on 17th Feb 2019 @ 19:07pm

Cloud Infrastructure Outage \| Clook Internet
After multiple hardware changes we are confident that the hardware issues are resolved and the network is stable. We are working our way through rebooting cloud servers and will continue through the night until complete. This has been a long and complex issue. Priority is on getting all remaining cloud servers back online after which we will work closely with our hardware and software vendors to fully investigate and analyse the events that have happened.
Updated by Chris James on 18th Feb 2019 @ 00:56am

Cloud Infrastructure Outage | Clook Internet

After multiple hardware changes we are confident that the hardware issues are resolved and the network is stable. We are working our way through rebooting cloud servers and will continue through the night until complete.

This has been a long and complex issue. Priority is on getting all remaining cloud servers back online after which we will work closely with our hardware and software vendors to fully investigate and analyse the events that have happened.

Updated by Chris James on 18th Feb 2019 @ 00:56am

Cloud Infrastructure Outage \| Clook Internet
RFO (Reason For Outage) Report Date: Sunday 17th February, 2019 Affected service: Cloud network, some cloud servers Problem description: Network switch intermittent failure, causing various issues on cloud storage network Impact: Some cloud servers unavailable. At approximately 07:55am on Sunday, our monitoring system showed some of the cloud system hypervisors (HVs) were not responding. Efforts were made to reboot these hypervisors, and they were brought back into service by 08:30. As cloud servers were being booted we continued to see random failures and it was apparent the problem was not fixed. The vendor of our cloud platform software continued their investigation and at 11:24 advised us of a networking issue on the storage (back-end) network - which is what the cloud servers use to share data so servers can migrate from hypervisor to hypervisor. Our upstream network vendor investigated the issue and at 12:21 were able to advise us of an intermittent fault in the link between two of the switches on the backend network. This was causing instability in the HVs and affecting the communication across the storage network, so drives were becoming unavailable. More investigation and steps to try and recover the switch took place. At 16:19 one of the switches was replaced. This seemed to cure the instability and the HVs and virtual machines (VMs) were brought back online. More issues were detected at 17:00, and further investigation with the network vendor took place. An engineer from the network switch vendor was dispatched en-route to the data centre. The network switch vendor replaced a faulty module in another of the switches at 21:45. This finally resolved the instability. By 22:00 the failing HVs were brought online once more, and around 22:30 all the VMs were being started. At 23:00 the network vendor declared the network switches stable, and then throughout the night our support team assisted in bringing up servers which required filesystem checks, and also solved any outstanding cloud system errors. Measures have been taken with our network vendor to provide us with better monitoring of the storage network for the future, and ensure better troubleshooting in the unlikely reoccurrence of this issue. We are also liaising with the switch vendor to establish if there is an ongoing concern with the hardware in place and whether any additional redundancy can be introduced to this segment of the network. We'd like to reiterate that for the 38% of the cloud servers affected by this outage, no data was lost in the incident, the backups remained completely intact. No server had to be restored from a backup.
Updated by Arran Short on 19th Feb 2019 @ 15:07pm

Cloud Infrastructure Outage | Clook Internet

RFO (Reason For Outage) Report
Date: Sunday 17th February, 2019
Affected service: Cloud network, some cloud servers
Problem description: Network switch intermittent failure, causing various issues on cloud storage network
Impact: Some cloud servers unavailable.
At approximately 07:55am on Sunday, our monitoring system showed some of the cloud system hypervisors (HVs) were not responding. Efforts were made to reboot these hypervisors, and they were brought back into service by 08:30. As cloud servers were being booted we continued to see random failures and it was apparent the problem was not fixed.

The vendor of our cloud platform software continued their investigation and at 11:24 advised us of a networking issue on the storage (back-end) network - which is what the cloud servers use to share data so servers can migrate from hypervisor to hypervisor.

Our upstream network vendor investigated the issue and at 12:21 were able to advise us of an intermittent fault in the link between two of the switches on the backend network. This was causing instability in the HVs and affecting the communication across the storage network, so drives were becoming unavailable. More investigation and steps to try and recover the switch took place.

At 16:19 one of the switches was replaced. This seemed to cure the instability and the HVs and virtual machines (VMs) were brought back online.

More issues were detected at 17:00, and further investigation with the network vendor took place. An engineer from the network switch vendor was dispatched en-route to the data centre.

The network switch vendor replaced a faulty module in another of the switches at 21:45. This finally resolved the instability.

By 22:00 the failing HVs were brought online once more, and around 22:30 all the VMs were being started.

At 23:00 the network vendor declared the network switches stable, and then throughout the night our support team assisted in bringing up servers which required filesystem checks, and also solved any outstanding cloud system errors.

Measures have been taken with our network vendor to provide us with better monitoring of the storage network for the future, and ensure better troubleshooting in the unlikely reoccurrence of this issue. We are also liaising with the switch vendor to establish if there is an ongoing concern with the hardware in place and whether any additional redundancy can be introduced to this segment of the network.

We'd like to reiterate that for the 38% of the cloud servers affected by this outage, no data was lost in the incident, the backups remained completely intact. No server had to be restored from a backup.

Updated by Arran Short on 19th Feb 2019 @ 15:07pm

03300 885 250

Technical Infrastructure Status

We believe in full transparency, everything you see here is 100% live.

03300 885 250