Clook Status - Cloud platform disruption

RESOLVED
This announcement has been resolved, no further updates are expected.

Cloud platform disruption
We are aware of a major issue affecting the storage systems on the majority of our Cloud v2 servers. The storage cluster has been recovered and is now serving data again but a small number of Cloud servers are still not yet back online. Our monitoring is alerting us to this and we are working on affected servers at the moment. An update will be sent when every server is back online. We will also be doing a full investigation of what happened throughout the night and will send full details when we have them.
Updated by Chris James on 14th Nov 2020 @ 21:49pm

Cloud platform disruption

We are aware of a major issue affecting the storage systems on the majority of our Cloud v2 servers. The storage cluster has been recovered and is now serving data again but a small number of Cloud servers are still not yet back online. Our monitoring is alerting us to this and we are working on affected servers at the moment.

An update will be sent when every server is back online. We will also be doing a full investigation of what happened throughout the night and will send full details when we have them.

Updated by Chris James on 14th Nov 2020 @ 21:49pm

Cloud platform disruption
All servers affected by this issue should now be back online and functioning normally. We are continuing to monitor but feel free to contact us if you see any issues on your server.
Updated by Chris James on 14th Nov 2020 @ 23:04pm

Cloud platform disruption
To update this announcement, since all servers were put back online during Saturday evening everything has been stable and we have been working at finding the cause of the problem. Work is still ongoing with this as it involves very large log files and conversations with third party software vendors but we hope to have something more detailed soon. In summary, the storage cluster in the cluster lost connectivity with a large number of physical hard drives resulting in them being assumed as failed and removed from the cluster. Due to the amount of drives, this resulted in the cluster not having enough drives left available to rebalance data and continue serving disks. At the time of the issue we ruled out physical hardware problems with the drives and manually added them back to the storage cluster which resulted in all storage coming back online once this was complete and data had rebalanced. We do not believe there is a high chance of this happening again in the immediate future and we're continuing to focus resources on finding the exact cause so that a fix can be implemented.
Updated by Chris James on 17th Nov 2020 @ 08:40am

Cloud platform disruption

To update this announcement, since all servers were put back online during Saturday evening everything has been stable and we have been working at finding the cause of the problem. Work is still ongoing with this as it involves very large log files and conversations with third party software vendors but we hope to have something more detailed soon.

In summary, the storage cluster in the cluster lost connectivity with a large number of physical hard drives resulting in them being assumed as failed and removed from the cluster. Due to the amount of drives, this resulted in the cluster not having enough drives left available to rebalance data and continue serving disks. At the time of the issue we ruled out physical hardware problems with the drives and manually added them back to the storage cluster which resulted in all storage coming back online once this was complete and data had rebalanced.

We do not believe there is a high chance of this happening again in the immediate future and we're continuing to focus resources on finding the exact cause so that a fix can be implemented.

Updated by Chris James on 17th Nov 2020 @ 08:40am

Cloud platform disruption
We have been making progress while working on this issue. To summarise, the problem on the 14th November stemmed from our storage cluster not being able to send/receive enough traffic over the storage network to rebalance data across all drives in the cluster. While this could be caused by software on each node or limitations at the node level, our tests in this area have shown no reason for network speed to be so much reduced on the servers. However, our log data shows that a 40G network switch on this network may have hit a ceiling capacity far lower than the level it is capable of which then caused the snowball effect resulting in the evenings problems. For this reason we have concentrated efforts on the network switches. Our network administrators along with engineers from the switch vendors identified a possible corruption in the firmware of the switch acting as the primary on the 14th which may have been part of the problem. Due to this, both switches in the redundant pair have been updated which we were able to do without any impact on service due to the redundant nature of the pair. Early tests are good so we're hopeful this has fixed the problem and removed any chance of it happening again. However, to make sure we wish to do a full test by replicating some of the circumstances that happened on the 14th November and generating more traffic than is normal on a day to day level to ensure the switches handle it okay. All tests are planned well within the capability of the switches so we're not expecting any issues. However, in the small chance that the throughput problem is not resolved, it may result in some disruption to services. Our sys-admin team will be present during the tests so in the worst case scenario of it causing noticeable issues across the cluster, they will be straight onto it within seconds to result in a far lower impact than we saw on the 14th (where it snowballed for around 30 mins after starting). Due to the above, we are scheduling an AT-RISK window between 7pm and 9pm tonight (26/11/20) where we will be running our tests and there is a small chance that it will be noticeable to clients. This announcement will be updated when complete.
Updated by Chris James on 26th Nov 2020 @ 12:11pm

Cloud platform disruption

We have been making progress while working on this issue. To summarise, the problem on the 14th November stemmed from our storage cluster not being able to send/receive enough traffic over the storage network to rebalance data across all drives in the cluster.

While this could be caused by software on each node or limitations at the node level, our tests in this area have shown no reason for network speed to be so much reduced on the servers. However, our log data shows that a 40G network switch on this network may have hit a ceiling capacity far lower than the level it is capable of which then caused the snowball effect resulting in the evenings problems. For this reason we have concentrated efforts on the network switches.

Our network administrators along with engineers from the switch vendors identified a possible corruption in the firmware of the switch acting as the primary on the 14th which may have been part of the problem. Due to this, both switches in the redundant pair have been updated which we were able to do without any impact on service due to the redundant nature of the pair.

Early tests are good so we're hopeful this has fixed the problem and removed any chance of it happening again. However, to make sure we wish to do a full test by replicating some of the circumstances that happened on the 14th November and generating more traffic than is normal on a day to day level to ensure the switches handle it okay.

All tests are planned well within the capability of the switches so we're not expecting any issues. However, in the small chance that the throughput problem is not resolved, it may result in some disruption to services. Our sys-admin team will be present during the tests so in the worst case scenario of it causing noticeable issues across the cluster, they will be straight onto it within seconds to result in a far lower impact than we saw on the 14th (where it snowballed for around 30 mins after starting).

Due to the above, we are scheduling an AT-RISK window between 7pm and 9pm tonight (26/11/20) where we will be running our tests and there is a small chance that it will be noticeable to clients. This announcement will be updated when complete.

Updated by Chris James on 26th Nov 2020 @ 12:11pm

Cloud platform disruption
We are now beginning the work. Further updates will follow.
Updated by James Scott on 26th Nov 2020 @ 19:00pm

Cloud platform disruption
All tests were completed within the 7pm-9pm window so the AT-RISK window is now closed. Early signs are good but we will be analysing all test data tomorrow. An update will be provided when we have completed the analysis.
Updated by Chris James on 26th Nov 2020 @ 21:28pm

Cloud platform disruption
The tests we did last week went well and showed expected results with no interruption to service throughout. We plan on running the same tests on the second switch in the redundant pair to confirm that the switch updates have fixed the issue we saw on 14th November. Again, we don't expect any problems but to make sure we're best prepared in the event something does happen, our sys-admin staff will be on site during the tests. We plan on testing the switch throughput this evening (30/11/20) between 7.30pm and 9.30pm. As with the last test, we expect no problems but categorising the testing period as an AT-RISK window. This announcement will be updated when complete.
Updated by Chris James on 30th Nov 2020 @ 11:16am

Cloud platform disruption

The tests we did last week went well and showed expected results with no interruption to service throughout. We plan on running the same tests on the second switch in the redundant pair to confirm that the switch updates have fixed the issue we saw on 14th November. Again, we don't expect any problems but to make sure we're best prepared in the event something does happen, our sys-admin staff will be on site during the tests.

We plan on testing the switch throughput this evening (30/11/20) between 7.30pm and 9.30pm. As with the last test, we expect no problems but categorising the testing period as an AT-RISK window.

This announcement will be updated when complete.

Updated by Chris James on 30th Nov 2020 @ 11:16am

Cloud platform disruption
This At-RISK window is now closed with all tests completed successfully. No service disruption was caused by the tests with everything going smoothly. A more detailed update to this announcement will be provided tomorrow after analysing all test data.
Updated by Chris James on 30th Nov 2020 @ 20:48pm

Cloud platform disruption
Following on from the work we have been doing to fix this problem and test the fix, I'm happy to now report that all tests have been successful and the issue is now being marked as resolved. After firmware updates and complete resets of both storage switches in the redundant pair, we have tested both to replicate the levels of traffic seen on 14th November (where one switch was limited in throughput severely) and both performed exactly as expected. Following the successful tests we have carried out some maintenance (not service impacting) that involves re-balancing storage data across the network which also went exactly as it should with full throughput shown on each switch and no impact to service As such we have determined that the problem was caused on one switch by the firmware corruption previously identified and the update and reset process resolved the problem. The events of 14th November also exposed some weaknesses in our emergency procedures as these simply weren't followed resulting in phonecalls not getting answered, tickets not getting replies and announcements not being sent in a timely manner. When the problem happened on that Saturday evening, normal support activity is very low so we are staffed accordingly. Throughout the problems ticket and phonecall numbers were approx 10 times what they normally are which overloaded those on shift. In addition, with critical issues like this being so infrequent, our emergency procedures were not at the front of mind and didn't get followed in time which would have got information out to clients to inform that we are aware of a problem and working on it. All procedures have been fully reviewed since the evening of this problem and all staff refamiliarised with them. We have always been proud of our communication and transparency, even in the midst of a critical problem which we aim to continue into the future. While we aim to keep problems to an absolute minimum, all clients can be assured that if something does happen, we will get any information we have out as soon as possible and keep information flowing throughout.
Updated by Chris James on 4th Dec 2020 @ 10:02am

Cloud platform disruption

Following on from the work we have been doing to fix this problem and test the fix, I'm happy to now report that all tests have been successful and the issue is now being marked as resolved.

After firmware updates and complete resets of both storage switches in the redundant pair, we have tested both to replicate the levels of traffic seen on 14th November (where one switch was limited in throughput severely) and both performed exactly as expected. Following the successful tests we have carried out some maintenance (not service impacting) that involves re-balancing storage data across the network which also went exactly as it should with full throughput shown on each switch and no impact to service

As such we have determined that the problem was caused on one switch by the firmware corruption previously identified and the update and reset process resolved the problem.

The events of 14th November also exposed some weaknesses in our emergency procedures as these simply weren't followed resulting in phonecalls not getting answered, tickets not getting replies and announcements not being sent in a timely manner.

When the problem happened on that Saturday evening, normal support activity is very low so we are staffed accordingly. Throughout the problems ticket and phonecall numbers were approx 10 times what they normally are which overloaded those on shift. In addition, with critical issues like this being so infrequent, our emergency procedures were not at the front of mind and didn't get followed in time which would have got information out to clients to inform that we are aware of a problem and working on it.

All procedures have been fully reviewed since the evening of this problem and all staff refamiliarised with them. We have always been proud of our communication and transparency, even in the midst of a critical problem which we aim to continue into the future. While we aim to keep problems to an absolute minimum, all clients can be assured that if something does happen, we will get any information we have out as soon as possible and keep information flowing throughout.

Updated by Chris James on 4th Dec 2020 @ 10:02am

03300 885 250

Technical Infrastructure Status

We believe in full transparency, everything you see here is 100% live.

03300 885 250