Cloud storage slowness |
---|
We are aware of an issue currently affecting servers on our Cloud (cluster 1). This appears related to the centralised storage that servers use for primary drives which we are investigating at this time. This alert will be updated when we have more information on this issue. |
Updated by Chris James on 4th Aug 2022 @ 04:24am |
Cloud storage slowness |
---|
As of approx 5.15am all cloud servers within this cluster should be back online and functioning as they should. We will be carrying out a full investigation throughout today to establish what happened and why. Once this is complete the RFO (Reason For Outage) will be sent out. At this point we know that the problem was caused by the primary storage cluster going extremely slowly. It appeared related to a few drives within the cluster that were showing performance issues and once removed the cluster appeared to return to normal health. It is unclear at this stage whether this is a hardware issue (ie the drives are failed) or a software issue. The clustered storage should have been able to handle a few drives under-performing and dealt with it automatically so we intend finding out why this didn't happen. Apologies to all affected by this issue. |
Updated by Chris James on 4th Aug 2022 @ 05:43am |
Cloud storage slowness |
---|
The RFO for this outage is delayed while we continue to investigate. We are working closely with software vendors to establish what happened and will provide the RFO when we have some conclusive information. At this point it is continuing to look like a software issue or bug with our clustered storage software that started the chain of events. With many hundreds of MB of log files to go through the process of tracking down the specific cause is taking longer than anticipated. |
Updated by Chris James on 5th Aug 2022 @ 11:45am |
Cloud storage slowness |
---|
We are continuing to work with our software vendors to determine the cause of the problem last week and, importantly, ensure that it does not happen again. On the advice of the software vendors we will be carrying out some work this evening starting at 8pm. Apologies for the short notice but it is not expected to be service impacting and will hopefully result in resolving this incident. Firstly, we will be updating the firmware on network cards attached to the storage network. This involves reboots to the physical machines but no downtime is expected to Cloud servers due to us moving everything off each node before we do the firmware updates. Following the firmware updates we will be changing some settings on the clustered storage software to make the platform replicate data more efficiently. This will cause a period of high network usage on the cluster as data re-balances but is not expected to cause any downtime. A further update will be provided when this work is complete. |
Updated by Chris James on 8th Aug 2022 @ 15:50pm |
Cloud storage slowness |
---|
Last nights work was completed with everything going to plan and all Cloud servers remaining online throughout. The data re-balance on the storage cluster is still ongoing and should take 24-48 hours to complete. During this time blocks of data are getting "reorganised" across the cluster which has no impact on client Cloud servers due to the low priority of re-balance. We are continuing to work closely with our software vendors to get further data about last weeks issue on the storage cluster and hope to have the RFO within the next few days. |
Updated by Chris James on 9th Aug 2022 @ 09:11am |
Cloud storage slowness |
---|
I am pleased to report that we have found the cause of the issue that caused the storage cluster to suffer issues last week. It would appear that there is a bug in the kernel of the node operating systems that affects some packets being transferred through our storage LAN network in very specific circumstances (when using the particular hardware and firmware that we use). Overnight last night we reverted back to a slightly older kernel (not service impacting) which has resolved the problem. The full RFO (Reason For Outage) document can be seen here: https://my.clook.net/docs/RFO-20220810.pdf Apologies to all clients affected by the incident last week. |
Updated by Chris James on 10th Aug 2022 @ 16:35pm |