03300 885 250

Technical Infrastructure Status

We believe in full transparency, everything you see here is 100% live.
RESOLVED
This announcement has been resolved, no further updates are expected.
Maintenance follow up and SAN replacement
This email is to detail the issues suffered during the scheduled maintenance last week and what followed from this.

** What happened? **

As previously announced we had scheduled maintenance to carry out updates to the underlying software that powers our cloud platform. Due to the seriousness of this upgrade and it being the first time required since we have used the software we paid for the work to be carried out by the software makers.

Aside from a small delay shutting down some servers all was going to plan until it came to reboot the physical machines. Some didn't boot up within a normal timeframe with one not returning at all (which was later found to be a RAID card issue).

All cloud servers that had data stored on the failed machine then failed to boot back up. This didn't result in data loss due to the mirrored storage throughout the platform but manual intervention was required to migrate the servers and boot them from other machines. This was also done by the software makers which took some time as each of the servers needed doing individually. Unfortunately this process added 1-4 hours of downtime to those affected.

Since the update we have experienced some periods of instability involving some physical machines crashing (resulting in some additional downtime to cloud servers hosted on those machines). Along with the makers of the underlying software we have spent a lot of time troubleshooting and monitoring the situation which initially looked to have been caused by resources being stretched after losing some physical hardware during the maintenance.


** The cause and solution **

After a lot of work troubleshooting, we found the cause of the instability to be a problem on one of our SANs which is responsible for storing secondary drives on cloud servers. This was causing the crashing on some hypervisors along with other symptoms such as delayed boots and slow data read/writes in some parts of the cloud.

In response to this, we have reduced the work required by this SAN so that things are now in a stable state throughout the cloud. We still need to replace the unit which will be done over the next 24 hours.

During this time, we will be removing the secondary drives from all servers. We provision these in all managed servers to store cPanel backups, so due to the data not being mission critical, we will not be migrating it to the new SAN and simply recreating empty disks for the backups to recreate in the scheduled backup runs instead.

This work will not cause any downtime to servers. We will be removing secondary drives and reading the new empty ones from the new SAN while everything remains online. It does however mean that the cPanel backup processes will not run until the new drive is in place so please be aware of this (the cloud server filesystem backups are still active).


** What will happen on future updates? **

From this point forward, no matter how routine we are told these updates (which are usually yearly or less often) will be, we'll be putting staff on site at the datacenter to deal with any machine booting issues right away so not to cause any problems like this.

In addition, for future work we will be looking to migrate servers off physical machines before doing any reboots to hopefully avoid any downtime to clients. The practicalities of this sometimes mean it this is not possible (such as last week when the storage systems needed taking offline) but service affecting maintenance will always be a last resort.


** Will refunds/credits be issued? **

Despite us not having a SLA with uptime guarantees, we have made the decision to issue credits to clients as a gesture of goodwill because this situation isn't something that should be considered normal and isn't something we're happy with.

We will be issuing a credit of 10% of the monthly fee for each hour of downtime suffered since the maintenance (outside of the 30 minutes we had scheduled).

For those wishing to claim this credit please email billing@clook.info with your cloud server hostname(s) within the next 30 days. We will then refer to our monitoring systems to get the downtime figures and issue the credit to the billing system.

Thank you for your patience during this time and apologies again for any inconvenience.
Updated by Chris James on 22nd Dec 2014 @ 12:33pm
Maintenance follow up and SAN replacement
As an update to this, the SAN used for secondary drives was replaced earlier today with no problems. We're seeing much improved performance from the storage unit now with no sign of the problems experienced last week.

We will continue to monitor the systems closely but all should be stable again across the cloud platform.

Thanks again for your patience.
Updated by Chris James on 23rd Dec 2014 @ 15:12pm