03300 885 250

Technical Infrastructure Status

We believe in full transparency, everything you see here is 100% live.
RESOLVED
This announcement has been resolved, no further updates are expected.
Cloud Network Issues
We're currently seeing poor connectivity on the network serving our Cloud Server infrastructure. This is being investigated and we hope to have things back to normal as soon as possible.
Updated by Ben H. on 6th May 2021 @ 16:46pm
Cloud Network Issues
Connectivity has been restored, our network engineers have continued to monitor the situation and the network has remained stable throughout their observation.

This evening we will continue emergency testing onsite at our datacentre. This testing is coordinated by our lead network engineer and lead systems administrator. We are do not anticipate further networking issues, however we will update this notification once the onsite testing has been concluded and the network confirmed stable.
Updated by Anthony German on 6th May 2021 @ 18:59pm
Cloud Network Issues
We are aware that some clients are seeing intermittent connectivity issues at the moment relating to the ongoing work this evening - we are actively working on resolving this currently.
Updated by Sam Pizzey on 6th May 2021 @ 20:53pm
Cloud Network Issues
All network issues were resolved earlier after tonights work. While these issues were not expected, our lead sys-admin and network admin were working on it right away to get full network connectivity restored.

A full RFO will follow but we are confident of the cause of the issue being found. We are planning a full and thorough review of all networking device configs with our lead technical team and would like to run further tests at a later date to confirm the issue is fully fixed. A separate announcement will be sent about these tests which are not expected to cause problems but after tonight will be scheduled as far outside of peak hours as possible.
Updated by Chris James on 6th May 2021 @ 22:17pm
Cloud Network Issues
The investigation into this issue has been continuing as a matter of urgency and with the aid of engineers from Extreme (our network equipment vendor).  Following discussions with Extreme last night we appear to have a break through.

I can confirm that the issue seen on Thursday was the same issue experienced last week and is related to running two Cloud platforms using the same blocks of IP addresses.  Last week when it happened we fast-tracked the final few migrations from our old platform and completely turned off what was left which we believed to have resolved the problem.  When it happened again this week it became clear that it wasn't the presence of the old platform that was the problem and it appeared to be related to the network equipment.

Network devices maintain what is called an ARP table which maps IP addresses to MAC addresses for devices on the network and this problem stems from some of our network equipment storing MAC addresses from our old Cloud platform in relation to IP addresses that are now on our new platform (with different MAC addresses).  This results in traffic not flowing to those IP addresses then snowballs into a scenario where approx 70% of Cloud servers lose network connectivity as we saw again on Thursday.

It has been confirmed by Extreme that there is a bug in the router operating system code that triggers in very specific circumstances whereby ARP tables get stuck and outdated.  This can snowball when the network devices share their ARP table (normal behaviour) and that data happens to be outdated/wrong.  In our setup whereby we had two platforms online and configured in the network this very specific circumstance was triggered where the bug showed itself.

Extreme will be releasing a firmware update to address this bug.  However, as we no longer have our old Cloud platform online we have removed all related configuration from our network equipment that Extreme have confirmed triggers this bug so we can say with some certainty that it shouldn't happen again.

In order to take our 99% certainty that this is the cause to 100%, we would like to run a redundancy check on our switches whereby one of the primary switches is reloaded (to take it out of the redundant pair) to ensure that the other switch takes over seamlessly.  We have carried out this test on numerous occasions without issue but as the same test didn't go well on Thursday night, we will be running this one with maximum caution with all lead staff in attendance along with Extreme engineers.  In addition, despite being confident of it not being service impacting the timing of the test will be as far outside of business hours as possible.

A separate notification will be sent when we can confirm the date/time of the test which will most likely be in 7-14 days.

Apologies to those affected by this issue. When we designed our new Cloud platform we put reliability and redundancy at top priority not only to benefit all clients hosted on it, but also to benefit us so we didn't have to deal with the reliability issues that were starting to show in the old platform. What we saw this week (and back in November) was certainly not in the plan back then and should not be considered "the norm" going forward. I have every confidence that we have seen the back of this particular issue and and with us now fully migrated from our old Cloud platform things will be much more stable going forwards.
Updated by Chris James on 8th May 2021 @ 11:30am