Microsoft Azure recovers from outage

Clare Hopping

11 Jan, 2019

Microsoft Azure’s UK South storage region suffered an outage yesterday, just a day after the company debuted its Azure Data Box Disk.

Just after lunchtime, customers started reporting their services were down. Some said their Azure accounts were unavailable, while others said they could only see a spinning wheel when trying to access the cloud service.

The problems began on Azure Storage, but spread to other services, including App and Virtual Machines, with the company’s status page showing a blanket outage for all services after the issue was first reported. Its Azure UK West storage had not been affected at the time of writing. 

“Starting at 13:19 UTC on 10 Jan 2019, a subset of customers leveraging Storage in UK South may experience service availability issues. In addition, resources with dependencies on Storage, may also experience downstream impact in the form of availability issues. Engineers have been engaged and are actively investigating. The next update will be provided in 60 minutes, or as events warrant,” the company’s service status page reported yesterday. An update later confirmed the issue continued until “approximately 05:30 UTC on 11 Jan 2019.”

The Azure Support team used Twitter to confirm that the issue had now been resolved, saying: “Mitigated: Engineers have confirmed that the Storage availability issue in UK South is resolved. Any customers experiencing residual impact will receive communications to their portal. A full Root Cause Analysis will be provided in approximately 72 hours.” 

However, some customers were unhappy that, while services were offline, the company had failed to communicate much since its original message as engineers scrambled to fix the issues.

The support team then followed up with another response two hours later, saying: “Hi there, as continue working to resolve this issue, we are wondering if you have seen any signs of recovery yet?”

In terms of what caused the outage, Microsoft said: “Engineers determined that a number of factors, initially related to a software error, caused several nodes on a single storage scale unit to become temporarily unreachable. This, along with the increase in load on the scale unit caused by the initial issue, resulted in impact to customers with Storage resources located on this scale unit.”