Microsoft digs down on Azure outage, explores data loss and failover question

Microsoft has put together a post-mortem on what it described as an 'unprecedented' Azure outage – exploring an interesting question of data loss and failover capability.

The outage, which affected customers on the VSTS – or Azure DevOps – service in the South Central US region, required more than 21 hours to recover all facilities, as well as an additional incident regarding a database which went offline taking another two hours to resolve.

As the status page – which originally went down with the rest of the service – noted at the time, the cause was blamed on high storms in the Texas area. With the power swells that resulted, the data centres were able to maintain temperature through a thermal buffer – but when that was depleted, automated shutdown took place after temperatures exceeded safe levels.

At the time, users queried Microsoft's claims that South Central US was the only region affected – but as the company explained, customers globally were affected due to cross-service dependencies.

Writing in a blog post, Buck Hodges, director of engineering for Azure DevOps, apologised to customers and said the company was exploring the feasibility of asynchronous replication. With asynchronous replication, data which did not have time to be copied across the network on the second server is lost if the first server fails. As Hodges explained: "If the asynchronous copy is fast, then under normal conditions, the effect is essentially the same as synchronous replication." Synchronous replication, where data loss is less of an issue, has problems particularly across regions, Hodges added, as the time it takes does not equate to performance, particularly across mission-critical applications.

For the customers themselves, it's not an either-or question. Hodges said that some customers would be happy to take a certain loss of data if it meant getting a large team up and running again, while others would prefer to wait for a full recovery however long it took.

"The only way to satisfy both is to provide customers the ability to choose to fail over their organisations in the event of a region being unavailable," Hodges wrote. "We've started to explore how we might be able to give customers that choice, including an indication of whether the secondary is up to date and possibly provide manual reconciliation once the primary data centre recovers.

"This is really the key to whether or not we should implement asynchronous cross-region fail over," Hodges added. "Since it's something we've only begun to look into, it's too early to know if it will be feasible."

Regardless of the problems outages cause and the frustration they cause to users, whether they be down to natural causes or otherwise, it is interesting to see an introspective exploration from Microsoft here.