Google Details Cause of Wednesday’s Widespread Apps Outage

Google issued an incident report on the Wednesday outage that affected less than one per cent of gmail users, but was significant for other services, including half of Admin Panel and 60% of Sync login requests. As has happened in the past, it was a configuration error for a central system, in this case Google Services Login, where the configuration glitch caused too many requests to be routed to too few servers, causing them to buckle under the load:

From 5:00 a.m. to 8:00 a.m. PT, some users received errors when trying to access Gmail, Drive, Talk, Google Sync, the Admin panel, and the Cloud Console, and to a lesser extent Groups, Sites, and Contacts. At the peak of the outage, this issue affected 50% of the Admin panel and 60% of Google Sync login requests. The percentages of affected users for other services were lower such as 0.18% users for Gmail. The root cause was an issue in the system that manages login requests for Google services.

At 5:00 a.m. as login traffic increased, the misconfigured servers were unable to process the load. This began to cause errors for some users logging in to Google services. The request load, exacerbated by retry requests from users and automated systems such as IMAP clients, initially appeared as the cause of the login errors. At 5:48 a.m., the Engineering team determined that the root cause was not excess traffic but insufficient capacity

The full report is less than two pages, and clearly outlines what happened and how they hope to prevent it in the future.