Category Archives: Outage Alert

502 Errors, Latency Accessing Gmail

Google reported a problem with Gmail today and not long after said it was resolved:

3:02 AM: We’re investigating reports of an issue with Google Mail. We will provide more information shortly.

3:43 AM: The problem with Google Mail should be resolved. We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better.
Users were experiencing 502 errors and latency when accessing email.

Google Details Cause of Wednesday’s Widespread Apps Outage

Google issued an incident report on the Wednesday outage that affected less than one per cent of gmail users, but was significant for other services, including half of Admin Panel and 60% of Sync login requests. As has happened in the past, it was a configuration error for a central system, in this case Google Services Login, where the configuration glitch caused too many requests to be routed to too few servers, causing them to buckle under the load:

From 5:00 a.m. to 8:00 a.m. PT, some users received errors when trying to access Gmail, Drive, Talk, Google Sync, the Admin panel, and the Cloud Console, and to a lesser extent Groups, Sites, and Contacts. At the peak of the outage, this issue affected 50% of the Admin panel and 60% of Google Sync login requests. The percentages of affected users for other services were lower such as 0.18% users for Gmail. The root cause was an issue in the system that manages login requests for Google services.

At 5:00 a.m. as login traffic increased, the misconfigured servers were unable to process the load. This began to cause errors for some users logging in to Google services. The request load, exacerbated by retry requests from users and automated systems such as IMAP clients, initially appeared as the cause of the login errors. At 5:48 a.m., the Engineering team determined that the root cause was not excess traffic but insufficient capacity

The full report is less than two pages, and clearly outlines what happened and how they hope to prevent it in the future.

 

 

Survey Says 40 Per Cent of IT Managers Have Suffered a Cloud Outage

According to a survey by Kelton done for TeamQuest, nearly four in ten respondents reported having suffered a cloud outage:

Many survey respondents believe the reported outages could have been prevented. Capacity management is sighted as one way to minimize the risks associated with cloud computing, according to respondents in a survey from Kelton Research, commissioned by TeamQuest Corporation.

Google Outages: Did the Latest Hit You?

This time it was Postini:

March 25, 2013 1:38:00 PM PDT

We’re investigating reports of an issue with Postini Services.

March 25, 2013 2:38:00 PM PDT

Postini Services service has already been restored for some users, and we expect a resolution for all users within the next 1 hours. Please note this time frame is an estimate and may change. (editor’s note: resolution took over six more hours).

March 25, 2013 9:05:00 PM PDT

The problem with Postini Services should be resolved. We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better.


 

Google Says Drive Problem Resolved, Wants to Hear From You if You Still Have a Problem

According to Google, the outage for some Google Drive users should be completely resolved.

Still having a problem? Then Google want to hear about it:

The problem with Google Drive should be resolved. We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better. If you are still experiencing an issue, please contact us via the Google Help Center.

Google Drive Outage Updates

From the Google App Status Dashboard:

March 18, 2013 7:17:00 AM PDT

We’re investigating reports of an issue with Google Drive. We will provide more information shortly.

 March 18, 2013 8:10:00 AM PDT

We’re aware of a problem with Google Drive affecting a significant subset of users. The affected users are unable to access Google Drive. We will provide an update by March 18, 2013 9:10:00 AM PDT detailing when we expect to resolve the problem. Please note that this resolution time is an estimate and may change.

March 18, 2013 8:55:00 AM PDT

Google Drive service has already been restored for some users, and we expect a resolution for all users within the next 1 hours. Please note this time frame is an estimate and may change.

Hurricane Sandy and NYC Data Centers: How They Prepped, What Happened

Water and servers don’t mix. Storms can do more than cut the power to a data center, they can also breech walls, flood, or otherwise damage a center. A natural disaster like Hurrican Sandy can also make it difficult for staff to even be there to do their jobs, and can delay the arrival of replacement parts, fuel for generators, and so on.

Two posts at Data Center Knowledge do a good job of outlining how they prepared, and what actually happened.