Google confirms network congestion as contributor to four-hour cloud outage

Google has confirmed a ‘network congestion’ issue which affected various services for more than four hours on Sunday has since been resolved.

A status update at 1225 PT noted the company was investigating an issue with Google Compute Engine, later diagnosed as high levels of network congestion across eastern USA sites. A further update arrived at 1458 to confirm engineering teams were working on the issue before the all-clear was sounded at 1709.

“We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimise future recurrence,” the company wrote in a statement. “We will provide a detailed report of this incident once we have completed our internal investigation.”

The outage predominantly affected users in the US, with some European users also seeing issues. While various Google services, including Google Cloud, YouTube, and G Suite were affected, many companies who run on Google’s cloud also experienced problems. Snapchat – a long-serving Google Cloud customer and considered a flagship client before the company’s major enterprise push – saw downtime, as did gaming messaging service Discord.

According to security provider ThousandEyes, network congestion is a ‘likely root cause’ of the outage. The company spotted services behaving out of sync as early at 1200 PT at sites including Ashburn, Atlanta and Chicago, only beginning to come back at approximately 1530 (below). “For the majority of the duration of the 4+ hour outage, ThousandEyes detected 100% packet loss for certain Google services from 249 of our global vantage points in 170 cities around the world,” said Angelique Medina, product marketing director at ThousandEyes.

Previous Google cloud snafus have shown the company can learn lessons. In November 2015 Google Compute Engine went down for approximately 70 minutes, with the result being the removal of manual link activation for safety checks. The following April, services went down for 18 minutes following a bug in Google Cloud’s network configuration management software.  

According to research from Gartner and Krystallize Technologies published last month, Microsoft is the poor relation among the biggest three cloud providers when it comes to reliability. As reported by GeekWire, 2018 saw Amazon and Google achieve almost identical uptime statistics, at 99.9987% and 99.9982% respectively. Microsoft, meanwhile, trailed with 99.9792% – a ‘small but significant’ amount.

https://www.cybersecuritycloudexpo.com/wp-content/uploads/2018/09/cyber-security-world-series-1.pngInterested in hearing industry leaders discuss subjects like this and sharing their experiences and use-cases? Attend the Cyber Security & Cloud Expo World Series with upcoming events in Silicon Valley, London and Amsterdam to learn more.