Google cloud falls over after routing error, strives to remove manual link activation

(c) Nikolic

Google Compute Engine went down for approximately 70 minutes last week, the company has confirmed, making certain Internet destinations unreachable from the europe-west1 region during that time.

The issue first came to light at 1326 PST on November 23 with a status update, before a further missive at 1432 confirming the problems should have been resolved.

Four days later, Google explained what exactly went wrong. At 1151 PST on November 23, Google engineers activated a new peering link – with an unnamed provider who Google says it works with extensively – but during the activation, the providers’ estimations of how much capacity the link could take differed wildly from actual performance. As a result, traffic was dropped with the majority of affected destination addresses coming from eastern Europe and the Middle East.

The reason for the oversight, Google notes, was due to an ‘unrelated failure’ which meant safety checks as part of the automation process for peering links were not performed. The search giant says it will change procedure to prevent manual link activation following the downtime.

“The automated checks were expected to protect the network for approximately one hour after link activation, and normal congestion monitoring began at the end of that period,” a post from the Google Compute Engine team noted. “As the post-activation checks were missing, this allowed a 61 minute delay before the normal monitoring started, detected the congestion, and alerted Google network engineers.

“If your service or application was affected, we apologise,” the team update added. “This is not the level of quality and reliability we strive to offer you, and we have taken and are taking immediate steps to improve the platform’s performance and availability.”

This is not the first time Google’s cloud has hit the skids; in February Compute Engine went down for two hours with network issues in ‘multiple zones’. Figures released by CloudHarmony at the start of this year showed Google to be one of the more reliable cloud providers, although Compute Engine scored a three nines SLA with 66 outages throughout 2014.