Category Archives: Outage

BT outage impacts 10% of customers in capital

BT Sevenoaks workstyle buildingBT has confirmed around 10% of its customers experienced an outage this morning, which has reportedly been linked to a power incident at the former Telecity LD8 site in London, which is now owned by Equinix, reports Telecoms.com.

BT first acknowledged the outage this morning on Twitter, which took down broadband services for a number of customers in the London area.

The LD8 data centre in London’s Docklands currently houses the London Internet Exchange (LINX), one of the world’s largest Internet Exchanges with more than 700 members which include ISPs such as BT and Virgin Media, as well as content providers.

“We’re sorry that some BT and Plusnet customers experienced problems accessing some internet services this morning,” said a BT spokesperson. “Around 10% of customers’ internet usage was affected following power issues at one of our internet connection partners’ sites in London. The issue has now been fixed and services have been restored.”

While the comment has stated the problem was limited to London, BT’s service status page does indicate dozens of cities and towns across the UK experienced issues. These service challenges have not been directly linked to the same incident to date.

The LD8 data centre has been under control of Equinix over recent months since the US company acquired Telecity for $3.8 billion. Equinix claims it is now the largest retail colocation provider in Europe and globally, after the deal added 34 data centres to the portfolio, though eight assets had to be off-loaded to keep competition powers in the European Commission happy.

“Equinix can confirm that we experienced a brief outage at the former Telecity LD8 site in London earlier this morning,” said a Equinix spokesperson. “This impacted a limited number of customers, however service was restored within minutes. Equinix engineers are on site and actively working with customers to minimise the impact.”

During email exchanges with Telecoms.com, neither BT or Equinix named either party, though this is understandable as it is a sensitive issue. Despite BT stating all services have been recovered at the time of writing the service status page lists dozens of towns and cities who are still experiencing problems. Although not directly linked, as long as service problems continue BT is likely to be facing a mounting customer service challenge.

AWS release statement to explain Aussie outage

Location Australia. Green pin on the map.AWS has blamed a power shortage caused by adverse weather conditions as the primary cause of the outage Australian customers experienced this weekend.

A statement on the company’s website stated its utility provider suffered a failure at the regional substation, which resulted in the total loss of utility power to multiple AWS facilities. At one of these facilities, the power redundancy didn’t work as designed and the company lost power to a large number of instances in the availability zone.

The storm this weekend was one of the worst experienced by Sydney in recent years, recording 150mm of rain over the period, with 93 mm falling on Sunday 5th alone, and wind speeds reaching as high as 96 km/h. The storm resulted in AWS customers losing services for up to six hours, between 11.30pm and 4.30am (PST) on June 4/5. The company claims over 80% of the impacted customer instances and volumes were back online and operational by 1am, though a latent bug in the instance management software led to a slower than expected recovery for some of the services.

While adverse weather conditions cannot be avoided, the outage is unlikely to ease concerns over public cloud propositions. Although the concept of cloud may now be considered mainstream, there are still numerous decision makers who are hesitant over placing mission critical workloads in such an environment, as it has been considered as handing control of a company’s assets to another organization. Such outages will not bolster confidence in those who are already pessimistic.

“Normally, when utility power fails, electrical load is maintained by multiple layers of power redundancy,” the statement said. “Every instance is served by two independent power delivery line-ups, each providing access to utility power, uninterruptable power supplies (UPSs), and back-up power from generators. If either of these independent power line-ups provides power, the instance will maintain availability. During this weekend’s event, the instances that lost power lost access to both their primary and secondary power as several of our power delivery line-ups failed to transfer load to their generators.”

In efforts to avoid similar episodes in the future, the team have stated additional breakers will be added to assure that we more quickly break connections to degraded utility power to allow generators to activate before uninterruptable power supplies systems are depleted. The team have also prioritized reviewing and redesigning the power configuration process in their facilities to prevent similar power sags from affecting performance in the future.

“We are never satisfied with operational performance that is anything less than perfect, and we will do everything we can to learn from this event and use it to drive improvement across our services,” the company said.

Google cloud team launches damage control mission

Close up of an astronaut in outer space, planet Mars in the background. Elements of the image are furnished by NASAGoogle will offer all customers who were affected by the Google Compute Engine outage with service credits, in what would appear to be a damage control exercise as the company looks to gain ground on AWS and Microsoft Azure in the public cloud market segment.

On Monday, 11 April, Google Compute Engine instances in all regions lost external connectivity for a total of 18 minutes. The outage has been blamed on two separate bugs, which separately would not have caused any major problems, though the combined result was a service outage. Although the outage has seemingly caused embarrassment for the company, it did not impact other more visible, consumer services such as Google Maps or Gmail.

“We recognize the severity of this outage, and we apologize to all of our customers for allowing it to occur,” said Benjamin Treynor Sloss, VP of Engineering at Google, in a statement on the company’s blog. “As of this writing, the root cause of the outage is fully understood and GCE is not at risk of a recurrence. Additionally, our engineering teams will be working over the next several weeks on a broad array of prevention, detection and mitigation systems intended to add additional defence in depth to our existing production safeguards.

“We take all outages seriously, but we are particularly concerned with outages which affect multiple zones simultaneously because it is difficult for our customers to mitigate the effect of such outages. It is our hope that, by being transparent and providing considerable detail, we both help you to build more reliable services and we demonstrate our ongoing commitment to offering you a reliable Google Cloud platform.”

While the outage would not appear to have caused any major damage for the company, competitors in the space may secretly be pleased with the level of publicity the incident has received. Google has been ramping up efforts in recent months to bolster its cloud computing capabilities to tackle the public cloud market segment with hires of industry hard-hitters, for instance Diane Greene, rumoured acquisitions, as well as announcing plans to open 12 new data centres by the end of 2017.

The company currently sits in third place in the public cloud market segment, behind AWS and Microsoft Azure, though has been demonstrating healthy growth in recent months prior to the outage.

Google Outages: Did the Latest Hit You?

This time it was Postini:

March 25, 2013 1:38:00 PM PDT

We’re investigating reports of an issue with Postini Services.

March 25, 2013 2:38:00 PM PDT

Postini Services service has already been restored for some users, and we expect a resolution for all users within the next 1 hours. Please note this time frame is an estimate and may change. (editor’s note: resolution took over six more hours).

March 25, 2013 9:05:00 PM PDT

The problem with Postini Services should be resolved. We apologize for the inconvenience and thank you for your patience and continued support. Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better.