AWS has blamed a power shortage caused by adverse weather conditions as the primary cause of the outage Australian customers experienced this weekend.
A statement on the company’s website stated its utility provider suffered a failure at the regional substation, which resulted in the total loss of utility power to multiple AWS facilities. At one of these facilities, the power redundancy didn’t work as designed and the company lost power to a large number of instances in the availability zone.
The storm this weekend was one of the worst experienced by Sydney in recent years, recording 150mm of rain over the period, with 93 mm falling on Sunday 5th alone, and wind speeds reaching as high as 96 km/h. The storm resulted in AWS customers losing services for up to six hours, between 11.30pm and 4.30am (PST) on June 4/5. The company claims over 80% of the impacted customer instances and volumes were back online and operational by 1am, though a latent bug in the instance management software led to a slower than expected recovery for some of the services.
While adverse weather conditions cannot be avoided, the outage is unlikely to ease concerns over public cloud propositions. Although the concept of cloud may now be considered mainstream, there are still numerous decision makers who are hesitant over placing mission critical workloads in such an environment, as it has been considered as handing control of a company’s assets to another organization. Such outages will not bolster confidence in those who are already pessimistic.
“Normally, when utility power fails, electrical load is maintained by multiple layers of power redundancy,” the statement said. “Every instance is served by two independent power delivery line-ups, each providing access to utility power, uninterruptable power supplies (UPSs), and back-up power from generators. If either of these independent power line-ups provides power, the instance will maintain availability. During this weekend’s event, the instances that lost power lost access to both their primary and secondary power as several of our power delivery line-ups failed to transfer load to their generators.”
In efforts to avoid similar episodes in the future, the team have stated additional breakers will be added to assure that we more quickly break connections to degraded utility power to allow generators to activate before uninterruptable power supplies systems are depleted. The team have also prioritized reviewing and redesigning the power configuration process in their facilities to prevent similar power sags from affecting performance in the future.
“We are never satisfied with operational performance that is anything less than perfect, and we will do everything we can to learn from this event and use it to drive improvement across our services,” the company said.