Earlier this week, Amazon Web Services’ (AWS)’ S3 (simple storage service) suffered an extended period of service disruption knocking a multitude of sites and businesses offline – and the fault was all down to good old fashioned human error, according to the company.
According to a note published to customers, the fault occurred during a debugging session. “At 9:37AM PST, an authorised S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” the note reads. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
As a result, this greater-than-expected removal prompted a full restart for US-EAST-1 region, which also meant that other AWS services, such as new instance launches of Amazon Elastic Compute Cloud (EC2), Elastic Block Store (EBS), and Lambda, were also affected.
The resulting casualty list was vast, including Quora, Slack, and Medium. Some users reported that their Internet of Things (IoT)-enabled services, such as connected lightbulbs and thermostats, had gone blank because they were connected to the Amazon backend, while AWS itself could not change its status dashboard, meaning green lights were erroneously blinking away while the chaos unfolded.
AWS, as one would expect in such a situation, said it would make several changes to ensure the issue does not happen again. The first step, which has already been carried out, was to change its capacity tool to create a slower process, as well as adding safeguards to prevent capacity being removed when it goes past the minimum required level. The company has also said it will change the admin console of its status dashboard to run across multiple regions and have less of a dependence on S3, adding that while the AWS Twitter feed tried to keep users updated, it understood the dashboard provided ‘important visibility’ to customers.
So what happens from here? Naturally, the resultant conversation and best practice was to not put ‘all your eggs in one cloud’, as Chuck Dubuque, VP of product and solution marketing at Tintri put it. “This is a wakeup call for those hosted on AWS and other providers to take a deeper look at how their infrastructure is set up and emphasises the need for redundancy,” said Shawn Moore, CTO at Solodev. “If nothing else, the S3 outages will cause some businesses to reconsider a diversified environment – that includes enterprise cloud – to reduce their risks,” Dubuque added.
“We want to apologise for the impact this event caused for our customers,” AWS added. “While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses.
“We will do everything we can to learn from this event and use it to improve our availability even further.”
You can read the full statement here.