The cloud is still fallible: Joyent’s entire US-East-1 data centre went down yesterday, with the finger of blame pointed at a particularly fat finger which resulted in a monumental operator error.
The problem has since been resolved, with all virtual machines back up and running, however according to a post-mortem blog post from the Joyent team, the minimum downtime for customers was 20 minutes and the maximum 149 minutes, with over 90% of instances recovered within an hour.
The cause of the outage? A typo, put simply. Joyent picks up the story: “The command to reboot the select set of new systems that needed to be updated was mistyped, and instead specified all servers in the data centre.
“Unfortunately the tool in question does not have enough input validation to prevent this from happening without extra steps/confirmation, and went ahead and issued a reboot command to every server …