After CRM and business IT provider Autotask suffered a severe outage yesterday, CloudTech caught up with VP engineering Adam Stewart to get the lowdown on what happened, and what happens from here
Exclusive Late on Monday and into part of Tuesday, cloudy IT management software provider Autotask went down for English speaking customers in Europe, Asia, Middle East and Africa, as CloudTech reported.
The root cause, according to Autotask VP engineering Adam Stewart, was a capacity spike that brought the system down and caused two separate failures. After the first downtime, engineers added more virtual cores, or more ‘CPU horsepower’ to the web tier, only for the system to go down again at approximately 0651 GMT on July 2.
“We spent a lot of time analysing what could be causing so much load on these systems,” says Stewart, who admitted that he was ‘a bit relieved’. “Basically, we went through all the things you would go through to try and discover what the cause is.”
This included the possibility of a malicious attack – a macabre, if not unexpected development given what happened to Code Spaces last week – as well as looking at various processes in the tier, yet the issue was because of the last change made to the CPU.
Autotask is like oxygen for our customers, and uptime is critically important
“After about two hours of trying many fruitless theories, we thought – although this is quite counter-intuitive – the last change we made to the system was that we added this other CPU,” Stewart explains. “It shouldn’t because we don’t believe that’s a problem, but let’s try it because it makes sense.
“So we reverted that change and, lo and behold, those servers stabilised almost immediately. We waited 10 minutes and there was no peak, no spike, no anything else, so we gradually went through and reverted the change on all the other servers, and that’s what’s stabilised the issue.
“In hindsight we now know, empirically, that change caused it,” Stewart adds. “We’re not sure why something that we’ve done dozens of times before didn’t actually stick this time.
“It was almost as if the virtualisation software was reporting to the operating system that there were eight cores in each processor, however it was only providing four cores – something like that.”
It’s a slightly worrying admission that the company doesn’t know exactly why this happened, but Stewart assures us that there is contact between Autotask and its virtualisation provider. Additional servers were added last night to expand the capacity and mitigate the risk of further problems – and all seems well for now.
The EMEA service disruption was resolved at 10:25 GMT. We’re working to ensure the issue doesn’t reoccur. Contact Autotask support with q’s.
— Autotask (@Autotask) July 2, 2014
Stewart is keen to point out that Autotask was “constantly” on the phones and on emails to customers during the outage, with some customers taking to Twitter to praise this approach:
Good call from @Autotask; checking how we managed the unexpected system outage yesterday #professional #caring
— Sytec (@Sytec) July 3, 2014
Yet Autotask’s Twitter page throughout Monday evening and Tuesday morning was a ghost town, leading to frustrated comments from customers.
“We were a little more silent there than maybe we should have been,” Stewart admits, adding: “I work out of the New York office – we were in constant contact with our UK office where most of the support was taking place for that.”
Stewart also concedes that yesterday’s downtime puts Autotask “a little bit below” the four nines SLA for the affected zones, yet adds it’s not a contractual stipulation. “It’s just the performance our customers have grown accustomed to, and rightly so,” he explains.
When outages occur, companies are quick to reassure customers it won’t happen again, and to put appropriate policies in place. It’s a similar thing for Autotask, who is collaborating with its virtualisation partner, examining additional testing procedures, as well as setting up a test environment to repeat the problem.
We spent a lot of time analysing what could be causing so much load on these systems
It’s all good development practice, yet prevention is still better than cure in these instances.
“Autotask is like oxygen for our customers, and uptime is critically important,” Stewart says.
“This is by far the worst outage we’ve had in probably five or six years,” he adds. “Overall, I’d say we still have a great record, but I’m sure in [the media’s] eyes…it’s called into question and we have to earn everybody’s trust again – which we will do.”