Why human error is still the biggest risk to your cloud system going down

(c)iStock.com/mediaphotos

The number one risk to system availability remains human error, according to the latest disaster recovery industry report from CloudEndure.

The research examines the various protocols businesses have in place for downtime if – or when – it occurs. On a scale of one to 10, human errors – including application bugs – hit 8.1, compared to network failures (7.2), cloud provider downtime (6.9) and external threats (6.7).

Even though the majority (83%) of organisations have a SLA goal of 99.9% or better, this doesn’t often translate into actual results. 44% of firms said they had at least one outage in the past three months, with 27% admitting their systems had gone down within the past month. 9% of respondents said their systems had never gone down.

Most intriguingly, more than a quarter of firms surveyed (28%) don’t measure service availability at all, and 15% said they do not share system availability numbers with customers. 37% said they meet their availability goals consistently, with 50% saying they hit their goals “most of the time.”

It’s worth noting what the accepted definition of ‘downtime’ is – as the report does not give a clear one. Half of respondents say downtime is simply where the system is not accessible, while roughly a quarter say it means the system is accessible but performance is highly degraded (26%) or some functions are not operational (24%).

Overwhelmingly, the respondents’ cloud provider of choice was Amazon Web Services (AWS). 59% of those polled said they used public cloud, with three quarters (74%) of that number opting for Amazon, ahead of Microsoft (7%), Google (6%) and Rackspace (4%). Not surprisingly, service availability was considered most critical to the customers of 33% of firms.

The report’s main claim is a “strong correlation” between the cost of downtime and the average hours per week invested in disaster recovery. 49% of respondents said they used their own measurement tools, with a quarter (24%) using some sort of third party tool. According to respondents remote storage backup (57%) is the most frequently used strategy to ensure system availability, ahead of storage replication (46%).

Previous reports from CloudEndure examined AWS and Microsoft Azure uptime figures for 2014: AWS showed a 41% reduction in performance issues quarter to quarter last year, while there were significantly more service interruptions in the last three quarters for Azure.