Test, test, then test again: Analysing the latest cloud DR strategies

(c)iStock.com/XtockImages

In September, we hosted a roundtable with fifteen business leaders to discuss and debate the findings from our The State of IT Disaster Recovery Amongst UK Businesses survey.  The debate was chaired by Chris Francis, techUK board member. Customers Wavex and Bluestone also participated in the discussion as did our partner Zerto and industry influencers Ray Bricknell from Behind Every Cloud and analyst, Peter Roe from TechMarketView.  The event was lively and thought provoking. 

Outages definitely happen more frequently than we think.  We ran through the scale of outages that had been reported in the press in just the last month including organisations like British Airways, ING Bank and Glasgow City Council.British Airways lost its check in facility due to a largely unexplained ‘IT glitch’,  ING Bank’s regional data centre went offline due to a fire drill gone wrong (reports suggest that more than one million customers were affected by the downtime), and Glasgow City Council lost its email for three days after a fire system blew in the Council’s data centre.

Our survey backed up the high frequency of outages showing that 95% of companies surveyed had faced an IT outage in the past 12 months.  Interestingly four fifths, or 87% of that 95% who suffered outages, considered them severe enough to trigger a failover. We looked at some of the reasons for those outages and top of the list were system failure and human error. So, it is often not the big headlines we see such as environmental threats, storms or even a terrorist threat that brings our systems down, but more day-to-day mundane issues.  The group also suggested that often it was at the application level that the issues occur rather than the entire infrastructure being taken down. 

We also discussed the importance of managing expectations and how disaster recovery should be baked in rather than seen as an add on. Most businesses have a complex environment with legacy systems so they can’t really expect there to be 100% availability all of the time. That said, the term disaster recovery can scare people so those around the table felt that we should really talk more about maintaining ‘Business as Usual’ and resilience. DR isn’t about failing over an entire site anymore, it’s actually about pre-empting issues, for example testing and making sure that everything is going to work before you make changes to a system.

The discussion moved on to the impact of downtime. The survey found that every second really does count. When we asked respondents about the impact of downtime and how catastrophic this was, 42% said near seconds would have a big impact. This statistic rose to nearly 70% when it came to minutes. The group’s advice was that businesses really need to focus on recovery times when looking at a DR solution. We also talked about how much budget is spent on meeting recovery goals. The reality is that you can’t pay enough to compensate for downtime, but for most businesses there will always be some kind of trade-off between budget and downtime.

The group discussed whether business decision makers really understand the financial impact of downtime. Is more education needed about recovery times, what can be recovered, and prioritising different systems so the business understands what will happen when outages take place?

We then moved on to look at overconfidence in DR solutions.  The survey found that 58% had issues when failing over despite 40% being confident that their disaster recovery plans would work. Only 32% executed a failover and were confident and it all worked well. 10% did not failover but were confident that it would work well.   We talked to the group about this misplaced confidence and that while IT leaders know the importance of having a DR solution and taking measures to implement one, there appears to be a gap between believing your business is protected in a disaster and having that translate to a successful failover.

The bottom line is that DR strategies are prone to failure unless failover systems are thoroughly and robustly tested.  Confidence in failover comes down to the frequency that IT teams actually perform testing, and whether they are testing the aspects that are really important, such as at the application level.  Equally are they testing network access, performance, security and so on?  We certainly believe that testing needs to be done frequently to build evidence and a proven strategy. If testing only takes place once a year or once every few years then how confident can organisations be?

The group agreed that the complex web of interlocking IT systems is one of the biggest inhibitors to successful testing. While testing may be conducted on one part of a system in isolation, if that fell over this can often trigger a chain of events in other systems that the organisation wouldn’t be able to control.

The group agreed that there is an intrinsic disconnect between what management wants to hear in terms of DR recovery times and what management wants to spend.

In conclusion, we discussed the need to balance downtime versus cost as no one has an unlimited budget. A lot of the issues raised in the survey are challenges that can be traced directly back to simply not testing enough or not doing enough high quality testing.  The overall advice that iland recommends from the survey is to test, test and test again – and, importantly, to make sure that DR testing can be performed non-intrusively so that production applications are not affected, is cost-effective and does not place a large administrative burden on IT teams.

Editor’s note: You can download a copy of the survey results here.