A disaster recovery plan: What is your IT team keeping from you?

(c)iStock.com/Dimitrios Stefanidis

Your disaster recovery program is like a parachute – you don’t want to find yourself in freefall before you discover it won’t open. But amid hastening development cycles, and cost, resource and time pressures, many CIOs are failing to adequately prioritise DR planning and testing.

While IT teams are running to stand still with day-to-day responsibilities, DR efforts tend to be focused solely on infrastructure, hardware and software, neglecting the people and processes needed to execute the plan. At best, this runs the risk of failed recovery testing. At worst, a business may be brought to its knees at a time of actual disaster without any chance of a swift recovery.

Even if you passed your last DR test, it’s only a predictor of recovery, not a guarantee

Your team may be reluctant to flag areas of concern, or admit that they aren’t confident your DR plan will work in practice. Perhaps they’re relying on the belief that “disaster” is a statistically unlikely freak of nature (we all know hurricanes hardly ever happen in Hertford, Hereford and Hampshire) rather than a mundane but eminently more probable hardware failure or human error. It’s possible that at least one of these admissions may be left unspoken in your own organisation:

“We’re not confident of meeting our RTOs/RPOs”

Even if you passed your last annual DR test, it’s only a predictor of recovery, not a guarantee. Most testing takes place under managed conditions and takes months to plan, whereas in real life, outages strike without notice. Mission-critical applications have multiple dependencies that change frequently, so without ongoing tests, a recovery plan that worked only a few months ago might now fail to restore availability to a critical business application.

“Our DR plan only scratches the surface”

Many organisations overlook the impact of disruption on staff and the long-term availability of their data centres. How long you can support an outage at your recovery centre – whether that’s days or weeks – will determine your DR approach. Can you anticipate what you would do in a major disaster if you lost power, buildings or communication links? What if you can’t get the right people to the right places? How well is everyone informed of procedures and chains of command? People and processes are as relevant as technology when it comes to rigorous DR planning.

“We know how to fail over… just not how to fail back”

Failback – reinstating your production environment – can be the most disruptive element of a DR execution, because most processes have to be performed in reverse. Yet organisations often omit the process of testing their capabilities to recover back to the primary environment. When push comes to shove, failure to document and test this component of the DR plan could force a business to rely on its secondary site for longer than anticipated, adding significant costs and putting a strain on staff.

“Our runbooks are a little dusty”

How often do you evaluate and update your runbooks? Almost certainly not frequently enough. They should contain all the information your team needs to perform day-to-day operations and respond to emergency situations, including resource information about your primary data centre and its hardware and software, and step-by-step recovery procedures for operational processes. If this “bible” isn’t kept up to date and thoroughly scrutinised by key stakeholders, your recovery process is likely to stall, if not grind to a halt.

“Change management hasn’t changed”

Change is a constant of today’s highly dynamic production environments, in which applications can be deployed, storage provisioned and new systems set up with unprecedented speed. But the ease and frequency with which these changes are introduced means they’re not always reflected in your recovery site. The deciding factor in a successful recovery is whether you’ve stayed on top of formal day-to-day change management so that your secondary environment is in perfect sync with your live production environment.

“Our backup is one size fits all”

In today’s increasingly complex IT environments, not all applications and data are created equal. Many organisations default to backing up all their systems and both transactional and supportive records en masse, using the same method and frequency. Instead, applications and data should be prioritised according to business value: this allows each tier to be backed up on a different schedule to maximise efficiency and, during recovery, ensures that the most critical applications are restored soonest.

“Backing up isn’t moving us forward”

Backups are not, in isolation, a complete DR solution, but data management is a critical element of a successful recovery management plan. Whether you’re replicating to disk, tape or a blend of both, shuttling data between storage media is achingly slow. And if it takes forever to move and restore data, then regular testing becomes even less appealing. But foregoing a regular test restoration process simply because of time-to-restore concerns is a recipe for data loss in the event of an outage.

“We don’t have the bandwidth for testing”

Testing recovery procedures of applications is a whole other ballgame than recreating a data center from scratch. Trying to squeeze the whole exercise into a 72-hour testing window won’t do – that’s just enough time to marshal the right employees and ask them to participate in the test when it’s not part of their core function. So, companies often end up winging it with whatever resources they have on hand, rather than mapping out the people they need to conduct and validate a truly indicative test.

“We don’t want to do it…but we’re not keen on someone else doing it”

Trying to persuade employees that an outsource option for recovery is in their best interests can be like selling Christmas to turkeys.

Foregoing a regular test restoration process simply because of time-to-restore concerns is a recipe for data loss

But in fact, partnering with a recovery service provider actively complements in-house skills-sets by allowing your people to focus on projects that move your business forward rather than operational tasks. It is also proven to boost overall recoverability. Managed recovery doesn’t have to be an all-or-nothing proposition, either, but a considered and congruous division of responsibilities.

With always-on availability becoming a competitive differentiator, as well as an operational must-have, you don’t have the luxury of trusting to luck that your DR plans will truly hold up in the event of a disaster.

The first step to recovery starts with admitting you have a problem and asking for help.

ICLOUD PE