The perils of not having disaster recovery – or, why we love a good reserve parachute

One of the most important but often missed steps in having a reliable infrastructure is disaster recovery (DR). Surprisingly, most companies decide either to not implement DR or to implement it halfway. Here, I intend to explore common terms and understandings in disaster recovery; how to leverage the cloud, different types, the plan and important considerations, as well as the economic impact.

Regional vs. zone/domain DR

DR can be implemented at regional or zone/domain level, depending on needs. I advocate and adopt having high availability (HA) at zone/domain level and DR at regional level; the cloud presents itself as a good alternative in terms of cost value for HA and DR – even more so with the plethora of providers that exist nowadays.

Levelling the field

First, some widely used terms:

RTO – recovery time objective. Essentially how long it will take to have the DR site operational and ready to accept traffic.

RPO – recovery point objective. Essentially to which point in time in the past of primary site the secondary site will return. It is also an indicator of data loss; if data is synced every hour, site A crashes at 11:59am, site B has data until 11am, so worst case scenario is about an hour is lost and the secondary site will be operational as primary site was operational at 11am. That is RPO 1h. The smaller the better – alas the more costly the implementation will be.

Regional – how far is too far and how close is too close? When is close too close? Having primary in London region and secondary in Dublin region, an asteroid the size of Kent falling over Wales can make the solution unviable, but the likeliness of that happening is negligible.

Cost – it is always a factor and in this case, it can make a difference since regions such as Ashburn (USA) are usually (significantly) cheaper than regions in Europe. Other than these main reasons, having a secondary site close to primary is priceless. Now, can it be too far? It depends. If the nature of the business depends on milisecond transactions, then analysts and customers in Bangalore cannot use a site in Phoenix. If it does not, the savings of having a secondary site (temporarily) in a different region are worth it. Also, it is not something permanent – the system is in a degraded state.

An alternative approach is having three DR sites – a primary site with a given RTO/RPO in case it is needed, and a secondary site in the form of pilot light only.

Hot, cold, warm standby

In some circles DR is covered on a hot/cold/warm approach. I usually prefer these terms in high availability architectures, although I have seen DR sites referred to as warm. A hot site is usually a site that is up and running and to which I can failover immediately. That is for me something I would relate to HA as mentioned, however a warm site can be a site that has the resources, and only the critical part is running or ready to run. It may take a few minutes until things are in order and can failover into that DR site.

A cold standby is one that, although it receives updates, they are not necessarily frequent, meaning that failing over may mean that the RPO is much larger than desired, and of course, the RTO and RPO are usually numbers bound to the SLA, so they need to be well thought and taken care of.

Domain/zone DR – worth it or not?

DR at zone/domain level is a difficult decision for different reasons; availability zones consist of sites within a region with independent network, power, cooling, and so on i.e. isolated from each other. One or more data centres comprise a zone and one or more zones (usually three) comprise a region. Zones are used frequently for high availability. Network connectivity between zones is usually very low latency – in the order of a few hundred microseconds – and transfer rate is of such orders of magnitude that RPOs can be made almost obsolete, since data is replicated everywhere in an instant.

As sometimes HA within zones is a luxury, a DR solution can be necessary within the zones. In this case, it is usually an active/passive configuration, meaning the secondary site is stopped.

Economic impact

It is a given that the economic impact is a big factor regarding RTO, RPO, compliance, security, and GDPR as well. It is not necessarily true that the more responsive the secondary site is, the more expensive it is as well. It will depend on the architecture, how it is implemented, and how it is carried on when needed. Basically, the economic impact will be given by the amount of information kept in different sites, not so much by the size of the infrastructure, nor the replication of that information; that which can be automated, and nowadays done with enough frequency as to almost have the same data in two or more regions at any given moment.

Also, as long as the infrastructure is stopped, it is possible to resume operations in minutes without a large impact. Of course, this will depend on the cloud provider. Some providers will charge even for stopped VMs or stopped BMs, depending on the shape/family – for instance Oracle Cloud will continue billing if the instance stopped uses NVMe and SSD, meaning any Dense/HighIO machine – so beware of these details.

Within the economic impact is also the automation. It can be automated, semi-automated, or no automation. For the most part, in DR cases I rather prefer semi-automated on a two-man rule fashion. What this means is even when everything indicates there is a massive outage that requires DR, it will take more than one person to say ‘go’ on the failover, and more than one person to actually activate the processes involved. The reason being: once the DR process is started, going back before completion can render a nightmare.

Pilot light

Although it is strange to see pilot light within economic impact, there is a reason for it; pilot light allows a DR site with the minimum of infrastructure. Although a data replica must exist, the DR site needs only one or two VMs, and those VMs, when needed, will take care of spawning the necessary resources. As an engineer, I sometimes steer towards pilot light with an orchestration tool, such as Terraform.

Having a virtual machine online that contains all the IaaC (infrastructure as code) files necessary to spin up an entire infrastructure is convenient, and usually it is a matter of a few minutes until the last version of the infrastructure is back and running, connected to all the necessary block devices. Remember, nowadays, it is possible to even handle load balancers with IaaC, so there are no boundaries to this.

The DR plan

This is a critical part, not only because it describes the processes that will become active when the failover is a reality, but also because all the stakeholders have a part in the plan and all of them must know what to do when it is time to execute it. The plan must be of course not only written and forgotten, but tested, not only once, but in a continuous improvement manner.

Anything and everything necessary to measure the efficacy and efficiency of it. It is adequate to test the plan with and without the stakeholders aware, in order to see how they will behave in a real situation – and it is also advisable to repeat every six months, since infrastructure and processes can change.

Leaving the degraded state

Sometimes, the plan does not cover going back to the primary site, and this is important, since the infrastructure is at the moment in a degraded state, it is necessary to bring the systems to the normal state so as to have DR again. Since going back to the normal state of things takes time as well, and all the data needs to be replicated back, this is something that needs to be done under a maintenance window, and surely all the customers will understand the need to do so; but just in case, when setting up SLO and SLA, bear in mind that this maintenance window may be necessary. It is possible to add them as a ‘fine print’, of which I am not a fan, or consider them within the calculations.

Conclusion

There are some considerations with regard to DR in different regions, specifically but not only for Europe, and these come in the form of data, security, compliance and GDPR. The new GDPR requires companies to have any personal data available in the event of any technical or physical incident, so DR is no more a wish list item – it is required. What this basically means is that under GDPR legislation, data held of a person must be available for deletion or freed up for transfer upon request. For those legally inclined, more information can be found in article 32 of GDPR. In case DR is found to be daunting, there are nowadays multiple vendors that offer DraaS as well.