How to leverage cloud architectures for high availability

Cloud architectures are used broadly nowadays; the cloud is a plethora of amazing alternatives in terms of services and solutions. However, with great power comes great responsibility, and the cloud presents itself as a place where failure can and eventually will occur. Thus, consequently, it will spread upon the entire architecture fast, possibly causing massive outages that can leave a business on its knees.

Okay, that’s not an overly optimistic scenario – more likely the opposite – but not to fear. This is the nature of almost any architecture – and why should cloud be any different?

Cloud architects face two different problems at scale at any given time in order to prepare for the worst; firstly, if something unexpected and undesired happens, how to continue business operations as if nothing happened, and secondly, if something unexpected and undesired happens and I am unable to continue operations as usual, how can I bring the architecture up someplace else and within a reasonable window of time, and then, resume operations as usual?

In these terms we can discuss:
– Continue business as usual in the face of an outage
– Resume business as usual in the shortest term possible in the face of an irrecuperable outage

The first is covered by high availability, and the second is covered by disaster recovery. Here, we will look at high availability.

The alternatives currently on the table

The cloud yields more than what is expected to face both scenarios. Most clouds are distributed in a geographic and technical way as to avoid massive outage scenarios by themselves; at a small scale, clouds have what is known as Availability Zones (AZs) or Availability Domains (ADs). These are usually different buildings, or different clusters of buildings, in the same geographic area, interconnected but highly redundant, especially in what refers to power, refrigeration and storage.

At a large scale, clouds are divided by regions; global regions, that is, with 10 or 15 regions if we look at giants such as Google Cloud and Amazon Web Services. These regions are spread geographically across the globe and serve two purposes; isolation in case of disaster, and performance. Customers in different countries and continents will be served by the nearest point of service, not rerouted to the main one. That is what makes the latency smaller and the response higher.

Putting all this into consideration, it is the task of the architect to design the service with availability zones and regions in mind, in order to serve customers properly and take advantage of the technologies at hand. Architectures are not replicated by cloud providers in different regions – that is something architects and engineering teams need to consider and tackle, and the same goes for availability domains, unless the discussion is about storage; compute instances and virtual networks to mention core services, are not replicated throughout Ads or AZs for the most part.

The alternatives for high availability involve avoiding single points of failure, testing the resilience of the architecture before deploying to product, and either constructing master/master, master/slave or active/passive solutions in order to be always available, or have an automation that is able to reduce the unavailability time to the minimum.

What are considered best practices?

The following is a list of best practices in terms of providing HA in the cloud. It is not completely comprehensive, but it may also apply, to a lesser degree, to data centre architectures as well.

  • Distributing load balancers across Ads, beware of single point of failure (SPOF) in the architecture: two is one and one is none
  • If the cloud provider is not providing redundancy across Ads and at least three copies of the same data automatically, it may be a good idea to re-evaluate the provider decision, or contemplate a service that does so
  • Easy to get in, easy to get out: it is necessary to have the certainty that in case it becomes primordial to move or redirect services, it is possible to do so with minimum effort
  • Implementing extra monitoring and metrics systems if possible, not to mention good integration: also if possible, off-the-shelf, through third parties that can provide for timely alerts and rich diagnostic information. Platforms such as New Relic, or incident tools such as PagerDuty, can be extremely valuable
  • Keeping the architecture versioned, and in IaaC (infrastructure as code) form: if an entire region goes away, it will be possible to spawn the entire service in a different region, or even a different cloud, provided data has been replicated and DNS services are elastic
  • Keeping DNS services elastic: this goes without saying, especially after the previous step; flexibility is key in terms of pointing records in one direction or another
  • Some clouds do not charge to have instances in a stopped state, especially with VMs e.g. Oracle only charges for stopped instances if those are Dense or HighIO, otherwise it does not. It is easy to leverage this and keep a duplicated architecture in two regions; with IaaC, this is not unreal and it is also easy to maintain
  • Synchronising necessary and critical data across ADs constantly in the ways of block storage ready to use and often unattached, avoid NVMe usage if that implies being billed for unused compute resources to which those NVMe are connected to
  • Leveraging object storage in order to have replicated data in +2 regions
  • Leveraging cold storage (archive, such as Glacier) to retain critical data in several sparse regions; sometimes the price to pay to break the minimum retention policy and request a restore Is worth in order to bring a production environment up
  • Using the APIs and SDKs for automation, by creating HA and failover tools, automation can transform systems into autonomous systems that take care of failovers by themselves, mixing this with anomaly detection can be a game changer. Do not rely so heavily on dashboards – most things can/are, and some must, be done behind the curtain
  • Nobody says it is necessary to stick to one cloud: with the power of orchestration and cloud providers, it is simple enough to have infrastructure in more than one cloud at the same time, running comparisons and if necessary switching providers
  • Using tools to test the resilience of the infrastructure and the readiness of the engineering team – faking important failures in the architecture can yield massive learning

Conclusion

Although best practices are only best practices if applied, not all of them can be applied in the same architecture or at the same time, so the judgement of an experienced architecture and engineering team is always necessary.

That said, most of the points can be applied without significant effort. It only takes some hard work and disposition, but the results yielded will make it worth its weight in copper.

Happy architecting and keep up the good work.