A guide to planning for application resiliency in cloud environments


As businesses look to clouds for faster, more flexible growth, they confront significant challenges from a legacy application base that has varying levels of cloud suitability. Here, we examine how requirements around fault tolerance and disaster recovery can impact choice of cloud or architecture strategies within a cloud. 

Planning for application resiliency in cloud environments can present special challenges. Strategies can be similar to those used in traditional data centres, but implementations often differ.

At the base of the implementation differences is the architecture typically chosen for cloud applications.  Clouds tend to favour scaling “out” to more nodes rather than “up” to a bigger node.  This choice enables more graceful degradation in the event of node failure.  It also allows developers to add capacity in smaller units that can be finely tuned to immediate requirements, avoiding larger buys and attendant unused capacity.  Scaling out does, though, present different requirements for high availability.

In cases where services are housed on equipment that is physically proximate, like a traditional data centre, strategies like virtual IPs and load balancers often suffice to manage even scale-out infrastructures.  Planning for availability and resilience across multiple geographies, though, can require detailed consideration and engineering around managing DNS services and sessions, request routing, and persistent storage management.  Cloud providers and implementations will vary in terms of providing services to support these requirements.

Typical tiered applications or services (or microservices) rely on a core of persistent data stores, layers of business and application logic to manipulate or communicate that data, and presentation layers presenting an interface to users or applications that can execute the business logic. Distributing these layers across multiple pieces of hardware typically involves detailed planning around state management, load balancing, and latencies. Caching layers are often intermingled with the core functional layers to drive more responsiveness out of the system, and these caches have their own requirements for distributed consistency and state management.

The core persistent data stores are particularly challenging with respect to resiliency and high availability. While databases implemented on physically proximate equipment have well understood clustering solutions that retain transactional integrity by synchronising duplicate data stores, distributed large-scale databases often require more thoughtful design.  This can range from asynchronous replication of data to avoid latency in the transaction flow to data partitioning and the adoption of an “eventually consistent” paradigm for the underlying data. The specific s of the solution will depend on the application design and any requirements to limit data loss (Recovery Point Objective or RPO), but there are well understood engineering patterns that accommodate common needs.

A larger concern with distributed systems resiliency is organisational.  All infrastructure environments manage multi-layered resilience complexity with a mix of vendor and in-house engineering.  A typical non-cloud environment can leverage a more mature marketplace for vendor products and services facilitating the various layers. Resiliency in cloud environments may require more in-house engineering and less mature technologies to meet performance and availability goals for the applications or services. This often entails additional risk or organisational change to support the application.  The trend toward “devops,” creating a more synergistic relationship between applications engineering and systems administrators is one key indicator of how these changes are playing out in the enterprise.   

While moving applications into a private or public cloud environment may present an opportunity to save costs or improve operations, applications vary in their suitability for cloud infrastructures.  Some architectures (web farms, application server clusters, et al) are similar to cloud native best practice, and require little retooling to allow for resiliency.  More complex patterns are also manageable with proper planning, design, and execution.  Evaluating applications explicitly for resiliency requirements and fit against cloud native architectural principles allows firms to take best advantage of cloud economics and efficiencies in the enterprise.