How to make Amazon Web Services highly available for SQL Server

Mission-critical database applications are often the most complex use case in the public cloud for a variety of reasons. They need to keep running 24×7 under all possible failure scenarios. As a result, they require full redundancy, which involves provisioning standby server instances and continuously replicating the data. Configurations that work well in a private cloud may not be possible in the public cloud. And providing high availability can incur considerably higher costs to license more advanced software.

There are, of course, ways to give SQL Server mission-critical high availability and disaster recovery protections on Amazon Web Services. But it is also possible (and all too common) to choose configurations that result in failover provisions failing when needed.

AWS offers two basic choices for running SQL Server applications; a Relational Database Service and the Elastic Compute Cloud. RDS is a managed service that is often suitable for basic applications. While RDS offers a choice of six different database engines, its support for SQL Server requires the more expensive Enterprise Edition to overcome some inherent limitations, such as an inability to detect failovers caused by the application software.

For mission-critical SQL Server applications, the substantially greater capabilities available with EC2 make it the preferred choice when HA and DR are of paramount importance. But EC2 also has a few limitations, especially the lack of shared storage used in traditional HA configurations. And as with RDS, always-on availability groups in the Enterprise Edition might be needed to achieve the desired level of protection.

AWS also offers a choice of running SQL Server on either Windows or Linux. Windows Server Failover Clustering is a powerful and proven capability that is integral to Windows. But because WSFC requires shared storage, the data replication needed for HA/DR protection requires the use of separate commercial or custom-developed software to simulate the sharing of storage across server instances.

For Linux, which lacks a feature like WSFC, the need for additional HA/DR provisions is even greater. Using open source software requires integrating multiple capabilities that, at a minimum, must include data replication, server clustering and heartbeat monitoring with failover/failback provisions. But because getting the full HA stack to work well under all possible failure scenarios can be extraordinarily difficult, only very large organizations have the wherewithal needed to even consider taking on the task.

Failover clustering – purpose-built for the cloud

The growing popularity of private, public and hybrid clouds has been accompanied by increased use of failover clustering solutions designed specifically for a cloud environment. These HA solutions are implemented entirely in software that creates, as their designation implies, a cluster of servers and storage with automatic failover to assure high availability at the application level.

Most of these solutions provide a complete HA/DR solution that includes a combination of real-time block-level data replication, continuous application monitoring and configurable failover/failback recovery policies. Some of the more sophisticated solutions also offer advanced capabilities like support for Always on Failover Clustering in the less expensive Standard Edition of SQL Server for both Windows and Linux, WAN optimisation to maximize multi-region performance, and manual switchover of primary and secondary server assignments to facilitate planned maintenance, including the ability to perform regular backups without disruption to the application.

Although these purpose-built HA/DR solutions are generally storage-agnostic, enabling them to work with shared storage area networks, shared-nothing SANless failover clustering is usually preferred for its ability to eliminate potential single points of failure. Most SANless failover clusters are also application-agnostic, enabling organizations to have a single, universal HA/DR solution. This same capability also affords protection for the entire SQL Server application, including the database, logons, agent jobs, etc., all in an integrated fashion.

The example EC2 configuration in the diagram shows a typical two-node SANless failover cluster that works with either Windows or Linux. The cluster is configured as Virtual Private Cloud with the two SQL Server nodes in different availability zones. The use of synchronous block-level replication across the two availability zones assures both high availability and high performance. The file share witness, which is needed to achieve a quorum, is performed by the domain controller in a separate availability zone. Keeping each server instance of the quorum in a different zone eliminates the possibility of losing more than one vote if any zone goes offline.

Above: SANless failover clustering supports multi-zone and multi-region EC2 configurations with either multiple standby server instances or a single standby server instance, as shown here.

HA and DR configurations involving three or more server instances are also possible with most SANless failover clustering solutions. The server instances can be located entirely within the AWS cloud or in a hybrid cloud. One such three-node configuration is a two-node HA cluster located in an enterprise data center with asynchronous data replication to AWS or another cloud service for DR purposes—or vice versa.

In both two- and three-node clusters, failovers are normally configured to occur automatically, and both failovers and failbacks can be controlled manually (with appropriate authorisation, of course). Three-node clusters can also facilitate planned hardware and software maintenance for all three servers while providing continuous high-availability for the application and its data.

With 44 availability zones spread across 16 geographical regions, the AWS global infrastructure affords tremendous opportunity to maximize availability by configuring SANless failover clusters with multiple, geographically-dispersed redundancies. Such a global footprint also enables SQL Server applications and data to be deployed near end-users to deliver satisfactory performance.