All posts by davebermingham

SQL Server high availability and disaster recovery for AWS, Azure and GCP: A guide

The public cloud offers a myriad of options for providing high availability and disaster recovery protections for SQL Server database applications. Conversely, some of the options available in a private cloud are not available in the public cloud. Given the many choices and limitations, the challenge faced by system and database administrators is determining the best available options for each application running in hybrid and purely public clouds.

All cloud service providers (CSPs) have service level agreements (SLAs) with money-back guarantees for when uptime falls below specified levels, usually ranging from 95.00% to 99.99%. Four-nine’s of uptime is generally accepted as constituting HA, and to be eligible for these 99.99% SLAs, the configurations need to meet certain requirements.

But be forewarned: The SLAs only guarantee “dial tone” at the server level, and explicitly excluded many causes of downtime at the database and application levels. These exclusions inevitably include natural disasters, the customer’s actions (or inactions), and the customer’s system or application software. There may also be a separate SLA for storage that is lower than the one for servers. So while it is advantageous to leverage various aspects of a CSP’s infrastructure, additional provisions are needed to ensure adequate uptime for mission-critical SQL Server databases.

Differences between HA and DR

Properly leveraging the cloud’s resilient infrastructure requires understanding key differences between “failures” and “disasters” because those differences affect the choice of provisions used for HA and DR protections. Failures are small in scale and short in duration, affecting a server, rack, or the power or cooling in a single datacenter. Disasters have more widespread and enduring impacts, and can affect multiple datacenters in ways that preclude rapid recovery.

The most consequential effect involves the location of the redundant resources (systems, software and data), which can be local—on a Local Area Network—for recovering from a localized failure. By contrast, the redundant resources required to recover from a widespread disaster must span a Wide Area Network.

For database applications that require high transactional throughput performance, the ability to replicate the active instance’s data synchronously across the LAN enables the standby instance to be “hot” and ready to take over immediately in the event of a failure. Such rapid recovery should be the goal of all HA provisions.

Data must be replicated asynchronously in DR configurations to prevent the latency inherent in the WAN from adversely impacting on the throughput performance in the active instance. This means that updates being made to the standby instance always lag behind updates being made to the active instance, making it “warm” and resulting in an unavoidable delay during the manual recovery process.

All three major CSPs accommodate these differences with redundancies both within and across datacenters. Of particular interest is the variously named “availability zone” that makes it possible to combine the synchronous replication available on a LAN with the geographical separation afforded by the WAN. These zones connect two or more regional datacenters via a low-latency, high-throughput network to facilitate synchronous data replication. With latencies around one millisecond, the use of multi-zone configurations has become a best practice for HA.

For DR, all CSPs have offerings that span multiple regions to afford additional protection against major disasters that could affect multiple zones. For example, Google has what could be called DIY (Do-It-Yourself) DR guided by templates, cookbooks and other tools. Microsoft and Amazon have managed DR-as-a-Service (DRaaS) offerings: Azure Site Recovery and CloudEndure Disaster Recovery, respectively.

For all three CSPs it is important to note that data replication across regions must be asynchronous, so the recovery will need to be performed manually to ensure minimal or no data loss. The resulting delay in recoveries is tolerable, however, because region-wide disasters are rare.

Making SQL Server “always on”

SQL Server offers two of its own HA/DR features: Always On Failover Cluster Instances and Always On Availability Groups. FCIs afford three notable advantages: inclusion in the less expensive Standard Edition; protection of the entire SQL Server instance; and support in all versions since SQL Server 7. A significant disadvantage is the need for a storage area network (SAN) or other form of shared storage, which is unavailable in the cloud. The lack of shared storage was addressed in Windows Server 2016 Datacenter Edition with the introduction of Storage Spaces Direct. But S2D also has limitations; most notably its inability to span availability zones.

SQL Server’s other HA/DR feature, Always On Availability Groups, is a more robust solution capable of providing rapid recoveries with no data loss. Among its other advantages are inclusion in SQL Server 2017 for Linux, no need for shared storage, and readable secondaries for queries (with appropriate licensing). But for Windows it requires licensing the substantially more expensive Enterprise Edition and it lacks protection for the entire SQL Server instance.

It is worth noting that SQL Server also offers a Basic Availability Groups feature, but it supports only a single database per Availability Group, making it suitable for only the smallest of environments.

The limitations associated with both options have created a need for third-party failover clustering solutions purpose-built to provide HA/DR protections for virtually all Windows and Linux applications in private, public and hybrid cloud environments. These software-only solutions facilitate, at a minimum, real-time data replication, continuous monitoring able to detect failures at the application level, and configurable policies for failover and failback. Most also offer a variety of value-added capabilities, including some specific to popular applications like SQL Server.

Failover clustering offerings afford two major advantages: SANless operation that overcomes the lack of shared storage in the cloud and application-agnosticism that eliminates the need to have different HA/DR provisions for different applications.

Editor’s note: More detailed information about the operation and benefits of SANless failover clustering is available in How to make Amazon Web Services highly available for SQL Server.

https://www.cybersecuritycloudexpo.com/wp-content/uploads/2018/09/cyber-security-world-series-1.pngInterested in hearing industry leaders discuss subjects like this and sharing their experiences and use-cases? Attend the Cyber Security & Cloud Expo World Series with upcoming events in Silicon Valley, London and Amsterdam to learn more.

Time is running out for SQL Server 2008/R2 support – here’s what to do about it

Extended support for SQL Server 2008 and 2008 R2 will end in July 2019, giving database and system administrators precious little time to make some necessary changes. Upgrading the software to the latest version is always an option, of course, but for a variety of reasons, that may not be viable for some applications. So Microsoft is providing an alternative: Get three more years of free Extended Security Updates by migrating to the Azure cloud.

While their 2008 vintage may designate these as “legacy” applications, many may still be mission-critical and require some form of high availability (HA) and/or disaster recovery (DR) protections. This article provides an overview of the options available within and for the Azure cloud, and highlights two common HA/DR configurations.

Availability options within the Azure cloud

The Azure cloud offers redundancy within datacenters, within regions and across multiple regions. Redundancy within datacenters is provided by Availability Sets that distribute servers across different Fault Domains residing in different racks to protect against failures at the server and rack levels. Within regions, Azure is rolling out Availability Zones (AZs), which consist of at least three datacenters inter-connected via high-bandwidth, low-latency networks capable of supporting synchronous data replication. For even greater resiliency, Azure offers Region Pairs, where a region gets paired with another within the same geography (e.g. US or Europe) to protect against widespread power or network outages, and major natural disasters.

Administrators should be fully aware, however, that even with the 99.99% uptime assurances afforded by AZs, what counts as downtime excludes many common causes of failure at the application level. Two quite common causes of failure explicitly excluded from the Azure Service Level Agreement are the use of software not provided by Microsoft and what could be called “operator error”—those mistakes mere mortals inevitably make. In effect, the SLA only guarantees “dial tone” for the servers, leaving it up to the customer to ensure uptime for the applications.

Achieving satisfactory HA protection for mission-critical applications is problematic in the Azure cloud, however, owing to the lack of a storage area network (SAN) or other shared storage needed for traditional failover clustering. Microsoft addressed this limitation with Storage Spaces Direct (S2D), a virtual shared storage solution. But S2D support began with Windows Server 2016 and only supports SQL Server 2016 and later. SQL Server’s more robust Always On Availability Groups feature, which was introduced in 2012, is also not an option for the 2008 versions.

Satisfactory DR protection is possible for some applications using Azure Site Recovery (ASR), Microsoft’s DR as a service (DRaaS) offering. While ASR automatically replicates entire VM images from the active instance to a standby instance in another datacenter, it requires manual outage detection and failover. The service is usually able to accommodate Recovery Point Objectives (RPOs) ranging from a few minutes to a few seconds, and Recovery Time Objectives (RTOs) of under one hour.

Third-party failover clustering solutions

With SQL Server’s Failover Cluster Instances (FCIs) requiring shared storage, and with no shared storage available in the Azure cloud, a third-party cluster storage solution is needed. Microsoft recognizes this need for providing HA protection, and includes these instructions for configuring one such solution in its documentation: High Availability for a file share using WSFC, ILB and 3rd-party Software SIOS DataKeeper.

Third-party cluster storage solutions include, at a minimum, real-time data replication and seamless integration with Window Server Failover Clustering. Their design overcomes the lack of shared storage by making locally-attached drives appear as clustered storage resources that can be shared by SQL Server’s FCIs. The block-level data replication occurs synchronously between or among instances in the same Azure region and asynchronously across regions.

The cluster is capable of immediately detecting failures at the application level regardless of the cause and without the exceptions cited in the Azure SLA. As a result, this option is able to ensure not only server dial tone, but also the application’s availability, making it suitable for even the most mission-critical of applications.

Two common configurations

With HA provisions for legacy SQL Server 2008/R2 applications being problematic in the Azure cloud, the only viable option is a third-party storage clustering solution. For DR, by contrast, administrators have a choice of using Azure Site Recovery or the failover cluster for both HA and DR. Here is an overview of both configurations.

Combining failover clustering for HA with ASR for DR affords a cost-effective solution for many SQL Server applications. The shared storage required by FCIs is provided by third-party clustered storage resources in the SANless HA failover cluster, and ASR replicates the cluster’s VM images to another region in a Region Pair to protect against widespread disasters. But like all DRaaS offerings, ASR has some limitations. For example, WAN bandwidth consumption cannot exceed 10 megabytes per second, which might be too low for high-demand applications.

More robust DR protection is possible by using the failover clustering solution in a three-node HA/DR configuration as shown in the diagram. Two of the nodes provide HA protection with rapid, automatic failover, while the third node, located in a different Azure region in a Region Pair adds DR protection.

This configuration uses a third-party cluster storage solution to provide both HA and DR protections across Azure Availability Zones and a Region Pair, respectively.

The main advantage of using the failover cluster for both HA and DR is the ability to accommodate even the most demanding RPOs. Another advantage is that administrators have a single, combined HA/DR solution to manage rather than two separate solutions. The main disadvantage is the slight increase in cost for licensing for the third node.

With two cost-effective solutions for HA/DR protection in the Azure cloud, your organization will now be able to get three more years of dependable service from those legacy SQL Server 2008/R2 applications.

https://www.cybersecuritycloudexpo.com/wp-content/uploads/2018/09/cyber-security-world-series-1.pngInterested in hearing industry leaders discuss subjects like this and sharing their experiences and use-cases? Attend the Cyber Security & Cloud Expo World Series with upcoming events in Silicon Valley, London and Amsterdam to learn more.

How to make Amazon Web Services highly available for SQL Server

Mission-critical database applications are often the most complex use case in the public cloud for a variety of reasons. They need to keep running 24×7 under all possible failure scenarios. As a result, they require full redundancy, which involves provisioning standby server instances and continuously replicating the data. Configurations that work well in a private cloud may not be possible in the public cloud. And providing high availability can incur considerably higher costs to license more advanced software.

There are, of course, ways to give SQL Server mission-critical high availability and disaster recovery protections on Amazon Web Services. But it is also possible (and all too common) to choose configurations that result in failover provisions failing when needed.

AWS offers two basic choices for running SQL Server applications; a Relational Database Service and the Elastic Compute Cloud. RDS is a managed service that is often suitable for basic applications. While RDS offers a choice of six different database engines, its support for SQL Server requires the more expensive Enterprise Edition to overcome some inherent limitations, such as an inability to detect failovers caused by the application software.

For mission-critical SQL Server applications, the substantially greater capabilities available with EC2 make it the preferred choice when HA and DR are of paramount importance. But EC2 also has a few limitations, especially the lack of shared storage used in traditional HA configurations. And as with RDS, always-on availability groups in the Enterprise Edition might be needed to achieve the desired level of protection.

AWS also offers a choice of running SQL Server on either Windows or Linux. Windows Server Failover Clustering is a powerful and proven capability that is integral to Windows. But because WSFC requires shared storage, the data replication needed for HA/DR protection requires the use of separate commercial or custom-developed software to simulate the sharing of storage across server instances.

For Linux, which lacks a feature like WSFC, the need for additional HA/DR provisions is even greater. Using open source software requires integrating multiple capabilities that, at a minimum, must include data replication, server clustering and heartbeat monitoring with failover/failback provisions. But because getting the full HA stack to work well under all possible failure scenarios can be extraordinarily difficult, only very large organizations have the wherewithal needed to even consider taking on the task.

Failover clustering – purpose-built for the cloud

The growing popularity of private, public and hybrid clouds has been accompanied by increased use of failover clustering solutions designed specifically for a cloud environment. These HA solutions are implemented entirely in software that creates, as their designation implies, a cluster of servers and storage with automatic failover to assure high availability at the application level.

Most of these solutions provide a complete HA/DR solution that includes a combination of real-time block-level data replication, continuous application monitoring and configurable failover/failback recovery policies. Some of the more sophisticated solutions also offer advanced capabilities like support for Always on Failover Clustering in the less expensive Standard Edition of SQL Server for both Windows and Linux, WAN optimisation to maximize multi-region performance, and manual switchover of primary and secondary server assignments to facilitate planned maintenance, including the ability to perform regular backups without disruption to the application.

Although these purpose-built HA/DR solutions are generally storage-agnostic, enabling them to work with shared storage area networks, shared-nothing SANless failover clustering is usually preferred for its ability to eliminate potential single points of failure. Most SANless failover clusters are also application-agnostic, enabling organizations to have a single, universal HA/DR solution. This same capability also affords protection for the entire SQL Server application, including the database, logons, agent jobs, etc., all in an integrated fashion.

The example EC2 configuration in the diagram shows a typical two-node SANless failover cluster that works with either Windows or Linux. The cluster is configured as Virtual Private Cloud with the two SQL Server nodes in different availability zones. The use of synchronous block-level replication across the two availability zones assures both high availability and high performance. The file share witness, which is needed to achieve a quorum, is performed by the domain controller in a separate availability zone. Keeping each server instance of the quorum in a different zone eliminates the possibility of losing more than one vote if any zone goes offline.

Above: SANless failover clustering supports multi-zone and multi-region EC2 configurations with either multiple standby server instances or a single standby server instance, as shown here.

HA and DR configurations involving three or more server instances are also possible with most SANless failover clustering solutions. The server instances can be located entirely within the AWS cloud or in a hybrid cloud. One such three-node configuration is a two-node HA cluster located in an enterprise data center with asynchronous data replication to AWS or another cloud service for DR purposes—or vice versa.

In both two- and three-node clusters, failovers are normally configured to occur automatically, and both failovers and failbacks can be controlled manually (with appropriate authorisation, of course). Three-node clusters can also facilitate planned hardware and software maintenance for all three servers while providing continuous high-availability for the application and its data.

With 44 availability zones spread across 16 geographical regions, the AWS global infrastructure affords tremendous opportunity to maximize availability by configuring SANless failover clusters with multiple, geographically-dispersed redundancies. Such a global footprint also enables SQL Server applications and data to be deployed near end-users to deliver satisfactory performance.

Azure post-mortems, RTOs and RPOs – and what to do with Hurricane Florence on the horizon

The first official post-mortems are starting to come out of Microsoft in regards to the Azure outage that happened last week. While this first post-mortem addresses the Azure DevOps outage specifically (previously known as Visual Studio Team Service, or VSTS), it gives us some additional insight into the breadth and depth of the outage, confirms the cause of the outage, and gives us some insight into the challenges Microsoft faced in getting things back online quickly. It also hints at some some features/functionality Microsoft may consider pursuing to handle this situation better in the future.

As I mentioned in my previous article, features such as the new Availability Zones being rolled out in Azure, might have minimized the impact of this outage. In the post-mortem, Microsoft confirms what I previously said.

The primary solution we are pursuing to improve handling datacenter failures is Availability Zones, and we are exploring the feasibility of asynchronous replication.

Until Availability Zones are rolled out across more regions the only disaster recovery options you have are cross-region, hybrid-cloud or even cross-cloud asynchronous replication. Software based #SANless clustering solutions available today will enable such configurations, providing a very robust RTO and RPO, even when replicating great distances.

When you use SaaS/PaaS solutions you are really depending on the Cloud Service Provider (CSPs) to have an iron clad HA/DR solution in place. In this case, it seems as if a pretty significant deficiency was exposed and we can only hope that it leads all CSPs to take a hard look at their SaaS/PaaS offerings and address any HA/DR gaps that might exist. Until then, it is incumbent upon the consumer to understand the risks and do what they can to mitigate the risks of extended outages, or just choose not to use PaaS/SaaS until the risks are addressed.

The post-mortem really gets to the root of the issue…what do you value more, RTO or RPO?

I fundamentally do not want to decide for customers whether or not to accept data loss. I’ve had customers tell me they would take data loss to get a large team productive again quickly, and other customers have told me they do not want any data loss and would wait on recovery for however long that took.

It will be impossible for a CSP to make that decision for a customer. I can’t see a CSP ever deciding to lose customer data, unless the original data is just completely lost and unrecoverable. In that case, a near real-time async replica is about as good as you are going to get in terms of RPO in an unexpected failure.

However, was this outage really unexpected and without warning? Modern satellite imagery and improvements in weather forecasting probably gave fair warning that there was going to be significant weather related events in the area.

With hurricane Florence bearing down on the Southeast US as I write this post, I certainly hope if your data center is in the path of the hurricane you are taking proactive measures to gracefully move your workloads out of the impacted region. The benefit of a proactive disaster recovery vs a reactive disaster recovery are numerous, including no data loss, ample time to address unexpected issues, and managing human resources such that employees can worry about taking care of their families, rather than spending the night at a keyboard trying to put the pieces back together again.

Again, enacting a proactive disaster recovery would be a hard decision for a CSP to make on behalf of all their customers, as planned migrations across regions will incur some amount of downtime. This decision will have to be put in the hands of the customer.

Hurricane Florence Satellite Image taken from the new GOES-16 Satellite, courtesy of Tropical Tidbits

So what can you do to protect your business critical applications and data? As I discussed in my previous article, cross-region, cross-cloud or hybrid-cloud models with software based #SANless cluster solutions are going to go a long way to address your HA/DR concerns, with an excellent RTO and RPO for cloud based IaaS deployments. Instead of application specific solutions, software based, block level volume replication solutions such SIOS DataKeeper and SIOS Protection Suite replicate all data, providing a data protection solution for both Linux and Windows platforms.

My oldest son just started his undergrad degree in Meteorology at Rutgers University. Can you imagine a day when artificial intelligence (AI) and machine learning (ML) will be used to consume weather related data from NOAA to trigger a planned disaster recovery migration, two days before the storm strikes? I think I just found a perfect topic for his Master’s thesis. Or better yet, have him and his smart friends at the WeatherWatcher LLC get funding for a tech startup that applies AI and ML to weather related data to control proactive disaster recovery events.

I think we are just at the cusp of  IT analytics solutions that apply advanced machine-learning technology to cut the time and effort you need to ensure delivery of your critical application services. SIOS iQ is one of the solutions leading the way in that field.

Batten down the hatches and get ready, Hurricane season is just starting and we are already in for a wild ride.