Questions Around Uptime Guarantees

Some manufacturers recently have made an impact with a “five nines” uptime guarantee, so I thought I’d provide some perspective. Most recently, I’ve come in contact with Hitachi’s guarantee. I quickly checked with a few other manufacturers (e.g. Dell EqualLogic) to see if they offer that guarantee for their storage arrays, and many do…but realistically, no one can guarantee uptime because “uptime” really needs to be measured from the host or application perspective. Read below for additional factors that impact storage uptime.

Five Nines is 5.26 minutes of downtime per year, or 25.9 seconds a month.

Four Nines is 52.6 minutes/year, which is one hour of maintenance, roughly.

Array controller failover in EQL and other dual controller, modular arrays (EMC, HDS, etc.) is automated to eliminate downtime. That is really just the beginning of the story. The discussion with my clients often comes down to a clarification of what uptime means – and besides uninterrupted connectivity to storage, data loss (due to corruption, user error, drive failure, etc.) is often closely linked in people’s minds, but is really a completely separate issue.

What are the teeth in the uptime guarantee? If the array does go down, does the manufacturer pay the customer money to make up for downtime and lost data?

{Register for our upcoming webinar on June 12th ”What’s Missing in Hybrid Cloud Management- Leveraging Cloud Brokerage“ featuring guest speakers from Forrester and Gravitant}

There are other array considerations that impact “uptime” besides upgrade or failover.

  • Multiple drive failures, since most are purchased in batches, are a real possibility. How does the guarantee cover this?
  • Very large drives must be in a suitable RAID configuration to improve the chances that a RAID rebuild will be completed before another URE (unrecoverable read error) occurs. How does the guarantee cover this?
  • Dual controller failures do happen to all the array makers, although I don’t recall this happening with EQL. Even a VMAX went down in Virginia once, in the last couple of years. How does the guarantee cover this?

 

The uptime “promise” doesn’t include all the connected components. Nearly every environment has something with a single path or SPOF or other configuration issue that must be addressed to insure uninterrupted storage connectivity.

  • Are applications, hosts, network and storage all capable of automated failover at sub-10 ms speeds? For a heavily loaded Oracle database server to continue working in a dual array controller “failure” (which is what an upgrade resembles), it must be connected via multiple paths to an array, using all available paths.
  • Some operating systems don’t support an automatic retry of paths (Windows), nor do all applications resume processing automatically without IO errors, outright failures or reboots.
  • You often need to make temporary changes in OS & iSCSI initiator configurations to support an upgrade – e.g. change timeout value.
  • Also, the MPIO software makes a difference. Dell EQL MEM helps a great deal in a VMware cluster to insure proper path failover, as do EMC PowerPath and Hitachi Dynamic Link Manager. Dell offers a MS MPIO extension and DSM plugin to help Windows recover from a path loss in a more resilient fashion
  • Network considerations are paramount, too.
    • Network switches often take 30 seconds to a few minutes to reboot after a power cycle or reboot.
    • Also in the network, if non-stacked switches are used, RSTP must be enabled. If not, and anything else isn’t configured correctly, connectivity to storage will be lost.
    • Flow Control must be enabled, among other considerations (disable unicast storm control, for example), to insure that the network is resilient enough.
    • Link aggregation, if not using stacked switches, must be dynamic or the iSCSI network might not support failover redundancy

 

Nearly every array manufacturer will say that upgrades are non-disruptive, but that is at the most simplistic level. Upgrades to a unified storage array, for example, will involve disruption to file system presentation, almost always. Clustered or multi-engine frame arrays (HP 3PAR, EMC VMAX, NetApp, Hitachi VSP) can offer the best hope of achieving 5 nines, or even greater. We have customers with VMAX and Symmetrix that have had 100% uptime for a few years, but the arrays are multi-million dollar investments. Dual controller modular arrays, like EMC and HDS, can’t really offer that level of redundancy, and that includes EQL.

If the environment is very carefully and correctly set up for automated failover, as noted above, then those 5 nines can be achieved, but not really guaranteed.