For cloud providers and their many customers a robust and continuously available power supply is amongst the most important reasons for placing IT equipment in a data centre. It’s puzzling therefore why so many data centres fail repeatedly in measuring up to such a mission critical requirement.
Only last month, for example, cloud service providers and communications companies were hit by yet another protracted power outage affecting a data centre in London. It took time for engineers from the National Grid to restore power and meanwhile many thousands of end users were impacted.
Let’s face it – from time to time there will be Grid interruptions. But they shouldn’t be allowed to escalate into noticeable service interruptions for customers. Inevitably, such incidents create shockwaves among users and cloud service providers, their shareholders, suppliers, and anyone else touched by the inconvenience.
The buck stops here
While it’s clear something or someone (or both) are at fault the buck eventually has to stop at the door of the data centre provider.
Outages are generally caused by a loss of power in the power distribution network. This could be triggered by a range of factors, from construction workers accidently cutting through cables – very common in metro areas – to power equipment failure, adverse weather conditions, not to mention human error.
Mitigating some such risks should be ‘easy’. Don’t locate a data centre near or on a flood plain and ideally choose a site where power delivery from the utilities will not be impaired. This is a critical point. Cloud providers and their customers need to fully appreciate how the power routes between their chosen data centre and through the electricity distribution network – in some cases its pretty tortuous.
Finding the ideal data centre location that ticks all the right boxes is often easier said than done, especially in the traditional data centre heartlands. Certainly, having an N+1 redundancy infrastructure in place is critical to mitigating outages due to equipment failure.
Simply put, N+1 means there is a more equipment deployed than needed and so allows for single component failure. The ‘N’ stands for the number of components necessary to run your system and the ‘+1’ means there is additional capacity should a single component fail. A handful of facilities go further. NGD for example has more than double the equipment needed to supply contracted power to customers, split into two power trains on either side of the building each of which is N+1. Both are completely separated with no common points of failure.
But even with all these precautions a facility still isn’t necessarily 100 percent ‘outage proof’. All data centre equipment has an inherent possibility of failure and while N+1 massively reduces the risks one cannot be complacent. After all, studies show that a proportion of failures are caused by human mis-management of functioning equipment. This puts a huge emphasis on engineers being well trained, and critically, having the confidence and experience in knowing when to intervene and when to allow the automated systems to do their job. They must also be skilled in performing concurrent maintenance and minimising the time during which systems are running with limited resilience.
Rigorous testing
Prevention is always better than cure. Far greater emphasis should be placed on engineers reacting quickly when a component failure occurs rather than assuming that inbuilt resilience will solve all problems. This demands high quality training for engineering staff, predictive diagnostics, watertight support contracts and sufficient on-site spares.
However, to be totally confident with data centre critical infrastructure come hell or high water, it should be rigorously tested. Not all data centres do this regularly. Some will have procedures to test their installations but rely on simulating total loss of incoming power. But this isn’t completely fool proof as the generators remain on standby and the equipment in front of the UPS systems stays on. This means that the cooling system and the lighting remain functioning during testing.
Absolute proof comes with ‘Black Testing’. It’s not for the faint hearted and many data centres simply don’t do it. Every six months NGD isolates incoming mains grid power and for up to sixteen seconds the UPS takes the full load while the emergency backup generators kick-in. Clearly, we are only cutting the power to one side of a 2N+2 infrastructure and it’s done under strictly controlled conditions.
When it comes to data centre critical power infrastructure regular full-scale black testing is the only way to be sure the systems will function correctly in the event of a real problem. Hoping for the best in the event of real-life loss of mains power simply isn’t an option.
Uptime check list
- Ensure N+1 redundancy at a minimum, but ideally 2N+x redundancy of critical systems to support separacy, testing and concurrent access
- Streamlining MTTF will deliver significant returns on backup systems availability and reliability, and overall facilities uptime performance
- Utilise predictive diagnostics, ensure fit for purpose support contracts, and hold appropriate spares stock on-site
- Regularly Black Test UPS and generator backup systems
- Drive a culture of continuous training and practice regularly to ensure staff are clear on spotting incipient problems and responding to real time problems– what to do, and when/when not to intervene