The Likelihood Theorem

When deciding where and how to spend your IT dollars, one question that comes up consistently is how far down the path of redundancy and resiliency should you build your solution for, and where does it cross the threshold from a necessity, to a nice-to-have because-its-cool.  Defining your relative position on this path has impacts in all areas of IT, including technology selection, implementation design, policies and procedures definition, and management requirements.  Therefore, I’ve developed the Likelihood (LH) Theorem to assist with identifying where that position is relative to your specific situation.  The LH is not a financial criteria, nor is it directly an ROI metric.  However it can be used to assist with determining the impact of making certain decisions in the design process.

Prior to establishing the components that make up your LH ratio, consider that at the start, with a completely blank slate, we all have the same LH.  True, you could argue that someone establishing a system in Kansas doesn’t have to worry about a tsunami, but they do have to consider tornados.  Besides, the preparation for such a level of regional, long term impact would be very similar regardless of the root cause.

The Likelihood Theorem starts with the concept of an Event (E ).  Each ( E ) has its own unique LH.  So initially:

LH=E

Next, apply any minimum standards that you define for included systems in your environment.  Call this the Foundation Factor (FF). If you define a FF, then you can reduce LH by some factor, eliminating certain events from consideration.  For example, your FF for server hardware may be redundant power supplies, NICs, and RAID.  When it comes to network connectivity, it may be redundant paths. If using SaaS for business critical functions, it may be ISP redundancy via multi-homing and link load balancing.  Therefore

LH=E-FF

Any of us who have been in this industry (or been a consumer of IT) for more than 5 minutes knows that even with a baseline established, things happen.  This is known as the Wild Card Effect (WCE).  One key note here is that all WCEs are in some form potentially controllable by the business.  For hardware, this may be the difference between purchasing from Tier 1 and Tier 2 vendors (i.e. lower quality of components or lower mean time to failure rates).  Another WCE may be the available budget for the solution.  There may be multiple WCEs in any scenario, and all WCEs add back to the LH ratio:

WCE1 +WCE2 + WCE3 …=WCEn

And so:

LH=E-FF+WCEn

At this point, we have accounted for the event in question, reduced our risk profile by our minimum standards, and adjusted for wild cards that are beyond our minimum standards but that we could address should we have the authority to make certain decisions.  Now, we need to begin considering the impacts associated with the event in question.  Is the event we are considering singular in nature, or is it potentially repetitive?  LH related to a regional disaster would be singular, however if we are considering telecommunication outages, then repetitive is more reasonable.  So, we need to take the equation and multiply it by the potential frequency (FQ):

LH=(E-FF+WCEn)*FQ

The last factor in LH is determining the length of time that the event in question could impact the environment.  This may come into play if the system in question is transitory, an interim step to a new solution, or has an expected limited lifecycle.  The length of time that the event is possible can impact our thoughts around how much we invest in preventing it:

LH=((E-FF+WCEn)*FQ)/Time

So, in thinking about how to approach your design, consider these factors:  What event are you trying to avoid?  Do your minimum specifications eliminate the possibility of the event occurring ( E = FF )?  What if you had to reduce your specifications to meet a lower budget (WCE1) or use a solution with an inherently higher ratio of failures or lackluster support (WCE 2 and WCE3)?  Can you reduce those wildcards if the Event is not fully covered by your minimum standards (lower total WCEn)?  Will the event be a one-time thing or could it happen repeatedly over the lifecycle of the solution?

I’m not suggesting that you can associate specific numerical values for these factors, but in terms of elevating or reducing the likelihood of an event happening, these criteria are key indicators.  Using this formula is a way to ensure that working within the known constraints placed on us by the business, we have maximized our ability to avoid specific events and reduced the likelihood of those we can realistically address.