The cloud complexity gap: Making software more intelligent to address complex infrastructure

(c)iStock.com/George Clerk

Over the past 15 years, we have seen a unique trend emerge in managing infrastructure ­ increasing complexity. Oddly enough, it began with the mainstream adoption of virtualisation and has rapidly accelerated since the introduction of cloud computing. To further exacerbate the issue, the ability of software that can manage that complexity ­ what analysts such as Gartner call IT Operations Management (ITOM)​ ­ has been unable to contain its growth.

When an organisation struggles with cloud computing ­ whether it is due to stability, cost, performance, security or the many other reasons that account for failed cloud initiatives ­ it is often due to the inability of an organisation to manage the complexity of their newfound infrastructure. The below chart is an attempt to visualise the complexity challenge in its historical context. The red line plots the growth of infrastructure since 2000; the blue line plots the ability of software to manage this complexity. The gap between the red and blue lines is known as the Complexity Gap, where chaos can reign and cloud initiatives fail.

What are the driving reasons behind this growing complexity gap?

Dynamic infrastructure

A critical source of the increased complexity comes from the primary benefits of cloud computing: on­demand infrastructure and pricing. The ease with which we can provision and deprovision infrastructure has fundamentally changed the way we develop applications.

While early virtualisation provided us a faster way to provision and deprovision virtual machines, the infrastructure often had lifecycles not so dissimilar from their physical ancestor. But when on­demand infrastructure was coupled to consumption based pricing in the public cloud, it socially engineered new behaviour for the design and operation of cloud infrastructure. The long-lived virtual machines of the early cloud were places with autoscaling, service­oriented architectures, auction­based compute, and innovative new platform services. These new architectures have provided us the ability to compose more fault tolerant, cost­effective, high-performance and feature-rich solutions than we have in the past. But they brought with them a downside: complexity.

The pace of innovation

It looks four years for the industry to standardise on a de facto functional specification for Infrastructure as a Service (IaaS). Just as enterprises were getting their hands around managing an IaaS cloud, vendors such as Amazon unleashed a torrent of new infrastructure and platform innovations. The new services provide innovations in all aspects of infrastructure: compute, storage, databases, deployment, networking, mobile, analytics, and application development. They also include mind-bending new services (e.g. AWS Lambda) whose existence has the potential to create new types of applications.

When disruptive innovations occur, it is common for users to want to use them in a similar way to the technology they are supplanting. The early digital cameras, for all their innovations, were used in a manner more alike to film­based cameras. But as the disruptive technology matures, its use tends to expand into uses very different from its predecessor (e.g. cameras in mobile phones, on headsets, used as an interface to the physical world). As cloud computing matures as a disruptive technology, it is revealing to use new ways in which we can develop, deploy, and operate applications that were never before possible. But with this incredible innovation comes one obvious consequence: complexity.

Lack of integrated management

The explosion of growth in data centres in the 1990s brought with it an increase in complexity of infrastructure. This complexity gap fostered enormous innovation in the software industry that eventually resulted in the $20B+ IT Operations Management (ITOM) we have today.

This market was for over a decade dominated by five providers ­ IBM, HP, CA, BMC, and Microsoft ­ and their broad management suites. For years, these companies, along with a large assortment of SMB players, managed to contain much of the complexity of our rapidly-growing infrastructure.  Unfortunately, these products were designed for a different generation of infrastructure, and no longer provide the ability to contain the complexity.

This has given rise to a new generation of cloud management solutions ­ e.g. Chef, New Relic, Ansible, Docker, Stackdriver ­ which are focused on managing the complexity of a single vertical slice of the overall ITOM stack. While the vertical focus of products allows customers to assembled best-of-breed suites for their needs, the resulting solutions require the use of multiple products and console to manage the infrastructure. Using multiple disconnected products can often feel like looking through “keyholes” to manage your infrastructure, with each product providing only partial insight into the overall infrastructure. To compensate for this lack of integration, many companies are building their own integration, using custom software, spreadsheets.

New distribution of ownership

Gone are the days in which IT had full control over the provisioning, deprovisioning and operations of infrastructure in support of lines of business. This centralised control started to erode in the mid­2000s and has accelerated over the last decade, with the cloud adding fuel to the fire. It is increasingly common for lines of business to “go rogue” to achieve their business goals, leveraging external cloud services and even managing their own infrastructure. Their experiences have showed them the pace of innovation and agility that can come from outside of IT, and now there is no going back.

This change in ownership has increased the complexity for IT to provide the governance, compliance and risk management required to protect their businesses. IT needs to find new ways to exert soft controls to protect the business, while not inhibiting the agility their internal customers expect now from the cloud. Unfortunately all the ITOM tools available today are built to take advantage only of a centralised model.

Specialised knowledge

The cloud has created a technology rift that requires the adoption of new technologies and approaches to managing infrastructure. Traditional operations engineers, for example, are being challenged with concepts such as DevOps/infrastructure as code that require they acquire new skills and adapt to different mindsets; software engineers are being challenged with new IaaS/PaaS services that fundamentally change the approach to software architectures.

Unfortunately, only a portion of our existing talent pool has proven able and willing to make this shift, resulting in a talent crunch for the remaining resources. Even for those willing to make the transition, becoming an expert in the emerging technologies takes time and hands-on experience, which can be hard to find in many environments. Managing talent acquisition, retention and training is essential to a successful cloud strategy, but is also more complex and resource intensive than it was in pre­cloud days.

Managing TCO / ROI

Managing Total Cost of Ownership (TCO) and Return on Investment (ROI) pre­cloud was complex. Managing it in the cloud is turning out to be incredibly complex. With dozens of different services being used, per minute billing by some providers, and a constant flow of changes occurring within your infrastructure, being able to quantify TCO and ROI requires smart software instead of analysts with spreadsheets.

In a few commands and minutes, a DevOps engineer can fundamentally alter the TCO of a project or application, or shift the profitability of a business initiative. Managing TCO and ROI requires both smart software and constant vigilance, as “cloud drift” puts your business at constant risk.

Conclusion

We all know the incredible benefits of cloud computing: agility, flexibility, elasticity, consumption­based pricing, cost, quality of service, and resilience. These benefits have been sufficiently powerful that cloud computing is in the early phases of reshaping the landscape of computing, forever changing how we engage with infrastructure. But these benefits have come at a cost: complexity. The success of your cloud strategy will be directly affected by your willingness and ability to confront and manage this complexity.