Is your cloud wasting money? How to rein things in

(c)iStock.com/ohmygouche

Cloud costs are notoriously difficult to contain. With 50+ AWS services, complex usage pricing, and bills that list “EC2 costs” as a single line item, it is often hard to know who spent what — and then tie those numbers to project budgets and ROIs.

It is no wonder that cost monitoring tools have proliferated in the last two years. And while these tools help, they are only half the battle.

Recently, our team helped perform an audit of a large media organisation’s three year old AWS environment. While they employed cloud engineers on staff, those engineers were busy supporting code releases and putting out fires. This left them little time to update their environment with new AWS features or best practice configurations.

This is a challenge for every IT department. How does a single engineering team balance fire fighting, supporting major code push events, and still have time to do real maintenance on their cloud infrastructure?

This is what happens when a “DevOps” engineering team is tasked with doing everything quickly — but not given the tools to reduce manual maintenance and deployment work. This is what happens when you try to create a DevOps team without automating the infrastructure to support continuous integration and continuous delivery.

It should come as no surprise that our audit on this team’s AWS environment revealed that about 20% of their compute resources were being wasted. Another 15% of the instances could not be linked back to an active project. They had over-engineered VPCs and were still manually launching and updating instances, which meant each instance had different configurations based on which engineer launched the instance.

Automation solves this team’s problems in a number of ways. First, and most importantly, it reduces the amount of time that the team spends configuring instances and deploying new code. When you create a custom template for your environment, bootstrap it, and then set up an integration between auto scaling and your automated deployment process, you have an environment that can spin up new instances in minutes and deploy them with the latest version of code with little or no human intervention.

Not coincidentally, the system described above is also far less prone to deployment errors or the failure of a server instance. According to a survey of 20,000 engineers by Puppet, deployment automation and integrated tested reduce deployment failures by 60%. Teams that use deployment automation deploy 200 times faster. And among Logicworks clients, downtime in production environments is zero — compared to an industry average of 10.6 downtime incidents per year among clients not on AWS, according to the IDC.

Automation also has the potential to save your team money in a more subtle way: by reducing custom configurations and therefore the time it takes to fix something. When you need to make a small change to your environment and your engineers manually boot and configure instances, they have to make this change in the console or in the CLI, then create a new AMI. What happens if this new AMI causes a failure? Your engineers will probably comb through BASH logs. And then they will go through the console. And then they might just try to rebuild the AMI.  

If your engineers want to make a small change to your environment and instead modify AWS CloudFormation templates or a configuration management module, it might actually take longer. But the value to your organisation is enormous. You will have a single, living source of documentation for every change you make, versioned and timestamped. This will not only save them a ton of time in troubleshooting, but over time it will reduce the technical complexity of your environment and encourage modular template design, which further reduces the scope of potential errors.

This emphasis on documentation and templatization over manual CLI work will also save you when your cloud engineers leave — as they inevitably will, someday. (The average engineer tenure these days is about three years.)

To truly cut maintenance costs — and not just monitor them — automation is the key. The initial investment an engineering team makes in building and maintaining templates and automation scripts creates fewer errors, less fire fighting, and reduces risk — and makes your engineers happier, too.  

The post Is Your Cloud Wasting Money? first appeared on Gathering Clouds.