Incident response on the AWS cloud and the case for outsourcing

(c)iStock.com/nzphotonz

By Jason Deck, VP strategic development, Logicworks

It is 4pm on a Friday before a holiday, right before your team leaves for a long weekend. An engineer on your team suddenly cannot connect to certain instances in your AWS environment. The error is affecting the largest projects and biggest customers — across hundreds of instances — including DR.

What happens now?

The answer to this question depends on your support model. In most companies, an incident like this means that no one is going home for the holiday weekend; they will spend 15+ hours diagnosing the problem, then 200+ hours fixing it manually, instance by instance. If they do not get it fixed in time, they will lose data and have to tell their customers — a potentially damaging situation.  

This is a true story, but what actually happened was very different. The company instead called their managed service provider (MSP) – us – who diagnosed the problem and fixed it over the holiday weekend.

Every system has weak points. Every system can fail. It is how you deal with catastrophe — and who you trust to help you during failure — that makes the difference. Every enterprise team needs an insurance policy against mistakes, large and small.

It turns out that one of the company’s internal engineers had caused the problem, inadvertently changing permissions for their entire environment. Logicworks was able to diagnose the problem in less than an hour, determine blast radius, get our smartest engineers in a room to develop a complex remediation strategy, and implement that fix before business resumed after the holiday. This involved writing custom scripts (in Python, BASH, and Puppet) to investigate the scope of the failure and another more complex script to partially automate the fix, so that each instance could be repaired in 3-5 minutes, rather than 2-3 hours. Ultimately it took 170+ hours of engineering effort, but the company readily admitted that it would have taken them two weeks to fix on their own.

Managed infrastructure service providers were born in an age when implementing a fix meant going to a data centre, swapping out hardware, and doing manual configurations. The value of an MSP to enterprises was not having to manage hardware and systems staff.

In the cloud, MSPs must do more. They must be programmers; instead of replacing hardware, they need to write custom scripts to repair virtual cloud instances. MSPs need to think and act like a software company: infrastructure problems are bugs, the solution is code, and speed is paramount.

Not all MSPs operate this way. Many MSPs would have looked at this company’s issue and applied the traditional incident response model: just reboot everything manually, one at a time. (Many also would have said “you caused the problem, you fix it.”) This is the traditional MSP line of thinking, and it would have meant that the company would have lost three to five days of data and customer trust.

MSPs need to think and act like a software company: infrastructure problems are bugs, the solution is code, and speed is paramount.

Running on cloud infrastructure comes with unique risks. It is often easier for your engineers to make a career-limiting mistake when a single wrong click of a button in an automated script can change permissions across an entire system. These new challenges require new answers and a new line of defence.

Importantly, this means that MSPs no longer replace internal IT teams; they provide additional expertise that the enterprise currently lacks (or is in the process of building) in fields like cloud security and automation. They provide an additional layer of defense. In the example above, the internal and MSP team collaborated to fix the problem, since there is shared control of the infrastructure.

In the cloud, the conversation no longer has to be insourcing vs. outsourcing. In fact, you will get the most out of an MSP if are also setting up internal DevOps teams or implementing software development best practices. As an MSP, companies with an existing or growing DevOps team are the most exciting to work beside. As an example, an MSP cannot automate your entire deployment pipeline alone; most only operate below the application level and can only automate instance spin-up and testing. But if the two teams are working together, they can balance application-specific needs with advanced scaling and network options and create a very mature pipeline very quickly.

An MSP can accelerate your DevOps team building strategies, not substitute them.

In other words, an MSP can accelerate your DevOps team building strategies, not substitute them. This is an incredibly powerful model that we have watched transform and mature entire cloud projects in a matter of months. Plus, they can subtract all the crucial compliance work your DevOps team dreads, like setting up backups and logging — and even improve the quality of that compliance work by creating automated tests to ensure logs and backups are always kept.

It is true that internal IT teams sacrifice some control by using an MSP. The key is that you are sacrificing control to a group of people who are held responsible for making your environment secure, available, etc. You control how and when the MSP touches your environment.

Cloud projects are complex, and cloud problems can be equally so. Just make sure that when they happen, you have the right team on the bench.

The post Incident Response on AWS Cloud: The Case for Outsourcing appeared first on Gathering Clouds.