You don’t have to be a pre-cog to find and deal with infrastructure and application problems; you just need good monitoring. We had quite a day Monday during the EC2 EBS availability incident. Thanks to some early alerts—which started coming in about 2.5 hours before AWS started reporting problems—our ops team was able to intervene and make sure that our customers’ data was safe and sound. I’ll start with screenshots of what we saw and experienced, then get into what metrics to watch and alert on in your environment, as well as how to do so in TraceView.
10:30 AM EST: Increased disk latency, data pipeline backupAround
10am, we started to notice that writes weren’t moving through our pipeline as smoothly as before. Sure enough, pretty soon we started seeing alerts about elevated DB load and disk latency. Here’s what it looked like: