Reliability Enablers (SREpath)
Reliability Enablers
#55 3 Uses for Monitoring Data Other Than Alerts and Dashboards
0:00
-11:01

#55 3 Uses for Monitoring Data Other Than Alerts and Dashboards

I've noticed that many engineers use monitoring data mainly for only 2 uses: alerts and dashboards. There are a few more things engineers can do with that data. We'll explore them in this post.

We’ll explore 3 use cases for monitoring data. They are:

  1. Analyzing long-term trends

  2. Comparing over time or experiment groups

  3. Conducting ad hoc retrospective analysis

Analyzing long-term trends

You can ask yourself a couple of simple questions as a starting point:

  • How big is my database?

  • How fast is the database growing?

  • How quickly is my user count growing?

As you get comfortable with analyzing data for the simpler questions, you can start to analyze trends for less straightforward questions like:

  • How is the database performance evolving? Are there signs of degradation?

  • Is there consistent growth in data volume that may require future infrastructure adjustments?

  • How is overall resource utilization trending over time across different services?

  • How is the cost of cloud resources evolving, and what does that mean for budget forecasting?

  • Are there recurring patterns in downtime or service degradation, and what can be done to mitigate them?

Sebastian mentioned that it's a part of observability he enjoys doing. I can understand why. It’s exciting to see how components are changing over a period and working out solutions before you end up in an incident response nightmare.

Getting to effectively analyze the trends requires the right level of data retention settings. Because if you're throwing out your logs, traces, and metrics too early, you will not have enough historical data to do this kind of work.

Doing this right means having the right amount of data in place to be able to analyze those trends over time, and that will of course depend on your desired period.

Comparing over time or experiment groups

Google’s definition

You're comparing the data results for different groups that you want to compare and contrast. Using a few examples from the SRE (2016) book:

  • Are your queries faster in this version of this database or this version of that database?

  • How much better is my memcache hit rate with an extra node and is my site slower than it was last week?

You're comparing it to different buckets of time and different types of products.

A proper use case for comparing groups

Sebastian did this particular use case recently because he had to compare two different technologies for deploying code: AWS Lambda vs AWS Fargate ECS.

He took those two services and played around with different memories and different virtual CPUs. Then he ran different amounts of requests against those settings and tried to figure out which one was the better technology option most cost-effectively.

His need for this went beyond engineering work but enabling product teams with the right decision-making data. He wrote out a knowledge base article to give them guidance for a more educated decision on the right AWS service.

Having the data to compare the two services allowed him to answer questions like:

  • When should you be using either of these technologies?

  • What use cases would either technology be more suitable for?

This data-based decision support is based mainly on monitoring or observability data. The idea of using the monitoring data to compare tools and technologies for guiding product teams is something I think reliability folk can gain a lot of value from doing.

Conducting ad hoc retrospective analysis (debugging)

Debugging is a bread-and-butter responsibility for anyone who is a software engineer of any level.

It’s something that everybody should know a little bit more about than other tasks because there are very effective and also very ineffective ways of going about debugging.

Monitoring data can help make the debugging process fall into the effective side.

There are organizations where you have 10 different systems. In one system, you might get one fragmented piece of information. In another, you’ll get another fragment. And so on for all the different systems.

And then you have to correlate these pieces of information in your head and hopefully, you get some clarity out of the fragments to form some kind of insight.

Monitoring data that are brought together into one datastream can help correlate and combine all these pieces of information. With it, you can:

  1. Pinpoint slow-running queries or functions by analyzing execution times and resource usage, helping you identify inefficiencies in your code

  2. Correlate application logs with infrastructure metrics to determine if a performance issue is due to code errors or underlying infrastructure problems

  3. Track memory leaks or CPU spikes by monitoring resource usage trends, which can help you identify faulty code or services

  4. Set up detailed error tracking that automatically flags code exceptions and matches them with infrastructure events, to get to the root cause faster

  5. Monitor system load alongside application performance to see if scaling issues are related to traffic spikes or inefficient code paths

Being able to do all this makes the insight part easier for you. And so your debugging approach becomes very different. It becomes much more effective. It becomes much less time-consuming. It potentially makes the debugging task fun.

Because you get to the root cause of the thing that is not working much faster. Your monitoring/observability data setup can make it nice and fun to a certain degree, or it can make it downright miserable.

If it's done well, it's just one of those things you don't even have to think about. It's just part of your job. You do it. It's very effective and you move on.

Wrapping up

So we've covered three more use cases for monitoring data, other than the usual alerts and dashboards.

They are once again:

  1. analyzing long-term trends

  2. comparing over time or experiment groups and

  3. conducting ad hoc retrospective analysis, aka debugging

Next time your boss asks you what all these systems do, you now have three more reasons that you need to focus on your monitoring and be able to use it more effectively.

Until next time, happy monitoring.

Discussion about this podcast

Reliability Enablers (SREpath)
Reliability Enablers
Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more.