We’ll explore 3 use cases for monitoring data. They are:
Analyzing long-term trends
Comparing over time or experiment groups
Conducting ad hoc retrospective analysis
Analyzing long-term trends
You can ask yourself a couple of simple questions as a starting point:
How big is my database?
How fast is the database growing?
How quickly is my user count growing?
As you get comfortable with analyzing data for the simpler questions, you can start to analyze trends for less straightforward questions like:
How is the database performance evolving? Are there signs of degradation?
Is there consistent growth in data volume that may require future infrastructure adjustments?
How is overall resource utilization trending over time across different services?
How is the cost of cloud resources evolving, and what does that mean for budget forecasting?
Are there recurring patterns in downtime or service degradation, and what can be done to mitigate them?
Sebastian mentioned that it's a part of observability he enjoys doing. I can understand why. It’s exciting to see how components are changing over a period and working out solutions before you end up in an incident response nightmare.
Getting to effectively analyze the trends requires the right level of data retention settings. Because if you're throwing out your logs, traces, and metrics too early, you will not have enough historical data to do this kind of work.
Doing this right means having the right amount of data in place to be able to analyze those trends over time, and that will of course depend on your desired period.
Comparing over time or experiment groups
Google’s definition
You're comparing the data results for different groups that you want to compare and contrast. Using a few examples from the SRE (2016) book:
Are your queries faster in this version of this database or this version of that database?
How much better is my memcache hit rate with an extra node and is my site slower than it was last week?
You're comparing it to different buckets of time and different types of products.
A proper use case for comparing groups
Sebastian did this particular use case recently because he had to compare two different technologies for deploying code: AWS Lambda vs AWS Fargate ECS.
He took those two services and played around with different memories and different virtual CPUs. Then he ran different amounts of requests against those settings and tried to figure out which one was the better technology option most cost-effectively.
His need for this went beyond engineering work but enabling product teams with the right decision-making data. He wrote out a knowledge base article to give them guidance for a more educated decision on the right AWS service.
Having the data to compare the two services allowed him to answer questions like:
When should you be using either of these technologies?
What use cases would either technology be more suitable for?
This data-based decision support is based mainly on monitoring or observability data. The idea of using the monitoring data to compare tools and technologies for guiding product teams is something I think reliability folk can gain a lot of value from doing.
Conducting ad hoc retrospective analysis (debugging)
Debugging is a bread-and-butter responsibility for anyone who is a software engineer of any level.
It’s something that everybody should know a little bit more about than other tasks because there are very effective and also very ineffective ways of going about debugging.
Monitoring data can help make the debugging process fall into the effective side.
There are organizations where you have 10 different systems. In one system, you might get one fragmented piece of information. In another, you’ll get another fragment. And so on for all the different systems.
And then you have to correlate these pieces of information in your head and hopefully, you get some clarity out of the fragments to form some kind of insight.
Monitoring data that are brought together into one datastream can help correlate and combine all these pieces of information. With it, you can:
Pinpoint slow-running queries or functions by analyzing execution times and resource usage, helping you identify inefficiencies in your code
Correlate application logs with infrastructure metrics to determine if a performance issue is due to code errors or underlying infrastructure problems
Track memory leaks or CPU spikes by monitoring resource usage trends, which can help you identify faulty code or services
Set up detailed error tracking that automatically flags code exceptions and matches them with infrastructure events, to get to the root cause faster
Monitor system load alongside application performance to see if scaling issues are related to traffic spikes or inefficient code paths
Being able to do all this makes the insight part easier for you. And so your debugging approach becomes very different. It becomes much more effective. It becomes much less time-consuming. It potentially makes the debugging task fun.
Because you get to the root cause of the thing that is not working much faster. Your monitoring/observability data setup can make it nice and fun to a certain degree, or it can make it downright miserable.
If it's done well, it's just one of those things you don't even have to think about. It's just part of your job. You do it. It's very effective and you move on.
Wrapping up
So we've covered three more use cases for monitoring data, other than the usual alerts and dashboards.
They are once again:
analyzing long-term trends
comparing over time or experiment groups and
conducting ad hoc retrospective analysis, aka debugging
Next time your boss asks you what all these systems do, you now have three more reasons that you need to focus on your monitoring and be able to use it more effectively.
Until next time, happy monitoring.
#55 3 Uses for Monitoring Data Other Than Alerts and Dashboards