How to Resolve Bad Observability Data Quality

Observability, data, and quality. If these 3 words don’t mean much to you, stop reading. Really. It means this guide is not for you. You can enjoy some cat videos on YouTube.

Jun 18, 2024

We previously covered 3 observability data flow issues that can plague incident response and other SRE activities.

There’s another issue — data quality. Imagine going into an incident and being unsure about the reliability of the data you’re using.

That’s what can happen when you have observability data quality issues.

The scary thing is — it’s really easy to end up with low-quality data.

We’ll first explore how data quality issues manifest in observability.

We’ll then look at the specific problems in data quality.

And finally, we’ll work through a few potential solutions.

Sound good? Let’s get started.

Unpacking poor data quality

In plain English, you risk getting dodgy or unreliable data.

It’s like trying to fill a swimming pool from a ground well that you don’t know much about managing well.

In this scenario:

the swimming pool is your observability data lake and
the water is observability data **

You might be able to fill your observability data lake or warehouse with the unverified water source, but the pipeline is filling it with swirls of dirt and sludge.

The data is tainted with errors or inaccuracies.

End result and risk: your observability data is filled with dirty or useless data, making it hard to trust the insights you gain from it.

It can manifest because of issues like high data cardinality, noisy data, and weak sampling strategy.

Let’s now explore a few of these data quality problems…

High data cardinality

High cardinality is when you have a large number of distinct values in your dataset.

It can happen if you pick something with very unique values like user IDs. That’s never recommended. It can also happen for more strategic fields like instance ID.

How cardinality can make your data quality issues worse by making it harder to handle duplications and inconsistencies. Bottlenecks can result in weak data output, which will give you a lo-fi picture of a hi-fi software system.

The key here is to avoid excessive cardinality data.

I explore this idea in more detail in the article The Cardinality Conundrum in Observability.

Share Boost software reliability with SREpath

Noisy Data

A system running at scale can generate a gargantuan amount of data.

But it’s almost certain that you don’t need all of this data to understand your system’s behaviors.

Noisy data is the phenomenon when an excessive amount of data overshadows the meaningful signals you need to solve problems.

For example, a logging system might track all events including routine informational types and a myriad of overcautious warnings.

You’re essentially wading through a flood of blogs to try and find the important error messages.

You can try to resolve this by implementing filters to reduce the noise at collection or analysis.

You can also set up better thresholds and apply better sampling.

Weak sampling strategy

Sampling is a common practice in observability. You are essentially picking out small slivers of your full observability data to analyze and query.

This is useful when using all the data would too much data to handle.

You can try to complete the puzzle of your full data through strategic sampling.

But a bad sampling strategy can hit your observability data quality the wrong way.

Thank you for reading Boost software reliability with SREpath. This post is public so feel free to share it.

Let’s explore a few of the characteristics of a bad sampling strategy:

Non-random sampling

It’s surprisingly easy to do this.

You might be used to picking out data samples from certain time periods such as business hours.

You might also have a set of criteria to do this.

Either way, you are not allowing random sampling to occur.

End result: skewed insights that don’t accurately represent your system's behavior.

Sampling too small

You need to consider the dataset that you have to work out what size you need to sample.

Taking 10 entries out of a 1000 entry database for example will give you very skewed results.

For small datasets of under 100 entries, you could viably sample 20-50% of the data.

For larger datasets, you can reduce the sampling size to 5-10% of the data.

For mega datasets with 100,000+ entries — like the ones you typically see in systems at scale — the sampling size can be as low as 1%.

Keep in mind, these numbers are guidelines. Follow engineering judgment for your context.

Ignoring dependencies

It’s easy to overlook the fact that your system is many different parts connected together.

Ignore these connections at your peril. Or at least your sampling’s peril.

Having an idea of the relationships between parts lets you sample effectively.

Imagine you're responsible for diagnosing latency issues in a web application with components like web servers, databases, and caching systems.

You already know that these components are dependent on each other.

Their interactions can affect the overall performance of the application.

Here’s what can happen without considering these dependencies:

You randomly select data points to analyze without considering how the components interplay.

You notice spikes in latency but struggle to understand why they occur.

Hard to solve this problem.

Here’s what can happen if you consider dependencies:

Latency in your web application is highly affected by interactions between the web server and the database.

So you focus your sampling on periods when there are simultaneous spikes in web server usage and database queries.

A few telltale signs come up from sampling moments when all 3 areas are at higher loads.

You discover that certain database queries are causing delays in the web server response time.

This is something you can work on resolving.

Wrapping up

As you now know, data quality in observability is an ongoing issue.

It requires handling challenges like high cardinality, noisy data, and weak sampling strategies.

Some of the highlights of what I suggested include avoiding excessive cardinality, implementing effective noise reduction, and considering dependencies in sampling.

Maybe then you can get more accurate insights for solving problems in your complex system.

Reliability Enablers (SREpath)

Discussion about this post