Solving Observability's Cardinality Conundrum

Apr 09, 2024

Introduction

Cardinality is a term you’ll hear over and over again if you’re looking into how to do observability.

And especially if you are talking with vendors! They love this topic!

A lot of people have been thinking about high cardinality for a while and for a good reason.

Because high cardinality can cost you a LOT of money and time if you go about it wrong… but really, what we want to do is cut down excessive cardinality.

The kind that doesn’t add value to your querying and intelligence.

That’s why dealing with cardinality is not a straightforward solution. Cutting labels willy-nilly is not the answer. We’ll get onto ways to deal with this later on.

But we should first talk about what cardinality actually means.

What is cardinality?

Because it’s not something you think about every day… unless you’re an observability engineer or vendor.

The last time I heard “cardinality” being used this often was in my SQL classes, which was a while back.

So I did a refresher and went down a rabbit hole of math, logic, and all that fun stuff.

Cardinality refers to how many unique values there are in a data set.

You’re essentially looking for how diverse — and ultimately complex — the data is.

How do you differentiate between low and high cardinality data?

Low cardinality means low complexity with few dimensions to data. This is fine for analyzing aggregate data but lacks granularity that engineers often seek to solve system problems
High cardinality contains more data dimensions. This lets you slice and dice data for more detailed analysis, but you then have to deal with complexity issues.

By dimensions, I mean attributes like method, status_code, instance_id, etc.

Let’s run through a simple example to cement the difference:

Let’s say you have the table fruits with only the fruit apple in your database. We want to add a color key to identify each apple in your table.

Now, the data on apples is low cardinality if you only find red or green apples like so:

But it becomes higher cardinality if you have apples with colors like orange, blue, red, pink, green, violet, black etc.

Low vs high cardinality data example visual

The more color options you have for apple, the higher the cardinality of the color data.

Examples of high cardinality

Here are some examples of high cardinality data:

email addresses (never append these to metrics!)
user IDs (”Observability engineer: why is this in my beautiful TSDB?!”)
IP addresses (sometimes appended for AppSec purposes)
instance name (fair use case for identifying instance issues)
pod s (like the kind you find within Kubernetes)

3 key benefits of high cardinality in observability data

High cardinality enables these traits within observability data:

1. Granularity

With granular data, you can slice and dice to deeper and deeper levels to precisely pinpoint where issues like performance degradation and outages are happening.

For example, with each 500 error, you may want to dig a little further.

With a high cardinality metric, you can dig into the 50x for GET, POST, and DELETE requests.

2. Segmentation

Helps you collect disparate data and organize it into chunks for easier human processing.

Examples include:

payments by age group e.g. 18-25, 25-34, 35-49, etc.
payments by market region e.g. North America, South East Asia, LatAm, EU, etc.

3. Performance evaluation

High cardinality means you can go beyond “success” and “fail” for responses.

You can classify it as performance levels to improve reliability for different instances and groups.

Imagine what you can work out by putting responses into buckets of 0-10 scoring.

Before we continue, I will assume that you know the answers to the following questions:

What does observability data look like?

How does observability data flow?

Where observability data gets stored and how

How do observability data types do in terms of cardinality?

Metrics are the main data type when we think about cardinality. But let’s still cover each of the 3 main observability data types to see its cardinality issue:

High cardinality in logs

Logs of yesteryear are typically less affected by high cardinality than metrics and traces.

Some engineers still use unstructured logs for small-scale systems.

At the commercial scale, modern logs are structured or at least shifting toward that. JSON shows the most promise through that.

So why shift to a structured format? A few things come into play like:

Readability
Better ability to group or segment logs by keys for better more refined analytical outcomes
Speed to read as less string manipulation or string processing has to take place during the reading of the aggregated log data

You can derive metrics from logs now!

Logs in a commercial setting can experience high cardinality, just like metrics and traces.

High cardinality in metrics

Among the trio, metrics are the most significantly affected by high cardinality.

Metrics get complicated by the fact that they can have many dimensions.

You can have a multiplier effect when you add a new dimension.

This multiplication is what defines the high cardinality we are talking about.

You benefit in one way from this through more granularity that helps deeper analysis.

But then you also impact storage, querying, and visualization performance.

High cardinality in traces

Traces can also be affected by high cardinality.

Picture a trace to be the bus route from New York to Los Angeles and back.

This route can be divided into sections — when you plan a trip along this route you may want to stop for a breather, food, or a bathroom break.

Each section of this route or round trip is equivalent to what in tracing we would call a span.

A common dimension for spans is the time or duration of that specific section of work.

We can add more detail or context for each span by appending metadata to the span.

You get high cardinality within tracing when you use spans like user IDs and specific names of what happened.

All this once again impacts storage and querying performance.

Calculating the impact of cardinality within a metric

Adding a dimension to a metric does not cause high cardinality.

It’s what that key-value pair stands for that determines this.

So adding a 6th dimension doesn’t increase cardinality if it’s a boolean or a “success”/”fail” type.

But if it were something like instance ID and you have 100 or 1000 instances, that would.

We do not need to calculate cardinality unless we are sitting in a math class.

We are more interested in how many time series

Calculating time series count for a metric

Let’s run through a simple example.

Say you have a metric called network_latency_distribution covering 100 instances with 10 buckets, 10 possible response codes, and 10 network paths.

**Calculating the series would look like this:**
= 100 instances * 10 buckets * 10 response codes * 10 paths
= 100,000 series

This is a reasonable size for a series, but things can get out of hand as you add more dimensions.

Adding a dimension like region can significantly increase cardinality.

**Say we have 6 regions to choose from. The calculation would become:**
= 100 instances x 10 buckets x 10 response codes x 10 paths x 6 regions
= 600,000 series (! 😕)

Okay. It doesn’t look excessive, but can significantly increase querying time!

I’ll share some query time data with you in a minute.

The dimension of pods (within containers) can increase cardinality by a huge factor, too.

**Say we have a low 1,000 pods in action. The calculation would become:**
= 100 instances x 10 buckets x 10 response codes x 10 paths x 1,000 pods
= 100,000,000 series (!! 😦)

Excessive cardinality comes into the picture when you add a dimension like user_id.

**Even with a modest user count of 10,000, the time series blows out with:**
= 100 instances * 10 buckets * 10 response codes * 10 paths * 10000 users
= 1,000,000,000 series (!!! 😱)

We’ll talk about Excessive cardinality in more detail later on.

🚨 PSA: The more types of data you collect, the more fragmented view you have of the whole piece. And if you have a fragmented view, that reduces your ability to TAKE ACTION on the data. Remember that Quality of data > Quantity of data.

High cardinality data is rising because of system trends

Shift from monolith to microservices architecture

Where you had one humongous service emitting metrics, you now have 10, 20, 50, 100+ microservices with each emitting its metrics. ‘Nuf said.

Shift from VM-based to container-based infrastructure

Who would’ve thought that life was easier when we depended on VM infrastructure rather than containers? </sarcasm>

Containers have their benefits but generate a whole bunch of time-series data.

For every 1 VM in an older system, there are 10s of containers to match workloads.

The key culprit behind this is their high ephemerality.

The ephemeral nature of containers refers to when they start, shut down and then a new container takes their place.

All this stop-start emits a lot of data for metrics collectors to ingest and push to storage.

Serverless functions as a large part of the system

Every time you invoke a Lambda function, that induces a metric with a time series. Depending on the particular service you are running on serverless, this time series number can be HUGE.

A whole bunch of factors come into play like:

how often you're invoking the lambda function
the number of serverless functions in your system
concurrency of handling requests and scale

Your time series data incurs higher and higher cardinality as these 3 factors rise in frequency or occurrence (depending on the factor).

When to deal with high cardinality

Remember how observability data flows. There are 4 distinct stages:

Dealing with high cardinality at the Instrumentation stage is not ideal. You will not confidently know what dimensions to filter out at this stage.

It’s also not ideal at the Usage stage. It’s too late to deal with cardinality issues because you’ve incurred high ingestion and storage costs, so…

Deal with high cardinality at the Ingestion & Storage stages. This is when you can put in practices like in-flight aggregation, cardinality isolation, and cardinality limiters.

When to ditch high cardinality data

High cardinality data is a necessary evil in some situations. As I mentioned earlier, it helps with granularity, segmentation, and deeper performance evaluation.

If it does not add value to your querying, intelligence, or alerting. Ditch it.

It’s just an ornament to give you pretty data.

What I mean by this is it looks like it could be important.

But in reality:

it’s giving you limited analytical value
rarely contributes to usable insight and
with either of the above two is resource intensive.

Because if resource intensiveness did not cost money or time, we wouldn’t care about it.

We’d just let high cardinality data sit there and do its thing.

But it does cost money in terms of storage and processing power.

And it costs time in terms of how long you have to wait before you can start working with the data.

And it’s definitively something you need to look into if it’s slowing things down to the point where you’re pushing past your MTTR target.

(MTTR = your mean time to recovery, repair, resolution, whatever you want to call it).

Now, you might be saying thank you, Captain Obvious.

But think about how often this is still a real problem in software systems.

You need to pose it as a challenge for people to think about and solve it.

The problem with excessive cardinality

The working group behind the open-source Prometheus monitoring tool has warned about this for a while.

Excerpt from Prometheus.io documentation on Metric and label naming

A quick TLDR of their alert message on key-value pairs:

“Every unique combination of key-value pairs represents a new time-series. This significantly increases the data stored.”

An ever-increasing number of possible key-value pairs or dimensions does something sinister within time series databases (TSDBs).

The number of series for a single metric will explode 💥 to the point that your querying eventually slows down to a crash.

Query performance data by Chris Marchbanks (ex-Splunk, Grafana Labs) says it all.

This is how long it took to query quantities of time series:

100,000 series took 1.5 seconds (acceptable)
200,000 series took 5 seconds (slow-ish)
10,000,000 series took 15 minutes (!)

Keep this in mind: high or (better put) excessive cardinality is a data problem at its core.

Not only does it slow down query time to a grind, but it can also cause trouble like:

data being wasted on dimensions that are not needed for system improvements
need for engineers to constantly maintain performance of the observability system
higher storage needs and processing resources means higher cost to run

We will think through a few solution starters toward the end to prevent or at least reduce this risk.

What’s contributing to excessive cardinality in a software system?

Bad dimension selection

Remember my calculation example above?

Putting in user or request IDs can skyrocket your observability metrics’ cardinality.

Each unique identifier will contribute to a new entry, increasing the overall cardinality of the logs.

Improper sample practices

Observability systems generate a ton of data all the time.

It’s not an easy task to query all of that data all the time.

This is where you want to bring sampling practices to select a portion of the data to analyze.

But if you don’t use the right sampling techniques, you will deal with high cardinality data and more series than your dashboard or query tool can handle.

Unbounded event types

Metrics are not the sole culprits for pushing out excessive cardinality data.

Putting weak boundaries around your event data can do the same to logs and traces.

A good system will have few event types while a system suffering from excessive cardinality will have numerous possible event types.

The more event types there are, the more logs and spans you’ll have to push to ingestion.

This is not an exhaustive list of ways excessive cardinality can happen. I want to illustrate the idea that there are several ways you can end up with it.

How to solve excessive cardinality

Be selective about dimensions

I feel like I’ve mentioned it several times already, but never, ever use dimensions like email address, user ID, transaction ID, or anything with overly unique data in your metrics.

Split the metric into smaller metrics

Ask yourself these two questions:

Do I need to have this single metric with all these dimensions?

Can I split it into two separate metrics that can still help me answer the questions I will pose at querying, and still give me the necessary alerting?

An example might better highlight why you’d want to do this:

Let’s return to our metric called network_latency_distribution, which covers 100 instances with 10 buckets, 10 possible response codes, and 10 network paths.

**Calculating the series would look like this:**
= 100 instances * 10 buckets * 10 response codes * 10 paths
= 100,000 series

Now what would happen if we were to split this metric into 2 individual metrics, one without paths and one without buckets?

**Here's the first metric without paths:**
= 100 instances * 10 buckets * 10 response codes 
= 10,000 series

**Here's the second metric without buckets:**
= 100 instances * 10 response codes * 10 paths
= 10,000 series

This gives us a grand total of 20,000 series and a whopping 80,000 series reduction!

This split works perfectly if we don’t need to correlate paths with buckets to solve system issues.

How much faster would it be to query 20,000 vs 100,000 series?

Answer: enough to feel instantaneous vs. time spent hearing, “Please wait while we process your query.”

Allow high(er) cardinality for high-value metrics only

It can still make sense to generate a whole bunch of series for metrics that add business value.

You may have some metrics that need 100,000 series rather than being split into multiple mini-metrics and losing their strength.

How do we define a high-value metric? It is this if it supports:

critical decision-making processes
developing actionable insights or
enhancement of overall system performance

What hits one of these criteria depends on your software architecture, industry context, and the problems you’re looking to solve.

Here are a few considerations to make:

These kinds of metrics might not be suitable for time-of-essence needs like alerting
You can try and save the designation of high-value metrics to ad-hoc queries, low-usage dashboards for specialist or special interest groups, and periodic reports
You still need to weigh the impact vs querying & visualization time
Even a high-value metric can start looking too expensive with the way observability is priced these days, so work out the costings vs the value you attain and discuss with management

We must still remember to control cardinality levels to meet our cost and time-to-productivity constraints.

Wrapping up

If you can only remember one thing from this guide, I want it to be this 👇🏼

Cardinality is valuable, but excessive cardinality is expensive — in terms of time to query, cost to store, and resource consumption to process & analyze. Keep it in check.

Reliability Enablers

Discussion about this post