Whitebox vs Blackbox Monitoring

Reliability Enablers

#51 Whitebox vs Blackbox Monitoring

0:00

-9:56

#51 Whitebox vs Blackbox Monitoring

Full writeup in email. Monitoring is not just a monolith. Google's SREs talk about 2 distinct forms of monitoring, one of which is essential to assuring reliability of external software.

Ash Patel and Sebastian Vietz

Jul 30, 2024

Sebastian and I took some time in July to talk about how we could make concepts from Google's SRE book more digestible and usable in practice.

We'll start on this by covering monitoring concepts, the first concept being what Google's SREs call white box versus black box monitoring.

I initially thought that we could just call it internal versus external monitoring to explain it to you, but it turns out that would not be correct. We’ll explore this further.

First of all…

What is monitoring?

If you think of monitoring, or observability as we call it today, you’d think about it as:

Monitoring is a way to measure your system, to gain insight and knowledge about the system.

In our context, a system is usually software of some kind e.g. SaaS, platform, etc.

So what is whitebox monitoring then?

In monitoring terms, a whitebox monitoring system covers the system and its components that you have full control over, which is yours.

It's the one that you can instrument and you can instrument it whichever way you feel is appropriate for your context and for the kind of insight that you're trying to get out of it.

You have full control over what is happening. There's no limitation to what you could do.

Some characteristics of whitebox monitoring include:

You can and do get very granular with the data you’re capturing
You have full control over the end-to-end life cycle of your observability data
The focus is on the internals of your system that you control

It is a subset of internal monitoring, but it takes the data capture much deeper than high-level metrics like you’d get from traditional application performance monitoring (APM) tools.

Now let's define black box monitoring.

Thank you for reading Reliability Enablers. Please share this post if you found it useful.

What is blackbox monitoring?

On the opposite end, there are systems that you have no control over.

That applies to systems like third-party API endpoints, or generally external pieces of software that you're using. A vendor provides you access to their system in some way, shape, or form, but you don't get to go inside that system and instrument that system to the degree that you might want to.

The way you then need to approach monitoring or observing these kinds of systems is somewhat different from your own systems.

What you can do is apply approaches that let you approximate what might happen within that system that is behind an API endpoint or within that piece of software that you're connected to.

There are things like heartbeats or health checks that you can implement on your side that end can follow a simple pattern like, for example:

send a simple HTTP request to an endpoint
you get a response back (or not)
gauge the health of that system based on the response

It's not a foolproof method. It's an approximation.

To a certain degree, that’s the best you can do with these types of systems that are a blackbox to you.

You can just go around the outside of that system and try to probe it with certain types of observability approaches And then infer from the data that you receive back, “Is the system that I'm connected with healthy or not?”.

“Will it respond appropriately and properly the way I need it to when I'm sending a barrage of requests to it?”

Blackbox monitoring is analogous to airplane blackboxes

You’d find that every airplane has a blackbox that records everything that's happening in the flight, but you cannot do anything with the data directly.

You cannot look at the data. If you're a pilot, you're not looking at that monitoring data at any point. That's only for investigation after the fact.

To sum it up, a black box is:

an aspect of your system that you don't have control over
focused on real time data collection
a higher-level overview of whatever is observable
designed for situations where you can’t drill down into the internals
able to keep you on top of the health of a provider’s API or system integration

The rising importance of blackbox monitoring

With recent events where third parties have let prominent software vendors down, I think black box monitoring is likely to become a lot more important in the future.

The focus in the industry has been on white box monitoring, which makes sense. It's something you have direct control over.

But as we increase our risk surface area with more third-party services, and as incidents intensify, blackbox monitoring needs to be discussed more.

In the last 10 years in the industry, there hasn't been a whole lot of movement in terms of innovation or advancements when it comes to especially the black box monitoring or observability portion.

We feel like that's an area that could be in for a little bit of a round of (visible) innovation.

It’s worthwhile improving your blackbox monitoring to infer more accurately what is happening with that third-party piece of software. The benefit might be for your sanity.

This lets you make more rational engineering decisions that can only come from a stronger picture of the overall health of your ecosystem.

How to get started in blackbox monitoring

As far as we are aware, there are not a slew of open-source options around.

However, Prometheus has an open source blackbox exporter that can probe endpoints such as web servers, databases, or network devices. It can probe over protocols like HTTP, HTTPS, DNS, TCP, ICMP, and gRPC.

Wrapping up

We've defined what Google's SREs think of whitebox and blackbox monitoring, but in particular, we’d like you to think about your black box efforts.

As an industry, we tend to put most of our energy into whitebox monitoring in most settings. But our systems are rife with 3rd party APIs and integrations.

How are you making sure that you don't get something like a global outage because a third-party vendor pushed bad code? Or at least, how would you minimize your blast radius?

The costs of third-party mishaps can be high, with the recent CrowdStrike-related outage incurring $5.4 billion in downtime costs.

It's important to set guardrails around your third-party systems, so:

changes from there are deployed slowly to your own systems
monitored by blackbox monitoring methods
and if failure occurs, there's a failover ready to go.

I hope this updated version of our SRE book rundown has been more helpful to your work.

This is a concept from Chapter 6 (Monitoring Distributed Systems) of the Google SRE (2016) book. Chapter written by Rob Ewaschuk and edited by Betsy Beyer.

Thank you for reading Reliability Enablers. Please share this post if you found it useful.