More telemetry makes reliability worse (until you fix the loop)

Dec 23, 2025

Every reliability engineer eventually learns the same painful truth: you can have a thousand dashboards showing you xyz and still miss the real signal.

This might feel like an insurmountable hurdle at first glance.

One of those “it is what it is” situations. After all:

The more data we collect → the more noise we face → the less trust we have in our alerts → the slower we respond → the more incidents worsen → the more data we collect to compensate.

In MIT’s world of System Dynamics, this noise amplification problem is what we’d call a reinforcing loop.

It’s a spiralling up of information overload as the loop continues to reinforce, or metaphorically snowball, onto itself. But here’s the thing…

Observability (o11y) isn’t just telemetry.

It’s also who interprets, triages, and learns from the telemetry: a balancing loop.

In a healthy system, every new signal that enters should trigger an equal and opposite stabilizing action, essentially a check-and-balance. That’s the balancing loop at work.

For example, when noise increases, teams should automatically slow alert creation or tighten signal thresholds to wait until trust recovers.

When signal quality improves, they can loosen up again.

Without that feedback control, the system loses balance, and the painful reinforcing loop that I mentioned earlier takes over.

If your team doesn’t trust the data, or worse, doesn’t have time to translate it, your observability system isn’t truly “seeing everything”.

That’s why engineers with Staff+ potential treat incident retros and observability reviews like a process tuning. They ask:

Who sees which o11y signals, and when?
What incentives drive our attention to o11y signals?
Where does learning from outputs feed back into o11y design?

Small interventions like taking the time to prune unhelpful alerts can have an outsized impact in the long run because they restore the balancing loop between data and actionability.

This should be your takeaway: reliability improves when observability helps people modify their impact from using the system, not just seeing the outputs of their services.

Reliability Enablers

Discussion about this post

Ready for more?