Reliability Enablers (SREpath)
Reliability Enablers
#59 Who handles monitoring in your team and how?
0:00
-8:17

#59 Who handles monitoring in your team and how?

Monitoring responsibilities vary between organizations, and how your team handles them might differ significantly from others, especially companies like Google.

Why many copy Google’s monitoring team setup

  1. Google’s Influence. Google played a key role in defining the concept of software reliability.

  2. Success in Reliability. Few can dispute Google’s ability to ensure high levels of reliability and its ability to share useful ways to improve it in other settings

    BUT there’s a problem:

  3. It’s not always replicable. While Google's practices are admired, they may not be a perfect fit for every team.

What is Google’s monitoring approach within teams?

Here’s the thing that Google does:

  • Google assigns one or two people per team to manage monitoring.

  • Even with centralized infrastructure, a dedicated person handles monitoring.

  • Many organizations use a separate observability team, unlike Google's integrated approach

If your org is large enough and prioritizes reliability highly enough, you might find it feasible to follow Google’s model to the tee.

Otherwise, a centralized team with occasional “embedded x engineer” secondments might be more effective.

Can your team mimic Google’s model?

Here are a few things you should factor in:

Size matters

Google's model works because of its scale and technical complexity. Many organizations don’t have the size, resources, or technology to replicate this.

What are the options for your team?

Dedicated monitoring team (very popular but $$$)

If you have the resources, you might create a dedicated observability team. This might call for a ~$500k+ personnel budget so it’s not something that a startup or SME can easily justify.

Dedicate SREs to monitoring work (effective but difficult to manage)

You might do this on rotation or make an SRE permanently “responsible for all monitoring matters”. Putting SREs on permanent tasks might lead to burnout as it might not suit their goals, and rotation work requires effective planning.

Internal monitoring experts (useful but hard capability)

One or more engineers within teams could take on monitoring/observability responsibilities as needed and support the team’s needs. This should be how we get monitoring work done, but it’s hard to get volunteers across a majority of teams.

Transitioning monitoring from project work to maintenance

2 distinct phases

Initial Setup (the “project”)

SREs may help set up the monitoring/observability infrastructure.

Since they have breadth of knowledge across systems, they can help connect disparate services and instrument applications effectively.

Post-project phase (“keep the lights on”)

Once the system is up, the focus shifts from project mode to ongoing operational tasks. But who will do that?

Who will maintain the monitoring system?

Answer: usually not the same team

After the project phase, a new set of people—often different from the original team—typically handles maintenance.

Options to consider (once again)

  1. Spin up a monitoring/observability team. Create a dedicated team for observability infrastructure.

  2. Take a decentralized approach. Engineers across various teams take on observability roles as part of their regular duties.

  3. Internal monitoring/observability experts. They can take responsibility for monitoring and ensure best practices are followed.

The key thing to remember here is…

Adapt to Your Organizational Context

One size doesn’t fit all

Google's model may not work for everyone. Tailor your approach based on your organization’s specific needs.

The core principle to keep in mind

As long as people understand why monitoring/observability matters and pay attention to it, you're on the right track.

Work according to engineer awareness

If engineers within product and other non-operations teams are aware of monitoring: You can attempt to **decentralize the effort** and involve more team members.

If awareness or interest is low: consider **dedicated observability roles** or an SRE team to ensure monitoring gets the attention it needs.

In conclusion

There’s no universal solution.

Whether you centralize or decentralize monitoring depends on your team’s structure, size, and expertise.

The important part is ensuring that observability practices are understood and implemented in a way that works best for your organization.


PS. Rather than spend an hour on writing, I decided to write in the style I normally use in a work setting i.e. “executive short-hand”. Tell me what you think.

Discussion about this podcast

Reliability Enablers (SREpath)
Reliability Enablers
Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more.