Reliability Enablers

More telemetry makes reliability worse (until you fix the loop)

Ash Patel — Tue, 23 Dec 2025 15:05:15 GMT

Every reliability engineer eventually learns the same painful truth: you can have a thousand dashboards showing you xyz and still miss the real signal.

This might feel like an insurmountable hurdle at first glance.

One of those “it is what it is” situations. After all:

The more data we collect → the more noise we face → the less trust we have in our alerts → the slower we respond → the more incidents worsen → the more data we collect to compensate.

In MIT’s world of System Dynamics, this noise amplification problem is what we’d call a reinforcing loop.

It’s a spiralling up of information overload as the loop continues to reinforce, or metaphorically snowball, onto itself. But here’s the thing…

Observability (o11y) isn’t just telemetry.

It’s also who interprets, triages, and learns from the telemetry: a balancing loop.

In a healthy system, every new signal that enters should trigger an equal and opposite stabilizing action, essentially a check-and-balance. That’s the balancing loop at work.

For example, when noise increases, teams should automatically slow alert creation or tighten signal thresholds to wait until trust recovers.

When signal quality improves, they can loosen up again.

Without that feedback control, the system loses balance, and the painful reinforcing loop that I mentioned earlier takes over.

If your team doesn’t trust the data, or worse, doesn’t have time to translate it, your observability system isn’t truly “seeing everything”.

That’s why engineers with Staff+ potential treat incident retros and observability reviews like a process tuning. They ask:

Who sees which o11y signals, and when?
What incentives drive our attention to o11y signals?
Where does learning from outputs feed back into o11y design?

Small interventions like taking the time to prune unhelpful alerts can have an outsized impact in the long run because they restore the balancing loop between data and actionability.

This should be your takeaway: reliability improves when observability helps people modify their impact from using the system, not just seeing the outputs of their services.

Real DevOps happens between commits, not pipelines

Ash Patel — Tue, 16 Dec 2025 15:01:11 GMT

Your CI/CD runs faster than your feedback. That’s not DevOps — that’s DevOops.

Most engineers still picture DevOps as a conveyor belt more than a (software development life)cycle. The ground-level reality of DevOps in most organizations is that code goes in, product comes out, and every new automation step makes the process run smoother and faster.

That’s the story most of us were sold and still work toward to this day.

“DevOps is efficiency.”

It’s understandable. That’s how DevOps got operationalised in most places — as tooling and velocity. But DevOps wasn’t conceived to only make delivery faster.

Patrick Debois, the “godfather of DevOps”, never framed DevOps with a singular focus on delivery speed. He framed it through the lens of feedback.

The DevOps Handbook, which he co-authored with Gene Kim et al., talks about DevOps as flow, feedback, and continuous learning.

That’s system-dynamics thinking, even if nobody used that term at the time… and well, even now.

Debois cared about how software engineering could take advantage of this.

His thinking focused on:

How fast signals travelled
How quickly the drift was corrected, and
How learning could compound improvements across teams

Coming back to today, we should not merely be optimizing for delivery speed. We should also optimize for feedback speed.

If you zoom into any software delivery flow today, for the most part, the commits themselves aren’t the problem. In fact, they are fast. Pipelines are flowing fast. Rollouts are fast.

What’s slow is everything between commits:

How long it takes to interpret an alert
How long it takes for an insight to become a design change
How fast one team’s learning propagates to another

This “between commits” zone is the rate-limiting step for truly smooth DevOps.

Here’s a composite story of the kinds I have heard over the last 3 years alone:

During a large-scale outage, the alert fired instantly, but it took 45 minutes before anyone realised the signal was noise-shaped by a mis-configured dashboard. The root cause was understood the same day, but the design change sat unscheduled for eight weeks. Worse, another team triggered the same failure pattern three months later because the learning never left the postmortem.

Feedback is the layer that determines whether DevOps creates stability or chaos.

In system-dynamics language, DevOps is a dance between two loops. The:

balancing loop of continuous improvement (detect → understand → act) and
reinforcing loop of (deploy → drift → more deploys).

This zone is where the balancing loop either outruns (stability) or falls behind (chaos) the reinforcing loop. Which is more common in your team or organization?

When your feedback travels slower than your commits, the reinforcing loop wins.

That’s when system reliability begins to decay quietly in the background.

But when your feedback travels faster than your commits, the balancing loop wins.

That’s when reliability earns its rightful place as a default system property. Really, it should not be seen as a sociocultural artifact like team effort or heroic effort.

Optimizing pipelines makes delivery fast. Optimizing feedback makes delivery safe. Doing both gives you real DevOps.

High-performing teams don’t ask, “How do we deploy more often?”

They ask, “How do we learn faster than we deploy?”

Humans, the pesky side of system design

Ash Patel — Tue, 09 Dec 2025 15:30:33 GMT

Most reliability engineers hear ‘system design’ and picture diagrams wrangling load balancers, queues, and failover zones.

Fair enough. That’s the world we were educated in.

But every design workshop I’ve sat in this year has ended with very human questions:

“Who’ll own this?”

“How do we know when to alert?”

“How do we ramp these 37 changes into our flow?”

Guess what? That’s system design, too. It’s just the part we never diagram.

You already know the cognitive load of infra churn…

AWS → Kubernetes → AI pipelines → whatever’s next.

Across every shift, the engineers who stay ahead are the ones who shape the system around the system, so its behaviors support sustainable reliability.

That’s not fluff — that’s the beginning of Staff+ thinking.

No one gets promoted for invisible work.

You advance by showing evidence of potential, and the human side of the system is where it shows early.

You already know the rituals that humans in your org get involved in. Most orgs run them — some lightly, some intensely, some religiously.

Some examples of rituals include postmortems/retrospectives, SLO writing sessions, and incident bridges with all involved parties.

But there’s a quiet crisis building up within these rituals.

Reliability flatlines when rituals replace mechanisms.

Rituals mean well with the intent to turn responses to events into regular action, so it makes sense that many teams try to improve reliability through rituals.

Over time, you end up with:

more dashboards
more alerts
more retros
more action items
more processes
more OKRs
more monitoring “initiatives”

These things look like progress. They feel responsible.

But they don’t change system behaviour.

That’s the ritual mindset: if we do more of the same things, reliability will improve. Except when it doesn’t. Now, let’s look at what happens to a lot of those rituals in the real world.

And in practice, they often falter:

Retro held, but nothing changes despite learnings
Action items logged, never acted on
SLOs are written but don’t fully influence prioritization
Everyone “cares”, but nobody changes behaviour
“We added alerts!” (…and made things worse)

🤔 Why don’t rituals consistently improve reliability?

Because they don’t change the parts of the system that produce reliability. Reliability is produced by what MIT’s System Dynamics group defines as feedback loops.

These loops can include delays in judgment and action, handoff frequency, review cadence, and how fast learning propagates. Rituals don’t touch any of that. They create activity, not loop correction.

You can run a retro every sprint, but if:

the process stays the same,
the information flows stay the same,
the handoffs stay the same,
and the delays stay the same…

the loop never changes.

The ritual happens around the system, but nothing shifts inside the system.

Next time, I’ll map out the loops underneath all this — the mechanisms rituals never touch.

You (and AI) can't automate reliability away

Ash Patel — Tue, 02 Dec 2025 13:03:47 GMT

What if the hardest part of reliability has nothing to do with tooling or automation? Jennifer Petoff explains why real reliability comes from the human workflows wrapped around the engineering work.

Everyone seems to think AI will automate reliability away.

I keep hearing the same story:

“Our tooling will catch it.”

“Copilots will reduce operational load.”

“Automation will mitigate incidents before they happen.”

But here’s a hard truth to swallow: AI only automates the mechanical parts of reliability — the machine in the machine.

The hard parts haven’t changed at all.

You still need teams with clarity on system boundaries.
You still need consistent approaches to resolution.
You still need postmortems that drive learning rather than blame.

AI doesn’t fix any of that. If anything, it exposes every organizational gap we’ve been ignoring. And that’s exactly why I wanted today’s guest on.

Jennifer Petoff is Director of Program Management for Google Cloud Platform and Technical Infrastructure education. Every day, she works with SREs at Google, as well as with SREs at other companies through her public speaking and Google Cloud Customer engagements.

Even if you have never touched GCP, you have still been influenced by her work at some point in your SRE career. She is co-editor of Google’s original Site Reliability Engineering book from 2016.

Yeah, that one!

It was my immense pleasure to have her join me to discuss the internal dynamics behind successful reliability initiatives. Here are 5 highlights from our talk:

3 issues stifling individual SREs’ work

To start, I wanted to know from Jennifer the kinds of challenges she has seen individual SREs face when attempting to introduce or reinforce reliability improvements within their teams or the broader organization.

She categorized these challenges into 3 main categories

Cultural issues (with a look into Westrum’s typology of organizational culture)
Insufficient buy-in from stakeholders
Inability to communicate the value of reliability work

Organizations with generative cultures have 30% better organizational performance.

A key highlight from this topic came from her look at DORA research, an annual survey of thousands of tech professionals and the research upon which the book Accelerate is based.

It showed that organizations with generative cultures have 30% better organizational performance. In other words, you can have the best technology, tools, and processes to get good results, but culture further raises the bar.

A generative culture also makes it easier to implement the more technical aspects of DevOps or SRE that are associated with improved organizational performance.

Hands-on is the best kind of training

We then explored structured approaches that ensure consistency, build capability, and deliberately shape reliability culture. As they say – Culture eats strategy for breakfast!

One key example Jennifer gave was the hands-on approach they take at Google. She believes that adults learn by doing. In other words, SREs gain confidence by doing hands-on work.

Where possible, training programs should move away from passive listening to lectures toward hands-on exercises that mimic real SRE work, especially troubleshooting.

One specific exercise that Google has built internally is Simulating Production Breakages.

Engineers undergoing that training have a chance to troubleshoot a real system built for this purpose in a safe environment.

The results have been profound, with a tremendous amount of confidence that Jennifer’s team saw in survey results.

This confidence is focused on job-related behaviors, which when repeated over time reinforce that culture of reliability.

Reliability is mandatory for everybody

Another thing Jennifer told me Google did differently was making reliability a mandatory part of every engineer’s curriculum, not only SREs.

When we first spun up the SRE Education team, our focus was squarely on our SREs. However, that’s like preaching to the choir. SREs are usually bought into reliability.
A few years in, our leadership was interested in propagating the reliability-focused culture of SRE to all of Google’s development teams, a challenge an order of magnitude greater than training SREs.

How did they achieve this mandate?

They developed a short and engaging (and mandatory) production safety training
That training has now been taken by tens of thousands of Googlers
Jennifer attributes this initiative’s success to how they“SRE’ed the program”.
“We ran a canary followed by a progressive roll-out. We instituted monitoring and set up feedback loops so that we could learn and drive continuous improvement.”

The result of this massive effort?

A very respectable 80%+ net promoter score with open text feedback: “best required training ever.”

What made this program successful is that Jennifer and her team SRE’d its design and iterative improvement. 

You can learn more about “How to SRE anything” (from work to life) using her rubric: 
https://www.reliablepgm.com/how-to-sre-anything/

Reliability gets rewarded just like feature work

Jennifer then talked about how Google mitigates a risk that I think every reliability engineer wishes could be solved at their organization.

That is, having great reliability work rewarded at the same level as great feature work.

For development and operations teams alike at Google, this means making sure “grungy work” like tech debt reduction, automation, and other activities that improve reliability are rewarded equally to shiny new product features.

Organizational reward programs that recognize outstanding work typically have committees. These committees not only look for excellent feature development work, but also reward and celebrate foundational activities that improve reliability.

This is explicitly built into the rubric for judging award submissions.

Keep a scorecard of reliability performance

Jennifer gave another example of how Google judges reliability performance, but more specifically for SRE teams this time.

Google’s Production Excellence (ProdEx) program was created in 2015 to assess and improve production excellence (aka reliability improvements) across SRE teams.

ProdEx acts like a central scorecard to aggregate metrics from various production health domains to provide a comprehensive overview of an SRE team’s health and the reliability of the services they manage.

Here are some specifics from the program:

Domains include SLOs, on-call workload, alerting quality, and postmortem discipline
Reviews are conducted live every few quarters by senior SREs (directors or principal engineers) who are not part of the team’s direct leadership
There is a focus on coaching and accountability without shame (to elicit psychological safety)

ProdEx serves various levels of the SRE organization through:

providing strategic situational awareness regarding organizational and system health to leadership and
keeping forward momentum around reliability and surfacing team-level issues early to support engineers in addressing them

Wrapping up

Having an inside view of reliability mechanisms within a few large organizations, I know that few are actively doing all — or sometimes any — of the reliability enhancers that Google uses and Jennifer has graciously shared with us.

It’s time to get the ball rolling. What will you do today to make it happen?

Reliability engineers are the best SDLC problem solvers because...

Ash Patel — Tue, 18 Nov 2025 12:20:41 GMT

One thing I’ve learned from watching reliability masters doing their magic is that SREs end up solving the SDLC problems no one else sees — at least not clearly.

It’s not because SREs are better at debugging.
It’s not because SREs are the “process optimizers” (that one still surprises people when I say it).

It’s because SREs look at the system from a different altitude.

They can see the parts of the SDLC that quietly shape everything but rarely show up as activities/outcomes in Jira, Git, or dashboards.

Most engineering teams have been indoctrinated into an industrialized version of DevOps. The daily mantra: ship new features, open more branches, close more tickets.

But a surprising amount of reliability pain comes from the parts of the SDLC that don’t fit neatly into that cycle. The things no one “owns,” because they’re not obviously tied to delivery velocity.

A few examples you’ll recognise instantly:

The release cadence frozen in 2022 - everyone remembers what kind of year that was
The ticket queue that magically moves because one engineer quietly triages it every morning
The API contract that changed three times, but half the dependent services never updated their call parameters

These things fade into the background because everyone’s racing to grow services, not slowing down to prune the system with bonsai-level care.

That’s where the outsider advantage kicks in.

Because SREs work across multiple teams and tech stacks, they spot patterns that day-to-day insiders simply stop seeing.

Not because those engineers lack skill, but because familiarity blinds you to the system’s oddities.

This clicked for me while recently watching a live lecture by William Dalrymple, the Scottish historian known for his work on India’s colonial period.

His biggest discoveries came from re-reading old sources differently.

He reviewed materials that other historians had access to but rarely prioritized, including Persian court chronicles, Maratha records, Portuguese archives, and private letters sent by English sailors.

He said his craft isn’t about hunting for hidden facts. It’s about noticing the things filtered out over time by local and foreign historians.

He compared it to walking through London as a tourist. Suddenly, every red postbox, royal crest, and double-decker bus jumps out at you.

Residents never see them that way. Their brains auto-classify those details as “background texture.”

Masterful SREs do something similar in the SDLC.

They notice the “background texture” that quietly shapes reliability but has become invisible to the people living inside the system every day.

Their edge isn’t extreme technical depth (though a baseline is obviously required).
Their edge is that they refuse to go numb to the system’s defaults.

My own soft landing into reliability looked exactly like this:

In the early 2010s, I co-founded a startup in a high-stakes finance vertical. Reliability wasn’t optional. It was the credibility signal that clients judged us on. If the system shook, the business shook. That forces one to notice what others gloss over.

Years later, the healthcare organization I worked in went through a messy public–private partnership digital-transformation program. You can imagine how well that went. Azure was still new to most of the engineers, incidents piled up, and we kept hitting reliability problems nobody could quite trace.

Somehow I ended up being the one who had to spot what everyone else missed and push it through the chain every week. Not the flashy issues — the structural ones: workflow drift, mismatched expectations, fuzzy ownership, and architecture shaped by people with PhDs in bureaucracy and technical skills ending with Internet Explorer.

Here’s the part engineers sometimes underestimate:
Fresh eyes are a skill. You can train this skill.

A few practices could make a massive difference for your own reliability work:

1. Become a tourist every week
Write yourself a weekly tourist pass that lets you pick a workflow or pipeline at random, and then walk through it like you’ve never seen it.
What steps only make sense because you already know the history or how your org works? Because the next person working on this workflow or pipeline might not.

2. Audit the default settings
Most regressions result from defaults no one has revisited on a regular cadence, so review the retry logic and thresholds, but also the ownership and handoffs.
If you say to yourself, “Why is this the way it is?”, that’s your sign to dig deeper.

3. Seek out organizational complacency
Identify 3 things your org treats as normal — the “that’s how we’ve always done it” — but really shouldn’t be. For example, tech debt is easy to blame, but coordination debt is harder to pin down. How does it cause downstream issues?

Reliability improves the moment you start paying attention to the parts of the SDLC that everyone else filters out. That’s the real SRE advantage, and it’s available to anyone who wants to develop it.

What a Slovak test engineer taught me about SRE career growth in a post-ZIRP world

Ash Patel — Tue, 28 Oct 2025 12:39:26 GMT

I can’t make up a story like this — and neither can GPT-5 (though it told me to use the em dash in this sentence).

I met a Slovak test engineer recently at a social event. When I go to these meetups, my goal is to unwind — not to talk about tech, or business, or the business of tech.

But within twenty minutes, we’d touched on all of it. He was simply happy to meet socially open people at our weekly pub gathering in my hometown.

And yet it all came out.

He’s a test engineer in the banking sector. I didn’t want to know more, but everyone else at the table did. So I got pulled into the vortex of systemizing his experience through the familiar frameworks and tropes of our industry.

Interestingly, he had no formal qualification in computing. No CS degree, minimal certifications, no fancy title. Yet every time a core-banking integration failed, everyone waited for him to show up.

He’d been seconded from his Slovak post at a multinational into a complex project serving one of Australia’s largest companies — think US$150 billion market cap. What was more interesting than his résumé was his approach to the problem.

He did not understand why everyone was jumping up and down, trying to keep him in country for the long-run. I did…

Projects like this, in the post-ZIRP world, after years of zero-interest rates and cheap capital fueling endless “digital transformations”, now have to justify every line of spend with actual reliability.

The same tightening is visible in Australia’s internal tech sector. During the ZIRP decade, domestic headcount grew on the back of cheap money and “digital-transformation” budgets that rewarded expansion over efficiency.

Now, with the cost of capital normalized, those roles are being quietly unwound through redundancies and restructures. What remains are the engineers who can stabilize systems — like our Slovak friend.

He’s the kind of engineer who has quietly mastered context, not code. Whether at a domain (banking) or a system (multi-service dependencies) level, he seeks beyond the baseline and digs deeper.

Most of us in the industry, especially those shaped by the traditional IT career pipeline, are products of a linear system. We’re trained early in life to think in progressions: clear the exam, earn the certification, reach the next title.

The whole infrastructure rewards throughput and correctness — not exploration.

The Slovak engineer had come up through something very different.

His early training was hands-on, constraint-driven, the kind you find in vocational settings with:

limited tooling
tight budgets
systems that had to keep running because no vendor contract could bail you out

He learned by fixing what broke, not by studying what should work and hoping that it would.

That’s the real divide I saw that night — not East versus West, or degree versus no degree — but theoretical progression versus practical insight.

He’d been forced to see systems for what they really are, and how they behave in reality, not how they’re diagrammed in a Visio document.

Where many engineers measure growth in certifications, he measured it in failure modes understood.

That’s why he looked puzzled when people praised him or tried to negotiate hard to keep him. In his world, competence wasn’t a résumé advantage. It was the minimum requirement for keeping the system alive.

At one point, he squirmed on his hardwood stool as I mentioned the likely negotiations between his seconding firm and the Australian companies trying to secure him permanently.

I expected pride. Instead, he shrugged. He cared more about what he could do with the system and the problems it threw at him. That was the real excitement for him.

So how does he system up?

In every outage call, he wasn’t asking “What failed?”

He was asking, “Why did the system allow this to fail?”

That question alone places him a decade ahead of most engineers in leadership capability.

It’s easy to romanticize stories like this, but a real shift is happening underneath:

The industry is rediscovering that sensemaking of systems beats formal credentials.

He didn’t strategize ways to climb up fast; he dug deep and was rewarded with a faster climb up than any certification could deliver.

He tested through shaky pipelines until they whispered their secrets.

He understood the underlying system’s personality — its hidden assumptions and dependencies built up through years of tech debt. He’s the kind of engineer high-paying Australian banks are quietly fighting to keep long-term.

That encounter reminded me of something we often miss in our own careers, especially in the Indian tech ecosystem, which I’ve become more cognizant of in recent years.

Many engineers are trained to think in levels — L4 to L5, Senior to Staff, Team Lead to Manager. Every promotion signals a “level up in capability.” But does it count in an era when systems literacy is critical in managing incidents?

Reliability work doesn’t reward hierarchy. It rewards systems literacy: the ability to see across components, people, and processes, and to hold that complexity in your head while you’re trying to fix things.

“Leveling up” is about scope and title.
“Systeming up” is about building insight and ownership.

It’s the difference between managing a dashboard and understanding the feedback loop behind it.

When I run workshops with senior engineers, I don’t care how many frameworks they’ve memorized. I care whether they can answer:

“When the alert fires, what happens next in the real world — and why?”

That question separates framework followers from system thinkers.

The best engineers — whether from India, the United States, or central Europe — are no longer measured by tenure, but by how reliably they can stabilize the systems that keep the enterprise — and everyone’s salary — running.

So if you’re planning your growth in this career of reliability engineering, don’t just chase a higher level.

Map the system you operate in and
Find the failure modes no one else has taken the time to understand.

That’s how you truly level up in your SRE career.

AWS us-east-1 sneezed, the world felt its breeze

Ash Patel — Fri, 24 Oct 2025 12:30:44 GMT

As you are already aware, on October 20, 2025, AWS experienced its longest outage in a decade. You’d be forgiven for assuming it was a data center meltdown, or one of the usual suspects: a cyberattack, catastrophic hardware failure, or misconfigured network deployment.

It was none of those.

It was something more ordinary, but at the same time more hidden. A race condition in DNS management brought the whole house down. It’s the kind of gremlin that lives quietly inside distributed systems for years, waiting for the perfect moment to strike.

When us-east-1 (Northern Virginia) sneezed, the internet — from Seattle to Melbourne — caught a cold. So what actually happened, and what can teams learn without grumbling through another tick-box incident retro?

So what actually happened on October 20th?

~14:10 UTC

A change to Amazon’s internal DNS configuration triggered a race condition between replication processes responsible for updating authoritative name servers. For a few minutes, different replicas disagreed on which version of a DNS zone file was “true.” That mismatch rippled outward. It was cached by resolvers, retried by clients, and multiplied by automation loops expecting stable DNS.

~14:20 UTC

Services that depended on AWS’s control plane began timing out. Route 53, EC2 instance metadata, and IAM authentication all started wobbling hard and in the process, took out a wide slice of internet plumbing. Everything from CI/CD pipelines to food-delivery apps and payment gateways slowed or failed outright.

by 15:00 UTC

AWS engineers isolated the offending processes within the us-east-1 region and began rolling back by 15:00 UTC. But if you’ve ever made DNS changes, you’d know that DNS caches live in the wild, and so, recovery was protracted. For some users, the internet, their conduit to the outside world, vanished for 20-odd hours.

Now, let’s address some questions that arise from this event:

First, what is a race condition in DNS management, and why does it matter?

Think of a race condition as two parts of a system trying to update the same record at the same time, and the system saying, “Both of you go ahead and update this record simultaneously. I’ll sort it out later.”

In DNS terms, that can mean two replication processes trying to modify or propagate zone data simultaneously, where timing rather than logic decides who wins. For a few milliseconds, caches see stale or inconsistent entries, and suddenly, half the internet can’t find the right IP for a perfectly healthy service.

Normally, DNS heals quickly from this. But when the affected zone is AWS’s own internal DNS infrastructure — the guidepost that every AWS service uses to find the other — you’re well past a rapid self-healing situation.

This is the digital equivalent of deleting your city’s street signs while heavy out-of-town traffic is streaming in. That’s why this outage hurt: it wasn’t a “catastrophic” failure — it was a consistency failure.

Okay, but why did a Virginia outage break things in faraway regions like Australia?

Because not all dependencies are local, even if your data is. Many AWS control-plane services — DNS, API endpoints, authentication flows — are partly centralized in us-east-1, where AWS runs out of 100 or so computer warehouses.

So it’s no surprise to learn that when that region sneezes, the rest of the world gets a cold. Your app mirrored in Sydney may never touch Virginia’s compute nodes, but if it needs to resolve an AWS-managed domain, for example, it’s now more susceptible to a single point of failure (SPoF).

And that’s why popular banking tools like PayID in Australia could not function for some time.

What kind of SPoF is this?

A logical one. The kind you can’t rack-and-stack. Engineering organizations spend fortunes removing physical SPOFs from their architectures. Things like multiple AZs, multi-region DR—but ignore the logical couplings baked into platform design.

DNS, IAM, and ancillary services are the invisible backbone of cloud-based software. They don’t look like dependencies until they break, and then you realize half your automation is pointing at one set of nameservers in a down status region.

Could companies have avoided this?

Depends on what “avoid” means. Some engineers argued you could just use multiple providers so everyone doesn’t fail together. Others pointed out that testing, syncing data, and keeping a cross-provider setup functional costs more time and money than the occasional global outage. They’re both right.

The trade-off is between the cost of preparation and the cost of surprise. For many orgs, the latter still feels cheaper — until they’re on the wrong side of it.

Why does everyone still rely on us-east-1?

Because that’s where AWS started, and inertia drives dependency. New services often launch there first. Docs default to it. SDKs assume it. It’s Conway’s Law of cloud regions: your architecture mirrors the provider’s organizational chart. As long as AWS itself treats us-east-1 as the “mothership,” customers will have to as well.

What blind spots about our system of work does this reveal?

Resilience isn’t just about adding on extra regions for redundancy. It’s also about the conversations teams never have:

Who owns DNS? Who tests failover? Which global endpoints does our CI/CD depend on?

You can’t buy that awareness. You need to practice it — through game-days, dependency mapping, and post-incident reviews that ask “what assumptions failed?” instead of “who missed an alert?”. I know it sounds obvious, but how robust is your practice in these areas?

Should teams actually do something about this?

Yes — but not as much as you’d think. You don’t need to make every service bulletproof. You just need to know which ones matter most and how they’ll behave when your cloud infrastructure starts shaking again.

3 takeaways for practical engineers

Map hidden dependencies. Every cloud architecture has ghosts in the control plane — learn to know where yours live.
Design for graceful degradation. Read-only mode beats “down for maintenance” pages, or worse, uncommunicated outages.
Treat provider outages as mirrors. They reflect your own coupling, not just theirs.

Final thoughts

This wasn’t intended as a teardown of failure. It was designed to act as a reminder that complexity doesn’t collapse cleanly. For all our talk about multi-AZ, multi-region, and multi-cloud, most systems are still one race condition away from a very bad day.

Sometimes the most reliable fix isn’t more redundancy — it’s more insight.

#67 Why the SRE Book Fails Most Orgs — Lessons from a Google Veteran

Ash Patel — Tue, 15 Jul 2025 13:05:13 GMT

A new or growing SRE team. A copy of the book. A company that says it cares about reliability. What happens next? Usually… not much.

In this episode, I sit down with Dave O’Connor, a 16-year Google SRE veteran, to talk about what happens when organizations cargo-cult reliability practices without understanding the context they were born in.

You might know him for his self-deprecating wit and legendary USENIX blurb about being “complicit in the development of the SRE function.”

This one’s a treat — less “here’s a shiny new tool” and more “here’s what reliability actually looks like when you’ve seen it all.”

✨ No vendor plugs from Dave at all, just a good old-fashioned chat about what works and what doesn’t.

Here’s what we dive into:

The adoption trap: Why SRE efforts often fail before they begin—especially when new hires care more about reliability than the org ever intended.
The SRE book dilemma: Dave’s take on why following the SRE book chapter-by-chapter is a trap for most companies (and what to do instead).
The cost of “caring too much”: How engineers burn out trying to force reliability into places it was never funded to live.
You build it, you run it (but should you?): Not everyone’s cut out for incident command—and why pretending otherwise sets teams up to fail.
Buying vs. building: The real reason even conservative enterprises are turning into software shops — and the reliability nightmare that follows.

We also discuss the evolving role of reliability in organizations today, from being mistaken for “just ops” to becoming a strategic investment (when done right).

Dave's seen the waves come and go in SRE — and he's still optimistic. That alone is worth a listen.

#66 - Unpacking 2025 SRE Report’s Damning Findings

Ash Patel — Tue, 01 Jul 2025 01:10:26 GMT

I know it’s already six months into 2025, but we recorded this almost three months ago. I’ve been busy with my foray into the world of tech consulting and training —and, well, editing these podcast episodes takes time and care.

This episode was prompted by the 2025 Catchpoint SRE Report, which dropped some damning but all-too-familiar findings:

53% of orgs still define reliability as uptime only, ignoring degraded experience and hidden toil
Manual effort is creeping back in, reversing five years of automation gains
41% of engineers feel pressure to ship fast, even when it undermines long-term stability

To unpack what this actually means inside organizations, I sat down with Sebastian Vietz, Director of Reliability Engineering at Compass Digital and co-host of the Reliability Enablers podcast.

Sebastian doesn’t just talk about technical fixes — he focuses on the organizational frictions that stall change, burn out engineers, and leave “reliability” as a slide deck instead of a lived practice.

We dig into:

How SREs get stuck as messengers of inconvenient truths
What it really takes to move from advocacy to adoption — without turning your whole org into a cost center
Why tech is more like milk than wine (Sebastian explains)
And how SREs can strengthen—not compete with—security, risk, and compliance teams

This one’s for anyone tired of reliability theatrics. No kumbaya around K8s here. Just an exploration of the messy, human work behind making systems and teams more resilient.

#65 - In Critical Systems, 99.9% Isn’t Reliable — It’s a Liability

Ash Patel — Tue, 17 Jun 2025 13:05:09 GMT

Most teams talk about reliability with a margin for error. “What’s our SLO? What’s our budget for failure?”

But in the energy sector? There is no acceptable downtime. Not even a little.

In this episode, I talk with Wade Harris, Director of FAST Engineering in Australia, who’s spent 15+ years designing and rolling out monitoring and control systems for critical energy infrastructure like power stations, solar farms, SCADA networks, you name it.

What makes this episode different is that Wade isn’t a reliability engineer by title, but it’s baked into everything his team touches. And that matters more than ever as software creeps deeper into operational technology (OT), and the cloud tries to stake its claim in critical systems.

We cover:

Why 100% uptime is the minimum bar, not a stretch goal
How the rise of renewables has increased system complexity — and what that means for monitoring
Why bespoke integration and SCADA spaghetti are still normal (and here to stay)
The reality of cloud risk in critical infrastructure (“the cloud is just someone else’s computer”)
What software engineers need to understand if they want their products used in serious environments

This isn’t about observability dashboards or DevOps rituals. This is reliability when the lights go out and people risk getting hurt if you get it wrong.

And it’s a reminder: not every system lives in a feature-driven world. Some systems just have to work. Always. No matter what.

A video on why reliability doesn't scale like it should

Ash Patel — Tue, 01 Apr 2025 15:21:29 GMT

Hey SRE friend,

There’s a good chance I don’t know much about you — maybe not even your first name — thanks to how newsletters work these days.

But I do know you’re interested in reliability engineering. And chances are, you’re doing that work somewhere that isn’t exactly MAANG or a VC-fueled startup.

That’s why I made this research-backed video.

It looks at why reliability efforts stall in most orgs — with support from interviews, journal references, and real-world examples.

I’d love to know if it resonates with your experience.

— Ash

P.S. This is still in preview mode — unlisted for now and shared only with subscribers and collaborators. Feel free to pass it on to teammates if it sparks something useful.

#64 - Using AI to Reduce Observability Costs

Ash Patel — Tue, 28 Jan 2025 14:03:35 GMT

Exploring how to manage observability tool sprawl, reduce costs, and leverage AI to make smarter, data-driven decisions.

It's been a hot minute since the last episode of the Reliability Enablers podcast.

Sebastian and I have been working on a few things in our realms. On a personal and work front, I’ve been to over 25 cities in the last 3 months and need a breather.

Meanwhile, listen to this interesting vendor, Ruchir Jha from Cardinal, working on the cutting edge of o11y to help reduce costs from spiraling out of control.

(To the skeptics, he did not pay me for this episode)

Here’s an AI-generated summary of what you can expect in our conversation:

In this conversation, we explore cutting-edge approaches to FinOps i.e. cost optimization for observability.

You'll hear about three pressing topics:

Managing Tool Sprawl: Insights into the common challenge of juggling 5-15 tools and how to identify which ones deliver real value.
Reducing Observability Costs: Techniques to track and trim waste, including how to uncover cost hotspots like overused or redundant metrics.
AI for Observability Decisions: Practical ways AI can simplify complex data, empowering non-technical stakeholders to make informed decisions.

We also touch on the balance between open-source solutions like OpenTelemetry and commercial observability tools.

Learn how these strategies, informed by Ruchir's experience at Netflix, can help streamline observability operations and cut costs without sacrificing reliability.

#63 - Does "Big Observability" Neglect Mobile?

Ash Patel — Tue, 12 Nov 2024 13:03:12 GMT

Andrew Tunall is a product engineering leader focused on pushing the boundaries of reliability with a current focus on mobile observability. Using his experience from AWS and New Relic, he’s vocal about the need for a more user-focused observability, especially in mobile, where traditional practices fall short.

Career Journey and Current Role: Andrew Tunall, now at Embrace, a mobile observability startup in Portland, Oregon, started his journey at AWS before moving to New Relic. He shifted to a smaller, Series B company to learn beyond what corporate America offered.
Specialization in Mobile Observability: At Embrace, Andrew and his colleagues build tools for consumer mobile apps, helping engineers, SREs, and DevOps teams integrate observability directly into their workflows.
Gap in Mobile Observability: Observability for mobile apps is still developing, with early tools like Crashlytics only covering basic crash reporting. Andrew highlights that more nuanced data on app performance, crucial to user experience, is often missed.
Motivation for User-Centric Tools: Leaving “big observability” to focus on mobile, Andrew prioritizes tools that directly enhance user experience rather than backend metrics, aiming to be closer to end-users.
Mobile's Role as a Brand Touchpoint: He emphasizes that for many brands, the primary consumer interaction happens on mobile. Observability needs to account for this by focusing on user experience in the app, not just backend performance.
Challenges in Measuring Mobile Reliability: Traditional observability emphasizes backend uptime, but Andrew sees a gap in capturing issues that affect user experience on mobile, underscoring the need for end-to-end observability.
Observability Over-Focused on Backend Systems: Andrew points out that “big observability” has largely catered to backend engineers due to the immense complexity of backend systems with microservices and Kubernetes. Despite mobile being a primary interface for apps like Facebook and Instagram, observability tools for mobile lag behind backend-focused solutions.
Lack of Mobile Engineering Leadership in Observability: Reflecting on a former Meta product manager’s observations, Andrew highlights the lack of VPs from mobile backgrounds, which has left a gap in observability practices for mobile-specific challenges. This gap stems partly from frontend engineers often seeing themselves as creators rather than operators, unlike backend teams.
OpenTelemetry’s Limitations in Mobile: While OpenTelemetry provides basic instrumentation, it falls short in mobile due to limited SDK support for languages like Kotlin and frameworks like Unity, React Native, and Flutter. Andrew emphasizes the challenges of adapting OpenTelemetry to mobile, where app-specific factors like memory consumption don’t align with traditional time-based observability.
SREs as Connective Tissue: Andrew views Site Reliability Engineers (SREs) as essential in bridging backend observability practices with frontend user experience needs. Whether through service level objectives (SLOs) or similar metrics, SREs help ensure that backend metrics translate into positive end-user experiences—a critical factor in retaining app users.
Amazon’s Operational Readiness Review: Drawing from his experience at AWS, Andrew values Amazon’s practice of operational readiness reviews before launching new services. These reviews encourage teams to anticipate possible failures or user experience issues, weighing risks carefully to maintain reliability while allowing innovation.
Shifting Focus to “Answerability” in Observability: For Andrew, the goal of observability should evolve toward “answerability,” where systems provide engineers with actionable answers rather than mere data. He envisions a future where automation or AI could handle repetitive tasks, allowing engineers to focus on enhancing user experiences instead of troubleshooting.

#62 - Early Youtube SRE shares Modern Reliability Strategy

Ash Patel — Tue, 05 Nov 2024 13:40:50 GMT

Andrew Fong’s take on engineering cuts through the usual role labels, urging teams to start with the problem they’re solving instead of locking into rigid job titles. He sees reliability, inclusivity, and efficiency as the real drivers of good engineering.

In his view, SRE is all about keeping systems reliable and healthy, while platform engineering is geared toward speed, developer enablement, and keeping costs in check. It’s a values-first, practical approach to tackling tough challenges that engineers face every day.

Here’s a slightly deeper dive into the concepts we discussed:

Career and Evolution in Tech: Andrew shares his journey through various roles, from early SRE at Youtube to VP of Infrastructure at Dropbox to Director of Engineering at Databricks, with extensive experience in infrastructure through three distinct eras of the internet. He emphasized the transition from early infrastructure roles into specialized SRE functions, noting the rise of SRE as a formalized role and the evolution of responsibilities within it.
Building Prodvana and the Future of SRE: As CEO of startup, Prodvana, they're focused on an "intelligent delivery system" designed to simplify production management for engineers, addressing cognitive overload. They highlight SRE as a field facing new demands due to AI, discussing insights shared with Niall Murphy and Corey Bertram around AI's potential in the space, distinguishing it from "web three" hype, and affirming that while AI will transform SRE, it will not eliminate it.
Challenges of Migration and Integration: Reflecting on experiences at YouTube post-acquisition by Google, the speaker discusses the challenges of migrating YouTube’s infrastructure onto Google’s proprietary, non-thread-safe systems. This required extensive adaptation and “glue code,” offering insights into the intricacies and sometimes rigid culture of Google’s engineering approach at that time.
SRE’s Shift Toward Reliability as a Core Feature: The speaker describes how SRE has shifted from system-level automation to application reliability, with growing recognition that reliability is a user-facing feature. They emphasize that leadership buy-in and cultural support are essential for organizations to evolve beyond reactive incident response to proactive, reliability-focused SRE practices.
Organizational Culture and Leadership Influence: Leadership’s role in SRE success is highlighted as crucial, with examples from Dropbox and Google emphasizing that strong, supportive leadership can shape positive, reliability-centered cultures. The speaker advises engineers to gauge leadership attitudes towards SRE during job interviews to find environments where reliability is valued over mere incident response.
Outcome-Focused Work Over Titles: Emphasis on assembling the right team based on skills, not titles, to solve technical problems effectively. Titles often distract from focusing on outcomes, and fostering a problem-solving culture over role-based thinking accelerates teamwork and results.
Engineers as Problem Solvers: Engineers, especially natural ones, generally resist job boundaries and focus on solving problems rather than sticking rigidly to job descriptions. This echoes how iconic engineers like Steve Jobs valued versatility over predefined roles.
Culture as Core Values: Organizational culture should be driven by core values like reliability, efficiency, and inclusivity rather than rigid processes or roles. For instance, Dropbox's infrastructure culture emphasized being a “force multiplier” to sustain product velocity, an approach that ensured values were integrated into every decision.
Balancing SRE and Platform Priorities: The fundamental difference between SRE (Site Reliability Engineering) and platform engineering is their focus: SRE prioritizes reliability, while platform engineering is geared toward increasing velocity or reducing costs. Leaders must be cautious when assigning both roles simultaneously, as each requires a distinct focus and expertise.
Strategic Trade-Offs in Smaller Orgs: In smaller companies with limited resources, leaders often face challenges balancing cost, reliability, and other objectives within single roles. It's advised to sequence these priorities rather than burden one individual with conflicting objectives. Prioritizing platform stability, for example, can help improve reliability in the long term.
DevOps as a Philosophy: DevOps is viewed here as an operational philosophy rather than a separate role. The approach enhances both reliability and platform functions by fostering a collaborative, efficient work culture.
Focus Investments for Long-Term Gains: Strategic technology investments, even if they might temporarily hinder short-term metrics (like reliability), can drive long-term efficiency and reliability improvements. For instance, Dropbox invested in a shared metadata system to enable active-active disaster recovery, viewing this as essential for future reliability.

#61 Scott Moore on SRE, Performance Engineering, and More

Ash Patel — Tue, 22 Oct 2024 11:24:50 GMT

Scott's got a few interesting things to say about these topics and software operations in general!

#60 How to NOT fail in Platform Engineering

Ash Patel — Tue, 01 Oct 2024 13:01:34 GMT

Here’s what we covered:

Defining Platform Engineering

Platform engineering: Building compelling internal products to help teams reuse capabilities with less coordination.
Cloud computing connection: Enterprises can now compose platforms from cloud services, creating mature, internal products for all engineering personas.

Ankit’s career journey

Didn't choose platform engineering; it found him.
Early start in programming (since age 11).
Transitioned from a product engineer mindset to building internal tools and platforms.
Key experience across startups, the public sector, unicorn companies, and private cloud projects.

Singapore Public Sector Experience

Public sector: Highly advanced digital services (e.g., identity services for tax, housing).
Exciting environment: Software development in Singapore’s public sector is fast-paced and digitally progressive.

Platform Engineering Turf Wars

Turf wars: Debate among DevOps, SRE, and platform engineering.
- DevOps: Collaboration between dev and ops to think systemically.
- SRE: Operations done the software engineering way.
- Platform engineering: Delivering operational services as internal, self-service products.

Dysfunctional Team Interactions

Issue: Requiring tickets to get work done creates bottlenecks.
- Ideal state: Teams should be able to work autonomously without raising tickets.
- Spectrum of dysfunction: From one ticket for one service to multiple tickets across teams leading to delays and misconfigurations.

Quadrant Model (Autonomy vs. Cognitive Load)

Challenge: Balancing user autonomy with managing cognitive load.
Goal: Enable product teams with autonomy while managing cognitive load.
Solution: Platforms should abstract unnecessary complexity while still giving teams the autonomy to operate independently.
How it pans out
- Low autonomy, low cognitive load: Dependent on platform teams but a simple process.
- Low autonomy, high cognitive load: Requires interacting with multiple teams and understanding technical details (worst case).
- High autonomy, high cognitive load: Teams have full access (e.g., AWS accounts) but face infrastructure burden and fragmentation.
- High autonomy, low cognitive load: Ideal situation—teams get what they need quickly without detailed knowledge.

Shift from Product Thinking to Cognitive Load

Cognitive load focus: More important than just product thinking—consider the human experience when using the system.
Team Topologies: Mentioned as a key reference on this concept of cognitive load management.

Platform as a Product Mindset

Collaboration: Building the platform in close collaboration with initial users (pilot teams) is crucial for success.
Product Management: Essential to have a product manager or team dedicated to communication, user journeys, and internal marketing.

Self-Service as a Platform Requirement

Definition: Users should easily discover, understand, and use platform capabilities without human intervention.
User Testing: Watch how users interact with the platform to understand stumbling points and improve the self-service experience.

Platform Team Cognitive Load

Burnout Prevention: Platform engineers need low cognitive load as well. Moving from a reactive (ticket-based) model to a proactive, self-service approach can reduce the strain.
Proactive Approach: Self-service models allow platform teams to prioritize development and avoid being overwhelmed by constant requests.

#59 Who handles monitoring in your team and how?

Ash Patel — Tue, 24 Sep 2024 11:35:49 GMT

Why many copy Google’s monitoring team setup

Google’s Influence. Google played a key role in defining the concept of software reliability.
Success in Reliability. Few can dispute Google’s ability to ensure high levels of reliability and its ability to share useful ways to improve it in other settings
BUT there’s a problem:
It’s not always replicable. While Google's practices are admired, they may not be a perfect fit for every team.

What is Google’s monitoring approach within teams?

Here’s the thing that Google does:

Google assigns one or two people per team to manage monitoring.
Even with centralized infrastructure, a dedicated person handles monitoring.
Many organizations use a separate observability team, unlike Google's integrated approach

If your org is large enough and prioritizes reliability highly enough, you might find it feasible to follow Google’s model to the tee.

Otherwise, a centralized team with occasional “embedded x engineer” secondments might be more effective.

Can your team mimic Google’s model?

Here are a few things you should factor in:

Size matters

Google's model works because of its scale and technical complexity. Many organizations don’t have the size, resources, or technology to replicate this.

What are the options for your team?

Dedicated monitoring team (very popular but $$$)

If you have the resources, you might create a dedicated observability team. This might call for a ~$500k+ personnel budget so it’s not something that a startup or SME can easily justify.

Dedicate SREs to monitoring work (effective but difficult to manage)

You might do this on rotation or make an SRE permanently “responsible for all monitoring matters”. Putting SREs on permanent tasks might lead to burnout as it might not suit their goals, and rotation work requires effective planning.

Internal monitoring experts (useful but hard capability)

One or more engineers within teams could take on monitoring/observability responsibilities as needed and support the team’s needs. This should be how we get monitoring work done, but it’s hard to get volunteers across a majority of teams.

Transitioning monitoring from project work to maintenance

2 distinct phases

Initial Setup (the “project”)

SREs may help set up the monitoring/observability infrastructure.

Since they have breadth of knowledge across systems, they can help connect disparate services and instrument applications effectively.

Post-project phase (“keep the lights on”)

Once the system is up, the focus shifts from project mode to ongoing operational tasks. But who will do that?

Who will maintain the monitoring system?

Answer: usually not the same team

After the project phase, a new set of people—often different from the original team—typically handles maintenance.

Options to consider (once again)

Spin up a monitoring/observability team. Create a dedicated team for observability infrastructure.
Take a decentralized approach. Engineers across various teams take on observability roles as part of their regular duties.
Internal monitoring/observability experts. They can take responsibility for monitoring and ensure best practices are followed.

The key thing to remember here is…

Adapt to Your Organizational Context

One size doesn’t fit all

Google's model may not work for everyone. Tailor your approach based on your organization’s specific needs.

The core principle to keep in mind

As long as people understand why monitoring/observability matters and pay attention to it, you're on the right track.

Work according to engineer awareness

If engineers within product and other non-operations teams are aware of monitoring: You can attempt to **decentralize the effort** and involve more team members.

If awareness or interest is low: consider **dedicated observability roles** or an SRE team to ensure monitoring gets the attention it needs.

In conclusion

There’s no universal solution.

Whether you centralize or decentralize monitoring depends on your team’s structure, size, and expertise.

The important part is ensuring that observability practices are understood and implemented in a way that works best for your organization.

PS. Rather than spend an hour on writing, I decided to write in the style I normally use in a work setting i.e. “executive short-hand”. Tell me what you think.

#58 Fixing Monitoring's Bad Signal-to-Noise Ratio

Ash Patel — Tue, 17 Sep 2024 12:13:22 GMT

Monitoring in the software engineering world continues to grapple with poor signal-to-noise ratios. It’s a challenge that’s been around since the beginning of software development and will persist for years to come.

The core issue is the overwhelming noise from non-essential data, which floods systems with useless alerts.

This interrupts workflows, affects personal time, and even disrupts sleep.

Sebastian dove into this problem, highlighting that the issue isn't just about having meaningless pages but also the struggle to find valuable information amidst the noise.

When legitimate alerts get lost in a sea of irrelevant data, pinpointing the root cause becomes exceptionally hard.

Sebastian proposes a fundamental fix for this data overload: be deliberate with the data you emit.

When instrumenting your systems, be intentional about what data you collect and transport.

Overloading with irrelevant information makes it tough to isolate critical alerts and find the one piece of data that indicates a problem.

To combat this, focus on:

Being Deliberate with Data. Make sure that every piece of telemetry data serves a clear purpose and aligns with your observability goals.
Filtering Data Effectively. Improve how you filter incoming data to eliminate less relevant information and retain what's crucial.
Refining Alerts. Optimize alert rules such as creating tiered alerts to distinguish between critical issues and minor warnings.

Dan Ravenstone, who leads platform at Top Hat, discussed “triaging alerts” recently.

He shared that managing millions of alerts, often filled with noise, is a significant issue.

His advice: scrutinize alerts for value, ensuring they meet the criteria of a good alert, and discard those that don’t impact the user journey.

According to Dan, the anatomy of a good alert includes:

A run book
A defined priority level
A corresponding dashboard
Consistent labels and tags
Clear escalation paths and ownership

To elevate your approach, consider using aggregation and correlation techniques to link otherwise disconnected data, making it easier to uncover patterns and root causes.

The learning point is simple: aim for quality over quantity.

By refining your data practices and focusing on what's truly valuable, you can enhance the signal-to-noise ratio, ultimately allowing more time for deep work rather than constantly managing incidents.

#57 How Technical Leads Support Software Reliability

Ash Patel — Tue, 10 Sep 2024 12:10:22 GMT

The question then condenses down to: Can technical leads support reliability work?

Yes, they can!

Anemari has been a technical lead for years — even spending a few years doing that at the coveted consultancy, Thoughtworks — and now coaches others.

She and I discussed the link between this role and software reliability.

Technical lead ≠ Engineering manager

According to Anemari, a tech lead is a person with technical knowledge who is responsible for leading a technical team to align them with a common technical strategy.

For the most part, engineering managers are focused on the people aspect of the work. They coach engineers and other technical folk to improve their performance.

In terms of activities, engineering managers coach, mentor, and support the development of their team members or direct reports. They’ll also go and bat for the team within the organization.

Technical leads are more focused on guiding the technical work that these people do. Their focus includes work like architecture, design patterns, and implementing projects. They offer technical insights and mentorship to the team.

Anemari found in her work experience that the use of technical leads depends from company to company:

I've worked with teams that don't have tech leads and then the engineering manager takes a more hands-on approach. Then you have teams where you have a tech lead only focused on tech, and then you have like a team lead or engineering manager, doing the people side.

Interestingly, she found that it was very difficult to singularly focus on the technical side of the work:

[Even] if you only want to focus on tech as a tech lead, you still end up having to deal with the people side because most tech problems are people problems in the end and so you kind of require both.

How can technical leads drive reliability principles?

Anemari advised me that tech leads are often required to think about reliability principles as part of their technical strategy.

Sometimes, teams don’t have an SRE team supporting them and are operating in a “you build it, you run it” mode. In this situation, reliability becomes a 90+ percent responsibility of the team.

In other words, the product team has had to develop the reliability-focused knowledge required to build and run their product reliably.

The technical lead can help the less experienced team members pick up reliability concepts and not just have that mindset of creating more features and making the burndown chart look pretty.

However, their preference usually goes on working with a Site Reliability Engineering (SRE) team if it is possible.

Working with reliability teams like SREs

Anemari believes it’s important to be aware of reliability principles. She put it as knowing:

What does it mean for my product to be reliable?

So, the first critical step is to align your definition of reliability from pondering this question with the known constraints of the product. These constraints can be:

performance limits
operational environments
user expectations

Anemari went on in our conversation to challenge the notion of simply stating reliability requirements to product teams.

It may seem obvious to reliability folk, but it’s one thing in a very large pile of considerations for a product team:

So I think it's important when we're talking about [reliability]… when we're talking about uptime, like, what does that mean? And why do we need 99 percent or whatever?

The onus falls on reliability engineers to support this thinking.

Technical leads can be a good point of contact for that initial conversation of alignment. The reliability team can advise about the standards we mentioned before.

So the first step is co-creating a common version of reliability.

The second step is continuous communication between the product and reliability teams. Anemari believes there must be an easy way for either party to communicate with the other.

“Creating a ticket” to reach the reliability team or vice versa does not work well.

Anemari has seen having a Slack channel between the SRE team and each product team as something that has worked well.

You can go in there, pop up a question and take it from there and solve problems.

Anemari is also a fan of the embedded SRE model.

If you have to build something new from scratch and you have to build a whole infrastructure, another strategy that I've seen is bringing a person with SRE knowledge [into] the team for a while, helping us define [technical work] and being hands on.

A long-term embedded SRE, however, might be an antipattern.

I’ve had some teams that had an SRE team member… all the time, but I’ve also seen them struggle with the fact that at some point there is not so much to do.

An embedded SRE should be brought in to enable the teams to do more and take care of their products By training the team on reliability topics — even at a high level — like understanding all the different dashboards that they have.

What can SREs do better to support product teams with reliability efforts?

Anemari told me that it would be great if SREs first put more effort into helping the product team understand why reliability is so important. It’s obvious to us in this space, but as I mentioned earlier, people have n+100 other things on their minds.

Product teams need to be made more aware of what the reliability engineers are taking care of and how that can affect their products.

There needs to be more conversation than what Anemari has experienced in a lot of encounters with SRE teams:

The SRE teams are like, “Just give it to us. We'll take care of it. You know, like, we, we know what to do. I don't have to explain all of these things to you.”

She found it more effective to sit down with the reliability team and try to understand what might seem trivial, but are important questions like:

Why is our service not processing enough requests and how can we change it?
Why can we just drop this?
Why do we need to have all of these services?

Asking these kinds of questions helped the collaborators come up with a better solution than what the SRE team could come up with on their own.

Anemari has seen in the past when an SRE team came in and introduced themselves to the product team. They then covered how their services ran in the background. This led the product team to ask questions to brainstorm on what could be improved.

The intentionality is important: the SRE team did not come with horns blaring that they were going to change the systems. They came in to discuss and brainstorm ideas.

But will the software engineers cooperate?

A lot of SREs I've spoken with are frustrated and think developer teams do not want to understand reliability.

But I'm sure at least some teams want to understand how reliability works.

Sometimes you have to just swallow that bitter pill and say, “Hey, look, I'm going to spell every single thing I'm doing out to you, just so that maybe in the future you can do it yourself.”

I think a lot of developers remember their early days and want to be able to run their product entirely on their own.

Anemari confirmed with her experience that developers want to have full power over what's happening with their product. Dependency is not necessarily something developers like very much.

There's of course the consideration of the learning journey.

There's that sweet spot that is hard to find early on, “How much do I need to know? Like, do I need to become an SRE expert now to run my service? Or how much do I need to know to just run my service?”

Anemari recalled one particular high-performing team that she led. The reason why it was high-performing was partly because they were able to run services fully with very little support from an external SRE team.

We [brought] someone we knew with knowledge… to fully understand how does our AWS work and how to restart our services and everything so that we were able to fully run it and be on call for it.

The sweet spot notion comes in once again where you have to work out, “How much do I need to develop and spend time on developing this SRE knowledge, and how much do put into developing the product?”

It's complicated, but communication is the key to it.

How can product teams handle conflicting priorities?

Anemari told me that the answer depends on the context:

where you are
what situation your product is in
how the developer and SRE teams are laid out

It’s very different to start a project from scratch versus having a whole monolith running for years and trying to make things better.

Anemari added:

It might sound a little bit crazy but I still think that it's very important for you to like write down all of these different things that you think you need to solve…
A big part of the role of a tech lead is to make sure that all of these different parts agree and align on a strategy. So that means the product, that means the developer, that means the SREs, that means the customer support, etc.

A tech lead would work to bring all of this together and then propose to all of these different stakeholders. You might have to put it as simply as, “This is what we have to focus on right now, given that as a product team. We have to keep delivering.”

Then come the compromises. For example, “20 percent of our time would go into improving our delivery pipeline.”

It is crucial to work out compromises before getting agreement from the various stakeholders in the software delivery organization.

This write-up was just a preview of what Anemari and I talked about in this episode of the Reliability Enablers. Be sure to listen to get the other half of our conversation.

#56 Resolving DORA Metrics Mistakes

Ash Patel — Wed, 04 Sep 2024 12:02:44 GMT

We're already well into 2024 and it’s sad that people still have enough fuel to complain about various aspects of their engineering life.

DORA seems to be turning into one of those problem areas.

Not at every organization, but some places are turning it into a case of “hitting metrics” without caring for the underlying capabilities and conversations.

Nathen Harvey is no stranger to this problem.

He used to talk a lot about SRE at Google as a developer advocate. Then, he became the lead advocate for DORA when Google acquired it in 2018.

His focus has been on questions like:

How do we help teams get better at delivering and operating software?

You and I can agree that this is an important question to ask.

I’d listen to what he has to say about DORA because he’s got a wealth of experience behind him, having also run community engineering at Chef Software.

Before we continue, let’s explore What is DORA? in Nathen’s (paraphrased) words:

DORA is a software research program that's been running since 2015.

This research program looks to figure out:

How do teams get good at delivering, operating, building, and running software?

The researchers were able to draw out the concept of the metrics based on correlating teams that have good technology practices with highly robust software delivery outcomes.

They found that this positively impacted organizational outcomes like profitability, revenue, and customer satisfaction.

Essentially, all those things that matter to the business.

One of the challenges the researchers found over the last decade was working out: how do you measure something like software delivery?

It's not the same as a factory system where you can go and count the widgets that we're delivering necessarily.

The unfortunate problem is that the factory mindset I think still leaks in. I’ve personally noted some silly metrics over the years like lines of code.

Imagine being asked constantly: “How many lines of code did you write this week?”

You might not have to imagine. It might be a reality for you.

DORA’s researchers agreed that the factory mode of metrics cannot determine whether or not you are a productive engineer.

They settled on and validated 4 key measures for software delivery performance.

Nathen elaborated that 2 of these measures look at throughput:

[Those] two [that] look at throughput really ask two questions:
How long does it take for a change of any kind, whether it's a code change, configuration change, whatever, a change to go from the developer's workstation. right through to production?
And then the second question on throughput is:
How frequently are you updating production?

In plain English, these 2 metrics are:

Deployment Frequency. How often code is deployed to production? This metric reflects the team's ability to deliver new features or updates quickly.
Lead Time for Changes: Measures the time it takes from code being committed to being deployed to production.

Nathen recounted his experience of working at organizations that differed in how often they update production from once every six months to multiple times a day.

They're both very different types of organizations, so their perspective on throughput metrics will be wildly different.

This has some implications for the speed of software delivery.

Of course, everyone wants to move faster, but there’s this other thing that comes in and that's stability.

And so, the other two stability-oriented metrics look at:

What happens when you do update production and... something's gone horribly wrong. “Yeah, we need to roll that back quickly or push a hot fix.”

In plain English, they are:

Change Failure Rate: Measures the percentage of deployments that cause a failure in production (e.g., outages, bugs).
Failed Deployment Recovery Time: Measures how long it takes to recover from a failure in production.

You might be thinking the same thing as me. These stability metrics might be a lot more interesting to reliability folks than the first 2 throughput metrics.

But keep in mind, it’s about balancing all 4 metrics.

Nathen believes it’s fair to say today that across many organizations, they look at these concepts of throughput and stability as tradeoffs of one another.

We can either be fast or we can be stable.

But the interesting thing that the DORA researchers have learned from their decade of collecting data is that throughput and stability aren't trade-offs of one another.

They tend to move together. They’ve seen organizations of every shape and size, in every industry, doing well across all four of those metrics.

They are the best performers.

The interesting thing is that the size of your organization doesn't matter the industry that you're in.

Whether you’re working in a highly regulated or unregulated industry, it doesn't matter.

The key insight that Nathen thinks we should be searching for is: how do you get there?

To him, it's about shipping smaller changes.

When you ship small changes, they're easier to move through your pipeline.

They're easier to reason about.

And when something goes wrong, they're easier to recover from and restore service.

But along with those small changes, we need to think about those feedback cycles.

Every line of code that we write is in reality a little bit of an experiment.

We think it's going to do what we expect and it's going to help our users in some way, but we need to get feedback on that as quickly as possible.

Underlying all of this, both small changes and getting fast feedback, is a real climate for learning. Nathen drew up a few thinking points from this:

So what is the learning culture like within our organization?
Is there a climate for learning?
And are we using things like failures as opportunities to learn, so that we can ever be improving?

I don’t know if you’re thinking the same as me already, but we're already learning that DORA is a lot more than just metrics.

To Nathen (and me), the metrics should be one of the least interesting parts of DORA because it digs into useful capabilities, like small changes and fast feedback.

That’s what truly helps determine how well you're going to do against those performance metrics.

Not saying “We are a low to medium performer. Now go and improve the metrics!”

I think the issue is that a lot of organizations emphasize the metrics because it's something that can sit on an executive dashboard

But the true reason we have metrics is to help drive conversations.

Through those conversations, we drive improvement.

That’s important because currently an unfortunately noticeable amount of organizations are doing this according to Nathen:

I've seen organizations [where it’s like]: “Oh, we're going to do DORA. Here's my dashboard. Okay, we're done. We've done DORA. I can look at these metrics on a dashboard.”
That doesn't change anything.
We have to go the step further and put those metrics into action.

We should be treating the metrics as a kind of compass on a map.

You can use those metrics to help orient yourself and understand, “Where are we heading?”.

But then you have to choose how are you going to make progress toward whatever your goal is.

The capabilities enabled by the DORA framework should help answer questions like:

Where are our bottlenecks?
Where are our constraints?
Do we need to do some improvement work as a team?

We also talked about the SPACE framework, which is a follow-on tool from DORA metrics.

It is a framework for understanding developer productivity.

It encourages teams or organizations to look at five dimensions when trying to measure something from a productivity perspective.

It stands for:

S — satisfaction and well-being
P — performance
A — activity
C — communication and collaboration
E — efficiency and flow

What the SPACE framework recommends is that you

First, pick metrics from two to three of those five categories.

(You don't need a metric from every one of those five but find something that works well for your team.)

Then write down those metrics and start measuring them.

Here’s the interesting thing: DORA is an implementation of SPACE.

You can correlate each metric with the SPACE acronym!

Lead time for changes is a measure of Efficiency and flow
Deployment frequency is an Activity
Change fail rate is about Performance.
Failed deployment recovery time is about Efficiency and flow

Keep in mind that SPACE itself has no metrics.

It is a framework for identifying metrics.

Nathen reiterated that you can't use the space metrics because there is no such thing.

I mentioned earlier how DORA is a means of identifying the capabilities that can improve the metrics.

These can be technical practices like using continuous integration.

But they can also be capabilities like collaboration and communication.

As an example, you might look at what your change approval process looks like.

You might look at how collaboration and communication have failed when you’ve had to send changes off to an external approval board like a CAB (change approval board).

DORA’s research backs the above up:

What our research has shown through collecting data over the years, is that while they do exist on the whole, an external change approval body will slow you down.
That's no surprise. So your change lead time is going to increase, your deployment frequency will decrease.
But, at best, they have zero impact on your change fail rate. In most cases, they have a negative impact on your change fail rate. So you're failing more often.

It goes back to the idea of smaller changes, faster feedback, and being able to validate that. Building in audit controls and so forth.

This is something that reliability-focused engineers should be able to help with because one of the things Sebastian and I talk about a lot is embracing and managing risk effectively and not trying to mitigate it through stifling measures like CABs.

In short, DORA and software reliability are not mutually exclusive concepts.

They're certainly in the same universe.

Nathen went as far as to say that some SRE practices necessarily get a little bit deeper than sort of the capability level that DORA has and provide even more sort of specific guidance on how to do things.

He clarified a doubt I had because a lot of people have argued with me (mainly at conferences) that DORA is this thing that developers do, earlier in the SDLC.

And then SRE is completely different because it focuses on the production side.

The worst possible situation could be turning to developers and saying, “These 2 throughput metrics, they’re yours. Make sure they go up no matter what,” and then turn to our SREs and say “Those stability metrics, they're yours. Make sure they stay good”

All that does is put these false incentives in place and we're just fighting against each other.

We talked a little more about the future of DORA in our podcast episode (player/link right at the top of this post) if you want to hear about that.

Here are some useful links from Nathen for further research:

DORA online community of practice

DORA homepage

[Article] The SPACE of Developer Productivity

Nathen Harvey's Linktree