Reliability Enablers: Hear the Podcast

You (and AI) can't automate reliability away

Ash Patel — Tue, 02 Dec 2025 13:03:47 GMT

What if the hardest part of reliability has nothing to do with tooling or automation? Jennifer Petoff explains why real reliability comes from the human workflows wrapped around the engineering work.

Everyone seems to think AI will automate reliability away.

I keep hearing the same story:

“Our tooling will catch it.”

“Copilots will reduce operational load.”

“Automation will mitigate incidents before they happen.”

But here’s a hard truth to swallow: AI only automates the mechanical parts of reliability — the machine in the machine.

The hard parts haven’t changed at all.

You still need teams with clarity on system boundaries.
You still need consistent approaches to resolution.
You still need postmortems that drive learning rather than blame.

AI doesn’t fix any of that. If anything, it exposes every organizational gap we’ve been ignoring. And that’s exactly why I wanted today’s guest on.

Jennifer Petoff is Director of Program Management for Google Cloud Platform and Technical Infrastructure education. Every day, she works with SREs at Google, as well as with SREs at other companies through her public speaking and Google Cloud Customer engagements.

Even if you have never touched GCP, you have still been influenced by her work at some point in your SRE career. She is co-editor of Google’s original Site Reliability Engineering book from 2016.

Yeah, that one!

It was my immense pleasure to have her join me to discuss the internal dynamics behind successful reliability initiatives. Here are 5 highlights from our talk:

3 issues stifling individual SREs’ work

To start, I wanted to know from Jennifer the kinds of challenges she has seen individual SREs face when attempting to introduce or reinforce reliability improvements within their teams or the broader organization.

She categorized these challenges into 3 main categories

Cultural issues (with a look into Westrum’s typology of organizational culture)
Insufficient buy-in from stakeholders
Inability to communicate the value of reliability work

Organizations with generative cultures have 30% better organizational performance.

A key highlight from this topic came from her look at DORA research, an annual survey of thousands of tech professionals and the research upon which the book Accelerate is based.

It showed that organizations with generative cultures have 30% better organizational performance. In other words, you can have the best technology, tools, and processes to get good results, but culture further raises the bar.

A generative culture also makes it easier to implement the more technical aspects of DevOps or SRE that are associated with improved organizational performance.

Hands-on is the best kind of training

We then explored structured approaches that ensure consistency, build capability, and deliberately shape reliability culture. As they say – Culture eats strategy for breakfast!

One key example Jennifer gave was the hands-on approach they take at Google. She believes that adults learn by doing. In other words, SREs gain confidence by doing hands-on work.

Where possible, training programs should move away from passive listening to lectures toward hands-on exercises that mimic real SRE work, especially troubleshooting.

One specific exercise that Google has built internally is Simulating Production Breakages.

Engineers undergoing that training have a chance to troubleshoot a real system built for this purpose in a safe environment.

The results have been profound, with a tremendous amount of confidence that Jennifer’s team saw in survey results.

This confidence is focused on job-related behaviors, which when repeated over time reinforce that culture of reliability.

Reliability is mandatory for everybody

Another thing Jennifer told me Google did differently was making reliability a mandatory part of every engineer’s curriculum, not only SREs.

When we first spun up the SRE Education team, our focus was squarely on our SREs. However, that’s like preaching to the choir. SREs are usually bought into reliability.
A few years in, our leadership was interested in propagating the reliability-focused culture of SRE to all of Google’s development teams, a challenge an order of magnitude greater than training SREs.

How did they achieve this mandate?

They developed a short and engaging (and mandatory) production safety training
That training has now been taken by tens of thousands of Googlers
Jennifer attributes this initiative’s success to how they“SRE’ed the program”.
“We ran a canary followed by a progressive roll-out. We instituted monitoring and set up feedback loops so that we could learn and drive continuous improvement.”

The result of this massive effort?

A very respectable 80%+ net promoter score with open text feedback: “best required training ever.”

What made this program successful is that Jennifer and her team SRE’d its design and iterative improvement. 

You can learn more about “How to SRE anything” (from work to life) using her rubric: 
https://www.reliablepgm.com/how-to-sre-anything/

Reliability gets rewarded just like feature work

Jennifer then talked about how Google mitigates a risk that I think every reliability engineer wishes could be solved at their organization.

That is, having great reliability work rewarded at the same level as great feature work.

For development and operations teams alike at Google, this means making sure “grungy work” like tech debt reduction, automation, and other activities that improve reliability are rewarded equally to shiny new product features.

Organizational reward programs that recognize outstanding work typically have committees. These committees not only look for excellent feature development work, but also reward and celebrate foundational activities that improve reliability.

This is explicitly built into the rubric for judging award submissions.

Keep a scorecard of reliability performance

Jennifer gave another example of how Google judges reliability performance, but more specifically for SRE teams this time.

Google’s Production Excellence (ProdEx) program was created in 2015 to assess and improve production excellence (aka reliability improvements) across SRE teams.

ProdEx acts like a central scorecard to aggregate metrics from various production health domains to provide a comprehensive overview of an SRE team’s health and the reliability of the services they manage.

Here are some specifics from the program:

Domains include SLOs, on-call workload, alerting quality, and postmortem discipline
Reviews are conducted live every few quarters by senior SREs (directors or principal engineers) who are not part of the team’s direct leadership
There is a focus on coaching and accountability without shame (to elicit psychological safety)

ProdEx serves various levels of the SRE organization through:

providing strategic situational awareness regarding organizational and system health to leadership and
keeping forward momentum around reliability and surfacing team-level issues early to support engineers in addressing them

Wrapping up

Having an inside view of reliability mechanisms within a few large organizations, I know that few are actively doing all — or sometimes any — of the reliability enhancers that Google uses and Jennifer has graciously shared with us.

It’s time to get the ball rolling. What will you do today to make it happen?

#67 Why the SRE Book Fails Most Orgs — Lessons from a Google Veteran

Ash Patel — Tue, 15 Jul 2025 13:05:13 GMT

A new or growing SRE team. A copy of the book. A company that says it cares about reliability. What happens next? Usually… not much.

In this episode, I sit down with Dave O’Connor, a 16-year Google SRE veteran, to talk about what happens when organizations cargo-cult reliability practices without understanding the context they were born in.

You might know him for his self-deprecating wit and legendary USENIX blurb about being “complicit in the development of the SRE function.”

This one’s a treat — less “here’s a shiny new tool” and more “here’s what reliability actually looks like when you’ve seen it all.”

✨ No vendor plugs from Dave at all, just a good old-fashioned chat about what works and what doesn’t.

Here’s what we dive into:

The adoption trap: Why SRE efforts often fail before they begin—especially when new hires care more about reliability than the org ever intended.
The SRE book dilemma: Dave’s take on why following the SRE book chapter-by-chapter is a trap for most companies (and what to do instead).
The cost of “caring too much”: How engineers burn out trying to force reliability into places it was never funded to live.
You build it, you run it (but should you?): Not everyone’s cut out for incident command—and why pretending otherwise sets teams up to fail.
Buying vs. building: The real reason even conservative enterprises are turning into software shops — and the reliability nightmare that follows.

We also discuss the evolving role of reliability in organizations today, from being mistaken for “just ops” to becoming a strategic investment (when done right).

Dave's seen the waves come and go in SRE — and he's still optimistic. That alone is worth a listen.

#66 - Unpacking 2025 SRE Report’s Damning Findings

Ash Patel — Tue, 01 Jul 2025 01:10:26 GMT

I know it’s already six months into 2025, but we recorded this almost three months ago. I’ve been busy with my foray into the world of tech consulting and training —and, well, editing these podcast episodes takes time and care.

This episode was prompted by the 2025 Catchpoint SRE Report, which dropped some damning but all-too-familiar findings:

53% of orgs still define reliability as uptime only, ignoring degraded experience and hidden toil
Manual effort is creeping back in, reversing five years of automation gains
41% of engineers feel pressure to ship fast, even when it undermines long-term stability

To unpack what this actually means inside organizations, I sat down with Sebastian Vietz, Director of Reliability Engineering at Compass Digital and co-host of the Reliability Enablers podcast.

Sebastian doesn’t just talk about technical fixes — he focuses on the organizational frictions that stall change, burn out engineers, and leave “reliability” as a slide deck instead of a lived practice.

We dig into:

How SREs get stuck as messengers of inconvenient truths
What it really takes to move from advocacy to adoption — without turning your whole org into a cost center
Why tech is more like milk than wine (Sebastian explains)
And how SREs can strengthen—not compete with—security, risk, and compliance teams

This one’s for anyone tired of reliability theatrics. No kumbaya around K8s here. Just an exploration of the messy, human work behind making systems and teams more resilient.

#65 - In Critical Systems, 99.9% Isn’t Reliable — It’s a Liability

Ash Patel — Tue, 17 Jun 2025 13:05:09 GMT

Most teams talk about reliability with a margin for error. “What’s our SLO? What’s our budget for failure?”

But in the energy sector? There is no acceptable downtime. Not even a little.

In this episode, I talk with Wade Harris, Director of FAST Engineering in Australia, who’s spent 15+ years designing and rolling out monitoring and control systems for critical energy infrastructure like power stations, solar farms, SCADA networks, you name it.

What makes this episode different is that Wade isn’t a reliability engineer by title, but it’s baked into everything his team touches. And that matters more than ever as software creeps deeper into operational technology (OT), and the cloud tries to stake its claim in critical systems.

We cover:

Why 100% uptime is the minimum bar, not a stretch goal
How the rise of renewables has increased system complexity — and what that means for monitoring
Why bespoke integration and SCADA spaghetti are still normal (and here to stay)
The reality of cloud risk in critical infrastructure (“the cloud is just someone else’s computer”)
What software engineers need to understand if they want their products used in serious environments

This isn’t about observability dashboards or DevOps rituals. This is reliability when the lights go out and people risk getting hurt if you get it wrong.

And it’s a reminder: not every system lives in a feature-driven world. Some systems just have to work. Always. No matter what.

#64 - Using AI to Reduce Observability Costs

Ash Patel — Tue, 28 Jan 2025 14:03:35 GMT

Exploring how to manage observability tool sprawl, reduce costs, and leverage AI to make smarter, data-driven decisions.

It's been a hot minute since the last episode of the Reliability Enablers podcast.

Sebastian and I have been working on a few things in our realms. On a personal and work front, I’ve been to over 25 cities in the last 3 months and need a breather.

Meanwhile, listen to this interesting vendor, Ruchir Jha from Cardinal, working on the cutting edge of o11y to help reduce costs from spiraling out of control.

(To the skeptics, he did not pay me for this episode)

Here’s an AI-generated summary of what you can expect in our conversation:

In this conversation, we explore cutting-edge approaches to FinOps i.e. cost optimization for observability.

You'll hear about three pressing topics:

Managing Tool Sprawl: Insights into the common challenge of juggling 5-15 tools and how to identify which ones deliver real value.
Reducing Observability Costs: Techniques to track and trim waste, including how to uncover cost hotspots like overused or redundant metrics.
AI for Observability Decisions: Practical ways AI can simplify complex data, empowering non-technical stakeholders to make informed decisions.

We also touch on the balance between open-source solutions like OpenTelemetry and commercial observability tools.

Learn how these strategies, informed by Ruchir's experience at Netflix, can help streamline observability operations and cut costs without sacrificing reliability.

#63 - Does "Big Observability" Neglect Mobile?

Ash Patel — Tue, 12 Nov 2024 13:03:12 GMT

Andrew Tunall is a product engineering leader focused on pushing the boundaries of reliability with a current focus on mobile observability. Using his experience from AWS and New Relic, he’s vocal about the need for a more user-focused observability, especially in mobile, where traditional practices fall short.

Career Journey and Current Role: Andrew Tunall, now at Embrace, a mobile observability startup in Portland, Oregon, started his journey at AWS before moving to New Relic. He shifted to a smaller, Series B company to learn beyond what corporate America offered.
Specialization in Mobile Observability: At Embrace, Andrew and his colleagues build tools for consumer mobile apps, helping engineers, SREs, and DevOps teams integrate observability directly into their workflows.
Gap in Mobile Observability: Observability for mobile apps is still developing, with early tools like Crashlytics only covering basic crash reporting. Andrew highlights that more nuanced data on app performance, crucial to user experience, is often missed.
Motivation for User-Centric Tools: Leaving “big observability” to focus on mobile, Andrew prioritizes tools that directly enhance user experience rather than backend metrics, aiming to be closer to end-users.
Mobile's Role as a Brand Touchpoint: He emphasizes that for many brands, the primary consumer interaction happens on mobile. Observability needs to account for this by focusing on user experience in the app, not just backend performance.
Challenges in Measuring Mobile Reliability: Traditional observability emphasizes backend uptime, but Andrew sees a gap in capturing issues that affect user experience on mobile, underscoring the need for end-to-end observability.
Observability Over-Focused on Backend Systems: Andrew points out that “big observability” has largely catered to backend engineers due to the immense complexity of backend systems with microservices and Kubernetes. Despite mobile being a primary interface for apps like Facebook and Instagram, observability tools for mobile lag behind backend-focused solutions.
Lack of Mobile Engineering Leadership in Observability: Reflecting on a former Meta product manager’s observations, Andrew highlights the lack of VPs from mobile backgrounds, which has left a gap in observability practices for mobile-specific challenges. This gap stems partly from frontend engineers often seeing themselves as creators rather than operators, unlike backend teams.
OpenTelemetry’s Limitations in Mobile: While OpenTelemetry provides basic instrumentation, it falls short in mobile due to limited SDK support for languages like Kotlin and frameworks like Unity, React Native, and Flutter. Andrew emphasizes the challenges of adapting OpenTelemetry to mobile, where app-specific factors like memory consumption don’t align with traditional time-based observability.
SREs as Connective Tissue: Andrew views Site Reliability Engineers (SREs) as essential in bridging backend observability practices with frontend user experience needs. Whether through service level objectives (SLOs) or similar metrics, SREs help ensure that backend metrics translate into positive end-user experiences—a critical factor in retaining app users.
Amazon’s Operational Readiness Review: Drawing from his experience at AWS, Andrew values Amazon’s practice of operational readiness reviews before launching new services. These reviews encourage teams to anticipate possible failures or user experience issues, weighing risks carefully to maintain reliability while allowing innovation.
Shifting Focus to “Answerability” in Observability: For Andrew, the goal of observability should evolve toward “answerability,” where systems provide engineers with actionable answers rather than mere data. He envisions a future where automation or AI could handle repetitive tasks, allowing engineers to focus on enhancing user experiences instead of troubleshooting.

#62 - Early Youtube SRE shares Modern Reliability Strategy

Ash Patel — Tue, 05 Nov 2024 13:40:50 GMT

Andrew Fong’s take on engineering cuts through the usual role labels, urging teams to start with the problem they’re solving instead of locking into rigid job titles. He sees reliability, inclusivity, and efficiency as the real drivers of good engineering.

In his view, SRE is all about keeping systems reliable and healthy, while platform engineering is geared toward speed, developer enablement, and keeping costs in check. It’s a values-first, practical approach to tackling tough challenges that engineers face every day.

Here’s a slightly deeper dive into the concepts we discussed:

Career and Evolution in Tech: Andrew shares his journey through various roles, from early SRE at Youtube to VP of Infrastructure at Dropbox to Director of Engineering at Databricks, with extensive experience in infrastructure through three distinct eras of the internet. He emphasized the transition from early infrastructure roles into specialized SRE functions, noting the rise of SRE as a formalized role and the evolution of responsibilities within it.
Building Prodvana and the Future of SRE: As CEO of startup, Prodvana, they're focused on an "intelligent delivery system" designed to simplify production management for engineers, addressing cognitive overload. They highlight SRE as a field facing new demands due to AI, discussing insights shared with Niall Murphy and Corey Bertram around AI's potential in the space, distinguishing it from "web three" hype, and affirming that while AI will transform SRE, it will not eliminate it.
Challenges of Migration and Integration: Reflecting on experiences at YouTube post-acquisition by Google, the speaker discusses the challenges of migrating YouTube’s infrastructure onto Google’s proprietary, non-thread-safe systems. This required extensive adaptation and “glue code,” offering insights into the intricacies and sometimes rigid culture of Google’s engineering approach at that time.
SRE’s Shift Toward Reliability as a Core Feature: The speaker describes how SRE has shifted from system-level automation to application reliability, with growing recognition that reliability is a user-facing feature. They emphasize that leadership buy-in and cultural support are essential for organizations to evolve beyond reactive incident response to proactive, reliability-focused SRE practices.
Organizational Culture and Leadership Influence: Leadership’s role in SRE success is highlighted as crucial, with examples from Dropbox and Google emphasizing that strong, supportive leadership can shape positive, reliability-centered cultures. The speaker advises engineers to gauge leadership attitudes towards SRE during job interviews to find environments where reliability is valued over mere incident response.
Outcome-Focused Work Over Titles: Emphasis on assembling the right team based on skills, not titles, to solve technical problems effectively. Titles often distract from focusing on outcomes, and fostering a problem-solving culture over role-based thinking accelerates teamwork and results.
Engineers as Problem Solvers: Engineers, especially natural ones, generally resist job boundaries and focus on solving problems rather than sticking rigidly to job descriptions. This echoes how iconic engineers like Steve Jobs valued versatility over predefined roles.
Culture as Core Values: Organizational culture should be driven by core values like reliability, efficiency, and inclusivity rather than rigid processes or roles. For instance, Dropbox's infrastructure culture emphasized being a “force multiplier” to sustain product velocity, an approach that ensured values were integrated into every decision.
Balancing SRE and Platform Priorities: The fundamental difference between SRE (Site Reliability Engineering) and platform engineering is their focus: SRE prioritizes reliability, while platform engineering is geared toward increasing velocity or reducing costs. Leaders must be cautious when assigning both roles simultaneously, as each requires a distinct focus and expertise.
Strategic Trade-Offs in Smaller Orgs: In smaller companies with limited resources, leaders often face challenges balancing cost, reliability, and other objectives within single roles. It's advised to sequence these priorities rather than burden one individual with conflicting objectives. Prioritizing platform stability, for example, can help improve reliability in the long term.
DevOps as a Philosophy: DevOps is viewed here as an operational philosophy rather than a separate role. The approach enhances both reliability and platform functions by fostering a collaborative, efficient work culture.
Focus Investments for Long-Term Gains: Strategic technology investments, even if they might temporarily hinder short-term metrics (like reliability), can drive long-term efficiency and reliability improvements. For instance, Dropbox invested in a shared metadata system to enable active-active disaster recovery, viewing this as essential for future reliability.

#61 Scott Moore on SRE, Performance Engineering, and More

Ash Patel — Tue, 22 Oct 2024 11:24:50 GMT

Scott's got a few interesting things to say about these topics and software operations in general!

#60 How to NOT fail in Platform Engineering

Ash Patel — Tue, 01 Oct 2024 13:01:34 GMT

Here’s what we covered:

Defining Platform Engineering

Platform engineering: Building compelling internal products to help teams reuse capabilities with less coordination.
Cloud computing connection: Enterprises can now compose platforms from cloud services, creating mature, internal products for all engineering personas.

Ankit’s career journey

Didn't choose platform engineering; it found him.
Early start in programming (since age 11).
Transitioned from a product engineer mindset to building internal tools and platforms.
Key experience across startups, the public sector, unicorn companies, and private cloud projects.

Singapore Public Sector Experience

Public sector: Highly advanced digital services (e.g., identity services for tax, housing).
Exciting environment: Software development in Singapore’s public sector is fast-paced and digitally progressive.

Platform Engineering Turf Wars

Turf wars: Debate among DevOps, SRE, and platform engineering.
- DevOps: Collaboration between dev and ops to think systemically.
- SRE: Operations done the software engineering way.
- Platform engineering: Delivering operational services as internal, self-service products.

Dysfunctional Team Interactions

Issue: Requiring tickets to get work done creates bottlenecks.
- Ideal state: Teams should be able to work autonomously without raising tickets.
- Spectrum of dysfunction: From one ticket for one service to multiple tickets across teams leading to delays and misconfigurations.

Quadrant Model (Autonomy vs. Cognitive Load)

Challenge: Balancing user autonomy with managing cognitive load.
Goal: Enable product teams with autonomy while managing cognitive load.
Solution: Platforms should abstract unnecessary complexity while still giving teams the autonomy to operate independently.
How it pans out
- Low autonomy, low cognitive load: Dependent on platform teams but a simple process.
- Low autonomy, high cognitive load: Requires interacting with multiple teams and understanding technical details (worst case).
- High autonomy, high cognitive load: Teams have full access (e.g., AWS accounts) but face infrastructure burden and fragmentation.
- High autonomy, low cognitive load: Ideal situation—teams get what they need quickly without detailed knowledge.

Shift from Product Thinking to Cognitive Load

Cognitive load focus: More important than just product thinking—consider the human experience when using the system.
Team Topologies: Mentioned as a key reference on this concept of cognitive load management.

Platform as a Product Mindset

Collaboration: Building the platform in close collaboration with initial users (pilot teams) is crucial for success.
Product Management: Essential to have a product manager or team dedicated to communication, user journeys, and internal marketing.

Self-Service as a Platform Requirement

Definition: Users should easily discover, understand, and use platform capabilities without human intervention.
User Testing: Watch how users interact with the platform to understand stumbling points and improve the self-service experience.

Platform Team Cognitive Load

Burnout Prevention: Platform engineers need low cognitive load as well. Moving from a reactive (ticket-based) model to a proactive, self-service approach can reduce the strain.
Proactive Approach: Self-service models allow platform teams to prioritize development and avoid being overwhelmed by constant requests.

#59 Who handles monitoring in your team and how?

Ash Patel — Tue, 24 Sep 2024 11:35:49 GMT

Why many copy Google’s monitoring team setup

Google’s Influence. Google played a key role in defining the concept of software reliability.
Success in Reliability. Few can dispute Google’s ability to ensure high levels of reliability and its ability to share useful ways to improve it in other settings
BUT there’s a problem:
It’s not always replicable. While Google's practices are admired, they may not be a perfect fit for every team.

What is Google’s monitoring approach within teams?

Here’s the thing that Google does:

Google assigns one or two people per team to manage monitoring.
Even with centralized infrastructure, a dedicated person handles monitoring.
Many organizations use a separate observability team, unlike Google's integrated approach

If your org is large enough and prioritizes reliability highly enough, you might find it feasible to follow Google’s model to the tee.

Otherwise, a centralized team with occasional “embedded x engineer” secondments might be more effective.

Can your team mimic Google’s model?

Here are a few things you should factor in:

Size matters

Google's model works because of its scale and technical complexity. Many organizations don’t have the size, resources, or technology to replicate this.

What are the options for your team?

Dedicated monitoring team (very popular but $$$)

If you have the resources, you might create a dedicated observability team. This might call for a ~$500k+ personnel budget so it’s not something that a startup or SME can easily justify.

Dedicate SREs to monitoring work (effective but difficult to manage)

You might do this on rotation or make an SRE permanently “responsible for all monitoring matters”. Putting SREs on permanent tasks might lead to burnout as it might not suit their goals, and rotation work requires effective planning.

Internal monitoring experts (useful but hard capability)

One or more engineers within teams could take on monitoring/observability responsibilities as needed and support the team’s needs. This should be how we get monitoring work done, but it’s hard to get volunteers across a majority of teams.

Transitioning monitoring from project work to maintenance

2 distinct phases

Initial Setup (the “project”)

SREs may help set up the monitoring/observability infrastructure.

Since they have breadth of knowledge across systems, they can help connect disparate services and instrument applications effectively.

Post-project phase (“keep the lights on”)

Once the system is up, the focus shifts from project mode to ongoing operational tasks. But who will do that?

Who will maintain the monitoring system?

Answer: usually not the same team

After the project phase, a new set of people—often different from the original team—typically handles maintenance.

Options to consider (once again)

Spin up a monitoring/observability team. Create a dedicated team for observability infrastructure.
Take a decentralized approach. Engineers across various teams take on observability roles as part of their regular duties.
Internal monitoring/observability experts. They can take responsibility for monitoring and ensure best practices are followed.

The key thing to remember here is…

Adapt to Your Organizational Context

One size doesn’t fit all

Google's model may not work for everyone. Tailor your approach based on your organization’s specific needs.

The core principle to keep in mind

As long as people understand why monitoring/observability matters and pay attention to it, you're on the right track.

Work according to engineer awareness

If engineers within product and other non-operations teams are aware of monitoring: You can attempt to **decentralize the effort** and involve more team members.

If awareness or interest is low: consider **dedicated observability roles** or an SRE team to ensure monitoring gets the attention it needs.

In conclusion

There’s no universal solution.

Whether you centralize or decentralize monitoring depends on your team’s structure, size, and expertise.

The important part is ensuring that observability practices are understood and implemented in a way that works best for your organization.

PS. Rather than spend an hour on writing, I decided to write in the style I normally use in a work setting i.e. “executive short-hand”. Tell me what you think.

#58 Fixing Monitoring's Bad Signal-to-Noise Ratio

Ash Patel — Tue, 17 Sep 2024 12:13:22 GMT

Monitoring in the software engineering world continues to grapple with poor signal-to-noise ratios. It’s a challenge that’s been around since the beginning of software development and will persist for years to come.

The core issue is the overwhelming noise from non-essential data, which floods systems with useless alerts.

This interrupts workflows, affects personal time, and even disrupts sleep.

Sebastian dove into this problem, highlighting that the issue isn't just about having meaningless pages but also the struggle to find valuable information amidst the noise.

When legitimate alerts get lost in a sea of irrelevant data, pinpointing the root cause becomes exceptionally hard.

Sebastian proposes a fundamental fix for this data overload: be deliberate with the data you emit.

When instrumenting your systems, be intentional about what data you collect and transport.

Overloading with irrelevant information makes it tough to isolate critical alerts and find the one piece of data that indicates a problem.

To combat this, focus on:

Being Deliberate with Data. Make sure that every piece of telemetry data serves a clear purpose and aligns with your observability goals.
Filtering Data Effectively. Improve how you filter incoming data to eliminate less relevant information and retain what's crucial.
Refining Alerts. Optimize alert rules such as creating tiered alerts to distinguish between critical issues and minor warnings.

Dan Ravenstone, who leads platform at Top Hat, discussed “triaging alerts” recently.

He shared that managing millions of alerts, often filled with noise, is a significant issue.

His advice: scrutinize alerts for value, ensuring they meet the criteria of a good alert, and discard those that don’t impact the user journey.

According to Dan, the anatomy of a good alert includes:

A run book
A defined priority level
A corresponding dashboard
Consistent labels and tags
Clear escalation paths and ownership

To elevate your approach, consider using aggregation and correlation techniques to link otherwise disconnected data, making it easier to uncover patterns and root causes.

The learning point is simple: aim for quality over quantity.

By refining your data practices and focusing on what's truly valuable, you can enhance the signal-to-noise ratio, ultimately allowing more time for deep work rather than constantly managing incidents.

#57 How Technical Leads Support Software Reliability

Ash Patel — Tue, 10 Sep 2024 12:10:22 GMT

The question then condenses down to: Can technical leads support reliability work?

Yes, they can!

Anemari has been a technical lead for years — even spending a few years doing that at the coveted consultancy, Thoughtworks — and now coaches others.

She and I discussed the link between this role and software reliability.

Technical lead ≠ Engineering manager

According to Anemari, a tech lead is a person with technical knowledge who is responsible for leading a technical team to align them with a common technical strategy.

For the most part, engineering managers are focused on the people aspect of the work. They coach engineers and other technical folk to improve their performance.

In terms of activities, engineering managers coach, mentor, and support the development of their team members or direct reports. They’ll also go and bat for the team within the organization.

Technical leads are more focused on guiding the technical work that these people do. Their focus includes work like architecture, design patterns, and implementing projects. They offer technical insights and mentorship to the team.

Anemari found in her work experience that the use of technical leads depends from company to company:

I've worked with teams that don't have tech leads and then the engineering manager takes a more hands-on approach. Then you have teams where you have a tech lead only focused on tech, and then you have like a team lead or engineering manager, doing the people side.

Interestingly, she found that it was very difficult to singularly focus on the technical side of the work:

[Even] if you only want to focus on tech as a tech lead, you still end up having to deal with the people side because most tech problems are people problems in the end and so you kind of require both.

How can technical leads drive reliability principles?

Anemari advised me that tech leads are often required to think about reliability principles as part of their technical strategy.

Sometimes, teams don’t have an SRE team supporting them and are operating in a “you build it, you run it” mode. In this situation, reliability becomes a 90+ percent responsibility of the team.

In other words, the product team has had to develop the reliability-focused knowledge required to build and run their product reliably.

The technical lead can help the less experienced team members pick up reliability concepts and not just have that mindset of creating more features and making the burndown chart look pretty.

However, their preference usually goes on working with a Site Reliability Engineering (SRE) team if it is possible.

Working with reliability teams like SREs

Anemari believes it’s important to be aware of reliability principles. She put it as knowing:

What does it mean for my product to be reliable?

So, the first critical step is to align your definition of reliability from pondering this question with the known constraints of the product. These constraints can be:

performance limits
operational environments
user expectations

Anemari went on in our conversation to challenge the notion of simply stating reliability requirements to product teams.

It may seem obvious to reliability folk, but it’s one thing in a very large pile of considerations for a product team:

So I think it's important when we're talking about [reliability]… when we're talking about uptime, like, what does that mean? And why do we need 99 percent or whatever?

The onus falls on reliability engineers to support this thinking.

Technical leads can be a good point of contact for that initial conversation of alignment. The reliability team can advise about the standards we mentioned before.

So the first step is co-creating a common version of reliability.

The second step is continuous communication between the product and reliability teams. Anemari believes there must be an easy way for either party to communicate with the other.

“Creating a ticket” to reach the reliability team or vice versa does not work well.

Anemari has seen having a Slack channel between the SRE team and each product team as something that has worked well.

You can go in there, pop up a question and take it from there and solve problems.

Anemari is also a fan of the embedded SRE model.

If you have to build something new from scratch and you have to build a whole infrastructure, another strategy that I've seen is bringing a person with SRE knowledge [into] the team for a while, helping us define [technical work] and being hands on.

A long-term embedded SRE, however, might be an antipattern.

I’ve had some teams that had an SRE team member… all the time, but I’ve also seen them struggle with the fact that at some point there is not so much to do.

An embedded SRE should be brought in to enable the teams to do more and take care of their products By training the team on reliability topics — even at a high level — like understanding all the different dashboards that they have.

What can SREs do better to support product teams with reliability efforts?

Anemari told me that it would be great if SREs first put more effort into helping the product team understand why reliability is so important. It’s obvious to us in this space, but as I mentioned earlier, people have n+100 other things on their minds.

Product teams need to be made more aware of what the reliability engineers are taking care of and how that can affect their products.

There needs to be more conversation than what Anemari has experienced in a lot of encounters with SRE teams:

The SRE teams are like, “Just give it to us. We'll take care of it. You know, like, we, we know what to do. I don't have to explain all of these things to you.”

She found it more effective to sit down with the reliability team and try to understand what might seem trivial, but are important questions like:

Why is our service not processing enough requests and how can we change it?
Why can we just drop this?
Why do we need to have all of these services?

Asking these kinds of questions helped the collaborators come up with a better solution than what the SRE team could come up with on their own.

Anemari has seen in the past when an SRE team came in and introduced themselves to the product team. They then covered how their services ran in the background. This led the product team to ask questions to brainstorm on what could be improved.

The intentionality is important: the SRE team did not come with horns blaring that they were going to change the systems. They came in to discuss and brainstorm ideas.

But will the software engineers cooperate?

A lot of SREs I've spoken with are frustrated and think developer teams do not want to understand reliability.

But I'm sure at least some teams want to understand how reliability works.

Sometimes you have to just swallow that bitter pill and say, “Hey, look, I'm going to spell every single thing I'm doing out to you, just so that maybe in the future you can do it yourself.”

I think a lot of developers remember their early days and want to be able to run their product entirely on their own.

Anemari confirmed with her experience that developers want to have full power over what's happening with their product. Dependency is not necessarily something developers like very much.

There's of course the consideration of the learning journey.

There's that sweet spot that is hard to find early on, “How much do I need to know? Like, do I need to become an SRE expert now to run my service? Or how much do I need to know to just run my service?”

Anemari recalled one particular high-performing team that she led. The reason why it was high-performing was partly because they were able to run services fully with very little support from an external SRE team.

We [brought] someone we knew with knowledge… to fully understand how does our AWS work and how to restart our services and everything so that we were able to fully run it and be on call for it.

The sweet spot notion comes in once again where you have to work out, “How much do I need to develop and spend time on developing this SRE knowledge, and how much do put into developing the product?”

It's complicated, but communication is the key to it.

How can product teams handle conflicting priorities?

Anemari told me that the answer depends on the context:

where you are
what situation your product is in
how the developer and SRE teams are laid out

It’s very different to start a project from scratch versus having a whole monolith running for years and trying to make things better.

Anemari added:

It might sound a little bit crazy but I still think that it's very important for you to like write down all of these different things that you think you need to solve…
A big part of the role of a tech lead is to make sure that all of these different parts agree and align on a strategy. So that means the product, that means the developer, that means the SREs, that means the customer support, etc.

A tech lead would work to bring all of this together and then propose to all of these different stakeholders. You might have to put it as simply as, “This is what we have to focus on right now, given that as a product team. We have to keep delivering.”

Then come the compromises. For example, “20 percent of our time would go into improving our delivery pipeline.”

It is crucial to work out compromises before getting agreement from the various stakeholders in the software delivery organization.

This write-up was just a preview of what Anemari and I talked about in this episode of the Reliability Enablers. Be sure to listen to get the other half of our conversation.

#56 Resolving DORA Metrics Mistakes

Ash Patel — Wed, 04 Sep 2024 12:02:44 GMT

We're already well into 2024 and it’s sad that people still have enough fuel to complain about various aspects of their engineering life.

DORA seems to be turning into one of those problem areas.

Not at every organization, but some places are turning it into a case of “hitting metrics” without caring for the underlying capabilities and conversations.

Nathen Harvey is no stranger to this problem.

He used to talk a lot about SRE at Google as a developer advocate. Then, he became the lead advocate for DORA when Google acquired it in 2018.

His focus has been on questions like:

How do we help teams get better at delivering and operating software?

You and I can agree that this is an important question to ask.

I’d listen to what he has to say about DORA because he’s got a wealth of experience behind him, having also run community engineering at Chef Software.

Before we continue, let’s explore What is DORA? in Nathen’s (paraphrased) words:

DORA is a software research program that's been running since 2015.

This research program looks to figure out:

How do teams get good at delivering, operating, building, and running software?

The researchers were able to draw out the concept of the metrics based on correlating teams that have good technology practices with highly robust software delivery outcomes.

They found that this positively impacted organizational outcomes like profitability, revenue, and customer satisfaction.

Essentially, all those things that matter to the business.

One of the challenges the researchers found over the last decade was working out: how do you measure something like software delivery?

It's not the same as a factory system where you can go and count the widgets that we're delivering necessarily.

The unfortunate problem is that the factory mindset I think still leaks in. I’ve personally noted some silly metrics over the years like lines of code.

Imagine being asked constantly: “How many lines of code did you write this week?”

You might not have to imagine. It might be a reality for you.

DORA’s researchers agreed that the factory mode of metrics cannot determine whether or not you are a productive engineer.

They settled on and validated 4 key measures for software delivery performance.

Nathen elaborated that 2 of these measures look at throughput:

[Those] two [that] look at throughput really ask two questions:
How long does it take for a change of any kind, whether it's a code change, configuration change, whatever, a change to go from the developer's workstation. right through to production?
And then the second question on throughput is:
How frequently are you updating production?

In plain English, these 2 metrics are:

Deployment Frequency. How often code is deployed to production? This metric reflects the team's ability to deliver new features or updates quickly.
Lead Time for Changes: Measures the time it takes from code being committed to being deployed to production.

Nathen recounted his experience of working at organizations that differed in how often they update production from once every six months to multiple times a day.

They're both very different types of organizations, so their perspective on throughput metrics will be wildly different.

This has some implications for the speed of software delivery.

Of course, everyone wants to move faster, but there’s this other thing that comes in and that's stability.

And so, the other two stability-oriented metrics look at:

What happens when you do update production and... something's gone horribly wrong. “Yeah, we need to roll that back quickly or push a hot fix.”

In plain English, they are:

Change Failure Rate: Measures the percentage of deployments that cause a failure in production (e.g., outages, bugs).
Failed Deployment Recovery Time: Measures how long it takes to recover from a failure in production.

You might be thinking the same thing as me. These stability metrics might be a lot more interesting to reliability folks than the first 2 throughput metrics.

But keep in mind, it’s about balancing all 4 metrics.

Nathen believes it’s fair to say today that across many organizations, they look at these concepts of throughput and stability as tradeoffs of one another.

We can either be fast or we can be stable.

But the interesting thing that the DORA researchers have learned from their decade of collecting data is that throughput and stability aren't trade-offs of one another.

They tend to move together. They’ve seen organizations of every shape and size, in every industry, doing well across all four of those metrics.

They are the best performers.

The interesting thing is that the size of your organization doesn't matter the industry that you're in.

Whether you’re working in a highly regulated or unregulated industry, it doesn't matter.

The key insight that Nathen thinks we should be searching for is: how do you get there?

To him, it's about shipping smaller changes.

When you ship small changes, they're easier to move through your pipeline.

They're easier to reason about.

And when something goes wrong, they're easier to recover from and restore service.

But along with those small changes, we need to think about those feedback cycles.

Every line of code that we write is in reality a little bit of an experiment.

We think it's going to do what we expect and it's going to help our users in some way, but we need to get feedback on that as quickly as possible.

Underlying all of this, both small changes and getting fast feedback, is a real climate for learning. Nathen drew up a few thinking points from this:

So what is the learning culture like within our organization?
Is there a climate for learning?
And are we using things like failures as opportunities to learn, so that we can ever be improving?

I don’t know if you’re thinking the same as me already, but we're already learning that DORA is a lot more than just metrics.

To Nathen (and me), the metrics should be one of the least interesting parts of DORA because it digs into useful capabilities, like small changes and fast feedback.

That’s what truly helps determine how well you're going to do against those performance metrics.

Not saying “We are a low to medium performer. Now go and improve the metrics!”

I think the issue is that a lot of organizations emphasize the metrics because it's something that can sit on an executive dashboard

But the true reason we have metrics is to help drive conversations.

Through those conversations, we drive improvement.

That’s important because currently an unfortunately noticeable amount of organizations are doing this according to Nathen:

I've seen organizations [where it’s like]: “Oh, we're going to do DORA. Here's my dashboard. Okay, we're done. We've done DORA. I can look at these metrics on a dashboard.”
That doesn't change anything.
We have to go the step further and put those metrics into action.

We should be treating the metrics as a kind of compass on a map.

You can use those metrics to help orient yourself and understand, “Where are we heading?”.

But then you have to choose how are you going to make progress toward whatever your goal is.

The capabilities enabled by the DORA framework should help answer questions like:

Where are our bottlenecks?
Where are our constraints?
Do we need to do some improvement work as a team?

We also talked about the SPACE framework, which is a follow-on tool from DORA metrics.

It is a framework for understanding developer productivity.

It encourages teams or organizations to look at five dimensions when trying to measure something from a productivity perspective.

It stands for:

S — satisfaction and well-being
P — performance
A — activity
C — communication and collaboration
E — efficiency and flow

What the SPACE framework recommends is that you

First, pick metrics from two to three of those five categories.

(You don't need a metric from every one of those five but find something that works well for your team.)

Then write down those metrics and start measuring them.

Here’s the interesting thing: DORA is an implementation of SPACE.

You can correlate each metric with the SPACE acronym!

Lead time for changes is a measure of Efficiency and flow
Deployment frequency is an Activity
Change fail rate is about Performance.
Failed deployment recovery time is about Efficiency and flow

Keep in mind that SPACE itself has no metrics.

It is a framework for identifying metrics.

Nathen reiterated that you can't use the space metrics because there is no such thing.

I mentioned earlier how DORA is a means of identifying the capabilities that can improve the metrics.

These can be technical practices like using continuous integration.

But they can also be capabilities like collaboration and communication.

As an example, you might look at what your change approval process looks like.

You might look at how collaboration and communication have failed when you’ve had to send changes off to an external approval board like a CAB (change approval board).

DORA’s research backs the above up:

What our research has shown through collecting data over the years, is that while they do exist on the whole, an external change approval body will slow you down.
That's no surprise. So your change lead time is going to increase, your deployment frequency will decrease.
But, at best, they have zero impact on your change fail rate. In most cases, they have a negative impact on your change fail rate. So you're failing more often.

It goes back to the idea of smaller changes, faster feedback, and being able to validate that. Building in audit controls and so forth.

This is something that reliability-focused engineers should be able to help with because one of the things Sebastian and I talk about a lot is embracing and managing risk effectively and not trying to mitigate it through stifling measures like CABs.

In short, DORA and software reliability are not mutually exclusive concepts.

They're certainly in the same universe.

Nathen went as far as to say that some SRE practices necessarily get a little bit deeper than sort of the capability level that DORA has and provide even more sort of specific guidance on how to do things.

He clarified a doubt I had because a lot of people have argued with me (mainly at conferences) that DORA is this thing that developers do, earlier in the SDLC.

And then SRE is completely different because it focuses on the production side.

The worst possible situation could be turning to developers and saying, “These 2 throughput metrics, they’re yours. Make sure they go up no matter what,” and then turn to our SREs and say “Those stability metrics, they're yours. Make sure they stay good”

All that does is put these false incentives in place and we're just fighting against each other.

We talked a little more about the future of DORA in our podcast episode (player/link right at the top of this post) if you want to hear about that.

Here are some useful links from Nathen for further research:

DORA online community of practice

DORA homepage

[Article] The SPACE of Developer Productivity

Nathen Harvey's Linktree

#55 3 Uses for Monitoring Data Other Than Alerts and Dashboards

Ash Patel — Tue, 27 Aug 2024 12:05:04 GMT

We’ll explore 3 use cases for monitoring data. They are:

Analyzing long-term trends
Comparing over time or experiment groups
Conducting ad hoc retrospective analysis

Analyzing long-term trends

You can ask yourself a couple of simple questions as a starting point:

How big is my database?
How fast is the database growing?
How quickly is my user count growing?

As you get comfortable with analyzing data for the simpler questions, you can start to analyze trends for less straightforward questions like:

How is the database performance evolving? Are there signs of degradation?
Is there consistent growth in data volume that may require future infrastructure adjustments?
How is overall resource utilization trending over time across different services?
How is the cost of cloud resources evolving, and what does that mean for budget forecasting?
Are there recurring patterns in downtime or service degradation, and what can be done to mitigate them?

Sebastian mentioned that it's a part of observability he enjoys doing. I can understand why. It’s exciting to see how components are changing over a period and working out solutions before you end up in an incident response nightmare.

Getting to effectively analyze the trends requires the right level of data retention settings. Because if you're throwing out your logs, traces, and metrics too early, you will not have enough historical data to do this kind of work.

Doing this right means having the right amount of data in place to be able to analyze those trends over time, and that will of course depend on your desired period.

Comparing over time or experiment groups

Google’s definition

You're comparing the data results for different groups that you want to compare and contrast. Using a few examples from the SRE (2016) book:

Are your queries faster in this version of this database or this version of that database?
How much better is my memcache hit rate with an extra node and is my site slower than it was last week?

You're comparing it to different buckets of time and different types of products.

A proper use case for comparing groups

Sebastian did this particular use case recently because he had to compare two different technologies for deploying code: AWS Lambda vs AWS Fargate ECS.

He took those two services and played around with different memories and different virtual CPUs. Then he ran different amounts of requests against those settings and tried to figure out which one was the better technology option most cost-effectively.

His need for this went beyond engineering work but enabling product teams with the right decision-making data. He wrote out a knowledge base article to give them guidance for a more educated decision on the right AWS service.

Having the data to compare the two services allowed him to answer questions like:

When should you be using either of these technologies?
What use cases would either technology be more suitable for?

This data-based decision support is based mainly on monitoring or observability data. The idea of using the monitoring data to compare tools and technologies for guiding product teams is something I think reliability folk can gain a lot of value from doing.

Conducting ad hoc retrospective analysis (debugging)

Debugging is a bread-and-butter responsibility for anyone who is a software engineer of any level.

It’s something that everybody should know a little bit more about than other tasks because there are very effective and also very ineffective ways of going about debugging.

Monitoring data can help make the debugging process fall into the effective side.

There are organizations where you have 10 different systems. In one system, you might get one fragmented piece of information. In another, you’ll get another fragment. And so on for all the different systems.

And then you have to correlate these pieces of information in your head and hopefully, you get some clarity out of the fragments to form some kind of insight.

Monitoring data that are brought together into one datastream can help correlate and combine all these pieces of information. With it, you can:

Pinpoint slow-running queries or functions by analyzing execution times and resource usage, helping you identify inefficiencies in your code
Correlate application logs with infrastructure metrics to determine if a performance issue is due to code errors or underlying infrastructure problems
Track memory leaks or CPU spikes by monitoring resource usage trends, which can help you identify faulty code or services
Set up detailed error tracking that automatically flags code exceptions and matches them with infrastructure events, to get to the root cause faster
Monitor system load alongside application performance to see if scaling issues are related to traffic spikes or inefficient code paths

Being able to do all this makes the insight part easier for you. And so your debugging approach becomes very different. It becomes much more effective. It becomes much less time-consuming. It potentially makes the debugging task fun.

Because you get to the root cause of the thing that is not working much faster. Your monitoring/observability data setup can make it nice and fun to a certain degree, or it can make it downright miserable.

If it's done well, it's just one of those things you don't even have to think about. It's just part of your job. You do it. It's very effective and you move on.

Wrapping up

So we've covered three more use cases for monitoring data, other than the usual alerts and dashboards.

They are once again:

analyzing long-term trends
comparing over time or experiment groups and
conducting ad hoc retrospective analysis, aka debugging

Next time your boss asks you what all these systems do, you now have three more reasons that you need to focus on your monitoring and be able to use it more effectively.

Until next time, happy monitoring.

#54 Becoming a Valuable Engineer Without Sacrificing Your Sanity

Ash Patel — Tue, 20 Aug 2024 12:09:12 GMT

Shlomo Bielak is the Head of Engineering (Operational Excellence and Cloud) at Penn Interactive, an interactive gaming company.

He’s dedicated much of his talk time at DevOps events to talk about a topic less covered at such technical events. A lot of what he said alluded to ways to become a more valuable engineer.

I’ve broken them down into the following areas:

Avoid the heroic efforts
Mind + heart > Mind alone
Curiosity > Credentials
Experience > Certifications
Thinking for complexity

When I saw him in Toronto, I thought he would talk about pre-production observability. It would only make sense after watching the previous presenter do a deep dive into Kubernetes tooling.

But surprisingly, he started about culture and the need to prevent burnout among engineers — a topic that is as important today as it was 2 years ago when he did the talk.

Here’s a look into Shlomo’s philosophy and the practices he champions.

Avoid the heroic efforts

Shlomo's perspective on heroics in engineering and operations challenges a traditional mindset that often glorifies excessive individual efforts at the cost of long-term sustainability.

He emphasizes that relying on heroics — where individuals consistently go above and beyond to save the day — creates an unhealthy work environment.

"We shouldn't be rewarding people for pulling all-nighters to save a project; we should be asking why those all-nighters were necessary in the first place."

This approach not only burns out engineers but also masks underlying systemic issues that need to be addressed. So, instead of celebrating these heroic efforts, Shlomo advocates for creating processes and metrics that ensure smooth operations without the need for constant intervention.

Mind + Heart > Mind alone

One of the challenges Shlomo has faced recently is scaling his engineering organization amidst rapid growth. His approach to hiring is unique; he doesn’t just look for technical skills but prioritizes self-awareness and kindness.

"Hiring with heart means looking for individuals who bring empathy and integrity to the team, not just expertise."

When he joined The Score, a subsidiary of Penn Interactive, Shlomo immediately revamped the hiring practices by integrating the values above into the process.

He favors role-playing scenarios over solely using behavioral interviews to evaluate candidates, as this method reveals how individuals might react in real production situations.

I tend to agree with this approach as seeing how people are doing the work is more enlightening than asking them how they behaved in a past situation alone.

Curiosity > credentials

How it plays into career progression

When it comes to career progression, Shlomo places little value on traditional markers like education or years of experience. Instead, he values adaptability, resilience, and curiosity. This last trait is the one he doubles down on.

According to Shlomo, curiosity is the cornerstone of continuous growth and innovation. It’s not just about asking questions. It’s about fostering a mindset that constantly seeks to understand the 'why' behind everything.

Shlomo advocates for a deep, insatiable curiosity that drives engineers to explore beyond the surface of problems, looking for underlying causes and potential improvements.

He believes that this kind of curiosity is what separates good engineers from great ones, as it leads to discovering solutions that aren’t immediately obvious and pushes the boundaries of what’s possible.

How it plays into teamwork

For Shlomo, curiosity also plays a crucial role in building a cohesive and forward-thinking team. He encourages leaders to cultivate an environment where questions are welcomed, and no stone is left unturned.

This approach not only sparks creativity but also ensures that everyone is engaged in a continuous learning process, which is vital in a field that evolves as rapidly as DevOps and SRE.

By nurturing curiosity, teams can stay ahead of the curve. They can anticipate challenges before they arise and develop right-fit solutions that keep their work relevant and impactful.

Shlomo advises engineers not to let their current organization limit them and to always seek out new challenges and learning opportunities. This mindset will make them valuable to any organization they may work with.

Experience > Certifications

Shlomo’s stance on certifications is clear: they don’t necessarily lead to career advancement. He argues that the best engineers are those who are too busy doing the work to focus on accumulating certifications. Instead, he encourages engineers to network with industry leaders, demonstrate their skills, and seek mentorship opportunities. Experience and mentorship, he believes, are far more critical to growth than any piece of paper.

Thinking for complexity

It’s a well-tread saying now, almost a cliche, but still very relevant to standing out in a crowded engineering talent market.

Shlomo and I talked about the issue of many engineers being trained to think in terms of best practices. I feel like over time, this emphasis will reduce, especially for more senior roles. Best practices are not directly applicable to solving today’s problems that are increasing in complexity.

Shlomo tries to test potential hires to see if they can handle the complexity. During interviews, he presents candidates with unreasonable scenarios to test their ability to think outside the box.

This approach not only assesses their problem-solving skills but also helps them understand the interconnectedness of the challenges they will face.

Wrapping up

The insights Shlomo shared with me underscore a crucial point:

The most successful engineers are those who combine technical prowess with a strong sense of curiosity, a commitment to continuous improvement, and a genuine understanding of their role within the team.

By embracing these qualities, you not only enhance your current contributions but also set yourself on a path for long-term growth and success.

The takeaway is clear: to truly stand out and advance in your career, it's not just about doing your job well — it's about constantly seeking to learn more, improve processes, and connect with your team on a deeper level.

These are the traits that make you not just a good engineer, but a valuable one.

#53 What's Missing in Incident Response Processes?

Ash Patel — Thu, 15 Aug 2024 12:15:39 GMT

Incident response is an increasingly difficult area for organizations. Many teams end up paying a lot of money for incident management solutions. However, issues remain because processes supporting the incident response are not robust.

Incident response software alone isn't going to fix bad incident processes.

It's gonna help for sure. You need these incident management tools to manage the data and communications within the incident.

But you also need to have effective processes and human-technology integration. Dr Ukis wrote in his Establishing SRE Foundations book about complex incident coordination and priority setting.

According to Vladislav, at the beginning of your SRE journey, it’s not going to be focused on incident response in terms of setting up an incident response process, but more on core SRE artifacts like SLIs, availability measurement, SLOs, etc.

And now we are safely investing more into the customer-facing features and things like this. So this is going to be the core SRE concepts. But then at some point, once you've got these things, more or less established in the organization.

Understanding and Leveraging SLOs

Once your Service Level Objectives (SLOs) are well-defined and refined over time, they should accurately reflect user and customer experiences. Your SLOs are no longer just initial metrics; they’ve been validated through production.

Product managers should now be able to use this data to make informed decisions about feature prioritization. This foundational work is crucial because it sets the stage for integrating a formal incident response process effectively.

Implementing a Formal Incident Response

Before you overlay a formal incident response process, ensure that you have the cultural and technical groundwork in place.

Without this, the process might not be as effective. When the foundational SLOs and organizational culture are strong, a well-structured incident response process can significantly enhance its effectiveness.

Coordinating During Major Incidents

When a significant incident occurs, detecting it through SLO breaches is just the beginning. You need a system in place to coordinate responses across multiple teams.

Consider appointing incident commanders and coordinators, as recommended in PagerDuty’s documentation, to manage this coordination. Develop a lightweight process to guide how incidents are handled.

Classifying Incidents

Establish an incident classification scheme to differentiate between types of incidents. This scheme should include priorities such as Priority One, Priority Two, and Priority Three.

Due to the inherently fuzzy nature of incidents, your classification system should also include guidelines for handling ambiguous cases. For instance, if uncertain whether an incident is Priority One or Two, default to Priority One.

Deriving Actions from Incident Classification

Based on the incident classification, outline specific actions. For example, Priority One incidents might require immediate involvement from an incident commander.

They might take the following actions:

Create a communication channel, assemble relevant teams, and start coordination.
Simultaneously inform stakeholders according to their priority group.
Define stakeholder groups and establish protocols for notifying them as the situation evolves.
Keep Incident Response Processes Simple and Accessible

Ensure that your incident response process is concise and easily understandable. Ideally, it should fit on a single sheet of paper. Complexity can lead to confusion and inefficiencies, so aim for simplicity and clarity in your process diagram.

This approach ensures that the process is practical and can be followed effectively during an incident.

Preparing Your Organization

An effective incident response process relies on an organization’s readiness for such rigor. Attempting to implement this process in an organization not yet mature enough may result in poor adherence during critical times.

Make sure your organization is prepared to follow the established procedures.

For a deeper dive into these concepts, consider reading "Establishing SRE Foundations," available on Amazon and other book retailers. For further inquiries, you can also connect with the author, Vlad, on LinkedIn.

Can ITIL Benefit from Site Reliability Engineering?

Ash Patel — Tue, 13 Aug 2024 11:45:12 GMT

According to Vlad Ukis, there are a lot of enterprises around whose IT functions are organized around ITIL. What you use SRE for is something completely different.

SRE is not for setting up the IT function. It is for enabling the product organization to operate online services reliably at scale.

However, the problem is that many in the industry are NOT using SRE principles but instead handing over complex services to a more traditional IT function.

Dr. Vladislav Ukis is well qualified to talk about reliability, being at Siemens Healthineers and leading 250 people globally to offer their cloud platform running off Microsoft Azure.

We discussed key concepts from his book, Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations.

Unlike other technical books in this field, Dr Ukis’ book is aimed at technology professionals who are beginners to the reliability journey.

This is different from the Site Reliability Engineering (2016) book by Google, which covers all the bells and whistles that SRE encompasses. That book requires a degree of prior knowledge and also prior experience in the field.

Vlad wanted to make it more accessible:

What I did with my book is to say, ‘Okay, so now you've never done operations, but you now are thrown in the world of online services where you have to operate them. How do you get started?’ So this is what the book is for. So for people who want to learn how to get started in the world of operating online services.

ITIL was originally developed by the UK government in the 80s to improve IT governance. It is best related to SRE through its service management and incident management components. But it’s for managing systems that are more predictable and can be handled through strict process control.

Modern product delivery doesn’t have the luxury of bureaucratic levels of predictability that older IT services have. It requires a more engineer-oriented approach to solving problems/incidents and providing services.

So how was Vlad’s experience bringing SRE into an organization that previously had run solely on the ITIL model?

Siemens Healthineers for many years operated like a traditional software development organization. In other words, they were developing on-prem software, not cloud software.

The company would ship the physical software product to its hospital customers and then those hospitals would have the software operated and supported by their IT departments.

The change came about when Siemens Healthineers began to work on a new digital health platform, which would be cloud-based from the beginning. So they would no longer ship physical software in discs to customers, but provide online services in the cloud centrally for the customers to use.

The early days were haphazardly done with the software deployed to the cloud with no major issues. Not many customers were on the cloud platform so the team could get away with “handcrafted operating procedures”.

But as traffic and service count started to rise rapidly, the Healthineers team learned that they needed a more professional approach. They began to understand that their initial approach to operations could not continue as-is.

This is when Vladislav began to drive SRE practices in the organization.

This was a sub-30-minute conversation that covered a lot of ground that would be relevant to the needs of organizations looking to transition to product delivery of online services at scale.

Have a listen.

#52 Navigating Complexity within Incidents

Ash Patel — Tue, 06 Aug 2024 12:03:00 GMT

As I mentioned in the intro, Sonja Blignaut is a complexity expert. That might not sound relevant to incident response in reliability engineering.

But it is!

Our systems are becoming more complex and so are the resulting incidents.

Learning about complexity can help reliability folk go into an incident with less anxiety, which we’ll explore in this post.

The complexity of incidents

You might already know that incident response is a huge part of reliability work in software systems.

What you might not know is that incident response as a whole is getting harder because of more unpredictable situations.

Here are some modern computing examples that contribute to this complexity:

cloud computing shifting us from 1 → 100+ pieces of infrastructure
code as a serverless function, making incident response harder through statelessness
containerization increases configuration complexity meaning that there is a greater failure surface area

So how can a mindset geared towards complexity aid in adapting to unexpected challenges during an incident?

Let’s first understand the levels of incidents

You might be used to terminology like Sev1/2/3 and P0/1/2/3/4 events, but let’s try to distinguish them in terms of a decision-making framework.

The Cynefin framework is a way to decipher the complexity of incidents. It helps us distinguish among four different kinds of incidents.

The kinds of incidents can be:

Simple. A straightforward issue that can be resolved by following a series of predetermined steps e.g. a runbook for a server reboot procedure
Complicated. Not straightforward, but we can still plan for them. We know where things might fail and plan for that e.g. load balancer configuration update
Complex. This is where unexpected things can happen. Typically, they emerge, and we need to find our way through e.g. application performance degradation
Chaotic. Things that you can prepare but can't plan for because you've got no idea exactly what's going to happen e.g. data center power outage

How incidents develop in this framework

Sonja has worked extensively with power utility companies and software teams in other industries. She found that the initial moments of an incident would often be chaotic.

For example, she faced a situation where several South African banks' payment systems went down simultaneously, causing chaos for her clients.

How do we prepare for something chaotic? It's a crisis. We almost need to drop everything and just fix this.

After the initial chaos settles, we try to wrangle control of the incident through our tried and tested methods. But at times, we struggle to take control.

In many organizations, we've come to equate control with competence. If you're a leader or if you are an expert software engineer, you're supposed to be in control.

If you're dealing with something complex, you can't ever fully be in control. That creates anxiety because then we start questioning our own competence.

In many organizations, we experience others questioning our competence as well because we're trying to control something that essentially can't be controlled.

What we’re trying to control is something that cannot be controlled, which is the underlying complexity within our incidents.

A deeper dive into complexity

In plain English, complexity is what occurs in interconnected systems where changes in one part can affect the whole system in unexpected ways.

Sound familiar?

Sonja came across the idea of complexity while working as a consultant at IBM. One of her colleagues, Dave Snowden, went on to develop the Cynefin framework.

This framework explores the decision-making domains that govern how we perceive problems and go about resolving them.

by Dave Snowden via Wikimedia under CY BY 3.0

I first came across this idea of complexity when I was in IBM. I met Dave Snowden there. At the time, I was a very unhappy consultant because at many of the big consultancies — I think even still today — their business model is focused on selling best practices or selling you what I like to refer to as recipes and disregarding context.
— Sonja on why complexity piqued her interest

That meeting with Dave made Sonja realize the idea of complexity and how context matters. The word context highlights the fact that every complex system is unique.

She went on to explore Dave Snowden’s framework in greater detail after leaving IBM to start her consultancy. After a while, she realized something interesting:

I think that every decision maker, whether it's senior levels or lower levels, or even if it's decision making in the household… can benefit from understanding complexity.

Why care about complexity?

Sonja told me about the key benefit of understanding complexity:

One of my early clients who also became a friend, was quite a senior leader in an organization. She said when she fully understood complexity and the implications thereof for the first time, it was as if a weight was lifted off of her shoulders.

Being a senior leader, Sonja’s friend was grappling with many priorities at once and felt overwhelmed. But after understanding complexity, she realized that it wasn't because of a lack of competence that she couldn't understand issues.

It was because of the complexity that nobody could know.

For some, that sense of not being able to know can create anxiety. For others, it creates a sense of freedom because essentially what it means is we are all wayfinding. We are finding our way through these messy tangles.

We will fail and we will make mistakes, but we will eventually find a way through.

But when we treat something complex as if it's complicated, very often we just waste a lot of time. We get ourselves even more stuck. We create unintended consequences.

Complex ≠ Complicated

I’ve noticed over the years that people mix up the meanings behind complex and complicated, so I felt it was best to clarify with the expert.

Sonja told me it’s good to get down to the root meanings of the two words.

The meaning behind “complicated”

The “plic” in complicated draws from the Latin word, plica aka plik.

plik means folded together. With something complicated, I can unfold it, analyze it, then understand it, and finally, replicate it.

A car is an example of a complicated system.

All of the different parts connect and come together in linear, predictable ways to create a certain functionality. There's no functionality or behavior in that car that you can't understand by understanding the part.

The meaning behind “complex”

The “plex” in complex is a Latin root word.

It means braided together, or in Sonja’s words, “it’s tangled together”.

While there are many aspects to something being complex, the first and foremost aspect is that it's entangled in ways that we can't fully understand.

So things are connected in ways that are not linear.

In comparison, complicated systems have linear, predictable traits.

Because of this, complex systems show unique traits such as:

being dynamic
continuously shifting and changing
rife with interconnectedness that we can't fully understand

It's almost difficult to draw a boundary line around where this system starts and ends.

Thank you for reading so far. Feel free to share this post with someone who will find it helpful.

Example of complex systems

You don’t have to wrack your brain to think of systems that fit the above criteria. Remember, complex means things can happen that you could not predict ahead of time.

Families meet the criteria and are complex systems. The human body is complex. Our gut biomes are complex. Even consciousness — scientists are still trying to figure out exactly what consciousness is and where it emerges from.

So when two or more human beings come together, the situation is immediately complex. The unpredictability continues to rise as you add more humans.

Any living system is complex, but it's also sometimes that can create some anxiety for us because you can't control a complex system like you could potentially do a complicated one.

Humans compound technical complexity

What I’m alluding to here is that while the complexity of incidents might be greater because of things like cloud, serverless, and containers, that’s not the end of it.

Complexity emerges in incidents because we don’t work in technical systems but in sociotechnical systems. Humans are managing the underlying components of our technical systems. They interpret these components in their unique way.

We as computing folk have become accustomed to running predictable systems that can fault-tolerate humans, but that’s changed.

Now, these very systems are shifting into a world of complexity, and we are struggling to handle the lack of predictable outcomes.

You have to understand all those tangled bits and trying to do that can be very difficult because there are so many different things happening at the same time.

Why it is difficult to make sense of complex systems?

Our schooling plays a role

Sonja thinks one of the issues is that we were for the most part taught in our educational systems to think in very reductionist ways.

This might be a familiar experience: if you're facing what would be classified as a complicated problem, you break it down into smaller pieces, you solve the pieces and then, in the end, you can solve the complicated problem.

In complexity, it doesn't quite work like that.

Why complex systems can’t be codified

Using the family system as an example, you can't break it into smaller pieces to try and understand it. You almost need to work with the whole.

There are “emergent qualities” within the family that don't exist within the individual members of the family. As members of the family individually interact with each other, they come together in unique ways.

The same applies to organizational culture. Culture emerges from the thousands of unique interactions between the humans in that system. The more humans you add to the mix, the more interactions you create.

This increases complexity at a parabolic rate.

Sonja puts it well:

All of the conversations, all of the ways that people even just look at each other, the interactions between people and the technology they work in, with the office space they're in. All of these things create this emergent identity or culture of this organization. And that culture does not exist within the individual people or the individual parts.

In other words, the whole is greater than the sum of its parts.

The critical learning point at this stage is that with something complex having interrelated components, the relationships between the components become more important than the components themselves.

How to get a better grasp of complex systems

Sonja suggests that the first step to embracing complexity means unlearning the patterns of thought associated with complicated systems.

That’s reducing or eliminating the tendency to want to reduce things into parts because when we're dealing with something complex, we also need to look at the whole.

We need to look at the whole as well as the parts and how things are connected.

That also means we need to shift from a linear problem-solution way of thinking.

In complexity, Sonja believes it’s better (and in her exact words, “more generative”) to think in terms of “emergent patterns” and not “problems”.

It's not impossible to understand complexity, but if we look at it through a linear reductionist lens, then we can get ourselves stuck. The thing to get comfortable with — and it will initially be hard — is that a pattern cannot necessarily be solved, but you can shift it.

If you think of something extremely complex like, for example, poverty or social inequality, if you see them solely as a “problem to solve”, you’ll almost immediately stuck because the system is connected to so many other contributing parts.

You won’t even know where to start and how solving one part causes an unpredictable outcome in another part of the system. But if you see the issue as a pattern, then all of a sudden, you can start interacting with it more effectively.

You can try different things and see what works.

You've got multiple entry points because so many things are connected.

Wrapping up

By now, I hope you have a better understanding of (1) complexity and (2) how it impacts your ability to respond to incidents.

Sonja and I also discussed:

the problem of achieving psychological safety in complex environments
deeper into the concept of emergence and how it contributes to our understanding of incidents as they develop
a need for resilience in working professionals as we become more attention-poor

For these 3 ideas, you’ll have to listen to the podcast episode.

About Sonja

I found Sonja’s work in 2019 while looking for ways to deal with the increasing VUCA (volatility, uncertainty, complexity, and ambiguity) at my work.

Sonja is a co-founder of Complexity Fit and founder of More Beyond focusing on helping teams build capacity for sensemaking, collaboration, and wayfinding.

She has a background in programming from her early career as a meteorologist, having worked in C and Fortran, and then progressing to working as a web developer.

You can connect with Sonja to learn more about complexity via LinkedIn.

#51 Whitebox vs Blackbox Monitoring

Ash Patel — Tue, 30 Jul 2024 12:15:59 GMT

Sebastian and I took some time in July to talk about how we could make concepts from Google's SRE book more digestible and usable in practice.

We'll start on this by covering monitoring concepts, the first concept being what Google's SREs call white box versus black box monitoring.

I initially thought that we could just call it internal versus external monitoring to explain it to you, but it turns out that would not be correct. We’ll explore this further.

First of all…

What is monitoring?

If you think of monitoring, or observability as we call it today, you’d think about it as:

Monitoring is a way to measure your system, to gain insight and knowledge about the system.

In our context, a system is usually software of some kind e.g. SaaS, platform, etc.

So what is whitebox monitoring then?

In monitoring terms, a whitebox monitoring system covers the system and its components that you have full control over, which is yours.

It's the one that you can instrument and you can instrument it whichever way you feel is appropriate for your context and for the kind of insight that you're trying to get out of it.

You have full control over what is happening. There's no limitation to what you could do.

Some characteristics of whitebox monitoring include:

You can and do get very granular with the data you’re capturing
You have full control over the end-to-end life cycle of your observability data
The focus is on the internals of your system that you control

It is a subset of internal monitoring, but it takes the data capture much deeper than high-level metrics like you’d get from traditional application performance monitoring (APM) tools.

Now let's define black box monitoring.

Thank you for reading Reliability Enablers. Please share this post if you found it useful.

What is blackbox monitoring?

On the opposite end, there are systems that you have no control over.

That applies to systems like third-party API endpoints, or generally external pieces of software that you're using. A vendor provides you access to their system in some way, shape, or form, but you don't get to go inside that system and instrument that system to the degree that you might want to.

The way you then need to approach monitoring or observing these kinds of systems is somewhat different from your own systems.

What you can do is apply approaches that let you approximate what might happen within that system that is behind an API endpoint or within that piece of software that you're connected to.

There are things like heartbeats or health checks that you can implement on your side that end can follow a simple pattern like, for example:

send a simple HTTP request to an endpoint
you get a response back (or not)
gauge the health of that system based on the response

It's not a foolproof method. It's an approximation.

To a certain degree, that’s the best you can do with these types of systems that are a blackbox to you.

You can just go around the outside of that system and try to probe it with certain types of observability approaches And then infer from the data that you receive back, “Is the system that I'm connected with healthy or not?”.

“Will it respond appropriately and properly the way I need it to when I'm sending a barrage of requests to it?”

Blackbox monitoring is analogous to airplane blackboxes

You’d find that every airplane has a blackbox that records everything that's happening in the flight, but you cannot do anything with the data directly.

You cannot look at the data. If you're a pilot, you're not looking at that monitoring data at any point. That's only for investigation after the fact.

To sum it up, a black box is:

an aspect of your system that you don't have control over
focused on real time data collection
a higher-level overview of whatever is observable
designed for situations where you can’t drill down into the internals
able to keep you on top of the health of a provider’s API or system integration

The rising importance of blackbox monitoring

With recent events where third parties have let prominent software vendors down, I think black box monitoring is likely to become a lot more important in the future.

The focus in the industry has been on white box monitoring, which makes sense. It's something you have direct control over.

But as we increase our risk surface area with more third-party services, and as incidents intensify, blackbox monitoring needs to be discussed more.

In the last 10 years in the industry, there hasn't been a whole lot of movement in terms of innovation or advancements when it comes to especially the black box monitoring or observability portion.

We feel like that's an area that could be in for a little bit of a round of (visible) innovation.

It’s worthwhile improving your blackbox monitoring to infer more accurately what is happening with that third-party piece of software. The benefit might be for your sanity.

This lets you make more rational engineering decisions that can only come from a stronger picture of the overall health of your ecosystem.

How to get started in blackbox monitoring

As far as we are aware, there are not a slew of open-source options around.

However, Prometheus has an open source blackbox exporter that can probe endpoints such as web servers, databases, or network devices. It can probe over protocols like HTTP, HTTPS, DNS, TCP, ICMP, and gRPC.

Wrapping up

We've defined what Google's SREs think of whitebox and blackbox monitoring, but in particular, we’d like you to think about your black box efforts.

As an industry, we tend to put most of our energy into whitebox monitoring in most settings. But our systems are rife with 3rd party APIs and integrations.

How are you making sure that you don't get something like a global outage because a third-party vendor pushed bad code? Or at least, how would you minimize your blast radius?

The costs of third-party mishaps can be high, with the recent CrowdStrike-related outage incurring $5.4 billion in downtime costs.

It's important to set guardrails around your third-party systems, so:

changes from there are deployed slowly to your own systems
monitored by blackbox monitoring methods
and if failure occurs, there's a failover ready to go.

I hope this updated version of our SRE book rundown has been more helpful to your work.

This is a concept from Chapter 6 (Monitoring Distributed Systems) of the Google SRE (2016) book. Chapter written by Rob Ewaschuk and edited by Betsy Beyer.

Thank you for reading Reliability Enablers. Please share this post if you found it useful.

#50 Making Better Sense of Observability Data

Ash Patel — Tue, 09 Jul 2024 12:12:50 GMT

Jack Neely is a DevOps observability architect at Palo Alto Networks and has a few interesting ways of extracting value from o11y data.

We crammed into just under 25 minutes ideas like these 7 takeaways:

Reasserting the Need to Monitor Four Golden Signals: Focus on latency, traffic, errors, and saturation for effective system monitoring and management.
Prioritize Customer Health: in Jack’s words, the 5th golden signal. Go beyond traditional metrics to monitor the health of your customers for a more comprehensive view of your system's impact.
Apply Mathematical Techniques: Incorporate advanced mathematical concepts, like the Nyquist Shannon law and T Digest algorithm, to enhance data accuracy and observability metrics.
Build Accurate Percentiles: Implement techniques to accurately reproduce percentiles from raw data to ensure reliable performance metrics.
Manage High Cardinality Data: Develop strategies to handle high cardinality data without overwhelming your resources, ensuring you extract valuable insights.
Standardize Log Records: Use readily available frameworks to emit standardized log records makes data easier to process and visualize.
Handle High-Velocity Data Efficiently: Develop methods for collecting and processing high-velocity data without incurring prohibitive costs.

Watch Jack’s Monitorama talk here:

Reliability Enablers: Hear the Podcast

You (and AI) can't automate reliability away

3 issues stifling individual SREs’ work

Hands-on is the best kind of training

Reliability is mandatory for everybody

Reliability gets rewarded just like feature work

Keep a scorecard of reliability performance

Wrapping up

#67 Why the SRE Book Fails Most Orgs — Lessons from a Google Veteran

#66 - Unpacking 2025 SRE Report’s Damning Findings

#65 - In Critical Systems, 99.9% Isn’t Reliable — It’s a Liability

#64 - Using AI to Reduce Observability Costs

#63 - Does "Big Observability" Neglect Mobile?

#62 - Early Youtube SRE shares Modern Reliability Strategy

#61 Scott Moore on SRE, Performance Engineering, and More

#60 How to NOT fail in Platform Engineering

Defining Platform Engineering

Ankit’s career journey

Singapore Public Sector Experience

Platform Engineering Turf Wars

Dysfunctional Team Interactions

Quadrant Model (Autonomy vs. Cognitive Load)

How it pans out

Shift from Product Thinking to Cognitive Load

Platform as a Product Mindset

Self-Service as a Platform Requirement

Platform Team Cognitive Load

#59 Who handles monitoring in your team and how?

Why many copy Google’s monitoring team setup

What is Google’s monitoring approach within teams?

Can your team mimic Google’s model?

Size matters

What are the options for your team?

Dedicated monitoring team (very popular but $$$)

Dedicate SREs to monitoring work (effective but difficult to manage)

Internal monitoring experts (useful but hard capability)

Transitioning monitoring from project work to maintenance

2 distinct phases

Initial Setup (the “project”)

Post-project phase (“keep the lights on”)

Who will maintain the monitoring system?

Answer: usually not the same team

Options to consider (once again)

Adapt to Your Organizational Context

One size doesn’t fit all

The core principle to keep in mind

Work according to engineer awareness

In conclusion

#58 Fixing Monitoring's Bad Signal-to-Noise Ratio

#57 How Technical Leads Support Software Reliability

Technical lead ≠ Engineering manager

How can technical leads drive reliability principles?

Working with reliability teams like SREs

What can SREs do better to support product teams with reliability efforts?

But will the software engineers cooperate?

How can product teams handle conflicting priorities?

#56 Resolving DORA Metrics Mistakes

#55 3 Uses for Monitoring Data Other Than Alerts and Dashboards

Analyzing long-term trends

Comparing over time or experiment groups

Google’s definition

A proper use case for comparing groups

Conducting ad hoc retrospective analysis (debugging)

Wrapping up

#54 Becoming a Valuable Engineer Without Sacrificing Your Sanity

Avoid the heroic efforts

Mind + Heart > Mind alone

Curiosity > credentials

How it plays into career progression

How it plays into teamwork

Experience > Certifications

Thinking for complexity

Wrapping up

#53 What's Missing in Incident Response Processes?

Understanding and Leveraging SLOs

Implementing a Formal Incident Response

Coordinating During Major Incidents

Classifying Incidents

Deriving Actions from Incident Classification

Keep Incident Response Processes Simple and Accessible