#56 Resolving DORA Metrics Mistakes

Reliability Enablers

0:00

-26:47

#56 Resolving DORA Metrics Mistakes

I asked Nathen Harvey who is lead DORA advocate at Google about some of the mistakes people make. The major ones seem to come down to management misconceptions about what it can do.

Ash Patel

and

Nathen Harvey

Sep 04, 2024

We're already well into 2024 and it’s sad that people still have enough fuel to complain about various aspects of their engineering life.

DORA seems to be turning into one of those problem areas.

Not at every organization, but some places are turning it into a case of “hitting metrics” without caring for the underlying capabilities and conversations.

Nathen Harvey is no stranger to this problem.

He used to talk a lot about SRE at Google as a developer advocate. Then, he became the lead advocate for DORA when Google acquired it in 2018.

His focus has been on questions like:

How do we help teams get better at delivering and operating software?

You and I can agree that this is an important question to ask.

I’d listen to what he has to say about DORA because he’s got a wealth of experience behind him, having also run community engineering at Chef Software.

Before we continue, let’s explore What is DORA? in Nathen’s (paraphrased) words:

DORA is a software research program that's been running since 2015.

This research program looks to figure out:

How do teams get good at delivering, operating, building, and running software?

The researchers were able to draw out the concept of the metrics based on correlating teams that have good technology practices with highly robust software delivery outcomes.

They found that this positively impacted organizational outcomes like profitability, revenue, and customer satisfaction.

Essentially, all those things that matter to the business.

One of the challenges the researchers found over the last decade was working out: how do you measure something like software delivery?

It's not the same as a factory system where you can go and count the widgets that we're delivering necessarily.

The unfortunate problem is that the factory mindset I think still leaks in. I’ve personally noted some silly metrics over the years like lines of code.

Imagine being asked constantly: “How many lines of code did you write this week?”

You might not have to imagine. It might be a reality for you.

DORA’s researchers agreed that the factory mode of metrics cannot determine whether or not you are a productive engineer.

They settled on and validated 4 key measures for software delivery performance.

Nathen elaborated that 2 of these measures look at throughput:

[Those] two [that] look at throughput really ask two questions:
How long does it take for a change of any kind, whether it's a code change, configuration change, whatever, a change to go from the developer's workstation. right through to production?
And then the second question on throughput is:
How frequently are you updating production?

In plain English, these 2 metrics are:

Deployment Frequency. How often code is deployed to production? This metric reflects the team's ability to deliver new features or updates quickly.
Lead Time for Changes: Measures the time it takes from code being committed to being deployed to production.

Nathen recounted his experience of working at organizations that differed in how often they update production from once every six months to multiple times a day.

They're both very different types of organizations, so their perspective on throughput metrics will be wildly different.

This has some implications for the speed of software delivery.

Of course, everyone wants to move faster, but there’s this other thing that comes in and that's stability.

And so, the other two stability-oriented metrics look at:

What happens when you do update production and... something's gone horribly wrong. “Yeah, we need to roll that back quickly or push a hot fix.”

In plain English, they are:

Change Failure Rate: Measures the percentage of deployments that cause a failure in production (e.g., outages, bugs).
Failed Deployment Recovery Time: Measures how long it takes to recover from a failure in production.

You might be thinking the same thing as me. These stability metrics might be a lot more interesting to reliability folks than the first 2 throughput metrics.

But keep in mind, it’s about balancing all 4 metrics.

Nathen believes it’s fair to say today that across many organizations, they look at these concepts of throughput and stability as tradeoffs of one another.

We can either be fast or we can be stable.

But the interesting thing that the DORA researchers have learned from their decade of collecting data is that throughput and stability aren't trade-offs of one another.

They tend to move together. They’ve seen organizations of every shape and size, in every industry, doing well across all four of those metrics.

They are the best performers.

The interesting thing is that the size of your organization doesn't matter the industry that you're in.

Whether you’re working in a highly regulated or unregulated industry, it doesn't matter.

The key insight that Nathen thinks we should be searching for is: how do you get there?

To him, it's about shipping smaller changes.

When you ship small changes, they're easier to move through your pipeline.

They're easier to reason about.

And when something goes wrong, they're easier to recover from and restore service.

But along with those small changes, we need to think about those feedback cycles.

Every line of code that we write is in reality a little bit of an experiment.

We think it's going to do what we expect and it's going to help our users in some way, but we need to get feedback on that as quickly as possible.

Underlying all of this, both small changes and getting fast feedback, is a real climate for learning. Nathen drew up a few thinking points from this:

So what is the learning culture like within our organization?
Is there a climate for learning?
And are we using things like failures as opportunities to learn, so that we can ever be improving?

I don’t know if you’re thinking the same as me already, but we're already learning that DORA is a lot more than just metrics.

To Nathen (and me), the metrics should be one of the least interesting parts of DORA because it digs into useful capabilities, like small changes and fast feedback.

That’s what truly helps determine how well you're going to do against those performance metrics.

Not saying “We are a low to medium performer. Now go and improve the metrics!”

I think the issue is that a lot of organizations emphasize the metrics because it's something that can sit on an executive dashboard

But the true reason we have metrics is to help drive conversations.

Through those conversations, we drive improvement.

That’s important because currently an unfortunately noticeable amount of organizations are doing this according to Nathen:

I've seen organizations [where it’s like]: “Oh, we're going to do DORA. Here's my dashboard. Okay, we're done. We've done DORA. I can look at these metrics on a dashboard.”
That doesn't change anything.
We have to go the step further and put those metrics into action.

We should be treating the metrics as a kind of compass on a map.

You can use those metrics to help orient yourself and understand, “Where are we heading?”.

But then you have to choose how are you going to make progress toward whatever your goal is.

The capabilities enabled by the DORA framework should help answer questions like:

Where are our bottlenecks?
Where are our constraints?
Do we need to do some improvement work as a team?

We also talked about the SPACE framework, which is a follow-on tool from DORA metrics.

It is a framework for understanding developer productivity.

It encourages teams or organizations to look at five dimensions when trying to measure something from a productivity perspective.

It stands for:

S — satisfaction and well-being
P — performance
A — activity
C — communication and collaboration
E — efficiency and flow

What the SPACE framework recommends is that you

First, pick metrics from two to three of those five categories.

(You don't need a metric from every one of those five but find something that works well for your team.)

Then write down those metrics and start measuring them.

Here’s the interesting thing: DORA is an implementation of SPACE.

You can correlate each metric with the SPACE acronym!

Lead time for changes is a measure of Efficiency and flow
Deployment frequency is an Activity
Change fail rate is about Performance.
Failed deployment recovery time is about Efficiency and flow

Keep in mind that SPACE itself has no metrics.

It is a framework for identifying metrics.

Nathen reiterated that you can't use the space metrics because there is no such thing.

I mentioned earlier how DORA is a means of identifying the capabilities that can improve the metrics.

These can be technical practices like using continuous integration.

But they can also be capabilities like collaboration and communication.

As an example, you might look at what your change approval process looks like.

You might look at how collaboration and communication have failed when you’ve had to send changes off to an external approval board like a CAB (change approval board).

DORA’s research backs the above up:

What our research has shown through collecting data over the years, is that while they do exist on the whole, an external change approval body will slow you down.
That's no surprise. So your change lead time is going to increase, your deployment frequency will decrease.
But, at best, they have zero impact on your change fail rate. In most cases, they have a negative impact on your change fail rate. So you're failing more often.

It goes back to the idea of smaller changes, faster feedback, and being able to validate that. Building in audit controls and so forth.

This is something that reliability-focused engineers should be able to help with because one of the things Sebastian and I talk about a lot is embracing and managing risk effectively and not trying to mitigate it through stifling measures like CABs.

In short, DORA and software reliability are not mutually exclusive concepts.

They're certainly in the same universe.

Nathen went as far as to say that some SRE practices necessarily get a little bit deeper than sort of the capability level that DORA has and provide even more sort of specific guidance on how to do things.

He clarified a doubt I had because a lot of people have argued with me (mainly at conferences) that DORA is this thing that developers do, earlier in the SDLC.

And then SRE is completely different because it focuses on the production side.

The worst possible situation could be turning to developers and saying, “These 2 throughput metrics, they’re yours. Make sure they go up no matter what,” and then turn to our SREs and say “Those stability metrics, they're yours. Make sure they stay good”

All that does is put these false incentives in place and we're just fighting against each other.

We talked a little more about the future of DORA in our podcast episode (player/link right at the top of this post) if you want to hear about that.

Here are some useful links from Nathen for further research:

DORA online community of practice

DORA homepage

[Article] The SPACE of Developer Productivity

Nathen Harvey's Linktree