#57 How Technical Leads Support Software Reliability

Reliability Enablers

0:00

-31:34

#57 How Technical Leads Support Software Reliability

You might be familiar with the term, “technical lead”. You might even be working with one or a few right now. But how well do you know their ability to support your reliability work?

Ash Patel

and

Anemari Fiser

Sep 10, 2024

Transcript

The question then condenses down to: Can technical leads support reliability work?

Yes, they can!

Anemari has been a technical lead for years — even spending a few years doing that at the coveted consultancy, Thoughtworks — and now coaches others.

She and I discussed the link between this role and software reliability.

Technical lead ≠ Engineering manager

According to Anemari, a tech lead is a person with technical knowledge who is responsible for leading a technical team to align them with a common technical strategy.

For the most part, engineering managers are focused on the people aspect of the work. They coach engineers and other technical folk to improve their performance.

In terms of activities, engineering managers coach, mentor, and support the development of their team members or direct reports. They’ll also go and bat for the team within the organization.

Technical leads are more focused on guiding the technical work that these people do. Their focus includes work like architecture, design patterns, and implementing projects. They offer technical insights and mentorship to the team.

Anemari found in her work experience that the use of technical leads depends from company to company:

I've worked with teams that don't have tech leads and then the engineering manager takes a more hands-on approach. Then you have teams where you have a tech lead only focused on tech, and then you have like a team lead or engineering manager, doing the people side.

Interestingly, she found that it was very difficult to singularly focus on the technical side of the work:

[Even] if you only want to focus on tech as a tech lead, you still end up having to deal with the people side because most tech problems are people problems in the end and so you kind of require both.

How can technical leads drive reliability principles?

Anemari advised me that tech leads are often required to think about reliability principles as part of their technical strategy.

Sometimes, teams don’t have an SRE team supporting them and are operating in a “you build it, you run it” mode. In this situation, reliability becomes a 90+ percent responsibility of the team.

In other words, the product team has had to develop the reliability-focused knowledge required to build and run their product reliably.

The technical lead can help the less experienced team members pick up reliability concepts and not just have that mindset of creating more features and making the burndown chart look pretty.

However, their preference usually goes on working with a Site Reliability Engineering (SRE) team if it is possible.

Working with reliability teams like SREs

Anemari believes it’s important to be aware of reliability principles. She put it as knowing:

What does it mean for my product to be reliable?

So, the first critical step is to align your definition of reliability from pondering this question with the known constraints of the product. These constraints can be:

performance limits
operational environments
user expectations

Anemari went on in our conversation to challenge the notion of simply stating reliability requirements to product teams.

It may seem obvious to reliability folk, but it’s one thing in a very large pile of considerations for a product team:

So I think it's important when we're talking about [reliability]… when we're talking about uptime, like, what does that mean? And why do we need 99 percent or whatever?

The onus falls on reliability engineers to support this thinking.

Technical leads can be a good point of contact for that initial conversation of alignment. The reliability team can advise about the standards we mentioned before.

So the first step is co-creating a common version of reliability.

The second step is continuous communication between the product and reliability teams. Anemari believes there must be an easy way for either party to communicate with the other.

“Creating a ticket” to reach the reliability team or vice versa does not work well.

Anemari has seen having a Slack channel between the SRE team and each product team as something that has worked well.

You can go in there, pop up a question and take it from there and solve problems.

Anemari is also a fan of the embedded SRE model.

If you have to build something new from scratch and you have to build a whole infrastructure, another strategy that I've seen is bringing a person with SRE knowledge [into] the team for a while, helping us define [technical work] and being hands on.

A long-term embedded SRE, however, might be an antipattern.

I’ve had some teams that had an SRE team member… all the time, but I’ve also seen them struggle with the fact that at some point there is not so much to do.

An embedded SRE should be brought in to enable the teams to do more and take care of their products By training the team on reliability topics — even at a high level — like understanding all the different dashboards that they have.

What can SREs do better to support product teams with reliability efforts?

Anemari told me that it would be great if SREs first put more effort into helping the product team understand why reliability is so important. It’s obvious to us in this space, but as I mentioned earlier, people have n+100 other things on their minds.

Product teams need to be made more aware of what the reliability engineers are taking care of and how that can affect their products.

There needs to be more conversation than what Anemari has experienced in a lot of encounters with SRE teams:

The SRE teams are like, “Just give it to us. We'll take care of it. You know, like, we, we know what to do. I don't have to explain all of these things to you.”

She found it more effective to sit down with the reliability team and try to understand what might seem trivial, but are important questions like:

Why is our service not processing enough requests and how can we change it?
Why can we just drop this?
Why do we need to have all of these services?

Asking these kinds of questions helped the collaborators come up with a better solution than what the SRE team could come up with on their own.

Anemari has seen in the past when an SRE team came in and introduced themselves to the product team. They then covered how their services ran in the background. This led the product team to ask questions to brainstorm on what could be improved.

The intentionality is important: the SRE team did not come with horns blaring that they were going to change the systems. They came in to discuss and brainstorm ideas.

But will the software engineers cooperate?

A lot of SREs I've spoken with are frustrated and think developer teams do not want to understand reliability.

But I'm sure at least some teams want to understand how reliability works.

Sometimes you have to just swallow that bitter pill and say, “Hey, look, I'm going to spell every single thing I'm doing out to you, just so that maybe in the future you can do it yourself.”

I think a lot of developers remember their early days and want to be able to run their product entirely on their own.

Anemari confirmed with her experience that developers want to have full power over what's happening with their product. Dependency is not necessarily something developers like very much.

There's of course the consideration of the learning journey.

There's that sweet spot that is hard to find early on, “How much do I need to know? Like, do I need to become an SRE expert now to run my service? Or how much do I need to know to just run my service?”

Anemari recalled one particular high-performing team that she led. The reason why it was high-performing was partly because they were able to run services fully with very little support from an external SRE team.

We [brought] someone we knew with knowledge… to fully understand how does our AWS work and how to restart our services and everything so that we were able to fully run it and be on call for it.

The sweet spot notion comes in once again where you have to work out, “How much do I need to develop and spend time on developing this SRE knowledge, and how much do put into developing the product?”

It's complicated, but communication is the key to it.

How can product teams handle conflicting priorities?

Anemari told me that the answer depends on the context:

where you are
what situation your product is in
how the developer and SRE teams are laid out

It’s very different to start a project from scratch versus having a whole monolith running for years and trying to make things better.

Anemari added:

It might sound a little bit crazy but I still think that it's very important for you to like write down all of these different things that you think you need to solve…
A big part of the role of a tech lead is to make sure that all of these different parts agree and align on a strategy. So that means the product, that means the developer, that means the SREs, that means the customer support, etc.

A tech lead would work to bring all of this together and then propose to all of these different stakeholders. You might have to put it as simply as, “This is what we have to focus on right now, given that as a product team. We have to keep delivering.”

Then come the compromises. For example, “20 percent of our time would go into improving our delivery pipeline.”

It is crucial to work out compromises before getting agreement from the various stakeholders in the software delivery organization.

This write-up was just a preview of what Anemari and I talked about in this episode of the Reliability Enablers. Be sure to listen to get the other half of our conversation.