Reliability Enablers (SREpath)
Reliability Enablers
#48 Cutting Down "Toil" aka Manual Work in Software
0:00
-44:03

#48 Cutting Down "Toil" aka Manual Work in Software

Reliability-focused engineering is famous across other disciplines for one thing in particular: reducing toil. More specifically, we look into what it is, how to reduce it, and more.

Sebastian and I scoured Chapter 5 of the Site Reliability Engineering (2016) book to find nuggets of wisdom on how to reduce toil.

We hit the jackpot with concepts like:

  1. what is toil according to a 5-point criteria

  2. why even care about toil?

  3. where you can find toil in your software system

  4. Google’s goal for how much work (%) should be toil

  5. the fact that toil isn’t always all that bad

Don’t have time to listen to what we learned or added to the concepts?

Check out the takeaways toward the end of this email.

But first…


Before we jump into the takeaways, here’s a new segment I’m trying out for newsletters. I’ll highlight a new reliability tool that I think could help you.

Do you struggle to visualize your Kubernetes workloads?

In that case, have you heard of kube-ops-view?

It helps you visualize your complex K8s clusters and everything inside them.

No alt text provided for this image
How cool is that?! A comprehensive view. Via Henning Jacobs

For a deeper rundown, visit the LinkedIn post I made about kube-ops-view which shares a few more details.


Back to our original programming…

Here are key takeaways from our chat

  1. Define and Identify Toil

    Regularly evaluate your tasks. Identify work that is manual, repetitive, and potentially automatable. Recognize it as toil and prioritize its reduction.

  2. Prioritize Automation

    Look for repetitive tasks in your workflow and automate them using tools and scripts to reduce manual interventions and increase efficiency.

  3. Embrace the Role of an SRE

    Realize that the role of an SRE is to improve system reliability proactively. Focus on long-term improvements rather than just responding to immediate issues.

  4. Address Common Sources of Toil

    Identify frequent sources of toil like context switching, on-call duties, and release processes. Implement solutions to automate and streamline these areas.

  5. Adopt a Toil Elimination Mindset

    Cultivate a mindset focused on eliminating toil. Regularly discuss and explore automation opportunities with your team to improve processes.

  6. Develop a Culture of Continuous Improvement

    Encourage a culture that values reducing manual, repetitive work. Advocate for proactive problem-solving and continuous process enhancement within teams.

Until next time, happy toil hunting!

Discussion about this podcast

Reliability Enablers (SREpath)
Reliability Enablers
Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more.