Runbooks for better incident response
Introduction
I can confidently tell you that runbooks form a critical part of the incident response toolkit. I will also tell you that SREs are well-placed to start and oversee the development of runbooks.
If you don't have a runbook yet, let me entice you with the thought of checklist-type documentation to follow when you're woken up to deal with a 3am production meltdown.
You won't be the only one using the runbook. Its simplicity allows you to more easily product teams into the incident response action. It gives clarity to those who may not be as experienced as you when investigating faults with their work-in-production.
Runbooks are most useful when you are finding your incident response to be a case of "putting out the same fires over and over again". It removes unnecessary thinking from incident response and helps you focus on the task at hand.
Or at least carry out the work without an overwhelmed 🤯 feeling.
Why runbooks are useful in SRE incident response
Here are 3 reasons why runbooks are superior to "I'll figure it out as it comes" as a strategy:
Automated processes don’t always protect against all possible issues — so software operations needs 10s to 100s of different activities actioned by skilled humans to keep the system rolling
“30-40% of procedures require human judgment to resolve safely, so that’s still a bunch of run books that won’t go away – even if large parts of deployment are push-button / automated processes.”
Prevents annoying experiences like this: “I recently ran into a situation where I spent 6 hours understanding how something works that would have taken 20 minutes if the relevant information was stored somewhere.”
Ways that teams have set up their runbooks
Confluence — is not particularly designed for managing runbooks but is an open-ended tool that enables you if you have a solid enough idea of how to effectively design a runbook
Jupyter Notebooks – an open-source tool with a combo of text, image, and live code snippets so decent option if you are happy to install and maintain it
Markdown files hosted in git repo — maintenance might be an issue over time without strict guidelines within the team
Err… this ➝ “Sticky notes on someone’s desk. We’re thinking about getting a laminator to keep the coffee spills from being too serious of a problem.” 😅
Factors to consider when developing your own runbook
Make a standard runbook template — makes it easier to process information when in a pinch like when resolving an urgent incident
Have a collaborative approach to building the runbooks — don’t palm the task off to technical writers – the people who design and build the systems should be the main authors or at least actively participate in the process
Give an explanation to the runbook user of why the component of the system was designed as it appears - ambiguity around intentions is a key reason for failing to come up with creative solutions to a tricky problem
Some runbooks have sub-processes – it’s important to clarify what these are and how they relate to the process that they are the children of