Reliability Enablers (SREpath)
Reliability Enablers
#53 What's Missing in Incident Response Processes?
0:00
-9:43

#53 What's Missing in Incident Response Processes?

Bonus clip from my chat with Dr Vladislav Ukis about the issues that beginner's to reliability work fall into during incident response

Incident response is an increasingly difficult area for organizations. Many teams end up paying a lot of money for incident management solutions. However, issues remain because processes supporting the incident response are not robust.

Incident response software alone isn't going to fix bad incident processes.

It's gonna help for sure. You need these incident management tools to manage the data and communications within the incident.

But you also need to have effective processes and human-technology integration. Dr Ukis wrote in his Establishing SRE Foundations book about complex incident coordination and priority setting.

According to Vladislav, at the beginning of your SRE journey, it’s not going to be focused on incident response in terms of setting up an incident response process, but more on core SRE artifacts like SLIs, availability measurement, SLOs, etc.

And now we are safely investing more into the customer-facing features and things like this. So this is going to be the core SRE concepts. But then at some point, once you've got these things, more or less established in the organization.

Understanding and Leveraging SLOs

Once your Service Level Objectives (SLOs) are well-defined and refined over time, they should accurately reflect user and customer experiences. Your SLOs are no longer just initial metrics; they’ve been validated through production.

Product managers should now be able to use this data to make informed decisions about feature prioritization. This foundational work is crucial because it sets the stage for integrating a formal incident response process effectively.

Implementing a Formal Incident Response

Before you overlay a formal incident response process, ensure that you have the cultural and technical groundwork in place.

Without this, the process might not be as effective. When the foundational SLOs and organizational culture are strong, a well-structured incident response process can significantly enhance its effectiveness.

Coordinating During Major Incidents

When a significant incident occurs, detecting it through SLO breaches is just the beginning. You need a system in place to coordinate responses across multiple teams.

Consider appointing incident commanders and coordinators, as recommended in PagerDuty’s documentation, to manage this coordination. Develop a lightweight process to guide how incidents are handled.

Classifying Incidents

Establish an incident classification scheme to differentiate between types of incidents. This scheme should include priorities such as Priority One, Priority Two, and Priority Three.

Due to the inherently fuzzy nature of incidents, your classification system should also include guidelines for handling ambiguous cases. For instance, if uncertain whether an incident is Priority One or Two, default to Priority One.

Deriving Actions from Incident Classification

Based on the incident classification, outline specific actions. For example, Priority One incidents might require immediate involvement from an incident commander.

They might take the following actions:

  1. Create a communication channel, assemble relevant teams, and start coordination.

  2. Simultaneously inform stakeholders according to their priority group.

  3. Define stakeholder groups and establish protocols for notifying them as the situation evolves.

    Keep Incident Response Processes Simple and Accessible

Ensure that your incident response process is concise and easily understandable. Ideally, it should fit on a single sheet of paper. Complexity can lead to confusion and inefficiencies, so aim for simplicity and clarity in your process diagram.

This approach ensures that the process is practical and can be followed effectively during an incident.

Preparing Your Organization

An effective incident response process relies on an organization’s readiness for such rigor. Attempting to implement this process in an organization not yet mature enough may result in poor adherence during critical times.

Make sure your organization is prepared to follow the established procedures.

For a deeper dive into these concepts, consider reading "Establishing SRE Foundations," available on Amazon and other book retailers. For further inquiries, you can also connect with the author, Vlad, on LinkedIn.

Discussion about this podcast

Reliability Enablers (SREpath)
Reliability Enablers
Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more.