Reliability Enablers (SREpath)
Reliability Enablers
#52 Navigating Complexity within Incidents
0:00
-36:52

#52 Navigating Complexity within Incidents

I talked with Sonja Blignaut about the pressing issue of increasing complexity within incidents. She's an expert in complexity thinking with an early career background as a Fortran and C programmer.

As I mentioned in the intro, Sonja Blignaut is a complexity expert. That might not sound relevant to incident response in reliability engineering.

But it is!

Our systems are becoming more complex and so are the resulting incidents.

Loading...

Learning about complexity can help reliability folk go into an incident with less anxiety, which we’ll explore in this post.

The complexity of incidents

You might already know that incident response is a huge part of reliability work in software systems.

What you might not know is that incident response as a whole is getting harder because of more unpredictable situations.

Here are some modern computing examples that contribute to this complexity:

  • cloud computing shifting us from 1 → 100+ pieces of infrastructure

  • code as a serverless function, making incident response harder through statelessness

  • containerization increases configuration complexity meaning that there is a greater failure surface area

So how can a mindset geared towards complexity aid in adapting to unexpected challenges during an incident?

Let’s first understand the levels of incidents

You might be used to terminology like Sev1/2/3 and P0/1/2/3/4 events, but let’s try to distinguish them in terms of a decision-making framework.

The Cynefin framework is a way to decipher the complexity of incidents. It helps us distinguish among four different kinds of incidents.

The kinds of incidents can be:

  1. Simple. A straightforward issue that can be resolved by following a series of predetermined steps e.g. a runbook for a server reboot procedure

  2. Complicated. Not straightforward, but we can still plan for them. We know where things might fail and plan for that e.g. load balancer configuration update

  3. Complex. This is where unexpected things can happen. Typically, they emerge, and we need to find our way through e.g. application performance degradation

  4. Chaotic. Things that you can prepare but can't plan for because you've got no idea exactly what's going to happen e.g. data center power outage

How incidents develop in this framework

Sonja has worked extensively with power utility companies and software teams in other industries. She found that the initial moments of an incident would often be chaotic.

For example, she faced a situation where several South African banks' payment systems went down simultaneously, causing chaos for her clients.

How do we prepare for something chaotic? It's a crisis. We almost need to drop everything and just fix this.

After the initial chaos settles, we try to wrangle control of the incident through our tried and tested methods. But at times, we struggle to take control.

In many organizations, we've come to equate control with competence. If you're a leader or if you are an expert software engineer, you're supposed to be in control.

If you're dealing with something complex, you can't ever fully be in control. That creates anxiety because then we start questioning our own competence.

In many organizations, we experience others questioning our competence as well because we're trying to control something that essentially can't be controlled.

What we’re trying to control is something that cannot be controlled, which is the underlying complexity within our incidents.

A deeper dive into complexity

In plain English, complexity is what occurs in interconnected systems where changes in one part can affect the whole system in unexpected ways.

Sound familiar?

Sonja came across the idea of complexity while working as a consultant at IBM. One of her colleagues, Dave Snowden, went on to develop the Cynefin framework.

This framework explores the decision-making domains that govern how we perceive problems and go about resolving them.

undefined
by Dave Snowden via Wikimedia under CY BY 3.0

I first came across this idea of complexity when I was in IBM. I met Dave Snowden there. At the time, I was a very unhappy consultant because at many of the big consultancies — I think even still today — their business model is focused on selling best practices or selling you what I like to refer to as recipes and disregarding context.

— Sonja on why complexity piqued her interest

That meeting with Dave made Sonja realize the idea of complexity and how context matters. The word context highlights the fact that every complex system is unique.

She went on to explore Dave Snowden’s framework in greater detail after leaving IBM to start her consultancy. After a while, she realized something interesting:

I think that every decision maker, whether it's senior levels or lower levels, or even if it's decision making in the household… can benefit from understanding complexity.

Why care about complexity?

Sonja told me about the key benefit of understanding complexity:

One of my early clients who also became a friend, was quite a senior leader in an organization. She said when she fully understood complexity and the implications thereof for the first time, it was as if a weight was lifted off of her shoulders.

Being a senior leader, Sonja’s friend was grappling with many priorities at once and felt overwhelmed. But after understanding complexity, she realized that it wasn't because of a lack of competence that she couldn't understand issues.

It was because of the complexity that nobody could know.

For some, that sense of not being able to know can create anxiety. For others, it creates a sense of freedom because essentially what it means is we are all wayfinding. We are finding our way through these messy tangles.

We will fail and we will make mistakes, but we will eventually find a way through.

But when we treat something complex as if it's complicated, very often we just waste a lot of time. We get ourselves even more stuck. We create unintended consequences.

Complex ≠ Complicated

I’ve noticed over the years that people mix up the meanings behind complex and complicated, so I felt it was best to clarify with the expert.

Sonja told me it’s good to get down to the root meanings of the two words.

The meaning behind “complicated”

The “plic” in complicated draws from the Latin word, plica aka plik.

plik means folded together. With something complicated, I can unfold it, analyze it, then understand it, and finally, replicate it.

A car is an example of a complicated system.

All of the different parts connect and come together in linear, predictable ways to create a certain functionality. There's no functionality or behavior in that car that you can't understand by understanding the part.

The meaning behind “complex”

The “plex” in complex is a Latin root word.

It means braided together, or in Sonja’s words, “it’s tangled together”.

While there are many aspects to something being complex, the first and foremost aspect is that it's entangled in ways that we can't fully understand.

So things are connected in ways that are not linear.

In comparison, complicated systems have linear, predictable traits.

Because of this, complex systems show unique traits such as:

  • being dynamic

  • continuously shifting and changing

  • rife with interconnectedness that we can't fully understand

It's almost difficult to draw a boundary line around where this system starts and ends.

Thank you for reading so far. Feel free to share this post with someone who will find it helpful.

Share

Example of complex systems

You don’t have to wrack your brain to think of systems that fit the above criteria. Remember, complex means things can happen that you could not predict ahead of time.

Families meet the criteria and are complex systems. The human body is complex. Our gut biomes are complex. Even consciousness — scientists are still trying to figure out exactly what consciousness is and where it emerges from.

So when two or more human beings come together, the situation is immediately complex. The unpredictability continues to rise as you add more humans.

Any living system is complex, but it's also sometimes that can create some anxiety for us because you can't control a complex system like you could potentially do a complicated one.

Humans compound technical complexity

What I’m alluding to here is that while the complexity of incidents might be greater because of things like cloud, serverless, and containers, that’s not the end of it.

Complexity emerges in incidents because we don’t work in technical systems but in sociotechnical systems. Humans are managing the underlying components of our technical systems. They interpret these components in their unique way.

We as computing folk have become accustomed to running predictable systems that can fault-tolerate humans, but that’s changed.

Now, these very systems are shifting into a world of complexity, and we are struggling to handle the lack of predictable outcomes.

You have to understand all those tangled bits and trying to do that can be very difficult because there are so many different things happening at the same time.

Why it is difficult to make sense of complex systems?

Our schooling plays a role

Sonja thinks one of the issues is that we were for the most part taught in our educational systems to think in very reductionist ways.

This might be a familiar experience: if you're facing what would be classified as a complicated problem, you break it down into smaller pieces, you solve the pieces and then, in the end, you can solve the complicated problem.

In complexity, it doesn't quite work like that.

Why complex systems can’t be codified

Using the family system as an example, you can't break it into smaller pieces to try and understand it. You almost need to work with the whole.

There are “emergent qualities” within the family that don't exist within the individual members of the family. As members of the family individually interact with each other, they come together in unique ways.

The same applies to organizational culture. Culture emerges from the thousands of unique interactions between the humans in that system. The more humans you add to the mix, the more interactions you create.

This increases complexity at a parabolic rate.

Sonja puts it well:

All of the conversations, all of the ways that people even just look at each other, the interactions between people and the technology they work in, with the office space they're in. All of these things create this emergent identity or culture of this organization. And that culture does not exist within the individual people or the individual parts.

In other words, the whole is greater than the sum of its parts.

The critical learning point at this stage is that with something complex having interrelated components, the relationships between the components become more important than the components themselves.

How to get a better grasp of complex systems

Sonja suggests that the first step to embracing complexity means unlearning the patterns of thought associated with complicated systems.

That’s reducing or eliminating the tendency to want to reduce things into parts because when we're dealing with something complex, we also need to look at the whole.

We need to look at the whole as well as the parts and how things are connected.

That also means we need to shift from a linear problem-solution way of thinking.

In complexity, Sonja believes it’s better (and in her exact words, “more generative”) to think in terms of “emergent patterns” and not “problems”.

It's not impossible to understand complexity, but if we look at it through a linear reductionist lens, then we can get ourselves stuck. The thing to get comfortable with — and it will initially be hard — is that a pattern cannot necessarily be solved, but you can shift it.

If you think of something extremely complex like, for example, poverty or social inequality, if you see them solely as a “problem to solve”, you’ll almost immediately stuck because the system is connected to so many other contributing parts.

You won’t even know where to start and how solving one part causes an unpredictable outcome in another part of the system. But if you see the issue as a pattern, then all of a sudden, you can start interacting with it more effectively.

You can try different things and see what works.

You've got multiple entry points because so many things are connected.

Wrapping up

By now, I hope you have a better understanding of (1) complexity and (2) how it impacts your ability to respond to incidents.

Sonja and I also discussed:

  1. the problem of achieving psychological safety in complex environments

  2. deeper into the concept of emergence and how it contributes to our understanding of incidents as they develop

  3. a need for resilience in working professionals as we become more attention-poor

For these 3 ideas, you’ll have to listen to the podcast episode.


About Sonja

I found Sonja’s work in 2019 while looking for ways to deal with the increasing VUCA (volatility, uncertainty, complexity, and ambiguity) at my work.

Sonja is a co-founder of Complexity Fit and founder of More Beyond focusing on helping teams build capacity for sensemaking, collaboration, and wayfinding.

She has a background in programming from her early career as a meteorologist, having worked in C and Fortran, and then progressing to working as a web developer.

You can connect with Sonja to learn more about complexity via LinkedIn.


Discussion about this podcast

Reliability Enablers (SREpath)
Reliability Enablers
Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more.