Humans, the pesky side of system design

Dec 09, 2025

Most reliability engineers hear ‘system design’ and picture diagrams wrangling load balancers, queues, and failover zones.

Fair enough. That’s the world we were educated in.

But every design workshop I’ve sat in this year has ended with very human questions:

“Who’ll own this?”

“How do we know when to alert?”

“How do we ramp these 37 changes into our flow?”

Guess what? That’s system design, too. It’s just the part we never diagram.

You already know the cognitive load of infra churn…

AWS → Kubernetes → AI pipelines → whatever’s next.

Across every shift, the engineers who stay ahead are the ones who shape the system around the system, so its behaviors support sustainable reliability.

That’s not fluff — that’s the beginning of Staff+ thinking.

No one gets promoted for invisible work.

You advance by showing evidence of potential, and the human side of the system is where it shows early.

You already know the rituals that humans in your org get involved in. Most orgs run them — some lightly, some intensely, some religiously.

Some examples of rituals include postmortems/retrospectives, SLO writing sessions, and incident bridges with all involved parties.

But there’s a quiet crisis building up within these rituals.

Reliability flatlines when rituals replace mechanisms.

Rituals mean well with the intent to turn responses to events into regular action, so it makes sense that many teams try to improve reliability through rituals.

Over time, you end up with:

more dashboards
more alerts
more retros
more action items
more processes
more OKRs
more monitoring “initiatives”

These things look like progress. They feel responsible.

But they don’t change system behaviour.

That’s the ritual mindset: if we do more of the same things, reliability will improve. Except when it doesn’t. Now, let’s look at what happens to a lot of those rituals in the real world.

And in practice, they often falter:

Retro held, but nothing changes despite learnings
Action items logged, never acted on
SLOs are written but don’t fully influence prioritization
Everyone “cares”, but nobody changes behaviour
“We added alerts!” (…and made things worse)

🤔 Why don’t rituals consistently improve reliability?

Because they don’t change the parts of the system that produce reliability. Reliability is produced by what MIT’s System Dynamics group defines as feedback loops.

These loops can include delays in judgment and action, handoff frequency, review cadence, and how fast learning propagates. Rituals don’t touch any of that. They create activity, not loop correction.

You can run a retro every sprint, but if:

the process stays the same,
the information flows stay the same,
the handoffs stay the same,
and the delays stay the same…

the loop never changes.

The ritual happens around the system, but nothing shifts inside the system.

Next time, I’ll map out the loops underneath all this — the mechanisms rituals never touch.

Reliability Enablers

Discussion about this post

Ready for more?