Reliability Enablers (SREpath)
Reliability Enablers
#65 - In Critical Systems, 99.9% Isn’t Reliable — It’s a Liability
0:00
-28:28

#65 - In Critical Systems, 99.9% Isn’t Reliable — It’s a Liability

In most SaaS, 99.9% uptime gets you promoted. In critical infrastructure like the energy sector, it gets flagged as a failure mode.

Most teams talk about reliability with a margin for error. “What’s our SLO? What’s our budget for failure?”

But in the energy sector? There is no acceptable downtime. Not even a little.

In this episode, I talk with Wade Harris, Director of FAST Engineering in Australia, who’s spent 15+ years designing and rolling out monitoring and control systems for critical energy infrastructure like power stations, solar farms, SCADA networks, you name it.

What makes this episode different is that Wade isn’t a reliability engineer by title, but it’s baked into everything his team touches. And that matters more than ever as software creeps deeper into operational technology (OT), and the cloud tries to stake its claim in critical systems.

We cover:

  • Why 100% uptime is the minimum bar, not a stretch goal

  • How the rise of renewables has increased system complexity — and what that means for monitoring

  • Why bespoke integration and SCADA spaghetti are still normal (and here to stay)

  • The reality of cloud risk in critical infrastructure (“the cloud is just someone else’s computer”)

  • What software engineers need to understand if they want their products used in serious environments

This isn’t about observability dashboards or DevOps rituals. This is reliability when the lights go out and people risk getting hurt if you get it wrong.

And it’s a reminder: not every system lives in a feature-driven world. Some systems just have to work. Always. No matter what.

Discussion about this episode

User's avatar