Reliability Enablers (SREpath)
Reliability Enablers
#60 How to NOT fail in Platform Engineering
0:00
-30:33

#60 How to NOT fail in Platform Engineering

Ankit Wal from ThoughtWorks Asia Pacific gave me the inside word on this hot topic

Here’s what we covered:

Defining Platform Engineering

  • Platform engineering: Building compelling internal products to help teams reuse capabilities with less coordination.

  • Cloud computing connection: Enterprises can now compose platforms from cloud services, creating mature, internal products for all engineering personas.

Ankit’s career journey

  • Didn't choose platform engineering; it found him.

  • Early start in programming (since age 11).

  • Transitioned from a product engineer mindset to building internal tools and platforms.

  • Key experience across startups, the public sector, unicorn companies, and private cloud projects.

Singapore Public Sector Experience

  • Public sector: Highly advanced digital services (e.g., identity services for tax, housing).

  • Exciting environment: Software development in Singapore’s public sector is fast-paced and digitally progressive.

Platform Engineering Turf Wars

  • Turf wars: Debate among DevOps, SRE, and platform engineering.

    • DevOps: Collaboration between dev and ops to think systemically.

    • SRE: Operations done the software engineering way.

    • Platform engineering: Delivering operational services as internal, self-service products.

Dysfunctional Team Interactions

  • Issue: Requiring tickets to get work done creates bottlenecks.

    • Ideal state: Teams should be able to work autonomously without raising tickets.

    • Spectrum of dysfunction: From one ticket for one service to multiple tickets across teams leading to delays and misconfigurations.

Quadrant Model (Autonomy vs. Cognitive Load)

  • Challenge: Balancing user autonomy with managing cognitive load.

  • Goal: Enable product teams with autonomy while managing cognitive load.

  • Solution: Platforms should abstract unnecessary complexity while still giving teams the autonomy to operate independently.

    How it pans out

    • Low autonomy, low cognitive load: Dependent on platform teams but a simple process.

    • Low autonomy, high cognitive load: Requires interacting with multiple teams and understanding technical details (worst case).

    • High autonomy, high cognitive load: Teams have full access (e.g., AWS accounts) but face infrastructure burden and fragmentation.

    • High autonomy, low cognitive load: Ideal situation—teams get what they need quickly without detailed knowledge.

Shift from Product Thinking to Cognitive Load

  • Cognitive load focus: More important than just product thinking—consider the human experience when using the system.

  • Team Topologies: Mentioned as a key reference on this concept of cognitive load management.

Platform as a Product Mindset

  • Collaboration: Building the platform in close collaboration with initial users (pilot teams) is crucial for success.

  • Product Management: Essential to have a product manager or team dedicated to communication, user journeys, and internal marketing.

Self-Service as a Platform Requirement

  • Definition: Users should easily discover, understand, and use platform capabilities without human intervention.

  • User Testing: Watch how users interact with the platform to understand stumbling points and improve the self-service experience.

Platform Team Cognitive Load

  • Burnout Prevention: Platform engineers need low cognitive load as well. Moving from a reactive (ticket-based) model to a proactive, self-service approach can reduce the strain.

  • Proactive Approach: Self-service models allow platform teams to prioritize development and avoid being overwhelmed by constant requests.

Discussion about this podcast

Reliability Enablers (SREpath)
Reliability Enablers
Software reliability is a tough topic for engineers in many organizations. The Reliability Enablers (Ash Patel and Sebastian Vietz) know this from experience. Join us as we demystify reliability jargon like SRE, DevOps, and more. We interview experts and share practical insights. Our mission is to help you boost your success in reliability-enabling areas like observability, incident response, release engineering, and more.