Analysis of SRE and platform setup at 10+ tech companies
In this article, you will see a breakdown of the platform setup and SRE practices within 12 non-FAANG technology companies.
This is based on the case studies by Andrios Robert.
“There is a lot of content available on how Google did [Site Reliability Engineering]; let’s uncover what happens with the rest of the world.”
— Andrios Robert, founder of Runops.io
You might be thinking, “Why don’t I read Andrios’ writeups instead of this?”. You should at some point, but chances are that you are pressed for time.
I have trawled through Andrios’ case studies and pulled together common threads, which I’ll explain in my analysis toward the end of this article.
Let’s first cover a few specifics about the companies being covered:
growth or late-stage startups and mid-sized companies — no FAANG or early-stage startups are represented below
they are all Brazilian companies — founded there or having a branch there — but their experiences should still translate across to other regions
Expanding on the last part:
Yes, these companies are based in a faraway region. However, I believe their experiences can highlight what companies in similar industries would be doing elsewhere in the world.
I doubt there is an appetite in Latin America to reinvent the wheel when serving cloud applications similar to what we see elsewhere.
Let’s have a business-level picture of the companies in a table format:
As you can see, these are not early-stage startups or small businesses.
Now let’s collate the common DevOps and SRE elements across these organizations:
Platform setup
Platform
K8s (Hash, Dafiti, Creditas, TOTVS, Empiricus, Dock, Loggi, Delivery Center, Quintoandar, Natura, PicPay)
AWS Lambda functions (Empiricus, Dock)
Istio (Hash, Dafiti, Empiricus, Loggi)
Terraform for IAC (Dafiti, Creditas, TOTVS, Natura)
Monitoring
Prometheus + Grafana (Hash, Creditas, TOTVS, Empiricus, Quintoandar, Natura)
NewRelic (Dafiti, Creditas, Delivery Center)
Datadog (Empiricus, Dock)
PagerDuty (Empiricus, Natura)
Codebase
Pre-microservices monolith (Leroy Merlin)
Diverse multi-language codebase (Dafiti, Empiricus, Delivery Center, Quintoandar, PicPay)
Single language codebase across org (Creditas, Leroy Merlin, Loggi)
SRE practices
Developers can autonomously deploy to production (Hash, Creditas, Empiricus, Dock, Natura, PicPay)
Product teams are on-call to support their product (TOTVS, Leroy Merlin, Dock, Delivery Center, Quintoandar, PicPay)
Product teams participate directly in platform/tooling matters (Hash, Loggi, Delivery Center, Quintoandar)
Feature flags enable safer rollouts (Leroy Merlin, Loggi)
Emulate Google SRE principles to an extent (Hash)
Deployments must pass the readiness checklist (Hash)
SRE teams set alerting based on the product team’s specifications (PicPay)
SRE teams responsible for after-hours incidents (PicPay)
Break SRE work into multiple streams e.g. observability etc (PicPay)
My brief analysis
Platform
Kubernetes - It shouldn’t be surprising that Kubernetes (aka K8s) is the platform of choice for 11 out of 12 studied organizations. Leroy Merlin was the only exception, still migrating from legacy monolith to microservices architecture.
Lambda - A small subset of companies uses Lambda functions for executing code. Its lesser use makes sense because Lambda requires using AWS, learning a new AWS service, and can drive up cloud costs without persistent oversight.
Istio - Service mesh seems to be getting visible use in production systems now. I wasn’t sure if it would make it based on how K8s practitioners viewed it in 2020.
Terraform - No surprise that Terraform is being used extensively by many of these organizations for their infrastructure provisioning capability.
Monitoring
As expected, there is a fair spread of tooling used for monitoring. Nonetheless, the open-source Prometheus and Grafana combination holds the largest share, reflecting the wider CNCF community uptake.
I am curious whether any of these use commercial implementations of Grafana.
The market among commercial monitoring tools (NewRelic, PagerDuty, Datadog) is divided with a handful of companies taking up each offering.
Codebase
There was a fairly even split of organizations with single and multi-language code architectures.
Organizations with a single language may have promoted their hiring brand as a single language shop or hired from within a network of engineers who all worked in that one language.
There are 2 ways to look at language diversity in multi-language organizations:
Language popularity
Different service types
Regarding language popularity - some organizations scaled up as different languages became popular and others faded in popularity; their engineering workforce would reflect this transition
Regarding different service types - some services may function better in a certain language than others e.g. Golang for infrastructure services, Python for data scripting, C for Linux services, Javascript for frontend, etc.
Feature flags seem to have a way to go before they are widely discussed and adopted. That may be because they add another layer of complexity to deploys.
SRE practices
I found a few common threads but many unique practices too. It seems like how the organization practices software reliability remains unique to its environment.
I’m sure a lot more common practices exist, but they are not as obvious to discuss as say, “We run EKS with Redis and Kafka monitored by NewRelic”.
Many organizations were open to having developers autonomously push to production but also making them responsible for any incidents that resulted.
Some organizations like Quintoandar additionally incentivize their developers to sit on incident call rotations
It was also interesting to note that many organizations have their developers actively participate in platform and tooling matters.
Here are interesting examples of how SREs are utilized::
as consultants to support incident capability - Delivery Center’s SREs are responsible for structuring the product team’s on-call schedules
last-resort support, to be called upon to solve critical or tricky problems
create a center of excellence for highly secure, reliable, and performant software practices like at Natura
Only one organization made SREs responsible for after-hours incidents.
Concluding words
I trust that this brief analysis of Andrios’ SRE team case studies has given you a new or clearer perspective on how organizations can set up their platform and SRE service culture.