Rundown of Uber's SRE practice
Introduction
Every time you push a button like the one below to request an Uber ride, you activate a sequence of (micro)service requests.
You'd never know unless you look under the hood because most of these services run solely in the background. Yet every service contributes to the start and completion of the Uber ride experience.
We'll explore how Uber's engineers ensure this goes on without trouble. The effort ties into Uber's SRE practice, which has been part of the broader infrastructure practice since 2014.
📊 Performance statistics for Uber
Uber has grown significantly since 2010 when it was a car-for-hire service in San Francisco.
Site Reliability Engineers (SREs) ensure rides, microservices, and writes work as reliably as possible.
What is Uber's contribution to Site Reliability Engineering?
Like Netflix is known for Chaos Engineering, Uber is best known for contributing to distributed tracing through its open-source Jaeger tool.
How is it that Netflix developed chaos engineering and Uber developed tracing?
It comes down to the internal problems each organization needed to solve. Netflix is focused on the playback of media files so the start of playback is most important.
Hence Netflix's focus was on assuring a high starts-per-second through chaos engineering. They developed the "Simian Army" tool suite to achieve this.
For Uber, starts-per-second is not the key metric. They have a resilience suite similar to the Simian Army, but tracing is their holy grail.
That's because Uber's golden metric is requests handled i.e. rider requests connecting with driver availability and geolocation. This kind of request effort requires making a lot of API calls on many services.
What kind of services are called for every ride? Here are a handful of examples:
Logistics and routing
Maps and navigation
Supply & demand market
Fraud detection
Payments
Surge calculation
SMS, calling, messaging
Data warehouse + modeling
Completed requests across the entire service chain are the bread and butter of Uber. When things go wrong, it’s about tracing the many API calls to find faults in a service or several services.
Yuri Shkuro from Uber developed a distributed tracing tool to gather request data from its complex service network, heavy service load, and high traffic. This tool is called Jaeger.
Check out my write-up on how Jaeger fits into observability.
🎯 Uber SRE goal parameters
Mission - Highly performant, highly available systems
Uptime goal - 99.99% uptime
Performance target - subsecond end-user latency (<500ms)
Business goal - minimize operational cost per ride
🤝 How SRE fits into the Uber engineering culture
Team formation
SRE teams at Uber can be found in 2 forms - embedded within larger teams or as handlers of infrastructure-at-large. Let's unpack both forms:
Embedded SRE teams work alongside or within engineering teams responsible for services like data engineering, front/backend, and real-time matching.
"Infrastructure-at-large" SRE teams take ownership of infrastructure across the Uber organization. Examples of aspects include security, storage, compute, and observability.
A dedicated SRE team owns observability across the entire organization and instruments it into all services.
According to Uber's first SRE hire, Rick Boone, observability is too critical to Uber's 99.99% uptime goal for implementation responsibility to be distributed.
It's much easier to have one focused team do it well than ask and pray for hundreds of disparate teams to do it well.
History of SRE at Uber
The SRE practice in Uber was started in October 2014, a very tumultuous time for the company. Will Larson, founder of SRE at Uber wrote at great length about this.
Before the existence of SRE in Uber, the core cloud infrastructure team had 15 engineers reacting to a myriad of operational issues including incident response, platform work, and more.
Until this point, Uber's 500 or so services were the responsibility of the overloaded "core infra" team.
There was little to no time to improve the systems and prevent recurring issues consuming a lot of engineering time. A subsequent systems meltdown prompted a wake-up call to shift toward a reliability-focused culture.
Culture for Uber SRE
The SRE culture at Uber can be summed up in 4 words: small team, outsized impact. Every. effort. must. scale.
Give me a lever and a fulcrum on which to place it, and I shall move the world.
- Archimedes quote that aptly describes SRE work
In 2016, for every SRE at Uber, there were:
14 services
30 engineers
30 engineers
640 servers
Compared to other tech companies, Uber's SREs have unique challenges like persistent trip tracking.
Persistent trip tracking in terms of what you can see is when you see the driver's car icon move on the map toward you and after you've been picked up, toward your destination.
Uber is keeping real-time data on 100s of 1000s of trips across the world at any given time.
It is a real challenge to ensure all of this data continues to stream smoothly. To achieve this, SREs are given a high degree of autonomy to work on solutions they feel the system needs.
We'll now cover Uber SRE's 3 cultural traits that support this.
3 cultural traits driving Uber's SRE work
🚀 Automate to support developer velocity
Uber's services call for high reliability but at the same time, in their early days, the SRE team faced a unique challenge that now forms a large part of the wider SRE ethos.
They had to ensure highly available, performant services without slowing developers down as would happen when more traditional ops workflows are in place.
They achieved this through having SREs with the mindset of serving the needs of service teams with leverage in mind. Their goal was to automate critical workflows to scale their work.
🙏🏽 Respect engineer autonomy
I've reviewed blog posts from senior Uber engineering leaders down to SRE interns. There is a clear pattern of respecting an engineer's autonomy.
The thinking behind this is "make what you think will make things better and we will try to make it work with the wider system".
This kind of thinking doesn't come without its challenges. That's exactly what happened as Uber grew quickly into a very large company. Quoting Will Larson:
"Uber’s wider infrastructure organization had an organic approach to problem-solving. It was fundamentally a bottoms-up organization that had local strategies without a common overarching strategy, which created quite a few points of friction." (source)
Give engineers autonomy, but have working groups ensure all these autonomous works are aligned properly with the overall direction.
📣 Hit a combination of influence and impact
3 factors play into the influence and impact an engineer can have. These are:
high internal impact
high ability to influence decisions and actions
high external impact
In a small startup, engineers experience #1 and #2. Every action they take affects the internal mechanics of the startup. Plus they get to influence a lot of decisions.
In a large tech company, engineers experience #1 and #3 but not a lot of #2. Their work touches many people inside and outside the organization, but they rarely influence decision-making.
During Uber's growth phase, engineers experienced all 3 points. Their work touches many people inside the organization plus millions outside it, but they also get to influence decisions.
Conclusion
Uber SREs have been given a lot of autonomy to fix the unique problems that the application can face. After all, with 3000+ services as of June 2022, there's a huge surface area for things to go wrong. It is amazing what goes on in the background so you can safely ride from A to B.