25+ Site Reliability Engineering OKRs
Readme before reviewing the Site Reliability OKRs below
Please review these guidelines before you consider adapting the OKRs:
Many of the OKRs are ambitious examples - certainly more than what most junior SREs should be given or could handle
Most OKRs would be the culmination of efforts by an entire SRE team and not a sole engineer
Numbers in the OKRs, e.g. 0.75%, have been created for illustrative purposes only - consider your metrics and goals for the numbers
Incident Response OKRs
Reduce MTTR for on-call engineers by 5%
Develop buffers to ensure incidents remain at < 75% of the error budget
Mitigate false positive system alerts to reduce on-call staff costs
Speed up the resolution of critical incidents by 5%
Increase the coverage of 4-point SLIs from 90% of services to 100%
Reduce manual toil from 25% of responder time to 20%
Increase increment velocity in SRE project work with one-sprint reduction
Reduce operational work from 65% of total work time to 55%
Reduce incident recurrence from 8 out of 10 to 6 out of 10 incidents
Assure realistic SLA targets in line with current SLIs for > 97.5% of accounts
System performance and resilience OKRs
Reduce 50x errors from 1% down to 0.75%
Increase failover design of # of microservices from the current 60% to 65%
Reduce network latency among the top 5 services by 2.5%
Increase average load speed of application by 0.25%
Reduce open-source-software-related errors by 10%
Reduce incident recurrence from 8 out of 10 to 6 out of 10 incidents
Increase black swan event awareness among developers to 90%
Plan for handling unexpected high demand up to 25% burst capacity
Developer support OKRs
Drive rail-guided services from 40% to 50% of all new launches
Speed up time to production for images by 20%
Improve developer speed-to-publish by 10%
Increase tool efficiency to < 2 same-purpose tools per category across teams
DevSecOps OKRs
Reduce build security issues by 25%
Drive DevSecOps awareness among developers to 75% of the headcount
Drive security of database architecture with < 1 major incident per year
FinOps (Cloud Cost Control) OKRs
Reduce the cost of stateful storage capacity by 10%
Reduce total cloud billing by 1%
Reduce vendor-based tool costs by 10%
Reduce routine downtime maintenance costs by 3%
Work practices OKRs
Increase increment velocity in SRE project work with one-sprint reduction
Reduce operational work from 65% of total work time to 55%
Feel free to reach out if you have any questions about the above OKRs or want us to add a new OKR.