The error budget provides a clear, objective metric that determines how unreliable the service is allowed to be within a single quarter. This metric removes the politics from negotiations between SREs and the product developers when deciding how much risk to allow. source
- Objective: The desired level of success, noted as a percentage
- SLI: an evaluation used to distinguish number of failed events
- Timeframe: enforcing a recency bias to the SLI
Objective: 99.95%
SLI: 95th percentile latency of api requests over 5 mins is < 100 ms
Timeframe: previous 28 days
99.95% of the 95th percentile of api requests over 5 minutes is less than 100 ms
over the previous 28 days.
The error budget is calculated using a formula:
Error Budget = 1 - Objective.
e.g.
20.16 minutes = ((1 - 0.9995) * (28 * 24 * 60))