Site Reliability Engineer Principles

Plan for failure

Prepare every system to handle unexpected inputs. Record all system failures and report on them. Fail in a way that halts interaction with the system instead of perpetuating a problem and producing defective results.

Approach change mindfully

Modification to a system can have unintended consequences. Every change is a potential risk. Changes should be rolled out progressively, closely monitored and communicated. Always have a backup plan.

Measure with purpose

The telemetry a system produces should contribute to the understanding of its overall function. Describe performance measurements in ways that are meaningful to an end-user.

Avoid alerting humans

Any system requiring a human action to resolve an outage reduces the upper bound on availability. Log events for forensic purposes and create issues for emerging problems.

Expect capacity changes

Systems are resized in response to external requirements and constraints. Anticipate this and build for scalability. Expose the system’s current utilization to its users.

Degrade gracefully

Partial failure is better than complete failure. Preserve the utility of a system even under adverse conditions. The impact of a failure within the system should be proportional to the severity of the failure.

Prefer simple over easy

Consider all potential users when building software, especially those not on your team. Simple tools that can be reasoned about are preferable to tools that attempt to make hard things easy via magic.

Define system boundaries

Every system has its limits in scale, performance and functionality. Design systems with clear boundaries, interfaces and expectations.