Plan for failure
Prepare every system to handle unexpected inputs. Record all system failures and report on them. Fail in a way that halts interaction with the system instead of perpetuating a problem and producing defective results.
Approach change mindfully
Modification to a system can have unintended consequences. Every change is a potential risk. Changes should be rolled out progressively, closely monitored and communicated. Always have a backup plan.
Measure with purpose
The telemetry a system produces should contribute to the understanding of its overall function. Describe performance measurements in ways that are meaningful to an end-user.
Avoid alerting humans
Any system requiring a human action to resolve an outage reduces the upper bound on availability. Log events for forensic purposes and create issues for emerging problems.
Expect capacity changes
Systems are resized in response to external requirements and constraints. Anticipate this and build for scalability. Expose the system’s current utilization to its users.
Degrade gracefully
Partial failure is better than complete failure. Preserve the utility of a system even under adverse conditions. The impact of a failure within the system should be proportional to the severity of the failure.
Prefer simple over easy
Consider all potential users when building software, especially those not on your team. Simple tools that can be reasoned about are preferable to tools that attempt to make hard things easy via magic.
Define system boundaries
Every system has its limits in scale, performance and functionality. Design systems with clear boundaries, interfaces and expectations.