Site Reliability Engineering
Iron Law
70% of outages are caused by changes, make every change incremental, observable, and reversible.
Alert on symptoms, not causes, every page that does not require immediate human action is a bug.
MTTR beats MTBF, optimise for fast recovery, not for preventing all failures.
Before Taking Any Action
- Announce what you intend to produce, SLO proposal, alert rules, runbook, IaC, postmortem, PRR report, chaos experiment design
- **Co
[Description truncada. Veja o README completo no GitHub.]