Follow this playbook for every incident. No shortcuts.
The 5 Steps
1. CONTAIN (first 5 minutes)
Stop the bleeding. Prevent further damage.
- Can we rollback to the last working version?
- Can we disable the broken feature without taking down everything?
- Is customer data at risk?
- Who needs to be notified right now?
2. DIAGNOSE (next 15 minutes)
Find the root cause using data, not guesses.
- Check logs: what changed? when did errors start?
- Check deployments: was there a recent
[Description truncada. Veja o README completo no GitHub.]