When this skill is activated, always start your first response with the 🧢 emoji.
Incident Management
Incident management is the structured practice of detecting, responding to, resolving, and learning from production failures. It spans the full incident lifecycle - from the moment an alert fires through war room coordination, customer communication via status pages, and the post-mortem that prevents recurrence. This skill provides actionable frameworks for each phase: on-call rotation design, runbook authoring, severity classification, war room protocols, status page communication, and blameless post-mortems. Built for engineering teams that want to move from chaotic firefighting to repeatable, calm incident response.
When to use this skill
Trigger this skill when the user:
- Needs to design or improve an on-call rotation or escalation policy
- Wants to write, review, or templatize a runbook for an alert or service
- Is conducting, writing, or facilitating a post-mortem / post-incident review
- Needs to set up or improve a status page and customer communication strategy
- Is running or setting up a war room for an active incident
- Wants to define severity levels or incident classification criteria
- Needs an incident commander playbook or role definitions
- Is building incident response tooling or automation
Do NOT trigger this skill for:
- Defining SLOs, SLIs, or error budgets without an incident context (use site-reliability skill)
- Infrastructure provisioning or deployment pipeline design (use CI/CD or cloud skills)
Key principles
-
Incidents are system failures, not people failures - Every incident reflects a gap in the system: missing automation, insufficient monitoring, unclear runbooks, or architectural fragility. Blaming individuals guarantees that problems get hidden instead of fixed. Design every process around surfacing systemic issues.
-
Preparation beats reaction - The quality of incident response is determined before the incident starts. Well-written runbooks, practiced war room protocols, pre-drafted status page templates, and clearly defined roles reduce mean-time-to-resolve far more than heroic debugging during the incident.
-
Communication is a first-class concern - Customers, stakeholders, and other engineering teams need timely, honest updates. A status page update every 30 minutes during an outage builds trust. Silence destroys it. Assign a dedicated communications role in every major incident.
-
Every incident must produce learning - An incident without a post-mortem is a wasted failure. The post-mortem is not paperwork - it is the mechanism that converts a bad experience into a durable improvement. Action items without owners and deadlines are wishes, not commitments.
-
On-call must be sustainable - Unsustainable on-call leads to burnout, attrition, and slower incident response. Track on-call load metrics, enforce rest periods, and treat excessive paging as a reliability problem to fix, not a cost of doing business.
Core concepts
Incident lifecycle
Detection -> Triage -> Response -> Resolution -> Post-mortem -> Prevention
| | | | | |
Alerts Severity War room Fix/rollback Review + Action
fire assigned stands up deployed learn items
tracked
Every phase has a defined owner, a set of artifacts, and a handoff to the next phase. Gaps between phases - especially between resolution and post-mortem - are where learning gets lost.
Incident roles
| Role | Responsibility | When assigned |
|---|---|---|
| Incident Commander (IC) | Owns the response, delegates work, makes decisions | SEV1/SEV2 immediately |
| Communications Lead | Updates status page, stakeholders, and support teams | SEV1/SEV2 immediately |
| Technical Lead | Drives root cause investigation and fix implementation | All severities |
| Scribe | Maintains the incident timeline in real-time | SEV1; optional for SEV2 |
Role assignment rule: For SEV1, all four roles must be filled within 15 minutes. For SEV2, IC and Technical Lead are mandatory. For SEV3+, the on-call engineer handles all roles.
Severity classification
| Severity | Customer impact | Response time | War room | Status page |
|---|---|---|---|---|
| SEV1 | Complete outage or data loss | Page immediately, 5-min ack | Required | Required |
| SEV2 | Degraded core functionality | Page on-call, 15-min ack | Recommended | Required |
| SEV3 | Minor degradation, workaround exists | Next business day | No | Optional |
| SEV4 | Cosmetic or internal-only | Backlog | No | No |
Escalation rule: If a SEV2 is not mitigated within 60 minutes, escalate to SEV1 procedures. If the on-call engineer cannot classify severity within 10 minutes, default to SEV2 until more information is available.
Common tasks
Design an on-call rotation
Rotation structure:
Primary on-call: First responder. Acks within 5 min (SEV1) or 15 min (SEV2).
Secondary on-call: Backup if primary misses ack window. Auto-escalated by pager.
Manager escalation: If both primary and secondary miss ack. Also for SEV1 war rooms.
Scheduling guidelines:
- Rotate weekly. Never assign the same person two consecutive weeks without a gap.
- Minimum team size for sustainable on-call: 5 engineers (allows 1-in-5 rotation).
- Follow-the-sun for distributed teams: hand off to the next timezone instead of paging at 3am. Each region covers business hours + 2 hours buffer.
- Provide comp time or additional pay for after-hours pages. Track and review quarterly.
On-call health metrics:
| Metric | Healthy | Unhealthy |
|---|---|---|
| Pages per on-call week | < 5 | > 10 |
| After-hours pages per week | < 2 | > 5 |
| Mean time-to-ack (SEV1) | < 5 min | > 15 min |
| Mean time-to-ack (SEV2) | < 15 min | > 30 min |
| Percentage of pages with runbooks | > 80% | < 50% |
Write a runbook
Every runbook must contain these sections:
Title: [Alert name] - [Service name] Runbook
Last updated: [date]
Owner: [team or individual]
1. SYMPTOM
What the alert tells you. Quote the alert condition verbatim.
2. IMPACT
Who is affected. Severity level. Business impact in plain language.
3. INVESTIGATION STEPS
Numbered steps. Each step has:
- What to check (command, dashboard link, or query)
- What a normal result looks like
- What an abnormal result means and what to do next
4. MITIGATION STEPS
Numbered steps to stop the bleeding. Prioritize speed over elegance.
Include rollback commands, feature flag toggles, and traffic shift procedures.
5. ESCALATION
Who to contact if steps 3-4 do not resolve the issue within [N] minutes.
Include name, team, and pager handle.
6. CONTEXT
Links to: service architecture doc, relevant dashboards, past incidents,
and the service's on-call schedule.
Runbook quality test: A new team member who has never seen this service should be able to follow the runbook and either resolve the issue or escalate correctly within 30 minutes.
Conduct a post-mortem
When to hold one: Every SEV1. Every SEV2 with customer impact. Any incident consuming more than 4 hours of engineering time. Recurring SEV3s from the same cause.
Timeline:
Hour 0: Incident resolved. IC assigns post-mortem owner.
Day 1: Owner drafts timeline and initial analysis.
Day 2-3: Facilitated post-mortem meeting (60-90 minutes).
Day 3-4: Draft published for 24-hour review period.
Day 5: Final version published. Action items entered in tracker.
Day 30: Action item review - are they done?
The five post-mortem questions:
- What happened? (factual timeline with timestamps)
- Why did it happen? (root cause analysis - use the "five whys" techniqu