Pre-mortem — fail-it-before-you-build-it

A retrospective for a project that hasn't happened yet. Surfaces real risks that "list every risk" prompts miss.

Originated in Gary Klein's research at MIT Sloan, now standard at AWS and other ops-mature orgs.

The 5-step pre-mortem

Step 1. Imagine you're 6 months in the future

The project shipped. It is a clear, public failure. There's a Reddit thread about it. The CEO is asking what went wrong.

Step 2. Write the post-mortem newspaper headline

One sentence. Concrete. Specific. Examples:

❌ Bad: "We had some quality issues."
✅ Good: "On 2026-09-12, the Stripe webhook handler deduplicated by raw body hash, so 30K customers were double-charged after Stripe retried delivery during a network blip."

The headline forces you to name the failure mode SPECIFICALLY.

Step 3. List every individual reason this exact failure happened

Brainstorm 10-15 reasons. Be specific. Each item should reference:

A real component / file
A real failure mode (race condition, schema mismatch, expired credential)
A real human factor (oncall didn't see alert, runbook was outdated)

Reject hand-waves like "testing was insufficient." Replace with "we didn't write a property-based test for the dedup-key collision case."

Step 4. Rank by likelihood × severity

For each cause, score:

Likelihood: 1-5 (1=once-in-a-decade, 5=monthly)
Severity: 1-5 (1=cosmetic, 5=data loss / regulatory breach)
Risk score: likelihood × severity

Top 3 by risk score → these are your highest-priority mitigations.

Step 4b. Classify risks — Tigers / Paper Tigers / Elephants

After scoring, classify each risk into one of three types:

🐯 Tigers — Real problems you personally believe could derail the project

Based on evidence, past experience, or clear logic
Should keep you awake at night
Require concrete action
Classify each Tiger by urgency:
- Launch-Blocking: Must be resolved before shipping (broken core feature, regulatory blocker, data integrity risk)
- Fast-Follow: Must be resolved within 30 days post-launch (performance issues, secondary features)
- Track: Monitor post-launch, fix if it becomes an issue (edge cases, nice-to-haves)

📄 Paper Tigers — Concerns others might raise that you don't believe are real risks

Valid-sounding on the surface but unlikely or overblown
Not worth significant resource investment
Worth documenting to align stakeholders and avoid repeated debates
For each: explain WHY you don't believe it's a real risk

🐘 Elephants — Things the team knows about but isn't discussing openly

Uncomfortable concerns: technical debt, team tension, unrealistic timeline, design that nobody likes
Uncertain — you're not sure if it's a problem, but nobody is investigating
Deserve explicit surfacing before launch — silent elephants become Tigers post-launch

Step 5. For each top-3 cause, write a guardrail in the plan

Each guardrail is a concrete change to the plan:

A test that would have caught it
A circuit breaker / feature flag
A runbook entry
A monitoring alert with specific SLO

If a top-3 cause CANNOT be mitigated within the time/budget, escalate to the user: "This plan accepts the risk of X with no mitigation."

Template — add to PLAN-*.md

## Pre-mortem

Six months from now, this project failed. Headline:

> <one-sentence failure headline>

### Top reasons (likelihood × severity)

| Cause | L | S | Risk | Mitigation in plan |
|---|---|---|---|---|
| <specific cause> | 4 | 5 | 20 | <Task #N: write idempotency test> |
| ... | | | | |

### 🐯 Tigers (real risks — require action)

| Tiger | Classification | Mitigation | Owner | Due |
|-------|---------------|-----------|-------|-----|
| <risk> | Launch-Blocking | <concrete action> | <team/person> | <date> |
| <risk> | Fast-Follow | <concrete action> | <team/person> | <date> |
| <risk> | Track | <monitoring approach> | <owner> | post-launch |

### 📄 Paper Tigers (overblown — document to align stakeholders)

- **<concern>**: Not a real risk because <reason>. If <condition> changes, revisit.

### 🐘 Elephants (unspoken — needs open discussion)

- **<concern>**: Nobody is talking about this. Suggested conversation: "<how to raise it>".

### Accepted risks (no mitigation)

- <risk> — accepted because <budget/scope reason>. Owner: <name>.

Common failure modes by archetype

Quick start — most-common pre-mortem causes per archetype:

Archetype	Common failure
fintech / commerce	Idempotency-key collision; double-charge during retry storm
healthcare	PHI leak via debug log; BAA not signed with vendor
web3	Oracle staleness; flash-loan exploit on bonding curve
mlops	Training/serving skew; model drift undetected
iot-embedded	OTA bricks devices in a region with no recovery path
data-platform	Late-arriving data overwrites correct values
ai-system / agent-product	Prompt injection exfiltrates other users' data
enterprise-saas	Cross-tenant data leak via RLS gap
cli-tool	Destructive flag with no confirmation (rm -rf equivalent)
library	Breaking change in minor version bump

Anti-patterns in pre-mortems

❌ Vague risks. "Performance might be a problem." Be specific: which operation, at what load, what's the SLO.

❌ Cosmic risks. "AWS could go down." Yes, but that's not actionable. Focus on what you can mitigate.

❌ Defensive list. Listing risks you've already mitigated to look thorough. Only list risks the current plan does NOT yet address.

❌ Skip the headline. Without the headline, the team won't believe the failure scenario is real.

When to skip

nano project_size — pre-mortem is overhead.
Pure refactor with full test coverage — guardrails already exist.
Bug-fix with one-line repro — risk is well-bounded.

pre-mortem

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

webapp-testing

brand-guidelines

frontend-design

web-artifacts-builder

Recibe nuevas skills de Design e Frontend todos los lunes