Create or Update Runbook
Operating Principles
- YAGNI applies to runbooks themselves. Apply the evidence-based YAGNI rule from ../../references/yagni-rule.md. A runbook is worth writing only when the scenario is grounded in something real: an alert that has actually fired, a documented incident, a recurring task that exists, or a known failure mode on a service that receives production traffic. Runbooks for hypothetical alerts, "best practice says we should have one," or "we'll need this someday" are YAGNI candidates and the runbook should be deferred until the scenario actually occurs. The canonical anti-pattern from project history: Sentry runbooks for staging-only Sentry where data isn't reaching production — alerts that will never fire because no signal flows. The user always wins; the rule's job is to make the cost of speculative runbooks visible.
- The companion evidence rule applies to the runbook's supporting evidence. Apply the evidence rule from ../../references/evidence-rule.md to the citations that ground the scenario: name the trust class of each piece of evidence (alert history, incident report, on-call rotation pattern); cite the actual artifact (dashboard URL, ticket ID, log query) rather than paraphrased recollection; and surface single-source claims as such rather than presenting them as settled.
- One runbook per invocation. The skill produces a single runbook file. Multi-runbook batches conflate scope; rerun the skill per scenario.
- Imperative commands with expected output. The template requires every step to show the exact command and what success looks like. Prose paragraphs in place of commands are an authoring failure the skill prompts against.
- Staleness is the failure mode. The template requires owner, last-validated, last-edited, and a change-history entry so decay is visible rather than hidden. The skill does not enforce a review cadence — that is a team-level workflow concern — but the metadata fields make the cadence auditable.
Project Context
- Git user: !
git config user.name(!git config user.email) - OS username: !
whoami - Today's date: !
date +%Y-%m-%d - CLAUDE.md: !
find . -maxdepth 1 -name "CLAUDE.md" -type f - project-discovery.md: !
find . -maxdepth 3 -name "project-discovery.md" -type f
Step 1: Determine Mode
Determine which mode to operate in based on the user's request:
| Mode | When | Then |
|---|---|---|
| Creating new | Drafting a runbook for a scenario the project does not yet have one for | → Step 2 |
| Updating existing | Modifying an existing runbook (new step, validation date refresh, escalation change) | Read the existing runbook → Step 4 |
| Validating existing | User says they ran the procedure end-to-end and wants to refresh Last validated and add a change-history entry | Read the existing runbook → Step 4 (update mode, validation entry only) |
Step 2: Apply the YAGNI Preflight
Before discovering structure or gathering context, gate the work. Ask the user (or confirm from their request) which of the following describes the scenario:
- An alert that has actually fired — name the alert, link the firing incident or alert manager record.
- A documented incident or post-mortem — link it.
- A recurring scheduled task that the team performs (weekly index rebuild, monthly cert rotation, etc.) — name the cadence and where the schedule lives.
- A live failure mode on a service that receives production traffic, where the failure has occurred or is expected to occur with current measured pressure — name the service and the failure mode.
- Customer report or stakeholder commitment requiring this procedure to be documented now — link it.
If none of these applies, recommend deferring the runbook. Surface the recommendation to the user with the trigger that would justify revisiting:
"I don't see a current trigger forcing this runbook. Per the project's YAGNI rule, runbooks for alerts that have never fired are an anti-pattern. Recommend deferring until {trigger — first alert fires, first occurrence of the failure mode, first run of the recurring task, customer commitment lands}. Override and proceed anyway?"
The user always wins. If they override, record the override in the runbook's Origin field as "override: written preventively at user request on {date} — {reason}" so future readers can see the runbook was written without standard evidence.
If the scenario does pass the preflight, capture the evidence — the user will be asked again at Step 4 to drop the link or reference into the runbook's Origin metadata field.
Step 3: Discover Project Structure
-
Resolve project config. Read CLAUDE.md's
## Project Discoverysection for documented runbook and docs directories. Fall back toproject-discovery.md. Fall back to Glob defaults (docs/runbooks/,runbooks/,docs/). Continue without any keys that remain unfound. -
Determine the runbooks directory. Use the runbooks directory if found; otherwise use
{docs-dir}/runbooks/if a docs directory was found; otherwise default todocs/runbooks/. Runmkdir -pon the resolved directory to ensure it exists. -
Enumerate existing runbooks. Use Glob to find existing
.mdfiles in the runbooks directory and any service subdirectories. Read filenames to detect whether the project organizes runbooks flat (docs/runbooks/{scenario}.md), per-service (docs/runbooks/{service}/{scenario}.md), or alert-keyed (docs/runbooks/alerts/{AlertName}.md). -
Resolve author information. If git user or email is empty in the project context above, ask the user for their name and email.
-
Check existing runbook format. If existing runbooks were found, read one to understand the project's format. If it differs from runbook-template.md, ask the user whether to match the existing format or use this skill's template. Default to matching the existing format when the project already has more than two runbooks — consistency is the larger value.
Step 4: Gather Context
From the arguments, conversation, and YAGNI preflight in Step 2, capture:
- Title — the symptom-first title per the template's title rule. Lead with the observable failure or operation, not the system name. Good:
Postgres primary unreachable: connections time out. Bad:Database failover. - Severity — the org's severity scheme. If the alert uses a different name (P1/P2), record both.
- Triggers — the alert name (with link to alert definition or monitoring), the schedule, the upstream runbook, or "manual".
- Reversibility — yes, partial, no — wait it out, no — data loss possible. This sets the front-door signal so the engineer knows before they commit whether they can back out.
- Origin — the link or reference captured in Step 2. Required.
- Owner — team or person paged at 2am for this runbook's freshness.
- Prerequisites — access groups, VPN, kubectl context, CLI tools with minimum versions, on-call privileges. "None — workstation only" is a valid answer; blank is not.
- Symptoms — what the engineer sees that brings them to this runbook.
- The procedure — for each step, the exact command (or non-command action), what success looks like, and what to do if the output differs. Use imperative voice.
- Verification — how to confirm the original symptom is gone (separate from per-step expected output).
- Escalation — for each escalation step, the condition (time-box or specific failure), the recipient, and the channel (PagerDuty service, Slack room, phone).
- Rollback — how to undo the fix, or the explicit alternative if rollback is not possible.
If any of these are unclear, use AskUserQuestion to clarify before writing. Ask only for what is genuinely missing; do not re-ask for values present in the user's request.
When the user gives you a