knowledge-ops
Company SOP + internal runbook authoring, 5W2H completeness validation, and KB hygiene reporting for Head-of-Ops / Knowledge-Manager / TPM-Internal personas.
Purpose
An ops organization three years in accumulates a sprawl: 600 Notion pages, 200 Confluence runbooks, three Obsidian vaults, a Drive/SOPs/ folder, and a Slack #ops-questions channel that exists because nobody can find the canonical doc. Predictable failure modes:
- No owner — 40% of SOPs name "the team" instead of a person. When the doc rots, nobody is accountable.
- No last-reviewed date — a 2023 vendor-offboarding SOP still references a procurement tool sunset in 2024.
- Vague success signals — runbook step 4 says "verify the service is up". A new operator can't tell what that means.
- No rollback path — incident-comms cascade runbook tells you how to send the alert. It doesn't tell you how to retract it when the alert was wrong.
- Orphan pages — half the KB has no inbound links. Nobody finds them via navigation; they only exist because somebody knew the URL.
- Glossary drift — "CSM" means Customer Success Manager in three docs and Customer Solutions Manager in five. New hires guess wrong for six months.
- Happy-path-only SOPs — the doc covers what happens when everything works. It doesn't cover the 30% case where it doesn't.
This skill answers the operator's actual question: "Which 20 docs do I fix first, and what specifically is wrong with each?" — with deterministic logic, not intuition.
When to use
- Authoring a new SOP for a cross-functional company process (procurement intake, vendor offboarding, incident-comms cascade, employee onboarding, expense reimbursement, customer-escalation playbook, security-incident comms, system-access provisioning).
- Validating an existing internal runbook before it goes into rotation (every step must have a named owner, expected duration, observable success signal, observable failure signal, rollback path, escalation contact).
- Ingesting a multi-document KB export (Notion zip, Confluence space export, Obsidian vault,
Drive/SOPs/directory) and surfacing what's broken: orphan pages, stale pages (no edit > 12 months), glossary drift, missing-owner pages, cross-link map. - Onboarding a new ops hire by generating the SOPs and ops-handbook pages they need to read in week 1.
- Wiki cleanup sprints — quarterly hygiene work where the org decides which 30 docs to archive, rewrite, or merge.
Workflow
Four-step deterministic flow (matches the ops org's actual workflow, not an abstract process):
- Ingest KB. Run
kb_ingester.py --input <vault-dir>on the existing wiki export. Output is a markdown health report: orphan pages, stale pages, glossary drift, missing-owner pages, cross-link map, prioritized cleanup list. The report ranks the top-20 docs to fix first — usually a mix of high-traffic stale docs and compliance-relevant missing-owner docs. Take this list to the cleanup sprint. - Validate existing runbooks. For each runbook in the cleanup list (or any new runbook before it goes into rotation), run
runbook_validator.py --input <runbook.md>. The validator scores each step against six checks (named owner, expected duration, observable success signal, observable failure signal, rollback path, escalation contact) and produces a per-step traffic-light + overall validity score 0-100 + MUST-FIX issue list. A runbook scoring < 60 is not safe to use in an incident. - Generate missing SOPs. For SOPs that need to be written from scratch (or rewritten because the existing one is unsalvageable), run
sop_generator.py --input <metadata.json> --profile <ops|support|finance|hr|it|regulated>. Output is a 5W2H-structured SOP scaffold: Who (RACI), What (process steps), When (triggers + frequency), Where (system + tool), Why (purpose + regulatory basis), How (step-by-step), How-much (cost + time per execution). Theregulatedprofile adds version control, signoff, and audit-trail sections (ISO 9001 / FDA 21 CFR Part 211 / SOC 2 / HIPAA). - Cross-link + close the loop. Re-run
kb_ingester.pyafter the cleanup sprint to verify orphan-page count is down and glossary drift is resolved. The metric that matters is "unfindable docs" (orphans) and "unsafe runbooks" (validity score < 60) — not page count.
Scripts
scripts/sop_generator.py — Reads a JSON metadata file describing an SOP (process owner, triggering event, audience role, frequency, regulatory overlay, inputs, outputs, steps outline) and emits a full 5W2H-structured SOP in markdown (or normalized JSON). The --profile flag tunes the output: ops (general internal ops), support (customer-support runbook style), finance (controls + reconciliation focus), hr (sensitive-data flagging), it (system + access focus), regulated (adds version control, signoff matrix, audit-trail). Regulatory overlays (SOC2, HIPAA, ISO13485, GDPR, SOX) attach the appropriate compliance preamble. --sample prints a complete vendor-offboarding SOP example. Stdlib only.
scripts/runbook_validator.py — Reads a runbook (markdown file or JSON) and validates each step against six required attributes: (1) named owner (not "the team", not "ops"), (2) expected duration (concrete number + unit), (3) observable success signal (e.g., "HTTP 200 from /healthz" — not "service is up"), (4) observable failure signal, (5) rollback path (or explicit "this step cannot be rolled back, escalate to X"), (6) escalation contact (named person or named on-call rotation). Output is a per-step traffic-light (GREEN/AMBER/RED), an overall validity score 0-100, and a MUST-FIX issue list. Verdict: ≥ 80 = SAFE-TO-USE, 60-79 = USE-WITH-CAUTION, < 60 = NOT-SAFE. --sample prints a deliberately-broken incident-comms runbook to demonstrate failure detection. Stdlib only.
scripts/kb_ingester.py — Walks a directory of markdown files (Notion export, Confluence space export, Obsidian vault, Drive/SOPs/ directory). Extracts: (a) cross-link map (which page references which, via markdown [link](path) syntax), (b) glossary candidates (frequently used proper nouns and acronyms that recur in 3+ docs without a single canonical definition page), (c) orphan pages (no inbound links from anywhere in the vault), (d) glossary drift (the same term defined or used inconsistently across docs — e.g., "CSM" expanded differently in two places), (e) stale pages (no edit in > 12 months, detected via filesystem mtime or YAML last_reviewed frontmatter), (f) missing-owner pages (no owner: field in frontmatter). Emits a KB health report markdown with a prioritized top-20 cleanup list ranked by staleness × inbound-link-count (high-traffic stale docs first). --sample builds a tiny synthetic 8-page vault in a tmpdir and runs the full pipeline against it. Stdlib only.
References
references/5w2h_sop_canon.md— Kaoru Ishikawa's 5W2H method, Toyota standard-work discipline, Atul Gawande's checklist manifesto, Atlassian Confluence SOP guidance, ISO 9001 SOP requirements, ITIL v4 Service Operation, FDA 21 CFR Part 211. Eight cited sources covering SOP authoring canon.references/runbook_canon.md— Google SRE Workbook (runbook chapter), Atlassian incident-management runbooks, PagerDuty Incident Response taxonomy, AWS Well-Architected operational excellence pillar, Charity Majors on observability-runbook integration, Susan Fowler on production-ready microservices, ITIL v4 Operations. Seven cited sources covering runbook design canon.references/kb_hygiene_anti_patterns.md— Eight anti-patterns drawn from Notion/Confluence wiki industry research, Mozilla SUMO knowledge-base lessons, Stack Overflow community-management research, the Atlassian Team Playbook, MIT TIK org-wiki studies, Cynthia Lee on glossary drift, and Adam Wiggins on "documentation rot".
Assumptions
- The KB is in markdown (or can be exported to markdown — Notion, Confl