knowledge-ops

Company SOP + internal runbook authoring, 5W2H completeness validation, and KB hygiene reporting for Head-of-Ops / Knowledge-Manager / TPM-Internal personas.

Purpose

An ops organization three years in accumulates a sprawl: 600 Notion pages, 200 Confluence runbooks, three Obsidian vaults, a Drive/SOPs/ folder, and a Slack #ops-questions channel that exists because nobody can find the canonical doc. Predictable failure modes:

No owner — 40% of SOPs name "the team" instead of a person. When the doc rots, nobody is accountable.
No last-reviewed date — a 2023 vendor-offboarding SOP still references a procurement tool sunset in 2024.
Vague success signals — runbook step 4 says "verify the service is up". A new operator can't tell what that means.
No rollback path — incident-comms cascade runbook tells you how to send the alert. It doesn't tell you how to retract it when the alert was wrong.
Orphan pages — half the KB has no inbound links. Nobody finds them via navigation; they only exist because somebody knew the URL.
Glossary drift — "CSM" means Customer Success Manager in three docs and Customer Solutions Manager in five. New hires guess wrong for six months.
Happy-path-only SOPs — the doc covers what happens when everything works. It doesn't cover the 30% case where it doesn't.

This skill answers the operator's actual question: "Which 20 docs do I fix first, and what specifically is wrong with each?" — with deterministic logic, not intuition.

When to use

Authoring a new SOP for a cross-functional company process (procurement intake, vendor offboarding, incident-comms cascade, employee onboarding, expense reimbursement, customer-escalation playbook, security-incident comms, system-access provisioning).
Validating an existing internal runbook before it goes into rotation (every step must have a named owner, expected duration, observable success signal, observable failure signal, rollback path, escalation contact).
Ingesting a multi-document KB export (Notion zip, Confluence space export, Obsidian vault, Drive/SOPs/ directory) and surfacing what's broken: orphan pages, stale pages (no edit > 12 months), glossary drift, missing-owner pages, cross-link map.
Onboarding a new ops hire by generating the SOPs and ops-handbook pages they need to read in week 1.
Wiki cleanup sprints — quarterly hygiene work where the org decides which 30 docs to archive, rewrite, or merge.

Workflow

Four-step deterministic flow (matches the ops org's actual workflow, not an abstract process):

Ingest KB. Run kb_ingester.py --input <vault-dir> on the existing wiki export. Output is a markdown health report: orphan pages, stale pages, glossary drift, missing-owner pages, cross-link map, prioritized cleanup list. The report ranks the top-20 docs to fix first — usually a mix of high-traffic stale docs and compliance-relevant missing-owner docs. Take this list to the cleanup sprint.
Validate existing runbooks. For each runbook in the cleanup list (or any new runbook before it goes into rotation), run runbook_validator.py --input <runbook.md>. The validator scores each step against six checks (named owner, expected duration, observable success signal, observable failure signal, rollback path, escalation contact) and produces a per-step traffic-light + overall validity score 0-100 + MUST-FIX issue list. A runbook scoring < 60 is not safe to use in an incident.
Generate missing SOPs. For SOPs that need to be written from scratch (or rewritten because the existing one is unsalvageable), run sop_generator.py --input <metadata.json> --profile <ops|support|finance|hr|it|regulated>. Output is a 5W2H-structured SOP scaffold: Who (RACI), What (process steps), When (triggers + frequency), Where (system + tool), Why (purpose + regulatory basis), How (step-by-step), How-much (cost + time per execution). The regulated profile adds version control, signoff, and audit-trail sections (ISO 9001 / FDA 21 CFR Part 211 / SOC 2 / HIPAA).
Cross-link + close the loop. Re-run kb_ingester.py after the cleanup sprint to verify orphan-page count is down and glossary drift is resolved. The metric that matters is "unfindable docs" (orphans) and "unsafe runbooks" (validity score < 60) — not page count.

Scripts

scripts/sop_generator.py — Reads a JSON metadata file describing an SOP (process owner, triggering event, audience role, frequency, regulatory overlay, inputs, outputs, steps outline) and emits a full 5W2H-structured SOP in markdown (or normalized JSON). The --profile flag tunes the output: ops (general internal ops), support (customer-support runbook style), finance (controls + reconciliation focus), hr (sensitive-data flagging), it (system + access focus), regulated (adds version control, signoff matrix, audit-trail). Regulatory overlays (SOC2, HIPAA, ISO13485, GDPR, SOX) attach the appropriate compliance preamble. --sample prints a complete vendor-offboarding SOP example. Stdlib only.

scripts/runbook_validator.py — Reads a runbook (markdown file or JSON) and validates each step against six required attributes: (1) named owner (not "the team", not "ops"), (2) expected duration (concrete number + unit), (3) observable success signal (e.g., "HTTP 200 from /healthz" — not "service is up"), (4) observable failure signal, (5) rollback path (or explicit "this step cannot be rolled back, escalate to X"), (6) escalation contact (named person or named on-call rotation). Output is a per-step traffic-light (GREEN/AMBER/RED), an overall validity score 0-100, and a MUST-FIX issue list. Verdict: ≥ 80 = SAFE-TO-USE, 60-79 = USE-WITH-CAUTION, < 60 = NOT-SAFE. --sample prints a deliberately-broken incident-comms runbook to demonstrate failure detection. Stdlib only.

scripts/kb_ingester.py — Walks a directory of markdown files (Notion export, Confluence space export, Obsidian vault, Drive/SOPs/ directory). Extracts: (a) cross-link map (which page references which, via markdown [link](path) syntax), (b) glossary candidates (frequently used proper nouns and acronyms that recur in 3+ docs without a single canonical definition page), (c) orphan pages (no inbound links from anywhere in the vault), (d) glossary drift (the same term defined or used inconsistently across docs — e.g., "CSM" expanded differently in two places), (e) stale pages (no edit in > 12 months, detected via filesystem mtime or YAML last_reviewed frontmatter), (f) missing-owner pages (no owner: field in frontmatter). Emits a KB health report markdown with a prioritized top-20 cleanup list ranked by staleness × inbound-link-count (high-traffic stale docs first). --sample builds a tiny synthetic 8-page vault in a tmpdir and runs the full pipeline against it. Stdlib only.

References

references/5w2h_sop_canon.md — Kaoru Ishikawa's 5W2H method, Toyota standard-work discipline, Atul Gawande's checklist manifesto, Atlassian Confluence SOP guidance, ISO 9001 SOP requirements, ITIL v4 Service Operation, FDA 21 CFR Part 211. Eight cited sources covering SOP authoring canon.
references/runbook_canon.md — Google SRE Workbook (runbook chapter), Atlassian incident-management runbooks, PagerDuty Incident Response taxonomy, AWS Well-Architected operational excellence pillar, Charity Majors on observability-runbook integration, Susan Fowler on production-ready microservices, ITIL v4 Operations. Seven cited sources covering runbook design canon.
references/kb_hygiene_anti_patterns.md — Eight anti-patterns drawn from Notion/Confluence wiki industry research, Mozilla SUMO knowledge-base lessons, Stack Overflow community-management research, the Atlassian Team Playbook, MIT TIK org-wiki studies, Cynthia Lee on glossary drift, and Adam Wiggins on "documentation rot".

Assumptions

The KB is in markdown (or can be exported to markdown — Notion, Confl

knowledge-ops

How to add

Drop this on your repo README

Related skills

learn-codebase

remove-deadcode

apify-competitor-intelligence

ad-creative

Get new Marketing skills every Monday

knowledge-ops

Purpose

When to use

Workflow

Scripts

References

Assumptions

Comments · No comments