Think-Diagnose - Abductive Reasoning About Causes
Takes a phenomenon — something that was observed and that the user wants to understand — and produces a ranked set of candidate causes with evidence-based confidence calibration. Uses abductive reasoning: inference to the best explanation. Distinct from /bug-fix (which handles code-specific diagnosis with artifact output and execution tooling); /think-diagnose is pure reasoning about causes, applicable to non-code phenomena as readily as code ones.
This skill produces no tangible artifacts. It is a consultant, not an implementer. No code, no tickets, no commits. The output is a structured diagnosis report that the user can act on by gathering more evidence, adopting a leading cause, or piping to /think-brainstorm for remediation.
Roles
Judge (you, running this skill):
- Capture the phenomenon in a written brief
- Elicit evidence, rigorously separating observation from interpretation
- Choose appropriate reasoning lenses
- Spawn diagnosticians in isolation
- Evaluate candidate causes against evidence (this skill has a real evaluative phase, unlike purely-divergent think-* skills)
- Calibrate confidence honestly and report
Diagnosticians: Each receives a specific reasoning lens and generates candidate causes (with mechanisms, predictions, refuters, and plausibility) in isolation from other diagnosticians.
Workflow
1. Receive the Phenomenon
The phenomenon may arrive as:
- Conversation context — summarize it back, confirm
- A document — read the file (incident report, data summary, observation log)
- Fresh user input — capture verbatim
Produce a written brief of the phenomenon. Precisely what is the thing to explain? Vague phenomena produce vague diagnoses.
2. Gather Evidence — Separate Observation from Interpretation
This is the most failure-prone step in the entire workflow, and it has enforced structure. Most bad diagnoses start by accepting interpretations as observations.
Elicit from the user, in three distinct buckets:
- Observations — concrete things that were measured, seen, or experienced. "The metric dropped 30% on March 14th." "Three customers mentioned X in surveys." "The build broke at commit abc123."
- Interpretations already held — what the user or others have already inferred from the observations. "The team thinks it's because of the migration." "We believe the drop is due to seasonality." Flag these explicitly so diagnosticians know not to accept them as given.
- Unavailable / unknown evidence — what's unknown, wasn't measured, or can't be retrieved. "We don't have per-user data before April." "We didn't log the old config."
Push back on smuggled interpretations. If the user says "the metric dropped because of the migration," that's two claims: (a) the metric dropped (observation) and (b) the migration caused it (interpretation). Separate them before proceeding.
3-6 clarifying questions is typical to establish this split. Stop when you have enough to pass diagnosticians material they can work with.
3. Choose Reasoning Lenses
Select 3-6 lenses from the palette based on the phenomenon's shape.
Available lenses:
- technical — engineering-level causes (code, infra, config, capacity, dependencies)
- human-factors — people, skills, fatigue, turnover, miscommunication, team dynamics
- process — broken or missing process, handoffs, approvals, ownership, rituals
- incentive-structure — the system rewards the behavior we're diagnosing (Goodhart territory)
- environmental — external factors (market, regulation, customer mix, vendor, upstream)
- temporal — something changed in time that correlates with the phenomenon
- measurement-artifact — the phenomenon isn't real, it's a metric/instrumentation issue
- statistical — base rates, regression to mean, Simpson's paradox, confounders, selection
Selection heuristics:
- Phenomenon is metric-based? Always include measurement-artifact. Underrated; catches a large share of false phenomena.
- Phenomenon has a clear onset date? Include temporal.
- Phenomenon involves aggregate data (averages, ratios)? Include statistical.
- Phenomenon is in a team/org context? Include human-factors, process, incentive-structure.
- Phenomenon is in a codebase or system? Include technical.
- Phenomenon occurs in a context with external inputs (customers, markets, vendors)? Include environmental.
Drop lenses that don't fit. A phenomenon in a closed system without external dependencies probably doesn't need environmental. A phenomenon observed directly (not through metrics) probably doesn't need measurement-artifact.
4. Spawn Diagnosticians (Parallel, Isolated)
Spawn one THK - Diagnostician agent per chosen lens, in parallel. Each receives:
- The phenomenon brief
- The observations
- The interpretations already held (flagged — not to be accepted as given)
- The unavailable evidence
- Its assigned lens
- Instruction to generate 3-8 candidate causes, each with mechanism / predictions / refuters / plausibility
No cross-talk between diagnosticians. NGT principle — independent reasoning first, evaluation second. Isolated diagnosticians produce more distinct candidate causes; coordinated ones anchor on the first compelling story.
Collect all candidate causes.
5. Evaluate Fit — Orchestrator's Work
This phase is new territory for /think-* skills. The prior skills (brainstorm, reframe, scrutinize, deliberate) are purely divergent or choose among pre-stated options; this skill requires the orchestrator to do evaluation against evidence.
For each candidate cause from step 4, evaluate:
- Explanatory fit — does this cause explain the observed phenomenon? Does it explain all the observations, or only some?
- Prediction check — the diagnostician stated what we'd expect to see if this cause were true. Do we observe those things? (Some predictions may require the user to check; note them.)
- Refuter check — the diagnostician stated what would disprove this cause. Do we observe any of those refuters?
- Parsimony — is there a simpler cause that fits equally well? Prefer the simpler one if fit is comparable.
- Domain plausibility — given what's known about the domain, how plausible is this cause? This uses general reasoning, not just evidence fit.
Cluster causes across lenses. Some causes from different lenses are the same underlying mechanism viewed from different angles (e.g., "engineers ship half-finished features" seen through human-factors and incentive-structure may converge on the same root cause). Merge and preserve lens attribution.
Resist compelling-narrative bias. Causes with clean stories are dangerous; they feel explanatory even when they don't fit the evidence. Weight evidence fit over story quality. When in doubt, flag "compelling story, weak fit" explicitly.
6. Calibrate Confidence
No fabricated percentages. Use qualitative categories with clear meaning:
- Strong fit — cause explains all observations, predictions confirmed (or testable), no refuters observed, plausible. This is a leading candidate.
- Moderate fit — cause explains most observations, some predictions unconfirmed but not contradicted, plausible. Secondary candidate.
- Weak fit — cause explains some observations, significant predictions unconfirmed, possibly plausible. Long-shot candidate.
- Unable to distinguish — two or more causes fit the evidence equally well. Cannot converge without more evidence.
Honest uncertainty is valuable. "Cause A looks most likely but evidence is sparse; disambiguating observation X would shift the picture" is a better output than fake precision.
7. Report
Final report format:
## Diagnosis Report
**Phenomenon:** [one-line summary]
**Lenses applied:** [list]
### Observations
[Concrete ground-truth observations, as elicite