Think-Diagnose - Abductive Reasoning About Causes

Takes a phenomenon — something that was observed and that the user wants to understand — and produces a ranked set of candidate causes with evidence-based confidence calibration. Uses abductive reasoning: inference to the best explanation. Distinct from /bug-fix (which handles code-specific diagnosis with artifact output and execution tooling); /think-diagnose is pure reasoning about causes, applicable to non-code phenomena as readily as code ones.

This skill produces no tangible artifacts. It is a consultant, not an implementer. No code, no tickets, no commits. The output is a structured diagnosis report that the user can act on by gathering more evidence, adopting a leading cause, or piping to /think-brainstorm for remediation.

Roles

Judge (you, running this skill):

Capture the phenomenon in a written brief
Elicit evidence, rigorously separating observation from interpretation
Choose appropriate reasoning lenses
Spawn diagnosticians in isolation
Evaluate candidate causes against evidence (this skill has a real evaluative phase, unlike purely-divergent think-* skills)
Calibrate confidence honestly and report

Diagnosticians: Each receives a specific reasoning lens and generates candidate causes (with mechanisms, predictions, refuters, and plausibility) in isolation from other diagnosticians.

Workflow

1. Receive the Phenomenon

The phenomenon may arrive as:

Conversation context — summarize it back, confirm
A document — read the file (incident report, data summary, observation log)
Fresh user input — capture verbatim

Produce a written brief of the phenomenon. Precisely what is the thing to explain? Vague phenomena produce vague diagnoses.

2. Gather Evidence — Separate Observation from Interpretation

This is the most failure-prone step in the entire workflow, and it has enforced structure. Most bad diagnoses start by accepting interpretations as observations.

Elicit from the user, in three distinct buckets:

Observations — concrete things that were measured, seen, or experienced. "The metric dropped 30% on March 14th." "Three customers mentioned X in surveys." "The build broke at commit abc123."
Interpretations already held — what the user or others have already inferred from the observations. "The team thinks it's because of the migration." "We believe the drop is due to seasonality." Flag these explicitly so diagnosticians know not to accept them as given.
Unavailable / unknown evidence — what's unknown, wasn't measured, or can't be retrieved. "We don't have per-user data before April." "We didn't log the old config."

Push back on smuggled interpretations. If the user says "the metric dropped because of the migration," that's two claims: (a) the metric dropped (observation) and (b) the migration caused it (interpretation). Separate them before proceeding.

3-6 clarifying questions is typical to establish this split. Stop when you have enough to pass diagnosticians material they can work with.

3. Choose Reasoning Lenses

Select 3-6 lenses from the palette based on the phenomenon's shape.

Available lenses:

technical — engineering-level causes (code, infra, config, capacity, dependencies)
human-factors — people, skills, fatigue, turnover, miscommunication, team dynamics
process — broken or missing process, handoffs, approvals, ownership, rituals
incentive-structure — the system rewards the behavior we're diagnosing (Goodhart territory)
environmental — external factors (market, regulation, customer mix, vendor, upstream)
temporal — something changed in time that correlates with the phenomenon
measurement-artifact — the phenomenon isn't real, it's a metric/instrumentation issue
statistical — base rates, regression to mean, Simpson's paradox, confounders, selection

Selection heuristics:

Phenomenon is metric-based? Always include measurement-artifact. Underrated; catches a large share of false phenomena.
Phenomenon has a clear onset date? Include temporal.
Phenomenon involves aggregate data (averages, ratios)? Include statistical.
Phenomenon is in a team/org context? Include human-factors, process, incentive-structure.
Phenomenon is in a codebase or system? Include technical.
Phenomenon occurs in a context with external inputs (customers, markets, vendors)? Include environmental.

Drop lenses that don't fit. A phenomenon in a closed system without external dependencies probably doesn't need environmental. A phenomenon observed directly (not through metrics) probably doesn't need measurement-artifact.

4. Spawn Diagnosticians (Parallel, Isolated)

Spawn one THK - Diagnostician agent per chosen lens, in parallel. Each receives:

The phenomenon brief
The observations
The interpretations already held (flagged — not to be accepted as given)
The unavailable evidence
Its assigned lens
Instruction to generate 3-8 candidate causes, each with mechanism / predictions / refuters / plausibility

No cross-talk between diagnosticians. NGT principle — independent reasoning first, evaluation second. Isolated diagnosticians produce more distinct candidate causes; coordinated ones anchor on the first compelling story.

Collect all candidate causes.

5. Evaluate Fit — Orchestrator's Work

This phase is new territory for /think-* skills. The prior skills (brainstorm, reframe, scrutinize, deliberate) are purely divergent or choose among pre-stated options; this skill requires the orchestrator to do evaluation against evidence.

For each candidate cause from step 4, evaluate:

Explanatory fit — does this cause explain the observed phenomenon? Does it explain all the observations, or only some?
Prediction check — the diagnostician stated what we'd expect to see if this cause were true. Do we observe those things? (Some predictions may require the user to check; note them.)
Refuter check — the diagnostician stated what would disprove this cause. Do we observe any of those refuters?
Parsimony — is there a simpler cause that fits equally well? Prefer the simpler one if fit is comparable.
Domain plausibility — given what's known about the domain, how plausible is this cause? This uses general reasoning, not just evidence fit.

Cluster causes across lenses. Some causes from different lenses are the same underlying mechanism viewed from different angles (e.g., "engineers ship half-finished features" seen through human-factors and incentive-structure may converge on the same root cause). Merge and preserve lens attribution.

Resist compelling-narrative bias. Causes with clean stories are dangerous; they feel explanatory even when they don't fit the evidence. Weight evidence fit over story quality. When in doubt, flag "compelling story, weak fit" explicitly.

6. Calibrate Confidence

No fabricated percentages. Use qualitative categories with clear meaning:

Strong fit — cause explains all observations, predictions confirmed (or testable), no refuters observed, plausible. This is a leading candidate.
Moderate fit — cause explains most observations, some predictions unconfirmed but not contradicted, plausible. Secondary candidate.
Weak fit — cause explains some observations, significant predictions unconfirmed, possibly plausible. Long-shot candidate.
Unable to distinguish — two or more causes fit the evidence equally well. Cannot converge without more evidence.

Honest uncertainty is valuable. "Cause A looks most likely but evidence is sparse; disambiguating observation X would shift the picture" is a better output than fake precision.

7. Report

Final report format:

## Diagnosis Report

**Phenomenon:** [one-line summary]
**Lenses applied:** [list]

### Observations

[Concrete ground-truth observations, as elicite

think-diagnose

How to add

Drop this on your repo README

Related skills

MoneyPrinterTurbo

weather-svg-creator

azure-keyvault-secrets-rust

azure-monitor-ingestion-py

Get new Automação skills every Monday