Thorough Code Review
Every finding must be backed by evidence: a command that was run, a quoted doc, or a file:line citation. Claims that depend on adjacent behavior — a flag's semantics, a library's API, a shell idiom's expansion, what a parallel or older code path does — are hypotheses until verified. Hypotheses do not appear in the final report.
When to use vs. skip
Use when:
- Change touches infra, security, payments, data migrations, auth, or a shared library
- Author will act on the review (fixes applied), not just collecting opinions
- A wrong finding would waste author time on a phantom bug
- A missed bug would ship
Skip when:
- User asked for a quick gut-check or exploratory thoughts
- Change is style-only, generated, or mechanical
- Use the lighter
/reviewslash command instead
Workflow
Five phases. Do not skip or interleave.
Phase 1 — Scope & context
- Fetch metadata and diff:
- GitHub:
gh pr view <n>+gh pr diff <n> - GitLab:
glab mr view <n>+glab mr diff <n> - Local branch:
git log <base>..HEAD+git diff <base>...HEAD
- GitHub:
- Read the PR description in full. Note goals, test plan, explicit non-goals.
- List changed files; tag each as core logic, test, or docs/config.
- Set review depth: core logic → line-by-line; tests → correctness + coverage; docs/config → sanity pass.
Exit: state the PR's intent in one sentence; order files by review priority.
Phase 2 — Read in context
Read each changed file with at least 20 lines of surrounding context, or the full file if under ~200 lines. Diff-only review is forbidden past this phase — diffs hide surrounding code that may interact with the change.
For each file, jot down (scratch notes, not the report):
- What the file does
- What this PR changes about it
- Who imports it (Grep on the export or path)
If you cannot articulate the change per file, you have not read enough.
Phase 3 — Findings (with evidence tags)
Tag each candidate finding:
[verified]— issue is visible in the changed code; evidence is file:line + quoted code[hypothesis]— depends on adjacent behavior not yet checked
Hypotheses are not findings. Do not write them into the final report.
Common hypothesis categories and how to verify:
| Category | Verification |
|---|---|
| What a CLI flag does | <tool> --help, official docs (use a docs-lookup skill if available), or run it |
| What a library function returns/throws | Docs lookup, source read, or a small repro |
Shell idiom expansion (xargs, glob, IFS, $(...) failure) | Run on sample input |
| What an old/parallel code path did | git show <base>:<path>, git log -p |
| Whether a test exists for X | grep / glob |
| Platform behavior (BSD vs GNU, macOS vs Linux) | Run it; cite the man page |
| Runtime behavior under edge case | Focused test, or downgrade severity |
Phase 4 — Verification (the gate)
Promote every [hypothesis] to [verified] or drop it. For each, produce evidence: command + output, doc quote with link, or source-of-truth file:line. Evidence goes in the finding's Evidence: field.
Code suggestions count as hypotheses. If you propose a fix snippet (shell one-liner, regex, config tweak), run it on representative input before posting. A "fix" that doesn't work as advertised is the most common review failure.
Parallel verification: for 3+ independent hypotheses, dispatch them to subagents in parallel via Agent. Each subagent gets one narrow task: "verify this claim and report yes/no with evidence." Subagents force evidence over vibes and keep the main context clean.
If a hypothesis cannot be verified within budget:
- Mark
[unverified], downgrade to NOTE (never BLOCKER or MAJOR), or - Drop it
Phase 5 — Self-validation (adversarial)
Re-read each finding as if you were the PR author defending it:
- Is the Evidence line actually evidence, or a restatement of the claim?
- Could this be wrong in a way that would embarrass me on second pass?
- Is this scoped to the PR's diff, or am I prescribing broader cleanup?
- Is severity hedged ("borderline", "mild", "technically", "might")? Downgrade.
- Did I praise something for behavior it doesn't actually have?
Drop or downgrade anything that fails. You are looking for your own mistakes.
Worked example
Illustrates how a hypothesis is formed, verified, and either dropped or promoted.
Scenario: PR adds a new --strict flag to a CLI's config loader.
Phase 3 — candidate finding (raw):
[hypothesis] The new --strict flag may break backwards compatibility for
existing config files.
Phase 4 — verification:
- Read PR description: claims
--strictis opt-in, defaultfalse. - Run
<tool> --help | grep strictagainst the changed branch: confirms defaultfalse. grep -rn "loadConfig(" src/: 4 callers; none setstrict: true.- Verdict: backwards compatibility preserved by default. Drop.
Verification Log entry:
- "--strict may break existing configs" → dropped: flag default is `false`
(verified via --help output); 4 callers grepped, none opt in.
A second hypothesis that survives verification:
Phase 3:
[hypothesis] --strict combined with the legacy --legacy-mode flag may
produce undefined behavior.
Phase 4:
- Read
src/config.ts:142-160(full function, not just diff): when both flags are set, the loader returns early without applying either. rg "strict.*legacy|legacy.*strict" tests/: zero matches → no test covers the combination.
Promoted to MAJOR finding:
**MAJOR — [src/config.ts:142] Conflicting flags silently no-op**
When both --strict and --legacy-mode are set, the loader returns early
without applying either; users get neither behavior with no warning.
Evidence: src/config.ts:142-160 (read in Phase 2);
`rg "strict.*legacy|legacy.*strict" tests/` returned 0 matches.
Suggested fix: either error on the combination or document the
precedence explicitly. (No code snippet posted — the fix is a design
choice for the author.)
Two properties to notice:
- The dropped hypothesis still appears in the Verification Log. Future re-readers can tell it was considered, not overlooked.
- The surviving finding's Evidence is reproducible: anyone with the repo can re-run the grep and re-read the lines.
Severity rubric
| Level | Meaning |
|---|---|
| BLOCKER | Must fix before merge. Concrete bug, security issue, or breaking change. Evidence unambiguous. |
| MAJOR | Should fix before merge. Correctness or design issue with clear reasoning. |
| MINOR | Nice-to-have inside the PR's scope. |
| NOTE | Observation, not a request. Unverified hypotheses and out-of-scope context land here. |
| PRAISE | Worth calling out: clean handling, smart trade-off, good test. Subject to the same evidence rule. |
Discipline:
- BLOCKER with hedging → MAJOR
- MAJOR without verified evidence → NOTE or drop
- Finding outside the diff → drop or move to "Out of scope"
Scope discipline
Stay inside the PR's diff:
- Ask the author to fix what they introduced, not what was already there
- Do not offer multi-option menus — pick one and recommend it; the author can push back
- Pre-existing issues → NOTE if the PR exposes them; otherwise stay silent
If a pre-existing bug makes the diff unsafe, frame it as: "this PR exposes a pre-existing X; the safest path here is Y" — not "while you're here, also fix X."
Required verifications by change type
Apply the relevant block when the diff touches that area.
Shell scripts / hooks
- Run any new shell idiom on sample input (xargs with whitespace, glob expansion,
$(...)failure modes,set -einteractions) - Check POSIX vs GNU vs BSD portability when shebang is
/usr/bin/env sh - Verify exit codes match the caller's contract (e.g., Claude Code Stop hooks: 0 = ok, 2 = block)
Library / framework usage
- Verify any flag or API claim against cu