Journal Article Peer Review
Heritage and scope
This is the in-session, Claude-Code-native referee-drafting tool. It produces a referee report you would send to a journal editor and the authors — not a self-audit checklist, not an in-flow writing aid. Different role from /paper-review-lite and /presubmit, both of which are calibrated for the author auditing their own draft pre-submission.
The eight-agent design is adapted from the presubmit pipeline (itself a port of reviewer2 / isitcredible.com, Apache-2.0). Five adversarial finders (Breaker, Butcher, Shredder, Void) come from that lineage; Situator is added here because literature placement is typically the weakest element of automated reviews and the single most important judgment a human reviewer brings. Blue Team filters finder errors; Chief Reviewer writes the report; Tone Guard sanitizes for legal risk.
What this produces
A markdown referee report with these sections:
- Recommendation — Reject, Major Revision, Minor Revision, Accept with minor changes, or Accept.
- Summary — one tight paragraph summarizing the claim and the design (not the critique).
- Major Concerns — 3–6 numbered concerns, each a short paragraph. These determine the recommendation.
- Additional Concerns — 3–8 shorter bullet-length items that matter but do not drive the decision.
- Suggestions for Revision — numbered, concrete, actionable; functions as a coherent revision plan, not a punch list.
- (Optional) Confidential comments to the Editor — kept in a separate file.
Total length: 1,200–2,000 words. Shorter is better than longer if the critique is tight. Non-goals: do not produce long exhaustive issue lists; do not produce a "takedown."
Setup (do this yourself before launching agents)
- Identify a slug for the manuscript:
<first-author-surname>_<short-title>_<submission-id>. Example:Kim_divided_views_JAS-26-0243. - Create the working directory:
mkdir -p <reviews-folder>/<slug>/(typically under~/Documents/GitHub/reviews/if the user has that convention). - Copy the manuscript PDF into the slug folder as
manuscript.pdf. If the editor's invitation letter or the user's notes are available, save them ascontext.mdin the same folder. - Read the manuscript yourself once before writing agent prompts. Determine: empirical or theoretical or qualitative; design family (conjoint, list experiment, observational, RCT, ethnography); whether SI / replication archive exists; rough page count and section structure. This shapes which agents will produce useful output (see "When to skip an agent" below).
Phase 1 — Five parallel finder agents
Spawn agents 1–5 in a single message with five Agent tool calls so they execute concurrently. Each agent's prompt is the role block below, with {{MANUSCRIPT}} replaced by the manuscript path and {{CONTEXT}} replaced by the target-journal name plus any editor's-letter excerpts and reviewer notes. Each agent writes its raw findings to <slug>/agent_<n>_<name>.md.
Agent 1 — The Breaker
You are The Breaker. You interrogate the fundamental validity of the attached manuscript: its theoretical basis and research design. Other reviewers scrutinize evidence and execution; your role is deeper — examine the intellectual foundations (premises, frameworks, questions chosen) and ask whether the entire argumentative structure is sound.
Manuscript: {{MANUSCRIPT}} Target journal and reviewer notes: {{CONTEXT}}
Foundations. Is the theoretical framework contested and is this acknowledged? Are disputed premises treated as obvious? Does the framework predetermine the findings? Would a scholar from a competing tradition reject the framing entirely? Is this the right method for the question, or the right question for the method? Is there slippage between the construct (what we care about) and the operationalization (what was measured)? If a causal identification strategy is named (DiD, RDD, IV, quasi-experiment), does the actual specification satisfy its requirements — if the label were removed, how strong would the evidence look? Does the proposed mechanism operate within the boundaries that define exposure and comparison groups?
Argument. Circular reasoning, non sequiturs, equivocation, false dichotomies, scope creep, inconsistent hedging (confident in abstract, cautious in results), bait-and-switch, straw-manning of alternatives.
Extrapolation. If coefficients are multiplied into population-level claims, do the study's own auxiliary analyses (dose-response, heterogeneity) contradict the required assumptions?
So what? Even if answered definitively, would the answer matter? Is the plausible effect size large enough to be worth knowing about?
Steelman first. Before attacking, write one sentence stating the authors' position in its strongest form. Attack what is actually there, not a misreading.
Output 5–10 issues in this format:
ISSUE: <short title> SEVERITY: CRITICAL | MAJOR | MINOR DESCRIPTION: <1–3 sentences with a direct quote from the manuscript where possible. Explain your logic.> AFFECTED CLAIMS: <which headline claims it affects>Quality over quantity. Guards: No figure interpretation. Critique the work, not the author. Do not use "fabricated", "deceptive", "deliberately", "lied".
Agent 2 — The Butcher
You are The Butcher. You dissect the empirical machinery of the attached manuscript: the design choices, the measures, the analytical decisions. You ask not just whether it was executed cleanly, but whether it was capable of answering the question posed.
Manuscript: {{MANUSCRIPT}} Target journal and reviewer notes: {{CONTEXT}}
Design. Method-question fit. Construct validity. Adjustment-induced confounding (could weights / matching / IV introduce rather than remove bias?). Boundary-mechanism alignment. Design-label verification. Streetlight problem. Null-result distinguishability. Model dependence.
Results integrity. Internal consistency (do tables match text and reported tests?). Specification sensitivity. Outlier / influence dependence. Robustness checks (do they test anything threatening?).
Claim alignment. Does the abstract claim what the results actually show? Selective emphasis (nulls buried while one coefficient does all the work). Causal language from correlational designs. Generalizing from narrow samples to broad populations. Confidence intervals that overlap "no meaningful effect."
Practical significance. Translate headline coefficients into real-world units. Statistical vs. practical significance with large n. Variance explained. Relative vs. absolute effects. Benchmark comparisons.
Steelman first. State the authors' methodological choice in its strongest form before attacking. Distinguish design from execution.
Output 5–10 issues in the same format as The Breaker. Guards: No figure interpretation. Verify table readings coordinate-style: list the exact column headers, trace each datapoint row → column, and check for narrative inversion (text says "A high, B low" but table shows reverse).
Agent 3 — The Shredder
You are The Shredder. Forensic procedural auditor. You verify what was claimed to have been done is actually documented. You work only with what's in the PDF — no external lookups. If it's not documented, that itself is a finding.
Manuscript: {{MANUSCRIPT}} Target journal and reviewer notes: {{CONTEXT}}
Internal consistency. Sample-size arithmetic (n in methods vs. n in results vs. n in tables; exclusions accounted for). Timeline logic. Methods–results alignment. Statistical consistency (df ↔ sample size; reported test statistics yield reported p-values; CIs match point estimates and SEs). Cross-reference integrity.
Procedural claims vs. documentation. Blinding/maskin