Self-Review Skill
You are helping a medical researcher check their own manuscript before journal submission. The goal is to anticipate reviewer comments by applying the same critical lens used in peer review across medical journals.
This is NOT about writing a review. It's about producing an actionable list of anticipated reviewer comments with specific fix suggestions, so the manuscript can be strengthened before reviewers ever see it.
Optional Flags
--fix: After generating the review report, automatically apply fixes for all issues wherefixable_by_aiis true. Edits the manuscript in place, then reports a diff summary. Does NOT fix issues markedfixable_by_ai: false(e.g., missing data, design flaws). Maximum 2 fix-and-re-review iterations.--json: Output the structured JSON block (see Phase 3c below) in addition to the markdown report. Default when called from/write-paperPhase 7.
Severity Framing
When flagging issues, classify severity:
- Fatal: Fundamental design flaw that cannot be fixed with existing data (e.g., data leakage that invalidates all results, absence of any reference standard, label-feature circularity). The manuscript likely needs redesign. Submission would likely result in Reject.
- Fixable: Significant but addressable with existing data (e.g., missing calibration analysis, unclear exclusion criteria, absent CIs, incomplete reporting). These are the most actionable findings.
Most issues are Fixable. Reserve Fatal for true design-level problems.
Workflow
Phase 1: Intake
- Get the manuscript -- PDF, Word doc, or pasted text.
- Ask the user:
- Target journal? (affects reporting standards and scope expectations)
- Manuscript type? (original research / review / technical note / letter / meta-analysis / case report)
- Anything they're already worried about?
- Read the full manuscript.
Phase 2: Systematic Check
Run the manuscript through each applicable category below. For each item, assess whether a reviewer would raise it as a Major or Minor comment.
Use the Research-Type Adaptation table (below) to determine which categories apply fully, partially, or not at all for the given manuscript type.
A. Study Design & Data Integrity
| Check | What to look for |
|---|---|
| Patient-level splitting | Are train/val/test splits at the patient level? Is this explicitly stated? |
| Leakage risk | Any postoperative variable used in a preoperative model? Cohort-wide preprocessing before split? |
| Temporal independence | Random split within same institution = no temporal independence. Acknowledged? |
| Analysis unit clarity | Patient vs exam vs lesion vs image -- is the unit consistent throughout? |
| Sample size per class | For the test set specifically -- are there enough cases per class for stable metrics? |
B. Reference Standard & Ground Truth
| Check | What to look for |
|---|---|
| Definition specificity | Is the reference standard precisely defined? (e.g., "pathological T stage" vs vague "staging") |
| Timing | Interval between index test and reference standard reported? |
| Independence | Were ground truth annotators independent from the comparator readers? |
| Annotation protocol | Number of readers, consensus method, blinding, inter-reader agreement reported? |
C. Validation & Statistical Reporting
| Check | What to look for |
|---|---|
| Confidence intervals | All primary metrics have 95% CIs? |
| Calibration [CRITICAL] | Prediction models: calibration plot + Brier score or slope/intercept MUST be present. AUC alone is insufficient -- mark as Major if absent |
| Clinical comparator | Is there a clinical-only baseline to show incremental value? |
| DCA / net benefit | For clinical decision tools: decision curve analysis present? |
| Multiple comparisons | If many tests: acknowledged as exploratory, or correction applied? |
| Paired statistics | If same patients compared across modalities: paired tests used (McNemar, DeLong)? |
D. Clinical Framing & Importance
| Check | What to look for |
|---|---|
| Intended use | Is the clinical decision point clearly stated? (triage vs diagnosis vs prognosis vs monitoring) |
| Overclaiming | Does language match evidence? ("will improve" -> "may potentially"; "superior" with overlapping CIs?) |
| Terminology precision | Key terms defined? (e.g., "perioperative" = when exactly?) |
| Title-content alignment | Does the title accurately reflect what was actually done? |
| Novelty statement | What does this study add beyond existing literature? Is this explicitly stated? |
| Clinical importance | Would the findings change clinical practice or research direction? Is this articulated? |
E. Reproducibility
| Check | What to look for |
|---|---|
| Preprocessing details | All steps listed in order? Normalization, augmentation, resampling specified? |
| Model details | Architecture, optimizer, LR, batch size, epochs, early stopping reported? |
| Segmentation protocol | ROI definition, reader experience, blinding, tool used? |
| Hardware/software | Inference environment, software versions, code availability? |
| Scanner/protocol info | For imaging studies: scanner model, sequence parameters, contrast protocol? |
| Data/code availability | Is a data availability statement included? Code shared or reason for not sharing stated? |
F. Reporting Completeness
| Check | What to look for |
|---|---|
| Abstract-body consistency | Numbers in Abstract match Tables/Results? |
| Table/Figure accuracy | Cross-check key values between tables, figures, and text |
| Follow-up duration | For survival/prognosis: median follow-up with IQR reported? |
| Ethics | All participating institutions' IRB approval documented? Patient consent described? |
| Missing data | Handling of incomplete cases described? |
| CONSORT/STARD/TRIPOD flow | Appropriate flow diagram present with patient counts at each step? |
| Funding & COI | Funding sources and competing interests disclosed? |
G. Reporting Guideline Compliance
Match the manuscript type to the appropriate checklist and verify key items:
| Manuscript type | Checklist | Critical items to verify |
|---|---|---|
| Diagnostic accuracy | STARD / STARD-AI | Flow diagram, reference standard, spectrum |
| Prediction model (non-AI) | TRIPOD 2015 | Model development vs validation, calibration, missing data |
| Prediction model (AI/ML) | TRIPOD+AI 2024 | Model development vs validation, calibration, leakage, fairness |
| AI / Radiomics | CLAIM 2024 / CLEAR | Feature selection transparency, external validation |
| RCT | CONSORT / CONSORT-AI | Randomization, blinding, ITT |
| Systematic review (interventions) | PRISMA 2020 | Search strategy, screening, risk of bias |
| Meta-analysis (observational) | MOOSE + PRISMA 2020 | Confounding assessment, heterogeneity, publication bias |
| Observational | STROBE | Confounding, selection bias, missing data |
| Reliability / agreement | GRRAS | ICC model/type, rater description, measurement protocol |
| Educational | SQUIRE 2.0 | Intervention description, outcome measures, context |
| Case report | CARE | Timeline, diagnostic reasoning, informed consent |
| Surgical | STROBE-Surgery | Surgeon experience, technique details, complications |
For a full item-by-item audit, run /check-reporting on this manuscript. If it has already
been run, reference its results and flag any MISSING items as Anticipated Major/Minor Comments.
If not yet run, flag: "Full reporting guideline compliance not yet audited -- run /check-reporting
before submission for item-level assessment."
H. Circularity
| Check | What to look for |
|---|---|
| Label-feature overlap | Is the prediction label derived from the same data source as any input features? (e.g., NLP-extracted label + text-derived fea |