Content Refinement Agent (Step 5)
Faithful implementation of the Content Refinement Agent from PaperOrchestra (Song et al., 2026, arXiv:2604.05018, §4 Step 5, App. F.1 pp. 49–51).
Cost: ~5–7 LLM calls (App. B), typically ~3 refinement iterations, each consisting of one reviewer call and one revision call.
The paper highlights this step as one of the largest contributors to overall quality: refinement alone accounts for +19% (CVPR) and +22% (ICLR) absolute acceptance-rate improvement (Fig. 4). Get this step right.
Inputs
workspace/drafts/paper.tex— output of Step 4workspace/inputs/conference_guidelines.mdworkspace/inputs/experimental_log.md— used as ground truth for the hallucination checkworkspace/citation_pool.json/workspace/refs.bib— the allowed bibliography
Outputs
workspace/refinement/iter1/,iter2/,iter3/— per-iteration snapshots containingpaper.tex,paper.pdf,review.json,score.jsonworkspace/refinement/worklog.json— append-only history of decisionsworkspace/final/paper.texandworkspace/final/paper.pdf— copy of the best accepted snapshot
The refinement loop
prev_score = score(paper.tex) # baseline from initial draft
snapshot iter0/
for iter in 1..ITER_CAP (default 3):
1. simulate_review(paper.tex) → review.json
(uses `references/reviewer-rubric.md` rubric)
2. apply_revision(paper.tex, review.json) → new_paper.tex
(uses verbatim Refinement Agent prompt at `references/prompt.md`)
3. snapshot iter<N>/ with new_paper.tex, review.json
latexmk -pdf new_paper.tex → iter<N>/paper.pdf
4. score(new_paper.tex) → curr_score
5. decide via score_delta.py:
- if curr.overall > prev.overall: ACCEPT
- elif curr.overall == prev.overall and net_subaxis ≥0: ACCEPT
- else: REVERT
6. apply_worklog.py to append the decision
7. if REVERT or no actionable weaknesses or iter == ITER_CAP: HALT
paper.tex ← new_paper.tex (only on ACCEPT)
prev_score ← curr_score
cp <best iter>/paper.tex → workspace/final/paper.tex
The "best" snapshot at HALT is the one with the highest accepted overall score. On a REVERT halt, the best is the iteration immediately before the revert.
Step-by-step
0. Pre-refinement integrity gate
Before snapshotting or scoring the initial draft, run the AI failure modes gate:
Load references/ai-failure-modes.md (which points to skills/shared/ai_failure_modes.md).
Run all 7 checks against the draft and the inputs. This gate runs once only,
at the start of iteration 1.
- CONFIRMED failure → write HALT entry to worklog.json, report to user, stop.
- SUSPECTED failure → add WARNING comment to paper.tex, log in worklog.json, continue.
- No failures → proceed.
0b. Snapshot the initial draft
python skills/content-refinement-agent/scripts/snapshot.py \
--src workspace/drafts/paper.tex \
--dst workspace/refinement/iter0/
This creates iter0/paper.tex. Then compile to iter0/paper.pdf:
cd workspace/refinement/iter0/ && latexmk -pdf -interaction=nonstopmode paper.tex
Score it (see Step 1 below) → iter0/score.json.
1. Simulate peer review
For each iteration N starting from 1:
Writing quality pre-check (start of every iteration): Load
references/writing-quality-check.md and run the 5-category checklist
(Categories A–E) against the current draft. Note violations and add them to
the revision agenda.
Load references/reviewer-rubric.md as the system prompt for the simulated
reviewer call. The reviewer reads iter<N-1>/paper.pdf (or paper.tex if
your host LLM lacks PDF input) and produces a JSON of strengths,
weaknesses, questions, and per-axis scores.
The rubric is structured to mimic AgentReview (Jin et al., 2024) — the paper's chosen evaluator. We ship a faithful rubric in the references directory; the host agent's LLM does the actual reviewing.
Devil's Advocate reviewer: One simulated reviewer must be designated the DA
following references/da-reviewer.md. The DA challenges core claims from first
principles (causal overclaiming, ablation coverage, baseline fairness,
generalization claims, novelty inflation) rather than surface polish. If the DA
issues a CRITICAL finding that remains unaddressed after all reviewers weigh in,
that finding blocks the "refinement accepted" decision regardless of rubric scores.
Log DA CRITICAL findings in worklog.json: {da_critical: true, finding: "..."}.
Save to workspace/refinement/iter<N>/review.json.
2. Score the draft
The reviewer call produces both qualitative feedback and a per-axis score:
{
"axis_scores": {
"scientific_depth": {"score": 65, "justification": "..."},
"technical_execution": {"score": 70, "justification": "..."},
"logical_flow": {"score": 60, "justification": "..."},
"writing_clarity": {"score": 55, "justification": "..."},
"evidence_presentation":{"score": 72, "justification": "..."},
"academic_style": {"score": 68, "justification": "..."}
},
"overall_score": 64.5,
"strengths": [...],
"weaknesses": [...],
"questions": [...]
}
Save to iter<N>/score.json. (Combined with review.json if your host
emits one document; the schemas overlap.)
3. Apply revision
Load the verbatim Content Refinement Agent prompt at references/prompt.md.
Prepend the Anti-Leakage Prompt. Inputs:
paper.tex— current draftpaper.pdf— compiled PDF (multimodal context if available)conference_guidelines.mdexperimental_log.md— ground truth for numeric claimsworklog.json— history of previous changescitation_pool.json— the allowed bibliographyreviewer_feedback— the JSON from Step 1
The prompt instructs the model to address weaknesses, integrate question answers, and emit two output blocks:
- A worklog JSON
{addressed_weaknesses[], integrated_answers[], actions_taken[]} - The full revised LaTeX code
Save the revised LaTeX as iter<N>/paper.tex. Append the worklog JSON to
workspace/refinement/worklog.json via apply_worklog.py.
4. Compile and re-score
cd workspace/refinement/iter<N>/ && latexmk -pdf -interaction=nonstopmode paper.tex
Then re-run the simulated review on the new draft → updated score.json
for the new iteration. (This is the "re-score after revision" call.)
5. Apply the accept/revert decision
The calling loop must track CONSECUTIVE_SMALL (starts at 0) and pass it
on each call so score_delta.py can detect the plateau:
python skills/content-refinement-agent/scripts/score_delta.py \
--prev workspace/refinement/iter<N-1>/score.json \
--curr workspace/refinement/iter<N>/score.json \
--plateau-threshold 1.0 \
--plateau-streak 3 \
--consecutive-small $CONSECUTIVE_SMALL \
> workspace/refinement/iter<N>/delta.json
EXIT=$?
# Update streak for next iteration:
CONSECUTIVE_SMALL=$(python3 -c "
import json
d = json.load(open('workspace/refinement/iter<N>/delta.json'))
print(d['consecutive_small'])
")
Exit codes:
0— ACCEPT (overall improved or tied with non-negative net sub-axis, no plateau)1— REVERT (overall decreased)2— REVERT (tied overall, but net sub-axis change negative)4— HALT_PLATEAU (accepted but N consecutive iterations below threshold — stop early)
Behavior:
- ACCEPT (exit 0): keep
iter<N>/paper.texas the new best. Continue to iter N+1. - REVERT (exit 1 or 2): copy
iter<N-1>/paper.texback as canonical, halt. - HALT_PLATEAU (exit 4): keep current (it was accepted), but stop — further iterations are unlikely to yield meaningful gains. In practice ~85% of refinement gain comes in iteration 1; the plateau fires when subsequent iterations improve by less than 1 point for 3 consecutive rounds.
Always log the decision via `apply_wor