Challenging — Adversarial Review
Adversarial review before shipping. Three paths, each with distinct trade-offs:
- Subagent path (Claude Code, primary). Native sub-Claude via the Task tool — zero API keys, fresh context window, same model. Best when available.
- External API path (claude.ai, Codex, headless). Gemini (cross-model + cross-context) or Anthropic API (cross-context). Costs an incremental API call but gives genuine outside perspective.
- Self path (any environment). The caller assistant inhabits the adversary persona in a dedicated response. Zero cost, retains full subject-matter context from the conversation. Weaker at catching same-session confabulations than fresh-context adversaries, stronger at catching local-convention and factual errors the artifact glosses over. Not a strict downgrade — a different failure-mode profile.
The adversary='auto' resolution (default) picks gemini → claude → self based on available credentials. Callers in Claude Code still use prepare() + Task tool explicitly (subagent is strictly better than self in that environment, and auto-detection of Claude Code is brittle).
Inspired by VDD (dollspace.gay) and Grainulation's anti-rationalization patterns. The drill helper adopts the 5 Whys pattern from Tim Kellogg's open-strix writeup. The self-path persona-inhabitation move is kin to generative-thinking's inversion — commit to the mode before evaluating.
Profiles
Pick the profile matching your artifact. Read only the profile you need — each is self-contained with persona, anti-rationalization table, evaluation criteria, and adversary system prompt.
| Profile | Use For | Iteration strategy | File |
|---|---|---|---|
prose | Blog posts, essays, articles — generic prose competence | parallel replay | references/prose.md |
prose-register | Prose with a named voice signature — fidelity check | parallel replay | references/prose-register.md |
analysis | Research briefs, comparisons, synthesis | parallel replay | references/analysis.md |
code | Scripts, implementations, PRs | parallel replay | references/code.md |
recommendation | Technical decisions, architecture choices | parallel replay | references/recommendation.md |
drill | 5 Whys on one finding from a review | sequential deepen | references/drill.md |
One engine, one surface, two iteration strategies. Review profiles iterate in parallel replay — each pass independent, novelty tracked for confabulation. Drill iterates in sequential deepen — each pass takes the chain so far and produces one more why-level until bedrock or max depth, followed by a synthesis pass that extracts root causes.
prose and prose-register are siblings, not redundant. prose evaluates generic prose competence (claims, logic, structure, performed insight) and is explicitly told not to comment on style. prose-register evaluates fidelity to a named voice signature passed via voice=... (positive markers + anti-patterns) and ignores generic competence. Run both passes on voiced prose — they catch different failure modes.
Usage — Claude Code (subagent path, primary)
Two-step protocol: a Python helper builds the prompt, you spawn a subagent via the Task tool, then a parser turns its response into structured findings.
import sys
sys.path.insert(0, '/mnt/skills/user/challenging/scripts')
from challenger import prepare, parse_response
job = prepare(
artifact=open('/home/claude/draft.md').read(),
profile='prose',
context='Blog post about RAG scaling laws',
)
Then invoke the Task tool — subagent_type='general-purpose', prompt=job['prompt'], description='Adversarial review (prose)'. The subagent runs in a fresh context, applies the persona, and returns a JSON message. Pass that message text to the parser:
result = parse_response(subagent_text)
print(result['verdict']) # SHIP | REVISE | RETHINK
print(result['findings']) # List of specific issues
print(result['strengths']) # What to preserve
Why subagents (not the API): no key, no network dependency, fresh context, and the same Claude that's reviewing your work is reviewing it again with no prior bias — but in a clean window. For cross-model diversity (genuinely different blind spots), use Gemini below.
Voiced prose — prose-register (subagent path)
For prose written in a named voice, prose will miss register-specific failure modes by design (its persona is instructed not to comment on style). Use the prose-register profile with a voice=... signature instead — or in addition, since the two profiles target orthogonal failure modes.
voice = """
Corvid voice signature.
Positive markers: short sentences when certainty is high, dry observations,
lead with the answer then context, no throat-clearing.
Anti-patterns: heroic-narrator framing, drama-line-breaks ("Then the part that
almost killed it." floating alone), cliche tells ("shot itself in the foot",
"footgun"), time-scale inflation ("a month ago" when it was yesterday),
performed-significance setups ("What's interesting about this is..."),
RTFM-performed-as-revelation, precious sentimentality in closings.
"""
job = prepare(
artifact=open('/home/claude/draft.md').read(),
profile='prose-register',
context='Blog post for muninn.austegard.com',
voice=voice,
)
# Task tool → parse_response() as usual.
voice is required for prose-register and rejected for every other profile — pass the signature once, fail loud if it leaks to the wrong call. The richer the signature (with both positive markers and explicit anti-patterns), the more precise the review. A thin signature returns verdict: RETHINK with a finding describing what to sharpen rather than a confident review against an ambiguous spec.
Drill — 5 Whys on a systemic finding (subagent path)
Drill uses the same prepare() / Task / parse_response() protocol as review profiles, but you run the loop yourself — each pass takes the chain so far and produces exactly ONE new {why, because}. When the adversary sets bedrock=true (or you hit max depth), run a final synthesis pass.
from challenger import prepare, parse_response
suspect = next(f for f in result['findings'] if f['severity'] in ('high', 'critical'))
artifact = open('/home/claude/draft.md').read()
ctx = 'Blog post about RAG scaling laws'
chain = []
for depth in range(1, 6):
job = prepare(artifact, 'drill', context=ctx, finding=suspect, chain=chain)
# Task tool: subagent_type='general-purpose', prompt=job['prompt']
step = parse_response(subagent_text) # {why, because, bedrock, reasoning}
chain.append({'why': step['why'], 'because': step['because']})
if step.get('bedrock'):
break
# Final synthesis pass over the completed chain
job = prepare(artifact, 'drill', context=ctx, finding=suspect, chain=chain, synthesize=True)
# Task tool again, then:
diagnosis = parse_response(subagent_text)
print(diagnosis['chain']) # [{why, because}, ...]
print(diagnosis['root_causes']) # usually 3-4 distinct systemic issues
print(diagnosis['direction']) # compass heading for the process fix
finding accepts either a dict from parse_response() or a free-text description. Patches fix the instance; drills fix the class. Why sequential, not parallel? A single-shot drill lets the model shortcut the whole tree and produces renames instead of explanations; one level per pass forces each "because" to earn its depth. See references/drill.md for when to drill and the anti-patterns to reject.
Usage — claude.ai, Codex, headless scripts (API path)
Where subagents aren't available, call an external model directly.
from challenger import challenge
result = challenge(
artifact=open('draft.md').read(),
profile='prose',
context='Blog post about RAG scaling laws',