Experiment Audit: Cross-Model Integrity Verification

🔒 Do not wrap this skill in /loop, /schedule, or CronCreate. It is verdict-bearing — it judges experiment integrity. Re-running that verdict on a timer adds no new signal, and a loop that accepts its own output to decide when to stop crosses into self-acquittal (acceptance-gate.md). Schedule the external wait that precedes it — experiments done → then audit once. See shared-references/external-cadence.md.

Audit experiment integrity for: $ARGUMENTS

Why This Exists

LLM agents can produce fraudulent experimental results through:

Fake ground truth — creating synthetic "reference" from model outputs, then reporting high agreement as performance
Score normalization — dividing metrics by the model's own max to get 0.99+
Phantom results — claiming numbers from files that don't exist or functions never called
Insufficient scope — reporting 2-scene pilots as "comprehensive evaluation"

These are NOT intentional deception — they are failure modes of optimizing agents that lack integrity constraints. This skill adds that constraint.

Core Principle

The executor collects file paths. The external reviewer backend reads code and judges integrity. The executor does NOT participate in integrity judgment.

This follows shared-references/reviewer-independence.md and shared-references/experiment-integrity.md.

Constants

REVIEWER_BACKEND = codex — Default: Codex MCP (xhigh). Override with — reviewer: oracle-pro for Oracle MCP, or — reviewer: manual for Manual Review MCP. If manual-review MCP is unavailable, stop and print the install command; do not fall back to Codex. See shared-references/reviewer-routing.md.

Reviewer Calling Convention

When calling the reviewer, branch on REVIEWER_BACKEND:

If REVIEWER_BACKEND = codex: Use mcp__codex__codex for new review threads. Use mcp__codex__codex-reply for follow-up rounds (reuse threadId).

If REVIEWER_BACKEND = manual: Use mcp__manual_review__review for new review threads with: prompt: [exact same prompt that would go to Codex] config: {"model_reasoning_effort": "xhigh"} Save the returned threadId. Use mcp__manual_review__review_reply for follow-up rounds with: threadId: [saved manual-review threadId] prompt: [follow-up prompt] config: {"model_reasoning_effort": "xhigh"}

Prompt fidelity: the manual prompt must be exactly the same text that Codex would receive. Review tracing applies equally to both backends.

Workflow

Step 1: Collect Artifacts (Executor — Claude)

Locate and list these files WITHOUT reading or summarizing their content:

Scan project directory for:
1. Evaluation scripts:    *eval*.py, *metric*.py, *test*.py, *benchmark*.py
2. Result files:          *.json, *.csv in results/, outputs/, logs/
3. Ground truth paths:    look in eval scripts for data loading (dataset paths, GT references)
4. Experiment tracker:    EXPERIMENT_TRACKER.md, EXPERIMENT_LOG.md
5. Paper claims:          NARRATIVE_REPORT.md, paper/sections/*.tex, PAPER_PLAN.md
6. Config files:          *.yaml, *.toml, *.json configs with metric definitions

DO NOT summarize, interpret, or explain any file content. Only collect paths.

Step 2: Send to Reviewer

Based on the selected reviewer backend (see Reviewer Calling Convention), pass ONLY file paths and the audit checklist to the reviewer. The reviewer reads everything directly.

For codex, call mcp__codex__codex with:

model: gpt-5.5
config: {"model_reasoning_effort": "xhigh"}
sandbox: read-only
cwd: [project directory]
prompt: [the exact full prompt below]

For manual, call mcp__manual_review__review with:

config: {"model_reasoning_effort": "xhigh"}
prompt: [the exact full prompt below]

Manual review cannot use Codex-only model, sandbox, or cwd; include the same file paths in the prompt so the user can inspect them.

Use this exact prompt for both backends:

You are an experiment integrity auditor. Read ALL files listed below
    and check for the following fraud patterns.

    Files to read:
    - Evaluation scripts: [list paths]
    - Result files: [list paths]
    - Experiment tracker: [list paths]
    - Paper claims: [list paths]
    - Config files: [list paths]

    ## Audit Checklist

    ### A. Ground Truth Provenance
    For each evaluation script:
    1. Where does "ground truth" / "reference" / "target" come from?
    2. Is it loaded from the DATASET, or generated/derived from MODEL OUTPUTS?
    3. If derived: is it explicitly labeled as proxy evaluation?
    4. Are official eval scripts used when available for this benchmark?
    FAIL if: GT is derived from model outputs without explicit proxy labeling.

    ### B. Score Normalization
    For each metric computation:
    1. Is any metric divided by max/min/mean of the model's OWN output?
    2. Are raw scores reported alongside any normalized scores?
    3. Are any scores suspiciously close to 1.0 or 100%?
    FAIL if: Normalization denominator comes from prediction statistics.

    ### C. Result File Existence
    For each claim in the paper/narrative:
    1. Does the referenced result file actually exist?
    2. Does the claimed metric key exist in that file?
    3. Does the claimed NUMBER match what's in the file?
    4. Is the experiment tracker status DONE (not TODO/IN_PROGRESS)?
    FAIL if: Claimed results reference nonexistent files or mismatched numbers.

    ### D. Dead Code Detection
    For each metric function defined in eval scripts:
    1. Is it actually CALLED in any evaluation pipeline?
    2. Does its output appear in any result file?
    WARN if: Metric functions exist but are never called.

    ### E. Scope Assessment
    1. How many scenes/datasets/configurations were actually tested?
    2. How many seeds/runs per configuration?
    3. Does the paper use words like "comprehensive", "extensive", "robust"?
    4. Is the actual scope sufficient for those claims?
    WARN if: Scope language exceeds actual evidence.

    ### F. Evaluation Type Classification
    Classify each evaluation as:
    - real_gt: uses dataset-provided ground truth
    - synthetic_proxy: uses model-generated reference
    - self_supervised_proxy: no GT by design
    - simulation_only: simulated environment
    - human_eval: human judges

    ## Output Format

    For each check (A-F), report:
    - Status: PASS | WARN | FAIL
    - Evidence: exact file:line references
    - Details: what specifically was found

    Overall verdict: PASS | WARN | FAIL
    
    Be thorough. Read every eval script line by line.

Step 3: Parse and Write Report (Executor — Claude)

Parse the reviewer's response and write EXPERIMENT_AUDIT.md:

# Experiment Audit Report

**Date**: [today]
**Auditor**: External reviewer backend, xhigh reasoning (cross-model, read-only)
**Project**: [project name]

## Overall Verdict: [PASS | WARN | FAIL]

## Integrity Status: [pass | warn | fail]

## Checks

### A. Ground Truth Provenance: [PASS|WARN|FAIL]
[details + file:line evidence]

### B. Score Normalization: [PASS|WARN|FAIL]
[details]

### C. Result File Existence: [PASS|WARN|FAIL]
[details]

### D. Dead Code Detection: [PASS|WARN|FAIL]
[details]

### E. Scope Assessment: [PASS|WARN|FAIL]
[details]

### F. Evaluation Type: [real_gt | synthetic_proxy | ...]
[classification + evidence]

## Action Items
- [specific fixes if WARN or FAIL]

## Claim Impact
- Claim 1: [supported | needs qualifier | unsupported]
- Claim 2: ...

Also write EXPERIMENT_AUDIT.json for machine consumption:

{
  "date": "2026-04-10",
  "auditor": "external-reviewer-xhigh",
  "overall_verdict": "warn",
  "integrity_status": "warn",
  "checks": {
    "gt_provenance": {"status": "pass", "details": "..."},
    "score_normalization": {"status": "warn", "deta

experiment-audit

How to add

Drop this on your repo README

Related skills

dev-browser

agent-browser

understand-chat

understand-dashboard

Get new Pesquisa e Web skills every Monday