Orchestration Log: When this skill is activated, append a log entry to
outputs/orchestration_log.md:### Skill Activation: Audit Engine **Timestamp:** [current date/time] **Actor:** AI Agent (audit-engine) **Input:** [paper + repo being audited] **Output:** [brief summary — e.g., "Audited 18 claims: 12 CONFIRMED, 3 PARTIAL, 2 MISSING, 1 MISMATCH"]
Audit Engine
Core Principle
Papers make claims. Code embodies what was actually done. This engine systematically checks whether the two agree. For every empirical or technical claim in the paper — datasets used, models trained, metrics reported, hyperparameters set, ablations run — the engine locates supporting evidence in the linked repository and classifies the match.
This is the complement to verification-engine, which checks citations against
external sources. Audit-engine checks the paper's own claims against the paper's
own code. Together they cover both failure modes of LLM-assisted writing: mis-cited
prior work and unsupported own-work claims.
Inspired by the /audit command in the Feynman research agent (Companion AI, 2026),
adapted to the IS/CS methodological style of this plugin.
When to Activate
- User says "audit the paper", "check paper vs. code", "verify my experiments", "reproducibility audit", "does the code match what I wrote"
- Before submitting a paper with an accompanying code release
- Before open-sourcing the repo of a published paper
- When reviewing someone else's paper + artifact
- As optional Phase 7.5 of the paper machine pipeline (after verify-citations, before prepare-submission)
When NOT to Activate
- The paper has no code artifact (pure theory, position paper, qualitative study without computational analysis) → say so and exit
- The user wants to verify citations → activate
verification-engineinstead - The user wants to check writing quality → activate the writing-engine
/analyze-writingcommand
Inputs
Required:
- Paper source —
paper.tex,draft.md, or explicit$ARGUMENTSpath - Code repository — one of:
- Local path (
./experiments/,~/repos/myproject) - GitHub URL (clone or use
gh repo view/WebFetchon raw files) - Archive link (Zenodo, OSF) — ask user to download locally first
- Local path (
If the repo location is not supplied, scan the paper for common signals:
- "Code available at [URL]" / "Our implementation is at [URL]"
- GitHub URLs in footnotes or acknowledgements
- A
code_availabilitysection - A
REPRODUCIBILITY.md,ARTIFACT.md, or similar file sibling to the paper
If still not found: ask the user once, then exit.
Step 1: Extract Auditable Claims
Scan the paper for claims that can be checked against code. Ignore claims that are purely conceptual, historical, or theoretical.
Claim Categories (check in order)
| Category | What to look for | Priority |
|---|---|---|
| Dataset | Named datasets, split sizes, sample counts, data sources | HIGH |
| Model | Model names, architectures, parameter counts, checkpoints | HIGH |
| Training | Epochs, batch size, learning rate, optimizer, hardware | HIGH |
| Metrics | Reported numbers (accuracy, F1, BLEU, loss values, percentages) | HIGH |
| Experiments | Named experimental conditions, ablations, baselines | HIGH |
| Hyperparameters | Specific values in tables or "Training Details" | MEDIUM |
| Preprocessing | Tokenization, normalization, filtering steps | MEDIUM |
| Evaluation | Test protocol, prompt templates, judge models, seeds | MEDIUM |
| Infrastructure | GPUs, training time, framework versions | LOW |
| Figures | Plots claimed to come from "our experiments" | MEDIUM |
Extraction Pattern
For each claim, record:
{
id: "C01",
category: "Model",
section: "4.2 Model Training",
claim_text: "We fine-tune LLaMA-3-8B for 3 epochs with a learning rate of 2e-5.",
testable_facts: [
"model == LLaMA-3-8B",
"epochs == 3",
"learning_rate == 2e-5"
],
priority: "HIGH"
}
Claims with concrete numbers, names, or identifiers are testable. Vague claims
("we use a standard transformer") are not auditable — mark them as NOT_AUDITABLE
and skip.
Output: outputs/audit_claims.md — numbered list of all testable claims.
Step 2: Map the Repository
Before searching, build a lightweight mental map of the repo. Do not read every file.
- Top-level listing —
Globon**/*.{py,ipynb,yaml,yml,json,toml,sh,md}at depth 2-3 - Identify key files by name convention:
train.py,main.py,run_experiments.py,eval.py→ entry pointsconfig.yaml,hparams.json,sweep.yaml,*.toml→ configurationrequirements.txt,pyproject.toml,environment.yml→ dependenciesREADME.md,REPRODUCE.md,docs/→ documentationresults/,outputs/,logs/,wandb/→ experiment artifactsdatasets/,data/,load_data.py→ data loaders
- Detect framework — PyTorch, JAX, TensorFlow, HuggingFace, scikit-learn — this guides search patterns
- Detect experiment tracking — wandb, mlflow, tensorboard, plain CSV logs
Record this as an internal map; do not output it unless the user asks.
Step 3: Search for Evidence (per claim)
For each testable claim, systematically search for supporting code evidence.
Search Strategy
Use Grep and Read — NOT an agent — for transparency. Each lookup should produce
a file path and line number that can be cited in the report.
Example — Model claim "LLaMA-3-8B, 3 epochs, lr=2e-5":
- Search for the model name:
Grep "llama-?3-?8b|Llama-3-8B" --type py - Search for learning rate:
Grep "2e-?5|0.00002|learning_rate.*2e-5" - Search for epochs:
Grep "epochs\s*[:=]\s*3|num_epochs.*3" - Check config files:
Read config/*.yamlfor matching values - If wandb/mlflow logs exist, grep those too
Example — Metric claim "we report an F1 of 0.87":
- Search results files:
Grep "0\.87" --type json --type csv --type md - Search eval scripts:
Grep -l "f1_score|F1" eval*.py - Check if the number appears in a logged output
Example — Dataset claim "trained on 12,000 examples from OpenReview":
- Search for dataset loader:
Grep "openreview" -i - Check size assertions:
Grep "12000|12_000|len\(.*\).*12" - Read the data loading function to confirm source
Record Evidence
For each claim, record:
{
id: "C01",
searches: ["llama-3-8b", "lr=2e-5", "epochs=3"],
hits: [
{file: "train.py", line: 42, snippet: "model_name = 'meta-llama/Llama-3-8B'"},
{file: "config/train.yaml", line: 7, snippet: "learning_rate: 2e-5"},
{file: "config/train.yaml", line: 8, snippet: "epochs: 5"} // NOTE mismatch
]
}
Do not hallucinate hits. If Grep returns nothing, record an empty hits list.
Step 4: Classify Each Claim
Classification Rubric
| Status | Criteria | Evidence |
|---|---|---|
| CONFIRMED | Every testable fact in the claim has matching code evidence | File + line for each fact |
| PARTIAL | Some facts confirmed, others missing or unchecked | Confirmed facts listed; gaps called out |
| MISSING | No code evidence found for any fact in the claim | Which searches returned empty |
| MISMATCH | Code evidence exists but contradicts the claim | Side-by-side: paper says X, code says Y |
| NOT_AUDITABLE | Claim is too vague to check, or code is not available | Brief reason |
Rules of Engagement
- Be conservative. If you're not sure a search hit actually supports the claim, mark PARTIAL and explain what's missing.
- Never mark CONFIRMED without a file:line reference. "I think it's probably in the training script" is not evidence.
- Treat MISMATCH as load-bearing. Even a single MISMATCH is worth flagging prominently — these are the findings the user most needs to know.
- **Distinguish MISSING from NOT_AUDITAB