Orchestration Log: When this skill is activated, append a log entry to outputs/orchestration_log.md:

### Skill Activation: Audit Engine
**Timestamp:** [current date/time]
**Actor:** AI Agent (audit-engine)
**Input:** [paper + repo being audited]
**Output:** [brief summary — e.g., "Audited 18 claims: 12 CONFIRMED, 3 PARTIAL, 2 MISSING, 1 MISMATCH"]

Audit Engine

Core Principle

Papers make claims. Code embodies what was actually done. This engine systematically checks whether the two agree. For every empirical or technical claim in the paper — datasets used, models trained, metrics reported, hyperparameters set, ablations run — the engine locates supporting evidence in the linked repository and classifies the match.

This is the complement to verification-engine, which checks citations against external sources. Audit-engine checks the paper's own claims against the paper's own code. Together they cover both failure modes of LLM-assisted writing: mis-cited prior work and unsupported own-work claims.

Inspired by the /audit command in the Feynman research agent (Companion AI, 2026), adapted to the IS/CS methodological style of this plugin.

When to Activate

User says "audit the paper", "check paper vs. code", "verify my experiments", "reproducibility audit", "does the code match what I wrote"
Before submitting a paper with an accompanying code release
Before open-sourcing the repo of a published paper
When reviewing someone else's paper + artifact
As optional Phase 7.5 of the paper machine pipeline (after verify-citations, before prepare-submission)

When NOT to Activate

The paper has no code artifact (pure theory, position paper, qualitative study without computational analysis) → say so and exit
The user wants to verify citations → activate verification-engine instead
The user wants to check writing quality → activate the writing-engine /analyze-writing command

Inputs

Required:

Paper source — paper.tex, draft.md, or explicit $ARGUMENTS path
Code repository — one of:
- Local path (./experiments/, ~/repos/myproject)
- GitHub URL (clone or use gh repo view / WebFetch on raw files)
- Archive link (Zenodo, OSF) — ask user to download locally first

If the repo location is not supplied, scan the paper for common signals:

"Code available at [URL]" / "Our implementation is at [URL]"
GitHub URLs in footnotes or acknowledgements
A code_availability section
A REPRODUCIBILITY.md, ARTIFACT.md, or similar file sibling to the paper

If still not found: ask the user once, then exit.

Step 1: Extract Auditable Claims

Scan the paper for claims that can be checked against code. Ignore claims that are purely conceptual, historical, or theoretical.

Claim Categories (check in order)

Category	What to look for	Priority
Dataset	Named datasets, split sizes, sample counts, data sources	HIGH
Model	Model names, architectures, parameter counts, checkpoints	HIGH
Training	Epochs, batch size, learning rate, optimizer, hardware	HIGH
Metrics	Reported numbers (accuracy, F1, BLEU, loss values, percentages)	HIGH
Experiments	Named experimental conditions, ablations, baselines	HIGH
Hyperparameters	Specific values in tables or "Training Details"	MEDIUM
Preprocessing	Tokenization, normalization, filtering steps	MEDIUM
Evaluation	Test protocol, prompt templates, judge models, seeds	MEDIUM
Infrastructure	GPUs, training time, framework versions	LOW
Figures	Plots claimed to come from "our experiments"	MEDIUM

Extraction Pattern

For each claim, record:

{
  id: "C01",
  category: "Model",
  section: "4.2 Model Training",
  claim_text: "We fine-tune LLaMA-3-8B for 3 epochs with a learning rate of 2e-5.",
  testable_facts: [
    "model == LLaMA-3-8B",
    "epochs == 3",
    "learning_rate == 2e-5"
  ],
  priority: "HIGH"
}

Claims with concrete numbers, names, or identifiers are testable. Vague claims ("we use a standard transformer") are not auditable — mark them as NOT_AUDITABLE and skip.

Output: outputs/audit_claims.md — numbered list of all testable claims.

Step 2: Map the Repository

Before searching, build a lightweight mental map of the repo. Do not read every file.

Top-level listing — Glob on **/*.{py,ipynb,yaml,yml,json,toml,sh,md} at depth 2-3
Identify key files by name convention:
- train.py, main.py, run_experiments.py, eval.py → entry points
- config.yaml, hparams.json, sweep.yaml, *.toml → configuration
- requirements.txt, pyproject.toml, environment.yml → dependencies
- README.md, REPRODUCE.md, docs/ → documentation
- results/, outputs/, logs/, wandb/ → experiment artifacts
- datasets/, data/, load_data.py → data loaders
Detect framework — PyTorch, JAX, TensorFlow, HuggingFace, scikit-learn — this guides search patterns
Detect experiment tracking — wandb, mlflow, tensorboard, plain CSV logs

Record this as an internal map; do not output it unless the user asks.

Step 3: Search for Evidence (per claim)

For each testable claim, systematically search for supporting code evidence.

Search Strategy

Use Grep and Read — NOT an agent — for transparency. Each lookup should produce a file path and line number that can be cited in the report.

Example — Model claim "LLaMA-3-8B, 3 epochs, lr=2e-5":

Search for the model name: Grep "llama-?3-?8b|Llama-3-8B" --type py
Search for learning rate: Grep "2e-?5|0.00002|learning_rate.*2e-5"
Search for epochs: Grep "epochs\s*[:=]\s*3|num_epochs.*3"
Check config files: Read config/*.yaml for matching values
If wandb/mlflow logs exist, grep those too

Example — Metric claim "we report an F1 of 0.87":

Search results files: Grep "0\.87" --type json --type csv --type md
Search eval scripts: Grep -l "f1_score|F1" eval*.py
Check if the number appears in a logged output

Example — Dataset claim "trained on 12,000 examples from OpenReview":

Search for dataset loader: Grep "openreview" -i
Check size assertions: Grep "12000|12_000|len$.*$.*12"
Read the data loading function to confirm source

Record Evidence

For each claim, record:

{
  id: "C01",
  searches: ["llama-3-8b", "lr=2e-5", "epochs=3"],
  hits: [
    {file: "train.py", line: 42, snippet: "model_name = 'meta-llama/Llama-3-8B'"},
    {file: "config/train.yaml", line: 7, snippet: "learning_rate: 2e-5"},
    {file: "config/train.yaml", line: 8, snippet: "epochs: 5"}  // NOTE mismatch
  ]
}

Do not hallucinate hits. If Grep returns nothing, record an empty hits list.

Step 4: Classify Each Claim

Classification Rubric

Status	Criteria	Evidence
CONFIRMED	Every testable fact in the claim has matching code evidence	File + line for each fact
PARTIAL	Some facts confirmed, others missing or unchecked	Confirmed facts listed; gaps called out
MISSING	No code evidence found for any fact in the claim	Which searches returned empty
MISMATCH	Code evidence exists but contradicts the claim	Side-by-side: paper says X, code says Y
NOT_AUDITABLE	Claim is too vague to check, or code is not available	Brief reason

Rules of Engagement

Be conservative. If you're not sure a search hit actually supports the claim, mark PARTIAL and explain what's missing.
Never mark CONFIRMED without a file:line reference. "I think it's probably in the training script" is not evidence.
Treat MISMATCH as load-bearing. Even a single MISMATCH is worth flagging prominently — these are the findings the user most needs to know.
**Distinguish MISSING from NOT_AUDITAB

audit-engine

How to add

Drop this on your repo README

Related skills

xlsx

mem-search

weekly-digests

how-it-works

Get new Dados e Análise skills every Monday