Paths: File paths (
references/) are relative to this skill directory.
Benchmark Compare
Type: L3 Worker Category: 8XX Optimization -> 840 Benchmark
Run a clean A/B benchmark in Claude Code: one session with built-in tools only, one with hex-line. The benchmark is scenario-based, diff-validated, manifest-driven, and runtime-backed. It measures activation, correctness, time, cost, and tokens. The current runner is intentionally scoped to this internal A/B. It does not, by itself, prove best-in-class against external alternatives.
Input / Output
| Direction | Content |
|---|---|
| Input | Repo checkout containing mcp/hex-line-mcp/, optional references/goals.md, optional references/expectations.json |
| Output | Comparison report in plugins/optimization-suite/skills/ln-840-benchmark-compare/results/{date}-comparison.md plus machine-readable benchmark summary artifact |
Prerequisites
claude --versionsucceedsgitsucceedsmcp/hex-line-mcp/server.mjsexistsmcp/hex-line-mcp/hook.mjsexistsplugins/optimization-suite/skills/ln-840-benchmark-compare/references/goals.mdexistsplugins/optimization-suite/skills/ln-840-benchmark-compare/references/expectations.jsonexistsplugins/optimization-suite/skills/ln-840-benchmark-compare/references/mcp-bench.jsonexists
Quick Run
bash plugins/optimization-suite/skills/ln-840-benchmark-compare/scripts/run-benchmark.sh \
[plugins/optimization-suite/skills/ln-840-benchmark-compare/references/goals.md] \
[plugins/optimization-suite/skills/ln-840-benchmark-compare/references/expectations.json]
Optional extra session profile:
EXTRA_SESSION_ID=other-mcp \
EXTRA_SESSION_LABEL="Other MCP" \
EXTRA_MCP_CONFIG=/abs/path/to/other-mcp.json \
EXTRA_SETTINGS='{"disableAllHooks":true}' \
bash plugins/optimization-suite/skills/ln-840-benchmark-compare/scripts/run-benchmark.sh
Monitor Integration (Claude Code 2.1.98+)
MANDATORY READ: Load references/monitor_integration_pattern.md
Stream benchmark progress:
Monitor(command="bash plugins/optimization-suite/skills/ln-840-benchmark-compare/scripts/run-benchmark.sh 2>&1 | grep --line-buffered -E 'scenario|PASS|FAIL|error|session'", timeout_ms=3600000, description="benchmark run")
Fallback: Bash(run_in_background=true).
The runner handles:
- syntax preflight
- SessionStart preflight
- scenario extraction from
goals.md - isolated worktrees per scenario/session
- per-scenario diffs
- final comparison report
Current scope:
- built-in Claude session
- Claude plus
hex-line - optional third Claude-compatible session profile through
EXTRA_SESSION_*environment variables
External baseline note:
- use the same
goals.mdandexpectations.json - do not rewrite scenarios to fit the external tool
- do not make "top tool" claims from the internal A/B alone
- the optional third session profile is only valid when it can emit the same
stream-jsonlog shape and diff artifacts
Workflow
Phase 1: Define The Canonical Suite
Use one canonical pair owned by this skill:
plugins/optimization-suite/skills/ln-840-benchmark-compare/references/goals.mdplugins/optimization-suite/skills/ln-840-benchmark-compare/references/expectations.json
Rules:
- The suite must be a balanced mix of common engineering scenarios.
- Do not design the suite to favor
hex-line. - Every scenario in
goals.mdmust have a matching entry inexpectations.json. expectations.jsonis the source of truth for correctness.- The same pair must be reused unchanged for any future external baseline.
Supported expectation fields per scenario:
| Field | Meaning |
|---|---|
id | Scenario identifier used in result filenames |
expectedChangedFiles | Files that must change |
forbiddenChangedFiles | Files that must not change |
requiredDiffPatterns | Regex patterns required in the saved diff |
forbiddenDiffPatterns | Regex patterns that must not appear in the diff |
requiredResultPatterns | Regex patterns required in the final assistant result text |
requiredCommands | Regex patterns that must match at least one Bash command |
exactChangedFiles | If true, no extra changed files are allowed |
Phase 2: Preflight
The runner must pass:
node --check server.mjsnode --check hook.mjsnode --check extract-scenarios.mjsnode --check parse-results.mjs- SessionStart smoke check from
hook.mjs
If preflight fails, the benchmark is invalid and must stop before scenarios run.
Phase 3: Execute Per Scenario
For each ## scenario in goals.md:
- generate a standalone prompt file
- create two clean worktrees from the same commit
- run built-in Claude session
- run hex-line Claude session
- save
.jsonllogs and.diff.txtartifacts - remove both worktrees
Built-in session:
- no MCP
- hooks disabled
Hex-line session:
- resolved MCP config pointing to
server.mjs outputStyle: "hex-line"PreToolUsehook throughhook.mjs
Phase 4: Parse Results
parse-results.mjs evaluates each scenario for both sessions.
Scenario pass requires:
- valid run
- successful session completion
- changed files match expectations
- diff patterns match expectations
- result text patterns match expectations
- required commands were actually executed
Phase 5: Read The Report
The final report has these sections:
- Scenario Outcomes
- Activation
- Time
- Cost
- Tokens
- Tool Totals
- Validity
Interpretation rules:
invalid runmeans setup/adoption failure, not product performance- scenario
FAILmeans correctness contract was not met - activation is part of product quality for
hex-line, not external noise - this report is necessary for internal A/B evaluation, but not sufficient for best-alternative claims
Report Contract
plugins/optimization-suite/skills/ln-840-benchmark-compare/results/{date}-comparison.md must answer:
- Did each scenario complete correctly?
- Did
hex-lineactivate cleanly without discovery drift? - What changed in wall time, API time, cost, output tokens, and total tool calls?
- Was the run valid?
Do not treat raw time/cost as sufficient without scenario correctness.
External Baseline Policy
- This skill owns the canonical suite, not a universal leaderboard.
- If maintainers compare
hex-lineagainst external alternatives, they must reuse the samegoals.md,expectations.json, and diff-based evaluation rules. - External runs may use different harnesses, but they must preserve the same task text, starting commit, and correctness contract.
- If an external tool cannot satisfy the contract format, record that as a harness limitation instead of rewriting the suite to accommodate it.
- A report that only covers built-in Claude vs
hex-linemust say so explicitly.
Runtime Contract
MANDATORY READ: Load references/benchmark_worker_runtime_contract.md, references/coordinator_summary_contract.md
Runtime CLI:
node references/scripts/benchmark-worker-runtime/cli.mjs start --skill ln-840-benchmark-compare --identifier suite-default --manifest-file <file>
node references/scripts/benchmark-worker-runtime/cli.mjs checkpoint --skill ln-840-benchmark-compare --identifier suite-default --phase PHASE_0_CONFIG --payload '{...}'
node references/scripts/benchmark-worker-runtime/cli.mjs record-summary --skill ln-840-benchmark-compare --identifier suite-default --payload '{...}'
node references/scripts/benchmark-worker-runtime/cli.mjs complete --skill ln-840-benchmark-compare --identifier suite-default
Required state fields:
report_readysummary_recordedfinal_resultself_check_passed
Domain checkpoints:
PHASE_0_CONFIGPHASE_1_PREFLIGHTPHASE_2_LOAD_SUITEPHASE_3_RUN_SCENARIOSPHASE_4_PARSE_RESULTSPHASE_5_WRITE_REPORTPHASE_6_WRITE_SUMMARYPHASE_7_SELF_CHECK
Guard rules:
- do not advance withou