Paths: File paths (references/) are relative to this skill directory.

Benchmark Compare

Type: L3 Worker Category: 8XX Optimization -> 840 Benchmark

Run a clean A/B benchmark in Claude Code: one session with built-in tools only, one with hex-line. The benchmark is scenario-based, diff-validated, manifest-driven, and runtime-backed. It measures activation, correctness, time, cost, and tokens. The current runner is intentionally scoped to this internal A/B. It does not, by itself, prove best-in-class against external alternatives.

Input / Output

Direction	Content
Input	Repo checkout containing `mcp/hex-line-mcp/`, optional `references/goals.md`, optional `references/expectations.json`
Output	Comparison report in `plugins/optimization-suite/skills/ln-840-benchmark-compare/results/{date}-comparison.md` plus machine-readable benchmark summary artifact

Prerequisites

claude --version succeeds
git succeeds
mcp/hex-line-mcp/server.mjs exists
mcp/hex-line-mcp/hook.mjs exists
plugins/optimization-suite/skills/ln-840-benchmark-compare/references/goals.md exists
plugins/optimization-suite/skills/ln-840-benchmark-compare/references/expectations.json exists
plugins/optimization-suite/skills/ln-840-benchmark-compare/references/mcp-bench.json exists

Quick Run

bash plugins/optimization-suite/skills/ln-840-benchmark-compare/scripts/run-benchmark.sh \
  [plugins/optimization-suite/skills/ln-840-benchmark-compare/references/goals.md] \
  [plugins/optimization-suite/skills/ln-840-benchmark-compare/references/expectations.json]

Optional extra session profile:

EXTRA_SESSION_ID=other-mcp \
EXTRA_SESSION_LABEL="Other MCP" \
EXTRA_MCP_CONFIG=/abs/path/to/other-mcp.json \
EXTRA_SETTINGS='{"disableAllHooks":true}' \
bash plugins/optimization-suite/skills/ln-840-benchmark-compare/scripts/run-benchmark.sh

Monitor Integration (Claude Code 2.1.98+)

MANDATORY READ: Load references/monitor_integration_pattern.md

Fallback: Bash(run_in_background=true).

The runner handles:

syntax preflight
SessionStart preflight
scenario extraction from goals.md
isolated worktrees per scenario/session
per-scenario diffs
final comparison report

Current scope:

built-in Claude session
Claude plus hex-line
optional third Claude-compatible session profile through EXTRA_SESSION_* environment variables

External baseline note:

use the same goals.md and expectations.json
do not rewrite scenarios to fit the external tool
do not make "top tool" claims from the internal A/B alone
the optional third session profile is only valid when it can emit the same stream-json log shape and diff artifacts

Workflow

Phase 1: Define The Canonical Suite

Use one canonical pair owned by this skill:

plugins/optimization-suite/skills/ln-840-benchmark-compare/references/goals.md
plugins/optimization-suite/skills/ln-840-benchmark-compare/references/expectations.json

Rules:

The suite must be a balanced mix of common engineering scenarios.
Do not design the suite to favor hex-line.
Every scenario in goals.md must have a matching entry in expectations.json.
expectations.json is the source of truth for correctness.
The same pair must be reused unchanged for any future external baseline.

Supported expectation fields per scenario:

Field	Meaning
`id`	Scenario identifier used in result filenames
`expectedChangedFiles`	Files that must change
`forbiddenChangedFiles`	Files that must not change
`requiredDiffPatterns`	Regex patterns required in the saved diff
`forbiddenDiffPatterns`	Regex patterns that must not appear in the diff
`requiredResultPatterns`	Regex patterns required in the final assistant result text
`requiredCommands`	Regex patterns that must match at least one Bash command
`exactChangedFiles`	If `true`, no extra changed files are allowed

Phase 2: Preflight

The runner must pass:

node --check server.mjs
node --check hook.mjs
node --check extract-scenarios.mjs
node --check parse-results.mjs
SessionStart smoke check from hook.mjs

If preflight fails, the benchmark is invalid and must stop before scenarios run.

Phase 3: Execute Per Scenario

For each ## scenario in goals.md:

generate a standalone prompt file
create two clean worktrees from the same commit
run built-in Claude session
run hex-line Claude session
save .jsonl logs and .diff.txt artifacts
remove both worktrees

Built-in session:

no MCP
hooks disabled

Hex-line session:

resolved MCP config pointing to server.mjs
outputStyle: "hex-line"
PreToolUse hook through hook.mjs

Phase 4: Parse Results

parse-results.mjs evaluates each scenario for both sessions.

Scenario pass requires:

valid run
successful session completion
changed files match expectations
diff patterns match expectations
result text patterns match expectations
required commands were actually executed

Phase 5: Read The Report

The final report has these sections:

Scenario Outcomes
Activation
Time
Cost
Tokens
Tool Totals
Validity

Interpretation rules:

invalid run means setup/adoption failure, not product performance
scenario FAIL means correctness contract was not met
activation is part of product quality for hex-line, not external noise
this report is necessary for internal A/B evaluation, but not sufficient for best-alternative claims

Report Contract

plugins/optimization-suite/skills/ln-840-benchmark-compare/results/{date}-comparison.md must answer:

Did each scenario complete correctly?
Did hex-line activate cleanly without discovery drift?
What changed in wall time, API time, cost, output tokens, and total tool calls?
Was the run valid?

Do not treat raw time/cost as sufficient without scenario correctness.

External Baseline Policy

This skill owns the canonical suite, not a universal leaderboard.
If maintainers compare hex-line against external alternatives, they must reuse the same goals.md, expectations.json, and diff-based evaluation rules.
External runs may use different harnesses, but they must preserve the same task text, starting commit, and correctness contract.
If an external tool cannot satisfy the contract format, record that as a harness limitation instead of rewriting the suite to accommodate it.
A report that only covers built-in Claude vs hex-line must say so explicitly.

Runtime Contract

MANDATORY READ: Load references/benchmark_worker_runtime_contract.md, references/coordinator_summary_contract.md

Runtime CLI:

node references/scripts/benchmark-worker-runtime/cli.mjs start --skill ln-840-benchmark-compare --identifier suite-default --manifest-file <file>
node references/scripts/benchmark-worker-runtime/cli.mjs checkpoint --skill ln-840-benchmark-compare --identifier suite-default --phase PHASE_0_CONFIG --payload '{...}'
node references/scripts/benchmark-worker-runtime/cli.mjs record-summary --skill ln-840-benchmark-compare --identifier suite-default --payload '{...}'
node references/scripts/benchmark-worker-runtime/cli.mjs complete --skill ln-840-benchmark-compare --identifier suite-default

Required state fields:

report_ready
summary_recorded
final_result
self_check_passed

Domain checkpoints:

PHASE_0_CONFIG
PHASE_1_PREFLIGHT
PHASE_2_LOAD_SUITE
PHASE_3_RUN_SCENARIOS
PHASE_4_PARSE_RESULTS
PHASE_5_WRITE_REPORT
PHASE_6_WRITE_SUMMARY
PHASE_7_SELF_CHECK

Guard rules:

do not advance withou

ln-840-benchmark-compare

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

pdf

pptx

canvas-design

theme-factory

Recibe nuevas skills de Documentos todos los lunes