NASDE Benchmark Runner
Run coding agent benchmarks with nasde and verify results. The two-stage pipeline: Harbor runs agents in Docker containers (functional test → reward 0/1), then an LLM-as-a-Judge scores architecture quality across multiple dimensions.
Authentication setup
Before running any benchmark, set up authentication tokens for the agents you plan to run. Both OS and auth method matter — pick the right command per row.
Step 1 — Ask the user which auth they prefer
Always ask the user before running, never assume. Two questions:
- Which agents will you run? (Claude / Codex / Gemini, any combination)
- For each agent, OAuth (subscription) or API key (per-token billing)? Default recommendation: OAuth where available — no per-token cost, no env vars to manage.
Then detect their OS and pick the matching script row from the table below. On Windows, also ask whether they're in PowerShell or WSL (cmd.exe is not directly supported — see "Windows: cmd.exe" below).
Where the auth scripts live
The OAuth scripts ship inside this skill. After nasde install-skills they are at:
- User scope (default):
~/.claude/skills/nasde-benchmark-runner/scripts/(macOS/Linux/WSL) or%USERPROFILE%\.claude\skills\nasde-benchmark-runner\scripts\(Windows PowerShell) - Project scope:
<project>/.claude/skills/nasde-benchmark-runner/scripts/(if installed withnasde install-skills --scope project) - Editable nasde checkout (devs only):
<repo>/scripts/— same files, mirrored from the skill bundle
Below, <SKILL_SCRIPTS> is shorthand for whichever absolute path applies. Resolve it once, then substitute it in every command. Verify the path with ls <SKILL_SCRIPTS> before telling the user to source anything — if the directory is missing, they need to run nasde install-skills first.
Step 2 — Run the right script per agent × OS
Priority order: Claude → Codex → Gemini. Claude is required even for non-Claude variants when [evaluation] backend = "claude" (default), because the assessment evaluator spawns claude CLI as a subprocess.
Claude Code
| OS / shell | OAuth (subscription) | API key |
|---|---|---|
| macOS | source <SKILL_SCRIPTS>/export_oauth_token.sh (reads Keychain entry "Claude Code-credentials") | export ANTHROPIC_API_KEY=sk-ant-... |
| Linux | source <SKILL_SCRIPTS>/export_oauth_token.sh (reads ~/.claude/.credentials.json) | export ANTHROPIC_API_KEY=sk-ant-... |
| Windows PowerShell | . <SKILL_SCRIPTS>\export_oauth_token.ps1 (reads %USERPROFILE%\.claude\.credentials.json) | $env:ANTHROPIC_API_KEY = 'sk-ant-...' |
| Windows WSL (Ubuntu) | source <SKILL_SCRIPTS>/export_oauth_token.sh (Linux path; resolve <SKILL_SCRIPTS> from your WSL home, not the Windows host's) | export ANTHROPIC_API_KEY=sk-ant-... |
Prerequisite for OAuth: claude CLI installed and claude ran once to log in.
The script exports CLAUDE_CODE_OAUTH_TOKEN. This is required for both Claude variant runs AND assessment evaluation (when [evaluation] backend = "claude" — the default).
Codex
| OS / shell | OAuth (ChatGPT subscription) | API key |
|---|---|---|
| macOS | codex login once, then source <SKILL_SCRIPTS>/export_codex_oauth_token.sh | export CODEX_API_KEY=sk-proj-... (or OPENAI_API_KEY) |
| Linux | codex login once, then source <SKILL_SCRIPTS>/export_codex_oauth_token.sh | export CODEX_API_KEY=sk-proj-... |
| Windows PowerShell | codex login once, then . <SKILL_SCRIPTS>\export_codex_oauth_token.ps1 | $env:CODEX_API_KEY = 'sk-proj-...' |
| Windows WSL (Ubuntu) | codex login once, then source <SKILL_SCRIPTS>/export_codex_oauth_token.sh | export CODEX_API_KEY=sk-proj-... |
The OAuth scripts only validate ~/.codex/auth.json (or %USERPROFILE%\.codex\auth.json) — Harbor injects the file into the sandbox automatically. API key always takes priority over OAuth when both are present.
Gemini CLI
| OS / shell | OAuth (Google account) | API key |
|---|---|---|
| macOS | gemini login once, then source <SKILL_SCRIPTS>/export_gemini_oauth_token.sh | export GEMINI_API_KEY=... |
| Linux | gemini login once, then source <SKILL_SCRIPTS>/export_gemini_oauth_token.sh | export GEMINI_API_KEY=... |
| Windows PowerShell | gemini login once, then . <SKILL_SCRIPTS>\export_gemini_oauth_token.ps1 | $env:GEMINI_API_KEY = '...' |
| Windows WSL (Ubuntu) | gemini login once, then source <SKILL_SCRIPTS>/export_gemini_oauth_token.sh | export GEMINI_API_KEY=... |
The OAuth scripts export GEMINI_OAUTH_CREDS (the raw JSON) — ConfigurableGemini reads that env var and injects credentials into the sandbox. API key always takes priority over OAuth.
Combined setup for cross-agent runs
Resolve <SKILL_SCRIPTS> first, then run all three.
macOS / Linux / Windows WSL:
SKILL_SCRIPTS=~/.claude/skills/nasde-benchmark-runner/scripts # adjust if --scope project
source $SKILL_SCRIPTS/export_oauth_token.sh # Claude (subscription)
source $SKILL_SCRIPTS/export_codex_oauth_token.sh # Codex (subscription) — or: export CODEX_API_KEY=...
source $SKILL_SCRIPTS/export_gemini_oauth_token.sh # Gemini (Google account) — or: export GEMINI_API_KEY=...
Windows PowerShell:
$SkillScripts = "$env:USERPROFILE\.claude\skills\nasde-benchmark-runner\scripts"
. "$SkillScripts\export_oauth_token.ps1"
. "$SkillScripts\export_codex_oauth_token.ps1"
. "$SkillScripts\export_gemini_oauth_token.ps1"
Windows: cmd.exe
cmd.exe is not supported directly — .ps1 requires PowerShell, .sh requires bash. Two workarounds:
- Open PowerShell (
powershell.exe) and dot-source the.ps1script. This is the simplest path on a vanilla Windows install. - Use WSL (
wsl -d Ubuntu) and source the.shscript. This is the recommended path if you also want Docker Desktop with the WSL2 backend, which is the most common dev setup.
If a user is in cmd.exe, point them to one of these two — don't try to extract the token manually.
Running benchmarks
All commands assume -C points to the benchmark project directory.
Basic run (all tasks, default variant)
nasde run -C path/to/benchmark
Assessment evaluation runs by default. This is the standard workflow.
Specific variant and tasks
# Single task, specific variant
nasde run --variant guided --tasks my-task -C path/to/benchmark
# Multiple tasks
nasde run --variant baseline --tasks task-a,task-b -C path/to/benchmark
With Opik tracing
nasde run --variant baseline --tasks my-task -C path/to/benchmark --with-opik
After this completes, ALWAYS verify Opik results (see Opik verification below).
Harbor only (skip assessment)
nasde run --variant baseline -C path/to/benchmark --without-eval
Parallel runs (multiple variants)
Do not use --all-variants when you want parallelism. --all-variants runs variants sequentially in a single process (one variant after another). To run two or more variants in parallel, launch separate nasde run processes with & and wait — each job directory gets a unique random suffix, so concurrent runs are collision-safe:
nasde run --variant vanilla --tasks my-task -C path/to/benchmark &
nasde run --variant guided --tasks my-task -C path/to/benchmark &
wait
Use --all-variants only when you want one variant after another (e.g. to limit total resource use, or when running Claude variants where parallel runs risk Docker OOM — see warning below).
For deterministic job names, use --job-suffix:
nasde run --variant vanilla --job-suffix run1 -C path/to/benchmark
Running Codex variants
Codex variants use AGENTS.md (instead of CLAUDE.md) and require either codex login (ChatGPT subscription) or CODEX_API_KEY/OPENAI_API_KEY (API billing).
CRITICAL: Codex model must be set explicitly. The nasde.toml default model (e.g.