NASDE Benchmark Runner

Run coding agent benchmarks with nasde and verify results. The two-stage pipeline: Harbor runs agents in Docker containers (functional test → reward 0/1), then an LLM-as-a-Judge scores architecture quality across multiple dimensions.

Authentication setup

Before running any benchmark, set up authentication tokens for the agents you plan to run. Both OS and auth method matter — pick the right command per row.

Step 1 — Ask the user which auth they prefer

Always ask the user before running, never assume. Two questions:

Which agents will you run? (Claude / Codex / Gemini, any combination)
For each agent, OAuth (subscription) or API key (per-token billing)? Default recommendation: OAuth where available — no per-token cost, no env vars to manage.

Then detect their OS and pick the matching script row from the table below. On Windows, also ask whether they're in PowerShell or WSL (cmd.exe is not directly supported — see "Windows: cmd.exe" below).

Where the auth scripts live

The OAuth scripts ship inside this skill. After nasde install-skills they are at:

User scope (default): ~/.claude/skills/nasde-benchmark-runner/scripts/ (macOS/Linux/WSL) or %USERPROFILE%\.claude\skills\nasde-benchmark-runner\scripts\ (Windows PowerShell)
Project scope: <project>/.claude/skills/nasde-benchmark-runner/scripts/ (if installed with nasde install-skills --scope project)
Editable nasde checkout (devs only): <repo>/scripts/ — same files, mirrored from the skill bundle

Below, <SKILL_SCRIPTS> is shorthand for whichever absolute path applies. Resolve it once, then substitute it in every command. Verify the path with ls <SKILL_SCRIPTS> before telling the user to source anything — if the directory is missing, they need to run nasde install-skills first.

Step 2 — Run the right script per agent × OS

Priority order: Claude → Codex → Gemini. Claude is required even for non-Claude variants when [evaluation] backend = "claude" (default), because the assessment evaluator spawns claude CLI as a subprocess.

Claude Code

OS / shell	OAuth (subscription)	API key
macOS	`source <SKILL_SCRIPTS>/export_oauth_token.sh` (reads Keychain entry "Claude Code-credentials")	`export ANTHROPIC_API_KEY=sk-ant-...`
Linux	`source <SKILL_SCRIPTS>/export_oauth_token.sh` (reads `~/.claude/.credentials.json`)	`export ANTHROPIC_API_KEY=sk-ant-...`
Windows PowerShell	`. <SKILL_SCRIPTS>\export_oauth_token.ps1` (reads `%USERPROFILE%\.claude\.credentials.json`)	`$env:ANTHROPIC_API_KEY = 'sk-ant-...'`
Windows WSL (Ubuntu)	`source <SKILL_SCRIPTS>/export_oauth_token.sh` (Linux path; resolve `<SKILL_SCRIPTS>` from your WSL home, not the Windows host's)	`export ANTHROPIC_API_KEY=sk-ant-...`

Prerequisite for OAuth: claude CLI installed and claude ran once to log in.

The script exports CLAUDE_CODE_OAUTH_TOKEN. This is required for both Claude variant runs AND assessment evaluation (when [evaluation] backend = "claude" — the default).

Codex

OS / shell	OAuth (ChatGPT subscription)	API key
macOS	`codex login` once, then `source <SKILL_SCRIPTS>/export_codex_oauth_token.sh`	`export CODEX_API_KEY=sk-proj-...` (or `OPENAI_API_KEY`)
Linux	`codex login` once, then `source <SKILL_SCRIPTS>/export_codex_oauth_token.sh`	`export CODEX_API_KEY=sk-proj-...`
Windows PowerShell	`codex login` once, then `. <SKILL_SCRIPTS>\export_codex_oauth_token.ps1`	`$env:CODEX_API_KEY = 'sk-proj-...'`
Windows WSL (Ubuntu)	`codex login` once, then `source <SKILL_SCRIPTS>/export_codex_oauth_token.sh`	`export CODEX_API_KEY=sk-proj-...`

The OAuth scripts only validate ~/.codex/auth.json (or %USERPROFILE%\.codex\auth.json) — Harbor injects the file into the sandbox automatically. API key always takes priority over OAuth when both are present.

Gemini CLI

OS / shell	OAuth (Google account)	API key
macOS	`gemini login` once, then `source <SKILL_SCRIPTS>/export_gemini_oauth_token.sh`	`export GEMINI_API_KEY=...`
Linux	`gemini login` once, then `source <SKILL_SCRIPTS>/export_gemini_oauth_token.sh`	`export GEMINI_API_KEY=...`
Windows PowerShell	`gemini login` once, then `. <SKILL_SCRIPTS>\export_gemini_oauth_token.ps1`	`$env:GEMINI_API_KEY = '...'`
Windows WSL (Ubuntu)	`gemini login` once, then `source <SKILL_SCRIPTS>/export_gemini_oauth_token.sh`	`export GEMINI_API_KEY=...`

The OAuth scripts export GEMINI_OAUTH_CREDS (the raw JSON) — ConfigurableGemini reads that env var and injects credentials into the sandbox. API key always takes priority over OAuth.

Combined setup for cross-agent runs

Resolve <SKILL_SCRIPTS> first, then run all three.

macOS / Linux / Windows WSL:

SKILL_SCRIPTS=~/.claude/skills/nasde-benchmark-runner/scripts   # adjust if --scope project
source $SKILL_SCRIPTS/export_oauth_token.sh         # Claude (subscription)
source $SKILL_SCRIPTS/export_codex_oauth_token.sh   # Codex (subscription) — or: export CODEX_API_KEY=...
source $SKILL_SCRIPTS/export_gemini_oauth_token.sh  # Gemini (Google account) — or: export GEMINI_API_KEY=...

Windows PowerShell:

$SkillScripts = "$env:USERPROFILE\.claude\skills\nasde-benchmark-runner\scripts"
. "$SkillScripts\export_oauth_token.ps1"
. "$SkillScripts\export_codex_oauth_token.ps1"
. "$SkillScripts\export_gemini_oauth_token.ps1"

Windows: cmd.exe

cmd.exe is not supported directly — .ps1 requires PowerShell, .sh requires bash. Two workarounds:

Open PowerShell (powershell.exe) and dot-source the .ps1 script. This is the simplest path on a vanilla Windows install.
Use WSL (wsl -d Ubuntu) and source the .sh script. This is the recommended path if you also want Docker Desktop with the WSL2 backend, which is the most common dev setup.

If a user is in cmd.exe, point them to one of these two — don't try to extract the token manually.

Running benchmarks

All commands assume -C points to the benchmark project directory.

Basic run (all tasks, default variant)

nasde run -C path/to/benchmark

Assessment evaluation runs by default. This is the standard workflow.

Specific variant and tasks

# Single task, specific variant
nasde run --variant guided --tasks my-task -C path/to/benchmark

# Multiple tasks
nasde run --variant baseline --tasks task-a,task-b -C path/to/benchmark

With Opik tracing

nasde run --variant baseline --tasks my-task -C path/to/benchmark --with-opik

After this completes, ALWAYS verify Opik results (see Opik verification below).

Harbor only (skip assessment)

nasde run --variant baseline -C path/to/benchmark --without-eval

Parallel runs (multiple variants)

Do not use --all-variants when you want parallelism. --all-variants runs variants sequentially in a single process (one variant after another). To run two or more variants in parallel, launch separate nasde run processes with & and wait — each job directory gets a unique random suffix, so concurrent runs are collision-safe:

nasde run --variant vanilla --tasks my-task -C path/to/benchmark &
nasde run --variant guided --tasks my-task -C path/to/benchmark &
wait

Use --all-variants only when you want one variant after another (e.g. to limit total resource use, or when running Claude variants where parallel runs risk Docker OOM — see warning below).

For deterministic job names, use --job-suffix:

nasde run --variant vanilla --job-suffix run1 -C path/to/benchmark

Running Codex variants

Codex variants use AGENTS.md (instead of CLAUDE.md) and require either codex login (ChatGPT subscription) or CODEX_API_KEY/OPENAI_API_KEY (API billing).

CRITICAL: Codex model must be set explicitly. The nasde.toml default model (e.g.

nasde-benchmark-runner

How to add

Drop this on your repo README

Related skills

claude-api

skill-creator

claude-mem

oh-my-issues

Get new Desenvolvimento skills every Monday