NASDE Benchmark Creator
Create and configure coding agent benchmarks for evaluation with nasde. A benchmark is a set of coding tasks that AI agents solve inside isolated Docker containers, scored both by functional tests (pass/fail) and by an LLM-as-a-Judge architecture assessment.
Critical: line endings on Windows (read this first)
Benchmark scripts execute inside Linux sandboxes (Docker, Daytona). If tests/test.sh, solution/solve.sh, or environment/Dockerfile are checked out with CRLF line endings (the Windows git default when core.autocrlf=true and there is no .gitattributes), every trial fails immediately with:
bash: line 1: /tests/test.sh: cannot execute: required file not found
…because the kernel reads the shebang as #!/bin/bash\r and tries to execute a non-existent /bin/bash\r. The agent finishes its work, but the verifier never runs and Harbor reports RewardFileNotFoundError.
Mitigation (always do this for a new benchmark — nasde init does it for you, but verify):
-
The benchmark repo MUST have a
.gitattributesfile enforcing LF for shell scripts and Dockerfiles. The minimum content:* text=auto eol=lf *.sh text eol=lf *.bash text eol=lf Dockerfile text eol=lf *.dockerfile text eol=lf docker-compose.yaml text eol=lf docker-compose.yml text eol=lf *.ps1 text eol=crlf *.bat text eol=crlf *.cmd text eol=crlfnasde initwrites this automatically. If you are adding a benchmark to an existing repo without.gitattributes, create one before adding any task. -
When writing
.shorDockerfilecontent programmatically on Windows, write with explicit LF — notpath.write_text(content)(which translates\n→\r\non Windows), butpath.write_text(content, encoding="utf-8", newline="")or open the file in binary mode. -
After committing on Windows for the first time, run:
git add --renormalize . git commit -m "normalize line endings"to fix any files that landed before
.gitattributeswas in place. -
Sanity check before pushing a new task:
file tasks/<task>/tests/test.sh # MUST say "with LF line terminators" or omit line-terminator info entirely. # If it says "with CRLF line terminators" — fix it (`sed -i 's/\r$//' file`).
This applies equally when you're adding tasks to a benchmark someone else created — if their repo has no .gitattributes and you're on Windows, your contribution will silently break for them on Linux CI and vice versa.
Step 1: Understand what to evaluate
Before creating files, clarify with the user:
- What programming language/framework? (determines Dockerfile base image)
- What kind of coding challenges? (feature implementation, refactoring, bug fixing, etc.)
- What source repository should the agent work on? (git URL cloned in Dockerfile)
- What quality dimensions should be assessed? (these are benchmark-specific, not hardcoded)
Step 2: Scaffold or create the project
For a new benchmark, run:
nasde init my-benchmark --name my-benchmark
This creates the base structure. Then customize the generated files.
For adding tasks to an existing benchmark, skip to Step 4.
Step 3: Define assessment dimensions
Edit assessment_dimensions.json. Each benchmark has its OWN dimensions — design them for what matters in this benchmark's domain.
Examples by domain:
- Refactoring:
code_clarity,test_preservation,api_compatibility,performance_impact - API integration:
error_handling,api_usage_correctness,test_coverage,documentation - Security:
vulnerability_detection,fix_correctness,regression_safety,explanation_quality - DDD:
domain_modeling,architecture_compliance,extensibility,test_quality
Rules:
- Pick whatever number of dimensions actually captures the quality you care about — there is no required minimum or maximum.
- Each dimension declares its own
max_score(any positive integer). Scales are independent — a coarse pass/fail-ish dimension can be 0–3 while a richly graded one can be 0–50 in the same rubric. There is no requirement for the total to sum to 100.normalized_scoreis computed automatically from the actual sum ofmax_scorevalues. See ADR-008. - Names in snake_case
- Each dimension has:
name,title,max_score,description
Step 4: Create task files
Each task lives in tasks/<task-name>/ and needs these files:
task.toml (required — single task config)
Single config file per task, shared with Harbor. nasde-specific fields live under [nasde.*].
version = "1.0"
[task]
name = "<benchmark-name>/<task-name>" # Harbor requires org/name format
description = "Brief description"
[metadata]
difficulty = "intermediate"
language = "C#"
framework = ".NET 8"
domain = "E-Commerce"
[agent]
timeout_sec = 1800 # Primary agent timeout. Rule of thumb: estimated_time_minutes × 60.
[environment]
memory_mb = 4096 # Container memory limit. Claude Code needs 4096+, default 2048 is too low.
[verifier]
timeout_sec = 300 # Timeout for tests/test.sh.
[nasde.source] # Only needed when task has no environment/Dockerfile (nasde auto-generates one).
git = "https://github.com/org/repo.git"
ref = "main"
Timeout priority: --timeout CLI flag > task.toml [agent] timeout_sec > Harbor default. Timeouts are per-task — there is no project-wide default in nasde.toml.
instruction.md (required)
Agent-facing task description. Structure it as:
# Task: <Name>
## Context
Working environment, codebase location (/app), technology stack.
## Requirement
What the agent must implement/fix/change. Concrete examples with inputs and expected outputs.
## Scope
What's in scope, what's not.
## Quality Expectations
Architecture and code quality expectations.
## Success Criteria
Numbered list matching what test.sh verifies.
## Constraints
What the agent must NOT do (e.g., don't modify existing tests).
environment/Dockerfile (required)
Reminder for Windows authors: the Dockerfile and any helper scripts it
COPYs in must have LF line endings — Docker tolerates CRLF in some commands but not inRUNshell snippets, and any shell script copied with CRLF will hit the same shebang failure astest.sh.
FROM <base-image>
RUN apt-get update && apt-get install -y git curl wget ca-certificates && rm -rf /var/lib/apt/lists/*
WORKDIR /app
RUN git clone <repository-url> .
# Pre-install dependencies so the agent doesn't waste time
RUN <dependency-install-command>
# Verify the environment works
RUN <build-or-compile-command>
CMD ["/bin/bash"]
The Dockerfile MUST be self-contained — the agent starts working immediately.
tests/test.sh (required — Harbor verifier)
Reminder for Windows authors: this file MUST be saved with LF line endings. See "Critical: line endings on Windows" at the top of this skill. CRLF here =
bash: required file not foundand a wasted trial.
#!/bin/bash
cd /app
echo "Step 1: Verifying build..."
if <build-command>; then
echo "✓ Build succeeded"
else
echo "✗ Build failed"
echo 0 > /logs/verifier/reward.txt
exit 1
fi
echo "Step 2: Running tests..."
if <test-command>; then
echo "✓ Tests pass"
else
echo "✗ Tests failed"
echo 0 > /logs/verifier/reward.txt
exit 1
fi
echo "EVALUATION PASSED ✓"
echo 1 > /logs/verifier/reward.txt
exit 0
Rules:
- Every failure:
echo 0 > /logs/verifier/reward.txt+exit 1 - Final success:
echo 1 > /logs/verifier/reward.txt+exit 0 - Order steps from fundamental (build) to specific (implementation checks)
assessment_criteria.md (required for LLM-as-a-Judge evaluation)
Per-task rubric. Structure:
# Assessment Criteria: <Task Name>
Evaluate across N dimensions. Each dimension uses its own scale (0–`max_scor