NASDE Benchmark from Git History
Generate NASDE benchmark tasks by mining git history. You analyze commits, diffs, and PR descriptions to identify self-contained changes that make good evaluation candidates, then generate task files with user approval.
Prerequisites
- A git repository with meaningful commit history (the repo you're currently in, or a path to another local repo)
- An existing NASDE benchmark project (run
nasde initfirst, or use thenasde-benchmark-creatorskill) - If the benchmark project doesn't exist yet, create it first — this skill generates tasks, not the project scaffold
Critical: line endings on Windows (read this first)
When generating tests/test.sh, solution/solve.sh, or environment/Dockerfile on a Windows host, write them with LF line endings or every trial fails with bash: required file not found (the kernel reads #!/bin/bash\r as the shebang). See the full explanation and .gitattributes template in the nasde-benchmark-creator skill.
Quick rules:
- The benchmark project MUST have a
.gitattributesenforcing*.sh text eol=lfandDockerfile text eol=lf.nasde initcreates this. If the existing project lacks it, create.gitattributesbefore generating any task files. - When writing files programmatically, use
path.write_text(content, encoding="utf-8", newline="")— never the bare default which translates\n→\r\non Windows. - Sanity-check after generation:
find tasks/<new-task> -name '*.sh' -o -name 'Dockerfile' | xargs file | grep CRLFshould print nothing.
Step 1: Identify the source repository and commit range
Ask the user:
- Which repository? Default: the current working directory. Can also be a path to another local repo.
- What commit range? Options:
- A branch name (analyze all commits on that branch)
- A commit range (
abc123..def456) - Last N commits (
HEAD~20..HEAD) - Specific PR numbers (if the repo has a GitHub remote, use
gh pr view) - "Just show me good candidates" — scan the last 50 commits and filter
If the user says "just find good candidates," proceed to Step 2 with the last 50 commits.
Step 2: Scan commits and identify candidates
For each commit in the range, read the diff and evaluate whether it's a good benchmark candidate.
Good candidates have:
- A self-contained change (clear before/after state — one commit or a squashed PR)
- A well-defined problem statement (readable from commit message, PR title, or linked issue)
- Existing tests that can serve as a verifier, OR a change that's testable by inspection
- Reasonable scope — not too trivial (typo fix) and not too large (multi-week refactor)
- A clean "before" state — the parent commit should build and run successfully
Bad candidates (skip these):
- Merge commits with no meaningful diff
- Dependency updates, lockfile changes, CI config tweaks
- Changes that span too many unrelated files (shotgun surgery)
- Changes that require external systems not reproducible in Docker (third-party API keys, specific databases with production data)
For each candidate, extract:
before_ref: the parent commit hash (the state the agent will start from)after_ref: the commit hash (the reference solution)description: what the change does (from commit message / PR description)files_changed: list of modified fileshas_tests: whether the commit includes test changesestimated_difficulty: easy / intermediate / hard (based on diff size and complexity)
Step 3: Present candidates to the user
Present a numbered list of candidates. For each one, show:
[1] abc1234 — "Add discount calculation for threshold-based pricing"
Files: src/Pricing/ThresholdDiscount.cs, tests/Pricing/ThresholdDiscountTests.cs
Difficulty: intermediate | Has tests: yes
Before: abc1233 (parent commit)
[2] def5678 — "Fix race condition in order processing pipeline"
Files: src/Orders/OrderProcessor.cs, src/Orders/OrderLock.cs
Difficulty: hard | Has tests: yes
Before: def5677 (parent commit)
[3] ...
Ask the user to select which candidates to turn into tasks (comma-separated numbers, or "all").
For each selected candidate, proceed to Step 4.
Step 4: Generate task files for each selected candidate
For each approved candidate, generate the full task directory structure. Work through each file with the user — present it, get approval or edits, then write it.
4a: task.toml (single task config, shared with Harbor)
Generate from commit metadata. nasde-specific fields go under [nasde.*].
version = "1.0"
[task]
name = "<benchmark-name>/<slugified-commit-description>" # Harbor requires org/name format
description = "<commit message, cleaned up>"
[metadata]
difficulty = "<estimated_difficulty>"
language = "<detected-language>"
framework = "<detected-framework>"
source_commit = "<after_ref>"
[agent]
timeout_sec = 1800 # Rule of thumb: estimated_time_minutes × 60
[environment]
memory_mb = 4096 # Claude Code needs 4096+, default 2048 is too low.
[verifier]
timeout_sec = 300 # Timeout for tests/test.sh
[nasde.source] # Only needed when task has no environment/Dockerfile (auto-generation).
git = "<repo-url-or-local-path>"
ref = "<before_ref>"
For [nasde.source] git:
- If the repo has a public remote: use the HTTPS clone URL
- If the repo is local-only (no public remote): use the absolute local path
- Ask the user if unsure
4b: instruction.md
Generate from the commit message, PR description (if available via gh), and the diff:
# Task: <Human-readable task name>
## Context
You are working in a <language/framework> codebase located at `/app`.
<Brief description of the project and the area of code being modified.>
## Requirement
<What the agent must implement/fix/change. Derived from the commit message and diff.
Be specific — describe the expected behavior, not the implementation approach.
Include concrete examples where possible.>
## Scope
- Files likely to be modified: <list based on the actual commit diff>
- Do NOT modify: <files outside the commit's scope, especially tests if they exist>
## Quality Expectations
<Inferred from the codebase style — mention patterns visible in surrounding code.>
## Success Criteria
<Numbered list derived from what the commit actually changed and what tests verify.>
Important: The instruction must describe the problem to solve, not the solution. Don't leak implementation details from the actual commit diff into the instruction. The agent should arrive at a solution independently.
Present the generated instruction to the user for review. They may want to:
- Remove implementation hints that leak from the diff
- Add context only they know (business rules, team conventions)
- Adjust scope (widen or narrow what the agent should touch)
4c: environment/Dockerfile
Generate based on the repo's tech stack (detected from files like package.json, *.csproj, Cargo.toml, requirements.txt, go.mod):
FROM <base-image-for-detected-stack>
RUN apt-get update && apt-get install -y git curl wget ca-certificates && rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Clone at the "before" state — the commit BEFORE the fix
RUN git clone <repo-url> . && git checkout <before_ref>
# Install dependencies
RUN <dependency-install-command>
# Verify the environment builds
RUN <build-command>
CMD ["/bin/bash"]
Base image selection:
.csproj/.sln→mcr.microsoft.com/dotnet/sdk:8.0package.json→node:20requirements.txt/pyproject.toml→python:3.12Cargo.toml→rust:1.78go.mod→golang:1.22- Other → ask the user
4d: tests/test.sh
If the commit includes test files, generate a verifier that runs those tests:
#!/bin/bash
cd /app
echo "Step 1: Verifying build..."
if <build-command>; then
echo "✓ Build succeeded"
else
echo "✗ Build failed"
echo 0 > /logs/verifi