NASDE Benchmark from Git History

Generate NASDE benchmark tasks by mining git history. You analyze commits, diffs, and PR descriptions to identify self-contained changes that make good evaluation candidates, then generate task files with user approval.

Prerequisites

A git repository with meaningful commit history (the repo you're currently in, or a path to another local repo)
An existing NASDE benchmark project (run nasde init first, or use the nasde-benchmark-creator skill)
If the benchmark project doesn't exist yet, create it first — this skill generates tasks, not the project scaffold

Critical: line endings on Windows (read this first)

When generating tests/test.sh, solution/solve.sh, or environment/Dockerfile on a Windows host, write them with LF line endings or every trial fails with bash: required file not found (the kernel reads #!/bin/bash\r as the shebang). See the full explanation and .gitattributes template in the nasde-benchmark-creator skill.

Quick rules:

The benchmark project MUST have a .gitattributes enforcing *.sh text eol=lf and Dockerfile text eol=lf. nasde init creates this. If the existing project lacks it, create .gitattributes before generating any task files.
When writing files programmatically, use path.write_text(content, encoding="utf-8", newline="") — never the bare default which translates \n→\r\n on Windows.
Sanity-check after generation: find tasks/<new-task> -name '*.sh' -o -name 'Dockerfile' | xargs file | grep CRLF should print nothing.

Step 1: Identify the source repository and commit range

Ask the user:

Which repository? Default: the current working directory. Can also be a path to another local repo.
What commit range? Options:
- A branch name (analyze all commits on that branch)
- A commit range (abc123..def456)
- Last N commits (HEAD~20..HEAD)
- Specific PR numbers (if the repo has a GitHub remote, use gh pr view)
- "Just show me good candidates" — scan the last 50 commits and filter

If the user says "just find good candidates," proceed to Step 2 with the last 50 commits.

Step 2: Scan commits and identify candidates

For each commit in the range, read the diff and evaluate whether it's a good benchmark candidate.

Good candidates have:

A self-contained change (clear before/after state — one commit or a squashed PR)
A well-defined problem statement (readable from commit message, PR title, or linked issue)
Existing tests that can serve as a verifier, OR a change that's testable by inspection
Reasonable scope — not too trivial (typo fix) and not too large (multi-week refactor)
A clean "before" state — the parent commit should build and run successfully

Bad candidates (skip these):

Merge commits with no meaningful diff
Dependency updates, lockfile changes, CI config tweaks
Changes that span too many unrelated files (shotgun surgery)
Changes that require external systems not reproducible in Docker (third-party API keys, specific databases with production data)

For each candidate, extract:

before_ref: the parent commit hash (the state the agent will start from)
after_ref: the commit hash (the reference solution)
description: what the change does (from commit message / PR description)
files_changed: list of modified files
has_tests: whether the commit includes test changes
estimated_difficulty: easy / intermediate / hard (based on diff size and complexity)

Step 3: Present candidates to the user

Present a numbered list of candidates. For each one, show:

[1] abc1234 — "Add discount calculation for threshold-based pricing"
    Files: src/Pricing/ThresholdDiscount.cs, tests/Pricing/ThresholdDiscountTests.cs
    Difficulty: intermediate | Has tests: yes
    Before: abc1233 (parent commit)

[2] def5678 — "Fix race condition in order processing pipeline"
    Files: src/Orders/OrderProcessor.cs, src/Orders/OrderLock.cs
    Difficulty: hard | Has tests: yes
    Before: def5677 (parent commit)

[3] ...

Ask the user to select which candidates to turn into tasks (comma-separated numbers, or "all").

For each selected candidate, proceed to Step 4.

Step 4: Generate task files for each selected candidate

For each approved candidate, generate the full task directory structure. Work through each file with the user — present it, get approval or edits, then write it.

4a: task.toml (single task config, shared with Harbor)

Generate from commit metadata. nasde-specific fields go under [nasde.*].

version = "1.0"

[task]
name = "<benchmark-name>/<slugified-commit-description>"    # Harbor requires org/name format
description = "<commit message, cleaned up>"

[metadata]
difficulty = "<estimated_difficulty>"
language = "<detected-language>"
framework = "<detected-framework>"
source_commit = "<after_ref>"

[agent]
timeout_sec = 1800          # Rule of thumb: estimated_time_minutes × 60

[environment]
memory_mb = 4096            # Claude Code needs 4096+, default 2048 is too low.

[verifier]
timeout_sec = 300           # Timeout for tests/test.sh

[nasde.source]              # Only needed when task has no environment/Dockerfile (auto-generation).
git = "<repo-url-or-local-path>"
ref = "<before_ref>"

For [nasde.source] git:

If the repo has a public remote: use the HTTPS clone URL
If the repo is local-only (no public remote): use the absolute local path
Ask the user if unsure

4b: instruction.md

Generate from the commit message, PR description (if available via gh), and the diff:

# Task: <Human-readable task name>

## Context
You are working in a <language/framework> codebase located at `/app`.
<Brief description of the project and the area of code being modified.>

## Requirement
<What the agent must implement/fix/change. Derived from the commit message and diff.
Be specific — describe the expected behavior, not the implementation approach.
Include concrete examples where possible.>

## Scope
- Files likely to be modified: <list based on the actual commit diff>
- Do NOT modify: <files outside the commit's scope, especially tests if they exist>

## Quality Expectations
<Inferred from the codebase style — mention patterns visible in surrounding code.>

## Success Criteria
<Numbered list derived from what the commit actually changed and what tests verify.>

Important: The instruction must describe the problem to solve, not the solution. Don't leak implementation details from the actual commit diff into the instruction. The agent should arrive at a solution independently.

Present the generated instruction to the user for review. They may want to:

Remove implementation hints that leak from the diff
Add context only they know (business rules, team conventions)
Adjust scope (widen or narrow what the agent should touch)

4c: environment/Dockerfile

Generate based on the repo's tech stack (detected from files like package.json, *.csproj, Cargo.toml, requirements.txt, go.mod):

FROM <base-image-for-detected-stack>

RUN apt-get update && apt-get install -y git curl wget ca-certificates && rm -rf /var/lib/apt/lists/*

WORKDIR /app
# Clone at the "before" state — the commit BEFORE the fix
RUN git clone <repo-url> . && git checkout <before_ref>

# Install dependencies
RUN <dependency-install-command>

# Verify the environment builds
RUN <build-command>

CMD ["/bin/bash"]

Base image selection:

.csproj / .sln → mcr.microsoft.com/dotnet/sdk:8.0
package.json → node:20
requirements.txt / pyproject.toml → python:3.12
Cargo.toml → rust:1.78
go.mod → golang:1.22
Other → ask the user

4d: tests/test.sh

If the commit includes test files, generate a verifier that runs those tests:

#!/bin/bash
cd /app

echo "Step 1: Verifying build..."
if <build-command>; then
    echo "✓ Build succeeded"
else
    echo "✗ Build failed"
    echo 0 > /logs/verifi

nasde-benchmark-from-history

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

claude-api

skill-creator

oh-my-issues

claude-mem

Recibe nuevas skills de Desenvolvimento todos los lunes