NASDE Benchmark from Public Repos
Build a diverse NASDE benchmark by curating tasks from multiple public GitHub repositories. Designed for validating universal skills — skills that should work across different languages, frameworks, project sizes, and architectural styles.
Prerequisites
- An existing NASDE benchmark project (run
nasde initfirst, or use thenasde-benchmark-creatorskill) - A clear description of the skill being evaluated (what it does, what kinds of tasks it helps with)
- Internet access (to browse and clone public repositories)
Critical: line endings on Windows (read this first)
When generating tests/test.sh, solution/solve.sh, or environment/Dockerfile on a Windows host, write them with LF line endings or every trial fails with bash: required file not found (the kernel reads #!/bin/bash\r as the shebang). See the full explanation and .gitattributes template in the nasde-benchmark-creator skill.
Quick rules:
- The benchmark project MUST have a
.gitattributesenforcing*.sh text eol=lfandDockerfile text eol=lf.nasde initcreates this. If the existing project lacks it, create.gitattributesbefore generating any task files. - When writing files programmatically, use
path.write_text(content, encoding="utf-8", newline="")— never the bare default which translates\n→\r\non Windows. - Sanity-check after generation:
find tasks/<new-task> -name '*.sh' -o -name 'Dockerfile' | xargs file | grep CRLFshould print nothing.
Step 1: Understand the skill under test
Ask the user:
- What does the skill do? (e.g., "helps agents refactor code", "guides test writing", "enforces DDD patterns")
- What languages/frameworks should it support? (e.g., "Python, TypeScript, and Go" or "any language")
- What task types exercise the skill? (e.g., "extract method, rename module, split class" for a refactoring skill)
- Are there known weak spots? (e.g., "seems to struggle with large files" or "not sure about Rust")
Step 2: Design the diversity matrix
Based on the skill description, define axes of variation that the benchmark should cover. Present these to the user as a table:
Example for a refactoring skill
| Axis | Values to cover | Why it matters |
|---|---|---|
| Language | Python, TypeScript, Go, Rust, C# | Refactoring idioms differ per language |
| Project size | Small (<5K LOC), Medium (5-50K), Large (>50K) | Large codebases stress navigation and context |
| Test coverage | Extensive tests, Minimal tests, No tests | Refactoring with no safety net is harder |
| Architecture | Monolith, Microservice, Library | Different refactoring patterns apply |
| Difficulty | Extract function, Split module, Restructure package | Increasing complexity |
Not every cell in the matrix needs a task. Aim for 8–15 tasks that provide meaningful coverage across the axes. Ask the user which axes matter most — they may want to emphasize language diversity over project size, or vice versa.
Step 3: Find candidate repositories
For each cell in the matrix that needs coverage, search for public repositories that fit.
Good source repositories have:
- A clear, active codebase (not abandoned, not a tutorial/toy project)
- A working build system and some test infrastructure
- A well-understood structure (README, organized directories)
- A permissive license (MIT, Apache 2.0, BSD — avoid GPL if the benchmark may be shared)
- Enough complexity to be a meaningful test (not a single-file script)
Search strategies:
- GitHub search — search by language, stars, topic tags
- Known ecosystem repos — well-known open source projects in each language (e.g., FastAPI for Python, Express for Node, Gin for Go)
- GitHub Trending — find actively maintained repos with good structure
- User suggestions — the user may know repos that represent their target audience
For each candidate repo, present:
[1] github.com/user/repo — "Description from GitHub"
Language: Python | Size: ~15K LOC | Stars: 2.3K | License: MIT
Tests: pytest suite, good coverage
Why: Medium Python project, clean architecture, good refactoring target
Proposed task: "Extract the database access layer into a repository pattern"
[2] github.com/user/repo2 — "Description from GitHub"
Language: TypeScript | Size: ~40K LOC | Stars: 890 | License: Apache 2.0
Tests: Jest, moderate coverage
Why: Large TS project, component-heavy, tests UI refactoring
Proposed task: "Split the UserDashboard component into focused sub-components"
Ask the user to select which repos and tasks to include.
Step 4: Create tasks for each selected repo
For each approved repo+task pair, generate the full task directory. Work through each file with the user.
4a: Determine the "before" state
Unlike nasde-benchmark-from-history (which uses a specific commit), here you choose a state of the repo that presents the problem to solve:
- Option A: Current main branch — the repo as-is has the problem (e.g., a God class that should be split). Set
source.refto a specific commit hash on main for reproducibility. - Option B: A tagged release — use a specific version. More stable for long-lived benchmarks.
- Option C: Create a setup branch — if the task requires introducing a specific problem into a clean codebase, create a branch that sets up the scenario. Push it to a fork or document the setup in the Dockerfile.
Always pin to a specific commit hash, not a branch name — branches move, hashes don't.
4b: task.toml (single task config, shared with Harbor)
version = "1.0"
[task]
name = "<benchmark-name>/<language>-<repo-slug>-<task-slug>" # Harbor requires org/name format
description = "<What the agent must do>"
[metadata]
difficulty = "<easy|intermediate|hard>"
language = "<language>"
framework = "<framework>"
source_repo = "https://github.com/<owner>/<repo>"
diversity_axes = ["<axis:value>", "<axis:value>"]
[agent]
timeout_sec = 1800 # Rule of thumb: estimated_time_minutes × 60
[environment]
memory_mb = 4096 # Claude Code needs 4096+, default 2048 is too low.
[verifier]
timeout_sec = 300
[nasde.source] # Only needed when task has no environment/Dockerfile (auto-generation).
git = "https://github.com/<owner>/<repo>.git"
ref = "<pinned-commit-hash>"
The [metadata] diversity_axes helps track coverage across the matrix. Always pin [nasde.source] ref to a specific commit hash, not a branch name.
4c: instruction.md
Write a task instruction that:
- Describes the codebase context (what the project does, relevant directory structure)
- States the problem clearly (what needs to change and why)
- Defines success criteria the agent can verify
- Does NOT prescribe the implementation approach
# Task: <Descriptive name>
## Context
You are working in `/app`, a <language> <framework> project that <brief description>.
The project structure relevant to this task:
<tree of relevant directories/files>
## Requirement
<What needs to change. Describe the problem, not the solution.
Example: "The UserService class handles authentication, authorization, profile management,
and notification dispatch. It's 800 lines and growing. Separate these concerns into
focused services.">
## Scope
- Focus on: <specific files/directories>
- Do NOT modify: <test files, configuration, unrelated modules>
- Preserve: <all existing public APIs, test behavior>
## Quality Expectations
- Follow <language> idioms and the project's existing style
- Maintain or improve test coverage
- Keep changes minimal — change what needs changing, nothing more
## Success Criteria
1. <Specific, testable criterion>
2. <Specific, testable criterion>
3. All existing tests continue to pass
Important for universal skill benchmarks: Write the instruction as if the agent has no special skill. The skill configuration is