Skill Refiner: Iterative Self-Improvement Loop
Adaptive evaluation loop for AI skill collections, inspired by Karpathy's AutoResearch. Orchestrates repeated score-improve-verify cycles using skill-creator as the engine and mandatory peer review as an adversarial check (cross-model when a secondary harness is available, fresh-context self-review as the minimum fallback).
When to use
- Batch-improving the entire skill collection after a period of manual edits
- Targeted improvement of one named skill when the user explicitly asks for skill-refiner, multiple iterations, a score target, or "no ceiling" polish
- Running quality sweeps before a release or publish
- Triggering a self-improvement cycle where skills bootstrap each other
- After adding several new skills that need polish and consistency alignment
- When cross-model perspective would catch single-model blind spots
- Periodic maintenance: scheduled improvement runs to keep skills current
When NOT to use
- Creating a new skill from scratch - use skill-creator (Mode 1)
- Single-pass review of one skill without iteration, scoring, or peer review - use skill-creator (Mode 2)
- One-off collection audit without iteration - use skill-creator (Mode 3)
- Full codebase review (code, not skills) - use full-review
- Style/slop audit on application code - use anti-slop
Configuration
skill-refiner [--iterations N] [--mode MODE] [--secondary HARNESS] [--threshold N] [--plateau N]
| Flag | Default | Description |
|---|---|---|
--iterations | 10 | Maximum iterations for phase 1 |
--mode | circuit-breaker | auto, circuit-breaker, or step |
--secondary | auto-detect | Secondary harness for cross-model review, or none |
--threshold | 85 | Focus threshold - skip skills scoring above this (user can override max) |
--plateau | 2 | Minimum score delta to keep iterating |
Environment override: SKILL_REFINER_SECONDARY=<harness> (CLI flag takes precedence)
Checkpoint Modes
circuit-breaker (default): runs autonomously, auto-pauses on score regression, contested major flags, or plateau. Always pauses before phase 2.
auto: fully autonomous through phase 1. Still pauses before phase 2 and on contested major flags (non-configurable).
step: pauses after every iteration for manual review. Best for first run or learning.
Performance
- Batch similar edits and validations to reduce repeated full-collection scans.
- Prioritize low-scoring or recently changed skills before polishing already-healthy ones.
- Use focused diffs and line-count checks after each batch to avoid late cleanup churn.
Best Practices
- Snapshot evaluation criteria before editing the skills that define the criteria.
- Revert changes that add complexity without improving behavior.
- Keep run history factual and free of unverifiable score inflation.
- Treat composite deltas as noisy when different scorer instances run across iterations: a few points of swing is judge variance, not real change. Anchor keep/revert decisions on the structural gate, on whether the specific targeted weakness was fixed, and on cross-model NO_FLAGS - not on small composite moves. Use a consistent scoring approach within a single before/after comparison.
- Leave externally-maintained version, CVE, and EOL pins out of scope. When a collection has a freshness routine (or equivalent) that owns version currency, do not edit those pins during a run and do not score them as "unverifiable" failures; flag only internal contradictions.
- Audit offensive or security skills (privilege escalation, exploit research) in-loop rather than through web-researching subagents, which can trip platform safeguards. Keep their scoring and edits in the main session.
Workflow
Phase 0: Setup
- Create feature branch:
skill-refiner/YYYY-MM-DD-HHMMSSfrom current HEAD. Preserve dirty worktrees; branching isolates the run, it does not imply cleanup. If already on a run branch for this sweep, record it instead of nesting another branch. Do not mix unrelated dirty files into refiner commits; isolate them with a path-limited stash only when authorized. - Load run history: read
.refiner-runs.jsonfrom the collection root (if it exists). Use previous run data for: baseline score comparison (detect regressions from external changes), model/harness change detection (flag if the primary or secondary model changed since last run - new model = new baseline, not a comparable delta), and skip analysis (don't re-attempt improvements that were already tried and reverted in a recent run). - Build skill inventory: list all skills, exclude phase-2 targets (skill-creator, skill-refiner) from the improvement pool
- Detect primary harness: check environment to identify which AI CLI is running this session
- Probe for secondary harness: run three-step validation (PATH check, config check,
smoke test) per
references/harness-detection.md. Announce result. - If no secondary found: always fall back to self-review. Spawn a fresh agent on
the current harness with the review prompt template from
references/harness-detection.md. Label as "same-model fresh-context review" in scoring, weight at 3% instead of 5% (composite becomes gate/40/55/3, renormalize the missing 2% proportionally to AI Self-Check and Behavioral). This catches confirmation bias but shares the primary model's blind spots. Skipping review entirely is not an option - a fresh-context self-review is the minimum bar. If the harness doesn't support subagents, run the review prompt as a separate CLI invocation (claude -p,codex exec,gemini -p, etc.).
Phase 1: Regular Iterations
- Iteration 1 - full sweep: score every skill in the pool using the four-component
model from
references/evaluation-criteria.md- Structural: run lint-skills.sh + validate-spec.sh
- AI Self-Check: invoke skill-creator review mode on each skill
- Behavioral: run test prompts from
references/test-cases.md. For skills without pre-written test cases, auto-generate 2-3 test prompts from the skill's "When to use" section and quality signals from its AI Self-Check. Log a warning that generated tests are lower quality than hand-written ones. Optionally save generated tests to a test-cases-local.md file alongside test-cases.md so they accumulate across runs. - Cross-model: skip on first iteration (no diff to review yet)
- Log baseline scores: record per-skill and aggregate scores in a score ledger before any edits. The ledger must include structural gate (G), AI Self-Check (A), behavioral score (B), cross-model review (X), composite score, test source, reviewer source, and timestamp. After this step, if the ledger is missing, incomplete, or only records lint/spec status, pause and backfill scoring before applying changes. In headless mode, halt the run and report the missing score data.
- Iteration 2+: enter adaptive focus mode. For a user-requested single-skill run, treat that skill as the whole phase-1 pool, run at least the requested iteration count, and keep iterating until the explicit score target is reached, quality plateaus, or a circuit breaker fires.
- Select targets: identify skills scoring below the focus threshold
- For each targeted skill, run the improvement cycle: a. Read current SKILL.md and all reference files b. Invoke skill-creator review mode - collect findings c. Run behavioral test - score current output quality d. Propose targeted improvements based on findings (not random changes) e. Apply changes to SKILL.md (and references if needed) f. Re-score: run lint + AI Self-Check + behavioral test g. Karpathy gate: if score improved, keep. If not, revert. No exceptions. h. If cross-model review available, send