Skill Refiner: Iterative Self-Improvement Loop

Adaptive evaluation loop for AI skill collections, inspired by Karpathy's AutoResearch. Orchestrates repeated score-improve-verify cycles using skill-creator as the engine and mandatory peer review as an adversarial check (cross-model when a secondary harness is available, fresh-context self-review as the minimum fallback).

When to use

Batch-improving the entire skill collection after a period of manual edits
Targeted improvement of one named skill when the user explicitly asks for skill-refiner, multiple iterations, a score target, or "no ceiling" polish
Running quality sweeps before a release or publish
Triggering a self-improvement cycle where skills bootstrap each other
After adding several new skills that need polish and consistency alignment
When cross-model perspective would catch single-model blind spots
Periodic maintenance: scheduled improvement runs to keep skills current

When NOT to use

Creating a new skill from scratch - use skill-creator (Mode 1)
Single-pass review of one skill without iteration, scoring, or peer review - use skill-creator (Mode 2)
One-off collection audit without iteration - use skill-creator (Mode 3)
Full codebase review (code, not skills) - use full-review
Style/slop audit on application code - use anti-slop

Configuration

skill-refiner [--iterations N] [--mode MODE] [--secondary HARNESS] [--threshold N] [--plateau N]

Flag	Default	Description
`--iterations`	10	Maximum iterations for phase 1
`--mode`	circuit-breaker	`auto`, `circuit-breaker`, or `step`
`--secondary`	auto-detect	Secondary harness for cross-model review, or `none`
`--threshold`	85	Focus threshold - skip skills scoring above this (user can override max)
`--plateau`	2	Minimum score delta to keep iterating

Environment override: SKILL_REFINER_SECONDARY=<harness> (CLI flag takes precedence)

Checkpoint Modes

circuit-breaker (default): runs autonomously, auto-pauses on score regression, contested major flags, or plateau. Always pauses before phase 2.

auto: fully autonomous through phase 1. Still pauses before phase 2 and on contested major flags (non-configurable).

step: pauses after every iteration for manual review. Best for first run or learning.

Performance

Batch similar edits and validations to reduce repeated full-collection scans.
Prioritize low-scoring or recently changed skills before polishing already-healthy ones.
Use focused diffs and line-count checks after each batch to avoid late cleanup churn.

Best Practices

Snapshot evaluation criteria before editing the skills that define the criteria.
Revert changes that add complexity without improving behavior.
Keep run history factual and free of unverifiable score inflation.
Treat composite deltas as noisy when different scorer instances run across iterations: a few points of swing is judge variance, not real change. Anchor keep/revert decisions on the structural gate, on whether the specific targeted weakness was fixed, and on cross-model NO_FLAGS - not on small composite moves. Use a consistent scoring approach within a single before/after comparison.
Leave externally-maintained version, CVE, and EOL pins out of scope. When a collection has a freshness routine (or equivalent) that owns version currency, do not edit those pins during a run and do not score them as "unverifiable" failures; flag only internal contradictions.
Audit offensive or security skills (privilege escalation, exploit research) in-loop rather than through web-researching subagents, which can trip platform safeguards. Keep their scoring and edits in the main session.

Workflow

Phase 0: Setup

Create feature branch: skill-refiner/YYYY-MM-DD-HHMMSS from current HEAD. Preserve dirty worktrees; branching isolates the run, it does not imply cleanup. If already on a run branch for this sweep, record it instead of nesting another branch. Do not mix unrelated dirty files into refiner commits; isolate them with a path-limited stash only when authorized.
Load run history: read .refiner-runs.json from the collection root (if it exists). Use previous run data for: baseline score comparison (detect regressions from external changes), model/harness change detection (flag if the primary or secondary model changed since last run - new model = new baseline, not a comparable delta), and skip analysis (don't re-attempt improvements that were already tried and reverted in a recent run).
Build skill inventory: list all skills, exclude phase-2 targets (skill-creator, skill-refiner) from the improvement pool
Detect primary harness: check environment to identify which AI CLI is running this session
Probe for secondary harness: run three-step validation (PATH check, config check, smoke test) per references/harness-detection.md. Announce result.
If no secondary found: always fall back to self-review. Spawn a fresh agent on the current harness with the review prompt template from references/harness-detection.md. Label as "same-model fresh-context review" in scoring, weight at 3% instead of 5% (composite becomes gate/40/55/3, renormalize the missing 2% proportionally to AI Self-Check and Behavioral). This catches confirmation bias but shares the primary model's blind spots. Skipping review entirely is not an option - a fresh-context self-review is the minimum bar. If the harness doesn't support subagents, run the review prompt as a separate CLI invocation (claude -p, codex exec, gemini -p, etc.).

Phase 1: Regular Iterations

Iteration 1 - full sweep: score every skill in the pool using the four-component model from references/evaluation-criteria.md
- Structural: run lint-skills.sh + validate-spec.sh
- AI Self-Check: invoke skill-creator review mode on each skill
- Behavioral: run test prompts from references/test-cases.md. For skills without pre-written test cases, auto-generate 2-3 test prompts from the skill's "When to use" section and quality signals from its AI Self-Check. Log a warning that generated tests are lower quality than hand-written ones. Optionally save generated tests to a test-cases-local.md file alongside test-cases.md so they accumulate across runs.
- Cross-model: skip on first iteration (no diff to review yet)
Log baseline scores: record per-skill and aggregate scores in a score ledger before any edits. The ledger must include structural gate (G), AI Self-Check (A), behavioral score (B), cross-model review (X), composite score, test source, reviewer source, and timestamp. After this step, if the ledger is missing, incomplete, or only records lint/spec status, pause and backfill scoring before applying changes. In headless mode, halt the run and report the missing score data.
Iteration 2+: enter adaptive focus mode. For a user-requested single-skill run, treat that skill as the whole phase-1 pool, run at least the requested iteration count, and keep iterating until the explicit score target is reached, quality plateaus, or a circuit breaker fires.
Select targets: identify skills scoring below the focus threshold
For each targeted skill, run the improvement cycle: a. Read current SKILL.md and all reference files b. Invoke skill-creator review mode - collect findings c. Run behavioral test - score current output quality d. Propose targeted improvements based on findings (not random changes) e. Apply changes to SKILL.md (and references if needed) f. Re-score: run lint + AI Self-Check + behavioral test g. Karpathy gate: if score improved, keep. If not, revert. No exceptions. h. If cross-model review available, send

skill-refiner

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

internal-comms

babysit

do

smart-explore

Recibe nuevas skills de DevOps e Infra todos los lunes