Skill Creator
Default operating mode: autonomous — create or update the skill, run evals, improve it, optimize the description, and return with a final report. Only pause for human review if the user explicitly requests it or you hit an ambiguity that can't be resolved from the evidence alone.
At a high level, the process of creating a skill goes like this:
- Decide what you want the skill to do and roughly how it should do it
- Write a draft of the skill
- Create a few test prompts and run claude-with-access-to-the-skill on them
- Evaluate the results both qualitatively and quantitatively
- While the runs happen in the background, draft some quantitative evals if there aren't any (if there are some, you can either use as is or modify if you feel something needs to change about them)
- Use the
eval-viewer/generate_review.pyscript to show the results if the user wants to review
- Rewrite the skill based on eval results, benchmark data, and user feedback (if review was requested)
- Repeat until quality thresholds are met or the user is satisfied
- Optimize the description for triggering accuracy parallel
Figure out where the user is in this process and then jump in and help them progress through these stages. Route based on what they need. Of course, you should always be flexible and if the user is like "I don't need to run a bunch of evaluations, just vibe with me", you can do that instead.
Workspace convention: All eval artifacts go in tmp/<skill-name>-workspace/ under the project root (the directory containing .claude/). This directory is gitignored. Within the workspace, organize by iteration (iteration-1/, iteration-2/, etc.).
Nested invocation: When invoked as a subagent with an explicit outputs directory, use that outputs/ directory as the root for all inner workspaces — not tmp/<skill-name>-workspace/.
Track your progress with tasks/todos — without them, description optimization and the judge step are commonly skipped.
Communicating with the user
The skill creator is liable to be used by people across a wide range of familiarity with coding jargon. If you haven't heard (and how could you, it's only very recently that it started), there's a trend now where the power of Claude is inspiring plumbers to open up their terminals, parents and grandparents to google "how to install npm". On the other hand, the bulk of users are probably fairly computer-literate.
So please pay attention to context cues to understand how to phrase your communication! In the default case, just to give you some idea:
- "evaluation" and "benchmark" are borderline, but OK
- for "JSON" and "assertion" you want to see serious cues from the user that they know what those things are before using them without explaining them
It's OK to briefly explain terms if you're in doubt, and feel free to clarify terms with a short definition if you're unsure if the user will get it.
Creating a skill
Capture Intent
Start by understanding the user's intent. The current conversation might already contain a workflow the user wants to capture (e.g., they say "turn this into a skill"). If so, extract answers from the conversation history first — the tools used, the sequence of steps, corrections the user made, input/output formats observed. Skip if in autonomous mode: The user may need to fill the gaps, and should confirm before proceeding to the next step.
- What should this skill enable Claude to do?
- When should this skill trigger? (what user phrases/contexts)
- What's the expected output format?
- What evaluation strategy fits? Objectively verifiable outputs (file transforms, code generation) → expectations and baselines. Subjective outputs (writing style, art) → qualitative analysis.
Skip if in autonomous mode: If a gap is genuinely irresolvable from context and would materially change the skill's correctness, pause and ask. Otherwise infer, state it, and proceed.
Interview and Research
Proactively ask questions about edge cases, input/output formats, example files, success criteria, and dependencies. Wait to write test prompts until you've got this part ironed out. SKIP this interview in autonomous mode, let make decisions yourself
Research existing skills — MUST read references/available-skill-resources.md for curated skill repositories, then fetch the README or index of relevant repos to check whether skills for this domain already exist. Don't deep-dive into repos with nothing relevant — a quick scan of the index is enough to know if there's something worth borrowing.
Check available MCPs - if useful for research (searching docs, finding similar skills, looking up best practices), research in parallel via subagents if available, otherwise inline. Come prepared with context to reduce burden on the user.
Write the SKILL.md
Based on the user interview, fill in these components:
- name: Skill identifier
- description: When to trigger, what it does. This is the primary triggering mechanism - include both what the skill does AND specific contexts for when to use it. All "when to use" info goes here, not in the body. Note: currently Claude has a tendency to "undertrigger" skills -- to not use them when they'd be useful. To combat this, please make the skill descriptions a little bit "pushy". So for instance, instead of "How to build a simple fast dashboard to display internal Anthropic data.", you might write "How to build a simple fast dashboard to display internal Anthropic data. Make sure to use this skill whenever the user mentions dashboards, data visualization, internal metrics, or wants to display any kind of company data, even if they don't explicitly ask for a 'dashboard.'" Length: hard limit is 1024 characters (the optimizer enforces this); aim for under ~650 characters in practice — past that, every extra clause competes for attention with the other skills' descriptions and tends to dilute rather than sharpen triggering.
- compatibility: Required tools, dependencies (optional, rarely needed)
- the rest of the skill :)
Skill Writing Guide
Anatomy of a Skill
skill-name/
├── SKILL.md (required)
│ ├── YAML frontmatter (name, description required)
│ └── Markdown instructions
└── Bundled Resources (optional)
├── scripts/ - Executable code for deterministic/repetitive tasks
├── references/ - Docs loaded into context as needed
└── assets/ - Files used in output (templates, icons, fonts)
Progressive Disclosure
Skills use a three-level loading system:
- Metadata (name + description) - Always in context (~100 words)
- SKILL.md body - In context whenever skill triggers (<500 lines ideal)
- Bundled resources - As needed (unlimited, scripts can execute without loading)
These word counts are approximate and you can feel free to go longer if needed.
Key patterns:
- Keep SKILL.md under 500 lines; if you're approaching this limit, add an additional layer of hierarchy along with clear pointers about where the model using the skill should go next to follow up.
- Reference files clearly from SKILL.md with guidance on when to read them
- For large reference files (>300 lines), include a table of contents
- Routing lives at the parent. "When to use" conditions go in the SKILL.md reference table — not inside the spoke file. By the time the agent reads a "## When to Use" section, it's already paid the load cost. Reference files cover HOW; parent covers WHEN.
Domain organization: When a skill supports multiple domains/frameworks, organize by variant:
cloud-deploy/
├── SKILL.md (workflow + selection)
└── references/
├── aws.md
├── gcp.md
└── azure.md
Claude reads only the relevant reference file.
Principle of Lack of Surprise
This goes without saying, but skills must not contain malware, exploit code, or any content that could compromise system security. A skill's contents should not su