Skill Scorer — 0-100 Rating for Any SKILL.md
Calibrated against Anthropic's own skill-creator guidelines and the patterns of 50k+ star repos (Superpowers, Caveman, mattpocock/skills). Average skill scores 45-55. If yours scores 80+, ship it.
Dependencies
None. Standalone. Reads any SKILL.md content.
CRITICAL: Auto-start
SKILL.md content in the message or just created in conversation → skip to Step 2. No preamble.
Step 1. Get the skill
Content in context? Use it. Otherwise:
Paste your SKILL.md or tell me which skill to score.
If the user names a skill in the current environment, read it.
Step 2. Score 10 Dimensions (each 0-10)
Be precise. 6.5 is 6.5, not 7. Every score needs one sentence of evidence referencing the actual skill content.
D1: Trigger Precision (0-10)
Will it fire when it should? Will it stay quiet when it shouldn't?
Evaluate the YAML description field:
- Specific trigger phrases listed? (+2)
- Natural language variations covered? (+2)
- "Pushy" quality per Anthropic guidance? (+2)
- Edge-case triggers (bare paste, implicit intent)? (+2)
- False positive risk controlled? (+2)
| Range | Calibration |
|---|---|
| 0-3 | Vague description, undertriggers or fires on everything |
| 4-6 | Some triggers but gaps in natural phrasing |
| 7-8 | Solid coverage, most natural phrasings caught |
| 9-10 | Comprehensive — includes implicit triggers and edge cases |
D2: Instruction Clarity (0-10)
Can Claude follow this without improvising?
- Unambiguous step sequence? (+2)
- Decision points with explicit branches? (+2)
- Deterministic enough that two instances produce similar output? (+3)
- No contradictory instructions? (+1.5)
- Clear scope boundaries (what it does AND doesn't do)? (+1.5)
D3: Output Predictability (0-10)
Does the user know what they'll get?
- Output format explicitly defined? (+3)
- Templates or examples provided? (+3)
- Length/scope appropriate and specified? (+2)
- Structured data output (JSON, table) available? (+2)
D4: Edge Case Coverage (0-10)
What happens with weird input?
- Empty/missing input handled? (+2)
- Extremely long input addressed? (+2)
- Invalid/unexpected input types? (+2)
- Multi-language or mixed input? (+2)
- Graceful degradation path? (+2)
D5: Anti-Hallucination Guardrails (0-10)
Does it prevent Claude from fabricating?
- Explicit "don't fabricate" rules? (+2.5)
- "If unsure, say so" instruction? (+2.5)
- Failure modes section (what skill does NOT do)? (+2.5)
- Separation of known vs. needs-verification? (+2.5)
D6: Developer Experience (0-10)
Is it pleasant to use?
- Engaging interaction style? (+2.5)
- Personality without being annoying? (+2.5)
- Appropriate for target audience? (+2.5)
- Memorable enough to recommend? (+2.5)
D7: Composability (0-10)
Does it play well with other tools?
- Output format consumable by other skills/tools? (+3)
- Explicit handoff points? (+3)
- Modular (not monolithic)? (+2)
- Avoids duplicating existing skills? (+2)
D8: Open-Source Readiness (0-10)
Would a stranger install this from a README?
- Self-contained (no private dependencies)? (+2)
- Documented well enough for strangers? (+2)
- Solves a real, recognized problem? (+2)
- Clear naming and discoverability? (+2)
- Cross-agent compatibility mentioned? (+2)
D9: Wow Factor (0-10)
Does it make people go "oh damn"?
- Novel approach or angle? (+3)
- Demo-able in 30 seconds? (+3)
- Would someone tweet/share this? (+2)
- Solves a problem in a way no one else has? (+2)
D10: Real-World Utility (0-10)
Would someone use this weekly?
- Solves a recurring problem? (+3)
- Problem is painful enough people seek solutions? (+3)
- Faster/better than manual alternative? (+2)
- Would someone miss it if removed? (+2)
Step 3. Calculate Composite Score
Formula
RAW = (D1 × 1.2) + (D2 × 1.2) + (D3 × 1.0) + (D4 × 0.8) +
(D5 × 1.0) + (D6 × 1.0) + (D7 × 0.6) + (D8 × 1.0) +
(D9 × 1.2) + (D10 × 1.0)
SCORE = RAW // max possible = 100
Weight rationale: Trigger precision (D1) and wow factor (D9) weighted highest — a skill that never fires or that nobody cares about is dead on arrival. Composability (D7) weighted lowest — standalone value matters more for v1.
Verdict
| Score | Verdict | Meaning |
|---|---|---|
| 80-100 | 🟢 SHIP IT | Publish. Polish the README and go. |
| 60-79 | 🟡 REWORK | Good bones. Fix the weak dimensions first. |
| 40-59 | 🟠 MAJOR REWORK | Concept may be sound. Execution needs rewrite. |
| 0-39 | 🔴 SCRAP | Start over or reconsider if this should be a skill. |
Calibration Anchors
- Anthropic's docx skill = ~72 (solid execution, low wow factor)
- Superpowers = ~88 (methodology is the product, cross-agent, massive adoption)
- Caveman = ~85 (extreme wow, simple concept, measurable benchmark)
- A typical first-draft skill = ~35-45
These are reference points, not dogma. Score based on the rubric, not by anchoring to these numbers.
Step 4. Output Format
## 🎯 SKILL SCORE: [Name]
| Dim | Dimension | Score | Evidence |
|-----|-----------|-------|----------|
| D1 | Trigger Precision | X/10 | [one line referencing the skill] |
| D2 | Instruction Clarity | X/10 | ... |
| D3 | Output Predictability | X/10 | ... |
| D4 | Edge Case Coverage | X/10 | ... |
| D5 | Anti-Hallucination | X/10 | ... |
| D6 | Developer Experience | X/10 | ... |
| D7 | Composability | X/10 | ... |
| D8 | Open-Source Readiness | X/10 | ... |
| D9 | Wow Factor | X/10 | ... |
| D10 | Real-World Utility | X/10 | ... |
## COMPOSITE: XX/100 — [VERDICT EMOJI] [VERDICT]
### 🔥 Top 3 Strengths
1. [Specific, with evidence]
2. ...
3. ...
### 💀 Top 3 Weaknesses
1. [Specific, with actionable fix]
2. ...
3. ...
### 🛠️ Priority Fixes
1. [Highest-impact change — specific enough to implement now]
2. [Second]
3. [Third]
Structured Output
Always append after the narrative:
{
"skill_score": {
"version": "2.0",
"skill_name": "...",
"dimensions": {
"trigger_precision": 0.0,
"instruction_clarity": 0.0,
"output_predictability": 0.0,
"edge_case_coverage": 0.0,
"anti_hallucination": 0.0,
"developer_experience": 0.0,
"composability": 0.0,
"open_source_readiness": 0.0,
"wow_factor": 0.0,
"real_world_utility": 0.0
},
"composite": 0,
"verdict": "SHIP|REWORK|MAJOR_REWORK|SCRAP",
"top_strengths": ["...", "...", "..."],
"top_weaknesses": ["...", "...", "..."],
"priority_fixes": ["...", "...", "..."]
}
}
Step 5. Interactive Options
What next?
A) Rewrite the weakest dimension
B) Optimize the trigger description (rewrite YAML)
C) Generate a README.md from this skill
D) Compare against another skill (side-by-side)
E) Score another skill
F) Run the Anthropic compliance check (progressive disclosure, token budget, etc.)
Step 6. Anthropic Compliance Check (option F)
When requested, run this secondary checklist against Anthropic's skill-creator guidelines:
| Check | Pass/Fail | Notes |
|---|---|---|
YAML frontmatter has name + description | ||
| Description ≤ 1024 chars | ||
| SKILL.md body ≤ 500 lines (~5000 tokens) | ||
| Progressive disclosure (3 tiers used correctly) | ||
| No hardcoded API keys or secrets | ||
| References files have guidance on when to load | ||
| Cross-agent install commands documented | ||
| Failure modes section present | ||
| Examples or demo included |
Operational Rules
- No flattery. "Great start!" is banned. Be specific or be quiet.
- No score inflation. Average = 45-55 out of 100. Not 70.
- Evidence required. Every score references the actual skill text.
- Fixes must be actionable. "Improve the description" → useless. "Add these 5 trigger phrases: [list]" → useful.
- Self-aware. This skill can score itself. Be hon