Skill Scorer — 0-100 Rating for Any SKILL.md

Calibrated against Anthropic's own skill-creator guidelines and the patterns of 50k+ star repos (Superpowers, Caveman, mattpocock/skills). Average skill scores 45-55. If yours scores 80+, ship it.

Dependencies

None. Standalone. Reads any SKILL.md content.

CRITICAL: Auto-start

SKILL.md content in the message or just created in conversation → skip to Step 2. No preamble.

Step 1. Get the skill

Content in context? Use it. Otherwise:

Paste your SKILL.md or tell me which skill to score.

If the user names a skill in the current environment, read it.

Step 2. Score 10 Dimensions (each 0-10)

Be precise. 6.5 is 6.5, not 7. Every score needs one sentence of evidence referencing the actual skill content.

D1: Trigger Precision (0-10)

Will it fire when it should? Will it stay quiet when it shouldn't?

Evaluate the YAML description field:

Specific trigger phrases listed? (+2)
Natural language variations covered? (+2)
"Pushy" quality per Anthropic guidance? (+2)
Edge-case triggers (bare paste, implicit intent)? (+2)
False positive risk controlled? (+2)

Range	Calibration
0-3	Vague description, undertriggers or fires on everything
4-6	Some triggers but gaps in natural phrasing
7-8	Solid coverage, most natural phrasings caught
9-10	Comprehensive — includes implicit triggers and edge cases

D2: Instruction Clarity (0-10)

Can Claude follow this without improvising?

Unambiguous step sequence? (+2)
Decision points with explicit branches? (+2)
Deterministic enough that two instances produce similar output? (+3)
No contradictory instructions? (+1.5)
Clear scope boundaries (what it does AND doesn't do)? (+1.5)

D3: Output Predictability (0-10)

Does the user know what they'll get?

Output format explicitly defined? (+3)
Templates or examples provided? (+3)
Length/scope appropriate and specified? (+2)
Structured data output (JSON, table) available? (+2)

D4: Edge Case Coverage (0-10)

What happens with weird input?

Empty/missing input handled? (+2)
Extremely long input addressed? (+2)
Invalid/unexpected input types? (+2)
Multi-language or mixed input? (+2)
Graceful degradation path? (+2)

D5: Anti-Hallucination Guardrails (0-10)

Does it prevent Claude from fabricating?

Explicit "don't fabricate" rules? (+2.5)
"If unsure, say so" instruction? (+2.5)
Failure modes section (what skill does NOT do)? (+2.5)
Separation of known vs. needs-verification? (+2.5)

D6: Developer Experience (0-10)

Is it pleasant to use?

Engaging interaction style? (+2.5)
Personality without being annoying? (+2.5)
Appropriate for target audience? (+2.5)
Memorable enough to recommend? (+2.5)

D7: Composability (0-10)

Does it play well with other tools?

Output format consumable by other skills/tools? (+3)
Explicit handoff points? (+3)
Modular (not monolithic)? (+2)
Avoids duplicating existing skills? (+2)

D8: Open-Source Readiness (0-10)

Would a stranger install this from a README?

Self-contained (no private dependencies)? (+2)
Documented well enough for strangers? (+2)
Solves a real, recognized problem? (+2)
Clear naming and discoverability? (+2)
Cross-agent compatibility mentioned? (+2)

D9: Wow Factor (0-10)

Does it make people go "oh damn"?

Novel approach or angle? (+3)
Demo-able in 30 seconds? (+3)
Would someone tweet/share this? (+2)
Solves a problem in a way no one else has? (+2)

D10: Real-World Utility (0-10)

Would someone use this weekly?

Solves a recurring problem? (+3)
Problem is painful enough people seek solutions? (+3)
Faster/better than manual alternative? (+2)
Would someone miss it if removed? (+2)

Step 3. Calculate Composite Score

Formula

RAW = (D1 × 1.2) + (D2 × 1.2) + (D3 × 1.0) + (D4 × 0.8) +
      (D5 × 1.0) + (D6 × 1.0) + (D7 × 0.6) + (D8 × 1.0) +
      (D9 × 1.2) + (D10 × 1.0)

SCORE = RAW  // max possible = 100

Weight rationale: Trigger precision (D1) and wow factor (D9) weighted highest — a skill that never fires or that nobody cares about is dead on arrival. Composability (D7) weighted lowest — standalone value matters more for v1.

Verdict

Score	Verdict	Meaning
80-100	🟢 SHIP IT	Publish. Polish the README and go.
60-79	🟡 REWORK	Good bones. Fix the weak dimensions first.
40-59	🟠 MAJOR REWORK	Concept may be sound. Execution needs rewrite.
0-39	🔴 SCRAP	Start over or reconsider if this should be a skill.

Calibration Anchors

Anthropic's docx skill = ~72 (solid execution, low wow factor)
Superpowers = ~88 (methodology is the product, cross-agent, massive adoption)
Caveman = ~85 (extreme wow, simple concept, measurable benchmark)
A typical first-draft skill = ~35-45

These are reference points, not dogma. Score based on the rubric, not by anchoring to these numbers.

Step 4. Output Format

## 🎯 SKILL SCORE: [Name]

| Dim | Dimension | Score | Evidence |
|-----|-----------|-------|----------|
| D1 | Trigger Precision | X/10 | [one line referencing the skill] |
| D2 | Instruction Clarity | X/10 | ... |
| D3 | Output Predictability | X/10 | ... |
| D4 | Edge Case Coverage | X/10 | ... |
| D5 | Anti-Hallucination | X/10 | ... |
| D6 | Developer Experience | X/10 | ... |
| D7 | Composability | X/10 | ... |
| D8 | Open-Source Readiness | X/10 | ... |
| D9 | Wow Factor | X/10 | ... |
| D10 | Real-World Utility | X/10 | ... |

## COMPOSITE: XX/100 — [VERDICT EMOJI] [VERDICT]

### 🔥 Top 3 Strengths
1. [Specific, with evidence]
2. ...
3. ...

### 💀 Top 3 Weaknesses
1. [Specific, with actionable fix]
2. ...
3. ...

### 🛠️ Priority Fixes
1. [Highest-impact change — specific enough to implement now]
2. [Second]
3. [Third]

Structured Output

Always append after the narrative:

{
  "skill_score": {
    "version": "2.0",
    "skill_name": "...",
    "dimensions": {
      "trigger_precision": 0.0,
      "instruction_clarity": 0.0,
      "output_predictability": 0.0,
      "edge_case_coverage": 0.0,
      "anti_hallucination": 0.0,
      "developer_experience": 0.0,
      "composability": 0.0,
      "open_source_readiness": 0.0,
      "wow_factor": 0.0,
      "real_world_utility": 0.0
    },
    "composite": 0,
    "verdict": "SHIP|REWORK|MAJOR_REWORK|SCRAP",
    "top_strengths": ["...", "...", "..."],
    "top_weaknesses": ["...", "...", "..."],
    "priority_fixes": ["...", "...", "..."]
  }
}

Step 5. Interactive Options

What next?
A) Rewrite the weakest dimension
B) Optimize the trigger description (rewrite YAML)
C) Generate a README.md from this skill
D) Compare against another skill (side-by-side)
E) Score another skill
F) Run the Anthropic compliance check (progressive disclosure, token budget, etc.)

Step 6. Anthropic Compliance Check (option F)

When requested, run this secondary checklist against Anthropic's skill-creator guidelines:

Check	Pass/Fail	Notes
YAML frontmatter has `name` + `description`
Description ≤ 1024 chars
SKILL.md body ≤ 500 lines (~5000 tokens)
Progressive disclosure (3 tiers used correctly)
No hardcoded API keys or secrets
References files have guidance on when to load
Cross-agent install commands documented
Failure modes section present
Examples or demo included

Operational Rules

No flattery. "Great start!" is banned. Be specific or be quiet.
No score inflation. Average = 45-55 out of 100. Not 70.
Evidence required. Every score references the actual skill text.
Fixes must be actionable. "Improve the description" → useless. "Add these 5 trigger phrases: [list]" → useful.
Self-aware. This skill can score itself. Be hon

skill-scorer

How to add

Drop this on your repo README

Related skills

claude-api

skill-creator

oh-my-issues

claude-mem

Get new Desenvolvimento skills every Monday