Prompt Validation
Calibration: Tier 2, Opus-primary. See repository README for model compatibility.
Evaluate whether a prompt is well-constructed, diagnose why it underperforms, and fix it systematically. This Skill treats prompt evaluation as a structured discipline — score the architecture, test the output, trace failures to their root layer.
Reasoning discipline
Before scoring any dimension, walk through the evidence explicitly. State the observations (specific prompt features, instruction clarity, structural elements, behavioral countermeasures), name the pattern or principle they match against the rubric anchors, then apply the 1-5 score. Do not compress this sequence into a summary score.
If the evaluation scope is unclear (Scorecard only? Output Evaluation Rubric too? scoring or rewrite?), confirm scope with the user before proceeding. Do not proceed on inferred assumptions.
When to Use This Skill
Use this Skill when evaluating a single prompt or system prompt. The user has a prompt (or its output) and wants to know whether it is working, why it is not, or how to improve it.
This Skill evaluates prompts. For evaluating an entire Claude Project's architecture (Custom Instructions, knowledge file organization, mode design, multi-file structure), use the rootnode-project-audit Skill if available. The distinction: if the user shows you one prompt or system prompt and asks "is this good?" — use this Skill. If the user describes a Project with multiple files and asks "why isn't my Project working?" — that is a project-level audit.
Instructions
Step 1: Score the Prompt (Prompt Scorecard)
Score the prompt on each of the six dimensions below using the 1-5 anchoring criteria. A prompt scoring below 3 on any dimension has an identified weakness that should be addressed before use. A prompt averaging 4.0+ with no score below 3 is ready for output testing.
The Scorecard evaluates prompt architecture, not output. A structurally strong prompt can still underperform if runtime context is thin. A prompt scoring 3s might produce adequate output for a simple task. Use the Scorecard as a quality gate, not a final verdict.
Critical: Evidence-first scoring. For every score, cite the specific text in the prompt that justifies the rating. Never assign a score without pointing to evidence. If you cannot find evidence for a higher score, it does not earn it.
Dimension 1: Objective Clarity
Does the objective define a specific task with evaluable success criteria?
| Score | Anchor |
|---|---|
| 1 | Topic only — names a subject area without specifying what to do with it. ("Analyze our marketing.") |
| 2 | Action verb present but success criteria absent. ("Evaluate our market entry options.") The reader knows the task type but not what a good output looks like. |
| 3 | Action verb, deliverable type, and audience specified, but success criteria are implicit. ("Produce an executive brief evaluating our three market entry options for the leadership team.") Two people might disagree on what "good" looks like. |
| 4 | Action verb, deliverable, audience, and explicit success criteria. Constraints stated. ("Evaluate three market entry options and recommend one, weighted toward operational feasibility given our 8-person team. Recommendation must be specific enough for a go/no-go decision.") |
| 5 | All of 4, plus the objective bounds what is out of scope. The two-person test passes cleanly — two readers would produce outputs with the same structure and analytical focus, differing only in judgment. |
If below 3 → most likely output failure is "answers an adjacent question." Fix the objective before touching any other layer.
Dimension 2: Context Specificity
Would the context need to change if the task were about a different organization in the same industry?
| Score | Anchor |
|---|---|
| 1 | No context, or purely categorical. ("We are a technology company looking to grow.") |
| 2 | Category with one or two specifics. ("B2B SaaS company with about 2,000 customers.") Output will be partially grounded but mostly generic. |
| 3 | Situation-specific with multiple concrete details (numbers, constraints, team composition) but missing prior decisions or what has been tried. Output will reference details but may suggest previously rejected approaches. |
| 4 | Specific situation with concrete numbers, named constraints, prior decisions, and what is off the table. Replacing the company name with a competitor's would make the context inaccurate. |
| 5 | All of 4, plus the context flags what is unknown or uncertain (rather than leaving gaps for Claude to guess), providing enough detail for tradeoff-aware rather than generic recommendations. |
If below 3 → most likely output failure is "generic — could apply to any company." No amount of reasoning refinement compensates for thin context.
Dimension 3: Reasoning Fit
Does the reasoning approach match the task type and direct analytical attention to the right dimensions?
| Score | Anchor |
|---|---|
| 1 | No reasoning guidance, or only "think step by step" / "analyze carefully." Claude receives no direction on what analytical moves to make. |
| 2 | Reasoning present but generic — steps could apply to any analytical task. ("Consider pros and cons. Identify risks. Recommend.") Steps do not reflect this task's specific challenge. |
| 3 | Reasoning from the correct task category (e.g., strategic reasoning for a strategic task) but used without customization. Steps are relevant but not tailored to the specific analytical challenge. |
| 4 | Task-specific reasoning with steps customized to the analytical challenge. Each step names a specific dimension to examine. If the task has a known pitfall (e.g., stakeholder bias, oversimplified tradeoff), the reasoning addresses it. |
| 5 | All of 4, plus the reasoning sequence builds on itself — each step uses output of prior steps. Total step count is 5-7 (under the focus ceiling). Cross-domain tasks combine elements from multiple reasoning categories. |
If below 3 → most likely output failure is "analysis is shallow — states the obvious." This is the highest-leverage dimension for analytical depth.
Dimension 4: Output Precision
Does the output specification provide enough structure for a predictable deliverable?
| Score | Anchor |
|---|---|
| 1 | No output specification, or only vague instruction. ("Write a thorough analysis.") Claude will choose its own format, length, and structure. |
| 2 | Deliverable type named but structure unspecified. ("Write an executive brief.") General format target but no section-level guidance. |
| 3 | Named sections specified, but per-section length guidance and format rules (prose vs. bullets, tone) absent. Output will have the right sections but may be unbalanced. |
| 4 | Named sections with per-section length guidance, total length target, and format constraints (prose, tone, audience calibration). Output structure is predictable across runs. |
| 5 | All of 4, plus the output specification includes what to exclude ("do not include a general background section") or how to handle edge cases within the format. Calibrated to the specific deliverable, not a generic template. |
If below 3 → most likely output failures are format-related: wrong length, unbalanced sections, bullets where prose is needed.
Dimension 5: Behavioral Calibration
Are countermeasures present for the failure modes this task is likely to trigger — and only those?
| Score | Anchor |
|---|---|
| 1 | No behavioral countermeasures. The prompt relies entirely on Claude's defaults, leaving task-relevant tendencies (agreeableness, hedging, verbosity, list overuse) unaddressed. |
| 2 | Generic countermeasures not targeted to this task. ("Be concise. Be direct. Challenge assumptions.") Reasonable defaults but not task-specific. |
| 3 | One or two targeted countermeasures for the most likely failure mode (e.g., agreeableness countermeasure |