Clarity Gate v2.1
Purpose: Pre-ingestion verification system that enforces epistemic quality before documents enter RAG knowledge bases. Produces Clarity-Gated Documents (CGD) compliant with the Clarity Gate Format Specification v2.1.
Core Question: "If another LLM reads this document, will it mistake assumptions for facts?"
Core Principle: "Detection finds what is; enforcement ensures what should be. In practice: find the missing uncertainty markers before they become confident hallucinations."
What's New in v2.1
| Feature | Description |
|---|---|
| Claim Completion Status | PENDING/VERIFIED determined by field presence (no explicit status field) |
| Source Field Semantics | Actionable source (PENDING) vs. what-was-found (VERIFIED) |
| Claim ID Format Guidance | Hash-based IDs preferred, collision analysis for scale |
| Body Structure Requirements | HITL Verification Record section mandatory when claims exist |
| New Validation Codes | E-ST10, W-ST11, W-HC01, W-HC02, E-SC06 (FORMAT_SPEC); E-TB01-07 (SOT validation) |
| Bundled Scripts | claim_id.py and document_hash.py for deterministic computations |
Specifications
This skill implements and references:
| Specification | Version | Location |
|---|---|---|
| Clarity Gate Format (Unified) | v2.1 | docs/CLARITY_GATE_FORMAT_SPEC.md |
Note: v2.0 unifies CGD and SOT into a single .cgd.md format. SOT is now a CGD with an optional tier: block.
Validation Codes
Clarity Gate defines validation codes for structural and semantic checks per FORMAT_SPEC v2.1:
HITL Claim Validation (§1.3.2-1.3.3)
| Code | Check | Severity |
|---|---|---|
| W-HC01 | Partial confirmed-by/confirmed-date fields | WARNING |
| W-HC02 | Vague source (e.g., "industry reports", "TBD") | WARNING |
| E-SC06 | Schema error in hitl-claims structure | ERROR |
Body Structure (§1.2.1)
| Code | Check | Severity |
|---|---|---|
| E-ST10 | Missing ## HITL Verification Record when claims exist | ERROR |
| W-ST11 | Table rows don't match hitl-claims count | WARNING |
SOT Table Validation (§3.1)
| Code | Check | Severity |
|---|---|---|
| E-TB01 | No ## Verified Claims section | ERROR |
| E-TB02 | Table has no data rows | ERROR |
| E-TB03 | Required columns missing | ERROR |
| E-TB04 | Column order wrong | ERROR |
| E-TB05 | Empty cell in required column | ERROR |
| E-TB06 | Invalid date format in Verified column | ERROR |
| E-TB07 | Verified date in future (beyond 24h grace) | ERROR |
Note: Additional validation codes may be defined in RFC-001 (clarification document) but are not part of the normative FORMAT_SPEC.
Bundled Scripts
This skill includes Python scripts for deterministic computations per FORMAT_SPEC.
scripts/claim_id.py
Computes stable, hash-based claim IDs for HITL tracking (per §1.3.4).
# Generate claim ID
python scripts/claim_id.py "Base price is $99/mo" "api-pricing/1"
# Output: claim-75fb137a
# Run test vectors
python scripts/claim_id.py --test
Algorithm:
- Normalize text (strip + collapse whitespace)
- Concatenate with location using pipe delimiter
- SHA-256 hash, take first 8 hex chars
- Prefix with "claim-"
Test vectors:
claim_id("Base price is $99/mo", "api-pricing/1")→claim-75fb137aclaim_id("The API supports GraphQL", "features/1")→claim-eb357742
scripts/document_hash.py
Computes document SHA-256 hash per FORMAT_SPEC §2.2-2.4 with full canonicalization.
# Compute hash
python scripts/document_hash.py my-doc.cgd.md
# Output: 7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730
# Verify existing hash
python scripts/document_hash.py --verify my-doc.cgd.md
# Output: PASS: Hash verified: 7d865e...
# Run normalization tests
python scripts/document_hash.py --test
Algorithm (per §2.2-2.4):
- Extract content between opening
---\nand<!-- CLARITY_GATE_END --> - Remove
document-sha256line from YAML frontmatter ONLY (with multiline continuation support) - Canonicalize:
- Strip trailing whitespace per line
- Collapse 3+ consecutive newlines to 2
- Normalize final newline (exactly 1 LF)
- UTF-8 NFC normalization
- Compute SHA-256
Cross-platform normalization:
- BOM removed if present
- CRLF to LF (Windows)
- CR to LF (old Mac)
- Boundary detection (prevents hash computation on content outside CGD structure)
- Whitespace variations produce identical hashes (deterministic across platforms)
The Key Distinction
Existing tools like UnScientify and HedgeHunter (CoNLL-2010) detect uncertainty markers already present in text ("Is uncertainty expressed?").
Clarity Gate enforces their presence where epistemically required ("Should uncertainty be expressed but isn't?").
| Tool Type | Question | Example |
|---|---|---|
| Detection | "Does this text contain hedges?" | UnScientify/HedgeHunter find "may", "possibly" |
| Enforcement | "Should this claim be hedged but isn't?" | Clarity Gate flags "Revenue will be $50M" |
Critical Limitation
Clarity Gate verifies FORM, not TRUTH.
This skill checks whether claims are properly marked as uncertain—it cannot verify if claims are actually true.
Risk: An LLM can hallucinate facts INTO a document, then "pass" Clarity Gate by adding source markers to false claims.
Solution: HITL (Human-In-The-Loop) verification is MANDATORY before declaring PASS.
When to Use
- Before ingesting documents into RAG systems
- Before sharing documents with other AI systems
- After writing specifications, state docs, or methodology descriptions
- When a document contains projections, estimates, or hypotheses
- Before publishing claims that haven't been validated
- When handing off documentation between LLM sessions
The 9 Verification Points
Relationship to Spec Suite
The 9 Verification Points guide semantic review — content quality checks that require judgment (human or AI). They answer questions like "Should this claim be hedged?" and "Are these numbers consistent?"
When review completes, output a CGD file conforming to CLARITY_GATE_FORMAT_SPEC.md. The C/S rules in CLARITY_GATE_FORMAT_SPEC.md validate file structure, not semantic content.
The connection:
- Semantic findings (9 points) determine what issues exist
- Issues are recorded in CGD state fields (
clarity-status,hitl-status,hitl-pending-count) - State consistency is enforced by structural rules (C7-C10)
Example: If Point 5 (Data Consistency) finds conflicting numbers, you'd mark clarity-status: UNCLEAR until resolved. Rule C7 then ensures you can't claim REVIEWED while still UNCLEAR.
Epistemic Checks (Core Focus: Points 1-4)
1. HYPOTHESIS vs FACT LABELING Every claim must be clearly marked as validated or hypothetical.
| Fails | Passes |
|---|---|
| "Our architecture outperforms competitors" | "Our architecture outperforms competitors [benchmark data in Table 3]" |
| "The model achieves 40% improvement" | "The model achieves 40% improvement [measured on dataset X]" |
Fix: Add markers: "PROJECTED:", "HYPOTHESIS:", "UNTESTED:", "(estimated)", "~", "?"
2. UNCERTAINTY MARKER ENFORCEMENT Forward-looking statements require qualifiers.
| Fails | Passes |
|---|---|
| "Revenue will be $50M by Q4" | "Revenue is projected to be $50M by Q4" |
| "The feature will reduce churn" | "The feature is expected to reduce churn" |
Fix: Add "projected", "estimated", "expected", "designed to", "intended to"
3. ASSUMPTION VISIBILITY Implicit assumptions that affect interpretation must be explicit.
| Fails | Passes |
|---|---|
| "The system scales linearly" | "The system scales linearly [assuming <1000 concurre |