LLM Evaluation

Evaluate Large Language Model (LLM) systems using automated metrics, LLM-as-judge patterns, and standardized benchmarks to ensure production quality and safety.

When to Use This Skill

Apply this skill when:

Testing individual prompts for correctness and formatting
Validating RAG (Retrieval-Augmented Generation) pipeline quality
Measuring hallucinations, bias, or toxicity in LLM outputs
Comparing different models or prompt configurations (A/B testing)
Running benchmark tests (MMLU, HumanEval) to assess model capabilities
Setting up production monitoring for LLM applications
Integrating LLM quality checks into CI/CD pipelines

Common triggers:

"How do I test if my RAG system is working correctly?"
"How can I measure hallucinations in LLM outputs?"
"What metrics should I use to evaluate generation quality?"
"How do I compare GPT-4 vs Claude for my use case?"
"How do I detect bias in LLM responses?"

Evaluation Strategy Selection

Decision Framework: Which Evaluation Approach?

By Task Type:

Task Type	Primary Approach	Metrics	Tools
Classification (sentiment, intent)	Automated metrics	Accuracy, Precision, Recall, F1	scikit-learn
Generation (summaries, creative text)	LLM-as-judge + automated	BLEU, ROUGE, BERTScore, Quality rubric	GPT-4/Claude for judging
Question Answering	Exact match + semantic similarity	EM, F1, Cosine similarity	Custom evaluators
RAG Systems	RAGAS framework	Faithfulness, Answer/Context relevance	RAGAS library
Code Generation	Unit tests + execution	Pass@K, Test pass rate	HumanEval, pytest
Multi-step Agents	Task completion + tool accuracy	Success rate, Efficiency	Custom evaluators

By Volume and Cost:

Samples	Speed	Cost	Recommended Approach
1,000+	Immediate	$0	Automated metrics (regex, JSON validation)
100-1,000	Minutes	$0.01-0.10 each	LLM-as-judge (GPT-4, Claude)
< 100	Hours	$1-10 each	Human evaluation (pairwise comparison)

Layered Approach (Recommended for Production):

Layer 1: Automated metrics for all outputs (fast, cheap)
Layer 2: LLM-as-judge for 10% sample (nuanced quality)
Layer 3: Human review for 1% edge cases (validation)

Core Evaluation Patterns

Unit Evaluation (Individual Prompts)

Test single prompt-response pairs for correctness.

Methods:

Exact Match: Response exactly matches expected output
Regex Matching: Response follows expected pattern
JSON Schema Validation: Structured output validation
Keyword Presence: Required terms appear in response
LLM-as-Judge: Binary pass/fail using evaluation prompt

Example Use Cases:

Email classification (spam/not spam)
Entity extraction (dates, names, locations)
JSON output formatting validation
Sentiment analysis (positive/negative/neutral)

Quick Start (Python):

import pytest
from openai import OpenAI

client = OpenAI()

def classify_sentiment(text: str) -> str:
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Classify sentiment as positive, negative, or neutral. Return only the label."},
            {"role": "user", "content": text}
        ],
        temperature=0
    )
    return response.choices[0].message.content.strip().lower()

def test_positive_sentiment():
    result = classify_sentiment("I love this product!")
    assert result == "positive"

For complete unit evaluation examples, see examples/python/unit_evaluation.py and examples/typescript/unit-evaluation.ts.

RAG (Retrieval-Augmented Generation) Evaluation

Evaluate RAG systems using RAGAS framework metrics.

Critical Metrics (Priority Order):

Faithfulness (Target: > 0.8) - MOST CRITICAL
- Measures: Is the answer grounded in retrieved context?
- Prevents hallucinations
- If failing: Adjust prompt to emphasize grounding, require citations
Answer Relevance (Target: > 0.7)
- Measures: How well does the answer address the query?
- If failing: Improve prompt instructions, add few-shot examples
Context Relevance (Target: > 0.7)
- Measures: Are retrieved chunks relevant to the query?
- If failing: Improve retrieval (better embeddings, hybrid search)
Context Precision (Target: > 0.5)
- Measures: Are relevant chunks ranked higher than irrelevant?
- If failing: Add re-ranking step to retrieval pipeline
Context Recall (Target: > 0.8)
- Measures: Are all relevant chunks retrieved?
- If failing: Increase retrieval count, improve chunking strategy

Quick Start (Python with RAGAS):

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
from datasets import Dataset

data = {
    "question": ["What is the capital of France?"],
    "answer": ["The capital of France is Paris."],
    "contexts": [["Paris is the capital of France."]],
    "ground_truth": ["Paris"]
}

dataset = Dataset.from_dict(data)
results = evaluate(dataset, metrics=[faithfulness, answer_relevancy, context_relevancy])
print(f"Faithfulness: {results['faithfulness']:.2f}")

For comprehensive RAG evaluation patterns, see references/rag-evaluation.md and examples/python/ragas_example.py.

LLM-as-Judge Evaluation

Use powerful LLMs (GPT-4, Claude Opus) to evaluate other LLM outputs.

When to Use:

Generation quality assessment (summaries, creative writing)
Nuanced evaluation criteria (tone, clarity, helpfulness)
Custom rubrics for domain-specific tasks
Medium-volume evaluation (100-1,000 samples)

Correlation with Human Judgment: 0.75-0.85 for well-designed rubrics

Best Practices:

Use clear, specific rubrics (1-5 scale with detailed criteria)
Include few-shot examples in evaluation prompt
Average multiple evaluations to reduce variance
Be aware of biases (position bias, verbosity bias, self-preference)

Quick Start (Python):

from openai import OpenAI

client = OpenAI()

def evaluate_quality(prompt: str, response: str) -> tuple[int, str]:
    """Returns (score 1-5, reasoning)"""
    eval_prompt = f"""
Rate the following LLM response on relevance and helpfulness.

USER PROMPT: {prompt}
LLM RESPONSE: {response}

Provide:
Score: [1-5, where 5 is best]
Reasoning: [1-2 sentences]
"""
    result = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": eval_prompt}],
        temperature=0.3
    )
    content = result.choices[0].message.content
    lines = content.strip().split('\n')
    score = int(lines[0].split(':')[1].strip())
    reasoning = lines[1].split(':', 1)[1].strip()
    return score, reasoning

For detailed LLM-as-judge patterns and prompt templates, see references/llm-as-judge.md and examples/python/llm_as_judge.py.

Safety and Alignment Evaluation

Measure hallucinations, bias, and toxicity in LLM outputs.

Hallucination Detection

Methods:

Faithfulness to Context (RAG):
- Use RAGAS faithfulness metric
- LLM checks if claims are supported by context
- Score: Supported claims / Total claims
Factual Accuracy (Closed-Book):
- LLM-as-judge with access to reliable sources
- Fact-checking APIs (Google Fact Check)
- Entity-level verification (dates, names, statistics)
Self-Consistency:
- Generate multiple responses to same question
- Measure agreement between responses
- Low consistency suggests hallucination

Bias Evaluation

Types of Bias:

Gender bias (stereotypical associations)
Racial/ethnic bias (discriminatory outputs)
Cultural bias (Western-centric assumptions)
Age/disability bias (ableist or ageist language)

Evaluation Methods:

**Stereotype Tests:

evaluating-llms

How to add

Drop this on your repo README

Related skills

webapp-testing

brand-guidelines

frontend-design

web-artifacts-builder

Get new Design e Frontend skills every Monday