LLM Evaluation
Master comprehensive evaluation strategies for LLM applications, from automated metrics to human evaluation and A/B testing.
Do not use this skill when
- The task is unrelated to llm evaluation
- You need a different domain or tool outside this scope
Instructions
- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open
resources/implementation-playbook.md.
Use this skill when
- Measuring LLM application performance systematically
- Comparing different models or prompts
- Detecting performance regressions before deployment
- Validating improvements from prompt changes
- Building confidence in production systems
- Establishing baselines and tracking progress over time
- Debugging unexpected model behavior
Core Evaluation Types
1. Automated Metrics
Fast, repeatable, scalable evaluation using computed scores.
Text Generation:
- BLEU: N-gram overlap (translation)
- ROUGE: Recall-oriented (summarization)
- METEOR: Semantic similarity
- BERTScore: Embedding-based similarity
- Perplexity: Language model confidence
Classification:
- Accuracy: Percentage correct
- Precision/Recall/F1: Class-specific performance
- Confusion Matrix: Error patterns
- AUC-ROC: Ranking quality
Retrieval (RAG):
- MRR: Mean Reciprocal Rank
- NDCG: Normalized Discounted Cumulative Gain
- Precision@K: Relevant in top K
- Recall@K: Coverage in top K
2. Human Evaluation
Manual assessment for quality aspects difficult to automate.
Dimensions:
- Accuracy: Factual correctness
- Coherence: Logical flow
- Relevance: Answers the question
- Fluency: Natural language quality
- Safety: No harmful content
- Helpfulness: Useful to the user
3. LLM-as-Judge
Use stronger LLMs to evaluate weaker model outputs.
Approaches:
- Pointwise: Score individual responses
- Pairwise: Compare two responses
- Reference-based: Compare to gold standard
- Reference-free: Judge without ground truth
Quick Start
from llm_eval import EvaluationSuite, Metric
# Define evaluation suite
suite = EvaluationSuite([
Metric.accuracy(),
Metric.bleu(),
Metric.bertscore(),
Metric.custom(name="groundedness", fn=check_groundedness)
])
# Prepare test cases
test_cases = [
{
"input": "What is the capital of France?",
"expected": "Paris",
"context": "France is a country in Europe. Paris is its capital."
},
# ... more test cases
]
# Run evaluation
results = suite.evaluate(
model=your_model,
test_cases=test_cases
)
print(f"Overall Accuracy: {results.metrics['accuracy']}")
print(f"BLEU Score: {results.metrics['bleu']}")
Automated Metrics Implementation
BLEU Score
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
def calculate_bleu(reference, hypothesis):
"""Calculate BLEU score between reference and hypothesis."""
smoothie = SmoothingFunction().method4
return sentence_bleu(
[reference.split()],
hypothesis.split(),
smoothing_function=smoothie
)
# Usage
bleu = calculate_bleu(
reference="The cat sat on the mat",
hypothesis="A cat is sitting on the mat"
)
ROUGE Score
from rouge_score import rouge_scorer
def calculate_rouge(reference, hypothesis):
"""Calculate ROUGE scores."""
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, hypothesis)
return {
'rouge1': scores['rouge1'].fmeasure,
'rouge2': scores['rouge2'].fmeasure,
'rougeL': scores['rougeL'].fmeasure
}
BERTScore
from bert_score import score
def calculate_bertscore(references, hypotheses):
"""Calculate BERTScore using pre-trained BERT."""
P, R, F1 = score(
hypotheses,
references,
lang='en',
model_type='microsoft/deberta-xlarge-mnli'
)
return {
'precision': P.mean().item(),
'recall': R.mean().item(),
'f1': F1.mean().item()
}
Custom Metrics
def calculate_groundedness(response, context):
"""Check if response is grounded in provided context."""
# Use NLI model to check entailment
from transformers import pipeline
nli = pipeline("text-classification", model="microsoft/deberta-large-mnli")
result = nli(f"{context} [SEP] {response}")[0]
# Return confidence that response is entailed by context
return result['score'] if result['label'] == 'ENTAILMENT' else 0.0
def calculate_toxicity(text):
"""Measure toxicity in generated text."""
from detoxify import Detoxify
results = Detoxify('original').predict(text)
return max(results.values()) # Return highest toxicity score
def calculate_factuality(claim, knowledge_base):
"""Verify factual claims against knowledge base."""
# Implementation depends on your knowledge base
# Could use retrieval + NLI, or fact-checking API
pass
LLM-as-Judge Patterns
Single Output Evaluation
def llm_judge_quality(response, question):
"""Use GPT-5 to judge response quality."""
prompt = f"""Rate the following response on a scale of 1-10 for:
1. Accuracy (factually correct)
2. Helpfulness (answers the question)
3. Clarity (well-written and understandable)
Question: {question}
Response: {response}
Provide ratings in JSON format:
{{
"accuracy": <1-10>,
"helpfulness": <1-10>,
"clarity": <1-10>,
"reasoning": "<brief explanation>"
}}
"""
result = openai.ChatCompletion.create(
model="gpt-5",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return json.loads(result.choices[0].message.content)
Pairwise Comparison
def compare_responses(question, response_a, response_b):
"""Compare two responses using LLM judge."""
prompt = f"""Compare these two responses to the question and determine which is better.
Question: {question}
Response A: {response_a}
Response B: {response_b}
Which response is better and why? Consider accuracy, helpfulness, and clarity.
Answer with JSON:
{{
"winner": "A" or "B" or "tie",
"reasoning": "<explanation>",
"confidence": <1-10>
}}
"""
result = openai.ChatCompletion.create(
model="gpt-5",
messages=[{"role": "user", "content": prompt}],
temperature=0
)
return json.loads(result.choices[0].message.content)
Human Evaluation Frameworks
Annotation Guidelines
class AnnotationTask:
"""Structure for human annotation task."""
def __init__(self, response, question, context=None):
self.response = response
self.question = question
self.context = context
def get_annotation_form(self):
return {
"question": self.question,
"context": self.context,
"response": self.response,
"ratings": {
"accuracy": {
"scale": "1-5",
"description": "Is the response factually correct?"
},
"relevance": {
"scale": "1-5",
"description": "Does it answer the question?"
},
"coherence": {
"scale": "1-5",
"description": "Is it logically consistent?"
}
},
"issues": {
"factual_error": False,
"hallucination": False,
"off_topic": False,
"unsafe_content": False
},
"feedback": ""
}
Inter-Rater Agreement
from sklearn.metrics import cohen_kappa_score
def calculate_agreement(rater1_scores, rater2_scores):
"""Cal