Understanding RLHF

Reinforcement Learning from Human Feedback (RLHF) is a technique for aligning language models with human preferences. Rather than relying solely on next-token prediction, RLHF uses human judgment to guide model behavior toward helpful, harmless, and honest outputs.

Core Concepts
The RLHF Pipeline
Preference Data
Instruction Tuning
Reward Modeling
Policy Optimization
Direct Alignment Algorithms
Challenges
Best Practices
References

Core Concepts

Why RLHF?

Pretraining produces models that predict likely text, not necessarily good text. A model trained on internet data learns to complete text in ways that reflect its training distribution—including toxic, unhelpful, or dishonest patterns. RLHF addresses this gap by optimizing for human preferences rather than likelihood.

The core insight: humans can often recognize good outputs more easily than they can specify what makes an output good. RLHF exploits this by collecting human judgments and using them to shape model behavior.

The Alignment Problem

Language models face several alignment challenges:

Helpfulness: Following instructions and providing useful information
Harmlessness: Avoiding toxic, dangerous, or inappropriate outputs
Honesty: Acknowledging uncertainty and avoiding fabrication
Intent alignment: Understanding what users actually want, not just what they say

RLHF provides a framework for encoding these properties through preference data.

Key Components

Preference data: Human judgments comparing model outputs
Reward model: A learned function approximating human preferences
Policy optimization: RL algorithms that maximize expected reward
Regularization: Constraints preventing deviation from the base model

The RLHF Pipeline

The standard RLHF pipeline consists of three main stages:

Stage 1: Supervised Fine-Tuning (SFT)

Start with a pretrained language model and fine-tune it on high-quality demonstrations. This teaches the model the desired format and style of responses.

Input: Pretrained model + demonstration dataset Output: SFT model that can follow instructions

Stage 2: Reward Model Training

Train a model to predict human preferences between pairs of outputs. The reward model learns to score outputs in a way that correlates with human judgment.

Input: SFT model + preference dataset (chosen/rejected pairs) Output: Reward model that scores any output

Stage 3: Policy Optimization

Use reinforcement learning to optimize the SFT model against the reward model, while staying close to the original SFT distribution.

Input: SFT model + reward model Output: Final aligned model

Alternative: Direct Alignment

Direct alignment algorithms (DPO, IPO, KTO) skip the reward model entirely, optimizing directly from preference data. This simplifies the pipeline but trades off some flexibility.

Modern Post-Training Variants

Current post-training stacks often mix these stages rather than using a single linear pipeline:

Method	Data Signal	Best Fit
SFT	Demonstrations	Format, style, instruction following
DPO/IPO/KTO/ORPO	Offline preferences or binary feedback	Simpler alignment without online rollouts
PPO/RLOO	Reward model scores on sampled responses	Reward-model RL with explicit KL control
GRPO	Grouped completions scored by reward functions/models	Reasoning and verifiable-task optimization

Preference Data

Preference data encodes human judgment about model outputs. The most common format is pairwise comparisons.

Pairwise Preferences

Given a prompt, collect two or more model outputs and have humans indicate which is better:

Prompt: "Explain quantum entanglement"

Response A: [technical explanation]
Response B: [simpler explanation with analogy]

Human preference: B > A

This creates (prompt, chosen, rejected) tuples for training.

Collection Methods

Human annotation: Trained annotators compare outputs according to guidelines. Most reliable but expensive and slow.

AI feedback: Use a capable model to generate preferences. Faster and cheaper but may propagate biases. This is the basis for Constitutional AI (CAI) and RLAIF.

Implicit signals: User interactions like upvotes, regeneration requests, or conversation length. Noisy but abundant.

Data Quality Considerations

Annotator agreement: Low agreement suggests ambiguous criteria or subjective preferences
Distribution coverage: Preferences should cover the range of model behaviors
Prompt diversity: Diverse prompts prevent overfitting to narrow scenarios
Preference strength: Some comparisons are clear; others are nearly ties

Instruction Tuning

Instruction tuning (supervised fine-tuning on instruction-response pairs) serves as the foundation for RLHF.

Purpose

Teaches the model to follow instructions rather than just complete text
Establishes the format and style for responses
Creates a starting point that already exhibits desired behaviors
Provides the reference policy for KL regularization

Dataset Composition

Typical instruction tuning datasets include:

Single-turn QA: Questions with direct answers
Multi-turn dialogue: Conversational exchanges
Task instructions: Specific tasks with examples
Chain-of-thought: Reasoning traces for complex problems

Relationship to RLHF

The SFT model defines the "prior" that RLHF refines. A better SFT model means:

The reward model has better starting outputs to compare
Policy optimization has less work to do
The KL penalty keeps the final model closer to this baseline

Reward Modeling

The reward model transforms pairwise preferences into a scalar signal for RL optimization.

The Bradley-Terry Model

Preferences are modeled using the Bradley-Terry framework:

P(A > B) = sigmoid(r(A) - r(B))

Where r(x) is the reward for output x. This assumes preferences depend only on the difference in rewards.

The loss function is:

L = -log(sigmoid(r(chosen) - r(rejected)))

This pushes the reward model to assign higher scores to chosen outputs.

Architecture

Reward models are typically:

The SFT model with a scalar head instead of the language modeling head
Trained on (prompt, chosen, rejected) tuples
Output a single scalar reward for any (prompt, response) pair

Considerations

Scaling: Larger reward models generally produce better signals
Calibration: Absolute reward values are less important than rankings
Generalization: The model must score outputs it hasn't seen during training
Over-optimization: Policies can exploit reward model weaknesses

See reference/reward-modeling.md for detailed training procedures.

Policy Optimization

Policy optimization uses RL to maximize expected reward while staying close to the reference policy.

The RLHF Objective

maximize E[R(x, y)] - β * KL(π || π_ref)

Where:

R(x, y) is the reward for response y to prompt x
KL(π || π_ref) measures deviation from the reference policy
β controls the strength of the regularization

PPO (Proximal Policy Optimization)

PPO is the classic reward-model RLHF algorithm:

Sample responses from the current policy
Score responses with the reward model
Compute advantage estimates
Update policy with clipped surrogate objective

The clipping prevents large policy updates that could destabilize training.

KL Regularization

The KL penalty serves multiple purposes:

Prevents reward hacking: Stops the policy from finding adversarial inputs to the reward model
**Maintains capabilities

rlhf

How to add

Drop this on your repo README

Related skills

xlsx

mem-search

weekly-digests

how-it-works

Get new Dados e Análise skills every Monday