encyclopedia

Reinforcement Learning from Human Feedback (RLHF)

RLHF is the primary technique behind making raw language models into useful assistants. Human annotators rate pairs of model outputs, and those preferences are used to train a "reward model" that scores responses. The language model is then fine-tuned with reinforcement learning to maximize this reward, resulting in an assistant that is more aligned with human values. ChatGPT, Claude, and Gemini all rely on variants of RLHF.

CategoryPractices

Reading time6 min read

Last updatedFeb 28, 2025

PublishedFeb 28, 2025

Definition

A training technique that uses human preference ratings to fine-tune AI models to produce more helpful, accurate, and safe responses.

Need this applied?

We help teams go from definitions to deployed workflows—safely and fast.

Get a tailored AI plan Try the AI Readiness Tool

The three-stage process

Ouyang et al., 2022 Anthropic, 2022

RLHF typically involves three phases applied on top of a pre-trained foundation model.

• Supervised fine-tuning (SFT): the model is fine-tuned on high-quality human demonstrations of the desired behavior.
• Reward modeling: human annotators compare pairs of responses and the model learns to predict which response they prefer.
• RL optimization: the language model is updated using PPO (Proximal Policy Optimization) or similar algorithms to maximize the reward model's score.

Variants: RLAIF and DPO

Anthropic, 2022 Rafailov et al., 2023

Because human annotation is expensive and slow, researchers have developed alternatives. RLAIF (Reinforcement Learning from AI Feedback) uses a larger "critic" model to provide preference labels instead of humans. Direct Preference Optimization (DPO) removes the separate reward model step, directly optimizing the language model from preference data, which is simpler and often equally effective.

FAQ

Why does RLHF sometimes make models less capable?

Over-optimization of the reward model can cause "reward hacking," where the model learns to produce responses that score well on the reward metric but are not actually better. This tradeoff between helpfulness and safety requires careful calibration.

Ouyang et al., 2022

Is RLHF the same as Constitutional AI?

Constitutional AI (CAI) is Anthropic's variant. Instead of relying solely on human preference labels, CAI uses a set of written principles (a "constitution") to guide an AI critic model to generate preference data, reducing reliance on human annotators for safety-related labeling.

Anthropic, 2022

Email this summary + checklist

Get a copy of “Reinforcement Learning from Human Feedback (RLHF)” and an AI readiness checklist in your inbox.