RLHF typically involves three phases applied on top of a pre-trained foundation model.
- • Supervised fine-tuning (SFT): the model is fine-tuned on high-quality human demonstrations of the desired behavior.
- • Reward modeling: human annotators compare pairs of responses and the model learns to predict which response they prefer.
- • RL optimization: the language model is updated using PPO (Proximal Policy Optimization) or similar algorithms to maximize the reward model's score.