Emergent Mind

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

(2305.18290)
Published May 29, 2023 in cs.LG , cs.AI , and cs.CL

Abstract

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

DPO outperforms PPO in expected reward and summarization win rates, showing superior optimization quality.

Overview

  • The paper discusses the challenges of fine-tuning language models to human preferences and presents Direct Preference Optimization (DPO) as a solution.

  • DPO is a stable and efficient algorithm that adjusts language models using a classification objective to increase preferred responses by human standards.

  • The approach is grounded in the Bradley-Terry model and avoids issues related to reinforcement learning such as instability and over-specification.

  • Empirical tests show DPO's performance is on par or better than existing methods in tasks like sentiment modulation, summarization, and dialogues.

  • The study indicates potential for future DPO applications in larger models and other modalities, but suggests the need for more research on its robustness and evaluation accuracy.

Summary of the Direct Preference Optimization Algorithm

Introduction to Language Model Tuning Challenges

Large-scale unsupervised language models have exhibited impressive capabilities, having been trained on massive datasets. However, precisely controlling their behavior remains difficult due to the varying intentions behind human-generated datasets. These models frequently adopt undesirable traits, and current methods aiming to fine-tune these models to reflect human preferences involve complex reinforcement learning (RL) algorithms that make the process computationally expensive and unstable. Recognizing the need for safer and more controllable AI systems, researchers have examined the possibility of bypassing explicit reward modeling and avoidance of reinforcement learning.

Direct Preference Optimization Approach

The paper introduces Direct Preference Optimization (DPO), a stable, performant, and computationally efficient algorithm for fine-tuning language models (LMs) to human preferences without the need for reward modeling or reinforcement learning. DPO directly optimizes policies to satisfy human preferences using a simple classification objective. It works by increasing the log probability of preferred responses compared to dispreferred responses, employing a dynamic per-example weight. DPO's ability to align with human preferences is shown to be comparable or superior to established methods like RL with proximal policy optimization (PPO), especially in tasks like sentiment modulation, summarization, and single-turn dialogues.

Theoretical Underpinnings and Advantages

DPO's framework is based on theoretical models that measure how well a given reward function aligns with empirical preference data, such as the Bradley-Terry model. This strategy allows DPO to avoid the instabilities associated with actor-critic algorithms commonly used in RL-based fine-tuning methods. It also addresses the over-specification issue, meaning multiple reward functions can induce the same preference distribution, and promotes generalization across different tasks without the risk of overfitting. Furthermore, DPO's simple nature avoids the need for complex hyperparameter tuning that other RL methods may require.

Empirical Validation and Future Potential

Experiments conducted with DPO show its effectiveness compared to alternative methods like PPO and Unlikelihood training, offering at least similar, if not better, alignment with human preferences. These results were demonstrated in tasks like generating summaries and dialogues, where models tuned with DPO exceeded performance benchmarks. There's potential for DPO to be applied to models even larger than the 6 billion parameters tested, and its efficacy opens avenues for numerous applications beyond fine-tuning LMs, such as in generating models in various modalities. While the results are promising, further research is needed to assess the robustness of DPO, including how the policies perform out of distribution and how reward over-optimization may present itself. Additionally, the accuracy of using automated systems, like GPT-4, for evaluation purposes could be a focus for continued investigation.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube