Direct Preference Optimization: Your Language Model is Secretly a Reward Model (2305.18290v3)

Published 29 May 2023 in cs.LG, cs.AI, and cs.CL

Abstract: While large-scale unsupervised LLMs (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

Citations (2,210)

View on Semantic Scholar

Summary

The paper introduces Direct Preference Optimization (DPO), a novel approach that optimizes language models to align with human preferences without reward modeling or reinforcement learning.
The paper leverages a simple classification objective with dynamic per-example weighting to overcome the instability and complexity of traditional RL methods.
The paper demonstrates DPO's competitive performance in tasks like sentiment modulation, summarization, and dialogue generation, setting a new direction for efficient LM fine-tuning.

Summary of the Direct Preference Optimization Algorithm

Introduction to LLM Tuning Challenges

Large-scale unsupervised LLMs have exhibited impressive capabilities, having been trained on massive datasets. However, precisely controlling their behavior remains difficult due to the varying intentions behind human-generated datasets. These models frequently adopt undesirable traits, and current methods aiming to fine-tune these models to reflect human preferences involve complex reinforcement learning (RL) algorithms that make the process computationally expensive and unstable. Recognizing the need for safer and more controllable AI systems, researchers have examined the possibility of bypassing explicit reward modeling and avoidance of reinforcement learning.

Direct Preference Optimization Approach

The paper introduces Direct Preference Optimization (DPO), a stable, performant, and computationally efficient algorithm for fine-tuning LLMs (LMs) to human preferences without the need for reward modeling or reinforcement learning. DPO directly optimizes policies to satisfy human preferences using a simple classification objective. It works by increasing the log probability of preferred responses compared to dispreferred responses, employing a dynamic per-example weight. DPO's ability to align with human preferences is shown to be comparable or superior to established methods like RL with proximal policy optimization (PPO), especially in tasks like sentiment modulation, summarization, and single-turn dialogues.

Theoretical Underpinnings and Advantages

DPO's framework is based on theoretical models that measure how well a given reward function aligns with empirical preference data, such as the Bradley-Terry model. This strategy allows DPO to avoid the instabilities associated with actor-critic algorithms commonly used in RL-based fine-tuning methods. It also addresses the over-specification issue, meaning multiple reward functions can induce the same preference distribution, and promotes generalization across different tasks without the risk of overfitting. Furthermore, DPO's simple nature avoids the need for complex hyperparameter tuning that other RL methods may require.

Empirical Validation and Future Potential

Experiments conducted with DPO show its effectiveness compared to alternative methods like PPO and Unlikelihood training, offering at least similar, if not better, alignment with human preferences. These results were demonstrated in tasks like generating summaries and dialogues, where models tuned with DPO exceeded performance benchmarks. There's potential for DPO to be applied to models even larger than the 6 billion parameters tested, and its efficacy opens avenues for numerous applications beyond fine-tuning LMs, such as in generating models in various modalities. While the results are promising, further research is needed to assess the robustness of DPO, including how the policies perform out of distribution and how reward over-optimization may present itself. Additionally, the accuracy of using automated systems, like GPT-4, for evaluation purposes could be a focus for continued investigation.

PDF Markdown

Related Papers

Tweets

https://twitter.com/stanfordnlp/status/1746959144480063663

https://twitter.com/davidad/status/1748422996916588634

https://twitter.com/jxmnop/status/1762945936924234149

https://twitter.com/abeirami/status/1855054106303336514

https://twitter.com/Ar_Douillard/status/1860990820783952140

https://twitter.com/DeepLearningAI/status/1751972212523844037