Emergent Mind

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

(2404.10719)
Published Apr 16, 2024 in cs.CL

Abstract

Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align LLMs with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the best performances of PPO in fine-tuning LLMs. Finally, we benchmark DPO and PPO across a collection of RLHF testbeds, ranging from dialogue to code generation. Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions.

PPO performance on APPS dataset with varying batch sizes using CodeLlama-13B across difficulty levels.

Overview

  • The study compares Direct Preference Optimization (DPO), a reward-free method, and Proximal Policy Optimization (PPO), a reward-based method, in aligning LLMs with human preferences, particularly through Reinforcement Learning from Human Feedback (RLHF).

  • It highlights the theoretical and empirical limitations of DPO, including its susceptibility to biased solutions and degradation under out-of-distribution responses.

  • Key factors for optimizing PPO's performance in RLHF, such as advantage normalization, large batch size, and exponential moving average update, are uncovered, demonstrating PPO's superior efficacy in LLM alignment.

  • Extensive empirical benchmarks across various RLHF testbeds, including dialogue and code generation tasks, show PPO's superior performance, challenging the academic acclaim of DPO and suggesting a need for a reevaluation of current alignment strategies.

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Introduction

The alignment of LLMs with human preferences is a pivotal arena in AI research, particularly through Reinforcement Learning from Human Feedback (RLHF) approaches. This study juxtaposes Direct Preference Optimization (DPO), a reward-free method, against Proximal Policy Optimization (PPO), a reward-based method, to evaluate their efficacy in aligning LLMs. Despite DPO's academic acclaim, we scrutinize its theoretical and empirical limitations and conduct a thorough analysis of PPO, uncovering key factors for optimizing its performance in RLHF. Moreover, our empirical benchmarks across diverse RLHF testbeds, including dialogue and code generation tasks, provide novel insights into the comparative advantages of PPO over DPO and other alignment methods.

Theoretical and Empirical Insights into DPO's Limitations

Our study reveals significant theoretical limitations of DPO, demonstrating its susceptibility to biased solutions that exploit out-of-distribution (OOD) responses. DPO's potential to develop a biased policy preference emphasizes a fundamental challenge in ensuring model alignment with human preferences, particularly in the face of distribution shifts between model outputs and preference datasets. Empirical analyses further illuminate how performance degradation in DPO can be attributed to such distribution shifts, highlighting the critical need for mitigating these disparities to improve alignment efficacy.

Unveiling Key Factors for PPO's Efficacy in RLHF

The exploration into PPO's algorithmic components uncovers three key factors instrumental in enhancing its performance for LLM alignment: advantage normalization, large batch size, and exponential moving average update for the reference model. These factors significantly contribute to PPO's robustness and effectiveness, as demonstrated through comprehensive ablation studies. The employment of large batch size training, in particular, emerges as a pivotal element in mitigating performance degradation, thereby cementing PPO’s superiority in challenging RLHF applications such as code generation tasks.

Benchmarking DPO and PPO Across RLHF Testbeds

Our extensive experimental evaluations across various RLHF testbeds underscore PPO's superior performance in aligning LLMs across all cases, notably achieving state-of-the-art results in challenging code competitions. Contrary to initial expectations, DPO's efficacy is pragmatically limited, suffering under the weight of theoretical and empirical constraints, particularly in demanding tasks that test the boundaries of model alignment capabilities. The findings critically question the purported supremacy of DPO in LLM alignment, propelling a reevaluation of alignment strategies within the research community.

Implications and Future Directions

The comprehensive scrutiny of DPO and PPO within this study not only challenges prevailing notions regarding LLM alignment methods but also opens new avenues for future research. The insights into DPO's limitations and the delineation of critical factors for enhancing PPO's performance offer a foundation for developing more robust and effective alignment strategies. As the AI field continues to progress, the lessons from this study could guide the refinement of RLHF methodologies, ensuring that LLMs are more finely tuned to human preferences and societal values.

The evolving landscape of LLM alignment necessitates ongoing theoretical and empirical investigations to iteratively refine and develop methodologies that ensure models serve the broader interests of humanity. This study represents a step forward in this journey, offering a critical evaluation of existing approaches and paving the way for future advancements in AI alignment research.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube