Emergent Mind

Learn Your Reference Model for Real Good Alignment

(2404.09656)
Published Apr 15, 2024 in cs.LG and cs.CL

Abstract

The complexity of the alignment problem stems from the fact that existing methods are unstable. Researchers continuously invent various tricks to address this shortcoming. For instance, in the fundamental Reinforcement Learning From Human Feedback (RLHF) technique of Language Model alignment, in addition to reward maximization, the Kullback-Leibler divergence between the trainable policy and the SFT policy is minimized. This addition prevents the model from being overfitted to the Reward Model (RM) and generating texts that are out-of-domain for the RM. The Direct Preference Optimization (DPO) method reformulates the optimization task of RLHF and eliminates the Reward Model while tacitly maintaining the requirement for the policy to be close to the SFT policy. In our paper, we argue that this implicit limitation in the DPO method leads to sub-optimal results. We propose a new method called Trust Region DPO (TR-DPO), which updates the reference policy during training. With such a straightforward update, we demonstrate the effectiveness of TR-DPO against DPO on the Anthropic HH and TLDR datasets. We show that TR-DPO outperforms DPO by up to 19%, measured by automatic evaluation with GPT-4. The new alignment approach that we propose allows us to improve the quality of models across several parameters at once, such as coherence, correctness, level of detail, helpfulness, and harmlessness.

Comparison of training methods: standard DPO and TR-DPO with soft and hard reference policy updates.

Overview

  • The paper introduces a novel method for Language Model (LM) alignment called Trust Region Direct Preference Optimization (TR-DPO), aiming to produce safer and more efficient models.

  • TR-DPO updates the reference policy iteratively during training, using soft and hard updates to balance output characteristics and flexibility, showing significant improvements over Direct Preference Optimization (DPO).

  • Experimental results using Anthropic HH and TLDR datasets and the Pythia Model architecture showcased TR-DPO's superior performance, especially in configurations optimized for update parameters \(\alpha\) and \(\tau\).

  • TR-DPO's dynamic reference policy updates offer a promising avenue for continuous refinement and adaptation in LM training, with potential applications across various AI research and development areas.

Enhancing Language Model Alignment with Trust Region Direct Preference Optimization

Introduction to TR-DPO

Language Model (LM) alignment remains a vital concern in NLP, aiming to produce safe, effective, and controllable models. This paper introduces Trust Region Direct Preference Optimization (TR-DPO), a novel approach to LM alignment that advances beyond conventional Direct Preference Optimization (DPO) methods. By iteratively updating the reference policy during training, TR-DPO improves the alignment quality across several metrics, including coherence, correctness, level of detail, helpfulness, and harmlessness, showcasing significant enhancements over the DPO method.

Methodology Overview

The TR-DPO method is predicated on the idea that a static reference model limits the optimization potential of alignment techniques. The authors propose two strategies for updating the reference policy: soft updates (blending the current policy with the reference policy) and hard updates (periodically replacing the reference policy with the current policy). These methods are designed to maintain a balance between adhering to the desired output characteristics and allowing for sufficient model flexibility to learn from new data. Theoretical connections to trust region optimization methods suggest that TR-DPO strikes an optimal balance by controlling update frequency and magnitude through parameters (\alpha) and (\tau).

Experimental Design and Results

The efficacy of TR-DPO was evaluated using the Anthropic HH and TLDR datasets across various model sizes of the Pythia Model architecture. Results indicated that TR-DPO outperformed DPO, with an (\alpha) setting of 0.6 yielding up to a 19% improvement in model performance based on GPT-4 evaluations. Additionally, human-centric metrics further affirmed the superiority of TR-DPO, especially in configurations with optimized (\alpha) and (\tau) parameters. These findings were backed by rigorous statistical analysis and a comprehensive examination of the trade-offs between alignment accuracy and generation diversity.

Implications and Future Directions

The introduction of TR-DPO brings forth significant implications for the future of LM alignment. By dynamically updating the reference policy, TR-DPO provides a more nuanced approach to model training, allowing for continuous refinement and adaptation based on new data. This method holds promise for enhancing the quality and safety of generative AI, with potential applications extending beyond text generation to other areas of AI research and development.

Moreover, the success of TR-DPO opens avenues for future exploration, including further refinement of update parameters, broader application across different types of LMs, and investigation into the impact of dynamic reference policy updates on long-term model stability and performance.

Conclusion

TR-DPO represents a substantial step forward in the alignment of LLMs, offering a method that not only improves upon existing DPO techniques but also introduces a flexible framework for continuous model improvement. By leveraging dynamic reference policies, TR-DPO facilitates the development of more coherent, correct, detailed, helpful, and harmless generative models, underscoring the critical importance of adaptability in achieving optimal AI alignment.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.