Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

162 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

246

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive (2402.13228v2)

Published 20 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Direct Preference Optimisation (DPO) is effective at significantly improving the performance of LLMs on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the relative probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. We then show empirically that this phenomenon occurs when fine-tuning LLMs on common datasets, especially datasets in which the edit distance between pairs of completions is low. Using these insights, we design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode. Surprisingly, we find that DPOP outperforms DPO and other fine-tuning procedures across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions. Furthermore, we find that the DPOP-tuned model outperforms the DPO-tuned model (all else equal) on benchmarks independent of the fine-tuning data, such as MT-Bench. Finally, using DPOP, we create and open-source Smaug-34B and Smaug-72B, with the latter becoming the first open-source LLM to surpass an average accuracy of 80% on the HuggingFace Open LLM Leaderboard.

References (63)

Citations (93)

View on Semantic Scholar

Summary

The paper introduces DPO-Positive (DPOP) as a corrective measure that maintains the likelihood of preferred completions during LLM fine-tuning.
It incorporates a novel penalty term to counteract issues when preference pairs have small edit distances, ensuring model robustness.
Empirical results validate DPOP’s effectiveness, leading to the creation of Smaug models that achieve state-of-the-art benchmarks on open-source LLM leaderboards.

Enhancing Preference Optimization in LLMs with DPO-Positive

Introduction

The evolution of LLMs has underscored the critical importance of aligning these models with human preferences to ensure their fluency and effectiveness across various tasks. In response to this need, Direct Preference Optimization (DPO) has emerged as a key technique, leveraging preferred and dispreferred data pairs to model the relative probability of one response over another. However, this paper identifies a notable limitation within the standard DPO approach – a potential reduction in the model’s likelihood for the preferred examples, particularly evident in datasets with small edit distances between pairs of completions. To address this, we introduce DPO-Positive (DPOP), a novel loss function and training methodology designed to overcome this failure mode, demonstrating significant improvements over DPO across diverse datasets and tasks.

Background and Related Work

The development of LLMs has been significantly aided by methods capable of integrating human-written completions or human-preferred completions to fine-tune models for enhanced performance on downstream tasks. Among these methods, reinforcement learning from human feedback (RLHF) and DPO are prominent. DPO, especially, has gained traction for its ability to directly optimize preferences without explicit reward function learning, focusing on maximizing the likelihood of preferred completions relative to dispreferred ones.

Failure Mode of DPO

A deeper analysis into the functionality of DPO reveals a critical oversight: the potential for a reduced likelihood of preferred examples, especially in scenarios where the preferred and dispreferred completions closely resemble each other textually. This paper theorizes and empirically validates that in datasets where the edit distance between preference pairs is minimal, the standard DPO methodology could inadvertently deprioritize the preferred completions, leading to a degradation in model performance.

Introducing DPO-Positive (DPOP)

To counteract the identified failure mode of DPO, DPOP introduces a corrective penalty term to the loss function, ensuring that the model's likelihood for preferred completions does not diminish. This innovation not only preserves the integrity of the preferred data but also elevates DPOP's effectiveness across a broad spectrum of datasets, including those with significant differences between completion pairs. The empirical results underscore DPOP's superior performance, notably in the creation of the Smaug class of models which exhibit state-of-the-art open-source achievements.

Contribution and Results

The paper’s contributions are manifold, offering a theoretical and empirical dissection of a DPO failure mode, the formulation of DPOP as a resilient alternative, and the developmental groundwork for the Smaug class of models, which push the boundaries of open-source LLM performance. Particularly, Smaug-72B sets a new benchmark by achieving an unprecedented average accuracy rate on the HuggingFace Open LLM Leaderboard.

Conclusion and Future Directions

While DPOP marks a significant stride toward refining preference optimization in LLMs, this work also acknowledges the limitations inherent in the scale and linguistic focus of tested datasets. The research paves the way for further explorations into preference-based LLM fine-tuning, stressing the potential for DPOP's application across a more diverse range of datasets, including non-English languages. The paper's findings not only contribute to the ongoing development of more accurate and aligned LLMs but also highlight the importance of continual evaluation and adaptation of existing methodologies to address emerging challenges.

Limitations and Impact

Acknowledging the potential misuse of such advanced techniques and models for generating harmful content is crucial. Yet, the focus on mathematical and reasoning contexts, coupled with a deeper understanding of preference optimization, leans towards a positive societal impact. The release of the Smaug models, while being a significant contribution to the AI research community, is done with the consideration of their comparative performance to prioritize safety and responsible use.

This work stands as a testament to the dynamic nature of AI research, where the detection of methodological weaknesses becomes the foundation for innovation, driving the field towards the development of LLMs that are not only powerful but also closely aligned with human values and preferences.

Tweets

https://twitter.com/crwhite_ml/status/1765065386679304344

https://twitter.com/KirkDBorne/status/1764425526280933786

https://twitter.com/WholeMarsBlog/status/1764680741106196865

https://twitter.com/crwhite_ml/status/1760478355739349104

https://twitter.com/bindureddy/status/1760466426811592850

https://twitter.com/KirkDBorne/status/1774768684583547366