Robust Preference Optimization through Reward Model Distillation (2405.19316v2)

Published 29 May 2024 in cs.LG and cs.CL

Abstract: LLM (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. However, the empirical evidence suggests that DPO typically assigns implicit rewards that overfit, and trend towards infinite magnitude. This frequently leads to degenerate policies, sometimes causing even the probabilities of the preferred generations to go to zero. In this work, we analyze this phenomenon and use distillation to get a better proxy for the true preference distribution over generation pairs: we train the LM such that its induced implicit reward, i.e., the scaled log-likelihood ratio of the model to the reference model, matches an explicit reward model trained on the preference data. Moreover, to account for uncertainty in the reward model we are distilling from, we optimize against a family of reward models that, as a whole, is likely to include at least one reasonable proxy for the preference distribution. Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations, while preserving the simple supervised nature of DPO.

References (41)

Citations (11)

View on Semantic Scholar

Summary

The paper introduces reward model distillation to mitigate DPO's overconfidence and misalignment issues.
It uses classical knowledge distillation with a pessimistic extension to align LM outputs with a reward-based distribution.
Empirical results, including TL;DR summarization, show improved robustness over traditional offline preference optimization methods.

Robust Preference Optimization Through Reward Model Distillation

The paper "Robust Preference Optimization through Reward Model Distillation," authored by Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, and Jonathan Berant, explores the landscape of LLM alignment methods. It specifically addresses the limitations of Direct Preference Optimization (DPO), proposing an alternative methodology that couples explicit reward modeling with a distillation approach, aiming for improved robustness in LLM policies.

Background and Motivation

LLM (LM) post-training or alignment often utilizes reward learning from human feedback via preference annotations. Traditional methods rely on reinforcement learning from human feedback (RLHF) to optimize these models. DPO, although popular for being an efficient offline method, tends to suffer from overconfidence issues and misalignment due to idiosyncrasies in preference data. The paper advocates for the importance of explicit reward modeling even in offline settings and proposes a method that merges the advantages of both direct preference optimization and reward model distillation. The objective is to mitigate issues arising from distribution shifts in preference annotations while maintaining simple and supervised training frameworks.

Methodology

The proposed method leverages classical knowledge distillation techniques, reformulating the problem to match the LLM's output probability distribution to a distribution derived from a reward model, which has itself been trained on preference data. The approach entails training the LM to produce probabilities that align with an underlying reward model distribution. This method provides a regularization effect, addressing the overconfidence issue in DPO.

Reward Model Distillation

The core of the approach involves distillation from a reward model under uncertainty. The reward model serves as a proxy for the true preference distribution. The paper presents the theoretical underpinnings that relate this distillation approach to optimization in classical reinforcement learning settings, demonstrating that adequately diverse samples from preference data can enable a straightforward transition from optimizing an RLHF objective to a distillation loss.

Pessimistic Distillation

To further refine robustness, the paper introduces a "pessimistic" extension of the reward model distillation. This extension optimizes policy alignment by considering the worst-case performance across a family of plausible reward models drawn from preference data. The technique is inspired by conservative offline RL methods and adds KL-divergence regularization to ensure that the policy remains stable even when faced with biased or noisy preference annotations.

Theoretical and Empirical Analysis

The paper extensively analyzes the degenerative tendencies of DPO using theoretical constructs. It finds that DPO can lead to policies that heavily overfit to training preference data, sometimes resulting in degenerate policies that ignore useful responses present in training data. This overfitting is mitigated by the proposed distillation and pessimistic distillation methods, leading to more reliable and robust policy outputs.

Empirical results further support these theoretical insights. Comparisons are drawn using the TL;DR summarization task, where preference data introduces a spurious bias between the length of summaries and their preference scores. The distilled and pessimistic methods show significant improvements in alignment performance over DPO and Identity Preference Optimization (IPO), particularly in scenarios where the preference datasets show biases.

Implications and Future Directions

The research presented in this paper has significant practical and theoretical implications. Practically, it suggests that the overarching strategy for LM alignment should not solely rely on direct optimization from preference data but should incorporate robust distillation techniques that account for uncertainty and potential biases in the data. Theoretically, it opens paths for future research into combining offline and online methods for more effective and efficient LM post-training.

The approach has shown promise in yielding robust policies that outperform traditional DPO and IPO methods, particularly in complex scenarios involving biased preference data. Future research could explore broader application domains, evaluate varying formats of distributional shifts, and refine the balance between reward model fidelity and computational efficiency.

In conclusion, the exploration of reward model distillation—both regular and pessimistic—marks a significant stride in the robustness and efficacy of LLM alignment. By combining the benefits of explicit reward models with the simplicity and efficiency of direct preference optimization, this approach promises a more stable path toward robust AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/adamjfisch/status/1796615062909562913

https://twitter.com/iScienceLuvr/status/1796129534649286895

https://twitter.com/fly51fly/status/1796101572574998853

https://twitter.com/abeirami/status/1822799211877781540

https://twitter.com/arxivsanitybot/status/1796535244117659739

https://twitter.com/mctalentowen/status/1796374262774952046