Emergent Mind

Robust Preference Optimization through Reward Model Distillation

(2405.19316)
Published May 29, 2024 in cs.LG and cs.CL

Abstract

Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. However, typical preference datasets have only a single, or at most a few, annotation per preference pair, which causes DPO to overconfidently assign rewards that trend towards infinite magnitude. This frequently leads to degenerate policies, sometimes causing even the probabilities of the preferred generations to go to zero. In this work, we analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs: we train the LM to produce probabilities that match the distribution induced by a reward model trained on the preference data. Moreover, to account for uncertainty in the reward model we are distilling from, we optimize against a family of reward models that, as a whole, is likely to include at least one reasonable proxy for the preference distribution. Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations, while preserving the simple supervised nature of DPO.

Overview

  • The paper addresses limitations of Direct Preference Optimization (DPO) in language model alignment and proposes an alternative method combining explicit reward modeling with distillation to enhance robustness.

  • The methodology involves using knowledge distillation to align language model outputs with a reward model distribution, and introduces a 'pessimistic' extension to handle worst-case performance scenarios.

  • Theoretical and empirical analysis demonstrates that the proposed methods reduce overfitting and improve alignment performance, particularly in tasks with biased preference data.

Robust Preference Optimization Through Reward Model Distillation

The paper "Robust Preference Optimization through Reward Model Distillation," authored by Adam Fisch, Jacob Eisenstein, Vicky Zayats, Alekh Agarwal, Ahmad Beirami, Chirag Nagpal, Pete Shaw, and Jonathan Berant, explores the landscape of language model alignment methods. It specifically addresses the limitations of Direct Preference Optimization (DPO), proposing an alternative methodology that couples explicit reward modeling with a distillation approach, aiming for improved robustness in language model policies.

Background and Motivation

Language Model (LM) post-training or alignment often utilizes reward learning from human feedback via preference annotations. Traditional methods rely on reinforcement learning from human feedback (RLHF) to optimize these models. DPO, although popular for being an efficient offline method, tends to suffer from overconfidence issues and misalignment due to idiosyncrasies in preference data. The paper advocates for the importance of explicit reward modeling even in offline settings and proposes a method that merges the advantages of both direct preference optimization and reward model distillation. The objective is to mitigate issues arising from distribution shifts in preference annotations while maintaining simple and supervised training frameworks.

Methodology

The proposed method leverages classical knowledge distillation techniques, reformulating the problem to match the language model's output probability distribution to a distribution derived from a reward model, which has itself been trained on preference data. The approach entails training the LM to produce probabilities that align with an underlying reward model distribution. This method provides a regularization effect, addressing the overconfidence issue in DPO.

Reward Model Distillation

The core of the approach involves distillation from a reward model under uncertainty. The reward model serves as a proxy for the true preference distribution. The paper presents the theoretical underpinnings that relate this distillation approach to optimization in classical reinforcement learning settings, demonstrating that adequately diverse samples from preference data can enable a straightforward transition from optimizing an RLHF objective to a distillation loss.

Pessimistic Distillation

To further refine robustness, the paper introduces a "pessimistic" extension of the reward model distillation. This extension optimizes policy alignment by considering the worst-case performance across a family of plausible reward models drawn from preference data. The technique is inspired by conservative offline RL methods and adds KL-divergence regularization to ensure that the policy remains stable even when faced with biased or noisy preference annotations.

Theoretical and Empirical Analysis

The paper extensively analyzes the degenerative tendencies of DPO using theoretical constructs. It finds that DPO can lead to policies that heavily overfit to training preference data, sometimes resulting in degenerate policies that ignore useful responses present in training data. This overfitting is mitigated by the proposed distillation and pessimistic distillation methods, leading to more reliable and robust policy outputs.

Empirical results further support these theoretical insights. Comparisons are drawn using the TL;DR summarization task, where preference data introduces a spurious bias between the length of summaries and their preference scores. The distilled and pessimistic methods show significant improvements in alignment performance over DPO and Identity Preference Optimization (IPO), particularly in scenarios where the preference datasets show biases.

Implications and Future Directions

The research presented in this paper has significant practical and theoretical implications. Practically, it suggests that the overarching strategy for LM alignment should not solely rely on direct optimization from preference data but should incorporate robust distillation techniques that account for uncertainty and potential biases in the data. Theoretically, it opens paths for future research into combining offline and online methods for more effective and efficient LM post-training.

The approach has shown promise in yielding robust policies that outperform traditional DPO and IPO methods, particularly in complex scenarios involving biased preference data. Future research could explore broader application domains, evaluate varying formats of distributional shifts, and refine the balance between reward model fidelity and computational efficiency.

In conclusion, the exploration of reward model distillation—both regular and pessimistic—marks a significant stride in the robustness and efficacy of language model alignment. By combining the benefits of explicit reward models with the simplicity and efficiency of direct preference optimization, this approach promises a more stable path toward robust AI systems.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.