Emergent Mind

Abstract

Reinforcement learning from human feedback (RLHF) aligns LLMs with human preferences. However, these preferences can often change over time due to external factors (e.g. environment change and societal influence). Consequently, what was wrong then might be right now. Current preference optimization algorithms do not account for temporal preference drift in their modeling, which can lead to severe misalignment. To address this limitation, we use a Dynamic Bradley-Terry model that models preferences via time-dependent reward functions, and propose Non-Stationary Direct Preference Optimisation (NS-DPO). By introducing a discount parameter in the loss function, NS-DPO applies exponential weighting, which proportionally focuses learning on more time-relevant datapoints. We theoretically analyse the convergence of NS-DPO in the offline setting, providing upper bounds on the estimation error caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO1 for fine-tuning LLMs in scenarios with drifting preferences. By simulating preference drift using renowned reward models and modifying popular LLM datasets accordingly, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.

Reward model shifts and performance comparison between NS-DPO and stationary DPO under various training conditions.

Overview

  • The paper 'Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift' introduces the Non-Stationary Direct Preference Optimization (NS-DPO) method to address the issue of evolving human preferences over time in RLHF algorithms.

  • NS-DPO leverages the Dynamic Bradley-Terry (DBT) model, allowing preference probabilities to vary temporally and incorporates an exponentially weighted loss function to prioritize recent data, handling temporal preference drift effectively.

  • The efficacy of NS-DPO is demonstrated through synthetic experiments and tests on LLM datasets, showing superior performance over stationary methods. Theoretical analyses provide performance guarantees, extending research into non-stationary environments.

Overview of Non-Stationary Direct Preference Optimization under Preference Drift

Introduction

The paper "Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift" addresses a critical limitation in current Reinforcement Learning from Human Feedback (RLHF) algorithms: the assumption that human preferences remain stationary over time. This assumption does not hold in real-world scenarios where preferences evolve due to environmental changes, societal influences, or other external factors. To bridge this gap, the authors propose the Non-Stationary Direct Preference Optimization (NS-DPO) method, which incorporates a Dynamic Bradley-Terry (DBT) model to accommodate time-dependent preferences.

Methodology

Dynamic Bradley-Terry Model

The core innovation of NS-DPO is the application of the Dynamic Bradley-Terry model to preference optimization problems. In contrast to the stationary Bradley-Terry model, where the probability of preference between two responses is time-invariant, the DBT model allows these probabilities to vary over time. The DBT model uses time-dependent reward functions to capture this variation, formulated as:

[ p(ai \succ a'i | xi, ti) = \sigma(r(xi, ai, ti) - r(xi, a'i, ti)). ]

Here, the reward function ( r(x, a, t) ) changes over time ( t ). NS-DPO introduces an exponentially weighted loss function with a discount parameter ( \gamma ) to prioritize more recent data, addressing issues of temporal preference drift.

Non-Stationary Direct Preference Optimization (NS-DPO)

NS-DPO modifies the Direct Preference Optimization (DPO) framework by incorporating time-varying rewards. The implicit reward function is defined as:

[ r(x, a, T) = \tau \log \frac{\pi{\thetaT}(a|x)}{\piref(a|x)} + \tau \log Z{\thetaT}(x), ]

where ( \pi{\thetaT} ) is the policy at time ( T ), and ( Z{\thetaT}(x) ) is a normalization constant. The NS-DPO loss function is then formulated as:

[ \mathcal{L}{\text{NS-DPO}}(\thetaT) = \sum{(xi, ai, a'i, ti) \sim \mathcal{D}} - \gamma{T-ti-1} \log \sigma \left( \tau \log \frac{\pi{\thetaT}(ai|xi)}{\piref(ai|xi)} - \tau \log \frac{\pi{\thetaT}(ai'|xi)}{\piref(ai'|xi)} \right). ]

This loss function down-weights older data, thus allowing the model to learn more aggressively from recent data where preferences are most relevant.

Theoretical Analysis

The paper provides theoretical guarantees on the performance of NS-DPO. The authors prove an upper bound on the expected regret of NS-DPO for log-linear policies, showing that the complexity of the regret bound is ( O(d B_T{1/2} n{-1/4}) ). This bound is comparable to state-of-the-art methods in stationary settings but extends to accommodate non-stationary environments.

Empirical Validation

The efficacy of NS-DPO is demonstrated through comprehensive experiments:

Synthetic Experiments:

  • NS-DPO significantly outperforms stationary DPO and other baselines in settings with controlled preference drifts.
  • The performance of NS-DPO is robust across a range of discount parameters ( \gamma ), highlighting its adaptability.

Large Language Model (LLM) Experiments:

  • NS-DPO was tested on datasets with simulated preference drift using well-known reward models like PairRM and ArmoRM.
  • NS-DPO consistently showed superior performance compared to stationary methods, even when preferences changed abruptly or gradually.

Implications and Future Work

The introduction of NS-DPO has several important implications:

  • Practical Impact: Practitioners can now fine-tune LLMs in environments where preferences are expected to drift over time. This is particularly useful for applications in social media, content recommendation, and personalized AI systems.
  • Theoretical Contribution: The theoretical framework for handling preference drift in RLHF provides a new direction for future research. This can be extended to other non-stationary settings, such as online learning and continuous deployment scenarios.
  • Future Developments: The approach can be adapted to more complex models and could incorporate advanced techniques like meta-learning to optimize the discount parameter dynamically.

Conclusion

The paper provides a robust solution to the problem of non-stationary preferences in RLHF. By leveraging the Dynamic Bradley-Terry model and introducing the NS-DPO framework, the authors offer a method that not only aligns models with current human preferences but also remains adaptive to changes over time. This work promises to significantly enhance the reliability and applicability of reinforcement learning systems in dynamic environments.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.