Emergent Mind

Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF

(2401.16335)
Published Jan 29, 2024 in cs.LG , cs.AI , cs.CL , and stat.ML

Abstract

Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique that aligns language models closely with human-centric values. The initial phase of RLHF involves learning human values using a reward model from ranking data. It is observed that the performance of the reward model degrades after one epoch of training, and optimizing too much against the learned reward model eventually hinders the true objective. This paper explore these issues, leveraging the theoretical insights to design improved reward learning algorithm termed 'Iterative Data Smoothing' (IDS). The core idea is that during each training epoch, we not only update the model with the data, but also update the date using the model, replacing hard labels with soft labels. Our empirical findings highlight the superior performance of this approach over the traditional methods.

Problem of incorrect reward learning from few samples and its correction using IDS algorithm.

Overview

  • RLHF is a method for aligning LLMs with human values but faces issues like reward overfitting and overoptimization.

  • Reward overfitting deteriorates model performance due to the inadequacies of cross-entropy loss with certain datasets.

  • Reward overoptimization happens when a policy model, trained to maximize the learned reward, diverges from the true objective causing reduced performance.

  • Iterative Data Smoothing (IDS) is proposed to mitigate these issues by smoothing infrequent observations and emphasizing common pairs.

  • IDS improves upon previous solutions by addressing both overfitting and overoptimization and shows efficacy in both bandit and neural network settings.

Introduction

Reinforcement Learning from Human Feedback (RLHF) has become an increasingly prominent method for aligning LLMs with human-centric values and preferences. Despite significant empirical successes across various applications, the RLHF paradigm frequently encounters issues like reward overfitting and reward overoptimization. These phenomena not only impede the stability and reliability of LLM deployment but also raise concerns about the scalability of RLHF.

Understanding Reward Overfitting and Overoptimization

Recent work has provided insight into two major challenges within RLHF. Reward overfitting emerges when a model's performance on a reward learning task deteriorates rapidly after only a single epoch of training. The degradation is partly due to the inadequacy of cross-entropy loss with long-tailed preference datasets. Even simple 3-armed bandit problems demonstrate significant overfitting and overoptimization under these conditions. The root issue is that the empirical entropy loss minimizer can underrepresent rarely compared items in the dataset, leading to extreme and inappropriate reward estimations.

The other challenge, reward overoptimization, occurs in the policy learning stage. Typically, when the policy model is trained to maximize the learned reward, the ground-truth reward may initially increase but subsequently decreases as training continues. This phenomenon is notably observed when the policy diverges significantly from its original state in terms of KL divergence, which inadvertently steers the policy away from the true objective it is supposed to maximize.

Iterative Data Smoothing as a Solution

To mitigate these concerns, a new algorithm dubbed the Iterative Data Smoothing (IDS) is proposed, taking inspiration from the pessimism mechanism found in bandit learning. IDS iterates between updating the model and updating the data through soft labels, effectively smoothing the influence of infrequent observations. The mechanism discourages the over-emphasis on sporadically seen samples and concentrates on the more commonly observed pairs. It combines the advantage of soft labels and iterative learning, where the data and model inform each other through successive training epochs.

Theoretical analysis reveals that IDS, diverging from other approaches like the lower-confidence-bound-based algorithm, effectively learns the ground truth distribution for comparisons that garner sufficient observations while disregarding those infrequently seen. Experimental evidence confirms the algorithm's efficacy in both bandit and neural network settings.

Related Work and Future Directions

IDS builds upon an existing body of work in RLHF, Preference-based Reinforcement Learning, knowledge distillation, and ranking estimation from pairwise comparisons. Prior studies have highlighted similar challenges and proposed various solutions, yet few have offered a strategy that systematically addresses both overfitting and overoptimization in a single framework.

As we progress, it becomes critical to extend the IDS methodology to multi-armed bandits with more complex comparison scenarios and integrate it into neural network-based reward models. Exploring new ways to refine the algorithm and its practical implementation will enhance our understanding and application in broader RLHF tasks. Further theoretical analyses, particularly of the IDS algorithm's long-term convergence properties, are essential steps to solidify its place in the RLHF toolkit.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.