Emergent Mind

Dataset Reset Policy Optimization for RLHF

(2404.08495)
Published Apr 12, 2024 in cs.LG , cs.AI , and cs.CL

Abstract

Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO), under the metric of GPT4 win-rate. Code for this work can be found at https://github.com/Cornell-RL/drpo.

Plot highlights DR-PO's superior tradeoff in learning higher rewards with lower KL versus baselines.

Overview

  • This paper introduces the Dataset Reset Policy Optimization (DR-PO), an RLHF algorithm leveraging dataset resets for enhanced online learning efficiency.

  • DR-PO allows for resetting to informative states from an offline dataset, improving exploration efficiency in learning agents, such as those involved in text generation.

  • Theoretical analysis of DR-PO confirms its capability to match or exceed the performance of policies within the offline data, highlighting its efficiency and effectiveness.

  • Empirical evaluations on RLHF benchmarks including TL;DR Summarization and the Anthropic HH dataset demonstrate DR-PO's superiority over conventional methods like PPO.

Reinforcement Learning from Human Feedback with Dataset Reset Policy Optimization

Introduction

Reinforcement Learning from Human Feedback (RLHF) has emerged as a potent strategy for training generative models in scenarios where crafting an explicit reward function proves challenging. Utilizing human-labeled preference data, researchers have successfully trained large-scale models across diverse domains. Despite its successes, conventional RLHF protocols separate the processes of reward model learning and policy optimization, potentially overlooking the wealth of information embedded in the offline preference dataset during online policy training. This paper introduces an innovative RLHF algorithm, Dataset Reset Policy Optimization (DR-PO), leveraging dataset resets to enhance online learning significantly.

Dataset Reset Policy Optimization (DR-PO)

DR-PO capitalizes on the ability to reset to informative states within an offline dataset, enabling more efficient policy optimization. By resetting the learning agent directly to states from this dataset instead of initiating from the traditional starting state distribution, DR-PO increases exploration efficiency. This mechanism particularly benefits scenarios such as text generation in LLMs, where resets correspond to initiating generation from partial sentence states. Theoretical analysis confirms that DR-PO can match or surpass the performance of any policy covered by the offline data, offering a significant leap in efficiency and effectiveness within the RLHF paradigm.

Theoretical Guarantees

DR-PO not only showcases simplicity in implementation akin to traditional policy optimization methods but also sets a new theoretical benchmark for RLHF. Under general function approximation and finite sample complexity conditions, DR-PO guarantees learning policies at least as effective as those encapsulated within the offline preference dataset. This theoretical robustness extends to computationally tractable settings, requiring only standard learning oracles such as Maximum Likelihood Estimation (MLE) for reward model fitting. Thus, DR-PO represents a significant theoretical advancement in the RLHF domain.

Empirical Demonstrations

The paper rigorously evaluates DR-PO on two standard RLHF benchmarks: TL;DR Summarization and the Anthropic Helpful Harmful (HH) dataset, employing methods such as Proximal Policy Optimization (PPO) for comparison. Notably, DR-PO outperforms PPO and Direction Preference Optimization (DPO) across these benchmarks. In TL;DR summarization tasks, DR-PO's summaries notably surpass those delivered by PPO and DPO, evaluated using GPT-4 win-rate. Moreover, when transitioning the trained policies to a zero-shot setting on the CNN/DailyMail dataset, DR-PO maintains its superiority, highlighting its robustness and generalizability beyond the training data. These empirical outcomes solidify DR-PO's practical efficacy in optimizing RLHF tasks, blending theoretical soundness with real-world applicability.

Conclusion and Future Directions

Dataset Reset Policy Optimization introduces a pivotal advancement in the domain of RLHF, substantiated by both theoretical guarantees and strong empirical performance. The capability to leverage dataset resets in policy optimization presents a novel pathway toward more efficient and effective learning from human feedback. As the paper conjectures, the principles underpinning DR-PO may extend beyond the settings explored, suggesting a broad horizon for future investigations. The integration of dataset resets offers a promising avenue to enhance online RL algorithms further, warranting comprehensive exploration across diverse RLHF applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

Reddit