Dataset Reset Policy Optimization for RLHF (2404.08495v3)

Published 12 Apr 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO), under the metric of GPT4 win-rate. Code for this work can be found at https://github.com/Cornell-RL/drpo.

References (66)

Citations (16)

View on Semantic Scholar

Summary

The paper introduces DR-PO, a novel RLHF algorithm that resets the learning agent to informative states from offline data to boost optimization efficiency.
The method offers theoretical guarantees, ensuring learned policies perform at least as well as those in the offline preference dataset.
Empirical results on TL;DR summarization and HH tasks show DR-PO outperforms baselines like PPO and DPO, demonstrating its robustness and generalizability.

Reinforcement Learning from Human Feedback with Dataset Reset Policy Optimization

Introduction

Reinforcement Learning from Human Feedback (RLHF) has emerged as a potent strategy for training generative models in scenarios where crafting an explicit reward function proves challenging. Utilizing human-labeled preference data, researchers have successfully trained large-scale models across diverse domains. Despite its successes, conventional RLHF protocols separate the processes of reward model learning and policy optimization, potentially overlooking the wealth of information embedded in the offline preference dataset during online policy training. This paper introduces an innovative RLHF algorithm, Dataset Reset Policy Optimization (DR-PO), leveraging dataset resets to enhance online learning significantly.

Dataset Reset Policy Optimization (DR-PO)

DR-PO capitalizes on the ability to reset to informative states within an offline dataset, enabling more efficient policy optimization. By resetting the learning agent directly to states from this dataset instead of initiating from the traditional starting state distribution, DR-PO increases exploration efficiency. This mechanism particularly benefits scenarios such as text generation in LLMs, where resets correspond to initiating generation from partial sentence states. Theoretical analysis confirms that DR-PO can match or surpass the performance of any policy covered by the offline data, offering a significant leap in efficiency and effectiveness within the RLHF paradigm.

Theoretical Guarantees

DR-PO not only showcases simplicity in implementation akin to traditional policy optimization methods but also sets a new theoretical benchmark for RLHF. Under general function approximation and finite sample complexity conditions, DR-PO guarantees learning policies at least as effective as those encapsulated within the offline preference dataset. This theoretical robustness extends to computationally tractable settings, requiring only standard learning oracles such as Maximum Likelihood Estimation (MLE) for reward model fitting. Thus, DR-PO represents a significant theoretical advancement in the RLHF domain.

Empirical Demonstrations

The paper rigorously evaluates DR-PO on two standard RLHF benchmarks: TL;DR Summarization and the Anthropic Helpful Harmful (HH) dataset, employing methods such as Proximal Policy Optimization (PPO) for comparison. Notably, DR-PO outperforms PPO and Direction Preference Optimization (DPO) across these benchmarks. In TL;DR summarization tasks, DR-PO's summaries notably surpass those delivered by PPO and DPO, evaluated using GPT-4 win-rate. Moreover, when transitioning the trained policies to a zero-shot setting on the CNN/DailyMail dataset, DR-PO maintains its superiority, highlighting its robustness and generalizability beyond the training data. These empirical outcomes solidify DR-PO's practical efficacy in optimizing RLHF tasks, blending theoretical soundness with real-world applicability.

Conclusion and Future Directions

Dataset Reset Policy Optimization introduces a pivotal advancement in the domain of RLHF, substantiated by both theoretical guarantees and strong empirical performance. The capability to leverage dataset resets in policy optimization presents a novel pathway toward more efficient and effective learning from human feedback. As the paper conjectures, the principles underpinning DR-PO may extend beyond the settings explored, suggesting a broad horizon for future investigations. The integration of dataset resets offers a promising avenue to enhance online RL algorithms further, warranting comprehensive exploration across diverse RLHF applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/xkianteb/status/1780331406859673665

https://twitter.com/gm8xx8/status/1779690815842869652

https://twitter.com/knishimae0531/status/1779713608932602313

Reddit

"DRPO: Dataset Reset Policy Optimization for RLHF", Chang et al 2024 (offline RL) (6 points, 1 comment)