Dataset Reset Policy Optimization for RLHF (2404.08495v3)
Abstract: Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO), under the metric of GPT4 win-rate. Code for this work can be found at https://github.com/Cornell-RL/drpo.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Reinforcement learning: Theory and algorithms. Technical report.
- On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1):4431–4506.
- Reinforcement learning with a near optimal rate of convergence. Technical report, INRIA.
- Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org.
- Bagnell, J. A. (2004). Learning decisions: Robustness, uncertainty, and approximation. Carnegie Mellon University.
- Covariant policy search. In Proceedings of the 18th international joint conference on Artificial intelligence, pages 1019–1024.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Efficient online reinforcement learning with offline data. arXiv preprint arXiv:2302.02948.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pages 783–792. PMLR.
- Learning to generate better than your llm. arXiv preprint arXiv:2306.11816.
- Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. In International Conference on Machine Learning, pages 3773–3793. PMLR.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Contextual.ai (2023). https://contextual.ai/better-cheaper-faster-llm-alignment-with-kto/.
- Search-based structured prediction. Machine learning, 75:297–325.
- Learning as search optimization: Approximate large margin methods for structured prediction. In Proceedings of the 22nd international conference on Machine learning, pages 169–176.
- Bilinear classes: A structural framework for provable generalization in rl.
- Contextual dueling bandits. In Conference on Learning Theory, pages 563–587. PMLR.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Contextual decision processes with low bellman rank are pac-learnable. arXiv preprint arXiv:1610.09512.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR.
- Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, volume 2, pages 267–274.
- Kakade, S. M. (2001). A natural policy gradient. Advances in neural information processing systems, 14.
- Kakade, S. M. (2003). On the sample complexity of reinforcement learning. University of London, University College London (United Kingdom).
- Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192.
- Reinforcement learning with human feedback: Learning dynamic choices via pessimism. arXiv preprint arXiv:2305.18438.
- Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Languages are rewards: Hindsight finetuning using human feedback. arXiv preprint arXiv:2302.02676.
- Interactive learning from policy-dependent human feedback. In International Conference on Machine Learning, pages 2285–2294. PMLR.
- Overcoming exploration in reinforcement learning with demonstrations. In 2018 IEEE international conference on robotics and automation (ICRA), pages 6292–6299. IEEE.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- Dueling posterior sampling for preference-based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence, pages 1029–1038. PMLR.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Dueling rl: reinforcement learning with trajectory preferences. arXiv preprint arXiv:2111.04850.
- Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241.
- Learning montezuma’s revenge from a single demonstration. arXiv preprint arXiv:1812.03381.
- Trust region policy optimization. In International conference on machine learning, pages 1889–1897.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
- Benchmarks and algorithms for offline preference-based reward learning. arXiv preprint arXiv:2301.01392.
- Mastering the game of Go with deep neural networks and tree search. nature, 529(7587):484–489.
- Hybrid rl: Using both offline and online data can make rl efficient. arXiv preprint arXiv:2210.06718.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Demonstration-regularized rl.
- Jump-start reinforcement learning. In International Conference on Machine Learning, pages 34556–34583. PMLR.
- Deep tamer: Interactive agent shaping in high-dimensional state spaces. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
- A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.
- Making rl with preference-based feedback efficient via randomization. arXiv preprint arXiv:2310.14554.
- Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment. arXiv preprint arXiv:2310.00212.
- Preference-based reinforcement learning with finite-time guarantees. Advances in Neural Information Processing Systems, 33:18784–18794.
- Efficient local planning with linear function approximation. In International Conference on Algorithmic Learning Theory, pages 1165–1192. PMLR.
- Self-rewarding language models. arXiv preprint arXiv:2401.10020.
- The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556.
- Offline reinforcement learning with realizability and single-policy concentrability. In Conference on Learning Theory, pages 2730–2775. PMLR.
- Provable offline preference-based reinforcement learning.
- Provable reward-agnostic preference-based reinforcement learning.
- Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons. arXiv preprint arXiv:2301.11270.
- Fine-tuning language models with advantage-induced policy alignment. arXiv preprint arXiv:2306.02231.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
- Relative confidence sampling for efficient on-line ranker evaluation. In Proceedings of the 7th ACM international conference on Web search and data mining, pages 73–82.