Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF (2405.19320v3)
Abstract: Reinforcement learning from human feedback (RLHF) has demonstrated great promise in aligning LLMs with human preference. Depending on the availability of preference data, both online and offline RLHF are active areas of investigation. A key bottleneck is understanding how to incorporate uncertainty estimation in the reward function learned from the preference data for RLHF, regardless of how the preference data is collected. While the principles of optimism or pessimism under uncertainty are well-established in standard reinforcement learning (RL), a practically-implementable and theoretically-grounded form amenable to LLMs is not yet available, as standard techniques for constructing confidence intervals become intractable under arbitrary policy parameterizations. In this paper, we introduce a unified approach to online and offline RLHF -- value-incentivized preference optimization (VPO) -- which regularizes the maximum-likelihood estimate of the reward function with the corresponding value function, modulated by a $\textit{sign}$ to indicate whether the optimism or pessimism is chosen. VPO also directly optimizes the policy with implicit reward modeling, and therefore shares a simpler RLHF pipeline similar to direct preference optimization. Theoretical guarantees of VPO are provided for both online and offline settings, matching the rates of their standard RL counterparts. Moreover, experiments on text summarization and dialog verify the practicality and effectiveness of VPO.
- Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70(4):2563–2578.
- Boltzmann exploration done right. Advances in neural information processing systems, 30.
- Dataset reset policy optimization for RLHF. arXiv preprint arXiv:2404.08495.
- H. chi, jeff dean, jacob devlin, adam roberts, denny zhou, quoc v. le, and jason wei. 2022. scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- KTO: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306.
- Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR.
- A survey of uncertainty in deep neural networks. arXiv preprint arXiv:2107.03342.
- Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792.
- Reward-biased maximum likelihood estimation for linear stochastic bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7874–7882.
- Is Q-learning provably efficient? Advances in neural information processing systems, 31.
- The power of exploiter: Provable multi-agent rl in large state spaces. In International Conference on Machine Learning, pages 10251–10279. PMLR.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191.
- A new family of optimal adaptive controllers for markov chains. IEEE Transactions on Automatic Control, 27(1):137–146.
- Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22.
- Bandit algorithms. Cambridge University Press.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
- Exploration through reward biasing: Reward-biased maximum likelihood estimation for stochastic multi-armed bandits. In International Conference on Machine Learning, pages 6248–6258. PMLR.
- Maximize to explore: One objective function fusing estimation, planning, and exploration. Advances in Neural Information Processing Systems, 36.
- Reward biased maximum likelihood estimation for reinforcement learning. In Learning for Dynamics and Control, pages 815–827. PMLR.
- Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR.
- Human-level control through deep reinforcement learning. nature, 518(7540):529–533.
- Nash learning from human feedback. arXiv preprint arXiv:2312.00886.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
- Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
- Bridging offline reinforcement learning and imitation learning: A tale of pessimism. IEEE Transactions on Information Theory, 68(12):8156–8196.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Pessimistic Q-learning for offline reinforcement learning: Towards optimal sample complexity. In International conference on machine learning, pages 19967–20025. PMLR.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Reinforcement Learning: An Introduction. MIT Press.
- A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056.
- Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Pessimistic model-based offline reinforcement learning under partial coverage. arXiv preprint arXiv:2107.06226.
- On the optimality of batch policy optimization algorithms. In International Conference on Machine Learning, pages 11362–11371. PMLR.
- Gibbs sampling from human feedback: A provable KL-constrained framework for RLHF. arXiv preprint arXiv:2312.11456.
- Iterative reasoning preference optimization. arXiv e-prints, pages arXiv–2404.
- Provable offline preference-based reinforcement learning. In The Twelfth International Conference on Learning Representations.
- Zhang, T. (2023). Mathematical analysis of machine learning algorithms. Cambridge University Press.
- Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425.
- Principled reinforcement learning with human feedback from pairwise or k-wise comparisons. In International Conference on Machine Learning, pages 43037–43067. PMLR.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.