Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble (2401.16635v3)
Abstract: Reinforcement Learning from Human Feedback (RLHF) is a widely adopted approach for aligning LLMs with human values. However, RLHF relies on a reward model that is trained with a limited amount of human preference data, which could lead to inaccurate predictions. As a result, RLHF may produce outputs that are misaligned with human values. To mitigate this issue, we contribute a reward ensemble method that allows the reward model to make more accurate predictions. As using an ensemble of LLM-based reward models can be computationally and resource-expensive, we explore efficient ensemble methods including linear-layer ensemble and LoRA-based ensemble. Empirically, we run Best-of-$n$ and Proximal Policy Optimization with our ensembled reward models, and verify that our ensemble methods help improve the alignment performance of RLHF outputs.
- Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.
- Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. ArXiv:2307.15217 [cs].
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307.
- Reward Model Ensembles Help Mitigate Overoptimization. ArXiv:2310.02743 [cs].
- AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback.
- Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking. ArXiv:2312.09244 [cs].
- Scaling Laws for Reward Model Overoptimization.
- Adam Gleave and Geoffrey Irving. 2022. Uncertainty Estimation for Language Reward Models. ArXiv:2203.07472 [cs].
- REALM: Retrieval-Augmented Language Model Pre-Training. ArXiv:2002.08909 [cs].
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pages 12519–12530.
- W. Bradley Knox and Peter Stone. 2009. Interactively Shaping Agents via Human Reinforcement: The TAMER Framework. In Proceedings of the Fifth International Conference on Knowledge Capture, K-CAP ’09, pages 9–16, New York, NY, USA.
- Large Language Models are Zero-Shot Reasoners. ArXiv:2205.11916 [cs].
- Conservative Q-Learning for Offline Reinforcement Learning. Neural Information Processing Systems.
- Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. ArXiv:1612.01474 [cs, stat].
- AI safety gridworlds. arXiv preprint arXiv:1711.09883.
- Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv:2005.01643 [cs, stat]. ArXiv: 2005.01643.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097. Publisher: American Association for the Advancement of Science.
- Reward Uncertainty for Exploration in Preference-based Reinforcement Learning. ArXiv:2205.12401 [cs].
- LLM+P: Empowering Large Language Models with Optimal Planning Proficiency. ArXiv:2304.11477 [cs].
- CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. ArXiv:2203.13474 [cs].
- Training language models to follow instructions with human feedback. ArXiv:2203.02155 [cs].
- WARM: On the Benefits of Weight Averaged Reward Models. ArXiv:2401.12187 [cs].
- Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs]. ArXiv: 1707.06347.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Attention Is All You Need. arXiv:1706.03762 [cs]. ArXiv: 1706.03762.
- Fine-Grained Human Feedback Gives Better Rewards for Language Model Training. ArXiv:2306.01693 [cs].
- Uncertainty-Penalized Reinforcement Learning from Human Feedback with Diverse Reward LoRA Ensembles. ArXiv:2401.00243 [cs].
- Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685.
- Shun Zhang (105 papers)
- Zhenfang Chen (36 papers)
- Sunli Chen (6 papers)
- Yikang Shen (62 papers)
- Zhiqing Sun (35 papers)
- Chuang Gan (195 papers)