Emergent Mind

Reward Model Ensembles Help Mitigate Overoptimization

(2310.02743)
Published Oct 4, 2023 in cs.LG

Abstract

Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning LLMs to follow instructions. As part of this process, learned reward models are used to approximately model human preferences. However, as imperfect representations of the "true" reward, these learned reward models are susceptible to overoptimization. Gao et al. (2023) studied this phenomenon in a synthetic human feedback setup with a significantly larger "gold" reward model acting as the true reward (instead of humans) and showed that overoptimization remains a persistent problem regardless of the size of the proxy reward model and training data used. Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization when using two optimization methods: (a) best-of-n sampling (BoN) (b) proximal policy optimization (PPO). We additionally extend the setup of Gao et al. (2023) to include 25% label noise to better mirror real-world conditions. Both with and without label noise, we find that conservative optimization practically eliminates overoptimization and improves performance by up to 70% for BoN sampling. For PPO, ensemble-based conservative optimization always reduces overoptimization and outperforms single reward model optimization. Moreover, combining it with a small KL penalty successfully prevents overoptimization at no performance cost. Overall, our results demonstrate that ensemble-based conservative optimization can effectively counter overoptimization.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Improving Multimodal Interactive Agents with Reinforcement Learning from Human Feedback
  2. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  3. Constitutional AI: Harmlessness from AI Feedback
  4. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR
  5. Convex optimization. Cambridge university press
  6. Disagreement-regularized imitation learning. In International Conference on Learning Representations
  7. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
  8. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30
  9. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31
  10. Faulty reward functions in the wild. https://openai.com/research/faulty-reward-functions

  11. Learning and Policy Search in Stochastic Dynamical Systems with Bayesian Neural Networks
  12. Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp.  1–15. Springer
  13. RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
  14. AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
  15. Improving PILCO with Bayesian neural network dynamics models. In Data-Efficient Machine Learning workshop, International Conference on Machine Learning
  16. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.  10835–10866. PMLR
  17. Reinforced Self-Training (ReST) for Language Modeling
  18. Mikael Henaff. Explicit explore-exploit algorithms in continuous state spaces. Advances in Neural Information Processing Systems, 32
  19. Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems, 31
  20. OpenAssistant Conversations -- Democratizing Large Language Model Alignment
  21. Specification gaming: The flip side of ai ingenuity. April 2020. https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity.

  22. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30
  23. PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training
  24. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities. Artificial life, 26(2):274–306
  25. Let's Verify Step by Step
  26. Self-Refine: Iterative Refinement with Self-Feedback
  27. Timothy Prickett Morgan. Counting the cost of training LLMs, 2022. https://www.nextplatform.com/2022/12/01/counting-the-cost-of-training-large-language-models/.

  28. WebGPT: Browser-assisted question-answering with human feedback
  29. OpenAI. Openai models. https://platform.openai.com/docs/models/, 2023a.

  30. OpenAI. Gpt-4 technical report. arXiv, pp.  2303–08774, 2023b.
  31. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744
  32. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32
  33. The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
  34. Self-supervised exploration via disagreement. In International conference on machine learning, pp.  5062–5071. PMLR
  35. Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  36. Self-critiquing models for assisting human evaluators
  37. Training Language Models with Language Feedback at Scale
  38. John Schulman. Approximating kl divergence, 2020. http://joschu.net/blog/kl-approx.html.

  39. Proximal Policy Optimization Algorithms
  40. Model-based active exploration. In International conference on machine learning, pp.  5779–5788. PMLR
  41. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35:9460–9471
  42. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021
  43. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  44. Solving math word problems with process- and outcome-based feedback
  45. Mosaic llms (part 2): Gpt-3 quality for <500absent500<500< 500 k, 2022. https://www.mosaicml.com/blog/gpt-3-quality-for-500k.

  46. Stephen J Wright. Numerical optimization. 2006.
  47. Recursively Summarizing Books with Human Feedback
  48. Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning
  49. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020a.
  50. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020b.
  51. RRHF: Rank Responses to Align Language Models with Human Feedback without tears
  52. SLiC-HF: Sequence Likelihood Calibration with Human Feedback
  53. Secrets of RLHF in Large Language Models Part I: PPO
  54. Consequences of misaligned ai. Advances in Neural Information Processing Systems, 33:15763–15773
  55. Fine-Tuning Language Models from Human Preferences

Show All 55