Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning LLMs to follow instructions. As part of this process, learned reward models are used to approximately model human preferences. However, as imperfect representations of the "true" reward, these learned reward models are susceptible to overoptimization. Gao et al. (2023) studied this phenomenon in a synthetic human feedback setup with a significantly larger "gold" reward model acting as the true reward (instead of humans) and showed that overoptimization remains a persistent problem regardless of the size of the proxy reward model and training data used. Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization when using two optimization methods: (a) best-of-n sampling (BoN) (b) proximal policy optimization (PPO). We additionally extend the setup of Gao et al. (2023) to include 25% label noise to better mirror real-world conditions. Both with and without label noise, we find that conservative optimization practically eliminates overoptimization and improves performance by up to 70% for BoN sampling. For PPO, ensemble-based conservative optimization always reduces overoptimization and outperforms single reward model optimization. Moreover, combining it with a small KL penalty successfully prevents overoptimization at no performance cost. Overall, our results demonstrate that ensemble-based conservative optimization can effectively counter overoptimization.
We're not able to analyze this paper right now due to high demand.
Please check back later (sorry!).
Generate a detailed summary of this paper with a premium account.
We ran into a problem analyzing this paper.
Faulty reward functions in the wild. https://openai.com/research/faulty-reward-functions
Specification gaming: The flip side of ai ingenuity. April 2020. https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity.
Timothy Prickett Morgan. Counting the cost of training LLMs, 2022. https://www.nextplatform.com/2022/12/01/counting-the-cost-of-training-large-language-models/.
OpenAI. Openai models. https://platform.openai.com/docs/models/, 2023a.
John Schulman. Approximating kl divergence, 2020. http://joschu.net/blog/kl-approx.html.
Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca
Mosaic llms (part 2): Gpt-3 quality for <500absent500<500< 500 k, 2022. https://www.mosaicml.com/blog/gpt-3-quality-for-500k.