Bayesian Reward Models for LLM Alignment (2402.13210v2)
Abstract: To ensure that LLM responses are helpful and non-toxic, a reward model trained on human preference data is usually used. LLM responses with high rewards are then selected through best-of-$n$ (BoN) sampling or the LLM is further optimized to produce responses with high rewards through reinforcement learning from human feedback (RLHF). However, these processes are susceptible to reward overoptimization or `hacking', where responses receive high rewards due to imperfections in the reward model rather than true preference, particularly as prompts or responses deviate from the training data. To address these challenges, we propose to train a Bayesian reward model, which signals higher uncertainty further from the training data distribution. We trained Bayesian reward models using Laplace approximation on LoRA weights, and found that the resulting uncertainty estimates can effectively mitigate reward overoptimization in BoN sampling.
- Deep kernel processes. In International Conference on Machine Learning, pp. 130–140. PMLR, 2021.
- Adapting the linearised laplace model evidence for modern deep learning. In ICML, 2022.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
- Weight uncertainty in neural network. In International conference on machine learning, pp. 1613–1622. PMLR, 2015.
- Disagreement-regularized imitation learning. In International Conference on Learning Representations, 2019.
- Odin: Disentangled reward mitigates hacking in rlhf. arXiv preprint arXiv:2402.07319, 2024.
- Reward model ensembles help mitigate overoptimization. In ICLR, 2024.
- Laplace redux-effortless bayesian deep learning. NeurIPS, 2021.
- Accelerated linearized laplace approximation for bayesian deep learning. NeurIPS, 2022.
- Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2024.
- Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint arXiv:2312.09244, 2023.
- ’in-between’uncertainty in bayesian neural networks. In ICML Workshop on Uncertainty and Robustness in Deep Learning, 2019.
- Bayesian neural network priors revisited. arXiv preprint arXiv:2102.06571, 2021.
- Scaling laws for reward model overoptimization. In ICML, pp. 10835–10866, 2023.
- Uncertainty estimation for language reward models. arXiv preprint arXiv:2203.07472, 2022.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Improving predictions of bayesian neural nets via local linearization. In AISTAT, 2021.
- Being Bayesian, even just a bit, fixes overconfidence in relu networks. In ICML, 2020.
- A sober look at llms for material discovery: Are they actually good for bayesian optimization over molecules? arXiv preprint arXiv:2402.05015, 2024.
- Limitations of the empirical fisher approximation for natural gradient descent. Advances in neural information processing systems, 32, 2019.
- Speciality vs generality: An empirical study on catastrophic forgetting in fine-tuning foundation models. arXiv preprint arXiv:2309.06256, 2023.
- David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 1992.
- Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- Global inducing point variational posteriors for bayesian neural networks and deep gaussian processes. In International Conference on Machine Learning, pp. 8248–8259. PMLR, 2021.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Warm: On the benefits of weight averaged reward models. arXiv preprint arXiv:2401.12187, 2024.
- Dept: Decomposed prompt tuning for parameter-efficient fine-tuning. arXiv preprint arXiv:2309.05173, 2023.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Lora ensembles for large language model fine-tuning. arXiv preprint arXiv:2310.00035, 2023.
- Bayesian low-rank adaptation for large language models. In ICLR, 2024.
- Uncertainty-penalized reinforcement learning from human feedback with diverse reward lora ensembles. arXiv preprint arXiv:2401.00243, 2024.
- Cyclical stochastic gradient mcmc for bayesian deep learning. arXiv preprint arXiv:1902.03932, 2019.
- Improving reinforcement learning from human feedback with efficient reward model ensemble. arXiv preprint arXiv:2401.16635, 2024.