RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment (2304.06767v4)
Abstract: Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially serious consequences. Consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their responsible and effective deployment in real-world applications. Prior research has primarily employed Reinforcement Learning from Human Feedback (RLHF) to address this problem, where generative models are fine-tuned with RL algorithms guided by a human-feedback-informed reward model. However, the inefficiencies and instabilities associated with RL algorithms frequently present substantial obstacles to the successful alignment, necessitating the development of a more robust and streamlined approach. To this end, we introduce a new framework, Reward rAnked FineTuning (RAFT), designed to align generative models effectively. Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently enhancing the model by fine-tuning on these filtered samples. Our studies show that RAFT can effectively improve the model performance in both reward learning and other automated metrics in both LLMs and diffusion models.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 610–623, 2021.
- Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
- On the weaknesses of reinforcement learning for neural machine translation. arXiv preprint arXiv:1907.01752, 2019.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
- Lmflow: An extensible toolkit for finetuning and inference of large foundation models. https://optimalscale.github.io/LMFlow/, 2023.
- Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729, 2020.
- Hierarchical neural story generation. arXiv preprint arXiv:1805.04833, 2018.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp. 10835–10866. PMLR, 2023.
- Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
- Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611, 2022.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
- Ai safety via debate. arXiv preprint arXiv:1805.00899, 2018.
- Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp. 5084–5096. PMLR, 2021.
- Wendell Johnson. Studies in language behavior: A program of research. Psychological Monographs, 56(2):1–15, 1944.
- Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
- Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp. 19274–19286. PMLR, 2023.
- A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055, 2015.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
- Understanding learned reward functions. arXiv preprint arXiv:2012.05862, 2020.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241, 2022.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Raj Reddy. Speech understanding systems: A summary of results of the five-year research effort at carnegie mellon university. Pittsburgh, Pa, 1977.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Training language models with language feedback at scale. arXiv preprint arXiv:2303.16755, 2023.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- A contrastive framework for neural text generation. arXiv preprint arXiv:2202.06417, 2022.
- Causal confusion and reward misidentification in preference-based reward learning. arXiv preprint arXiv:2204.06601, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Emergent abilities of large language models. Transactions on Machine Learning Research, 2022a. URL https://openreview.net/forum?id=yzkSU5zdwD. Survey Certification.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022b.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
- Better aligning text-to-image models with human preference. arXiv preprint arXiv:2303.14420, 2023.
- Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34, 2021.
- Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game. arXiv preprint arXiv:2205.15512, 2022.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.