RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback (2309.00267v3)
Abstract: Reinforcement learning from human feedback (RLHF) has proven effective in aligning LLMs with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards "self-improvement" by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.
- Concrete problems in ai safety. arXiv preprint arXiv:1606.06565.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Constitutional ai: Harmlessness from ai feedback.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Is GPT-3 a good data annotator? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11173–11195, Toronto, Canada. Association for Computational Linguistics.
- Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5988–6008. PMLR.
- Tom Everitt and Marcus Hutter. 2016. Avoiding wireheading with value reinforcement learning. In Artificial General Intelligence: 9th International Conference, AGI 2016, New York, NY, USA, July 16-19, 2016, Proceedings 9, pages 12–22. Springer.
- Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia. Association for Computational Linguistics.
- A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968–988, Online. Association for Computational Linguistics.
- Taming the noise in reinforcement learning via soft updates. arXiv preprint arXiv:1512.08562.
- Reward learning for efficient reinforcement learning in extractive document summarisation. arXiv preprint arXiv:1907.12894.
- A theory of regularized markov decision processes. In International Conference on Machine Learning, pages 2160–2169. PMLR.
- Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375.
- Google. 2023. Ai platform data labeling service pricing. https://cloud.google.com/ai-platform/data-labeling/pricing#labeling_costs. Accessed: 2023-09-28.
- Palm 2 technical report.
- Ronald A Howard. 1960. Dynamic programming and markov processes. John Wiley.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610.
- Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In International Conference on Machine Learning, pages 1645–1654. PMLR.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- M. G. Kendall and B. Babington Smith. 1939. The Problem of m𝑚mitalic_m Rankings. The Annals of Mathematical Statistics, 10(3):275 – 287.
- Reward design with language models. In The Eleventh International Conference on Learning Representations.
- Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. arXiv preprint arXiv:2307.16039.
- Summary of chatgpt/gpt-4 research and perspective towards the future of large language models. arXiv preprint arXiv:2304.01852.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
- James Manyika. 2023. An overview of bard: an early experiment with generative ai. https://ai.google/static/documents/google-about-bard.pdf. Accessed: 2023-08-23.
- Tuning language models as training data generators for augmentation-enhanced few-shot learning. In International Conference on Machine Learning, pages 24457–24477. PMLR.
- Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- OpenAI. 2023a. Gpt-4 technical report.
- OpenAI. 2023b. Openai pricing. https://openai.com/pricing. Accessed: 2023-09-28.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Pouya Pezeshkpour and Estevam Hruschka. 2023. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483.
- Factually consistent summarization via reinforcement learning with textual entailment feedback. arXiv preprint arXiv:2306.00186.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. CoRR, abs/1804.04235.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
- Towards understanding chain-of-thought prompting: An empirical study of what matters. arXiv preprint arXiv:2212.10001.
- Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
- Want to reduce labeling cost? gpt-3 can help. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4195–4205.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
- Towards zero-label language learning. arXiv preprint arXiv:2109.09193.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256.
- A study of reinforcement learning for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3612–3621.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- Yuxiang Wu and Baotian Hu. 2018. Learning to extract coherent summary via deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, page 5602.
- Rlcd: Reinforcement learning from contrast distillation for language model alignment.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.