Emergent Mind

Abstract

Reinforcement Learning with Human Feedback (RLHF) is the most prominent method for Language Model (LM) alignment. However, RLHF is an unstable and data-hungry process that continually requires new high-quality LM-generated data for finetuning. We introduce Advantage-Leftover Lunch RL (A-LoL), a new class of offline policy gradient algorithms that enable RL training on any pre-existing data. By assuming the entire LM output sequence as a single action, A-LoL allows incorporating sequence-level classifiers or human-designed scoring functions as rewards. Subsequently, by using LM's value estimate, A-LoL only trains on positive advantage (leftover) data points, making it resilient to noise. Overall, A-LoL is an easy-to-implement, sample-efficient, and stable LM training recipe. We demonstrate the effectiveness of A-LoL and its variants with a set of four different language generation tasks. We compare against both online RL (PPO) and recent preference-based (DPO, PRO) and reward-based (GOLD) offline RL baselines. On the commonly-used RLHF benchmark, Helpful and Harmless Assistant (HHA), LMs trained with A-LoL methods achieve the highest diversity while also being rated more safe and helpful than the baselines according to humans. Additionally, in the remaining three tasks, A-LoL could optimize multiple distinct reward functions even when using noisy or suboptimal training data. We also release our experimental code. https://github.com/abaheti95/LoL-RL

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Just say no: Analyzing the stance of neural dialogue generation in offensive contexts. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  4846–4862, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.397. https://aclanthology.org/2021.emnlp-main.397.

  2. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a
  3. Constitutional ai: Harmlessness from ai feedback, 2022b
  4. Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pp.  131–198, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/W16-2301. https://aclanthology.org/W16-2301.

  5. COMET: Commonsense transformers for automatic knowledge graph construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4762–4779, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1470. https://aclanthology.org/P19-1470.

  6. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

  7. Open problems and fundamental limitations of reinforcement learning from human feedback
  8. Deep rl with hierarchical action exploration for dialogue generation
  9. Scaling Instruction-Finetuned Language Models
  10. Did it happen? the pragmatic complexity of veridicality assessment. Computational Linguistics, 38(2):301–333, February 2012. https://doi.org/10.1162/COLI_a_00097.

  11. Off-Policy Actor-Critic. In International Conference on Machine Learning, Edinburgh, United Kingdom, June 2012. https://inria.hal.science/hal-00764021.

  12. Qlora: Efficient finetuning of quantized llms
  13. Wizard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations, 2019. https://openreview.net/forum?id=r1l73iRqKm.

  14. Measuring the carbon intensity of ai in cloud instances. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, pp.  1877–1894, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522. doi: 10.1145/3531146.3533234. https://doi.org/10.1145/3531146.3533234.
  15. FaithDial: A faithful benchmark for information-seeking dialogue. Transactions of the Association for Computational Linguistics, 10:1473–1490, 2022a. doi: 10.1162/tacl˙a˙00529. https://aclanthology.org/2022.tacl-1.84.

  16. On the origin of hallucinations in conversational models: Is it the datasets or the models? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5271–5285, Seattle, United States, July 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.387. https://aclanthology.org/2022.naacl-main.387.

  17. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned
  18. Dialogue response ranking training with large-scale human feedback data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  386–395, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.28. https://aclanthology.org/2020.emnlp-main.28.

  19. Aligning language models with preferences through f-divergence minimization. In International Conference on Machine Learning (ICML), 2023. https://openreview.net/forum?id=ttga7UlrsE.

  20. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  708–719, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1065. https://aclanthology.org/N18-1065.

  21. Efficient (soft) Q-learning for text generation with limited good data. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  6969–6991, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.518. https://aclanthology.org/2022.findings-emnlp.518.

  22. Measurement and fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pp.  375–385, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445901. https://doi.org/10.1145/3442188.3445901.
  23. Gpt-critic: Offline reinforcement learning for end-to-end task-oriented dialogue systems. In International Conference on Learning Representations
  24. Critic-guided decoding for controlled text generation
  25. Should i run offline reinforcement learning or behavioral cloning? In International Conference on Learning Representations, 2022. https://openreview.net/forum?id=AP1MKT37rJ.

  26. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  110–119, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1014. https://aclanthology.org/N16-1014.

  27. Aligning generative language models with human values. In Findings of the Association for Computational Linguistics: NAACL 2022, pp.  241–252, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-naacl.18. https://aclanthology.org/2022.findings-naacl.18.

  28. Statistical rejection sampling improves preference optimization
  29. Quark: Controllable text generation with reinforced unlearning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  27591–27609. Curran Associates, Inc., 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/b125999bde7e80910cbdbd323087df8f-Paper-Conference.pdf.

  30. OpenAI. Gpt-4 technical report
  31. Training language models to follow instructions with human feedback. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=TG8KACxEON.

  32. Richard Yuanzhe Pang and He He. Text generation by learning from demonstrations. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=RovX-uQ1Hua.

  33. Reward gaming in conditional text generation
  34. Stabilizing rlhf through advantage model and selective rehearsal
  35. The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation
  36. Language models are unsupervised multitask learners. In Arxiv
  37. Direct preference optimization: Your language model is secretly a reward model
  38. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The Eleventh International Conference on Learning Representations, 2023. https://openreview.net/forum?id=8aHzds2uUyB.

  39. Mark B. Ring. Child: A first step towards continual learning. Mach. Learn., 28(1):77–104, jul 1997. ISSN 0885-6125. doi: 10.1023/A:1007331723572. https://doi.org/10.1023/A:1007331723572.

  40. High-Dimensional Continuous Control Using Generalized Advantage Estimation
  41. Proximal policy optimization algorithms
  42. What makes a good conversation? how controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  1702–1723, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1170. https://aclanthology.org/N19-1170.

  43. Toward diverse text generation with inverse reinforcement learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, pp.  4361–4367. AAAI Press, 2018. ISBN 9780999241127.
  44. The curse of recursion: Training on generated data makes models forget
  45. Defining and characterizing reward gaming. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=yb3HOXO3lX2.

  46. Offline rl for natural language generation with implicit language q learning. International Conference on Learning Representations
  47. Preference ranking optimization for human alignment, 2023a
  48. Reward collapse in aligning large language models, 2023b
  49. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  3645–3650, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1355. https://aclanthology.org/P19-1355.

  50. Llama: Open and efficient foundation language models, 2023a
  51. Llama 2: Open foundation and fine-tuned chat models, 2023b
  52. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

  53. CHAI: A CHatbot AI for task-oriented dialogue with offline reinforcement learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4471–4491, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.332. https://aclanthology.org/2022.naacl-main.332.

  54. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl

  55. Large language models are not fair evaluators
  56. Critic regularized regression. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  7768–7778. Curran Associates, Inc., 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/588cb956d6bbe67078f29f8de420a13d-Paper.pdf.

  57. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022a. https://openreview.net/forum?id=gEZrGCozdqR.

  58. Chain of thought prompting elicits reasoning in LLMs. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022b. https://openreview.net/forum?id=_VjQlMeSB_J.

  59. Generating sequences by learning to self-correct
  60. Lilian Weng. Policy gradient algorithms. lilianweng.github.io, 2018. https://lilianweng.github.io/posts/2018-04-08-policy-gradient/.

  61. Symbolic knowledge distillation: from general language models to commonsense models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4602–4625, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.341. https://aclanthology.org/2022.naacl-main.341.

  62. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. https://aclanthology.org/2020.emnlp-demos.6.

  63. Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
  64. Rlcd: Reinforcement learning from contrast distillation for language model alignment
  65. Rrhf: Rank responses to align language models with human feedback without tears
  66. DIALOGPT : Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  270–278, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-demos.30. https://aclanthology.org/2020.acl-demos.30.

  67. Slic-hf: Sequence likelihood calibration with human feedback
  68. Judging llm-as-a-judge with mt-bench and chatbot arena
  69. Fine-tuning language models with advantage-induced policy alignment

Show All 69