Emergent Mind

Pretraining Language Models with Human Preferences

(2302.08582)
Published Feb 16, 2023 in cs.CL and cs.LG

Abstract

Language models (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the trade-off between alignment and capabilities of pretrained LMs. We find a Pareto-optimal and simple approach among those we explored: conditional training, or learning distribution over tokens conditional on their human preference scores given by a reward model. Conditional training reduces the rate of undesirable content by up to an order of magnitude, both when generating without a prompt and with an adversarially-chosen prompt. Moreover, conditional training maintains the downstream task performance of standard LM pretraining, both before and after task-specific finetuning. Pretraining with human feedback results in much better preference satisfaction than standard LM pretraining followed by finetuning with feedback, i.e., learning and then unlearning undesirable behavior. Our results suggest that we should move beyond imitation learning when pretraining LMs and incorporate human preferences from the start of training.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, pp.  298–306, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384735. doi: 10.1145/3461702.3462624. https://doi.org/10.1145/3461702.3462624.
  2. Palm 2 technical report
  3. A General Language Assistant as a Laboratory for Alignment
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback
  5. The second pascal recognising textual entailment challenge. Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, 01 2006.
  6. A neural probabilistic language model. J. Mach. Learn. Res., 3(null):1137–1155, mar 2003. ISSN 1532-4435.
  7. The fifth PASCAL recognizing textual entailment challenge. In Proceedings of the Second Text Analysis Conference, TAC 2009, Gaithersburg, Maryland, USA, November 16-17, 2009. NIST, 2009. https://tac.nist.gov/publications/2009/additional.papers/RTE5_overview.proceedings.pdf.

  8. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion Proceedings of The 2019 World Wide Web Conference, WWW ’19, pp.  491–500, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450366755. doi: 10.1145/3308560.3317593. https://doi.org/10.1145/3308560.3317593.
  9. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

  10. The secret sharer: Evaluating and testing unintended memorization in neural networks. In Proceedings of the 28th USENIX Conference on Security Symposium, SEC’19, pp.  267–284, USA, 2019. USENIX Association. ISBN 9781939133069.
  11. Extracting Training Data from Large Language Models
  12. Quantifying Memorization Across Neural Language Models
  13. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp.  1–14, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/S17-2001. https://aclanthology.org/S17-2001.

  14. Improving code generation by training with natural language feedback
  15. Decision transformer: Reinforcement learning via sequence modeling. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  15084–15097. Curran Associates, Inc., 2021a. https://proceedings.neurips.cc/paper/2021/file/7f489f642a0ddb10272b5c31057f0663-Paper.pdf.

  16. Evaluating large language models trained on code. 2021b.
  17. Scaling Instruction-Finetuned Language Models
  18. The pascal recognising textual entailment challenge. In Proceedings of the First International Conference on Machine Learning Challenges: Evaluating Predictive Uncertainty Visual Object Classification, and Recognizing Textual Entailment, MLCW’05, pp. 177–190, Berlin, Heidelberg, 2005. Springer-Verlag. ISBN 3540334270. doi: 10.1007/11736790˙9. https://doi.org/10.1007/11736790_9.

  19. Style transformer: Unpaired text style transfer without disentangled latent representation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  5997–6007, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1601. https://aclanthology.org/P19-1601.

  20. The case for 4-bit precision: k-bit inference scaling laws
  21. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. https://aclanthology.org/N19-1423.

  22. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005. https://aclanthology.org/I05-5002.

  23. Rvs: What is essential for offline RL via supervised learning? In International Conference on Learning Representations, 2022. https://openreview.net/forum?id=S874XAIpkR-.

  24. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. https://aclanthology.org/P18-1082.

  25. Controlling linguistic style aspects in neural language generation. In Proceedings of the Workshop on Stylistic Variation, pp. 94–104, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4912. https://aclanthology.org/W17-4912.

  26. The Pile: An 800GB Dataset of Diverse Text for Language Modeling
  27. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  3356–3369, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.301. https://aclanthology.org/2020.findings-emnlp.301.

  28. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pp.  1–9, Prague, June 2007. Association for Computational Linguistics. https://aclanthology.org/W07-1401.

  29. Aligning language models with preferences through f-divergence minimization
  30. Detoxify. Github. https://github.com/unitaryai/detoxify

  31. Ethical challenges in data-driven dialogue systems. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’18, pp.  123–129, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450360128. doi: 10.1145/3278721.3278777. https://doi.org/10.1145/3278721.3278777.
  32. Using self-supervised learning can improve model robustness and uncertainty. Advances in Neural Information Processing Systems (NeurIPS)
  33. Pretrained transformers improve out-of-distribution robustness. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  2744–2751, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.244. https://aclanthology.org/2020.acl-main.244.

  34. Hewitt, J. Initializing new word embeddings for pretrained language models
  35. An empirical analysis of compute-optimal large language model training. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=iBBcRUlOAPR.

  36. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020. https://openreview.net/forum?id=rygGQyrFvH.

  37. spaCy: Industrial-strength Natural Language Processing in Python. 2020. doi: 10.5281/zenodo.1212303.
  38. GPT-critic: Offline reinforcement learning for end-to-end task-oriented dialogue systems. In International Conference on Learning Representations, 2022. https://openreview.net/forum?id=qaxhBG1UUaS.

  39. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems
  40. Human-centric dialog training via offline reinforcement learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  3985–4003, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.327. https://aclanthology.org/2020.emnlp-main.327.

  41. CTRL: A Conditional Transformer Language Model for Controllable Generation
  42. A distributional approach to controlled text generation. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. https://openreview.net/forum?id=jWkw45-9AbL.

  43. On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022a. https://openreview.net/forum?id=XvI6h-s4un.

  44. RL with KL penalties is better viewed as Bayesian inference. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  1083–1091, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. https://aclanthology.org/2022.findings-emnlp.77.

  45. Reward-Conditioned Policies
  46. Conservative q-learning for offline reinforcement learning. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  47. Levesque, H. J. The winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning. AAAI, 2011. http://dblp.uni-trier.de/db/conf/aaaiss/aaaiss2011-6.html#Levesque11.

  48. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
  49. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  110–119, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1014. https://aclanthology.org/N16-1014.

  50. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. https://aclanthology.org/2022.acl-long.229.

  51. RoBERTa: A Robustly Optimized BERT Pretraining Approach
  52. QUARK: Controllable text generation with reinforced unlearning. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=5HaIds3ux5O.

  53. Teaching language models to support answers with verified quotes
  54. Context dependent recurrent neural network language model. In 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 234–239, 2012. doi: 10.1109/SLT.2012.6424228.
  55. AWAC: Accelerating Online Reinforcement Learning with Offline Datasets
  56. Training language models to follow instructions with human feedback. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=TG8KACxEON.

  57. The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1525–1534, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1144. https://aclanthology.org/P16-1144.

  58. Towards controllable story generation. In Proceedings of the First Workshop on Storytelling, pp. 43–49, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-1505. https://aclanthology.org/W18-1505.

  59. Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
  60. Red Teaming Language Models with Language Models
  61. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pp.  745–750, New York, NY, USA, 2007. Association for Computing Machinery. ISBN 9781595937933. doi: 10.1145/1273496.1273590. https://doi.org/10.1145/1273496.1273590.
  62. Improving language understanding by generative pre-training. 2018.
  63. Language models are unsupervised multitask learners. 2019.
  64. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. https://aclanthology.org/D16-1264.

  65. Effect of scale on catastrophic forgetting in neural networks. In International Conference on Learning Representations, 2022. https://openreview.net/forum?id=GhVS8_yPeEa.

  66. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022. https://openreview.net/forum?id=9Vrb9D0WI4.

  67. The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  1668–1678, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1163. https://aclanthology.org/P19-1163.

  68. Training Language Models with Language Feedback
  69. Training language models with language feedback at scale
  70. Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions
  71. Offline RL for Natural Language Generation with Implicit Language Q Learning
  72. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.  1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. https://aclanthology.org/D13-1170.

  73. Process for adapting language models to society (PALMS) with values-targeted datasets. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. https://openreview.net/forum?id=k-ghaB9VZBw.

  74. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
  75. Transcending Scaling Laws with 0.1% Extra Compute
  76. Natural Language Processing with Transformers: Building Language Applications with Hugging Face. O’Reilly Media, Incorporated, 2022. ISBN 1098103246. https://books.google.ch/books?id=7hhyzgEACAAJ.

  77. Style guide for Python code. PEP 8, 2001. https://www.python.org/dev/peps/pep-0008/.

  78. Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning
  79. Overcoming catastrophic forgetting in zero-shot cross-lingual generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  9279–9300, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. https://aclanthology.org/2022.emnlp-main.630.

  80. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. https://aclanthology.org/W18-5446.

  81. Exploring the limits of domain-adaptive training for detoxifying large-scale language models. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=v_0F4IZJZw.

  82. Neural Network Acceptability Judgments
  83. Challenges in detoxifying language models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  2447–2469, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.210. https://aclanthology.org/2021.findings-emnlp.210.

  84. Neural text generation with unlikelihood training. In International Conference on Learning Representations, 2020. https://openreview.net/forum?id=SJeYe0NtvH.

  85. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1112–1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. https://aclanthology.org/N18-1101.

  86. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. https://aclanthology.org/2020.emnlp-demos.6.

  87. Detoxifying language models risks marginalizing minority voices. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2390–2397, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.190. https://aclanthology.org/2021.naacl-main.190.

  88. Recipes for Safety in Open-domain Chatbots
  89. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp.  1097–1100
  90. Adversarial training for high-stakes reliability. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=NtJyGXo0nF.

  91. Fine-Tuning Language Models from Human Preferences

Show All 91