Emergent Mind

Compositional preference models for aligning LMs

(2310.13011)
Published Oct 17, 2023 in cs.CL and cs.LG

Abstract

As language models (LMs) become more capable, it is increasingly important to align them with human preferences. However, the dominant paradigm for training Preference Models (PMs) for that purpose suffers from fundamental limitations, such as lack of transparency and scalability, along with susceptibility to overfitting the preference dataset. We propose Compositional Preference Models (CPMs), a novel PM framework that decomposes one global preference assessment into several interpretable features, obtains scalar scores for these features from a prompted LM, and aggregates these scores using a logistic regression classifier. Through these simple steps, CPMs allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgment. Our experiments show that CPMs not only improve generalization and are more robust to overoptimization than standard PMs, but also that best-of-n samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs. Overall, our approach demonstrates the benefits of endowing PMs with priors about which features determine human preferences while relying on LM capabilities to extract those features in a scalable and robust way.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Concrete problems in ai safety
  2. Anthropic. Introducing claude, 2023. https://www.anthropic.com/index/introducing-claude.

  3. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  4. Constitutional ai: Harmlessness from ai feedback, 2022b
  5. Taken out of context: On measuring situational awareness in llms
  6. Measuring Progress on Scalable Oversight for Large Language Models
  7. Measuring Progress on Scalable Oversight for Large Language Models
  8. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp.  108–122
  9. Open problems and fundamental limitations of reinforcement learning from human feedback
  10. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. https://lmsys.org/blog/2023-03-30-vicuna/.

  11. Supervising strong learners by amplifying weak experts
  12. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30
  13. Scaling Instruction-Finetuned Language Models
  14. Introduction to algorithms. MIT press
  15. Ajeya Cotra. Why ai alignment could be hard with modern deep learning. Blog post on Cold Takes, Sep 2021. https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/. Accessed on [insert today’s date here].

  16. Selection-inference: Exploiting large language models for interpretable logical reasoning
  17. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  5988–6008. PMLR, 17–23 Jul 2022.
  18. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp. 10835–10866. PMLR
  19. Improving alignment of dialogue agents via targeted human judgements
  20. Aligning language models with preferences through f𝑓fitalic_f-divergence minimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  11546–11583. PMLR, 23–29 Jul 2023. https://proceedings.mlr.press/v202/go23a.html.

  21. Charles AE Goodhart. Problems of monetary management: the UK experience. Springer
  22. The political ideology of conversational AI: Converging evidence on ChatGPT's pro-environmental, left-libertarian orientation
  23. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer
  24. DeBERTa: Decoding-enhanced BERT with Disentangled Attention
  25. AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators
  26. Measuring goodhart’s law, 2022. https://openai.com/research/measuring-goodharts-law.

  27. Ai safety via debate
  28. Pretraining language models with human preferences. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  17506–17533. PMLR, 23–29 Jul 2023. https://proceedings.mlr.press/v202/korbak23a.html.

  29. Scalable agent alignment via reward modeling: a research direction
  30. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
  31. Decoupled Weight Decay Regularization
  32. On the fragility of learned reward functions
  33. Occam’s razor is insufficient to infer the preferences of irrational agents. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, pp.  5603–5614, Red Hook, NY, USA, 2018. Curran Associates Inc.
  34. Orca: Progressive Learning from Complex Explanation Traces of GPT-4
  35. WebGPT: Browser-assisted question-answering with human feedback
  36. OpenAI. Gpt-4 technical report
  37. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744
  38. Modeling and mitigating human annotation errors to design efficient stream processing systems with human-in-the-loop machine learning. International Journal of Human-Computer Studies, 160:102772
  39. Pytorch: An imperative style, high-performance deep learning library. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Proc. of NeurIPS, pp.  8024–8035, 2019. https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html.

  40. Discovering language model behaviors with model-written evaluations
  41. Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
  42. Direct preference optimization: Your language model is secretly a reward model
  43. Iterated decomposition: Improving science q&a by supervising reasoning processes
  44. Whose Opinions Do Language Models Reflect?
  45. On the feasibility of learning, rather than assuming, human biases for reward inference. In International Conference on Machine Learning, pp. 5670–5679. PMLR
  46. tasksource: A Dataset Harmonization Framework for Streamlined NLP Multi-Task Learning and Evaluation
  47. Defining and characterizing reward gaming. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  9460–9471. Curran Associates, Inc., 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/3d719fee332caa23d5038b8a90e81796-Paper-Conference.pdf.

  48. Llama 2: Open foundation and fine-tuned chat models
  49. Chain-of-Thought Prompting Elicits Reasoning in LLMs. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=_VjQlMeSB_J.

  50. Transformers: State-of-the-art natural language processing. In Proc. of EMNLP, pp.  38–45, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. https://aclanthology.org/2020.emnlp-demos.6.

  51. Least-to-most prompting enables complex reasoning in large language models
  52. Fine-Tuning Language Models from Human Preferences

Show All 52