Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
9 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Compositional preference models for aligning LMs (2310.13011v2)

Published 17 Oct 2023 in cs.CL and cs.LG

Abstract: As LLMs (LMs) become more capable, it is increasingly important to align them with human preferences. However, the dominant paradigm for training Preference Models (PMs) for that purpose suffers from fundamental limitations, such as lack of transparency and scalability, along with susceptibility to overfitting the preference dataset. We propose Compositional Preference Models (CPMs), a novel PM framework that decomposes one global preference assessment into several interpretable features, obtains scalar scores for these features from a prompted LM, and aggregates these scores using a logistic regression classifier. Through these simple steps, CPMs allow to control which properties of the preference data are used to train the preference model and to build it based on features that are believed to underlie the human preference judgment. Our experiments show that CPMs not only improve generalization and are more robust to overoptimization than standard PMs, but also that best-of-n samples obtained using CPMs tend to be preferred over samples obtained using conventional PMs. Overall, our approach demonstrates the benefits of endowing PMs with priors about which features determine human preferences while relying on LM capabilities to extract those features in a scalable and robust way.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Concrete problems in ai safety, 2016.
  2. Anthropic. Introducing claude, 2023. URL https://www.anthropic.com/index/introducing-claude.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  4. Constitutional ai: Harmlessness from ai feedback, 2022b.
  5. Taken out of context: On measuring situational awareness in llms, 2023.
  6. Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540, 2022a.
  7. Measuring Progress on Scalable Oversight for Large Language Models, November 2022b. URL http://arxiv.org/abs/2211.03540. arXiv:2211.03540 [cs].
  8. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, pp.  108–122, 2013.
  9. Open problems and fundamental limitations of reinforcement learning from human feedback, 2023.
  10. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  11. Supervising strong learners by amplifying weak experts, 2018.
  12. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  13. Scaling instruction-finetuned language models, 2022. URL https://arxiv.org/abs/2210.11416.
  14. Introduction to algorithms. MIT press, 2022.
  15. Ajeya Cotra. Why ai alignment could be hard with modern deep learning. Blog post on Cold Takes, Sep 2021. URL https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/. Accessed on [insert today’s date here].
  16. Selection-inference: Exploiting large language models for interpretable logical reasoning, 2022.
  17. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  5988–6008. PMLR, 17–23 Jul 2022.
  18. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp. 10835–10866. PMLR, 2023.
  19. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  20. Aligning language models with preferences through f𝑓fitalic_f-divergence minimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  11546–11583. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/go23a.html.
  21. Charles AE Goodhart. Problems of monetary management: the UK experience. Springer, 1984.
  22. The political ideology of conversational ai: Converging evidence on chatgpt’s pro-environmental, left-libertarian orientation. arXiv preprint arXiv:2301.01768, 2023.
  23. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
  24. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
  25. Annollm: Making large language models to be better crowdsourced annotators. arXiv preprint arXiv:2303.16854, 2023.
  26. Measuring goodhart’s law, 2022. URL https://openai.com/research/measuring-goodharts-law.
  27. Ai safety via debate, 2018.
  28. Pretraining language models with human preferences. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  17506–17533. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/korbak23a.html.
  29. Scalable agent alignment via reward modeling: a research direction, 2018.
  30. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023.
  31. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  32. On the fragility of learned reward functions, 2023.
  33. Occam’s razor is insufficient to infer the preferences of irrational agents. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, pp.  5603–5614, Red Hook, NY, USA, 2018. Curran Associates Inc.
  34. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
  35. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  36. OpenAI. Gpt-4 technical report, 2023.
  37. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  38. Modeling and mitigating human annotation errors to design efficient stream processing systems with human-in-the-loop machine learning. International Journal of Human-Computer Studies, 160:102772, 2022.
  39. Pytorch: An imperative style, high-performance deep learning library. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Proc. of NeurIPS, pp.  8024–8035, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html.
  40. Discovering language model behaviors with model-written evaluations, 2022.
  41. Question decomposition improves the faithfulness of model-generated reasoning, 2023. URL https://arxiv.org/abs/2307.11768.
  42. Direct preference optimization: Your language model is secretly a reward model, 2023.
  43. Iterated decomposition: Improving science q&a by supervising reasoning processes, 2023.
  44. Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548, 2023.
  45. On the feasibility of learning, rather than assuming, human biases for reward inference. In International Conference on Machine Learning, pp. 5670–5679. PMLR, 2019.
  46. Damien Sileo. tasksource: Structured dataset preprocessing annotations for frictionless extreme multi-task learning and evaluation. arXiv preprint arXiv:2301.05948, 2023. URL https://arxiv.org/abs/2301.05948.
  47. Defining and characterizing reward gaming. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  9460–9471. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/3d719fee332caa23d5038b8a90e81796-Paper-Conference.pdf.
  48. Llama 2: Open foundation and fine-tuned chat models, 2023.
  49. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J.
  50. Transformers: State-of-the-art natural language processing. In Proc. of EMNLP, pp.  38–45, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
  51. Least-to-most prompting enables complex reasoning in large language models, 2023.
  52. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
Citations (12)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets