Emergent Mind

KTO: Model Alignment as Prospect Theoretic Optimization

(2402.01306)
Published Feb 2, 2024 in cs.LG and cs.AI

Abstract

Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner; for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them being $\textit{human-aware loss functions}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach Kahneman-Tversky Optimization (KTO), and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B. Crucially, KTO does not need preferences -- only a binary signal of whether an output is desirable or undesirable for a given input. This makes it far easier to use in the real world, where preference data is scarce and expensive.

KTO outperforms DPO and unaligned Mistral 7B in winrates, showing greater robustness across temperatures.

Overview

  • The paper introduces Kahneman-Tversky Optimization (KTO) as a novel alignment approach for LLMs, which surpasses or matches existing methods by utilizing binary signals instead of preference data.

  • KTO is based on prospect theory, emphasizing human biases in evaluating gains and losses, and aims to directly maximize human utility, allowing for simpler data collection.

  • Empirical results show KTO's effectiveness across various model scales and its ability to perform well with significantly fewer desirable examples, potentially reducing the need for supervised fine-tuning.

  • The paper highlights the importance of interdisciplinary insights from behavioral economics for advancing AI research and suggests future exploration into cognitive biases to improve alignment techniques.

Exploring Model Alignment through Kahneman-Tversky Optimization

Introduction

The paper explores aligning LLMs with human feedback, a pivotal step in making generative models more helpful, factual, and ethical. Traditionally, alignment methods like RLHF and DPO have shown success over supervised fine-tuning alone, leveraging preference data as their main input. This research introduces a novel alignment approach, Kahneman-Tversky Optimization (KTO), which steers away from preference data and utilizes a binary signal indicating whether a model's output is desirable or not, based on human utility functions from Kahneman & Tversky’s prospect theory. The KTO method surpasses or matches the performance of existing preference-based methods across model scales (1B to 30B).

Prospect Theory and HALOs

KTO is grounded in prospect theory, which accounts for the way humans evaluate gains and losses in a biased manner, notably being more sensitive to losses than equivalent gains. The paper shows that many current alignment methods implicitly model such biases, aiding in their success. These methods are termed human-aware loss functions (HALOs). Notably, KTO directly maximizes human utility, as opposed to preference likelihood, and allows for simpler, more abundant data collection in real-world scenarios.

Performance of KTO

Empirical results demonstrate KTO’s effectiveness across various model scales, with its performance either matching or exceeding that of preference-based approaches like DPO. Noteworthy is KTO’s ability to work with significantly fewer desirable examples (up to 90% less), suggesting it is not overly reliant on preference pairs for data, a critical advantage over existing methods. Additionally, in situations where a pre-trained model is already of high quality, KTO can eliminate the need for supervised fine-tuning, outperforming DPO-aligned models without it.

Implications and Future Directions

The findings have profound implications for model alignment research and practical applications of AI. The ability of KTO to learn effectively from sparse, binary feedback opens new doors for efficiently gathering and using human feedback in model training. Given the varied performance across scales and datasets, further exploration into the optimal settings for KTO in different scenarios remains a rich area for future work.

Furthermore, the paper raises intriguing theoretical insights into the nature of human biases in model alignment and the potential for HALOs to better capture these biases than current methods. These insights beg the question of what other cognitive biases could be modeled to improve alignment techniques further.

Conclusion

Overall, the paper presents Kahneman-Tversky Optimization as a powerful tool for aligning LLMs with human feedback, capable of leveraging simpler, binary signals to achieve or surpass the performance of more complex preference-based methods. As we continue to push the boundaries of what AI can achieve, approaches like KTO, which combine insights from behavioral economics with cutting-edge AI research, will be crucial for developing more ethical, effective, and human-aligned models.

Acknowledgements

The research behind KTO stands on the shoulders of interdisciplinary insights, notably Kahneman & Tversky’s prospect theory. The success and insights derived from KTO highlight the importance of cross-disciplinary research, acknowledging the contributions from behavioral economics to the evolving field of AI. Thanks are also due to the team behind the implementation and evaluation of KTO, underscoring the collaborative effort required to advance AI research.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
References
  1. Leftover Lunch: Advantage-based Offline Reinforcement Learning for Language Models
  2. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  3. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR
  4. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345
  5. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Machine learning, 97:327–351
  6. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
  7. Human irrationality: both bad and good for reward inference
  8. Evaluating Large Language Models Trained on Code
  9. Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
  10. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30
  11. Training Verifiers to Solve Math Word Problems
  12. Ultrafeedback: Boosting language models with high-quality feedback
  13. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  5988–6008. PMLR, 17–23 Jul 2022.
  14. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
  15. Aligning Language Models with Preferences through f-divergence Minimization
  16. Decision-making under uncertainty–a field study of cumulative prospect theory. Journal of Banking & Finance, 33(7):1221–1229
  17. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pp.  173–182
  18. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR)
  19. Constructing stable preferences: A look into dimensions of experience and their impact on preference stability. Journal of consumer psychology, 8(2):113–139
  20. Holm, S. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pp.  65–70
  21. Learning trajectory preferences for manipulators via iterative improvement. Advances in neural information processing systems, 26
  22. Mistral 7B
  23. Prospect theory: An analysis of decision under risk. Econometrica, 47(2):263–292
  24. OpenAssistant Conversations -- Democratizing Large Language Model Alignment
  25. Pretraining language models with human preferences. In International Conference on Machine Learning, pp.  17506–17533. PMLR
  26. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37
  27. Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1777–1788
  28. When humans aren’t optimal: Robots that collaborate with risk-aware humans. In Proceedings of the 2020 ACM/IEEE international conference on human-robot interaction, pp.  43–52
  29. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval

  30. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744
  31. Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
  32. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pp.  745–750
  33. Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  34. Proximal Policy Optimization Algorithms
  35. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  36. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021
  37. Interpretable modelling of driving behaviors in interactive driving scenarios based on cumulative prospect theory. In 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp.  4329–4335. IEEE
  38. Fine-tuning Language Models for Factuality
  39. LLaMA: Open and Efficient Foundation Language Models
  40. Zephyr: Direct distillation of lm alignment
  41. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and uncertainty, 5:297–323
  42. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl

  43. Neural text generation with unlikelihood training. In International Conference on Learning Representations
  44. Self-Rewarding Language Models
  45. SLiC-HF: Sequence Likelihood Calibration with Human Feedback
  46. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
  47. Fine-Tuning Language Models from Human Preferences

Show All 47