Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner; for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them being $\textit{human-aware loss functions}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach Kahneman-Tversky Optimization (KTO), and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B. Crucially, KTO does not need preferences -- only a binary signal of whether an output is desirable or undesirable for a given input. This makes it far easier to use in the real world, where preference data is scarce and expensive.
The paper introduces Kahneman-Tversky Optimization (KTO) as a novel alignment approach for LLMs, which surpasses or matches existing methods by utilizing binary signals instead of preference data.
KTO is based on prospect theory, emphasizing human biases in evaluating gains and losses, and aims to directly maximize human utility, allowing for simpler data collection.
Empirical results show KTO's effectiveness across various model scales and its ability to perform well with significantly fewer desirable examples, potentially reducing the need for supervised fine-tuning.
The paper highlights the importance of interdisciplinary insights from behavioral economics for advancing AI research and suggests future exploration into cognitive biases to improve alignment techniques.
The paper explores aligning LLMs with human feedback, a pivotal step in making generative models more helpful, factual, and ethical. Traditionally, alignment methods like RLHF and DPO have shown success over supervised fine-tuning alone, leveraging preference data as their main input. This research introduces a novel alignment approach, Kahneman-Tversky Optimization (KTO), which steers away from preference data and utilizes a binary signal indicating whether a model's output is desirable or not, based on human utility functions from Kahneman & Tversky’s prospect theory. The KTO method surpasses or matches the performance of existing preference-based methods across model scales (1B to 30B).
KTO is grounded in prospect theory, which accounts for the way humans evaluate gains and losses in a biased manner, notably being more sensitive to losses than equivalent gains. The paper shows that many current alignment methods implicitly model such biases, aiding in their success. These methods are termed human-aware loss functions (HALOs). Notably, KTO directly maximizes human utility, as opposed to preference likelihood, and allows for simpler, more abundant data collection in real-world scenarios.
Empirical results demonstrate KTO’s effectiveness across various model scales, with its performance either matching or exceeding that of preference-based approaches like DPO. Noteworthy is KTO’s ability to work with significantly fewer desirable examples (up to 90% less), suggesting it is not overly reliant on preference pairs for data, a critical advantage over existing methods. Additionally, in situations where a pre-trained model is already of high quality, KTO can eliminate the need for supervised fine-tuning, outperforming DPO-aligned models without it.
The findings have profound implications for model alignment research and practical applications of AI. The ability of KTO to learn effectively from sparse, binary feedback opens new doors for efficiently gathering and using human feedback in model training. Given the varied performance across scales and datasets, further exploration into the optimal settings for KTO in different scenarios remains a rich area for future work.
Furthermore, the paper raises intriguing theoretical insights into the nature of human biases in model alignment and the potential for HALOs to better capture these biases than current methods. These insights beg the question of what other cognitive biases could be modeled to improve alignment techniques further.
Overall, the paper presents Kahneman-Tversky Optimization as a powerful tool for aligning LLMs with human feedback, capable of leveraging simpler, binary signals to achieve or surpass the performance of more complex preference-based methods. As we continue to push the boundaries of what AI can achieve, approaches like KTO, which combine insights from behavioral economics with cutting-edge AI research, will be crucial for developing more ethical, effective, and human-aligned models.
The research behind KTO stands on the shoulders of interdisciplinary insights, notably Kahneman & Tversky’s prospect theory. The success and insights derived from KTO highlight the importance of cross-disciplinary research, acknowledging the contributions from behavioral economics to the evolving field of AI. Thanks are also due to the team behind the implementation and evaluation of KTO, underscoring the collaborative effort required to advance AI research.
Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval
Trl: Transformer reinforcement learning. https://github.com/huggingface/trl