Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization (2405.16681v2)
Abstract: Reinforcement Learning with Human Feedback (RLHF) enhances the alignment of LLMs. However, its limitations have led to the development of Direct Preference Optimization (DPO), an RL-free approach designed to overcome these shortcomings. While studies have shown that DPO improves instruction-following capabilities, it negatively impacts the reasoning ability of LLMs. Additionally, DPO is highly sensitive to judgment noise in preference datasets and the size of the training set. Although several modifications to DPO have been proposed, they still fail to fully resolve these issues. To address these limitations, we propose Triple Preference Optimization (TPO), a new preference learning method designed to enhance both reasoning and instruction-following abilities through one-step optimization. We compare TPO against DPO and its recent variants using state-of-the-art training setups, including both base and instruction-tuned models such as Mistral and Llama 3. Our evaluation covers a comprehensive range of chat-based and reasoning benchmarks. The results demonstrate that TPO achieves significant improvements over existing methods without substantially increasing response length across different dataset sizes. Specifically, TPO outperforms DPO and SimPO by up to 7.0% and 7.3% points on Arena-Hard, 12.2% and 13.3% points on MixEval-Hard, 10.4% and 10.1% points on MMLU-Pro, and 19.0% and 19.2% points on GSM8K, respectively. Furthermore, TPO achieves these improvements while requiring less data than DPO.
- Palm 2 technical report.
- A general theoretical paradigm to understand learning from human preferences.
- Training a helpful and harmless assistant with reinforcement learning from human feedback.
- BIG bench authors. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
- Heejong Bong and Alessandro Rinaldo. 2022. Generalized results for the existence and consistency of the mle in the bradley-terry-luce model.
- Language models are few-shot learners.
- Sparks of artificial general intelligence: Early experiments with gpt-4.
- Getting it right: Improving spatial consistency in text-to-image models. arXiv preprint arXiv:2404.01197.
- Deep reinforcement learning from human preferences.
- Think you have solved question answering? try arc, the ai2 reasoning challenge.
- Ultrafeedback: Boosting language models with high-quality feedback.
- Enhancing chat language models by scaling high-quality instructional conversations.
- Human-aware loss functions (halos). Technical report, Contextual AI.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.
- Contrastive prefence learning: Learning from human feedback without rl. arXiv preprint arXiv:2310.13639.
- Measuring massive multitask language understanding.
- Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691.
- Lora: Low-rank adaptation of large language models.
- Phi-2: The surprising power of small language models. Microsoft Research Blog.
- Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning.
- Truthfulqa: Measuring how models mimic human falsehoods.
- Statistical rejection sampling improves preference optimization.
- Alexander V. Lotov and Kaisa Miettinen. 2008. Visualizing the Pareto Frontier, pages 213–243. Springer Berlin Heidelberg, Berlin, Heidelberg.
- Kaisa Miettinen. 1999. Nonlinear multiobjective optimization, volume 12. Springer Science & Business Media.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.
- Efficient large-scale language model training on gpu clusters using megatron-lm.
- Training language models to follow instructions with human feedback.
- Red teaming language models with language models.
- Direct preference optimization: Your language model is secretly a reward model.
- Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization.
- Insights into alignment: Evaluating dpo and its variants across multiple tasks. arXiv preprint arXiv:2404.14723.
- Winogrande: An adversarial winograd schema challenge at scale.
- Multitask prompted training enables zero-shot task generalization.
- Proximal policy optimization algorithms.
- Learning to summarize from human feedback.
- Llama: Open and efficient foundation language models.
- Zephyr: Direct distillation of lm alignment.
- Amos Tversky and Daniel Kahneman. 1992. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and uncertainty, 5:297–323.
- Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.
- Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment.
- Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. arXiv preprint arXiv:2401.08417.
- Rrhf: Rank responses to align language models with human feedback without tears.
- Hellaswag: Can a machine really finish your sentence?
- Slic-hf: Sequence likelihood calibration with human feedback.
- Fine-tuning language models from human preferences. ArXiv, abs/1909.08593.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.