Self-Play Preference Optimization for Language Model Alignment (2405.00675v5)
Abstract: Standard reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate LLM alignment. In this paper, we propose a self-play-based method for LLM alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium. Additionally, we propose a new SPPO objective which is both strongly motivated by theory and is simple and effective in practice. In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench, Arena-Hard, and the Open LLM Leaderboard. Starting from a stronger base model Llama-3-8B-Instruct, we are able to achieve a length-controlled win rate of 38.77%. Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger LLMs. Codes are available at https://github.com/uclaml/SPPO.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861 .
- A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036 .
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
- Open llm leaderboard. Hugging Face .
- Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika 39 324–345.
- Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335 .
- Deep reinforcement learning from human preferences. Advances in neural information processing systems 30.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 .
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 .
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377 .
- Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475 .
- Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems 36.
- Contextual dueling bandits. In Conference on Learning Theory. PMLR.
- Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306 .
- Adaptive game playing using multiplicative weights. Games and Economic Behavior 29 79–103.
- Scaling laws for reward model overoptimization. In International Conference on Machine Learning. PMLR.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning. PMLR.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 .
- Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691 .
- Reinforcement learning from human feedback with active queries. arXiv preprint arXiv:2402.09401 .
- Mistral 7b. arXiv preprint arXiv:2310.06825 .
- Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561 .
- Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470 .
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958 .
- Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657 .
- Active ranking without strong stochastic transitivity. Advances in neural information processing systems .
- Nash learning from human feedback. arXiv preprint arXiv:2312.00886 .
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774 .
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 27730–27744.
- Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228 .
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems 36.
- Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715 .
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM 64 99–106.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .
- Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585 .
- A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056 .
- Thurstone, L. (1927). A law of comparative judgment. Psychological Review 34 273.
- Tversky, A. (1969). Intransitivity of preferences. Psychological review 76 31.
- Is rlhf more difficult than standard rl? a theoretical perspective. Advances in Neural Information Processing Systems 36.
- Borda regret minimization for generalized linear dueling bandits. In ICML 2023 Workshop The Many Facets of Preference-Based Learning.
- Gibbs sampling from human feedback: A provable kl-constrained framework for rlhf. arXiv preprint arXiv:2312.11456 .
- Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682 .
- A theoretical analysis of nash learning from human feedback under general kl-regularized preference. arXiv preprint arXiv:2402.07314 .
- Self-rewarding language models. arXiv preprint arXiv:2401.10020 .
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830 .
- Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425 .
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 36.
- Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons. arXiv preprint arXiv:2301.11270 .