Emergent Mind

Self-Play Preference Optimization for Language Model Alignment

(2405.00675)
Published May 1, 2024 in cs.LG , cs.AI , cs.CL , and stat.ML

Abstract

Traditional reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate language model alignment. In this paper, we propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbed \textit{Self-Play Preference Optimization} (SPPO), approximates the Nash equilibrium through iterative policy updates and enjoys theoretical convergence guarantee. Our method can effectively increase the log-likelihood of the chosen response and decrease that of the rejected response, which cannot be trivially achieved by symmetric pairwise loss such as Direct Preference Optimization (DPO) and Identity Preference Optimization (IPO). In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench and the Open LLM Leaderboard. Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger language models.

Comparison of model win rates, showing SPPO's superior performance with stronger judge models.

Overview

  • The paper introduces Self-Play Preference Optimization (SPPO), a new approach in Reinforcement Learning from Human Feedback (RLHF) for LLMs, aiming to more effectively align LLM outputs with complex human preferences by approximating a Nash equilibrium.

  • SPPO differentiates itself with a convergence model based on multiplicative weights, a practical testing on UltraFeedback with PairRM, and a unique focus on handling non-transitivity in preferences, achieving greater alignment with human responses than existing RLHF methods.

  • SPPO has demonstrated robust performance in empirical studies, showing superior alignment capabilities in various benchmarks, scalability, and suggests potential future applications in diverse AI domains, along with possible enhancements in response sampling and preference estimation.

Understanding Self-Play Preference Optimization for Aligning LLMs

Introduction

Reinforcement Learning from Human Feedback (RLHF) has significantly advanced the development of LLMs, which are pivotal in generating human-like responses in various scenarios. However, existing RLHF techniques, heavily reliant on parametric models like the Bradley-Terry model, do not adequately address the complexity and non-transitivity found in human preferences. The paper introduces a novel approach, Self-Play Preference Optimization (SPPO), that reimagines RLHF, focusing on approximating the Nash equilibrium in a two-player constant-sum game. This method leverages iterative updates to refine LLM responses, aligning them more closely with human-like preferences.

What SPPO Brings to the Table

SPPO proposes a distinct methodology from traditional RLHF by emphasizing direct engagement with preference probabilities, improving flexibility in capturing human preferences. Here's what makes SPPO stand out:

  • Provably Convergent: SPPO employs a theoretical model for convergence through multiplicative weights, promising that over iterations, the model approaches a Nash equilibrium.
  • Practical Excellence: Empirically tested on the UltraFeedback dataset with the PairRM preference model, SPPO showcases significant improvements. For instance, it achieves a 28.53% length-controlled win-rate over GPT-4-Turbo in the AlpacaEval 2.0 setup.
  • Deep Focus on Preference Interactions: Different from typical pairwise loss systems, SPPO is engineered to increase the log-likelihood of a selected response and decrease that of the rejected, addressing a common shortfall in symmetric loss functions like DPO and IPO.

Theoretical Foundations and Practical Implications

SPPO constructs its methodology around the idea of each model iteration playing against its predecessor, honing policy through self-play that is both practical and theoretically grounded. It suggests that:

  • Effective Self-Play: By iteratively playing against itself, the model self-adjusts through exposure to a diverse range of responses generated from past iterations, enriching its response quality over time.
  • Handling Non-Transitivity: Directly tackling the non-transitivity in human preferences makes SPPO particularly adept at managing complex preference scenarios, unlike the linear assumptions seen in models like Bradley-Terry.

SPPO Experimentation and Observations

SPPO's real-world application involves a series of experiments using a base LLM model improved iteratively with minimal external supervision. Some notable achievements include:

  • Strong Performance Across Benchmarks: In comparative studies with existing methods, SPPO consistently demonstrates superior ability to align LLM outputs with human preferences across various benchmarks like MT-Bench and the Open LLM Leaderboard.
  • Scalability and Efficiency: Despite using a relatively smaller pre-trained model and fewer data samples, SPPO matches or even surpasses much larger models in head-to-head comparisons.

Future Directions and Speculation

Looking ahead, SPPO sets a promising path for further research into efficient and scalable solutions for LLM training. Future research could explore:

  • Broader Application Domains: Applying SPPO in other areas of AI, such as automated dialog systems or personalized learning environments, could provide increased interactivity and satisfaction.
  • Improvements in Sampling and Estimation: Enhancements in how responses are sampled and preferences are estimated could lead to even more robust models.
  • Integration with Other Learning Paradigms: Combining SPPO's approach with other machine learning paradigms might yield interesting synergies, particularly in areas requiring nuanced understanding of human feedback.

In summary, the SPPO framework not only strengthens the foundation of RLHF for LLMs through theoretical assurances but also impresses with its practical dominance in empirical tests. This dual strength paves the way for crafting more responsive and human-aligned language models in the future.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube