Emergent Mind

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

(2404.03715)
Published Apr 4, 2024 in cs.LG , cs.AI , and cs.CL

Abstract

This paper studies post-training LLMs using preference feedback from a powerful oracle to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement Learning from Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy optimization. However, such a reward maximization approach is limited by the nature of "point-wise" rewards (such as Bradley-Terry model), which fails to express complex intransitive or cyclic preference relations. While advances on RLHF show reward learning and policy optimization can be merged into a single contrastive objective for stability, they yet still remain tethered to the reward maximization framework. Recently, a new wave of research sidesteps the reward maximization presumptions in favor of directly optimizing over "pair-wise" or general preferences. In this paper, we introduce Direct Nash Optimization (DNO), a provable and scalable algorithm that marries the simplicity and stability of contrastive learning with theoretical generality from optimizing general preferences. Because DNO is a batched on-policy algorithm using a regression-based objective, its implementation is straightforward and efficient. Moreover, DNO enjoys monotonic improvement across iterations that help it improve even over a strong teacher (such as GPT-4). In our experiments, a resulting 7B parameter Orca-2.5 model aligned by DNO achieves the state-of-the-art win-rate against GPT-4-Turbo of 33% on AlpacaEval 2.0 (even after controlling for response length), an absolute gain of 26% (7% to 33%) over the initializing model. It outperforms models with far more parameters, including Mistral Large, Self-Rewarding LM (70B parameters), and older versions of GPT-4.

Comparison of post-training techniques; \DNOfull most effective; colorful error bands highlight self-implemented methods.

Overview

  • Introduction of a novel framework, Direct Nash Optimization (DNO), for optimizing LLMs alignment with complex human preferences without relying on scalar rewards.

  • DNO combines theoretical robustness in optimizing general preferences with the practical efficiency of contrastive learning, avoiding explicit reward function computation.

  • Evidence shows DNO achieves superior performance on benchmarks, including a significant advancement on the alpaca benchmark, and demonstrates monotonic improvement over iterations.

  • The research outlines theoretical benefits, practical efficacy, and future potential of DNO in refining LLMs beyond traditional reward-focused approaches.

Exploring Direct Nash Optimization for Self-Improving Language Models

Introduction to Direct Nash Optimization

In the realm of artificial intelligence research, particularly in the development of LLMs, optimizing the alignment of LLMs with complex human preferences has emerged as a significant challenge. Traditional approaches to post-training LLMs, such as Reinforcement Learning from Human Feedback (RLHF), have focused on reward maximization based on scalar rewards. However, this methodology encounters limitations when expressing general preferences, especially in the context of intransitive or cyclic preference relations. Addressing this challenge, the recent study on Direct Nash Optimization (\DNO) presents a novel framework that diverges from the conventional reward-focused paradigm, embracing the optimization of general preferences through a scalable, contrastive learning-based algorithm.

Key Contributions of the Study

The study introduces \DNO, an algorithm that combines the theoretical robustness associated with optimizing general preferences with the practical efficiency and stability of contrastive learning. The following points summarize the critical contributions and findings of this work:

  1. Algorithmic Foundation: \DNO leverages batched on-policy iterations alongside a regression-based objective, facilitating a stable and efficient approach to optimizing general preferences. This methodology sidesteps the need for explicit reward function computation.
  2. Theoretical Insights: The paper demonstrates, through theoretical analysis, that \DNO converges to a Nash equilibrium on average, offering a foundational mathematical underpinning for its approach to learning from general preference feedback.
  3. Practical Efficacy: Empirical evaluations showcase that \DNO, when applied to a 7B parameter language model, outperforms its counterparts, achieving record performance on standard benchmarks such as \alpaca.
  4. Monotonic Improvement: \DNO is proven to exhibit monotonic improvement across iterations, ensuring consistent progress in aligning the LLM with the targeted preferences.

Theoretical and Practical Implications

The exploration of \DNO contributes significantly to both the theoretical understanding and practical applications of post-training LLMs with human feedback. Specifically, the paper sheds light on the following aspects:

  • Expressing Complex Preferences: By moving away from scalar reward functions, \DNO addresses the critical limitation of expressing complex, non-linear preferences, paving the way for more nuanced LLM tuning.
  • Stability and Efficiency: The batched on-policy approach, combined with regression-based objectives, marks a stride towards achieving both theoretical soundness and practical efficiency in learning from human feedback.
  • Benchmark Performance: The state-of-the-art performance of the resultant 7B parameter model underscores \DNO's effectiveness in real-world applications, suggesting its potential as a new standard for post-training LLMs.

Future Directions

While \DNO marks a significant advancement in the alignment of LLMs with human preferences, it also opens avenues for further exploration. Future work could focus on extending the algorithm to broader applications beyond text generation, exploring the integration of \DNO with other LLM architectures, and further refining the algorithm for even greater efficiency and scalability.

Conclusion

The development and study of Direct Nash Optimization represent a noteworthy advancement in optimizing LLMs for alignment with human preferences. By theoretically and empirically demonstrating the effectiveness of this approach, the research sets a new precedent for future endeavors in fine-tuning language models in a manner that more accurately reflects the intricacies of human preferences.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube