Emergent Mind

Self-Improving Robust Preference Optimization

(2406.01660)
Published Jun 3, 2024 in cs.LG , cs.AI , and stat.ML

Abstract

Both online and offline RLHF methods such as PPO and DPO have been extremely successful in aligning AI with human preferences. Despite their success, the existing methods suffer from a fundamental problem that their optimal solution is highly task-dependent (i.e., not robust to out-of-distribution (OOD) tasks). Here we address this challenge by proposing Self-Improving Robust Preference Optimization SRPO, a practical and mathematically principled offline RLHF framework that is completely robust to the changes in the task. The key idea of SRPO is to cast the problem of learning from human preferences as a self-improvement process, which can be mathematically expressed in terms of a min-max objective that aims at joint optimization of self-improvement policy and the generative policy in an adversarial fashion. The solution for this optimization problem is independent of the training task and thus it is robust to its changes. We then show that this objective can be re-expressed in the form of a non-adversarial offline loss which can be optimized using standard supervised optimization techniques at scale without any need for reward model and online inference. We show the effectiveness of SRPO in terms of AI Win-Rate (WR) against human (GOLD) completions. In particular, when SRPO is evaluated on the OOD XSUM dataset, it outperforms the celebrated DPO by a clear margin of 15% after 5 self-revisions, achieving WR of 90%.

Win rates of SRPO vs. human summaries across N-revision iterations at varying alpha values.

Overview

  • The paper 'Self-Improving Robust Preference Optimization' (SRPO) provides a novel offline framework in Reinforcement Learning from Human Feedback (RLHF) that is robust to out-of-distribution (OOD) tasks.

  • The SRPO framework employs a two-step self-improvement process to enhance robustness: learning an in-context self-improvement model (">") and optimizing a robust generative Language Learning Model (LLM).

  • Numerical results demonstrate significant improvements over traditional methods like DPO and IPO, with SRPO showing substantial enhancements in AI Win-Rate for OOD tasks like the XSUM dataset.

Self-Improving Robust Preference Optimization: An Overview

Introduction

The paper "Self-Improving Robust Preference Optimization" introduces a new offline framework for Reinforcement Learning from Human Feedback (RLHF), focusing on mitigating the sensitivity of current RLHF methods to out-of-distribution (OOD) tasks. Unlike existing methods such as Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), which are task-dependent, the proposed Self-Improving Robust Preference Optimization (SRPO) framework aims to be robust against variations in the task distribution. The proposed method aligns AI preferences with human preferences using a theoretically grounded and practical min-max optimization approach. This essay provides a comprehensive overview of the SRPO method, strong numerical results, and its implications for future AI development.

Key Ideas and Contributions

The Problem with Existing Methods

Current RLHF methods face significant limitations due to their dependence on the training task's distribution. When the evaluation distribution deviates significantly from the training distribution, the performance of these methods degrades. This dependency makes it challenging to generalize and apply these models to OOD tasks.

SRPO Framework

The SRPO framework addresses these challenges by introducing a two-step self-improvement process:

  1. In-Context Self-Improving Preference Optimization: This step involves learning an in-context self-improvement model (\pi_{\dagger}), which generates improved outputs iteratively based on the completions and contexts from the initial model.
  2. Robust Preference Optimization of the Generative Model: Utilizing the self-improvement policy learned in the first step, the framework learns a robust generative LLM (Language Learning Model) (\pi). This model is optimized so that its outputs require minimal improvement, ensuring robustness across different distributions.

By recasting the optimization problem into a joint supervised optimization process, SRPO circumvents the need for a reward model and online inference.

The Mathematical Foundation of SRPO

The SRPO framework is formally expressed as a min-max optimization problem:

[ J*(x) = \min{\pi} \max{\pi{\dagger}} \mathbb{E}\left[p(y2 \succ y1 \mid x) - \beta \text{KL}(\pi{\dagger} \mid\mid \text{ref} \mid y_1, x) + \beta \text{KL}(\pi \mid\mid \text{ref} \mid x)\right] ]

The inner maximization can be solved in closed form for (\pi_{\dagger}):

[ \pi*{\dagger}(y2 \mid y1, x) = \frac{\exp\left(\frac{p(y2 \succ y1 \mid x)}{\beta}\right) \pi{\text{ref}}(y2 \mid y1, x)}{Z*(y_1, x)} ]

The solution to this problem is then translated into a non-adversarial offline supervised loss, allowing the joint optimization of both (\pi_{\dagger}) and (\pi).

Numerical Results

The paper presents a robust evaluation of the SRPO framework against well-established baselines, DPO and IPO. Notably, SRPO demonstrates substantial improvements in AI Win-Rate (WR) against human completions. For instance, when evaluated on the OOD XSUM dataset, SRPO achieves a significant improvement, outperforming DPO by 15% after 5 self-revisions, achieving a WR of 90%.

Implications and Future Research

Practical Implications:

  1. Robustness to Task Distribution: SRPO's independence from the behavior policy ((\mu)) ensures that it remains robust to distribution shifts, making it suitable for deployment across various tasks without task-specific retraining.
  2. Scalability: The transformation of the min-max optimization problem into a joint supervised loss facilitates scalable, large-scale implementation.

Theoretical Implications:

  1. Generalization of Preference Models: Unlike previous models restricted to the Bradley-Terry framework, SRPO's formulation holds across all preference models, enhancing its applicability to diverse scenarios.
  2. Self-Improvement Mechanism: The self-improvement policy embedded within SRPO introduces a novel paradigm in language model training, focusing on iterative refinement of completions.

Future Developments:

  1. Application to Complex Multi-Task Benchmarks: Testing SRPO on more complex multi-task benchmarks could validate its robustness and scalability further.
  2. Improving Algorithms for General AI: By leveraging the robustness and self-improvement capabilities of SRPO, future research could focus on enhancing general AI's ability to perform consistently across varied tasks and distributions.

Conclusion

The "Self-Improving Robust Preference Optimization" framework represents a significant step towards making RLHF methods more robust and scalable. Through its innovative approach to self-improvement and robust preference optimization, SRPO addresses the limitations of task dependency in current methods. The strong numerical results and theoretical foundations pave the way for more resilient AI systems capable of maintaining performance across diverse and unforeseen tasks. Future research will likely build on these foundations, exploring broader applications and further refining robust AI training methodologies.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.