Emergent Mind

West-of-N: Synthetic Preference Generation for Improved Reward Modeling

(2401.12086)
Published Jan 22, 2024 in cs.CL , cs.AI , and cs.LG

Abstract

The success of reinforcement learning from human feedback (RLHF) in language model alignment is strongly dependent on the quality of the underlying reward model. In this paper, we present a novel approach to improve reward model quality by generating synthetic preference data, thereby augmenting the training dataset with on-policy, high-quality preference pairs. Motivated by the promising results of Best-of-N sampling strategies in language model training, we extend their application to reward model training. This results in a self-training strategy to generate preference pairs by selecting the best and worst candidates in a pool of responses to a given query. Empirically, we find that this approach improves the performance of any reward model, with an effect comparable to the addition of a similar quantity of human preference data. This work opens up new avenues of research for improving RLHF for language model alignment, by offering synthetic preference generation as a solution to reward modeling challenges.

West-of-N self-training generates pseudo-preference pairs to enhance reward model training.

Overview

  • The paper presents a method called West-of-N sampling for creating synthetic preference data to improve reward models in LLMs.

  • The need for quality preference data in Reinforcement Learning from Human Feedback is highlighted, with cost and effort being major hurdles.

  • Previous strategies like Best-of-N sampling have shown success in model outcomes, and West-of-N aims to replicate this within reward model training.

  • The paper provides empirical evidence that their synthetic data generation method performs on par with human-generated data and improves reward model performance.

  • Future research directions are suggested, particularly in the potential of self-training extensions to further enhance reward models.

Introduction

The delineation of Reinforcement Learning from Human Feedback (RLHF) has facilitated the prosperity of LLMs, where the optimization of model output hinges critically on the fidelity of the underlying reward model. Assembling a robust reward model, on the other hand, is contingent on the procuration of top-quality preference data—a procedure that can be cost-prohibitive and labor-intensive. Addressing this bottlenecks, the paper introduces a novel method for generating synthetic preference data to enhance reward model training, thereby directly benefiting language model alignment.

Related Work

The framing of the problem is rooted in the established understanding that the procurement and curation of high-value preference data is crucial for modeling human preferences effectively. Prior strategies such as the Best-of-N sampling have exhibited efficacy in elevating language model outcomes by navigating models towards favorable generations. However, the application of such strategies in reward model optimization has not been thoroughly explored. Concurrently, the employ of self-training methods within the semi-supervised learning paradigm has shown promise in elevating performance across various domains in AI, but their potential in reward modeling for LLMs remains untapped.

Approach and Contributions

The paper expounds on a scheme termed West-of-N sampling, where, through self-training, synthetic high-quality preference pairs are produced by discerning the best and worst responses within a set of outputs to an input query. The anticipation is that this approach allows for substantial enhancements in reward model performance. This is accompanied by empirical validation suggesting the approach's efficacy is comparable to the inclusion of an equivalent footing of human preference data. The authors highlight three principal contributions: a newly introduced method for creating synthetic preference data, validation of the method's capability to boost the performance of reward models, and pioneering evidence of the utility of Best-of-N sampling within the scope of reward model training.

Empirical Validation and Avenues for Future Research

Empirical trials underscore the method's potency across multiple datasets, manifesting consistent improvements over existing synthetic data generation approaches like RLAIF and RLCD. The findings are robust over various initial data conditions, reinforcing the method's universal applicability. Moreover, the paper ventures into an extensive analysis of self-training strategies, which further unveils mechanisms pivotal for this approach’s success. These analyses pave the way for innovative research directions, such as exploring self-training extension methodologies that could potentially lead to additional advancements in reward model performance.

The paper engenders optimism for future work, laying the groundwork for subsequent explorations in refining RLHF methodologies, all while emphasizing the quintessential role synthetic preference generation plays in the continuous evolution of language model alignment.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.