Emergent Mind

Abstract

Reward models (RMs) are crucial for aligning LLMs with human preferences. They are trained using preference datasets where each example consists of one input prompt, two responses, and a preference label. As curating a high-quality human labeled preference dataset is both time-consuming and expensive, people often rely on existing powerful LLMs for preference label generation. This can potentially introduce noise and impede RM training. In this work, we present RMBoost, a novel synthetic preference data generation paradigm to boost reward model quality. Unlike traditional methods, which generate two responses before obtaining the preference label, RMBoost first generates one response and selects a preference label, followed by generating the second more (or less) preferred response conditioned on the pre-selected preference label and the first response. This approach offers two main advantages. First, RMBoost reduces labeling noise since preference pairs are constructed intentionally. Second, RMBoost facilitates the creation of more diverse responses by incorporating various quality aspects (e.g., helpfulness, relevance, completeness) into the prompts. We conduct extensive experiments across three diverse datasets and demonstrate that RMBoost outperforms other synthetic preference data generation techniques and significantly boosts the performance of four distinct reward models.

Comparison of RMBoost with RLAIF and RLCD for synthetic preference data generation, highlighting response and label methods.

Overview

  • The paper introduces RMBoost, a novel method for generating synthetic data to improve Reward Models (RMs) used for aligning LLMs with human preferences, addressing the limitations of previous methods.

  • RMBoost operates through a progressive framework involving generating an initial response, assigning a preference label, and generating a second response based on predefined multi-aspect evaluation instructions, enhancing diversity and reducing labeling noise.

  • Evaluations across various datasets demonstrate RMBoost's superiority in generating high-quality synthetic preference data, leading to improved performance of RMs over existing methods, with future directions focusing on optimization and broader applications.

Boosting Reward Model with Preference-Conditional Multi-Aspect Synthetic Data Generation

The paper "Boosting Reward Model with Preference-Conditional Multi-Aspect Synthetic Data Generation" introduces RMBoost, a novel synthetic data generation method designed to enhance the performance of Reward Models (RMs), which are essential for aligning LLMs with human preferences. RM training traditionally relies on preference datasets containing pairs of responses to prompts along with labels indicating the preferred response. RMBoost proposes a more nuanced method to generate these pairs, aiming to address the shortcomings of previous synthetic data generation techniques.

Methodology

The RMBoost framework operates in a distinct, progressive manner:

  1. Initial Response Generation: For each input prompt, RMBoost first generates a response using a LLM.
  2. Preference Label Assignment: A preference label (indicating more or less preferred) is then pre-assigned.
  3. Conditional Second Response Generation: RMBoost conditions the generation of a second response on the initial response, the pre-assigned preference label, and predefined multi-aspect evaluation instructions (e.g., helpfulness, relevance).

This approach ensures that the preferred response pairs are not just synthetically generated based on LLM outputs but are also diverse and intentional in their differences. This multi-aspect conditioning allows responses to vary along specific, intended dimensions, thereby reducing labeling noise and enhancing response diversity.

Evaluation and Results

The paper evaluates RMBoost against existing synthetic data generation methods such as RLAIF, West-of-N, and RLCD across three diverse datasets:

  1. QA Feedback: A dataset for long-form question answering.
  2. Ultra Feedback: A dataset for general LLM alignment.
  3. TLDR Summarization: A dataset for summarization of Reddit posts.

Results demonstrate RMBoost's superiority in generating high-quality synthetic preference datasets. The experiments show that RMBoost significantly boosts the performance of various RM backbones (Gemini-Nano-1, Gemini-Nano-2, PaLM 2-XXS, and Gemma 2B). Notably, RMBoost-trained RMs show higher preference prediction accuracy and robustness when mixed with real, human-labeled data.

Implications and Future Work

Practically, RMBoost advances the methodology for creating synthetic preference data, making it a valuable tool for RM training where large-scale human-labeled datasets are impractical. Theoretically, it balances the trade-off between response distribution shift and label noise more effectively than previous methods. This leads to training a less biased RM and achieving better downstream task performance.

Future developments might focus on optimizing RMBoost for different domains and extending its application to multi-modal data inputs. Enhancing the robustness and scalability of RMBoost will be crucial for broader adoption in diverse LLM applications.

Conclusion

RMBoost represents a significant methodological improvement for synthetic preference data generation and reward model training. By leveraging preference-conditional and multi-aspect synthetic data generation, RMBoost offers a pathway to more reliable and diverse synthetic datasets, enhancing the overall alignment and performance of LLMs. This work opens up new avenues for research in AI alignment and synthetic data generation, reaffirming the importance of carefully designed data creation methodologies in the development of advanced AI systems.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.