Emergent Mind

RewardBench: Evaluating Reward Models for Language Modeling

(2403.13787)
Published Mar 20, 2024 in cs.LG

Abstract

Reward models (RMs) are at the crux of successful RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those reward models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. To date, very few descriptors of capabilities, training methods, or open-source reward models exist. In this paper, we present RewardBench, a benchmark dataset and code-base for evaluation, to enhance scientific understanding of reward models. The RewardBench dataset is a collection of prompt-win-lose trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We created specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO), and on a spectrum of datasets. We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.

Scores distributions in the \name dataset, comparing chosen vs. rejected responses by DPO-trained models.

Overview

  • RewardBench introduces a comprehensive framework for evaluating reward models in reinforcement learning from human feedback, focusing on alignment with human values.

  • It includes a diverse set of prompts across domains like chat, reasoning, safety, and out-of-distribution queries, aiming to highlight limitations of current models.

  • Evaluation shows significant variability in model performance, with differences between Direct Preference Optimization models and classifier-based models in handling tasks.

  • The benchmark aims to spur further research into reward models, suggesting exploration of hybrid approaches and expansion to include dynamic scenarios and diverse perspectives.

Evaluating Reward Models for Language Modeling with REWARD BENCH

Introduction to RewardBench

RewardBench presents a comprehensive framework for evaluating reward models in the context of Reinforcement Learning from Human Feedback (RLHF). This benchmark includes a diverse set of prompts to test reward models across various domains such as chat, reasoning, safety, and out-of-distribution queries. One of the primary goals is to explore the limitations of contemporary reward models and how they align with human values within language models. Further, RewardBench seeks to establish a repository that encourages reproducibility and consistent benchmarking across reward models, addressing a gap in the current literature where few resources exist for such evaluations.

Dataset Construction and Evaluation

RewardBench is structured into five principal sections, with prompts sourced from both new collections and repurposed from existing benchmarks. Notably, this dataset emphasizes the role of refusals in safe content generation and includes instruction-following, reasoning tasks, and tests reward models against crafted adversarial prompts to explore their handling of nuanced language understanding tasks.

The evaluation metric primarily used is accuracy, calculated as the percentage of instances where a reward model correctly identifies the preferred completion from a pair. This binary classification approach offers a straightforward measure of a reward model's effectiveness in aligning with human judgment. The final REWARD BENCH score represents an average across the subset scores, presenting a holistic assessment of a reward model's performance across varied domains.

Key Findings and Insights

Significant variability exists in the performance of tested reward models across different categories within RewardBench. While some models demonstrate strong alignment with human preferences in certain domains, others falter, particularly with adversarial or nuanced prompts. This variability underscores the complexity of reward modeling and highlights areas for improvement in understanding human values and preferences.

The evaluation also sheds light on the distinction between models trained directly with preference data (Direct Preference Optimization models) and those trained as classifiers. Interestingly, Direct Preference Optimization (DPO) models generally excel in the reasoning and safety categories but exhibit lower performance on established preference datasets. This discrepancy points to a potential divide between models optimized for generative tasks and those fine-tuned for classification, suggesting different avenues for refinement in each approach.

Practical Implications and Future Directions

The RewardBench benchmark catalyzes further research into reward models, particularly in addressing their limitations in understanding complex instructions, safety considerations, and reasoning capabilities. Moreover, the observed differences between DPO and classifier-based models open pathways to exploring hybrid approaches or new training paradigms to enhance model alignment with human values.

Future work could expand RewardBench to include dynamic scenarios where reward models must adapt to evolving contexts or preferences, further pushing the boundaries of model evaluation. Additionally, incorporating broader datasets representing diverse global perspectives can ensure that reward models align with a more inclusive set of human values, addressing potential biases and promoting fairness in AI applications.

In conclusion, RewardBench contributes a valuable framework to the ongoing effort to develop and refine reward models in language technology. By highlighting current challenges and offering a basis for comparison, it paves the way for advancements in creating more aligned, ethical, and effective AI systems.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube