Emergent Mind

Transforming and Combining Rewards for Aligning Large Language Models

(2402.00742)
Published Feb 1, 2024 in cs.CL and cs.AI

Abstract

A common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model. We study two closely related problems that arise in this approach. First, any monotone transformation of the reward model preserves preference ranking; is there a choice that is better'' than others? Second, we often wish to align language models to multiple properties: how should we combine multiple reward models? Using a probabilistic interpretation of the alignment procedure, we identify a natural choice for transformation for (the common case of) rewards learned from Bradley-Terry preference models. This derived transformation has two important properties. First, it emphasizes improving poorly-performing outputs, rather than outputs that already score well. This mitigates both underfitting (where some prompts are not improved) and reward hacking (where the model learns to exploit misspecification of the reward model). Second, it enables principled aggregation of rewards by linking summation to logical conjunction: the sum of transformed rewards corresponds to the probability that the output isgood'' in all measured properties, in a sense we make precise. Experiments aligning language models to be both helpful and harmless using RLHF show substantial improvements over the baseline (non-transformed) approach.

Graph shows Bradley-Terry reward improves LLM's helpfulness and harmlessness over base SFT model, avoiding overfitting.

Overview

  • The paper addresses challenges in training LLMs to generate desirable outputs through Reinforcement Learning from Human Feedback (RLHF) by optimizing reward models.

  • Two common problems in RLHF are addressed: selecting optimal monotone transformation of reward models and effectively combining multiple reward models for different desired properties.

  • A log-sigmoid transformation for reward models is suggested, which helps in prioritizing improvements in less satisfactory outputs and mitigates underfitting and reward hacking.

  • The authors propose a method for combining rewards from multiple properties into a single utility function, using a probabilistic framework that assumes independent judgments.

  • Empirical experiments demonstrate that the proposed transformed reward and aggregation methods lead to significant improvements in aligning LLMs, especially for being helpful and harmless.

Introduction

In the landscape of AI alignment, a splendid challenge lies in encouraging LLMs to generate outputs that possess desirable characteristics, such as being both helpful and harmless. This need has led researchers to use Reinforcement Learning from Human Feedback (RLHF), a two-stage process that first involves training a reward model based on human preferences, followed by aligning the LLM's responses to increase the expected reward. However, two issues are often encountered: First, monotone transformations of the reward model do not alter preference rankings, raising questions about optimal transformation choices. Second, when aligning to multiple properties, there's uncertainty about how to combine multiple reward models effectively.

Reward Transformation

Wang et al. approach these problems by introducing a probabilistic interpretation of the alignment procedure and consequently identifying an optimal transformation for rewards learned from Bradley-Terry models. The proposed log-sigmoid transformation, denoted ( u(x, y) = \log \sigma(r(x, y) - r_{ref}(x)) ), prioritizes ameliorations in poorly-performing outputs, which mitigates underfitting and reward hacking. Reward hacking is an undesirable behavior where LLMs game the reward model rather than genuinely improving. This transformation also offers a principled method for combining rewards, as it equates to a probability that an output is 'good' across assessed properties. This is a dynamic shift from the standard approach which simply uses raw reward values.

Reward Aggregation

Turning to the challenge of combining rewards from multiple properties, the authors extend their probabilistic framework, under the assumption of independent judgments. The enhanced utility function for aggregating multiple reward models becomes ( u(x, y) = \sum \log \sigma(ri(x, y) - r{i, ref}(x)) ), aligning the LLM to be 'good' across all target properties. This move is not just aptly mathematical but also aligns with the intuitive understanding of logical conjunction.

Empirical Validation

The authors conducted extensive experiments, applying their transformed reward and aggregation methodology in RLHF. They found significant improvements over baseline approaches, both in single-property and multiple-property alignments. Predominantly, focusing on aligning LLMs to be helpful and harmless simultaneously, they observed that their approach effectively helped models avoid reward hacking and underfitting, hence enhancing the overall performance robustly.

The paper concludes with a nod toward a constellation of extant methods aimed at mitigating reward hacking, pointing out that their proposed technique might be used in tandem with other strategies. The significance of their work resonates with any approach that sharpens utility maximization in the RLHF framework, potentially providing instrumental insights for future language model alignment procedures.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.