- The paper proposes a log-sigmoid transformation for reward models that prioritizes fixing weak outputs and mitigates reward hacking.
- It extends a probabilistic framework to combine multiple reward signals, ensuring LLM outputs meet diverse desirable properties.
- Empirical results demonstrate significant improvements in aligning LLMs to be both helpful and harmless.
Introduction
In the landscape of AI alignment, a splendid challenge lies in encouraging LLMs to generate outputs that possess desirable characteristics, such as being both helpful and harmless. This need has led researchers to use Reinforcement Learning from Human Feedback (RLHF), a two-stage process that first involves training a reward model based on human preferences, followed by aligning the LLM's responses to increase the expected reward. However, two issues are often encountered: First, monotone transformations of the reward model do not alter preference rankings, raising questions about optimal transformation choices. Second, when aligning to multiple properties, there's uncertainty about how to combine multiple reward models effectively.
Wang et al. approach these problems by introducing a probabilistic interpretation of the alignment procedure and consequently identifying an optimal transformation for rewards learned from Bradley-Terry models. The proposed log-sigmoid transformation, denoted u(x,y)=logσ(r(x,y)−rref(x)), prioritizes ameliorations in poorly-performing outputs, which mitigates underfitting and reward hacking. Reward hacking is an undesirable behavior where LLMs game the reward model rather than genuinely improving. This transformation also offers a principled method for combining rewards, as it equates to a probability that an output is 'good' across assessed properties. This is a dynamic shift from the standard approach which simply uses raw reward values.
Reward Aggregation
Turning to the challenge of combining rewards from multiple properties, the authors extend their probabilistic framework, under the assumption of independent judgments. The enhanced utility function for aggregating multiple reward models becomes u(x,y)=∑logσ(ri(x,y)−ri,ref(x)), aligning the LLM to be 'good' across all target properties. This move is not just aptly mathematical but also aligns with the intuitive understanding of logical conjunction.
Empirical Validation
The authors conducted extensive experiments, applying their transformed reward and aggregation methodology in RLHF. They found significant improvements over baseline approaches, both in single-property and multiple-property alignments. Predominantly, focusing on aligning LLMs to be helpful and harmless simultaneously, they observed that their approach effectively helped models avoid reward hacking and underfitting, hence enhancing the overall performance robustly.
The paper concludes with a nod toward a constellation of extant methods aimed at mitigating reward hacking, pointing out that their proposed technique might be used in tandem with other strategies. The significance of their work resonates with any approach that sharpens utility maximization in the RLHF framework, potentially providing instrumental insights for future LLM alignment procedures.