Emergent Mind

Abstract

Accurately aligning LLMs with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that reinforcement learning from human feedback (RLHF) -- the predominant approach for aligning LLMs with human preferences through a reward model -- suffers from an inherent algorithmic bias due to its Kullback--Leibler-based regularization in optimization. In extreme cases, this bias could lead to a phenomenon we term preference collapse, where minority preferences are virtually disregarded. To mitigate this algorithmic bias, we introduce preference matching (PM) RLHF, a novel approach that provably aligns LLMs with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model. Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses, which helps the LLM balance response diversification and reward maximization. Notably, we obtain this regularizer by solving an ordinary differential equation that is necessary for the PM property. For practical implementation, we introduce a conditional variant of PM RLHF that is tailored to natural language generation. Finally, we empirically validate the effectiveness of conditional PM RLHF through experiments on the OPT-1.3B and Llama-2-7B models, demonstrating a 29% to 41% improvement in alignment with human preferences, as measured by a certain metric, compared to standard RLHF.

CDF of PM divergences for Llama-2-7B fine-tuning under four RLHF settings and three beta values.

Overview

  • The paper discusses the issue of algorithmic bias in aligning LLMs with human preferences using Reinforcement Learning from Human Feedback (RLHF), identifying significant bias introduced by the commonly used Kullback-Leibler (KL) divergence-based regularization, leading to the collapse of minority preferences.

  • The authors propose a novel method called Preference Matching (PM) RLHF, providing a theoretical foundation and a practical conditional variant to address this bias and align model outputs more accurately with diverse human preferences.

  • Empirical validation shows that PM RLHF significantly improves preference alignment, with results demonstrating a 29% to 41% reduction in preference matching divergence compared to standard RLHF in various model configurations.

An In-depth Analysis of Algorithmic Bias in RLHF for LLMs

Overview

The paper titled "On the Algorithmic Bias of Aligning LLMs with RLHF: Preference Collapse and Matching Regularization" explores the issue of algorithmic bias in aligning LLMs with human preferences through reinforcement learning from human feedback (RLHF). The central assertion is that the prevalent RLHF approach, which employs Kullback-Leibler (KL) divergence-based regularization, introduces inherent biases that can lead to what the authors term "preference collapse." This phenomenon results in the near-total disregard of minority preferences. To address this, the authors propose a novel method called preference matching (PM) RLHF, which aims to align LLMs accurately with the distribution of preferences expressed by a reward model.

Key Contributions

The authors identify the primary source of bias in RLHF as the KL divergence-based regularization, which uses a pretrained LLM as a reference model. This introduces unavoidable biases from the reference model into the final LLM alignment. The bias can become so severe that minority preferences are entirely collapsed in favor of the majority.

The key contributions of the paper include:

  1. Introduction of PM RLHF: The authors propose PM RLHF as a method to eliminate the algorithmic bias inherent in standard RLHF. This technique involves a PM regularizer based on the negative logarithm of the LLM's policy probability distribution over responses.

  2. Theoretical Foundation: The paper establishes a theoretical basis for PM RLHF by solving an ordinary differential equation necessary for the PM property. This framework ensures that the LLM's output distribution matches the human preference distribution given by the reward model.

  3. Conditional Variant: For practical implementation, the authors propose a conditional variant of PM RLHF tailored to natural language generation. This variant penalizes responses with low probabilities according to a reference model, effectively filtering out unnatural or nonsensical outputs.

  4. Empirical Validation: Empirical results show significant improvements in alignment with human preferences. The proposed PM RLHF approach led to a 29% to 41% reduction in preference matching divergence compared to standard RLHF in experiments with the OPT-1.3B and Llama-2-7B models.

Methodological Insight

The PM RLHF method diverges from standard RLHF by directly addressing the distribution of preferences. The regularization term $R(\pi)$, derived from solving a differential equation, ensures that the optimization aligns with the preference distribution modeled by the reward function $r(x, y)$. Specifically, $R(\pi) = -\log(\pi) + C{1,x} + C{2,x}/ \pi$, where $C{1,x}$ and $C{2,x}$ are constants that may depend on the prompt $x$.

This formulation ensures that the LLM not only maximizes the reward but also maintains diverse responses, preventing the exclusive preference of majority opinions.

Addressing Practical Challenges

One of the challenges noted in the application of PM RLHF is the naturalness of generated text. To resolve text generation issues observed with the direct application of PM RLHF, the authors introduced the concept of conditional PM RLHF. This variant ensures that responses deemed nonsensical or meaningless by a reference model are heavily penalized, preventing their inclusion. This conditional approach effectively balances reward maximization and response naturalness.

Empirical Results

The empirical results were robust, demonstrating that conditional PM RLHF substantially reduces preference matching divergence. In experiments, the divergence metrics for the aligned models showed that the PM RLHF approach significantly outperformed standard RLHF across multiple configurations and values of $\beta$.

Interestingly, there was a trade-off observed between preference alignment and generative performance. While the PM RLHF models excelled in aligning with human preferences, they also exhibited changes in metrics like perplexity, reflecting the nuanced balance between these objectives.

Implications and Future Directions

The findings of this paper have profound implications for both practical and theoretical domains. Practically, improving the alignment of LLMs with diverse human preferences can lead to fairer and more effective decision-making systems in various applications. Theoretically, the introduction of PM RLHF opens new avenues for further research into RLHF methodologies and their inherent biases.

Future research could explore several directions:

  1. Scaling Up: Applying PM RLHF to larger industrial-level LLMs such as GPT-4 or Claude-3 Opus could help to better understand its impact on more complex models.

  2. Diverse Human Preferences: Extending PM RLHF to incorporate multiple reward models could address preference matching more finely when faced with heterogeneous human preferences.

  3. Generalized Models: Investigating generalized preference models beyond the PL model could yield insights into the adaptability and effectiveness of PM regularization in various contexts.

  4. Direct Preference Optimization (DPO): Developing a DPO counterpart of PM RLHF could benefit scenarios where computational efficiency is critical.

  5. Length Sensitivity: Exploring the impact of response length on preference alignment could further refine PM RLHF to handle biases arising from varied response lengths.

Conclusion

The paper makes a significant contribution to the field of aligning LLMs with human preferences by identifying and addressing the intrinsic algorithmic biases in standard RLHF. The proposed PM RLHF method offers a principled approach to achieve unbiased preference alignment, backed by strong theoretical foundations and empirical validations. This work not only advances the understanding of RLHF methodologies but also paves the way for developing more fair and effective AI systems.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.