RRHF: Rank Responses to Align Language Models with Human Feedback without tears (2304.05302v3)

Published 11 Apr 2023 in cs.CL

Abstract: Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of LLMs with human preferences, significantly enhancing the quality of interactions between humans and models. InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO). However, PPO is sensitive to hyperparameters and requires multiple models in its standard implementation, making it hard to train and scale up to larger parameter counts. In contrast, we propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities and learns to align these probabilities with human preferences through ranking loss. RRHF can leverage sampled responses from various sources including the model responses from itself, other LLM responses, and human expert responses to learn to rank them. RRHF only needs 1 to 2 models during tuning and can efficiently align LLMs with human preferences robustly without complex hyperparameter tuning. Additionally, RRHF can be considered an extension of SFT and reward model training while being simpler than PPO in terms of coding, model counts, and hyperparameters. We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling. Extensive experiments show that the performance of RRHF is highly related to sampling quality which suggests RRHF is a best-of-n learner. Codes available at https://github.com/GanjinZero/RRHF.

Citations (290)

View on Semantic Scholar

Summary

The paper's main contribution is introducing RRHF, a simplified training method that leverages ranking loss for efficient language model alignment.
RRHF requires only 1 to 2 models during tuning, avoiding the complex reward-model training used in PPO for scalability.
Experiments demonstrate that high-quality response sampling correlates with matching PPO performance on the Helpful and Harmless dataset.

Overview of RRHF

Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent method to align LLMs with human preferences, as seen in prominent instances such as InstructGPT. However, traditional models like Proximal Policy Optimization (PPO), which are used for this purpose, often face scalability issues due to hyperparameter sensitivity and architectural complexity. This paper introduces a new learning paradigm called RRHF (Rank Responses to Align LLMs with Human Feedback) that seeks to alleviate some of the stated challenges. RRHF simplifies the alignment process by utilizing a ranking loss to score and organize model-generated responses to coincide with human preferences.

RRHF vs PPO

RRHF distinguishes itself from PPO by its minimalist approach—it requires only 1 to 2 models during tuning (compared to PPO's 4) and avoids complex reward model training. This is achieved by scoring responses using log probabilities and employing ranking loss for optimization, thus eliminating the need for auxiliary models or KL divergence calculations. The robustness of RRHF is demonstrated through its performance on the Helpful and Harmless dataset, where it matched PPO in performance metrics, both in automated evaluation and human labelling, without the associated complexities.

Experiment Findings

Experiments conducted with RRHF revealed several insights. The approach aligns LLMs efficiently, demonstrated by comparable performance to PPO, but requires significantly fewer resources and less complexity. It also showed that the quality of responses sampled during training has a direct correlation with the performance of the tuned model, indicating the importance of high-quality sampling in the alignment process.

Additionally, to simulate real-world training conditions for models like ChatGPT, the novel LLM Wombat was developed using RRHF. Wombat displayed superior performance over SFT and aligns effectively with human preferences when trained with prompts and responses from other LLMs, showcasing the generalizability of RRHF.

Contributions and Future Work

The key contributions of this paper are the development of RRHF, a simplified and efficient training paradigm, the establishment of RRHF as an extension of SFT, and the demonstration of its comparable performance with PPO on the Helpful and Harmless dataset. These contributions are meaningful as they could potentially lead to easier scaling of LLM alignment to human preferences, especially with limited resources.

In terms of future work, although the RRHF method has shown promise, there is an acknowledgment of limitations such as potential over-optimization and the need for multiple response inputs which may increase GPU usage. Exploring how to address these limitations will be crucial for further optimizing and ensuring the safe deployment of LLMs aligned with human preferences.

PDF Markdown

Related Papers

GitHub

GitHub - GanjinZero/RRHF: [NIPS2023] RRHF & Wombat (801 stars)