Emergent Mind

Abstract

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of LLMs with human preferences, significantly enhancing the quality of interactions between humans and models. InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO). However, PPO is sensitive to hyperparameters and requires multiple models in its standard implementation, making it hard to train and scale up to larger parameter counts. In contrast, we propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities and learns to align these probabilities with human preferences through ranking loss. RRHF can leverage sampled responses from various sources including the model responses from itself, other large language model responses, and human expert responses to learn to rank them. RRHF only needs 1 to 2 models during tuning and can efficiently align language models with human preferences robustly without complex hyperparameter tuning. Additionally, RRHF can be considered an extension of SFT and reward model training while being simpler than PPO in terms of coding, model counts, and hyperparameters. We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling. Extensive experiments show that the performance of RRHF is highly related to sampling quality which suggests RRHF is a best-of-n learner. Codes available at https://github.com/GanjinZero/RRHF.

Overview

  • RRHF introduces a simplified method for aligning language models with human preferences by employing a ranking loss.

  • RRHF requires fewer models and resources compared to traditional methods, avoiding complex training and hyperparameter sensitivity.

  • The method's efficiency is reflected in its performance on the Helpful and Harmless dataset, equaling that of PPO but with less complexity.

  • The novel language model Wombat, trained with RRHF, demonstrates superior performance and generalizability in aligning with human preferences.

  • The paper suggests potential future work to address limitations of RRHF, such as the risk of over-optimization and increased GPU usage.

Overview of RRHF

Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent method to align language models with human preferences, as seen in prominent instances such as InstructGPT. However, traditional models like Proximal Policy Optimization (PPO), which are used for this purpose, often face scalability issues due to hyperparameter sensitivity and architectural complexity. This paper introduces a new learning paradigm called RRHF (Rank Responses to Align Language Models with Human Feedback) that seeks to alleviate some of the stated challenges. RRHF simplifies the alignment process by utilizing a ranking loss to score and organize model-generated responses to coincide with human preferences.

RRHF vs PPO

RRHF distinguishes itself from PPO by its minimalist approach—it requires only 1 to 2 models during tuning (compared to PPO's 4) and avoids complex reward model training. This is achieved by scoring responses using log probabilities and employing ranking loss for optimization, thus eliminating the need for auxiliary models or KL divergence calculations. The robustness of RRHF is demonstrated through its performance on the Helpful and Harmless dataset, where it matched PPO in performance metrics, both in automated evaluation and human labelling, without the associated complexities.

Experiment Findings

Experiments conducted with RRHF revealed several insights. The approach aligns LLMs efficiently, demonstrated by comparable performance to PPO, but requires significantly fewer resources and less complexity. It also showed that the quality of responses sampled during training has a direct correlation with the performance of the tuned model, indicating the importance of high-quality sampling in the alignment process.

Additionally, to simulate real-world training conditions for models like ChatGPT, the novel language model Wombat was developed using RRHF. Wombat displayed superior performance over SFT and aligns effectively with human preferences when trained with prompts and responses from other language models, showcasing the generalizability of RRHF.

Contributions and Future Work

The key contributions of this paper are the development of RRHF, a simplified and efficient training paradigm, the establishment of RRHF as an extension of SFT, and the demonstration of its comparable performance with PPO on the Helpful and Harmless dataset. These contributions are meaningful as they could potentially lead to easier scaling of language model alignment to human preferences, especially with limited resources.

In terms of future work, although the RRHF method has shown promise, there is an acknowledgment of limitations such as potential over-optimization and the need for multiple response inputs which may increase GPU usage. Exploring how to address these limitations will be crucial for further optimizing and ensuring the safe deployment of language models aligned with human preferences.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.