Emergent Mind

PERL: Parameter Efficient Reinforcement Learning from Human Feedback

(2403.10704)
Published Mar 15, 2024 in cs.LG , cs.AI , and cs.CL

Abstract

Reinforcement Learning from Human Feedback (RLHF) has proven to be a strong method to align Pretrained LLMs with human preferences. But training models with RLHF is computationally expensive, and an overall complex process. In this work, we study RLHF where the underlying models are trained using the parameter efficient method of Low-Rank Adaptation (LoRA) introduced by Hu et al. [2021]. We investigate the setup of "Parameter Efficient Reinforcement Learning" (PERL), in which we perform reward model training and reinforcement learning using LoRA. We compare PERL to conventional fine-tuning (full-tuning) across various configurations for 7 benchmarks, including 2 novel datasets, of reward modeling and reinforcement learning. We find that PERL performs on par with the conventional RLHF setting, while training faster, and with less memory. This enables the high performance of RLHF, while reducing the computational burden that limits its adoption as an alignment technique for LLMs. We also release 2 novel thumbs up/down preference datasets: "Taskmaster Coffee", and "Taskmaster Ticketing" to promote research around RLHF.

Comparison between PERL and traditional reinforcement learning loops.

Overview

  • PERL introduces a new training framework within the RLHF paradigm that utilizes Low-Rank Adaptation (LoRA) to substantially reduce the training duration and memory requirements while maintaining performance.

  • The paper demonstrates that PERL can achieve the same results as traditional RLHF methods by adjusting only 0.1% of a model's parameters, marking a significant efficiency improvement.

  • Deployment of PERL showcases its practical benefits, including halving memory usage and speeding up the training process in both reward model training and the reinforcement learning phase.

  • PERL's evaluation across various tasks and datasets reveals its robustness and adaptability, proposing a new direction for research in AI alignment with human preferences.

PERL: A Leap in Computational Efficiency for Reinforcement Learning from Human Feedback

Introduction to PERL

In the quest for aligning Pretrained LLMs with human preferences, Reinforcement Learning from Human Feedback (RLHF) emerges as a pivotal technique. However, the significant computational expense and intricacy involved in this training approach pose substantial challenges. Addressing these hurdles, the recent study on "Parameter Efficient Reinforcement Learning from Human Feedback" introduces PERL, an innovative framework that leverages Low-Rank Adaptation (LoRA) for training models within the RLHF paradigm. Notably, PERL exhibits equivalent performance to traditional RLHF modes while ensuring a reduction in both training duration and memory requirements.

The Efficiency of PERL

PERL delineates a significant departure from conventional fine-tuning practices by incorporating LoRA into both the reward model training and the reinforcement learning phase. This integration facilitates the training of models with a substantially reduced number of parameters. Specifically, the study reveals that PERL can operate by adjusting merely 0.1% of the total parameters present in a model. This efficiency is not at the expense of performance; PERL matches the results obtained through traditional full-parameter tuning across various benchmarks.

Deployment and Practical Implications

The deployment of PERL in learning settings highlights its practical benefits, notably in terms of memory efficiency and faster training speeds. For instance, in reward model training, a direct comparison to full-tuning setups demonstrates PERL's capability to halve the memory usage while accelerating the training process by 50%. Similar improvements are observed in the reinforcement learning phase, with memory savings of about 20% and a speed increase of 10%. Such enhancements make PERL a compelling choice for aligning LLMs with human preferences, an aspect crucial for diverse applications ranging from text summarization to UI automation.

Exploration of Datasets

The evaluation of PERL extended across a myriad of datasets, encompassing tasks like text summarization (e.g., Reddit TL;DR and BOLT Message Summarization), UI automation, and generating neutral viewpoint responses. The comprehensive analysis across these datasets not only showcases PERL's robustness but also its adaptability to distinct task domains. The introduction of the Taskmaster Coffee and Ticketing datasets further enriches the research landscape, providing new avenues for exploring RLHF methodologies.

Future Directions

The study’s findings prompt a reevaluation of parameter tuning in reinforcement learning, particularly within the RLHF framework. The efficiency gains observed with PERL pave the way for broader adoption and experimentation, potentially expanding the horizons of LLM alignment techniques. Future investigations might delve into enhancing PERL’s cross-domain generalization capabilities or exploring more efficient ensemble models. The potential integration of recent advancements, such as weight-averaging models, could also contribute to mitigating reward hacking issues, thereby ensuring a more reliable and robust learning process.

In Summary

PERL represents a significant stride toward computational efficiency in reinforcement learning, particularly in the context of aligning LLMs with human preferences. By harnessing the power of LoRA, PERL achieves parity with traditional RLHF methods in performance metrics while substantially reducing the computational resources required. The release of new datasets alongside PERL’s validation across multiple benchmarks signals a promising direction for future research in AI alignment, promising to make reinforcement learning from human feedback an even more accessible and efficient process.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.