Emergent Mind

Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment

(2405.17931)
Published May 28, 2024 in cs.CL and cs.LG

Abstract

Effectively aligning LLMs with human-centric values while preventing the degradation of abilities acquired through Pre-training and Supervised Fine-tuning (SFT) poses a central challenge in Reinforcement Learning from Human Feedback (RLHF). In this paper, we first discover that interpolating RLHF and SFT model parameters can adjust the trade-off between human preference and basic capabilities, thereby reducing the alignment tax at the cost of alignment reward. Inspired by this, we propose integrating the RL policy and SFT models at each optimization step in RLHF to continuously regulate the training direction, introducing the Online Merging Optimizer. Specifically, we merge gradients with the parameter differences between SFT and pretrained models, effectively steering the gradient towards maximizing rewards in the direction of SFT optimization. We demonstrate that our optimizer works well with different LLM families, such as Qwen and LLaMA, across various model sizes ranging from 1.8B to 8B, various RLHF algorithms like DPO and KTO, and existing model merging methods. It significantly enhances alignment reward while mitigating alignment tax, achieving higher overall performance across 14 benchmarks.

RLHF process with online merging optimizers.

Overview

  • The paper addresses the issue of alignment tax in the training of LLMs using Reinforcement Learning from Human Feedback (RLHF) and proposes Online Merging Optimizers as a solution.

  • The authors discovered that interpolating RLHF and Supervised Fine-Tuning (SFT) model parameters helps reduce alignment tax, and built upon this with Online Merging Optimizers which merge gradients with SFT delta parameters to balance reward maximization and alignment.

  • Experimental validation across various LLM architectures and sizes demonstrated that Online Merging Optimizers outperform traditional techniques across multiple benchmarks, showing robustness and effectiveness.

Overview of "Online Merging Optimizers for Mitigating Alignment Tax in RLHF Training"

The paper explore the challenge of mitigating "alignment tax" in the training of LLMs using Reinforcement Learning from Human Feedback (RLHF). The alignment tax refers to the degradation of fundamental model abilities that occurs when LLMs are aligned with human preferences via RLHF. The authors propose an innovative solution by integrating offline merging techniques into RLHF optimization steps, leading to the development of Online Merging Optimizers. This approach effectively balances the trade-off between reward maximization and minimization of alignment tax.

The research presents a comprehensive examination and validation of online merging optimizers across various LLM architectures, RLHF algorithms, and model sizes. Experimental results indicate that these optimizers significantly enhance alignment performance and mitigate alignment tax, outperforming traditional regularization techniques and offline model merging methods.

Contributions

  1. Discovery of Parameter Interpolation: The authors initially discover that interpolating RLHF and Supervised Fine-Tuning (SFT) model parameters facilitates a trade-off between human preference alignment and foundational capabilities, effectively reducing alignment tax at some cost to alignment reward.

  2. Online Merging Optimizer Proposal: Building upon the parameter interpolation insight, the paper proposes the Online Merging Optimizer that merges gradients with delta parameters from the SFT model at each optimization step in RLHF. This method steers the gradient to maximize rewards while maintaining alignment with the SFT model's optimization direction.

  3. Broad Experimental Validation: The optimizer is tested across several LLM families, including Qwen and LLaMa, with various model sizes ranging from 1.8B to 8B parameters. It is compatible with different RLHF algorithms such as DPO and KTO and existing model merging methods like DARE and TIES. Results demonstrate superior performance across 14 benchmarks.

Key Results

Benchmark Performance:

The proposed Online Merging Optimizer achieves higher overall performance across 14 benchmarks compared to traditional optimization techniques.

MT-Bench and AlpacaEval 2.0:

It consistently performs better in terms of alignment rewards, as evidenced by MT-Bench and AlpacaEval 2.0 scores, indicating improved alignment with human preferences.

Effectiveness Across Models:

The optimizer shows robustness and effectiveness with different LLM backbones and sizes, demonstrating its general applicability.

Theoretical and Practical Implications

The research presents significant theoretical implications by framing the alignment tax issue within the context of mode connectivity in neural networks. The integration of offline model merging techniques into active training processes opens up new avenues for optimizing RLHF training. Practically, the introduction of Online Merging Optimizers is a step towards producing more balanced and capable LLMs with minimized alignment tax. These optimizers can be readily adopted in various settings, providing more robust models without requiring extensive modifications to existing training pipelines.

Speculations on Future Developments

Given the success of Online Merging Optimizers in mitigating alignment tax, future developments may focus on refining these techniques for even more granular control over the trade-off between reward maximization and tax minimization. Further research can explore:

  • Application of online merging optimizers in other areas prone to catastrophic forgetting, such as continual learning scenarios.
  • Hybrid models that combine the advantages of parameter-efficient training techniques like LoRA with online merging.
  • Enhanced memory efficiency methods to reduce the computational overhead associated with maintaining delta parameters.

Conclusion

This paper presents a well-structured approach to addressing the alignment tax problem in RLHF training of LLMs by introducing Online Merging Optimizers. The results are promising, demonstrating significant performance improvements across multiple benchmarks and alignment tasks. This research contributes valuable insights into optimizing human-aligned AI models and sets a foundation for further advancements in the field.

Limitations

The primary limitation discussed is related to the memory overhead of maintaining delta parameters of the reference model, which might hinder scalability in some scenarios. However, these drawbacks are outweighed by the significant benefits offered by the proposed method in terms of enhanced model performance and alignment capability. Future work could focus on alleviating these limitations through innovative parameter-efficient techniques.

Overall, this research marks a notable step forward in the pursuit of more effective and balanced RLHF training methods.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.