Emergent Mind

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

(2404.05567)
Published Apr 8, 2024 in cs.LG , cs.AI , and cs.CL

Abstract

Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$\times$ compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally require 2-4$\times$ times more parameters to achieve comparable performance to a dense model, which incurs larger GPU memory requirements and makes MoE models less efficient in I/O-bounded scenarios like autoregressive generation. In this work, we propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency by employing dense computation across all experts during training and sparse computation during inference. Our experiments on training LLMs demonstrate that our DS-MoE models are more parameter-efficient than standard sparse MoEs and are on par with dense models in terms of total parameter size and performance while being computationally cheaper (activating 30-40% of the model's parameters). Performance tests using vLLM show that our DS-MoE-6B model runs up to $1.86\times$ faster than similar dense models like Mistral-7B, and between $1.50\times$ and $1.71\times$ faster than comparable MoEs, such as DeepSeekMoE-16B and Qwen1.5-MoE-A2.7B.

Overview

  • This paper introduces a hybrid Mixture-of-Experts (MoE) approach named DS-MoE, which combines dense training with sparse inference to improve parameter efficiency and computational cost.

  • The study presents a novel Dense Training method that utilizes dense gradient propagation, incorporating a Mutual Information (MI) loss to balance expert usage and ensure computational efficiency.

  • Sparse Inference in the DS-MoE model selectively activates experts, significantly reducing computational load while maintaining model performance, supported by Mixture of Attention Head (MoA) blocks.

  • Empirical evaluations demonstrate that DS-MoE models achieve comparable performance to dense models with significantly fewer parameters, showcasing enhanced parameter efficiency and throughput.

Dense Training, Sparse Inference: Optimizing Mixture-of-Experts Language Models

Introduction

The dichotomy between the computational cost of training LLMs and the necessity for efficiency during inference stages presents a significant challenge in deep learning. Mixture-of-Experts (MoE) models emerged as a viable solution by facilitating selective parameter utilization, which increases computational efficiency while maintaining, or even enhancing, model performance. Nevertheless, MoE models' overwhelming parameter requirements, often 2 to 4 times that of dense models, exacerbate memory consumption and decrease efficiency in autoregressive tasks. This paper introduces a hybrid approach, employing dense training coupled with sparse inference (DS-MoE), aimed at retaining the computational benefits of MoE models while mitigating their parameter inefficiency.

Methodology

Dense Training

The cornerstone of the DS-MoE framework is the adoption of dense gradient propagation during the training phase, involving all experts in the computation, as opposed to traditional sparse training methods. This full participation ensures efficient GPU utilization and balances expert usage, a common pitfall in sparse training. The Mutual Information (MI) loss is introduced, promoting load balance among experts and ensuring an even distribution of the computational load. This loss complements the standard autoregressive modeling loss, strategically balancing the model's focus on primary tasks and expert efficiency.

Sparse Inference

For inference, DS-MoE models revert to sparsity, activating only a subset of experts based on routing scores or a predefined threshold. This approach significantly reduces the computational load, preserving the model's efficiency during the inference stage. The implementation also features Mixture of Attention Head (MoA) blocks further optimizing the model's computational demand by efficiently managing attention mechanisms.

Results and Discussion

Empirical evaluations underscore the DS-MoE model's capability to closely rival dense models in performance while significantly outstripping traditional MoE models in parameter efficiency. Key findings include:

  • DS-MoE models manifest an impressive reduction in required parameters for comparable performance levels, effectively addressing the inefficiencies associated with MoEs.
  • The approach achieves a 30-40% activation rate of the model's parameters during inference, striking a balance between computational efficiency and model performance.
  • Enhanced throughput in both computation-bounded and I/O-bounded scenarios demonstrates the DS-MoE model's superior efficiency across diverse operational contexts.

These results underscore the utility of the DS-MoE framework in making MoE models more tractable and efficient, particularly in environments where computational and memory resources are at a premium.

Future Directions

This research opens promising avenues for further optimization and exploration in the training and inference paradigms of LLMs. Future work may delve into refining the mutual information loss to foster even greater efficiency and exploring the scalability of the DS-MoE approach for models beyond the scope of current experiments. Additionally, the dynamic nature of the sparse inference process offers a fertile ground for developing more adaptive and context-aware routing mechanisms, potentially tailoring computational efforts to the specific demands of given tasks or inputs.

Conclusion

The proposed DS-MoE framework marks a significant step forward in resolving the intrinsic tension between the desire for large, expressive models and the imperative for computational efficiency. By merging dense training with sparse inference, this approach promises to make large-scale models more accessible and practical for a broader range of applications, advancing the state-of-the-art in efficient language modeling.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube