Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

9 tokens/sec

GPT-4o

12 tokens/sec

Gemini 2.5 Pro Pro

40 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models (2404.05567v1)

Published 8 Apr 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Mixture-of-Experts (MoE) LLMs can reduce computational costs by 2-4$\times$ compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally require 2-4$\times$ times more parameters to achieve comparable performance to a dense model, which incurs larger GPU memory requirements and makes MoE models less efficient in I/O-bounded scenarios like autoregressive generation. In this work, we propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency by employing dense computation across all experts during training and sparse computation during inference. Our experiments on training LLMs demonstrate that our DS-MoE models are more parameter-efficient than standard sparse MoEs and are on par with dense models in terms of total parameter size and performance while being computationally cheaper (activating 30-40% of the model's parameters). Performance tests using vLLM show that our DS-MoE-6B model runs up to $1.86\times$ faster than similar dense models like Mistral-7B, and between $1.50\times$ and $1.71\times$ faster than comparable MoEs, such as DeepSeekMoE-16B and Qwen1.5-MoE-A2.7B.

References (76)

Citations (12)

View on Semantic Scholar

Summary

The paper presents a DS-MoE framework that integrates dense training with sparse inference to optimize expert utilization.
It employs a Mutual Information loss to balance expert participation and improve GPU efficiency during training.
Empirical results demonstrate a 30-40% parameter activation during inference, matching dense model performance with reduced computational cost.

Dense Training, Sparse Inference: Optimizing Mixture-of-Experts LLMs

Introduction

The dichotomy between the computational cost of training LLMs and the necessity for efficiency during inference stages presents a significant challenge in deep learning. Mixture-of-Experts (MoE) models emerged as a viable solution by facilitating selective parameter utilization, which increases computational efficiency while maintaining, or even enhancing, model performance. Nevertheless, MoE models' overwhelming parameter requirements, often 2 to 4 times that of dense models, exacerbate memory consumption and decrease efficiency in autoregressive tasks. This paper introduces a hybrid approach, employing dense training coupled with sparse inference (DS-MoE), aimed at retaining the computational benefits of MoE models while mitigating their parameter inefficiency.

Methodology

Dense Training

The cornerstone of the DS-MoE framework is the adoption of dense gradient propagation during the training phase, involving all experts in the computation, as opposed to traditional sparse training methods. This full participation ensures efficient GPU utilization and balances expert usage, a common pitfall in sparse training. The Mutual Information (MI) loss is introduced, promoting load balance among experts and ensuring an even distribution of the computational load. This loss complements the standard autoregressive modeling loss, strategically balancing the model's focus on primary tasks and expert efficiency.

Sparse Inference

For inference, DS-MoE models revert to sparsity, activating only a subset of experts based on routing scores or a predefined threshold. This approach significantly reduces the computational load, preserving the model's efficiency during the inference stage. The implementation also features Mixture of Attention Head (MoA) blocks further optimizing the model's computational demand by efficiently managing attention mechanisms.

Results and Discussion

Empirical evaluations underscore the DS-MoE model's capability to closely rival dense models in performance while significantly outstripping traditional MoE models in parameter efficiency. Key findings include:

DS-MoE models manifest an impressive reduction in required parameters for comparable performance levels, effectively addressing the inefficiencies associated with MoEs.
The approach achieves a 30-40% activation rate of the model's parameters during inference, striking a balance between computational efficiency and model performance.
Enhanced throughput in both computation-bounded and I/O-bounded scenarios demonstrates the DS-MoE model's superior efficiency across diverse operational contexts.

These results underscore the utility of the DS-MoE framework in making MoE models more tractable and efficient, particularly in environments where computational and memory resources are at a premium.

Future Directions

This research opens promising avenues for further optimization and exploration in the training and inference paradigms of LLMs. Future work may delve into refining the mutual information loss to foster even greater efficiency and exploring the scalability of the DS-MoE approach for models beyond the scope of current experiments. Additionally, the dynamic nature of the sparse inference process offers a fertile ground for developing more adaptive and context-aware routing mechanisms, potentially tailoring computational efforts to the specific demands of given tasks or inputs.

Conclusion

The proposed DS-MoE framework marks a significant step forward in resolving the intrinsic tension between the desire for large, expressive models and the imperative for computational efficiency. By merging dense training with sparse inference, this approach promises to make large-scale models more accessible and practical for a broader range of applications, advancing the state-of-the-art in efficient LLMing.

PDF Markdown

Tweets

https://twitter.com/fly51fly/status/1779622166272848380

https://twitter.com/BowenPan7/status/1777760141552869860

https://twitter.com/gm8xx8/status/1777522569186279595

https://twitter.com/arxivsanitybot/status/1778414603958911447

https://twitter.com/knishimae0531/status/1778646157247562159

https://twitter.com/knishimae0531/status/1779659232310448386

YouTube

Show All Videos