Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models (2404.05567v1)
Abstract: Mixture-of-Experts (MoE) LLMs can reduce computational costs by 2-4$\times$ compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally require 2-4$\times$ times more parameters to achieve comparable performance to a dense model, which incurs larger GPU memory requirements and makes MoE models less efficient in I/O-bounded scenarios like autoregressive generation. In this work, we propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency by employing dense computation across all experts during training and sparse computation during inference. Our experiments on training LLMs demonstrate that our DS-MoE models are more parameter-efficient than standard sparse MoEs and are on par with dense models in terms of total parameter size and performance while being computationally cheaper (activating 30-40% of the model's parameters). Performance tests using vLLM show that our DS-MoE-6B model runs up to $1.86\times$ faster than similar dense models like Mistral-7B, and between $1.50\times$ and $1.71\times$ faster than comparable MoEs, such as DeepSeekMoE-16B and Qwen1.5-MoE-A2.7B.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432–7439, 2020.
- Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791, 2019.
- Model preserving compression for neural networks. Advances in Neural Information Processing Systems, 35:38060–38074, 2022.
- Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
- On the representation collapse of sparse mixture of experts. ArXiv, abs/2204.09179, 2022. URL https://api.semanticscholar.org/CorpusID:248266346.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
- Stablemoe: Stable routing strategy for mixture of experts. In Annual Meeting of the Association for Computational Linguistics, 2022. URL https://api.semanticscholar.org/CorpusID:248227505.
- Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:52967399.
- Glam: Efficient scaling of language models with mixture-of-experts. ArXiv, abs/2112.06905, 2021. URL https://api.semanticscholar.org/CorpusID:245124124.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- Megablocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems, 5, 2023.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- Sparsely activated mixture-of-experts are robust multi-task learners. ArXiv, abs/2204.07689, 2022. URL https://api.semanticscholar.org/CorpusID:248227728.
- Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning. In Neural Information Processing Systems, 2021. URL https://api.semanticscholar.org/CorpusID:235358484.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Long short-term memory. Neural Computation, 9:1735–1780, 1997. URL https://api.semanticscholar.org/CorpusID:1915014.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Sparse upcycling: Training mixture-of-experts from dense checkpoints. ArXiv, abs/2212.05055, 2022. URL https://api.semanticscholar.org/CorpusID:254535822.
- Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023.
- Beyond distillation: Task-level mixture-of-experts for efficient inference. ArXiv, abs/2110.03742, 2021. URL https://api.semanticscholar.org/CorpusID:238531628.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
- Gshard: Scaling giant models with conditional computation and automatic sharding. ArXiv, abs/2006.16668, 2020. URL https://api.semanticscholar.org/CorpusID:220265858.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp. 19274–19286. PMLR, 2023.
- Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, 2021. URL https://api.semanticscholar.org/CorpusID:232428341.
- Branch-train-merge: Embarrassingly parallel training of expert language models. ArXiv, abs/2208.03306, 2022. URL https://api.semanticscholar.org/CorpusID:251371375.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
- Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE international conference on computer vision, pp. 2736–2744, 2017.
- Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pp. 22137–22176. PMLR, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066, 2017.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
- Is a modular architecture enough? ArXiv, abs/2206.02713, 2022. URL https://api.semanticscholar.org/CorpusID:249395289.
- Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1325–1334, 2019.
- Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pp. 7197–7206. PMLR, 2020.
- Codegen: An open large language model for code with multi-turn program synthesis. ICLR, 2023.
- Va-red22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Video adaptive redundancy reduction. arXiv preprint arXiv:2102.07887, 2021.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Emergent mixture-of-experts: Can dense pre-trained transformers benefit from emergent modular structures? arXiv preprint arXiv:2310.10908, 2023.
- Improving language understanding by generative pre-training. 2018. URL https://api.semanticscholar.org/CorpusID:49313245.
- Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. IEEE, 2020.
- Hash layers for large sparse models. In Neural Information Processing Systems, 2021. URL https://api.semanticscholar.org/CorpusID:235367626.
- Routing networks and the challenges of modular and compositional computation. ArXiv, abs/1904.12774, 2019. URL https://api.semanticscholar.org/CorpusID:139103965.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Mixture-of-experts meets instruction tuning:a winning combination for large language models. 2023a. URL https://api.semanticscholar.org/CorpusID:259342096.
- Mixture-of-experts meets instruction tuning: A winning combination for large language models. arXiv preprint arXiv:2305.14705, 2023b.
- Moduleformer: Learning modular large language models from uncurated data. ArXiv, abs/2306.04640, 2023c. URL https://api.semanticscholar.org/CorpusID:259096191.
- Moduleformer: Learning modular large language models from uncurated data. arXiv preprint arXiv:2306.04640, 2023d.
- Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018.
- Scattered mixture-of-experts implementation. arXiv preprint arXiv:2403.08245, 2024.
- Attention is all you need. In Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/CorpusID:13756489.
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418, 2019.
- Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8612–8620, 2019.
- Skipnet: Learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 409–424, 2018.
- Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209, 2017.
- Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29, 2016.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 38–45, 2020.
- Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8817–8826, 2018.
- Structured pruning learns compact and accurate models. In Association for Computational Linguistics (ACL), 2022.
- Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp. 38087–38099. PMLR, 2023.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
- Mixture of attention heads: Selecting attention heads per token. arXiv preprint arXiv:2210.05144, 2022.
- Moefication: Transformer feed-forward layers are mixtures of experts. In Findings, 2021a. URL https://api.semanticscholar.org/CorpusID:247958465.
- Moefication: Transformer feed-forward layers are mixtures of experts. arXiv preprint arXiv:2110.01786, 2021b.
- Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
- Mixture-of-experts with expert choice routing. ArXiv, abs/2202.09368, 2022. URL https://api.semanticscholar.org/CorpusID:247011948.
- St-moe: Designing stable and transferable sparse expert models. 2022. URL https://api.semanticscholar.org/CorpusID:248496391.