Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models (2402.14800v2)
Abstract: A pivotal advancement in the progress of LLMs is the emergence of the Mixture-of-Experts (MoE) LLMs. Compared to traditional LLMs, MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes. Different from previous weight pruning methods that rely on specifically designed hardware, this paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques. Specifically, we propose, for the first time to our best knowledge, post-training approaches for task-agnostic and task-specific expert pruning and skipping of MoE LLMs, tailored to improve deployment efficiency while maintaining model performance across a wide range of tasks. Extensive experiments show that our proposed methods can simultaneously reduce model sizes and increase the inference speed, while maintaining satisfactory performance. Data and code will be available at https://github.com/Lucky-Lance/Expert_Sparsity.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Task-specific expert pruning for sparse mixture-of-experts. arXiv preprint arXiv:2206.00277.
- On the representation collapse of sparse mixture of experts. In Advances in Neural Information Processing Systems.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- An algorithm–hardware co-optimized framework for accelerating n: M sparse transformers. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 30(11):1573–1586.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270.
- Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
- Megablocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems, 5.
- A framework for few-shot language model evaluation.
- Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pages 1389–1397.
- Measuring mathematical problem solving with the math dataset. NeurIPS.
- Accelerated sparse neural training: A provable and efficient method to find n: m transposable masks. Advances in neural information processing systems, 34:21099–21111.
- Adaptive mixtures of local experts. Neural computation, 3(1):79–87.
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Scalable and efficient moe training for multitask multilingual models. arXiv preprint arXiv:2109.10465.
- Memory-efficient nllb-200: Language-specific expert pruning of a massively multilingual machine translation model. arXiv preprint arXiv:2212.09811.
- A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35:24101–24116.
- Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978.
- Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378.
- OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints.
- Mixture of experts explained.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
- A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
- Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
- Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
- Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010.