Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models (2407.01906v2)
Abstract: Parameter-efficient fine-tuning (PEFT) is crucial for customizing LLMs with constrained resources. Although there have been various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture LLMs is still underexplored. In this work, we study the PEFT method for LLMs with the Mixture-of-Experts (MoE) architecture and the contents of this work are mainly threefold: (1) We investigate the dispersion degree of the activated experts in customized tasks, and found that the routing distribution for a specific task tends to be highly concentrated, while the distribution of activated experts varies significantly across different tasks. (2) We propose Expert-Specialized Fine-Tuning, or ESFT, which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the tuning efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning. (3) We further analyze the impact of the MoE architecture on expert-specialized fine-tuning. We find that MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks, thereby enhancing both the training efficiency and effectiveness. Our code is available at https://github.com/deepseek-ai/ESFT.
- Composable sparse fine-tuning for cross-lingual transfer. arXiv preprint arXiv:2110.07560.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
- Evaluating large language models trained on code. In NeurIPS.
- Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457.
- Gsm8k: A dataset for grade school math problem solving. In NeurIPS.
- Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. CoRR, abs/2401.06066.
- Stablemoe: Stable routing strategy for mixture of experts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 7085–7095. Association for Computational Linguistics.
- Databricks. 2024. Dbrx: Resources and code examples.
- DeepSeek. 2024. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. CoRR, abs/2405.04434.
- Sparse low-rank adaptation of pre-trained language models. arXiv preprint arXiv:2311.11696.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961.
- A note on lora. arXiv preprint arXiv:2404.05086.
- Cross-attention is all you need: Adapting pretrained transformers for machine translation. arXiv preprint arXiv:2104.08771.
- Parameter-efficient transfer learning with diff pruning. arXiv preprint arXiv:2012.07463.
- Parameter-efficient fine-tuning for large models: A comprehensive survey. CoRR, abs/2403.14608.
- Sensitivity-aware visual parameter-efficient fine-tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11825–11835.
- Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
- Measuring massive multitask language understanding. In International Conference on Learning Representations (ICLR).
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- C-Eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
- triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. arXiv e-prints, arXiv:1705.03551.
- Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021. OpenReview.net.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
- Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
- Parameter-efficient fine-tuning without introducing new latency. arXiv preprint arXiv:2305.16742.
- Lora dropout as a sparsity regularizer for overfitting control. arXiv preprint arXiv:2404.09610.
- Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications. arXiv preprint arXiv:2310.18339.
- P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602.
- Wizardcoder: Empowering code large language models with evol-instruct.
- Meta. 2023a. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
- Meta. 2023b. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Meta. 2024. Llama 3 model card.
- Mistral. 2024a. Cheaper, better, faster, stronger: Continuing to push the frontier of ai and making it accessible to all.
- Mistral. 2024b. Mixtral of experts. CoRR, abs/2401.04088.
- Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247.
- Qwen. 2024. Introducing qwen1.5.
- Hash layers for large sparse models. CoRR, abs/2106.04426.
- Jetmoe: Reaching llama2 performance with 0.1m dollars. CoRR, abs/2404.07413.
- Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems, 34:24193–24205.
- Efficient fine-tuning of bert models on the edge. In 2022 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1838–1842. IEEE.
- Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models. arXiv preprint arXiv:2205.12410, 1(2):4.
- XAI. 2024. Grok open release.
- Clue: A chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986.
- Raise a child in large language model: Towards effective and generalizable fine-tuning. arXiv preprint arXiv:2109.05687.
- Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics.
- Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512.
- Chren: Cherokee-english machine translation for endangered language revitalization. In EMNLP2020.
- Towards adaptive prefix tuning for parameter-efficient language model fine-tuning. arXiv preprint arXiv:2305.15212.
- Instruction-following evaluation for large language models. Preprint, arXiv:2311.07911.