Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models (2407.01906v2)

Published 2 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Parameter-efficient fine-tuning (PEFT) is crucial for customizing LLMs with constrained resources. Although there have been various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture LLMs is still underexplored. In this work, we study the PEFT method for LLMs with the Mixture-of-Experts (MoE) architecture and the contents of this work are mainly threefold: (1) We investigate the dispersion degree of the activated experts in customized tasks, and found that the routing distribution for a specific task tends to be highly concentrated, while the distribution of activated experts varies significantly across different tasks. (2) We propose Expert-Specialized Fine-Tuning, or ESFT, which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the tuning efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning. (3) We further analyze the impact of the MoE architecture on expert-specialized fine-tuning. We find that MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks, thereby enhancing both the training efficiency and effectiveness. Our code is available at https://github.com/deepseek-ai/ESFT.

References (52)

Summary

The paper introduces Expert-Specialized Fine-Tuning (ESFT) by fine-tuning only the most relevant experts in sparse MoE large language models.
It reveals that expert activation patterns vary greatly across tasks, highlighting the importance of domain-specific tuning.
Experimental results show ESFT matches or outperforms full fine-tuning with significantly reduced computational cost on benchmarks like GSM8K and MMLU.

Expert-Specialized Fine-Tuning for Sparse Architectural LLMs

The paper, "Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural LLMs," presents a comprehensive paper on parameter-efficient fine-tuning (PEFT) methods tailored for sparse-architecture LLMs employing the Mixture-of-Experts (MoE) architecture. This research addresses the gap in existing work focused primarily on dense-architecture LLMs, by proposing and evaluating a novel fine-tuning method designed specifically for the MoE paradigm.

Main Contributions

Investigation of Expert Dispersion: The paper investigates the dispersion degree of activated experts across various customized tasks. The findings show that the routing distribution for a specific task tends to be highly concentrated, whereas the distribution of activated experts varies significantly between different tasks. This observation suggests that different tasks activate specialized combinations of experts within the MoE architecture.
Expert-Specialized Fine-Tuning (ESFT): The core contribution is the introduction of Expert-Specialized Fine-Tuning (ESFT). This method focuses on tuning only the experts most relevant to the downstream task while keeping other experts and modules frozen. ESFT aims to maintain expert specialization, thereby preserving task-specific knowledge and improving tuning efficiency.
Impact Analysis of MoE Architecture: The paper provides an in-depth analysis of the impact of MoE architecture on ESFT performance. It demonstrates that models using finer-grained experts allow for more effective selection of task-relevant experts, enhancing both training efficiency and effectiveness.

Methodology

Mixture-of-Experts Architecture

The MoE architecture is central to this paper, where different experts handle different tasks. The model assigns tokens to a subset of most relevant experts, thereby ensuring computational efficiency. The paper builds upon the DeepSeek MoE framework, which introduces fine-grained segmentation of experts to enhance specialization and efficiency.

Expert Relevance Scoring

Two methods for calculating expert relevance are proposed:

Average Gate Score (ESFT-Gate): This score computes the average affinity of an expert to tokens from sampled data, providing a measure of how often an expert is engaged by the task.
Token Selection Ratio (ESFT-Token): This method calculates the ratio of tokens for which an expert is selected, offering another perspective on expert relevance.

Selection and Fine-Tuning

Only the most relevant experts, as determined by the relevance scores, are fine-tuned. This selective tuning aims to preserve the specialization of these experts, leading to computational efficiency with minimal loss in model performance.

Experimental Results

The evaluation encompasses two primary scenarios:

Enhancement of Specific Domains: Tasks focused on the Math and Code domains, where fine-tuning can yield performance improvements in familiar tasks.
Adaptation to Specialized Tasks: Evaluations on tasks such as Intent Recognition, Text Summarization, Legal Judgment Prediction, and Low-resource Translation, where fine-tuning aids in adapting to less familiar tasks.

Performance Metrics

The paper employs benchmarks like GSM8K, HumanEval, and MMLU, among others, to assess both task-specific performance and the maintenance of general abilities. The results show that ESFT not only matches but sometimes surpasses full-parameter fine-tuning (FFT) while requiring significantly fewer computational resources. Notably, ESFT demonstrated:

Efficiency: ESFT methods significantly reduce training time and storage space, with only slight performance trade-offs.
Task Specialization: ESFT maintains high task-specific performance by optimizing only the most relevant experts, mitigating the risks of overfitting and catastrophic forgetting seen in FFT.

Theoretical and Practical Implications

The findings of this paper have significant implications:

Practical Efficiency: ESFT offers a practical approach to fine-tuning large-scale, sparse-architecture LLMs, making it feasible to customize models for specific tasks without extensive computational resources.
Theoretical Insights: This work highlights the importance of expert specialization within MoE architectures, suggesting a direction for future models to leverage fine-grained expert segmentation effectively.
Future Developments in AI: Future AI systems can build on the framework of ESFT to dynamically and efficiently adapt to varying tasks, potentially integrating real-time learning capabilities in large-scale models.

Conclusion

The paper provides a robust framework for extending PEFT methods to sparse-architecture LLMs, notably through the ESFT approach. The insights on expert specialization within MoE models and the demonstrated efficiency of ESFT highlight its potential for advancing the customization of LLMs in a computationally efficient manner. The proposed methods set the stage for further exploration into fine-grained expert architectures and their applications in diverse AI tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/papers_anon/status/1808753079891939814

https://twitter.com/wzihanw/status/1809089857320521899

https://twitter.com/theomitsa/status/1812483807523594643

https://twitter.com/wzihanw/status/1844449090949742640

https://twitter.com/susumuota/status/1814813642216386776

https://twitter.com/wzihanw/status/1809051267970564238