Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning (2309.05444v1)

Published 11 Sep 2023 in cs.CL and cs.LG

Abstract: The Mixture of Experts (MoE) is a widely known neural architecture where an ensemble of specialized sub-models optimizes overall performance with a constant computational cost. However, conventional MoEs pose challenges at scale due to the need to store all experts in memory. In this paper, we push MoE to the limit. We propose extremely parameter-efficient MoE by uniquely combining MoE architecture with lightweight experts.Our MoE architecture outperforms standard parameter-efficient fine-tuning (PEFT) methods and is on par with full fine-tuning by only updating the lightweight experts -- less than 1% of an 11B parameters model. Furthermore, our method generalizes to unseen tasks as it does not depend on any prior task knowledge. Our research underscores the versatility of the mixture of experts architecture, showcasing its ability to deliver robust performance even when subjected to rigorous parameter constraints. Our code used in all the experiments is publicly available here: https://github.com/for-ai/parameter-efficient-moe.

Authors (6)

Ted Zadouri (3 papers)
Ahmet Üstün (38 papers)
Arash Ahmadian (18 papers)
Acyr Locatelli (14 papers)
Sara Hooker (71 papers)
Beyza Ermiş (2 papers)

Citations (66)

View on Semantic Scholar

Summary

The paper demonstrates that integrating PEFT methods with MoE reduces parameter updates to less than 1% while maintaining performance.
The proposed approach replaces dense experts with lightweight IA vectors and LORA adapters to significantly cut computational overhead.
The modified MoE architecture outperforms standard fine-tuning across 12 tasks and 55 datasets, proving its scalability and efficiency.

Overview of Mixture of Experts Architecture

The Mixture of Experts (MoE) architecture is a concept in neural network design wherein a group of specialized models, known as experts, work in concert to optimize performance while maintaining constant computational cost. Traditional MoE architectures face scalability issues due to the necessity of storing all the experts in memory, making them less practical for large-scale use.

Advancements in Parameter-Efficient Fine-Tuning

Researchers have now developed a framework pushing the boundaries of MoE by revolutionizing its parameter efficiency. The novel model innovatively pairs MoE with parameter-efficient fine-tuning (PEFT) methods, which substantially reduce the number of parameters requiring updates during fine-tuning. These methods include Intrinsic Attention (IA) and Low-Rank adaptation (LORA). Their proposed architecture manages to match the performance of complete model fine-tuning by only adjusting a fraction of the model's parameters - less than 1%. This is especially noteworthy as the method does not rely on prior knowledge of tasks, thus generalizing well to new, unseen tasks.

Implementation and Practical Benefits

The proposed approach introduces two significant modifications to MoE: Mixture of Vectors (MoV) and Mixture of LORA (MoLORA). In these adaptations, traditional dense experts are replaced with lightweight adaptable elements like IA vectors or LORA adapters. Unlike their denser counterparts, these experts require updates to fewer parameters, significantly reducing memory usage and computational demands during both training and inference. Additionally, this increased efficiency does not come at the cost of performance, with MoV and MoLORA exhibiting superior results compared to standard PEFT methods and full model fine-tuning.

Comprehensive Evaluation

The models were put through rigorous experimentation, encompassing 12 tasks across 55 datasets in the Public Pool of Prompts (P3) dataset. The experiments utilized a range of Transformers from the T5 model family, extending up to 11 billion parameters. In summary, this extremely parameter-efficient MoE framework has demonstrated considerable improvements over standard methods, delivering competitive performance to full fine-tuning and a promising solution for large-scale model deployment. The research not only validates the potency of MoE in parameter-constrained scenarios but also offers a valuable contribution to the domain of efficient model fine-tuning. To encourage further exploration and application, the team has made their code publicly accessible.

PDF Markdown

Related Papers

GitHub

GitHub - for-ai/parameter-efficient-moe (263 stars)

Tweets

https://twitter.com/2wlearning/status/1787520183281840153

https://twitter.com/731538535795163136/status/1736512960741495091

https://twitter.com/CohereForAI/status/1788992260853624973

https://twitter.com/CohereForAI/status/1704167822958420299

https://twitter.com/CohereForAI/status/1788781132877639760

https://twitter.com/sarahookr/status/1747283029050188119

YouTube

Show All Videos