OrchMoE: Efficient Multi-Adapter Learning with Task-Skill Synergy (2401.10559v1)
Abstract: We advance the field of Parameter-Efficient Fine-Tuning (PEFT) with our novel multi-adapter method, OrchMoE, which capitalizes on modular skill architecture for enhanced forward transfer in neural networks. Unlike prior models that depend on explicit task identification inputs, OrchMoE automatically discerns task categories, streamlining the learning process. This is achieved through an integrated mechanism comprising an Automatic Task Classification module and a Task-Skill Allocation module, which collectively deduce task-specific classifications and tailor skill allocation matrices. Our extensive evaluations on the 'Super Natural Instructions' dataset, featuring 1,600 diverse instructional tasks, indicate that OrchMoE substantially outperforms comparable multi-adapter baselines in terms of both performance and sample utilization efficiency, all while operating within the same parameter constraints. These findings suggest that OrchMoE offers a significant leap forward in multi-task learning efficiency.
- Multi-head adapter routing for cross-task generalization, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021.
- Glm: General language model pretraining with autoregressive blank infilling, 2022.
- Lora: Low-rank adaptation of large language models. CoRR, abs/2106.09685, 2021.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
- Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- The power of scale for parameter-efficient prompt tuning, 2021.
- Prefix-tuning: Optimizing continuous prompts for generation, 2021.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
- Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
- P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks, 2022.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
- Combining modular skills in multitask learning, 2022.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
- Meta-dataset: A dataset of datasets for learning to learn from few examples. CoRR, abs/1903.03096, 2019.
- Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc.
- Attention is all you need, 2023.
- Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
- Customizable combination of parameter-efficient modules for multi-task learning, 2023.
- Multilora: Democratizing lora for better multi-task learning, 2023.
- Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning, 2023.
- Adalora: Adaptive budget allocation for parameter-efficient fine-tuning, 2023.
- Mixture-of-experts with expert choice routing, 2022.