OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models (2402.01739v2)
Abstract: To help the open-source community have a better understanding of Mixture-of-Experts (MoE) based LLMs, we train and release OpenMoE, a series of fully open-sourced and reproducible decoder-only MoE LLMs, ranging from 650M to 34B parameters and trained on up to over 1T tokens. Our investigation confirms that MoE-based LLMs can offer a more favorable cost-effectiveness trade-off than dense LLMs, highlighting the potential effectiveness for future LLM development. One more important contribution of this study is an in-depth analysis of the routing mechanisms within our OpenMoE models, leading to three significant findings: Context-Independent Specialization, Early Routing Learning, and Drop-towards-the-End. We discovered that routing decisions in MoE models are predominantly based on token IDs, with minimal context relevance. The token-to-expert assignments are determined early in the pre-training phase and remain largely unchanged. This imperfect routing can result in performance degradation, particularly in sequential tasks like multi-turn conversations, where tokens appearing later in a sequence are more likely to be dropped. Finally, we rethink our design based on the above-mentioned observations and analysis. To facilitate future MoE LLM development, we propose potential strategies for mitigating the issues we found and further improving off-the-shelf MoE LLM designs.
- “Palm 2 technical report” In arXiv preprint arXiv:2305.10403, 2023
- Anonymous “(InThe)WildChat: 570K ChatGPT Interaction Logs In The Wild” In The Twelfth International Conference on Learning Representations, 2024 URL: https://openreview.net/forum?id=Bl8u7ZRlbM
- “Efficient large scale language modeling with mixtures of experts” In arXiv preprint arXiv:2112.10684, 2021
- BIG-bench authors “Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models” In Transactions on Machine Learning Research, 2023 URL: https://openreview.net/forum?id=uyTL5Bvosj
- “Efficient training of language models to fill in the middle” In arXiv preprint arXiv:2207.14255, 2022
- “Findings of the 2016 Conference on Machine Translation” In Proceedings of the First Conference on Machine Translation Berlin, Germany: Association for Computational Linguistics, 2016, pp. 131–198 URL: http://www.aclweb.org/anthology/W/W16/W16-2301
- “Language models are few-shot learners” In arXiv preprint arXiv:2005.14165, 2020
- “Evaluating Large Language Models Trained on Code”, 2021 arXiv:2107.03374 [cs.LG]
- “Palm: Scaling language modeling with pathways” In arXiv preprint arXiv:2204.02311, 2022
- “Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining” In arXiv preprint arXiv:2304.09151, 2023
- Together Computer “RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset”, 2023 URL: https://github.com/togethercomputer/RedPajama-Data
- “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models” In arXiv preprint arXiv:2401.06066, 2024
- “An image is worth 16x16 words: Transformers for image recognition at scale” In arXiv preprint arXiv:2010.11929, 2020
- “Glam: Efficient scaling of language models with mixture-of-experts” In International Conference on Machine Learning, 2022, pp. 5547–5569 PMLR
- William Fedus, Barret Zoph and Noam Shazeer “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity” In J. Mach. Learn. Res 23, 2021, pp. 1–40
- “How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources” In Yao Fu’s Notion, 2022 URL: https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1
- “A framework for few-shot language model evaluation” Zenodo, 2023 DOI: 10.5281/zenodo.10256836
- “OpenLLaMA: An Open Reproduction of LLaMA”, 2023 URL: https://github.com/openlm-research/open_llama
- “Measuring massive multitask language understanding” In arXiv preprint arXiv:2009.03300, 2020
- “Training compute-optimal large language models” In arXiv preprint arXiv:2203.15556, 2022
- “Mixtral of experts” In arXiv preprint arXiv:2401.04088, 2024
- “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension” In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Vancouver, Canada: Association for Computational Linguistics, 2017, pp. 1601–1611 DOI: 10.18653/v1/P17-1147
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova “Bert: Pre-training of deep bidirectional transformers for language understanding” In Proceedings of naacL-HLT 1, 2019, pp. 2
- “The Stack: 3 TB of permissively licensed source code” In Preprint, 2022
- “Gshard: Scaling giant models with conditional computation and automatic sharding” In arXiv preprint arXiv:2006.16668, 2020
- “Base layers: Simplifying training of large, sparse models” In International Conference on Machine Learning, 2021, pp. 6265–6274 PMLR
- Junlong Li, Zhuosheng Zhang and Hai Zhao “Self-Prompting Large Language Models for Open-Domain QA” In arXiv preprint arXiv:2212.08635, 2022
- “StarCoder: may the source be with you!” In arXiv preprint arXiv:2305.06161, 2023
- “Roberta: A robustly optimized bert pretraining approach” In arXiv preprint arXiv:1907.11692, 2019
- “Cross-token Modeling with Conditional Computation” In arXiv preprint arXiv:2109.02008, 2021
- “Multimodal contrastive learning with limoe: the language-image mixture of experts” In Advances in Neural Information Processing Systems 35, 2022, pp. 9564–9576
- “Xgen-7b technical report” In arXiv preprint arXiv:2309.03450, 2023
- “From sparse to soft mixtures of experts” In arXiv preprint arXiv:2308.00951, 2023
- “Scaling language models: Methods, analysis & insights from training gopher” In arXiv preprint arXiv:2112.11446, 2021
- “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” In Journal of Machine Learning Research 21.140, 2020, pp. 1–67 URL: http://jmlr.org/papers/v21/20-074.html
- “Scaling vision with sparse mixture of experts” In Advances in Neural Information Processing Systems 34, 2021, pp. 8583–8595
- Stephen Roller, Sainbayar Sukhbaatar and Jason Weston “Hash layers for large sparse models” In Advances in Neural Information Processing Systems 34, 2021, pp. 17555–17566
- “Code llama: Open foundation models for code” In arXiv preprint arXiv:2308.12950, 2023
- Noam Shazeer “Glu variants improve transformer” In arXiv preprint arXiv:2002.05202, 2020
- “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer” In arXiv preprint arXiv:1701.06538, 2017
- “Megatron-lm: Training multi-billion parameter language models using model parallelism” In arXiv preprint arXiv:1909.08053, 2019
- “Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research” In arXiv preprint, 2023
- “Roformer: Enhanced transformer with rotary position embedding” In Neurocomputing 568 Elsevier, 2024, pp. 127063
- “Ul2: Unifying language learning paradigms” In The Eleventh International Conference on Learning Representations, 2022
- “Unifying language learning paradigms” In arXiv preprint arXiv:2205.05131, 2022
- LLaMA-MoE Team “LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training”, 2023 URL: https://github.com/pjlab-sys4nlp/llama-moe
- “Llama: Open and efficient foundation language models” In arXiv preprint arXiv:2302.13971, 2023
- “GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model”, https://github.com/kingoflolz/mesh-transformer-jax, 2021
- “CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data” In Proceedings of the Twelfth Language Resources and Evaluation Conference Marseille, France: European Language Resources Association, 2020, pp. 4003–4012 URL: https://aclanthology.org/2020.lrec-1.494
- “GSPMD: general and scalable parallelization for ML computation graphs” In arXiv preprint arXiv:2105.04663, 2021
- “One Student Knows All Experts Know: From Sparse to Dense” In arXiv preprint arXiv:2201.10890, 2022
- “Go wider instead of deeper” In Proceedings of the AAAI Conference on Artificial Intelligence 36.8, 2022, pp. 8779–8787
- “Efficient language modeling with sparse all-mlp” In arXiv preprint arXiv:2203.06850, 2022
- “TinyLlama: An Open-Source Small Language Model”, 2024 arXiv:2401.02385 [cs.CL]
- “Deep long-tailed learning: A survey” In IEEE Transactions on Pattern Analysis and Machine Intelligence IEEE, 2023
- “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena” In arXiv preprint arXiv:2306.05685, 2023
- “Brainformers: Trading simplicity for efficiency” In International Conference on Machine Learning, 2023, pp. 42531–42542 PMLR
- “Mixture-of-experts with expert choice routing” In Advances in Neural Information Processing Systems 35, 2022, pp. 7103–7114
- “St-moe: Designing stable and transferable sparse expert models” In URL https://arxiv. org/abs/2202.08906, 2022