OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models (2402.01739v2)
Abstract: To help the open-source community have a better understanding of Mixture-of-Experts (MoE) based LLMs, we train and release OpenMoE, a series of fully open-sourced and reproducible decoder-only MoE LLMs, ranging from 650M to 34B parameters and trained on up to over 1T tokens. Our investigation confirms that MoE-based LLMs can offer a more favorable cost-effectiveness trade-off than dense LLMs, highlighting the potential effectiveness for future LLM development. One more important contribution of this study is an in-depth analysis of the routing mechanisms within our OpenMoE models, leading to three significant findings: Context-Independent Specialization, Early Routing Learning, and Drop-towards-the-End. We discovered that routing decisions in MoE models are predominantly based on token IDs, with minimal context relevance. The token-to-expert assignments are determined early in the pre-training phase and remain largely unchanged. This imperfect routing can result in performance degradation, particularly in sequential tasks like multi-turn conversations, where tokens appearing later in a sequence are more likely to be dropped. Finally, we rethink our design based on the above-mentioned observations and analysis. To facilitate future MoE LLM development, we propose potential strategies for mitigating the issues we found and further improving off-the-shelf MoE LLM designs.
- “Palm 2 technical report” In arXiv preprint arXiv:2305.10403, 2023
- Anonymous “(InThe)WildChat: 570K ChatGPT Interaction Logs In The Wild” In The Twelfth International Conference on Learning Representations, 2024 URL: https://openreview.net/forum?id=Bl8u7ZRlbM
- “Efficient large scale language modeling with mixtures of experts” In arXiv preprint arXiv:2112.10684, 2021
- BIG-bench authors “Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models” In Transactions on Machine Learning Research, 2023 URL: https://openreview.net/forum?id=uyTL5Bvosj
- “Efficient training of language models to fill in the middle” In arXiv preprint arXiv:2207.14255, 2022
- “Findings of the 2016 Conference on Machine Translation” In Proceedings of the First Conference on Machine Translation Berlin, Germany: Association for Computational Linguistics, 2016, pp. 131–198 URL: http://www.aclweb.org/anthology/W/W16/W16-2301
- “Language models are few-shot learners” In arXiv preprint arXiv:2005.14165, 2020
- “Evaluating Large Language Models Trained on Code”, 2021 arXiv:2107.03374 [cs.LG]
- “Palm: Scaling language modeling with pathways” In arXiv preprint arXiv:2204.02311, 2022
- “Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining” In arXiv preprint arXiv:2304.09151, 2023
- Together Computer “RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset”, 2023 URL: https://github.com/togethercomputer/RedPajama-Data
- “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models” In arXiv preprint arXiv:2401.06066, 2024
- “An image is worth 16x16 words: Transformers for image recognition at scale” In arXiv preprint arXiv:2010.11929, 2020
- “Glam: Efficient scaling of language models with mixture-of-experts” In International Conference on Machine Learning, 2022, pp. 5547–5569 PMLR
- William Fedus, Barret Zoph and Noam Shazeer “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity” In J. Mach. Learn. Res 23, 2021, pp. 1–40
- “How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources” In Yao Fu’s Notion, 2022 URL: https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1
- “A framework for few-shot language model evaluation” Zenodo, 2023 DOI: 10.5281/zenodo.10256836
- “OpenLLaMA: An Open Reproduction of LLaMA”, 2023 URL: https://github.com/openlm-research/open_llama
- “Measuring massive multitask language understanding” In arXiv preprint arXiv:2009.03300, 2020
- “Training compute-optimal large language models” In arXiv preprint arXiv:2203.15556, 2022
- “Mixtral of experts” In arXiv preprint arXiv:2401.04088, 2024
- “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension” In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Vancouver, Canada: Association for Computational Linguistics, 2017, pp. 1601–1611 DOI: 10.18653/v1/P17-1147
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova “Bert: Pre-training of deep bidirectional transformers for language understanding” In Proceedings of naacL-HLT 1, 2019, pp. 2
- “The Stack: 3 TB of permissively licensed source code” In Preprint, 2022
- “Gshard: Scaling giant models with conditional computation and automatic sharding” In arXiv preprint arXiv:2006.16668, 2020
- “Base layers: Simplifying training of large, sparse models” In International Conference on Machine Learning, 2021, pp. 6265–6274 PMLR
- Junlong Li, Zhuosheng Zhang and Hai Zhao “Self-Prompting Large Language Models for Open-Domain QA” In arXiv preprint arXiv:2212.08635, 2022
- “StarCoder: may the source be with you!” In arXiv preprint arXiv:2305.06161, 2023
- “Roberta: A robustly optimized bert pretraining approach” In arXiv preprint arXiv:1907.11692, 2019
- “Cross-token Modeling with Conditional Computation” In arXiv preprint arXiv:2109.02008, 2021
- “Multimodal contrastive learning with limoe: the language-image mixture of experts” In Advances in Neural Information Processing Systems 35, 2022, pp. 9564–9576
- “Xgen-7b technical report” In arXiv preprint arXiv:2309.03450, 2023
- “From sparse to soft mixtures of experts” In arXiv preprint arXiv:2308.00951, 2023
- “Scaling language models: Methods, analysis & insights from training gopher” In arXiv preprint arXiv:2112.11446, 2021
- “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” In Journal of Machine Learning Research 21.140, 2020, pp. 1–67 URL: http://jmlr.org/papers/v21/20-074.html
- “Scaling vision with sparse mixture of experts” In Advances in Neural Information Processing Systems 34, 2021, pp. 8583–8595
- Stephen Roller, Sainbayar Sukhbaatar and Jason Weston “Hash layers for large sparse models” In Advances in Neural Information Processing Systems 34, 2021, pp. 17555–17566
- “Code llama: Open foundation models for code” In arXiv preprint arXiv:2308.12950, 2023
- Noam Shazeer “Glu variants improve transformer” In arXiv preprint arXiv:2002.05202, 2020
- “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer” In arXiv preprint arXiv:1701.06538, 2017
- “Megatron-lm: Training multi-billion parameter language models using model parallelism” In arXiv preprint arXiv:1909.08053, 2019
- “Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research” In arXiv preprint, 2023
- “Roformer: Enhanced transformer with rotary position embedding” In Neurocomputing 568 Elsevier, 2024, pp. 127063
- “Ul2: Unifying language learning paradigms” In The Eleventh International Conference on Learning Representations, 2022
- “Unifying language learning paradigms” In arXiv preprint arXiv:2205.05131, 2022
- LLaMA-MoE Team “LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training”, 2023 URL: https://github.com/pjlab-sys4nlp/llama-moe
- “Llama: Open and efficient foundation language models” In arXiv preprint arXiv:2302.13971, 2023
- “GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model”, https://github.com/kingoflolz/mesh-transformer-jax, 2021
- “CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data” In Proceedings of the Twelfth Language Resources and Evaluation Conference Marseille, France: European Language Resources Association, 2020, pp. 4003–4012 URL: https://aclanthology.org/2020.lrec-1.494
- “GSPMD: general and scalable parallelization for ML computation graphs” In arXiv preprint arXiv:2105.04663, 2021
- “One Student Knows All Experts Know: From Sparse to Dense” In arXiv preprint arXiv:2201.10890, 2022
- “Go wider instead of deeper” In Proceedings of the AAAI Conference on Artificial Intelligence 36.8, 2022, pp. 8779–8787
- “Efficient language modeling with sparse all-mlp” In arXiv preprint arXiv:2203.06850, 2022
- “TinyLlama: An Open-Source Small Language Model”, 2024 arXiv:2401.02385 [cs.CL]
- “Deep long-tailed learning: A survey” In IEEE Transactions on Pattern Analysis and Machine Intelligence IEEE, 2023
- “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena” In arXiv preprint arXiv:2306.05685, 2023
- “Brainformers: Trading simplicity for efficiency” In International Conference on Machine Learning, 2023, pp. 42531–42542 PMLR
- “Mixture-of-experts with expert choice routing” In Advances in Neural Information Processing Systems 35, 2022, pp. 7103–7114
- “St-moe: Designing stable and transferable sparse expert models” In URL https://arxiv. org/abs/2202.08906, 2022
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.