OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models (2402.01739v2)

Published 29 Jan 2024 in cs.CL, cs.AI, cs.DC, and cs.LG

Abstract: To help the open-source community have a better understanding of Mixture-of-Experts (MoE) based LLMs, we train and release OpenMoE, a series of fully open-sourced and reproducible decoder-only MoE LLMs, ranging from 650M to 34B parameters and trained on up to over 1T tokens. Our investigation confirms that MoE-based LLMs can offer a more favorable cost-effectiveness trade-off than dense LLMs, highlighting the potential effectiveness for future LLM development. One more important contribution of this study is an in-depth analysis of the routing mechanisms within our OpenMoE models, leading to three significant findings: Context-Independent Specialization, Early Routing Learning, and Drop-towards-the-End. We discovered that routing decisions in MoE models are predominantly based on token IDs, with minimal context relevance. The token-to-expert assignments are determined early in the pre-training phase and remain largely unchanged. This imperfect routing can result in performance degradation, particularly in sequential tasks like multi-turn conversations, where tokens appearing later in a sequence are more likely to be dropped. Finally, we rethink our design based on the above-mentioned observations and analysis. To facilitate future MoE LLM development, we propose potential strategies for mitigating the issues we found and further improving off-the-shelf MoE LLM designs.

References (59)

Citations (53)

View on Semantic Scholar

Summary

The paper presents an open-source MoE framework that scales LLM parameters from 650M to 34B for enhanced cost-effectiveness.
It analyzes a token ID-based routing mechanism that drives early specialization but may restrict multi-turn contextual understanding.
It recommends design adjustments, such as reducing code dataset proportions and refining the architecture for more balanced token routing.

Introduction to OpenMoE

The open-source community recently gained a remarkable tool with the release of OpenMoE, a series of decoder-only mixture-of-experts (MoE) LLMs. These LLMs range vastly in size, including models with parameters varying from 650M to 34B, trained on extensively large datasets exceeding 1 trillion tokens. The ambition behind OpenMoE is triple-fold: to document the process of training a decoder-only MoE LLM, to explore the intricacies of MoE routing mechanisms, and to serve as a catalyst for further MoE LLM development within the open-source milieu.

MoE Efficiency and Open Access

A central finding from the release of OpenMoE is the efficiency of MoE-based LLMs compared to their dense counterparts. MoE LLMs exhibit a more cost-effective balance, indicating their viability for future LLM endeavors. This paper details the formidable performance of OpenMoE-8B/32E models, which provide an insightful comparison with OpenLLaMA-3B and TinyLLaMA-1.1B—two dense models with a higher training cost. It's particularly notable that the OpenMoE-8B/32E-Chat model performed substantially better in single-turn conversations on the MT-Bench, indicating its potential in conversational AI applications.

In-Depth Analysis of OpenMoE

Perhaps more compelling is the in-depth examination of the routing mechanisms within MoE models. The MoE routing decisions appear to be largely token ID-based, with little regard for context. Further, routing specialization is determined early in the training phase and is predominantly unalterable. This inherent characteristic can lead to a performance decline in scenarios where a sequential understanding is critical, like multi-turn conversations due to token drops later in the sequence.

Recalibrating the Model Design

The paper does not shy away from acknowledging limitations, such as initial suboptimal design choices in MoE architecture and an overly code-heavy dataset mix. Reflecting on these aspects provides an opportunity to share learnings that could benefit model iteration and innovation in the community. To address the discovered challenges, a strategic pivot is suggested, including reducing the proportion of code in the training data mix and refining the MoE architecture to minimize context-independent token routing.

Conclusion and Future Directions

In closing, OpenMoE signifies an evolutionary step in LLM development. It delivers an enhanced understanding of MoE models, complete with strengths and areas for improvement. The research articulates potential strategies to ameliorate identified deficiencies, especially emphasizing the imperative for balanced token routing. The initiative sets the groundwork for the open-source community to push the boundaries of LLM capabilities and chart the course for subsequent endeavors in the generative AI landscape.