Emergent Mind

OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models

(2402.01739)
Published Jan 29, 2024 in cs.CL , cs.AI , cs.DC , and cs.LG

Abstract

To help the open-source community have a better understanding of Mixture-of-Experts (MoE) based LLMs, we train and release OpenMoE, a series of fully open-sourced and reproducible decoder-only MoE LLMs, ranging from 650M to 34B parameters and trained on up to over 1T tokens. Our investigation confirms that MoE-based LLMs can offer a more favorable cost-effectiveness trade-off than dense LLMs, highlighting the potential effectiveness for future LLM development. One more important contribution of this study is an in-depth analysis of the routing mechanisms within our OpenMoE models, leading to three significant findings: Context-Independent Specialization, Early Routing Learning, and Drop-towards-the-End. We discovered that routing decisions in MoE models are predominantly based on token IDs, with minimal context relevance. The token-to-expert assignments are determined early in the pre-training phase and remain largely unchanged. This imperfect routing can result in performance degradation, particularly in sequential tasks like multi-turn conversations, where tokens appearing later in a sequence are more likely to be dropped. Finally, we rethink our design based on the above-mentioned observations and analysis. To facilitate future MoE LLM development, we propose potential strategies for mitigating the issues we found and further improving off-the-shelf MoE LLM designs.

OpenMoE models' token prediction accuracy, detailing usage of UL2 and CasualLM at specific training steps.

Overview

  • OpenMoE introduces decoder-only mixture-of-experts LLMs to the open-source community, offering a variety of sizes and trained on over 1 trillion tokens.

  • Efficiency gains are highlighted in MoE LLMs when compared to dense LLMs, especially noted in OpenMoE's superior performance in single-turn dialog tasks.

  • An examination of MoE routing mechanisms reveals a bias towards token ID-based decisions over contextual understanding, highlighting potential limitations.

  • The paper suggests refinements in MoE design and training datasets to address limitations and suggests future community efforts to enhance LLM development.

Introduction to OpenMoE

The open-source community recently gained a remarkable tool with the release of OpenMoE, a series of decoder-only mixture-of-experts (MoE) LLMs. These LLMs range vastly in size, including models with parameters varying from 650M to 34B, trained on extensively large datasets exceeding 1 trillion tokens. The ambition behind OpenMoE is triple-fold: to document the process of training a decoder-only MoE LLM, to delve into the intricacies of MoE routing mechanisms, and to serve as a catalyst for further MoE LLM development within the open-source milieu.

MoE Efficiency and Open Access

A central finding from the release of OpenMoE is the efficiency of MoE-based LLMs compared to their dense counterparts. MoE LLMs exhibit a more cost-effective balance, indicating their viability for future LLM endeavors. This paper details the formidable performance of OpenMoE-8B/32E models, which provide an insightful comparison with OpenLLaMA-3B and TinyLLaMA-1.1B—two dense models with a higher training cost. It's particularly notable that the OpenMoE-8B/32E-Chat model performed substantially better in single-turn conversations on the MT-Bench, indicating its potential in conversational AI applications.

In-Depth Analysis of OpenMoE

Perhaps more compelling is the in-depth examination of the routing mechanisms within MoE models. The MoE routing decisions appear to be largely token ID-based, with little regard for context. Further, routing specialization is determined early in the training phase and is predominantly unalterable. This inherent characteristic can lead to a performance decline in scenarios where a sequential understanding is critical, like multi-turn conversations due to token drops later in the sequence.

Recalibrating the Model Design

The paper does not shy away from acknowledging limitations, such as initial suboptimal design choices in MoE architecture and an overly code-heavy dataset mix. Reflecting on these aspects provides an opportunity to share learnings that could benefit model iteration and innovation in the community. To address the discovered challenges, a strategic pivot is suggested, including reducing the proportion of code in the training data mix and refining the MoE architecture to minimize context-independent token routing.

Conclusion and Future Directions

In closing, OpenMoE signifies an evolutionary step in LLM development. It delivers an enhanced understanding of MoE models, complete with strengths and areas for improvement. The research articulates potential strategies to ameliorate identified deficiencies, especially emphasizing the imperative for balanced token routing. The initiative sets the groundwork for the open-source community to push the boundaries of LLM capabilities and chart the course for subsequent endeavors in the generative AI landscape.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
Reddit