Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

139 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

OLMoE: Open Mixture-of-Experts Language Models (2409.02060v2)

Published 3 Sep 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce OLMoE, a fully open, state-of-the-art LLM leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.

References (220)

Citations (20)

View on Semantic Scholar

Summary

The paper introduces a fully open-source Mixture-of-Experts language model with 6.9B total parameters and 1.3B active per token, enabling efficient performance.
It employs a dropless token-choice routing mechanism with auxiliary load balancing and Z-losses, achieving faster training with fewer FLOPs than dense models.
The release of comprehensive training data, code, logs, and checkpoints fosters reproducibility and advances research in MoE architectures.

The paper "OLMoE: Open Mixture-of-Experts LLMs" (2409.02060) introduces OLMoE, a fully open-source Mixture-of-Experts (MoE) LLM, and its instruction-tuned variant, OLMoE-Instruct. The authors aim to address the lack of openness in existing MoE models, which hinders research and development in this area. OLMoE has 6.9 billion total parameters but only activates 1.3 billion parameters per input token, offering a favorable cost-performance trade-off. It was pretrained on 5.1 trillion tokens.

Key Contributions and Openness:

The primary contribution is the release of a state-of-the-art MoE model that is fully open:

Model Weights: Available on Hugging Face for OLMoE (base, SFT, and DPO/Instruct versions).
Training Data: The pretraining dataset (OLMoE-mix) and adaptation datasets are released.
Training Code: The codebase used for pretraining and adaptation is open-sourced on GitHub.
Training Logs: Detailed logs, including intermediate checkpoints every 5000 steps, are available via Weights & Biases.

This level of openness is intended to facilitate research into MoE architectures and training.

Model Architecture and Training:

OLMoE is a decoder-only transformer. Key architectural and training details include:

Active Parameters: 1.3 billion.
Total Parameters: 6.9 billion.
Expert Configuration: Each MoE layer has 64 small experts, with 8 experts activated per token. The FFN dimension for each expert is 1,024.
Routing Mechanism: Dropless token choice routing is used, where a learned linear router selects the top-k experts for each token.
Auxiliary Losses: The training objective includes the standard cross-entropy loss plus two auxiliary losses:
- Load Balancing Loss ( $\mathcal{L}_{LB}$ with weight $\alpha=0.01$ ): Encourages even distribution of tokens across experts.
- Router Z-Loss ( $\mathcal{L}_{RZ}$ with weight $\beta=0.001$ ): Penalizes large logits in the router to improve stability.
- The final loss is $\mathcal{L} = \mathcal{L}_{CE} + \alpha \mathcal{L}_{LB} + \beta \mathcal{L}_{RZ}$ .
Pretraining Data (OLMoE-mix): A 5.1 trillion token dataset combining DCLM-Baseline (filtered Common Crawl) with high-quality components from Dolma 1.7 (StarCoder, peS2o, arXiv, Wikipedia, OpenWebMath, Algebraic Stack). Specific filters were applied to enhance data quality.
Adaptation: OLMoE-Instruct is created through a two-stage process:
1. Instruction Tuning (SFT): Using a mix including Tulu 2 SFT, No Robots, CodeFeedback, MetaMathQA, and a subset of Daring Anteater. More code and math data were added to boost performance in these areas.
2. Preference Tuning (DPO): Using a binarized and filtered version of UltraFeedback.

Experimental Design Choices and Findings:

The paper details numerous experiments that informed OLMoE's design:

MoE vs. Dense: MoEs train ~2x faster in terms of wall-clock time and reach equivalent performance with ~3x fewer tokens/FLOPs compared to dense models with similar active parameters.
Expert Granularity: Finer-grained experts (more smaller experts) generally improve performance, with diminishing returns. OLMoE uses 64 experts with 8 active.
Shared Experts: No shared expert is used, as experiments showed it slightly worsened performance by reducing expert combination flexibility.
Routing Algorithm: Dropless token-choice routing outperformed expert-choice routing.
Sparse Upcycling: Training from scratch was found to be more beneficial than sparsely upcycling a pretrained dense LM for their compute budget, especially as upcycling constrains hyperparameter choices.
Load Balancing Loss: Essential for preventing expert collapse and improving performance.
Router Z-Loss: Improves stability and performance.
Dataset: The custom OLMoE-mix outperformed Dolma 1.7.
Initialization: Truncated normal initialization (std 0.02, cutoff at 3 stds) provided more stable training.
Normalization: RMSNorm (with parameters included in weight decay) was chosen over non-parametric LayerNorm for better performance, despite a throughput reduction. QK-Norm (normalizing query and key projections) also improved stability and performance.
AdamW Epsilon: Reduced to 1e-8 for better convergence.
Adaptation Settings:
- Auxiliary losses (load balancing) were not used during SFT/DPO as it slightly degraded performance without harming expert balance significantly.
- The post-annealing checkpoint was better for adaptation.
- DPO was chosen over KTO for the final OLMoE-Instruct, though KTO performed comparably.

Performance Results:

During Pretraining: OLMoE achieves better performance with fewer FLOPs than dense OLMo models and matches or outperforms OLMo-7B.
After Pretraining (Base Model): OLMoE performs best among models with <2B active parameters. It outperforms some dense 7B models (e.g., Llama2-7B) but is behind others (e.g., Llama3.1-8B).
After Adaptation (OLMoE-Instruct): OLMoE-Instruct significantly improves over the base model, especially on GSM8k due to added math data in SFT. It outperforms larger models like Llama2-13B-Chat, OLMo-7B-Instruct, and DeepSeekMoE-16B on average across benchmarks like MMLU, GSM8k, HumanEval, and AlpacaEval.

MoE Analysis:

The paper analyzes four MoE-specific properties:

Router Saturation: Router decisions (which experts are chosen for given tokens) tend to saturate relatively early in pretraining (e.g., ~60% saturation for top-8 experts after 1% of training). Later layers saturate faster than earlier ones, with layer 0 being an outlier (slower saturation).
Expert Co-activation: Generally low co-activation between experts within a layer, suggesting little redundancy and good specialization. Some small groups of experts tend to co-activate.
Domain Specialization: OLMoE experts show significant specialization for specific data domains (e.g., arXiv, GitHub), with certain experts being activated much more or less frequently than random chance for these domains. This specialization is less pronounced for generic data (e.g., C4). OLMoE exhibits stronger domain specialization than Mixtral-8x7B, possibly due to OLMoE being trained from scratch.
Vocabulary Specialization: Experts also specialize in particular vocabulary items (token IDs). Later layers show higher vocabulary specialization. Some experts focus on non-alphabetic tokens, geographic terms, or connector words. This is linked to domain specialization. OLMoE shows stronger vocabulary specialization than Mixtral.

Implementation Considerations:

Pretraining Hardware: 256 H100 GPUs for ~10 days.
Adaptation Hardware: 32 H100 GPUs for SFT (~33 hours) and DPO (~14 hours).
Memory: While inference cost (active parameters) is similar to a 1B dense model, storing the full 6.9B parameters requires more GPU memory.

Limitations and Future Work:

The paper acknowledges limitations such as the model's relatively small active parameter count, the amount of pretraining data (though substantial, less than some frontier models), its text-only modality, and its predominantly English focus. Future work could involve scaling parameters and data further, exploring multimodality, and improving multilingual capabilities.

In summary, OLMoE represents a significant step towards fully open and reproducible research in MoE LLMs, providing competitive performance for its size and a valuable suite of resources for the community. The detailed experiments offer practical insights into MoE design, and the analysis sheds light on the internal workings of these sparse models.

PDF Markdown

Tweets

https://twitter.com/Muennighoff/status/1831159130230587486

https://twitter.com/reach_vb/status/1831320884562325762

https://twitter.com/Ar_Douillard/status/1831315220481884194

https://twitter.com/TheTuringPost/status/1833625180293525605

https://twitter.com/Muennighoff/status/1915163362662760871

https://twitter.com/Quebec_AI/status/1831731111593988235

YouTube

Show All Videos

OLMoE: Open Mixture-of-Experts Language Models (2409.02060v2)

Summary

Related Papers

Tweets

YouTube