2000 character limit reached
OLMoE: Open Mixture-of-Experts Language Models (2409.02060v2)
Published 3 Sep 2024 in cs.CL, cs.AI, and cs.LG
Abstract: We introduce OLMoE, a fully open, state-of-the-art LLM leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone.
- Yi: Open Foundation Models by 01.AI.
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.
- A Survey on Data Selection for Language Models.
- SantaCoder: don’t reach for the stars!
- SmolLM - blazingly fast and remarkably powerful.
- Zeyuan Allen-Zhu and Yuanzhi Li. 2024. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws.
- The Falcon Series of Open Language Models.
- PaLM 2 Technical Report.
- Efficient Large Scale Language Modeling with Mixtures of Experts.
- Llemma: An Open Language Model For Mathematics.
- Layer Normalization.
- Qwen Technical Report.
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.
- Constitutional AI: Harmlessness from AI Feedback.
- Stable LM 2 1.6B Technical Report.
- Conditional Computation in Neural Networks for faster models.
- Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling.
- Lessons from the Trenches on Reproducible Evaluation of Language Models.
- PIQA: Reasoning about Physical Commonsense in Natural Language.
- GPT-NeoX-20B: An Open-Source Autoregressive Language Model.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow.
- The Foundation Model Transparency Index.
- Language Models are Few-Shot Learners.
- Tianle Cai. 2023. Mixtral from Mistral.
- InternLM2 Technical Report.
- Generative pretraining from pixels.
- Evaluating Large Language Models Trained on Code.
- Soumith Chintala. 2024. GPT-4 MoE.
- PaLM: Scaling Language Modeling with Pathways.
- Deep reinforcement learning from human preferences.
- Unified Scaling Laws for Routed Language Models.
- BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions.
- Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.
- Training Verifiers to Solve Math Word Problems.
- Together Computer. 2023. RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset.
- MoEUT: Mixture-of-Experts Universal Transformers.
- UltraFeedback: Boosting Language Models with High-quality Feedback.
- DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models.
- Databricks. 2024. DBRX.
- Language Modeling with Gated Convolutional Networks.
- DeepSeek LLM: Scaling Open-Source Language Models with Longtermism.
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.
- Scaling Vision Transformers to 22 Billion Parameters.
- Universal Transformers.
- Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster.
- PaLM-E: An Embodied Multimodal Language Model.
- GLaM: Efficient Scaling of Language Models with Mixture-of-Experts.
- Tricks for Training Sparse Translation Models.
- The Llama 3 Herd of Models.
- Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.
- Learning Factored Representations in a Deep Mixture of Experts.
- The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding.
- KTO: Model Alignment as Prospect Theoretic Optimization.
- CroissantLLM: A Truly Bilingual French-English Language Model.
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.
- Language models scale reliably with over-training and on downstream tasks.
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts.
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling.
- A framework for few-shot language model evaluation.
- ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools.
- SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning.
- Catwalk: A Unified Language Model Evaluation Framework for Many Datasets.
- OLMo: Accelerating the Science of Language Models.
- Hard Mixtures of Experts for Large Scale Weakly Supervised Vision.
- OLMES: A Standard for Language Model Evaluations.
- Textbooks Are All You Need.
- Xu Owen He. 2024. Mixture of A Million Experts.
- Measuring Massive Multitask Language Understanding.
- Measuring Mathematical Problem Solving With the MATH Dataset.
- Training Compute-Optimal Large Language Models.
- ORPO: Monolithic Preference Optimization without Reference Model.
- MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies.
- Music Transformer.
- Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2.
- Sparse is Enough in Scaling Transformers.
- Mistral 7B.
- Mixtral of Experts.
- Scaling Laws for Neural Language Models.
- Andrej Karpathy. 2024. LLM model size competition is intensifying… backwards!
- The hateful memes challenge: Competition report.
- Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization.
- The Stack: 3 TB of permissively licensed source code.
- Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints.
- Scaling Laws for Fine-Grained Mixture of Experts.
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding.
- BASE Layers: Simplifying Training of Large, Sparse Models.
- DataComp-LM: In search of the next generation of training sets for language models.
- Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models.
- StarCoder: may the source be with you!
- AlpacaEval: An Automatic Evaluator of Instruction-following Models.
- Textbooks Are All You Need II: phi-1.5 technical report.
- Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts.
- Holistic Evaluation of Language Models.
- Jamba: A Hybrid Transformer-Mamba Language Model.
- MoE-LLaVA: Mixture of Experts for Large Vision-Language Models.
- TruthfulQA: Measuring How Models Mimic Human Falsehoods.
- MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts.
- RegMix: Data Mixture as Regression for Language Model Pre-training.
- Routers in Vision Mixture of Experts: An Empirical Study.
- LLM360: Towards Fully Transparent Open-Source LLMs.
- The Flan Collection: Designing Data and Methods for Effective Instruction Tuning.
- The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI.
- Consent in Crisis: The Rapid Decline of the AI Data Commons.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization.
- SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages.
- StarCoder 2 and The Stack v2: The Next Generation.
- FinGPT: Large Generative Models for a Small Language.
- Paloma: A Benchmark for Evaluating Language Model Fit.
- MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training.
- OpenELM: An Efficient Language Model Family with Open Training and Inference Framework.
- SimPO: Simple Preference Optimization with a Reference-Free Reward.
- Pointer Sentinel Mixture Models.
- Mixed Precision Training.
- Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.
- Cross-Task Generalization via Natural Language Crowdsourcing Instructions.
- Niklas Muennighoff. 2020. Vilio: State-of-the-art Visio-Linguistic Models applied to Hateful Memes.
- OctoPack: Instruction Tuning Code Large Language Models.
- Scaling Data-Constrained Language Models.
- Generative Representational Instruction Tuning.
- Crosslingual Generalization through Multitask Finetuning.
- Soft Merging of Experts with Adaptive Routing.
- Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts.
- Nemotron-4 340B Technical Report.
- GPT-4 Technical Report.
- Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models.
- Nemotron-4 15B Technical Report.
- OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text.
- The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only.
- RWKV: Reinventing RNNs for the Transformer Era.
- Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence.
- Ofir Press and Lior Wolf. 2017. Using the Output Embedding to Improve Language Models.
- Robust Speech Recognition via Large-Scale Weak Supervision.
- Language models are unsupervised multitask learners.
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
- No Robots.
- DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale.
- ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.
- Mixture-of-Depths: Dynamically allocating compute in transformer-based language models.
- M2D2: A Massively Multi-domain Language Modeling Dataset.
- PanGu-Sigma: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing.
- Hash Layers For Large Sparse Models.
- XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models.
- WinoGrande: An Adversarial Winograd Schema Challenge at Scale.
- Multitask Prompted Training Enables Zero-Shot Task Generalization.
- SocialIQA: Commonsense Reasoning about Social Interactions.
- What Language Model to Train if You Have One Million GPU Hours?
- Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need.
- Noam Shazeer. 2020. GLU Variants Improve Transformer.
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.
- Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost.
- Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models.
- Scaling Vision-Language Models with Sparse Mixture of Experts.
- JetMoE: Reaching Llama2 Performance with 0.1M Dollars.
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.
- Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning.
- Snowflake. 2024a. Snowflake Arctic Cookbook Series: Exploring Mixture of Experts (MoE).
- Snowflake. 2024b. Snowflake Arctic: The Best LLM for Enterprise AI — Efficiently Intelligent, Truly Open.
- Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.
- Luca Soldaini and Kyle Lo. 2023. peS2o (Pretraining Efficiently on S2ORC) Dataset.
- KMMLU: Measuring Massive Multitask Language Understanding in Korean.
- RoFormer: Enhanced Transformer with Rotary Position Embedding.
- VL-BERT: Pre-training of Generic Visual-Linguistic Representations.
- Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM.
- Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them.
- CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge.
- Sparse Universal Transformer.
- Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies.
- Chameleon Team. 2024a. Chameleon: Mixed-Modal Early-Fusion Foundation Models.
- Gemini: A Family of Highly Capable Multimodal Models.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.
- Gemma: Open Models Based on Gemini Research and Technology.
- Gemma 2: Improving Open Language Models at a Practical Size.
- Jamba-1.5: Hybrid Transformer-Mamba Models at Scale.
- MosaicML NLP Team. 2023. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs.
- Qwen Team. 2024b. Qwen1.5-MoE: Matching 7B Model Performance with 1/3 Activated Parameters”.
- Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models.
- LLaMA: Open and Efficient Foundation Language Models.
- Llama 2: Open Foundation and Fine-Tuned Chat Models.
- Zephyr: Direct Distillation of LM Alignment.
- Attention Is All You Need.
- Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model.
- OpenDevin: An Open Platform for AI Software Developers as Generalist Agents.
- How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources.
- HelpSteer2: Open-source dataset for training top-performing reward models.
- Finetuned Language Models Are Zero-Shot Learners.
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models.
- Crowdsourcing Multiple Choice Science Questions.
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.
- Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts.
- Yuan 2.0-M32: Mixture of Experts with Attention Router.
- xAI. 2024. Open Release of Grok-1.
- C-Pack: Packaged Resources To Advance General Chinese Embedding.
- Benchmark Data Contamination of Large Language Models: A Survey.
- OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models.
- Baichuan 2: Open Large-scale Language Models.
- Qwen2 Technical Report.
- SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering.
- BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting.
- MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models.
- Toward Inference-optimal Mixture-of-Expert Large Language Models.
- Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning.
- HellaSwag: Can a Machine Really Finish Your Sentence?
- Biao Zhang and Rico Sennrich. 2019. Root Mean Square Layer Normalization.
- MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series.
- TinyLlama: An Open-Source Small Language Model.
- BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts.
- OPT: Open Pre-trained Transformer Language Models.
- PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.
- Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658.
- Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training.
- LIMA: Less Is More for Alignment.
- Instruction-Following Evaluation for Large Language Models.
- Brainformers: Trading Simplicity for Efficiency.
- Mixture-of-Experts with Expert Choice Routing.
- Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models.
- ST-MoE: Designing Stable and Transferable Sparse Expert Models.
- Taming Sparsely Activated Transformer with Stochastic Experts.
- Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model.