Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

97 tokens/sec

GPT-4o

11 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

JetMoE: Reaching Llama2 Performance with 0.1M Dollars (2404.07413v1)

Published 11 Apr 2024 in cs.CL and cs.AI

Abstract: LLMs have achieved remarkable results, but their increasing resource demand has become a major obstacle to the development of powerful and accessible super-human intelligence. This report introduces JetMoE-8B, a new LLM trained with less than $0.1 million, using 1.25T tokens from carefully mixed open-source corpora and 30,000 H100 GPU hours. Despite its low cost, the JetMoE-8B demonstrates impressive performance, with JetMoE-8B outperforming the Llama2-7B model and JetMoE-8B-Chat surpassing the Llama2-13B-Chat model. These results suggest that LLM training can be much more cost-effective than generally thought. JetMoE-8B is based on an efficient Sparsely-gated Mixture-of-Experts (SMoE) architecture, composed of attention and feedforward experts. Both layers are sparsely activated, allowing JetMoE-8B to have 8B parameters while only activating 2B for each input token, reducing inference computation by about 70% compared to Llama2-7B. Moreover, JetMoE-8B is highly open and academia-friendly, using only public datasets and training code. All training parameters and data mixtures have been detailed in this report to facilitate future efforts in the development of open foundation models. This transparency aims to encourage collaboration and further advancements in the field of accessible and efficient LLMs. The model weights are publicly available at https://github.com/myshell-ai/JetMoE.

References (98)

Citations (15)

View on Semantic Scholar

Summary

The paper presents the JetMoE-8B model that uses sparse Mixture-of-Experts architecture to achieve Llama2-level performance on a $0.1M budget.
It employs both sparse attention and feedforward experts to activate only essential parameters, reducing computational costs significantly.
Extensive evaluation shows JetMoE-8B matches or outperforms established models on benchmarks like Hellaswag, MMLU, and TruthfulQA.

Overview of JetMoE: Reaching Llama2 Performance with $0.1M

The paper "JetMoE: Reaching Llama2 Performance with 0.1M Dollars" presents a comprehensive paper on the development and evaluation of the JetMoE-8B model, a LLM trained under significant budget constraints while achieving competitive performance against well-known models such as Llama2. This paper focuses on the efficient training methodologies and architectural optimizations employed to create a cost-effective model that maintains high performance across a variety of benchmarks.

Introduction

The research addresses a critical issue in the development of LLMs: the increasing computational and financial demands required to achieve state-of-the-art performance. The JetMoE-8B model utilizes a Sparsely-gated Mixture-of-Experts (SMoE) architecture to alleviate these demands. By activating only a subset of the total parameters during training and inference, this approach reduces computational costs significantly. JetMoE-8B incorporates both sparse attention and feedforward layers, activating only 2B parameters out of 8B per input token. This greatly minimizes inference computation compared to other models like Llama2-7B, which use all their parameters simultaneously.

Model Architecture

The architecture of JetMoE-8B is designed to maximize efficiency without compromising performance. It extends the sparse activation technique to both the attention and feed-forward layers, inspired by the ModuleFormer architecture. By doing so, the model efficiently manages computational resources, activating only necessary parameters per input token.

Mixture of Experts

In the JetMoE framework, the Mixture of Experts (MoE) layer is a central feature. Each MoE layer comprises multiple experts and a router to select the top-k experts for each input. The sparse activation reduces the computational load during both training and inference phases.

FeedForward and Attention Experts

The model uses a standard 2-layer MLP for each feedforward expert while the attention experts incorporate innovations like the Mixture of Attention heads (MoA) with RoPE relative positioning. The shared key and value projection matrices across attention experts further enhance efficiency and training stability.

Pretraining and Data Mixture

JetMoE-8B is pretrained on a mixture of open-source datasets spanning web documents, code, and mathematical content. The datasets include RefinedWeb, StarCoder, The Pile, Dolma, and others. The training strategy involves two phases, with an initial phase focused on a broader data mix and a second phase emphasizing high-quality data to fine-tune the model, increasing the weight of high-quality data during the learning rate decay phase.

The training was conducted using the Megatron framework with modifications to support MoA and z-loss. The infrastructure consisted of a cluster with 96 H100 GPUs spread across 12 nodes. Hyperparameters were selected based on empirical results from prior research and set to optimize both performance and computational efficiency.

Model Alignment

JetMoE-8B-Chat is aligned through a two-step process comprising Distilled Supervised Fine-Tuning (dSFT) and Distilled Direct Preference Optimization (dDPO). dSFT involves instruction tuning with data distilled from a teacher model, while dDPO refines this by incorporating teacher model preferences into the reward function. This alignment ensures that JetMoE-8B-Chat achieves a high degree of relevance and coherence in its responses.

Evaluation

The evaluation of JetMoE-8B includes a comparison with several leading models on the OpenLLM leaderboard and other domain-specific benchmarks. JetMoE-8B consistently outperforms or matches the performance of these models despite a lower computational budget. On metrics like Hellaswag, MMLU, and TruthfulQA, JetMoE-8B excels, demonstrating the efficacy of its architecture and training regimen.

Implications and Future Work

This research underscores the potential for creating high-performance LLMs in a cost-effective manner. The adoption of the SMoE architecture proves that significant computational savings can be achieved without a considerable drop in model performance. The described methodologies and open-source nature of JetMoE-8B facilitate further research and collaboration across the AI community.

However, due to budget constraints, the paper lacks ablation experiments that could provide deeper insights into the contributions of various components. Future research could look into optimizing hyperparameters and data mixtures further, potentially improving the performance and efficiency of ensuing models.

Conclusion

JetMoE-8B exemplifies a significant stride towards democratizing access to advanced LLMs by presenting an efficient, open-source approach to training LLMs. The detailed reporting of training parameters and data mixtures provided in this paper fosters reproducibility and further advancements in the field. By balancing cost and performance effectively, JetMoE-8B paves the way for future research aimed at creating accessible and potent AI solutions.

PDF Markdown

Tweets

https://twitter.com/qinzytech/status/1779890675045069075

https://twitter.com/_philschmid/status/1780120605188096283

https://twitter.com/myshell_ai/status/1779886834543391133

https://twitter.com/Yikang_Shen/status/1779896673679929699

https://twitter.com/Zhen4good/status/1778590518450204783

https://twitter.com/cloneofsimo/status/1778605133196296334

HackerNews

JetMoE: Reaching Llama2 Performance with 0.1M Dollars (2 points, 0 comments)
JetMoE: Reaching Llama2 Performance with 0.1M Dollars (1 point, 0 comments)