Emergent Mind

JetMoE: Reaching Llama2 Performance with 0.1M Dollars

(2404.07413)
Published Apr 11, 2024 in cs.CL and cs.AI

Abstract

LLMs have achieved remarkable results, but their increasing resource demand has become a major obstacle to the development of powerful and accessible super-human intelligence. This report introduces JetMoE-8B, a new LLM trained with less than $0.1 million, using 1.25T tokens from carefully mixed open-source corpora and 30,000 H100 GPU hours. Despite its low cost, the JetMoE-8B demonstrates impressive performance, with JetMoE-8B outperforming the Llama2-7B model and JetMoE-8B-Chat surpassing the Llama2-13B-Chat model. These results suggest that LLM training can be much more cost-effective than generally thought. JetMoE-8B is based on an efficient Sparsely-gated Mixture-of-Experts (SMoE) architecture, composed of attention and feedforward experts. Both layers are sparsely activated, allowing JetMoE-8B to have 8B parameters while only activating 2B for each input token, reducing inference computation by about 70% compared to Llama2-7B. Moreover, JetMoE-8B is highly open and academia-friendly, using only public datasets and training code. All training parameters and data mixtures have been detailed in this report to facilitate future efforts in the development of open foundation models. This transparency aims to encourage collaboration and further advancements in the field of accessible and efficient LLMs. The model weights are publicly available at https://github.com/myshell-ai/JetMoE.

JetMoE architecture for efficient and scalable multi-expert model parallelism.

Overview

  • The paper 'JetMoE: Reaching Llama2 Performance with 0.1M Dollars' introduces the JetMoE-8B model, a cost-effective large language model achieving competitive performance against more expensive models like Llama2.

  • JetMoE-8B uses a Sparsely-gated Mixture-of-Experts (SMoE) architecture to significantly reduce computational costs by activating only a subset of parameters during training and inference, including innovations in sparse attention and feedforward layers.

  • The model is pretrained on diverse open-source datasets and aligned through distilled supervised fine-tuning and preference optimization, demonstrating high relevance and coherence in outputs while maintaining efficient resource use.

Overview of JetMoE: Reaching Llama2 Performance with $0.1M

The paper "JetMoE: Reaching Llama2 Performance with 0.1M Dollars" presents a comprehensive study on the development and evaluation of the JetMoE-8B model, a large language model trained under significant budget constraints while achieving competitive performance against well-known models such as Llama2. This paper focuses on the efficient training methodologies and architectural optimizations employed to create a cost-effective model that maintains high performance across a variety of benchmarks.

Introduction

The research addresses a critical issue in the development of LLMs: the increasing computational and financial demands required to achieve state-of-the-art performance. The JetMoE-8B model utilizes a Sparsely-gated Mixture-of-Experts (SMoE) architecture to alleviate these demands. By activating only a subset of the total parameters during training and inference, this approach reduces computational costs significantly. JetMoE-8B incorporates both sparse attention and feedforward layers, activating only 2B parameters out of 8B per input token. This greatly minimizes inference computation compared to other models like Llama2-7B, which use all their parameters simultaneously.

Model Architecture

The architecture of JetMoE-8B is designed to maximize efficiency without compromising performance. It extends the sparse activation technique to both the attention and feed-forward layers, inspired by the ModuleFormer architecture. By doing so, the model efficiently manages computational resources, activating only necessary parameters per input token.

Mixture of Experts

In the JetMoE framework, the Mixture of Experts (MoE) layer is a central feature. Each MoE layer comprises multiple experts and a router to select the top-k experts for each input. The sparse activation reduces the computational load during both training and inference phases.

FeedForward and Attention Experts

The model uses a standard 2-layer MLP for each feedforward expert while the attention experts incorporate innovations like the Mixture of Attention heads (MoA) with RoPE relative positioning. The shared key and value projection matrices across attention experts further enhance efficiency and training stability.

Pretraining and Data Mixture

JetMoE-8B is pretrained on a mixture of open-source datasets spanning web documents, code, and mathematical content. The datasets include RefinedWeb, StarCoder, The Pile, Dolma, and others. The training strategy involves two phases, with an initial phase focused on a broader data mix and a second phase emphasizing high-quality data to fine-tune the model, increasing the weight of high-quality data during the learning rate decay phase.

The training was conducted using the Megatron framework with modifications to support MoA and z-loss. The infrastructure consisted of a cluster with 96 H100 GPUs spread across 12 nodes. Hyperparameters were selected based on empirical results from prior research and set to optimize both performance and computational efficiency.

Model Alignment

JetMoE-8B-Chat is aligned through a two-step process comprising Distilled Supervised Fine-Tuning (dSFT) and Distilled Direct Preference Optimization (dDPO). dSFT involves instruction tuning with data distilled from a teacher model, while dDPO refines this by incorporating teacher model preferences into the reward function. This alignment ensures that JetMoE-8B-Chat achieves a high degree of relevance and coherence in its responses.

Evaluation

The evaluation of JetMoE-8B includes a comparison with several leading models on the OpenLLM leaderboard and other domain-specific benchmarks. JetMoE-8B consistently outperforms or matches the performance of these models despite a lower computational budget. On metrics like Hellaswag, MMLU, and TruthfulQA, JetMoE-8B excels, demonstrating the efficacy of its architecture and training regimen.

Implications and Future Work

This research underscores the potential for creating high-performance LLMs in a cost-effective manner. The adoption of the SMoE architecture proves that significant computational savings can be achieved without a considerable drop in model performance. The described methodologies and open-source nature of JetMoE-8B facilitate further research and collaboration across the AI community.

However, due to budget constraints, the study lacks ablation experiments that could provide deeper insights into the contributions of various components. Future research could look into optimizing hyperparameters and data mixtures further, potentially improving the performance and efficiency of ensuing models.

Conclusion

JetMoE-8B exemplifies a significant stride towards democratizing access to advanced language models by presenting an efficient, open-source approach to training LLMs. The detailed reporting of training parameters and data mixtures provided in this paper fosters reproducibility and further advancements in the field. By balancing cost and performance effectively, JetMoE-8B paves the way for future research aimed at creating accessible and potent AI solutions.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.