Emergent Mind

Abstract

Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up LLMs. However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense LLMs. Specifically, based on the well-known LLaMA-2 7B model, we obtain an MoE model by: (1) Expert Construction, which partitions the parameters of original Feed-Forward Networks (FFNs) into multiple experts; (2) Continual Pre-training, which further trains the transformed MoE model and additional gate networks. In this paper, we comprehensively explore different methods for expert construction and various data sampling strategies for continual pre-training. After these stages, our LLaMA-MoE models could maintain language abilities and route the input tokens to specific experts with part of the parameters activated. Empirically, by training 200B tokens, LLaMA-MoE-3.5B models significantly outperform dense models that contain similar activation parameters. The source codes and models are available at https://github.com/pjlab-sys4nlp/llama-moe .

Framework for building LLaMA-MoE models with expert-split FFNs and partially chosen experts for training.

Overview

  • This research introduces a cost-effective approach to developing Mixture-of-Experts (MoE) models by repurposing dense LLMs, such as LLaMA, to create the LLaMA-MoE model.

  • The methodology involves two main stages: Expert Construction, which transforms dense feed-forward networks into smaller experts through various partitioning methods, and Continual Pre-training, which optimizes these models using different data sampling strategies.

  • Empirical evaluations highlight that LLaMA-MoE models outperform similarly sized pre-trained models across multiple benchmarks, demonstrating efficient scaling and specialized expert behavior, especially in domain-specific tasks.

Overview of LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

The research paper under discussion presents a novel approach to develop Mixture-of-Experts (MoE) models by leveraging existing dense LLMs, specifically the LLaMA model. The approach is rooted in cost-effective model scaling while maintaining computational efficiency. This report, authored by researchers from various esteemed institutions, explore the intricacies of constructing and fine-tuning MoE models, along with comprehensive empirical evaluations demonstrating their effectiveness.

Introduction and Background

LLMs such as LLaMA have significantly advanced in processing and reasoning capabilities but at a high computational cost. This paper leverages Mixture-of-Experts (MoE) architecture, which activates only a subset of model parameters, thus offering a sparse model alternative. The primary challenge when training MoEs from scratch is the substantial computational budget required, motivating the authors to propose a methodology of repurposing pre-existing dense models.

Methodology

The proposed framework, labeled LLaMA-MoE, involves two pivotal stages: Expert Construction and Continual Pre-training.

Expert Construction

Expert construction involves transforming the dense feed-forward networks (FFNs) in LLaMA into multiple smaller independent experts. Various partitioning strategies are explored:

  1. Independent Random Splitting: Randomly partitions the FFN parameters.
  2. Independent Clustering: Utilizes balanced k-means clustering to partition neurons.
  3. Sharing Inner: Shares important neurons across multiple experts based on Taylor expansion loss estimates.
  4. Sharing Inter: Combines neuron sharing with residual blocks for less critical neurons.

Empirical evidence highlighted that the Independent Random Splitting method produced the best average scores across different downstream tasks and yielded efficient training convergence.

Continual Pre-training

After constructing the experts, the next stage involves continual pre-training of the modified LLaMA-MoE model. Various data sampling strategies were probed to optimize training, including static and dynamic sampling weights. Particularly, the static weights derived from Sheared-LLaMA demonstrated superior performance in downstream tasks, coupled with additional filtering methods that improved fluency and quality of the training data.

Experiments and Results

The experimental evaluation involved rigorous comparisons with similar-sized pre-trained models, including OpenLLaMA-3B-v2 and Sheared-LLaMA-2.7B. LLaMA-MoE-3.5B models demonstrated superior performance across multiple benchmarks:

  • ARC Challenge: Scores improved notably with 44.2 compared to 41.6 by Sheared-LLaMA.
  • HellaSwag: LLaMA-MoE-3.5B scored 73.3 outperforming all comparative models.
  • Multi-task Evaluation: Achieved an overall improvement of 1.3 points over Sheared-LLaMA.

Notably, the models exhibited specialized expert routing behavior, with deeper layers showing domain-specific preferences.

Theoretical and Practical Implications

The findings have significant implications for both theoretical advancements and practical applications in NLP:

  1. Cost-Effective Scaling: The method of utilizing existing dense models to build MoEs reduces computational expenses, a crucial advancement for resource-constrained environments.
  2. Expert Specialization: The domain-specific expert behavior opens avenues for more targeted and specialized models, potentially enhancing specific downstream application performance.
  3. Robust Performance: By blending robust partition methods and finely-tuned data sampling strategies, the approach ensures high performance and efficient learning dynamics.

Future Directions

Looking ahead, the research domain offers several promising trajectories:

  1. Fine-Grained Expert Construction: Further exploration into dynamic and more granular expert construction methods could yield models with even higher specialization and efficiency.
  2. Extended Datasets: Incorporating diverse and rich datasets for training could enhance the generalization and robustness of MoE models.
  3. Compression and Pruning: Investigating strategies for compressing and pruning experts based on routing statistics could lead to more compact and faster models.
  4. Hybrid Architectures: Combining MoE with other advanced architectures to leverage the strengths of multiple approaches may push the boundaries of current NLP capabilities.

Conclusion

This paper presents a comprehensive framework for constructing Mixture-of-Experts from existing dense models, specifically LLaMA, and fine-tuning them through continual pre-training. The proposed LLaMA-MoE models not only outperform comparative models in extensive evaluations but also demonstrate efficient scalability and specialization. The implications of this research span practical advancements in resource-efficient model training and theoretical insights into expert specialization within neural network architectures.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube