Emergent Mind

Abstract

Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks. In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters. Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision. Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.

Averaged routing weights show domain specialization in MoE models across various domains like Books and Wikipedia.

Overview

  • The paper introduces 'Lory', a novel fully-differentiable Mixture-of-Experts (MoE) architecture designed to optimize the pre-training of autoregressive language models by managing computational costs effectively while enhancing performance.

  • Lory implements causal segment routing and similarity-based data batching as core innovations to improve the specialization of experts and efficiency in processing semantically similar information, respectively.

  • The training results indicate that Lory not only outperforms traditional dense models in terms of perplexity and downstream task performance but also competes favorably with other state-of-the-art MoE models using token-level routing, despite its less granular segment-level approach.

Exploring "Lory": A Fully-Differentiable Mixture-of-Experts for Language Models

Introducing the Lory Model

When we talk about scaling AI models, particularly language models, the challenge often lies in managing the computational cost while improving the model's performance. This is where Mixture-of-Experts (MoE) architectures step in, allowing growth in model size without a proportional increase in computation.

However, traditional MoE models struggle with optimizing a non-differentiable, discrete objective posed by the training of the routing network. Enter "Lory", a novel approach that introduces a fully-differentiable MoE architecture suitable for autoregressive language model pre-training.

Core Innovations in Lory

The Lory model introduces two pivotal techniques:

  • Causal Segment Routing: This strategy involves dividing the input sequence into segments. For each segment, the model determines which expert to use based on the information from the previous segment. During inference, Lory simplifies the process by allowing a single routing decision based on the input prompt, enhancing efficiency.
  • Similarity-based Data Batching: By grouping similar documents during training, this method enhances the model's ability to route and process semantically similar information, promoting expert specialization.

Training and Performance

Lory was trained from scratch on a dataset comprising 150 billion tokens, with model sizes varying up to 30 billion parameters. The results have been quite promising:

  • On perplexity, a measure of model prediction uncertainty, Lory outperformed parameter-matched dense models significantly (approximately 13.9% improvement).
  • For downstream tasks, which include a diverse set from reading comprehension to text classification, the performance boost ranged from 1.5% to 11.1%.

Importantly, despite using segment-level routing, Lory achieved competitive performance against state-of-the-art MoE models that use more granular (but computationally expensive) token-level routing.

Theoretical and Practical Implications

The research demonstrates several key implications:

  1. Specialization without Supervision: Lory's experts developed domain-level specialization independently, a trait not prominently seen in traditional MoE approaches that tend to focus on superficial token-level patterns.
  2. Scalability with Fully Differentiable Architecture: By replacing non-differentiable components, Lory simplifies the training process and opens up possibilities for more scalable and efficient model training regimes.
  3. Efficiency in Inference: The approach to make a single routing decision during inference parallels the simplicity and computational efficiency of dense models, making Lory practical for real-world applications where resources might be constrained.

Looking Forward

The success of Lory suggests a potent future for fully-differentiable MoE architectures in language model pre-training. For future directions, the exploration could extend to combining Lory’s segment-level routing with token-level strategies or advancing the model’s capabilities for even more specialized tasks.

Moreover, the principle of Lory could potentially be translated to other forms of AI outside NLP, wherever MoE architectures can be beneficial. Developments in these areas would further underline the versatility and utility of fully-differentiable MoE systems in modern AI solutions.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube