Emergent Mind

An Empirical Study of Mamba-based Language Models

(2406.07887)
Published Jun 12, 2024 in cs.LG and cs.CL

Abstract

Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In a controlled setting (e.g., same data), however, studies so far have only presented small scale experiments comparing SSMs to Transformers. To understand the strengths and weaknesses of these architectures at larger scales, we present a direct comparison between 8B-parameter Mamba, Mamba-2, and Transformer models trained on the same datasets of up to 3.5T tokens. We also compare these models to a hybrid architecture consisting of 43% Mamba-2, 7% attention, and 50% MLP layers (Mamba-2-Hybrid). Using a diverse set of tasks, we answer the question of whether Mamba models can match Transformers at larger training budgets. Our results show that while pure SSMs match or exceed Transformers on many tasks, they lag behind Transformers on tasks which require strong copying or in-context learning abilities (e.g., 5-shot MMLU, Phonebook) or long-context reasoning. In contrast, we find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks we evaluated (+2.65 points on average) and is predicted to be up to 8x faster when generating tokens at inference time. To validate long-context capabilities, we provide additional experiments evaluating variants of the Mamba-2-Hybrid and Transformer extended to support 16K, 32K, and 128K sequences. On an additional 23 long-context tasks, the hybrid model continues to closely match or exceed the Transformer on average. To enable further study, we release the checkpoints as well as the code used to train our models as part of NVIDIA's Megatron-LM project.

Mamba and Mamba-2 blocks, tensor parallelism differences in all-reduce operations per layer.

Overview

  • The study compares Mamba-based Selective State-Space Models (SSMs) with traditional Transformer-based architectures across models up to 8 billion parameters trained on datasets up to 3.5 trillion tokens.

  • Mamba-2 models generally matched or outperformed Transformers on standard tasks but fell short on in-context learning and long-context reasoning tasks, while hybrid Mamba-2-Hybrid models showed substantial improvements.

  • Mamba-2-Hybrid models demonstrated notable speedups in token generation and could support significantly longer sequences, making them suitable for applications requiring low-latency and handling complex, extensive input data.

An Empirical Study of Mamba-based Language Models

In the presented study, the authors conducted a comprehensive comparison between Mamba-based Selective State-Space Models (SSMs) and traditional Transformer-based architectures. This comparison spans a variety of scales, including 8 billion parameter models trained on datasets comprising up to 3.5 trillion tokens. The principal focus of the study was to determine whether models based on Mamba, specifically Mamba and Mamba-2 architectures, could match or exceed the performance of Transformers on standard and long-context natural language processing tasks.

Key Findings

The paper reports several key findings from the empirical evaluations:

Task and Training Comparisons:

  • Mamba and Mamba-2 models were evaluated against 8B-parameter Transformer models using the same datasets and hyperparameters, ensuring a controlled comparative analysis.
  • The study highlighted that pure Mamba-based models generally matched or exceeded Transformers on many downstream language tasks but fell short on tasks requiring in-context learning and long-context reasoning.

Performance on Standard Tasks:

  • On a benchmark suite of 12 standard tasks, including WinoGrande, PIQA, HellaSwag, ARC-Easy, and ARC-Challenge, pure Mamba-2 models achieved competitive or superior results compared to Transformers. However, they underperformed on MMLU and Phonebook tasks which involve in-context learning and data retrieval from long contexts.
  • The performance gap on MMLU was significant when training on smaller token datasets (1.1T tokens), demonstrating that pure SSM models require approaching or surpassing Transformer token budgets to close this gap.

Hybrid Architectures:

  • The hybrid model combining Mamba-2, self-attention, and MLP layers (termed Mamba-2-Hybrid) was substantially more effective. At 8B-parameters and trained on 3.5T tokens, Mamba-2-Hybrid outperformed the corresponding Transformer on all evaluated tasks, achieving an average increase of 2.65 points.
  • Mamba-2-Hybrid models demonstrated significant speedup during token generation at inference time—potentially up to 8 times faster for long sequences—due to efficient state-space model computation.

Long-Context Capabilities:

  • The extension of Mamba-2-Hybrid models to support sequence lengths of 16K, 32K, and even 128K maintained or improved accuracy on standard tasks and outperformed Transformers on synthetic long-context benchmarks like the Phonebook task.
  • In long-context evaluations such as those in LongBench and RULER, Mamba-2-Hybrid models displayed excellent capabilities in context learning and copying tasks, though certain multi-document question-answering settings favored Transformers.

Implications and Future Work

The study underscores the potential of hybrid models incorporating Mamba-2 layers to achieve superior performance and inference efficiency compared to pure Transformer models. This has several practical implications for the future development and deployment of LLMs:

  • Inference Efficiency: The reduced computational and memory overheads during inference make Mamba-2-Hybrid models attractive for applications requiring real-time or low-latency responses.
  • Scalability: The successful extension of Mamba-2-Hybrid models to 128K context lengths indicates their potential for handling extensive and complex input data sequences, benefiting use cases in document understanding and long-form content generation.

Future research directions could involve:

  • Optimization of Training Procedures: Exploring tailored training recipes for SSM-based models, especially for mixed long-document datasets, to further enhance their performance on natural long-context tasks.
  • Fine-tuning and Prompt Techniques: Investigating more sophisticated prompt engineering strategies to improve the robustness of hybrid models in various knowledge retrieval and question-answering scenarios.
  • Hybrid Model Architectures: Delving deeper into the architectural nuances, such as the ratio and placement of SSM, attention, and MLP layers, to optimize hybrid model performance for specific tasks.

In conclusion, the comparison provides compelling evidence that integrating selective state-space models with attention mechanisms offers a promising avenue for pushing the boundaries of what is achievable with large-scale NLP models. The release of code and model checkpoints as part of NVIDIA's Megatron-LM project further promotes reproducibility and encourages continued innovation in this field.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
HackerNews
Reddit
An Empirical Study of Mamba-based Language Models (66 points, 11 comments) in /r/LocalLLaMA