An Empirical Study of Mamba-based Language Models (2406.07887v1)

Published 12 Jun 2024 in cs.LG and cs.CL

Abstract: Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the LLMing capabilities of Transformers, making them an attractive alternative. In a controlled setting (e.g., same data), however, studies so far have only presented small scale experiments comparing SSMs to Transformers. To understand the strengths and weaknesses of these architectures at larger scales, we present a direct comparison between 8B-parameter Mamba, Mamba-2, and Transformer models trained on the same datasets of up to 3.5T tokens. We also compare these models to a hybrid architecture consisting of 43% Mamba-2, 7% attention, and 50% MLP layers (Mamba-2-Hybrid). Using a diverse set of tasks, we answer the question of whether Mamba models can match Transformers at larger training budgets. Our results show that while pure SSMs match or exceed Transformers on many tasks, they lag behind Transformers on tasks which require strong copying or in-context learning abilities (e.g., 5-shot MMLU, Phonebook) or long-context reasoning. In contrast, we find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks we evaluated (+2.65 points on average) and is predicted to be up to 8x faster when generating tokens at inference time. To validate long-context capabilities, we provide additional experiments evaluating variants of the Mamba-2-Hybrid and Transformer extended to support 16K, 32K, and 128K sequences. On an additional 23 long-context tasks, the hybrid model continues to closely match or exceed the Transformer on average. To enable further study, we release the checkpoints as well as the code used to train our models as part of NVIDIA's Megatron-LM project.

Authors (16)

Roger Waleffe (11 papers)
Wonmin Byeon (27 papers)
Duncan Riach (2 papers)
Brandon Norick (6 papers)
Vijay Korthikanti (7 papers)
Tri Dao (47 papers)
Albert Gu (40 papers)
Ali Hatamizadeh (33 papers)
Sudhakar Singh (20 papers)
Deepak Narayanan (26 papers)
Garvit Kulshreshtha (1 paper)
Vartika Singh (10 papers)
Jared Casper (11 papers)
Jan Kautz (215 papers)
Mohammad Shoeybi (60 papers)
Bryan Catanzaro (123 papers)

Citations (29)

View on Semantic Scholar

Summary

The paper evaluated Mamba and Mamba-2 models against 8B-parameter Transformers, revealing competitive performance on standard NLP tasks.
It found that hybrid Mamba-2-Hybrid models outperform pure Mamba and Transformer architectures by achieving an average increase of 2.65 points and up to 8x speedup in token generation.
The study demonstrated that extending Mamba-2-Hybrid models to 128K context lengths maintains task accuracy, highlighting their potential for long-context applications.

An Empirical Study of Mamba-based LLMs

In the presented paper, the authors conducted a comprehensive comparison between Mamba-based Selective State-Space Models (SSMs) and traditional Transformer-based architectures. This comparison spans a variety of scales, including 8 billion parameter models trained on datasets comprising up to 3.5 trillion tokens. The principal focus of the paper was to determine whether models based on Mamba, specifically Mamba and Mamba-2 architectures, could match or exceed the performance of Transformers on standard and long-context natural language processing tasks.

Key Findings

The paper reports several key findings from the empirical evaluations:

Task and Training Comparisons:
- Mamba and Mamba-2 models were evaluated against 8B-parameter Transformer models using the same datasets and hyperparameters, ensuring a controlled comparative analysis.
- The paper highlighted that pure Mamba-based models generally matched or exceeded Transformers on many downstream language tasks but fell short on tasks requiring in-context learning and long-context reasoning.
Performance on Standard Tasks:
- On a benchmark suite of 12 standard tasks, including WinoGrande, PIQA, HellaSwag, ARC-Easy, and ARC-Challenge, pure Mamba-2 models achieved competitive or superior results compared to Transformers. However, they underperformed on MMLU and Phonebook tasks which involve in-context learning and data retrieval from long contexts.
- The performance gap on MMLU was significant when training on smaller token datasets (1.1T tokens), demonstrating that pure SSM models require approaching or surpassing Transformer token budgets to close this gap.
Hybrid Architectures:
- The hybrid model combining Mamba-2, self-attention, and MLP layers (termed Mamba-2-Hybrid) was substantially more effective. At 8B-parameters and trained on 3.5T tokens, Mamba-2-Hybrid outperformed the corresponding Transformer on all evaluated tasks, achieving an average increase of 2.65 points.
- Mamba-2-Hybrid models demonstrated significant speedup during token generation at inference time—potentially up to 8 times faster for long sequences—due to efficient state-space model computation.
Long-Context Capabilities:
- The extension of Mamba-2-Hybrid models to support sequence lengths of 16K, 32K, and even 128K maintained or improved accuracy on standard tasks and outperformed Transformers on synthetic long-context benchmarks like the Phonebook task.
- In long-context evaluations such as those in LongBench and RULER, Mamba-2-Hybrid models displayed excellent capabilities in context learning and copying tasks, though certain multi-document question-answering settings favored Transformers.

Implications and Future Work

The paper underscores the potential of hybrid models incorporating Mamba-2 layers to achieve superior performance and inference efficiency compared to pure Transformer models. This has several practical implications for the future development and deployment of LLMs:

Inference Efficiency: The reduced computational and memory overheads during inference make Mamba-2-Hybrid models attractive for applications requiring real-time or low-latency responses.
Scalability: The successful extension of Mamba-2-Hybrid models to 128K context lengths indicates their potential for handling extensive and complex input data sequences, benefiting use cases in document understanding and long-form content generation.

Future research directions could involve:

Optimization of Training Procedures: Exploring tailored training recipes for SSM-based models, especially for mixed long-document datasets, to further enhance their performance on natural long-context tasks.
Fine-tuning and Prompt Techniques: Investigating more sophisticated prompt engineering strategies to improve the robustness of hybrid models in various knowledge retrieval and question-answering scenarios.
Hybrid Model Architectures: Delving deeper into the architectural nuances, such as the ratio and placement of SSM, attention, and MLP layers, to optimize hybrid model performance for specific tasks.

In conclusion, the comparison provides compelling evidence that integrating selective state-space models with attention mechanisms offers a promising avenue for pushing the boundaries of what is achievable with large-scale NLP models. The release of code and model checkpoints as part of NVIDIA's Megatron-LM project further promotes reproducibility and encourages continued innovation in this field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/iScienceLuvr/status/1801075210273337628

https://twitter.com/rohanpaul_ai/status/1801255162972983609

https://twitter.com/rohanpaul_ai/status/1810340344158167066

https://twitter.com/wonmin_byeon/status/1801314183721718258

https://twitter.com/Grad62304977/status/1908943760127406262

https://twitter.com/wonmin_byeon/status/1813616920496853379