Emergent Mind

BlackMamba: Mixture of Experts for State-Space Models

(2402.01771)
Published Feb 1, 2024 in cs.CL , cs.AI , cs.DC , and cs.LG

Abstract

State-space models (SSMs) have recently demonstrated competitive performance to transformers at large-scale language modeling benchmarks while achieving linear time and memory complexity as a function of sequence length. Mamba, a recently released SSM model, shows impressive performance in both language modeling and long sequence processing tasks. Simultaneously, mixture-of-expert (MoE) models have shown remarkable performance while significantly reducing the compute and latency costs of inference at the expense of a larger memory footprint. In this paper, we present BlackMamba, a novel architecture that combines the Mamba SSM with MoE to obtain the benefits of both. We demonstrate that BlackMamba performs competitively against both Mamba and transformer baselines, and outperforms in inference and training FLOPs. We fully train and open-source 340M/1.5B and 630M/2.8B BlackMamba models on 300B tokens of a custom dataset. We show that BlackMamba inherits and combines both of the benefits of SSM and MoE architectures, combining linear-complexity generation from SSM with cheap and fast inference from MoE. We release all weights, checkpoints, and inference code open-source. Inference code at: https://github.com/Zyphra/BlackMamba

Comparison of dense transformer, dense Mamba, transformer-MoE, and Mamba-MoE architectures.

Overview

  • BlackMamba integrates State-Space Models and Mixture of Expert models into a hybrid architecture that offers linear time and memory complexity benefits.

  • The architecture includes alternating Mamba and MoE blocks, utilizing sparse activation and the SwiGLU function for computational efficiency.

  • BlackMamba achieves comparable performance to dense transformers with fewer training FLOPs and exhibits constant generation latency, unlike traditional transformers.

  • The model's design opens up new research directions for AI architecture and has been open-sourced to contribute to wider community collaboration.

Introduction to BlackMamba

State-Space Models (SSMs) and Mixture of Expert (MoE) models each represent innovative advancements in the field of language processing, each addressing different limitations posed by traditional transformer architectures. The novel contribution of our work lies in the successful hybridization of these two architectures, creating BlackMamba, which effectively leverages the linear time and memory complexity benefits of SSMs with the computational and latency efficiencies of MoE models. This synergy yields a novel language model that can outperform existing language modeling benchmarks not only in terms of cost-efficiency but also in actual performance metrics.

Distinctive Architecture and Implementation

BlackMamba's architecture is characterized by the concurrent use of alternating Mamba blocks, which replace the traditional attention mechanism common in transformers, and MoE blocks. The arrangement of these blocks within the architecture ensures that the benefits inherent to each individual model are preserved and utilized to full effect within BlackMamba. A notable design decision was to use the SwiGLU activation function for the expert MLPs and to engage only a sparse subset of the model's total parameters for any given forward pass, enabling improved compute efficiency. Standout results were achieved by our team: the 340M/1.5B and 630M/2.8B BlackMamba models were not only fully trained but were also open-sourced after training on a remarkable 300 billion tokens of a custom dataset.

Comprehensive Results and Performance

The results showcased by BlackMamba are striking. Using significantly fewer training FLOPs, BlackMamba was able to achieve comparable performance metrics to dense transformer models on a range of downstream tasks. In terms of inference speed, our model demonstrated a remarkable advantage over not just transformer models, but also over Mamba and transformer-MoE models. Even more compelling is the fact that BlackMamba's generation latency remained constant as a function of sequence length, unlike transformers that suffer quadratic scaling. These results indicate BlackMamba as an exceptionally efficient model for both inference and training compared to its predecessors.

Further Discussion and Implications

The implications of the BlackMamba architecture extend far beyond performance metrics alone. The combination of SSMs with MoE in our model underscores a potential paradigm shift in how various architectural components can be modularly combined for efficient AI model design. While still preliminary, our exploration opens numerous avenues for future research, such as optimizing hyperparameters, exploring fine-tuning approaches, and investigating the composite effect on the model’s learned representations and behaviors. The open-sourced nature of BlackMamba provides a valuable asset for the broader AI community to enhance the collective understanding and development of this pioneering architecture.

In conclusion, BlackMamba embodies a significant leap forward in the evolution of language models, offering a new-wave architecture that achieves remarkable efficiency without compromising on quality or performance. Its linear complexity and swift inference capabilities pave the way for language models that can process longer sequences more rapidly, marking an exciting juncture in the landscape of AI-driven language processing.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

Reddit
BlackMamba: Mixture of Experts for State-Space Models (35 points, 3 comments) in /r/LocalLLaMA