BlackMamba: Mixture of Experts for State-Space Models (2402.01771v1)

Published 1 Feb 2024 in cs.CL, cs.AI, cs.DC, and cs.LG

Abstract: State-space models (SSMs) have recently demonstrated competitive performance to transformers at large-scale LLMing benchmarks while achieving linear time and memory complexity as a function of sequence length. Mamba, a recently released SSM model, shows impressive performance in both LLMing and long sequence processing tasks. Simultaneously, mixture-of-expert (MoE) models have shown remarkable performance while significantly reducing the compute and latency costs of inference at the expense of a larger memory footprint. In this paper, we present BlackMamba, a novel architecture that combines the Mamba SSM with MoE to obtain the benefits of both. We demonstrate that BlackMamba performs competitively against both Mamba and transformer baselines, and outperforms in inference and training FLOPs. We fully train and open-source 340M/1.5B and 630M/2.8B BlackMamba models on 300B tokens of a custom dataset. We show that BlackMamba inherits and combines both of the benefits of SSM and MoE architectures, combining linear-complexity generation from SSM with cheap and fast inference from MoE. We release all weights, checkpoints, and inference code open-source. Inference code at: https://github.com/Zyphra/BlackMamba

References (44)

Citations (18)

View on Semantic Scholar

Summary

The paper introduces BlackMamba, a hybrid architecture combining state-space models and mixture-of-experts to reduce computational cost and maintain constant latency.
It employs alternating Mamba and MoE blocks with SwiGLU activation and sparse parameter selection to optimize compute efficiency.
Experiments on 300 billion tokens show BlackMamba achieves competitive performance with fewer FLOPs and faster inference than traditional transformers.

Introduction to BlackMamba

State-Space Models (SSMs) and Mixture of Expert (MoE) models each represent innovative advancements in the field of language processing, each addressing different limitations posed by traditional transformer architectures. The novel contribution of our work lies in the successful hybridization of these two architectures, creating BlackMamba, which effectively leverages the linear time and memory complexity benefits of SSMs with the computational and latency efficiencies of MoE models. This synergy yields a novel LLM that can outperform existing LLMing benchmarks not only in terms of cost-efficiency but also in actual performance metrics.

Distinctive Architecture and Implementation

BlackMamba's architecture is characterized by the concurrent use of alternating Mamba blocks, which replace the traditional attention mechanism common in transformers, and MoE blocks. The arrangement of these blocks within the architecture ensures that the benefits inherent to each individual model are preserved and utilized to full effect within BlackMamba. A notable design decision was to use the SwiGLU activation function for the expert MLPs and to engage only a sparse subset of the model's total parameters for any given forward pass, enabling improved compute efficiency. Standout results were achieved by our team: the 340M/1.5B and 630M/2.8B BlackMamba models were not only fully trained but were also open-sourced after training on a remarkable 300 billion tokens of a custom dataset.

Comprehensive Results and Performance

The results showcased by BlackMamba are striking. Using significantly fewer training FLOPs, BlackMamba was able to achieve comparable performance metrics to dense transformer models on a range of downstream tasks. In terms of inference speed, our model demonstrated a remarkable advantage over not just transformer models, but also over Mamba and transformer-MoE models. Even more compelling is the fact that BlackMamba's generation latency remained constant as a function of sequence length, unlike transformers that suffer quadratic scaling. These results indicate BlackMamba as an exceptionally efficient model for both inference and training compared to its predecessors.

Further Discussion and Implications

The implications of the BlackMamba architecture extend far beyond performance metrics alone. The combination of SSMs with MoE in our model underscores a potential paradigm shift in how various architectural components can be modularly combined for efficient AI model design. While still preliminary, our exploration opens numerous avenues for future research, such as optimizing hyperparameters, exploring fine-tuning approaches, and investigating the composite effect on the model’s learned representations and behaviors. The open-sourced nature of BlackMamba provides a valuable asset for the broader AI community to enhance the collective understanding and development of this pioneering architecture.

In conclusion, BlackMamba embodies a significant leap forward in the evolution of LLMs, offering a new-wave architecture that achieves remarkable efficiency without compromising on quality or performance. Its linear complexity and swift inference capabilities pave the way for LLMs that can process longer sequences more rapidly, marking an exciting juncture in the landscape of AI-driven language processing.

PDF Markdown

Related Papers

GitHub

GitHub - Zyphra/BlackMamba: Code repository for Black Mamba (247 stars)

Tweets

https://twitter.com/_akhaliq/status/1754723073889120555

https://twitter.com/fly51fly/status/1759185606880657447

https://twitter.com/woojinrad/status/1755281021329698885

https://twitter.com/QuentinAnthon15/status/1754720419893178851

https://twitter.com/jinghan23/status/1757815686368547297

https://twitter.com/HPCPapers/status/1754747356027744394

Reddit

BlackMamba: Mixture of Experts for State-Space Models (35 points, 3 comments)