Jamba: A Hybrid Transformer-Mamba Language Model (2403.19887v2)

Published 28 Mar 2024 in cs.CL and cs.LG

Abstract: We present Jamba, a new base LLM based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard LLM benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.

References (50)

Citations (135)

View on Semantic Scholar

Summary

The paper introduces a novel hybrid architecture combining Transformer, Mamba, and MoE layers to enhance long-context language modeling.
It achieves state-of-the-art benchmarks with efficient memory use and throughput, processing up to 256K tokens on a single 80GB GPU.
The configurable Attention-to-Mamba ratio offers practical insights for optimizing computational resources in large-scale language tasks.

Jamba: Unveiling a Hybrid Transformer-Mamba Architecture with MoE for Enhanced LLM Performance

Introduction to Jamba

The recently developed Jamba framework represents a significant stride in LLM architecture, integrating Transformer and Mamba layers in a hybrid fashion, along with employing a mixture-of-experts (MoE) component. This architecture leverages the strengths of both the Transformer's and Mamba's architectural benefits, enhancing model capacity and performance while optimally managing memory usage and computational efficiency. Jamba is particularly designed to fit within the confines of a single 80GB GPU, making it highly accessible for large-scale LLMing tasks.

Model Architecture

The Jamba architecture is unique in its combination of Transformer layers, known for their attention mechanism, with Mamba layers, a class of state-space models acclaimed for efficiently handling sequence data. This amalgamation is further fortified with MoE layers, strategically enhancing the model's capacity. Each 'Jamba block' contains a mix of Mamba and Attention layers, interspersed with MoE layers applied to some of the MLPs. This structure allows for flexibility in model design, enabling the balancing of memory footprint, computational demands, and overall model performance. Jamba employs a configurable ratio of Attention-to-Mamba layers, thus allowing for adjustments based on specific resource and objective needs.

Performance Insights

Jamba's innovative architecture demonstrates superior performance on standard benchmarks, particularly excelling in tasks requiring long context lengths of up to 256K tokens. It showcases strong results across various evaluations, attaining comparable or superior performance relative to current leading models, such as Mixtral-8x7B and Llama-2 70B, while supporting significantly longer contexts. Furthermore, Jamba achieves this with a significantly smaller KV cache footprint and superior throughput efficiency, marking a substantial advancement in the practical application of large-scale LLMs.

Computational Efficiency

In addition to its impressive performance on benchmarks, Jamba stands out for its computational efficiency. Its unique architecture supports much larger batch processing and extended context lengths within single-GPU environments, a critical consideration for real-world applications. This efficiency is particularly pronounced in scenarios with extended sequence lengths, where Jamba's throughput far surpasses that of comparable models, highlighting its practical advantages in handling long-context tasks.

Future Implications and Research Directions

The introduction of Jamba opens up new avenues for the development of efficient and powerful LLMs. Its hybrid architecture provides a template for balancing the computational and memory requirements of large models, a common challenge in the field. The successful integration of MoE layers into this setup further underscores the potential for such techniques to expand model capacity without proportionately increasing computational demands. As the first production-grade model of its kind, Jamba sets a precedent for future research and development in the field of hybrid LLMs.

Concluding Remarks

Jamba represents a significant advancement in LLMing, effectively harnessing the strengths of Transformer and Mamba architectures alongside MoE components. This hybrid model not only achieves state-of-the-art performance across a broad range of benchmarks but does so with remarkable efficiency and adaptability. The release of Jamba under a permissive license encourages further exploration and optimization by the research community, potentially spurring the next wave of innovations in LLM development.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AI21Labs/status/1773350888427438424

https://twitter.com/arankomatsuzaki/status/1774614763785314446

https://twitter.com/_akhaliq/status/1774645753383682344

https://twitter.com/AI21Labs/status/1774824070053331093

https://twitter.com/_philschmid/status/1774712633855299669

https://twitter.com/ClementDelangue/status/1774841135766098037