MambaVision: A Hybrid Mamba-Transformer Vision Backbone (2407.08083v2)

Published 10 Jul 2024 in cs.CV

Abstract: We propose a novel hybrid Mamba-Transformer backbone, MambaVision, specifically tailored for vision applications. Our core contribution includes redesigning the Mamba formulation to enhance its capability for efficient modeling of visual features. Through a comprehensive ablation study, we demonstrate the feasibility of integrating Vision Transformers (ViT) with Mamba. Our results show that equipping the Mamba architecture with self-attention blocks in the final layers greatly improves its capacity to capture long-range spatial dependencies. Based on these findings, we introduce a family of MambaVision models with a hierarchical architecture to meet various design criteria. For classification on the ImageNet-1K dataset, MambaVision variants achieve state-of-the-art (SOTA) performance in terms of both Top-1 accuracy and throughput. In downstream tasks such as object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets, MambaVision outperforms comparably sized backbones while demonstrating favorable performance. Code: https://github.com/NVlabs/MambaVision

Citations (20)

View on Semantic Scholar

Summary

The paper introduces MambaVision, a hybrid backbone that fuses Mamba blocks with Transformers to enhance global context modeling.
The redesigned Mamba block replaces causal convolutions with regular ones and adds a symmetric branch without SSM for improved feature extraction.
Empirical results show MambaVision-B achieves 84.2% Top-1 accuracy on ImageNet-1K, outperforming comparable ConvNets and ViTs in efficiency.

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

The paper "MambaVision: A Hybrid Mamba-Transformer Vision Backbone" by Ali Hatamizadeh and Jan Kautz presents a novel architecture for vision tasks, integrating the strengths of both Mamba and Transformer models. The core innovation lies in redesigning the Mamba formulation to enhance its effectiveness in modeling visual features, as well as integrating Vision Transformers (ViTs) to address the limitations in capturing long-range spatial dependencies inherent in Mamba's original design.

Summary of Contributions

The authors introduce several critical contributions:

Redesigned Vision-Friendly Mamba Block: The Mamba block is redesigned to improve accuracy and image throughput. The redesign includes replacing the causal convolution with regular convolution and adding a symmetric path without SSM for better global context modeling.
Systematic Investigation of Integration Patterns: A thorough examination of different integration patterns between Mamba and Transformer blocks is conducted. It is demonstrated that incorporating self-attention blocks in the final layers significantly enhances the model’s ability to capture global context and long-range dependencies.
Introduction of MambaVision: A new hybrid Mamba-Transformer model, MambaVision, is introduced. It features a hierarchical architecture with CNN-based residual blocks for fast feature extraction and achieves a new state-of-the-art (SOTA) performance on ImageNet-1K in terms of the Top-1 accuracy to image throughput tradeoff.

Methodology

MambaVision's architecture is hierarchical, consisting of four stages. The initial stages (1 and 2) employ CNN-based layers for swift feature extraction at higher resolutions, while stages 3 and 4 utilize the revised MambaVision and Transformer blocks. The detailed design ensures both short and long-range contexts are effectively captured, leading to superior performance.

MambaVision Mixer

The MambaVision token mixer incorporates the strengths of sequential and parallel processing paths. Regular convolutions replace Mamba's causal convolutions to remove unnecessary directional constraints in vision tasks. Furthermore, a symmetric branch without SSM captures local context, ensuring no loss of information. These outputs are then concatenated to form a comprehensive feature representation, enhancing the model's global context understanding.

Hybrid Integration Patterns

A key finding is that positioning self-attention blocks in the final layers of each stage significantly bolsters the model’s capability for capturing global dependencies. This staged hybridization effectively balances the computational efficiency of Mamba blocks with the extensive contextual learning facilitated by Transformers.

Experimental Evaluation

The experimental results corroborate the efficacy of MambaVision:

Image Classification: On the ImageNet-1K dataset, MambaVision models outperform comparable ConvNets and ViTs by substantial margins while delivering higher image throughput. Notably, MambaVision-B achieves a Top-1 accuracy of 84.2%, surpassing both ConvNeXt-B (83.8%) and Swin-B (83.5%).
Object Detection and Instance Segmentation: On MS COCO, models with MambaVision backbones consistently outperform counterparts in Mask-RCNN and Cascade Mask-RCNN setups. For instance, MambaVision-T achieves a box AP of 46.4 and mask AP of 41.8, outperforming ConvNeXt-T and Swin-T.
Semantic Segmentation: On ADE20K, MambaVision models exhibit superior performance, with MambaVision-B achieving a mIoU of 49.1%, surpassing Swin-B (48.1%).

Implications and Future Directions

The introduction and success of the MambaVision backbone have significant implications for the field of computer vision. The hybrid architecture harmonizes the efficient contextual learning of Mamba models with the comprehensive dependency capturing of Transformer models, yielding robust and efficient performance across various vision tasks.

Looking forward, the flexibility of MambaVision in integrating different architectural components opens avenues for further exploration:

Multimodal Learning: Given its hybrid nature, MambaVision could be extended to multimodal learning tasks, facilitating more integrated processing of diverse data modalities.
Fine-tuning for Specialized Tasks: The adaptive capacity of MambaVision can be leveraged to fine-tune specialized tasks such as medical imaging, autonomous driving, and remote sensing, where precision and efficiency are paramount.
Scalability: Future research could explore the scalability of the MambaVision model for larger datasets and more complex tasks, potentially integrating more advanced attention mechanisms or further optimizing the hybrid structure.

In conclusion, the MambaVision model presents a promising direction for vision backbones, combining the best of Mamba and Transformer architectures to achieve superior performance and efficiency. This paper lays foundational work that could inspire subsequent advancements in the design of hybrid neural networks for computer vision.

Related Papers

GitHub

GitHub - NVlabs/MambaVision: Official PyTorch implementation of MambaVision: A Hybrid Mamba-Transformer Vision Backbone (812 stars)

Tweets

https://twitter.com/_akhaliq/status/1811579708787638497

https://twitter.com/arankomatsuzaki/status/1811504374079770770

https://twitter.com/ahatamiz1/status/1894838302450295183

https://twitter.com/ahatamiz1/status/1904208488043360675

https://twitter.com/ahatamiz1/status/1816153853894725992

https://twitter.com/ahatamiz1/status/1825335383380705528

YouTube

Show All Videos