Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 138 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 31 tok/s Pro

GPT-4o 92 tok/s Pro

Kimi K2 189 tok/s Pro

GPT OSS 120B 450 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

VMamba: Visual State Space Model (2401.10166v4)

Published 18 Jan 2024 in cs.CV

Abstract: Designing computationally efficient network architectures remains an ongoing necessity in computer vision. In this paper, we adapt Mamba, a state-space LLM, into VMamba, a vision backbone with linear time complexity. At the core of VMamba is a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D bridges the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the collection of contextual information from various sources and perspectives. Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements. Extensive experiments demonstrate VMamba's promising performance across diverse visual perception tasks, highlighting its superior input scaling efficiency compared to existing benchmark models. Source code is available at https://github.com/MzeroMiko/VMamba.

References (59)

Citations (376)

View on Semantic Scholar

Summary

The paper introduces VMamba, which combines CNN scalability with ViT’s global receptive field using a linear computational approach.
The model employs a new Cross-Scan Module (CSM) to effectively handle non-causal visual data while preserving dynamic weights.
Benchmarking shows VMamba outperforms models like ResNet, ViT, and Swin Transformers, especially on high-resolution images.

Overview of Visual State Space Model (VMamba)

In recent advancements in visual representation learning, two primary foundation models, Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have dominated the field. CNNs are known for their scalability, with a computational complexity that increases linearly with image resolution. Conversely, ViTs are celebrated for their superior fitting capabilities, albeit facing challenges with their quadratic computational complexity. Upon close examination, what gives ViTs their edge are the global receptive fields and dynamic weights in their architecture.

Introducing VMamba

A novel architecture known as the Visual State Space Model (VMamba) has been introduced to combine the strengths of CNNs and ViTs while also tackling their respective computational efficiency issues. VMamba leverages the advantages of ViTs in retaining global receptive fields and dynamic weights, yet manages to do so with linear computational complexity. To surmount the inherent direction-sensitive problem associated with non-causal visual data, VMamba employs a new module called the Cross-Scan Module (CSM), which allows for traversing the spatial domain in a way that maintains these global properties without the computational expense typically incurred by ViTs.

The Backbone of VMamba

At the heart of VMamba is a mechanism inspired by state space models, particularly the Selective Scan Space State Sequential Model (S6), initially designed to enhance NLP tasks. The selective scan mechanism built within S6 is what enables VMamba to maintain a global receptive field and circumvent the quadratic complexity. CSM plays a crucial role as well in ensuring that every element within the spatial domain of an image can integrate information from all other locations. This is achieved via a four-way scanning strategy, which avoids increasing the linear computational complexity.

Benchmarking VMamba's Performance

VMamba was put through rigorous testing across a variety of visual perception tasks. The results are revealing: VMamba consistently exhibits strong performance, and as the resolution of the input images increases, its advantages become even more pronounced. Compared to established benchmarks such as ResNet, ViT, and Swin transformers, VMamba holds its own, especially when dealing with larger image inputs where other models would see a significant rise in computational demands. Importantly, VMamba shows that it is feasible to have a model architecture that combines the desirable qualities of a global receptive field and dynamic weights without becoming computationally prohibitive.