Emergent Mind

VMamba: Visual State Space Model

(2401.10166)
Published Jan 18, 2024 in cs.CV

Abstract

Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) have long been the predominant backbone networks for visual representation learning. While ViTs have recently gained prominence over CNNs due to their superior fitting capabilities, their scalability is largely constrained by the quadratic complexity of attention computation. Inspired by the capability of Mamba in efficiently modeling long sequences, we propose VMamba, a generic vision backbone model aiming to reduce the computational complexity to linear while retaining ViTs' advantageous features. To enhance VMamba's adaptability in processing vision data, we introduce the Cross-Scan Module (CSM) to enable 1D selective scanning in 2D image space with global receptive fields. Additionally, we make further improvements in implementation details and architectural designs to enhance VMamba's performance and boost its inference speed. Extensive experimental results demonstrate VMamba's promising performance across various visual perception tasks, highlighting its pronounced advantages in input scaling efficiency compared to existing benchmark models. Source code is available at https://github.com/MzeroMiko/VMamba.

VMamba series outperforms rivals in ImageNet-1K with top-1 accuracy, dynamic weights, and effective reception field.

Overview

  • VMamba integrates strengths of CNNs and ViTs while enhancing computational efficiency.

  • VMamba's architecture features global receptive fields and dynamic weights with linear computational complexity.

  • The novel Cross-Scan Module (CSM) in VMamba enables traversal of spatial domains while retaining global properties.

  • VMamba utilizes the Selective Scan Space State Sequential Model (S6) mechanism for global information integration without quadratic complexity.

  • Benchmark tests show VMamba competes well against established models, particularly with high-resolution images.

Overview of Visual State Space Model (VMamba)

In recent advancements in visual representation learning, two primary foundation models, Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have dominated the field. CNNs are known for their scalability, with a computational complexity that increases linearly with image resolution. Conversely, ViTs are celebrated for their superior fitting capabilities, albeit facing challenges with their quadratic computational complexity. Upon close examination, what gives ViTs their edge are the global receptive fields and dynamic weights in their architecture.

Introducing VMamba

A novel architecture known as the Visual State Space Model (VMamba) has been introduced to combine the strengths of CNNs and ViTs while also tackling their respective computational efficiency issues. VMamba leverages the advantages of ViTs in retaining global receptive fields and dynamic weights, yet manages to do so with linear computational complexity. To surmount the inherent direction-sensitive problem associated with non-causal visual data, VMamba employs a new module called the Cross-Scan Module (CSM), which allows for traversing the spatial domain in a way that maintains these global properties without the computational expense typically incurred by ViTs.

The Backbone of VMamba

At the heart of VMamba is a mechanism inspired by state space models, particularly the Selective Scan Space State Sequential Model (S6), initially designed to enhance NLP tasks. The selective scan mechanism built within S6 is what enables VMamba to maintain a global receptive field and circumvent the quadratic complexity. CSM plays a crucial role as well in ensuring that every element within the spatial domain of an image can integrate information from all other locations. This is achieved via a four-way scanning strategy, which avoids increasing the linear computational complexity.

Benchmarking VMamba's Performance

VMamba was put through rigorous testing across a variety of visual perception tasks. The results are revealing: VMamba consistently exhibits strong performance, and as the resolution of the input images increases, its advantages become even more pronounced. Compared to established benchmarks such as ResNet, ViT, and Swin transformers, VMamba holds its own, especially when dealing with larger image inputs where other models would see a significant rise in computational demands. Importantly, VMamba shows that it is feasible to have a model architecture that combines the desirable qualities of a global receptive field and dynamic weights without becoming computationally prohibitive.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube