Emergent Mind

Abstract

Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., the Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance on self-attention for visual representation learning is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8$\times$ faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248$\times$1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models. Code is available at https://github.com/hustvl/Vim.

Comparison of DeiT-Ti model visualizations showcasing training and performance enhancements.

Overview

  • The paper introduces Vision Mamba (Vim), a state space model-based vision backbone designed to learn visual representations efficiently.

  • Vim uses bidirectional Mamba blocks to process high-resolution images quickly and with lower memory costs.

  • It outperforms the DeiT vision transformer in ImageNet classification, COCO object detection, and ADE20K semantic segmentation with superior accuracy and efficiency.

  • Vim excels on hardware accelerators by requiring lower IO operations and memory, applying recomputation strategies to reduce resource demands.

  • Empirical results show Vim’s promise as a backbone for future vision models with potential applications in various domains.

Introduction

The field of computer vision has witnessed remarkable advancements, primarily driven by the success of convolutional neural networks (CNNs) and, more recently, vision transformers (ViTs). The established paradigms, however, face challenges when processing high-resolution images, a critical capability for numerous applications. One promising approach to address these computational challenges involves state space models (SSMs), specifically, the Mamba model adept at capturing long-range dependencies efficiently. The novel research introduces Vision Mamba (Vim), an approach to building a pure SSM-based vision backbone that offers competitive performance for visual tasks without the usual reliance on self-attention mechanisms.

Methodology

The proposed Vim employs bidirectional Mamba blocks, integrating SSMs with a keen awareness of global visual context and spatial information. By marking image sequences with position embeddings and utilizing bidirectional selective state space models, Vim elegantly compresses visual representations. This methodology permits efficient feature extraction at significantly higher speeds and lower memory costs compared to current transformer-based models.

Vim's approach is validated through extensive evaluations against existing models. It is noteworthy that Vim shines on ImageNet classification, COCO object detection, and ADE20K semantic segmentation tasks. It outperforms DeiT, a widely recognized vision transformer, both in terms of accuracy and computational efficiency.

Efficiency Analysis

The researchers conducted a thorough analysis of Vim's efficiency, highlighting its superior performance on hardware accelerators like GPUs, particularly in managing input/output operations and memory. Vim demonstrates notably lower IO requirements and implements recomputation strategies to minimize memory footprint when calculating gradients and activations.

Moreover, Vim's computation efficiency is underscored when compared with self-attention in transformers. Due to its linear scaling in sequence length, Vim holds the potential to handle much larger sequence lengths, therefore extending its applicability to image resolutions previously deemed challenging for transformer-style models.

Experimental Results

Empirical evidence supports the practicality and robustness of Vim. For instance, when evaluating image classification on the ImageNet-1K dataset, Vim achieves a top-1 accuracy surpassing that of DeiT with fewer parameters. Semantic segmentation on the ADE20K dataset echoes these results, with Vim showing similar performance to that of ResNet-101 while requiring significantly less computational resources.

The performance gains extend to object detection and instance segmentation tasks on the COCO dataset. Vim demonstrates a stronger ability to capture long-range context compared to DeiT, as illustrated by its superior performance in detecting medium and large-sized objects.

Conclusion

In summary, Vim is a compelling alternative to traditional CNNs and ViTs, offering an efficient and effective solution to the challenge of visual representation learning. With its capability to process long sequences more efficiently and its exceptional handling of high-resolution images, Vim presents itself as a potential backbone for the next generation of vision foundation models. Future research may leverage Vim for large-scale unsupervised visual data pretraining, multimodal task processing, and the analysis of complex images in various domains such as medical imaging and remote sensing.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube