Emergent Mind

Mamba-R: Vision Mamba ALSO Needs Registers

(2405.14858)
Published May 23, 2024 in cs.CV

Abstract

Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. These artifacts, corresponding to high-norm tokens emerging in low-information background areas of images, appear much more severe in Vision Mamba -- they exist prevalently even with the tiny-sized model and activate extensively across background regions. To mitigate this issue, we follow the prior solution of introducing register tokens into Vision Mamba. To better cope with Mamba blocks' uni-directional inference paradigm, two key modifications are introduced: 1) evenly inserting registers throughout the input token sequence, and 2) recycling registers for final decision predictions. We term this new architecture Mamba-R. Qualitative observations suggest, compared to vanilla Vision Mamba, Mamba-R's feature maps appear cleaner and more focused on semantically meaningful regions. Quantitatively, Mamba-R attains stronger performance and scales better. For example, on the ImageNet benchmark, our base-size Mamba-R attains 82.9% accuracy, significantly outperforming Vim-B's 81.8%; furthermore, we provide the first successful scaling to the large model size (i.e., with 341M parameters), attaining a competitive accuracy of 83.2% (84.5% if finetuned with 384x384 inputs). Additional validation on the downstream semantic segmentation task also supports Mamba-R's efficacy.

Addressing Vision Mamba's artifacts by inserting input-independent register tokens to enhance final predictions.

Overview

  • The paper addresses the artifact issue in Vision Mamba's feature maps by introducing an improved model architecture called Mamba®, which uses register tokens to enhance image representation and reduce computational inefficiency.

  • Empirical results show that Mamba® achieves higher accuracy and better performance on benchmarks like ImageNet and ADE20k compared to the vanilla Vision Mamba model.

  • The paper suggests practical implications for data scientists and proposes future research directions to further optimize the model and explore its applications in real-time scenarios.

Understanding Vision Mamba: Tackling Artifacts in Feature Maps

Introduction

Vision Transformers (ViTs) have made waves in computer vision by offering an alternative to traditional Convolutional Neural Networks (CNNs), often showing strong performance in various tasks. However, a relatively newer approach called Vision Mamba has been gaining attention. This paper explore the drawbacks of the Vision Mamba, particularly the issue of artifacts in feature maps, and proposes an improved architecture, termed Mamba®. Let's break this down.

Background: Vision Mamba and State Space Models (SSMs)

State Space Models (SSMs) are efficient at handling sequential data due to their linear computational complexity. Vision Mamba is an adaptation of SSMs for visual tasks, designed to tackle computational challenges of high-resolution images better than CNNs or ViTs. Yet, Vision Mamba isn't without its problems. The primary issue? Artifacts in feature maps, which are prominent in background regions and waste computational resources by carrying high-norm, low-information tokens.

What's The Problem?

Imagine looking at an image where the model highlights irrelevant background areas rather than crucial objects or regions. These undesired activations are called artifacts. Vision Mamba suffers from this problem more severely than ViTs. Researchers found that the feature maps of Vision Mamba were cluttered with high-norm tokens in non-semantic background areas. These artifacts compromise the model's focus and efficiency, making it hard to scale the architecture effectively.

The Proposed Solution: Mamba®

To address this issue, the authors introduce Mamba®, an improved Vision Mamba architecture. The core idea revolves around using "register tokens". Let’s break down the two primary changes:

  1. Even Distribution of Register Tokens: Unlike previous methods that add register tokens only at the end, Mamba® inserts these tokens evenly throughout the input sequence. This helps the model to better capture and represent the global context of the image.
  2. Register Recycling: Instead of discarding registers after initial use, Mamba® recycles them for the final prediction. This method enriches the model's ability to create a comprehensive image representation.

Empirical Results

The results? Promising, to say the least.

  • ImageNet Accuracy: Mamba®-Base achieves an accuracy of 82.9%, outpacing Vim-B’s (a vanilla Vision Mamba model) 81.8%. In larger configurations, Mamba®-Large with 341M parameters reaches up to 84.5% accuracy when fine-tuned with higher resolution inputs.
  • Semantic Segmentation: On the ADE20k benchmark, Mamba® attains a mean Intersection over Union (mIoU) of 49.1%, significantly outperforming Vim.

These figures demonstrate Mamba®’s ability to handle larger models effectively and maintain high accuracy, marking a notable advancement over previous Vision Mamba architectures.

Practical Implications

For data scientists working with image recognition or segmentation tasks, Mamba® could offer a more efficient and accurate model, particularly when dealing with high-resolution images. The reduction of artifacts means that the model focuses on more meaningful content, potentially improving performance on a variety of visual tasks.

Future Directions

While the improvements are clear, there is always room for further research. Possible future directions might include:

  • Further Optimization: Refining how register tokens are utilized and exploring other configurations could yield even better results.
  • Real-time Applications: Testing Mamba® in real-world scenarios where computational efficiency is crucial, such as autonomous vehicles or medical imaging.

Conclusion

In summary, Mamba® presents a significant improvement over the standard Vision Mamba by addressing the artifact issue head-on. Its even distribution of register tokens and recycling method result in cleaner and more effective feature maps, with robust performance across benchmarks. As AI continues to evolve, such enhancements underscore the potential of fine-tuning model architectures to unleash new capabilities and efficiencies.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.