Emergent Mind

VSSD: Vision Mamba with Non-Casual State Space Duality

(2407.18559)
Published Jul 26, 2024 in cs.CV

Abstract

Vision transformers have significantly advanced the field of computer vision, offering robust modeling capabilities and global receptive field. However, their high computational demands limit their applicability in processing long sequences. To tackle this issue, State Space Models (SSMs) have gained prominence in vision tasks as they offer linear computational complexity. Recently, State Space Duality (SSD), an improved variant of SSMs, was introduced in Mamba2 to enhance model performance and efficiency. However, the inherent causal nature of SSD/SSMs restricts their applications in non-causal vision tasks. To address this limitation, we introduce Visual State Space Duality (VSSD) model, which has a non-causal format of SSD. Specifically, we propose to discard the magnitude of interactions between the hidden state and tokens while preserving their relative weights, which relieves the dependencies of token contribution on previous tokens. Together with the involvement of multi-scan strategies, we show that the scanning results can be integrated to achieve non-causality, which not only improves the performance of SSD in vision tasks but also enhances its efficiency. We conduct extensive experiments on various benchmarks including image classification, detection, and segmentation, where VSSD surpasses existing state-of-the-art SSM-based models. Code and weights are available at \url{https://github.com/YuHengsss/VSSD}.

Challenges in SSM/SSD for images; VSSD model outperforms ConvNeXt in accuracy and efficiency.

Overview

  • The paper introduces the Visual State Space Duality (VSSD) model, a novel adaptation of State Space Models (SSMs) for computer vision tasks, addressing limitations in Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) with improved performance and efficiency.

  • Key innovations include the introduction of Non-Causal SSD (NC-SSD), enhanced token processing, incorporation of Depth-Wise Convolution (DWConv), Feed-Forward Network (FFN), Local Perception Unit (LPU), and hybrid self-attention mechanisms to capture global and local features effectively.

  • Extensive benchmarking shows that VSSD outperforms existing models in various vision tasks like image classification (ImageNet-1K), object detection and instance segmentation (MS COCO), and semantic segmentation (ADE20K), indicating the model’s potential for real-time and complex vision applications.

Vision State Space Duality: Advances in State-Space-Based Vision Models

The paper presents the Visual State Space Duality (VSSD) model, an innovative extension of the State Space Duality (SSD) adapted for computer vision tasks. It addresses the limitations of Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) by merging the strengths of State Space Models (SSMs) with non-causal and position-independent processing, thus enhancing performance and efficiency.

State space models (SSMs) have emerged as efficient alternatives to attention-based methods in NLP due to their linear computational complexity. However, their causal nature restricts their utility in vision tasks. To overcome these limitations, this paper introduces the non-causal format of SSD (NC-SSD), which modifies the iterative update process by discarding magnitude and preserving relative token contributions. This adaptation allows the VSSD model to handle non-causal vision tasks more effectively.

Key Innovations and Contributions

  1. NC-SSD Introduction: The paper introduces NC-SSD by transforming the SSD into a non-causal and position-independent format. This transformation involves using a scalar SS2 matrix, facilitating the removal of the causal mask and enabling the model to capture global context.
  2. Enhanced Token Processing: By preserving the relative weights and discarding the magnitude of token interactions, the model can concurrently process tokens, thus improving training and inference speeds.
  3. Model Architecture Adjustments:
  • Incorporation of Depth-Wise Convolution (DWConv): Replacing causal convolution with DWConv enhances local feature extraction in image data.
  • Integration of Feed-Forward Network (FFN): Adding an FFN module post the NC-SSD block aids in enhanced information mixing across channels.
  • Local Perception Unit (LPU): LPUs help in capturing finer, local features, thereby improving model performance further.
  • Hybrid with Self-Attention: Employing self-attention in the final stages captures high-level features more effectively.
  • Overlapped Downsampling Layers: The use of overlapped convolutions for downsampling assists in maintaining feature continuity, mitigating the loss of important visual information.
  • Extensive Benchmarking: The VSSD model demonstrates superior performance across various vision tasks, including image classification (ImageNet-1K), object detection and instance segmentation (MS COCO), and semantic segmentation (ADE20K).

Experimental Findings

The VSSD model was rigorously tested against several benchmarks:

  • Image Classification on ImageNet-1K: VSSD outperformed CNNs, ViTs, and other SSM-based models in different size categories. For instance, VSSD-T achieved a top-1 accuracy of 83.7%, outperforming VMambaV9-T by 1.2%.
  • Object Detection and Instance Segmentation on MS COCO: VSSD-T demonstrated considerable improvements with AP$\text{b}$ (box average precision) increasing by 4.2 points over Swin-T.
  • Semantic Segmentation on ADE20K: VSSD consistently outperformed well-established models, including Swin and ConvNeXt, by substantial margins in the tiny model category.

Implications of the Research

The paper underscores the potential of applying SSM-based models to vision tasks, highlighting their capabilities in managing global context with linear computational efficiency. The adaptations to NC-SSD provide a significant breakthrough in addressing the causal limitations of traditional state space models. This non-causal approach opens new avenues for developing efficient and effective vision models suitable for various practical applications such as real-time object detection and complex scene segmentation.

Future Directions

The VSSD model's promising performance suggests several future research directions:

  • Scalability: Testing the model with larger datasets, like ImageNet-22K, could offer insights into its scalability and robustness.
  • Further Optimizations: Exploring advanced normalization techniques and fine-tuning the interplay between different modules (NC-SSD, self-attention, and FFNs) could enhance model efficiency and performance further.
  • Extended Applications: Applying VSSD to other vision tasks like video understanding and 3D vision tasks could exhibit its flexibility and extend its applicability.

In conclusion, the VSSD model represents a significant stride in vision modeling by effectively leveraging non-causal state space duality. It not only advances the theoretical foundations of SSMs in vision but also showcases practical enhancements in terms of performance and computational efficiency.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

GitHub