LocalMamba: Visual State Space Model with Windowed Selective Scan (2403.09338v1)

Published 14 Mar 2024 in cs.CV and cs.AI

Abstract: Recent advancements in state space models, notably Mamba, have demonstrated significant progress in modeling long sequences for tasks like language understanding. Yet, their application in vision tasks has not markedly surpassed the performance of traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). This paper posits that the key to enhancing Vision Mamba (ViM) lies in optimizing scan directions for sequence modeling. Traditional ViM approaches, which flatten spatial tokens, overlook the preservation of local 2D dependencies, thereby elongating the distance between adjacent tokens. We introduce a novel local scanning strategy that divides images into distinct windows, effectively capturing local dependencies while maintaining a global perspective. Additionally, acknowledging the varying preferences for scan patterns across different network layers, we propose a dynamic method to independently search for the optimal scan choices for each layer, substantially improving performance. Extensive experiments across both plain and hierarchical models underscore our approach's superiority in effectively capturing image representations. For example, our model significantly outperforms Vim-Ti by 3.1% on ImageNet with the same 1.5G FLOPs. Code is available at: https://github.com/hunto/LocalMamba.

References (2)

Citations (79)

View on Semantic Scholar

Summary

The paper presents a novel approach to enhance visual state space modeling by partitioning images into local windows and dynamically selecting scan patterns.
It introduces a Spatial and Channel Attention module to integrate multiple scan outputs, effectively preserving local dependencies for improved image analysis.
Experimental results show LocalMamba outperforming traditional CNNs and ViTs, achieving up to 76.2% accuracy on ImageNet classification.

LocalMamba: Enhancing Visual State Space Models with Windowed Selective Scan

The research paper "LocalMamba: Visual State Space Model with Windowed Selective Scan" presents a novel approach to enhancing the effectiveness of Vision Mamba (ViM) models in visual tasks. While state space models like Mamba have shown marked improvements in modeling long sequences for language tasks, their application in vision tasks has not outperformed traditional methodologies such as CNNs and ViTs. This paper introduces innovative modifications to address these limitations, focusing on optimizing scan directions for sequence modeling.

The authors identify the challenge that flattening 2D spatial tokens extends the distance between adjacent tokens, disrupting local 2D dependencies crucial for effective image analysis. To counter this, the authors propose a local scanning strategy that partitions images into distinct windows, maintaining local dependencies while also considering global context.

Methodological Innovations

Local Scans: By dividing images into distinct local windows, the approach ensures proximate processing of tokens from the same semantic areas, thereby enhancing the capture of local dependencies.
Dynamic Scan Selection: The paper introduces a method for dynamically selecting the optimal scan pattern for different network layers. This is based on recognizing that varying layers might prefer different scan patterns to maximize performance.
Spatial and Channel Attention (SCAttn): To effectively integrate the various scans, an attention module weighs channel and spatial dimensions, thus highlighting relevant features and filtering out redundant information.

Experimental Validation

Comprehensive experiments demonstrate the efficacy of LocalMamba across multiple tasks. Notably, the proposed model surpasses traditional CNNs and ViTs in image classification accuracy, with significant improvements over baseline approaches like Vim and VMamba. For example, LocalVim-T achieves a 76.2% accuracy on ImageNet, a noteworthy improvement over Vim-Ti's performance. Similarly, experiments on object detection and semantic segmentation tasks confirm the advantages of this approach, underscoring its adaptability and effectiveness.

Implications and Future Directions

The advancements in LocalMamba offer both practical and theoretical implications. Practically, this method allows for more efficient and nuanced image interpretation, capitalizing on both local and global context. Theoretically, it opens avenues for further exploration into the dynamics of selective scanning in visual tasks, hinting at potential refinements in state space modeling.

Future research could explore optimizing computational frameworks to better accommodate the intricate workings of SSMs, as the current deep learning environments do not expedite SSM computations as efficiently as they do for other architectures. Investigations might also explore scaling the proposed methodologies to diverse, complex tasks or enhancing the adaptability of scanning strategies.

In conclusion, the paper presents a well-founded approach to improving vision tasks through strategic refinements in state space modeling, demonstrated through robust experimental results and thoughtful consideration of both local and global feature interactions. The LocalMamba framework marks a substantial step forward in adapting state space models for comprehensive visual analysis.