Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation

Published 5 Apr 2024 in cs.CV | (2404.04256v2)

Abstract: Multi-modal semantic segmentation significantly enhances AI agents' perception and scene understanding, especially under adverse conditions like low-light or overexposed environments. Leveraging additional modalities (X-modality) like thermal and depth alongside traditional RGB provides complementary information, enabling more robust and reliable prediction. In this work, we introduce Sigma, a Siamese Mamba network for multi-modal semantic segmentation utilizing the advanced Mamba. Unlike conventional methods that rely on CNNs, with their limited local receptive fields, or Vision Transformers (ViTs), which offer global receptive fields at the cost of quadratic complexity, our model achieves global receptive fields with linear complexity. By employing a Siamese encoder and innovating a Mamba-based fusion mechanism, we effectively select essential information from different modalities. A decoder is then developed to enhance the channel-wise modeling ability of the model. Our proposed method is rigorously evaluated on both RGB-Thermal and RGB-Depth semantic segmentation tasks, demonstrating its superiority and marking the first successful application of State Space Models (SSMs) in multi-modal perception tasks. Code is available at https://github.com/zifuwan/Sigma.

Abstract PDF HTML Upgrade to Chat

Citations (20)

View on Semantic Scholar

Summary

The paper's main contribution is the integration of a Siamese encoder with a Mamba fusion mechanism, achieving global receptive fields with linear complexity.
Extensive evaluations on RGB-Thermal and RGB-Depth datasets demonstrate Sigma's superior accuracy and efficiency compared to conventional CNN and ViT models.
Innovative components like Selective Scan Modules and a channel-aware Mamba decoder enable effective cross-modal feature integration, advancing multi-modal scene understanding.

The paper introduces Sigma, an innovative approach to multi-modal semantic segmentation, by leveraging the Selective Structured State Space Model, Mamba, within a Siamese network architecture. This work addresses significant challenges in semantic segmentation under adverse conditions by integrating complementary modalities such as thermal and depth information with traditional RGB data.

Technical Contributions

Sigma's architecture diverges from conventional CNN and ViT models by achieving global receptive fields with linear complexity, a feat traditionally burdened by the quadratic complexity of ViTs. The introduction of a Siamese encoder, augmented by a Mamba fusion mechanism, marks a novel approach to multi-modal data handling. This design aims to optimize the selection and integration of pivotal features from heterogeneous data sources, enhancing segmentation outcomes. The experimentations under RGB-Thermal and RGB-Depth tasks not only demonstrate Sigma's superior performance but also signify the inaugural successful deployment of State Space Models in the field of multi-modal perception.

Evaluation and Results

Sigma was benchmarked against several state-of-the-art models across datasets such as MFNet, PST900, NYU Depth V2, and SUN RGB-D. The method consistently outperformed existing frameworks in both accuracy and computational efficiency. A notable advantage of Sigma is its ability to process concatenated sequences, preserving rich information from both modalities, a departure from Transformer-based methods that often consolidate token sequences, losing potentially valuable data.

Architectural Insights

The core innovation lies in the employment of Selective Scan Modules, allowing the model to adopt an input-dependent strategy. Sigma's design integrates Cross Mamba and Concat Mamba Blocks, facilitating effective cross-modal interaction and feature integration. Furthermore, the channel-aware Mamba decoder enhances spatial and channel-specific information extraction, a crucial component in refining semantic segmentation outputs.

Broader Implications and Future Directions

The implications of Sigma extend into various domains wherein robust scene understanding is imperative, such as autonomous vehicles and augmented reality. The demonstrated efficacy of State Space Models within this framework opens avenues for further exploration, particularly in tasks involving more extensive modality combinations.

Potential future work could explore broader applications of Mamba in other complex tasks, considering the underexplored capacity for extremely long sequences. Moreover, optimizing the resource-intensive aspects of the Mamba encoder remains a pertinent challenge, necessitating strategies for deployment on edge devices. Finally, the exploration of Sigma within datasets featuring diverse sensory inputs, such as those involving LiDAR, would prove beneficial in pushing the boundaries of multi-modal scene understanding.

Markdown Report Issue