Emergent Mind

MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection

(2403.19888)
Published Mar 29, 2024 in cs.LG , cs.AI , and cs.CV

Abstract

Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. Despite recent attempts to design efficient and effective architecture backbone for multi-dimensional data, such as images and multivariate time series, existing models are either data independent, or fail to allow inter- and intra-dimension communication. Recently, State Space Models (SSMs), and more specifically Selective State Space Models, with efficient hardware-aware implementation, have shown promising potential for long sequence modeling. Motivated by the success of SSMs, we present MambaMixer, a new architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels, called Selective Token and Channel Mixer. MambaMixer connects selective mixers using a weighted averaging mechanism, allowing layers to have direct access to early features. As a proof of concept, we design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on the MambaMixer block and explore their performance in various vision and time series forecasting tasks. Our results underline the importance of selective mixing across both tokens and channels. In ImageNet classification, object detection, and semantic segmentation tasks, ViM2 achieves competitive performance with well-established vision models and outperforms SSM-based vision models. In time series forecasting, TSM2 achieves outstanding performance compared to state-of-the-art methods while demonstrating significantly improved computational cost. These results show that while Transformers, cross-channel attention, and MLPs are sufficient for good performance in time series forecasting, neither is necessary.

Design of MambaMixer for ViM2 featuring depth-wise convolution.

Overview

  • MambaMixer introduces a novel architecture in State Space Models (SSMs) featuring dual selection mechanisms for efficient modeling of long sequences.

  • The architecture is applied to create Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2), enhancing performance in vision and time series forecasting tasks.

  • ViM2 and TSM2 demonstrate superior performance in their respective domains, outperforming existing models and setting new benchmarks.

  • The MambaMixer's dual selection mechanism facilitates efficient information mixing across tokens and channels, showing promise for future AI developments.

MambaMixer: Introducing Efficient Selectivity in State Space Models for Multidimensional Data

Introduction

Recent developments in State Space Models (SSMs) and their structured counterparts have ushered in a new era of sequence modeling, challenging the hegemony of attention-based architectures, notably Transformers. SSMs, by virtue of their linear time complexity, offer a promising avenue for efficient and scalable modeling of long sequences. The introduction of Selective State Space Models (S6), which incorporate data-dependent weights, has further enhanced their applicability, enabling selective focus on relevant context. Building on this advancement, MambaMixer emerges as a novel architecture that incorporates dual selection mechanisms across both tokens and channels, marking a significant stride in the evolution of SSMs. This paper, authored by Behrouz et al., elaborates on the MambaMixer block and demonstrates its application through Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) for tackling vision and time series forecasting tasks, respectively.

MambaMixer Architecture

The MambaMixer architecture introduces a Selective Token and Channel Mixer, designed to selectively mix and fuse information across both tokens and channels in a data-dependent manner. This dual selection mechanism, a defining feature of the MambaMixer, allows layers to directly access inputs and outputs of different layers via a weighted averaging mechanism. This setup not only enhances information flow between the selective mixers but also across different layers, facilitating the construction of large-scale, stable networks.

The MambaMixer block sequentially employs Selective Token Mixer and Selective Channel Mixer blocks, each complemented by bidirectional S6 blocks. The inclusion of direct access to earlier features through a weighted averaging mechanism allows MambaMixer-based models to benefit from large numbers of layers while maintaining stability during training.

Application to Vision and Time Series Forecasting

The application of the MambaMixer block leads to the development of two distinct architectures: Vision MambaMixer (ViM2) for vision tasks, and Time Series MambaMixer (TSM2) for time series forecasting. ViM2 leverages the MambaMixer block to perform selective mixing across tokens and channels in image data, outperforming existing SSM-based vision models and achieving competitive performance with established vision models like ViT and MLP-Mixer. On the other hand, TSM2 demonstrates superior performance in time series forecasting tasks, outdoing state-of-the-art methods while showcasing significantly improved computational cost efficacy.

Evaluation and Results

ViM2 and TSM2 were evaluated across various vision and time series forecasting tasks, respectively. ViM2 achieved noteworthy performance in ImageNet classification, object detection, and semantic segmentation tasks, surpassing several well-established models. TSM2 excelled in time series forecasting across various benchmark datasets, demonstrating outstanding performance and setting new benchmarks.

Implications and Future Directions

The introduction of MambaMixer represents a significant advancement in the field of SSMs, offering a versatile architecture that can be adapted to various domains and tasks. The dual selection mechanism allows for efficient and effective selection and mixing of information across both tokens and channels, a capability that proves particularly beneficial for multi-dimensional data like images and multivariate time series. The performance of ViM2 and TSM2 illustrates the potential of MambaMixer-based models to challenge existing paradigms and set new standards for future developments in AI and deep learning.

Looking ahead, the MambaMixer architecture opens new avenues for exploring the possibilities of selective mixing in other domains, potentially leading to further innovations in AI models that are both efficient and effective. Beyond immediate practical applications, the principles underlying MambaMixer may inspire novel approaches to modeling complex data structures, further enriching the landscape of deep learning research.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube