Emergent Mind

Audio Mamba: Bidirectional State Space Model for Audio Representation Learning

(2406.03344)
Published Jun 5, 2024 in cs.SD , cs.AI , and eess.AS

Abstract

Transformers have rapidly become the preferred choice for audio classification, surpassing methods based on CNNs. However, Audio Spectrogram Transformers (ASTs) exhibit quadratic scaling due to self-attention. The removal of this quadratic self-attention cost presents an appealing direction. Recently, state space models (SSMs), such as Mamba, have demonstrated potential in language and vision tasks in this regard. In this study, we explore whether reliance on self-attention is necessary for audio classification tasks. By introducing Audio Mamba (AuM), the first self-attention-free, purely SSM-based model for audio classification, we aim to address this question. We evaluate AuM on various audio datasets - comprising six different benchmarks - where it achieves comparable or better performance compared to well-established AST model.

The proposed Audio Mamba (AuM) architecture design.

Overview

  • The paper 'Audio Mamba: Bidirectional State Space Model for Audio Representation Learning' introduces Audio Mamba (AuM), a self-attention-free, state space model (SSM) designed for audio classification, which offers computational efficiency and competitive performance against Transformer-based models.

  • AuM is evaluated across six benchmark audio datasets and demonstrates comparable or superior performance to the Audio Spectrogram Transformer (AST), while also showing significant improvements in computational efficiency due to linear, rather than quadratic, scaling of memory and speed requirements.

  • Empirical evaluations, including ablation studies and analysis of pre-training impact, validate AuM's design, confirming its potential to perform efficiently and effectively without relying on self-attention mechanisms typical of Transformer-based architectures.

Audio Mamba: Bidirectional State Space Model for Audio Representation Learning

The paper "Audio Mamba: Bidirectional State Space Model for Audio Representation Learning" explores a novel approach to audio classification that challenges the dominance of Transformer-based architectures. Specifically, the authors, Mehmet Hamza Erol, Arda Senocak, Jiu Feng, and Joon Son Chung from the Korea Advanced Institute of Science and Technology, introduce Audio Mamba (AuM) as a self-attention-free model based solely on state space models (SSMs).

Context and Motivation

The field of audio classification has witnessed a paradigm shift from convolutional neural networks (CNNs) to Transformer-based architectures. Despite their superior performance, Transformers’ reliance on self-attention mechanisms entails a computational complexity that scales quadratically with sequence length, denoted as $\mathcal{O}(n2)$. This constraint becomes particularly limiting when dealing with longer audio sequences. In contrast, state space models (SSMs) like Mamba, having shown promise in both language and vision tasks, offer a more efficient alternative with linear complexity concerning sequence length.

Overview of Audio Mamba

Audio Mamba (AuM) embodies the first SSM-based model for audio classification implemented without self-attention mechanisms. The primary research question addressed is whether self-attention is indispensable for high-performing audio classification models. AuM is evaluated across six benchmark audio datasets, including AudioSet, VGGSound, VoxCeleb, Speech Commands V2, and Epic-Sounds, demonstrating comparable or superior performance to the Audio Spectrogram Transformer (AST). AuM achieves this whilst offering significant computational efficiency—transformer models’ memory and speed requirements scale exponentially, in stark contrast to AuM's linear scaling.

Design and Methodology

The architecture of AuM involves several critical components:

  1. Input Representation: The input audio is converted into a spectrogram partitioned into non-overlapping square patches. Each patch is flattened and linearly projected into a higher-dimensional space to form embedding tokens.
  2. Classification Token: A learnable classification token is inserted in the middle of the sequence of patch embeddings.
  3. Bidirectional State Space Modules: The sequence of tokens, including the classification token, is input to the Audio Mamba encoder. The encoder consists of bidirectional SSM modules that process the sequence in both forward and backward directions, allowing for effective global context modeling.

Empirical Evaluation

Comprehensive experiments were conducted to evaluate the efficacy of AuM. The key findings from these experiments can be summarized as follows:

  • Performance: AuM delivers competitive or superior performance compared to AST across multiple data sets. For instance, on AudioSet, AuM achieves a mean average precision (mAP) of 32.43 compared to AST’s 29.10, representing a 3.33 increase.
  • Efficiency: Empirical evaluations highlight significant improvements in computational efficiency. AuM's memory usage and inference time scale linearly with sequence length, which contrasts with the quadratic scaling observed in AST.

Ablation Studies

Design choices were meticulously ablated to determine their impact on model performance:

  • Bidirectional SSM vs. Unidirectional SSM: Bidirectional SSM modules showed superior performance, particularly in handling the middle-positioned classification token.
  • Position of Class Token: The optimal position for the classification token was found to be in the middle of the input sequence, reflecting a balance between forward and backward context.

Pre-Training Impact

Various scenarios were analyzed to understand the impact of pre-training:

  • Out-of-Domain Pre-Training: When initialized with ImageNet-pretrained weights, AuM-S (small version of AuM) performed similarly to AST-S, indicating potential for further performance gains with appropriate pre-trained weights.
  • Audio-Only Pre-Training: Initializing models with weights from audio-only pre-training datasets, such as AudioSet, improved performance for both AuM and AST, with AuM generally outperforming AST except in the VoxCeleb dataset.

Conclusion and Future Directions

The introduction of Audio Mamba (AuM) signifies a potential shift toward more computationally efficient audio classification models, free from the quadratic constraints of self-attention mechanisms. AuM demonstrates that it is possible to achieve competitive performance using SSMs, presenting a compelling alternative to Transformer-based architectures. This work opens avenues for further research into SSM-based models in various domains, including but not limited to, self-supervised learning and multimodal learning.

Further developments might include leveraging more extensive and diverse pre-training datasets, enhancing bidirectional SSM architectures, and exploring integration with other modalities such as visual and textual data.

Overall, Audio Mamba's architecture and empirical validation set the stage for more efficient and scalable solutions in audio representation learning, warranting continued exploration and advancement.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.