Emergent Mind

SPMamba: State-space model is all you need in speech separation

(2404.02063)
Published Apr 2, 2024 in cs.SD , cs.AI , and eess.AS

Abstract

In speech separation, both CNN- and Transformer-based models have demonstrated robust separation capabilities, garnering significant attention within the research community. However, CNN-based methods have limited modelling capability for long-sequence audio, leading to suboptimal separation performance. Conversely, Transformer-based methods are limited in practical applications due to their high computational complexity. Notably, within computer vision, Mamba-based methods have been celebrated for their formidable performance and reduced computational requirements. In this paper, we propose a network architecture for speech separation using a state-space model, namely SPMamba. We adopt the TF-GridNet model as the foundational framework and substitute its Transformer component with a bidirectional Mamba module, aiming to capture a broader range of contextual information. Our experimental results reveal an important role in the performance aspects of Mamba-based models. SPMamba demonstrates superior performance with a significant advantage over existing separation models in a dataset built on Librispeech. Notably, SPMamba achieves a substantial improvement in separation quality, with a 2.42 dB enhancement in SI-SNRi compared to the TF-GridNet. The source code for SPMamba is publicly accessible at https://github.com/JusperLee/SPMamba .

SPMamba model combines TF-GridNet and BMamba, integrating multi-head attention and convolutional layers for time-frequency attention.

Overview

  • SPMamba introduces a novel architecture for speech separation, utilizing State-Space Models for improved quality and efficiency.

  • The model leverages the Mamba method for effective long-range dependency modeling, surpassing traditional CNN, RNN, and Transformer limitations.

  • SPMamba achieves a notable 2.42 dB improvement in SI-SNRi over the baseline model, indicating superior speech separation performance.

  • The integration of State-Space Models in SPMamba sets a new benchmark in speech separation technology and suggests potential for broader AI applications.

SPMamba: Advancing Speech Separation with State-Space Models

Introduction to SPMamba

Speech separation technology is essential for improving audio clarity in environments with overlapping speakers, facilitating advancements in audio analysis and clearer communication. Recent developments have leveraged CNNs, RNNs, and Transformer architectures, each presenting unique benefits and limitations in processing audio signals. Conventional CNN-based models, despite their robustness in handling various auditory tasks, struggle with limited receptive fields that hinder their performance in capturing the full context of long audio sequences. On the opposite end, Transformer-based methods excel in modeling long-range dependencies but suffer from high computational demands, rendering them less practical for real-time applications.

State-Space Models (SSMs) have emerged as a promising solution, offering efficient processing of long sequences through long-range dependencies with a manageable computational footprint. This paper introduces SPMamba, a novel architecture that integrates the State-Space Model approach into speech separation, significantly enhancing separation quality and computational efficiency.

Background and Model Design

The Mamba Technique

The Mamba method, a precursor to the SPMamba model, represents a new direction in speech separation tasks. It introduces a selective State-Space Model that synergizes the benefits of CNNs and RNNs while mitigating their respective limitations. The Mamba architecture, with its selective mechanism, adjusts its processing based on the input, dynamically focusing on relevant parts of the audio signal for separation. This method is not only efficient in its computational design but also adept at handling the complexities inherent in speech separation tasks, thanks to its innovative approach to modeling the audio sequences.

SPMamba Architecture

SPMamba, building upon the foundational TF-GridNet model, innovates by incorporating a bidirectional Mamba module, replacing the Transformer component traditionally used. This modification enhances the model's ability to capture a broader range of contextual information within audio sequences, making significant strides in addressing the constraints faced by CNN and RNN methods in speech separation. The architecture of SPMamba is meticulously designed, featuring:

  • A bidirectional Mamba layer as the core, enabling effective modeling of both forward and backward sequences in non-causal speech separation tasks.
  • Integration within the TF-GridNet framework, leveraging its strengths in handling time-frequency dimensions while improving efficiency through the Mamba module.

Empirical Evaluation

The effectiveness of SPMamba was rigorously evaluated on a challenging dataset filled with noise and reverberation intricacies. The model demonstrated outstanding performance, outclassing existing speech separation models across several metrics. Notably, SPMamba achieved a remarkable 2.42 dB improvement in SI-SNRi over its baseline, TF-GridNet, while also showcasing significant reductions in the number of parameters and overall computational footprint. These results underscore the model's superior capability in delivering high-quality speech separation with enhanced efficiency.

Conclusion and Future Directions

SPMamba sets a new benchmark in the field of speech separation by adeptly integrating the benefits of State-Space Models. The superior performance and efficiency of SPMamba not only address the current challenges in speech separation technology but also open up new avenues for future research. The scalability and adaptability of SPMamba suggest a broad potential for further advancements in audio processing tasks, challenging the research community to explore the integration of SSMs in other domains of AI.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.