SPMamba: State-space model is all you need in speech separation (2404.02063v2)

Published 2 Apr 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Existing CNN-based speech separation models face local receptive field limitations and cannot effectively capture long time dependencies. Although LSTM and Transformer-based speech separation models can avoid this problem, their high complexity makes them face the challenge of computational resources and inference efficiency when dealing with long audio. To address this challenge, we introduce an innovative speech separation method called SPMamba. This model builds upon the robust TF-GridNet architecture, replacing its traditional BLSTM modules with bidirectional Mamba modules. These modules effectively model the spatiotemporal relationships between the time and frequency dimensions, allowing SPMamba to capture long-range dependencies with linear computational complexity. Specifically, the bidirectional processing within the Mamba modules enables the model to utilize both past and future contextual information, thereby enhancing separation performance. Extensive experiments conducted on public datasets, including WSJ0-2Mix, WHAM!, and Libri2Mix, as well as the newly constructed Echo2Mix dataset, demonstrated that SPMamba significantly outperformed existing state-of-the-art models, achieving superior results while also reducing computational complexity. These findings highlighted the effectiveness of SPMamba in tackling the intricate challenges of speech separation in complex environments.

References (31)

Summary

The paper introduces SPMamba, a novel architecture that integrates state-space models to enhance speech separation with superior efficiency.
It replaces the Transformer module with a bidirectional Mamba layer, effectively capturing long-range audio dependencies.
Empirical results show a 2.42 dB SI-SNRi improvement over TF-GridNet, with fewer parameters and lower computational cost.

SPMamba: Advancing Speech Separation with State-Space Models

Introduction to SPMamba

Speech separation technology is essential for improving audio clarity in environments with overlapping speakers, facilitating advancements in audio analysis and clearer communication. Recent developments have leveraged CNNs, RNNs, and Transformer architectures, each presenting unique benefits and limitations in processing audio signals. Conventional CNN-based models, despite their robustness in handling various auditory tasks, struggle with limited receptive fields that hinder their performance in capturing the full context of long audio sequences. On the opposite end, Transformer-based methods excel in modeling long-range dependencies but suffer from high computational demands, rendering them less practical for real-time applications.

State-Space Models (SSMs) have emerged as a promising solution, offering efficient processing of long sequences through long-range dependencies with a manageable computational footprint. This paper introduces SPMamba, a novel architecture that integrates the State-Space Model approach into speech separation, significantly enhancing separation quality and computational efficiency.

Background and Model Design

The Mamba Technique

The Mamba method, a precursor to the SPMamba model, represents a new direction in speech separation tasks. It introduces a selective State-Space Model that synergizes the benefits of CNNs and RNNs while mitigating their respective limitations. The Mamba architecture, with its selective mechanism, adjusts its processing based on the input, dynamically focusing on relevant parts of the audio signal for separation. This method is not only efficient in its computational design but also adept at handling the complexities inherent in speech separation tasks, thanks to its innovative approach to modeling the audio sequences.

SPMamba Architecture

SPMamba, building upon the foundational TF-GridNet model, innovates by incorporating a bidirectional Mamba module, replacing the Transformer component traditionally used. This modification enhances the model's ability to capture a broader range of contextual information within audio sequences, making significant strides in addressing the constraints faced by CNN and RNN methods in speech separation. The architecture of SPMamba is meticulously designed, featuring:

A bidirectional Mamba layer as the core, enabling effective modeling of both forward and backward sequences in non-causal speech separation tasks.
Integration within the TF-GridNet framework, leveraging its strengths in handling time-frequency dimensions while improving efficiency through the Mamba module.

Empirical Evaluation

The effectiveness of SPMamba was rigorously evaluated on a challenging dataset filled with noise and reverberation intricacies. The model demonstrated outstanding performance, outclassing existing speech separation models across several metrics. Notably, SPMamba achieved a remarkable 2.42 dB improvement in SI-SNRi over its baseline, TF-GridNet, while also showcasing significant reductions in the number of parameters and overall computational footprint. These results underscore the model's superior capability in delivering high-quality speech separation with enhanced efficiency.

Conclusion and Future Directions

SPMamba sets a new benchmark in the field of speech separation by adeptly integrating the benefits of State-Space Models. The superior performance and efficiency of SPMamba not only address the current challenges in speech separation technology but also open up new avenues for future research. The scalability and adaptability of SPMamba suggest a broad potential for further advancements in audio processing tasks, challenging the research community to explore the integration of SSMs in other domains of AI.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ArxivSound/status/1775373503333536006

https://twitter.com/gm8xx8/status/1775345209016041594

https://twitter.com/AudioAndSpeech/status/1775497781928452419

https://twitter.com/AudioAndSpeech/status/1833953016187142284