Papers
Topics
Authors
Recent
2000 character limit reached

Universal Sound Separation with Self-Supervised Audio Masked Autoencoder (2407.11745v2)

Published 16 Jul 2024 in eess.AS, cs.AI, and cs.SD

Abstract: Universal sound separation (USS) is a task of separating mixtures of arbitrary sound sources. Typically, universal separation models are trained from scratch in a supervised manner, using labeled data. Self-supervised learning (SSL) is an emerging deep learning approach that leverages unlabeled data to obtain task-agnostic representations, which can benefit many downstream tasks. In this paper, we propose integrating a self-supervised pre-trained model, namely the audio masked autoencoder (A-MAE), into a universal sound separation system to enhance its separation performance. We employ two strategies to utilize SSL embeddings: freezing or updating the parameters of A-MAE during fine-tuning. The SSL embeddings are concatenated with the short-time Fourier transform (STFT) to serve as input features for the separation model. We evaluate our methods on the AudioSet dataset, and the experimental results indicate that the proposed methods successfully enhance the separation performance of a state-of-the-art ResUNet-based USS model.

Citations (1)

Summary

  • The paper introduces a self-supervised audio masked autoencoder integrated with ResUNet and STFT features to achieve robust universal sound separation.
  • The methodology leverages concatenated SSL embeddings and FiLM-based latent source embeddings to guide the source separator effectively.
  • The approach improves separation performance, achieving a 5.62 dB SDR increase on AudioSet with significant gains in classes like dial tones and smoke detectors.

Universal Sound Separation with Self-Supervised Audio Masked Autoencoder

Introduction

The task of Universal Sound Separation (USS) entails the segregation of arbitrary sound sources from a mixture. Traditional approaches primarily rely on supervised learning with labeled datasets. However, the emerging Self-Supervised Learning (SSL) paradigm, which utilizes unlabeled data to derive task-independent representations, offers compelling advantages for various downstream tasks. This research introduces a novel integration of SSL with USS, employing a pre-trained Audio Masked Autoencoder (A-MAE) to enhance separation capabilities. By leveraging SSL embeddings that are concatenated with short-time Fourier transform (STFT) features, the paper evaluates the performance on the AudioSet dataset, demonstrating improvements in separation efficiency over a ResUNet-based baseline.

Methodology

The proposed system is structured around the integration of A-MAE, a self-supervised model designed to learn audio representations from Mel-spectrograms. During USS training, the paper explores two strategies: freezing and selectively updating the A-MAE parameters. The objective is to optimize the ResUNet-based USS model by concatenating SSL-derived representations with STFT features.

The framework advocates using a query-based system employing weakly labeled data from AudioSet. This involves three major components: a Sound Event Detection (SED) system for localizing event occurrences, a source separator, and a Latent Source Embedding (LSE) processor. The SED system is instrumental in identifying clean target sound events, facilitating the construction of training mixtures. Meanwhile, LSE embeddings are integrated using Feature-wise Linear Modulation (FiLM) to guide the source separator in discerning specific audio sources.

Results

The experimental setup extensively evaluates the proposed methods on the AudioSet evaluation dataset, considering both oracle and average embedding conditions. Notably, the integration of A-MAE enhances the separation performance, achieving an SDR improvement (SDRi) of 5.62 dB using average embeddings, surpassing the previous state-of-the-art by 0.44 dB. Figure 1

Figure 1: Class-wise USS results on some AudioSet sound classes.

Class-wise analysis reveals substantial improvements in SDRi for distinct sound categories such as dial tones and smoke detectors, which are characterized by line spectrum features. Interestingly, while the proposed method generally enhances performance across most classes, a few exhibit negligible or adverse changes, suggesting avenues for further refinement. Figure 2

Figure 2: Visualization of separation results obtained by our model.

A visual inspection of separation results demonstrates the efficacy of the proposed method, as the extracted source closely aligns with its ground truth counterpart. The robust A-MAE-based features significantly aid in preserving the fidelity of separated sources.

Conclusion

This research marks a pioneering effort in deploying self-supervised pre-trained audio models like A-MAE for universal sound separation. By enriching the feature space with SSL embeddings, the separation performance across various sound classes in AudioSet is significantly advanced. Future work will focus on enhancing the system's adaptability to unseen sound categories and exploring additional modalities for sound separation enhancement. These directions promise to further optimize USS systems for practical, real-world applications.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.