Co-Separating Sounds of Visual Objects

Published 16 Apr 2019 in cs.CV, cs.MM, cs.SD, and eess.AS | (1904.07750v2)

Abstract: Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video clips, but this puts unwieldy restrictions on training data collection and may even prevent learning the properties of "true" mixed sounds. We introduce a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos. Our novel training objective requires that the deep neural network's separated audio for similar-looking objects be consistently identifiable, while simultaneously reproducing accurate video-level audio tracks for each source training pair. Our approach disentangles sounds in realistic test videos, even in cases where an object was not observed individually during training. We obtain state-of-the-art results on visually-guided audio source separation and audio denoising for the MUSIC, AudioSet, and AV-Bench datasets.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (201)

View on Semantic Scholar

Summary

The paper introduces a co-separation framework that learns object-level audio features from multi-source videos without requiring clean, labeled data.
It employs a pre-trained object detector and co-separation loss to enforce alignment between separated audio and their corresponding visual objects.
Experimental results on datasets like MUSIC and AudioSet demonstrate state-of-the-art performance in audio separation and denoising tasks.

An Expert Overview of "Co-Separating Sounds of Visual Objects"

This paper, authored by Ruohan Gao and Kristen Grauman, presents a novel approach for audio source separation using visual cues from video. The research introduces a co-separation training paradigm, which enables learning object-level sounds from videos with multiple simultaneous sound sources, thereby addressing limitations inherent in existing methods that rely on artificially mixed video clips.

Key Contributions

The primary contribution of this paper is the co-separation framework, which is designed to learn from naturally occurring multi-source videos without requiring labeled data. The proposed approach leverages noisy object detections from videos to separate audio sources at an object level rather than a video level. This shift allows the system to learn the unique sounds associated with visual objects, a capability that traditional "mix-and-separate" methods struggle to achieve effectively due to their reliance on single-source video clips and artificial mixes.

Methodology

The authors employ a pre-trained object detector to identify potential sound-emitting objects in video frames. These detected objects are used to guide the source separation process in the co-separation framework. Notably, the framework operates by considering pairs of videos during training to enforce object-level sound consistency across samples. The audio-visual separator network integrates visual features from detected objects with audio features from the mixed signals to predict spectrogram masks. This integration facilitates the separation of sounds that correspond to each detected object, circumventing the need for clean "solo" videos during training.

A noteworthy aspect of this methodology is the co-separation loss, which ensures that the network's separated outputs can reconstruct the original audio tracks of the training video pairs. Additionally, an object-consistency loss is introduced to maintain alignment between the separated audio and its corresponding visual object, enhancing the network's ability to learn from complex, multi-source datasets.

Evaluation and Results

The effectiveness of the co-separation approach is demonstrated through experimental evaluations on several datasets, including MUSIC, AudioSet, and AV-Bench. The system achieves state-of-the-art results for visually-guided audio source separation and audio denoising tasks. Specifically, the co-separation framework outperforms existing methods on challenging datasets like AudioSet-Unlabeled by successfully learning from noisy web videos, which contain highly overlapping sound sources.

Implications and Future Directions

The approach proposed in this paper has significant implications for real-world applications, such as audio denoising, sound event detection, and video indexing. By enabling accurate separation of sound sources from multi-source videos, the co-separation paradigm offers a robust alternative to traditional methods that fail in less curated, realistic environments.

Looking to the future, the research opens several avenues for further investigation. There is potential to enhance the model by incorporating temporal object proposals and motion cues, which could improve separation performance for objects with similar acoustic properties. Moreover, extending the framework to handle other types of multi-modal perception challenges could broaden its utility across different domains.

In summary, the paper presents a substantial advance in audio-visual source separation by introducing a novel framework capable of handling complex audio mixtures with visual object context. This work sets a foundation for further developments in the intersection of computer vision and audio processing, offering both practical benefits and opportunities for theoretical exploration in AI.

Markdown Report Issue