Learning to Separate Object Sounds by Watching Unlabeled Video

Published 5 Apr 2018 in cs.CV, cs.MM, cs.SD, and eess.AS | (1804.01665v2)

Abstract: Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video. We obtain state-of-the-art results on visually-aided audio source separation and audio denoising. Our video results: http://vision.cs.utexas.edu/projects/separating_object_sounds/

Abstract PDF Upgrade to Chat

Citations (275)

View on Semantic Scholar

Summary

The paper introduces a novel unsupervised framework using a MIML network to match audio bases with object categories.
It leverages non-negative matrix factorization and image recognition to extract and align audio signals with visual cues.
Results demonstrate improved SDR and robustness on benchmark datasets, highlighting advancements in real-world audio-visual processing.

An Expert Analysis of "Learning to Separate Object Sounds by Watching Unlabeled Video"

This paper presents a novel approach for audio source separation in videos through a model that learns to associate sounds with visible objects from unlabeled visual-audio data. The authors propose an unsupervised framework exploiting the synchronized nature of visual and auditory information in videos, focusing on disentangling audio signals using visual cues. The research introduces a multi-instance multi-label (MIML) learning framework to match audio bases with detected object categories in videos, an approach inspired by human perception where visual context assists in isolating individual sounds in complex auditory environments.

Methodology

The core methodology revolves around deep learning techniques and signal processing methods, particularly non-negative matrix factorization (NMF). The proposed system extracts audio frequency bases from video clips using NMF and then leverages state-of-the-art image recognition models to predict the objects present in each video frame. These predictions provide weak labels for the audio bases, enabling the creation of an audio basis-object relation map through the MIML network. This network helps the model understand which audio bases correspond to which visual objects, even in cases where the objects are not isolated in the audio signal.

The model’s training employs a large corpus of videos from the AudioSet dataset, where audio signals encompass multiple overlapping sources, ideal for testing the scale robustness. The successful construction of a detailed basis-object association through unsupervised learning marks a significant contribution to the field, mainly due to the inherent challenges associated with noisy, real-world audio-visual data.

Results and Contributions

The authors demonstrate state-of-the-art performance on visually-aided audio source separation and audio denoising, as reported. The model outperforms contemporary methods on benchmark datasets such as AV-Bench and AudioSet, underscoring its efficacy in diverse scenarios involving musical instruments, animals, and vehicles. Furthermore, the authors show that the model maintains robustness to imperfect visual predictions, an attribute crucial for application in uncurated video data.

A particularly strong numerical result is achieved in the improvement of Signal to Distortion Ratio (SDR), especially in synthetic experiments where audio is pairwise mixed artificially from single-source AudioSet videos. Moreover, the results suggest that integrating learned audio bases into NMF guided by visual context significantly enhances separation performance compared to traditional unsupervised clustering methods.

Implications and Future Directions

The implications of this research are far-reaching for audio-visual processing, offering potential advancements in automatic scene understanding, audio remixing, and noise reduction in media applications. The separation of individual object sounds from complex auditory scenes can facilitate better audio indexing and querying in vast video libraries, with practical use cases in media production and accessibility tools such as enhanced closed captioning.

Future research could expand this method's applicability by exploring its integration with more sophisticated object detection models or employing temporal dynamics in video streams for improved temporal coherence in sound separation. Additionally, addressing challenges such as ambiguous cases where an object does not visually correlate to any source of sound—either because the sound is off-screen or the visible object is not producing sound—remains an open area for exploration.

Overall, this paper contributes a substantive methodological advance in leveraging multi-modal learning from large-scale unlabeled data, setting a foundation for further research into the intersection of sound and vision processing in AI systems.

Markdown Report Issue