Objects that Sound

Published 18 Dec 2017 in cs.CV, cs.LG, cs.MM, cs.SD, and eess.AS | (1712.06651v2)

Abstract: In this paper our objectives are, first, networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval; and second, a network that can localize the object that sounds in an image, given the audio signal. We achieve both these objectives by training from unlabelled video using only audio-visual correspondence (AVC) as the objective function. This is a form of cross-modal self-supervision from video. To this end, we design new network architectures that can be trained for cross-modal retrieval and localizing the sound source in an image, by using the AVC task. We make the following contributions: (i) show that audio and visual embeddings can be learnt that enable both within-mode (e.g. audio-to-audio) and between-mode retrieval; (ii) explore various architectures for the AVC task, including those for the visual stream that ingest a single image, or multiple images, or a single image and multi-frame optical flow; (iii) show that the semantic object that sounds within an image can be localized (using only the sound, no motion or flow information); and (iv) give a cautionary tale on how to avoid undesirable shortcuts in the data preparation.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (516)

View on Semantic Scholar

Summary

The paper introduces unsupervised deep architectures that align audio and visual data using an innovative audio-visual correspondence task to outperform both unsupervised and supervised baselines.
It presents two networks—AVE-Net for cross-modal retrieval using Euclidean embedding distances and AVOL-Net for accurately localizing sound sources via a Multiple Instance Learning framework.
Results on the AudioSet-Instruments dataset demonstrate high retrieval accuracy (nDCG) and effective object localization, paving the way for advanced multimedia and interactive AI applications.

In this paper, the authors explore the intersection of audio and visual data through a study on unsupervised cross-modal learning. Their work is centered on leveraging large-scale, unlabeled video datasets to create deep learning architectures capable of associating sound with the corresponding visual source. This is implemented in two main tasks: cross-modal retrieval and audio-visual object localization.

Methodology

The study introduces three network architectures designed for the respective tasks: AVE-Net for cross-modal retrieval and AVOL-Net for object localization. Each is trained using the audio-visual correspondence (AVC) task, thereby employing a form of self-supervision without any need for pre-labeled data.

AVE-Net aligns visual and audio embeddings through a novel architecture that emphasizes embedding distances, optimized for cross-modal retrieval. The network exploits the Euclidean distance between normalized feature embeddings, forcing alignment and facilitating efficient retrieval across modalities.

AVOL-Net, on the other hand, uses a Multiple Instance Learning (MIL) framework to localize the source of a sound within an image. By utilizing a spatial similarity score between visual descriptors and audio embeddings, the network can effectively pinpoint the sound-generating object in the visual field.

Results

The networks were tested on the AudioSet-Instruments dataset, which includes labeled video segments focused on musical instruments and similar categories. Although labels were not used during training, they served to evaluate the quality of the retrieval and localization. The measured performance through normalized discounted cumulative gain (nDCG) showcased the superiority of AVE-Net against both unsupervised and supervised baselines, confirming the efficacy of the cross-modal alignment technique.

For object localization, AVOL-Net demonstrated strong results, accurately identifying sound-producing items across a variety of contexts, as evidenced by the controlled mismatched audio-visual validation. This highlights its ability to distinguish salient objects driven by the audio context rather than mere visual conspicuity.

Implications and Future Directions

This research has significant implications for the fields of multimedia retrieval, video understanding, and multi-modal learning. By enabling systems to correlate and retrieve data across different modalities without explicit supervision, the methods evolve traditional machine learning approaches that typically demand extensive labeled datasets.

The study also opens new avenues for future exploration in refining object localization mechanisms, potentially incorporating soft attention frameworks or considering improvements in dealing with background noise and cluttered scenes. Moreover, the prospect of embedding such cross-modal capabilities in robotics, augmented reality, and more interactive AI systems could enhance real-world applications, providing richer, more contextually aware interactions with the environment.

Overall, the paper sets a foundation for future developments in the cross-utilization of audio and visual data, fostering advancements in AI's capacity to understand and leverage the multi-sensory richness of real-world phenomena.

Markdown Report Issue