Look, Listen and Learn

Published 23 May 2017 in cs.CV and cs.LG | (1705.08168v2)

Abstract: We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself -- the correspondence between the visual and the audio streams, and we introduce a novel "Audio-Visual Correspondence" learning task that makes use of this. Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations. These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art self-supervised approaches on ImageNet classification. We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks.

Abstract PDF Upgrade to Chat

Citations (864)

View on Semantic Scholar

Summary

The paper introduces the L3-Net, a framework that learns audio-visual correspondence through self-supervised training on unlabeled videos.
It employs a novel Audio-Visual Correspondence task to fuse visual and audio features, achieving 74%-78% accuracy on key datasets.
The method sets new benchmarks in audio classification on ESC-50 and DCASE, underscoring the potential of multimodal learning.

Analysis of the "Look, Listen and Learn" Paper

Overview

The paper "Look, Listen and Learn" by Relja Arandjelović and Andrew Zisserman examines the potential of learning visual and audio representations simultaneously from unlabelled videos using an Audio-Visual Correspondence (AVC) task. The primary aim is to leverage the natural co-occurrence of visual and audio events to train neural networks in a self-supervised manner. This paper introduces the $L^3$ -Net, a network architecture designed to extract and fuse visual and audio features to determine if a video frame and an audio clip correspond to each other.

Methodology

The authors propose a novel learning task, the AVC task, whereby the network decides if a visual frame corresponds to an audio snippet from the same video. Positive pairs are taken from corresponding visual and audio streams, while negative pairs are generated by mismatching frames and audio clips from different videos. This setup ensures that the only way to succeed in the task is to learn meaningful visual and audio representations.

The $L^3$ -Net architecture is composed of three parts:

Vision Subnetwork: Follows a VGG-like style with convolutional layers, pooling, and batch normalization, designed to process $224 \times 224$ input images.
Audio Subnetwork: Similar architecture to the vision subnetwork but adapted to process $1$-second audio clips converted into log-spectrograms.
Fusion Network: Takes the 512-D visual and audio features, concatenates them into a 1024-D vector, and passes them through fully connected layers to produce the final correspondence decision.

Results

Audio-Visual Correspondence

The $L^3$ -Net shows robust performance on the AVC task, achieving 74% and 78% accuracy on the Kinetics-Sounds and Flickr-SoundNet datasets, respectively, significantly higher than chance (50%). This validation indicates that the network effectively learns from the raw, unlabeled video inputs. The performance is comparable to supervised baselines, demonstrating the efficacy of the self-supervised approach.

Audio Feature Evaluation

The audio features learned by the $L^3$ -Net set new benchmarks on the ESC-50 and DCASE sound classification datasets, achieving 79.3% and 93% accuracy, respectively. These results outperform previous state-of-the-art models, such as SoundNet, which use supervised visual networks as teachers. These findings underscore the potential for self-supervised audio learning to produce high-quality audio representations.

Visual Feature Evaluation

The visual features derived from the $L^3$ -Net were evaluated on ImageNet, attaining a Top-1 accuracy of 32.3%. This performance is on par with other state-of-the-art self-supervised methods. Notably, the $L^3$ -Net uses video frames for training, which have different statistics from still images and generalize well despite these differences.

Qualitative Analysis

The qualitative assessment reveals that the visual subnetwork learns to recognize semantic concepts and objects, such as musical instruments and specific scenes like "concert" or "outdoor." Similarly, the audio subnetwork captures fine-grained audio distinctions and scene-specific sounds, such as "fingerpicking" versus "playing bass guitar." The network also shows the ability to localize these concepts within the visual and audio domains.

Implications and Future Directions

The implications of this research are significant for both practical and theoretical aspects of AI. Practically, the results suggest that self-supervised learning from multimodal, unlabelled data can rival supervised approaches, reducing the need for extensive labelled datasets. Theoretically, the paper paves the way for further exploration of multimodal learning, particularly the use of synchronized video and audio streams to uncover complex representations.

Future work could explore stronger concurrency constraints by leveraging video sequences instead of single frames. Additionally, exploiting datasets curated by audio events presents an opportunity to refine audio-visual learning and capture more nuanced semantic representations.

Conclusion

The "Look, Listen and Learn" paper demonstrates that concurrent visual and audio streams in videos present a rich source of self-supervised learning. The $L^3$ -Net model effectively exploits this modality, producing state-of-the-art features in both domains. These findings highlight the potential for self-supervised learning approaches and contribute to the growing understanding of multimodal representation learning.