Contrastive Audio-Visual Masked Autoencoder

Published 2 Oct 2022 in cs.MM, cs.CV, cs.SD, and eess.AS | (2210.07839v4)

Abstract: In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments show that the contrastive audio-visual correspondence learning objective not only enables the model to perform audio-visual retrieval tasks, but also helps the model learn a better joint representation. As a result, our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the previous best supervised pretrained model on AudioSet in the audio-visual event classification task. Code and pretrained models are at https://github.com/yuangongnd/cav-mae.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (101)

View on Semantic Scholar

Summary

The paper introduces CAV-MAE that combines contrastive and masked learning to fuse audio and visual modalities into a unified representation.
It achieves state-of-the-art performance with 65.9% accuracy on VGGSound and results comparable to top supervised models on AudioSet.
Multi-modal pretraining in CAV-MAE enhances single-modality tasks by efficiently transferring robust cross-modal features.

Overview of Contrastive Audio-Visual Masked Autoencoder

The paper presents a novel approach to audio-visual learning by integrating two major self-supervised learning frameworks: contrastive learning and masked data modeling. The proposed model, Contrastive Audio-Visual Masked Autoencoder (CAV-MAE), seeks to address limitations in previous single-modality models by leveraging multi-modal data and audio-visual correspondences more effectively.

Key Contributions

Extension of Masked Auto-Encoder (MAE) Model: The research extends the MAE model, traditionally used for single-modality tasks, to handle audio-visual multi-modalities, resulting in the development of the Audio-Visual Masked Auto-Encoder (AV-MAE). This extension aims to fuse audio and visual streams into a joint representation.
Introduction of CAV-MAE: By combining contrastive audio-visual learning with masked data modeling, CAV-MAE is designed to learn robust representations by addressing the complementary nature of these two learning frameworks. While contrastive learning emphasizes learning correspondences between modalities, masked modeling ensures comprehensive representation by reconstructing input data from masked features.
State-of-the-Art Performance: The CAV-MAE achieves state-of-the-art (SOTA) performance on VGGSound with an accuracy of 65.9% and performs comparably to the best supervised models on AudioSet. These results demonstrate the superior joint representation learning powered by CAV-MAE, which excels in both audio-visual event classification and retrieval tasks.
Single-Modal Enhancement through Multi-Modal Pretraining: The research interestingly notes that multi-modal pretraining with CAV-MAE enhances single-modality performance, indicating a beneficial transfer of multi-modal learned features to unimodal tasks.

Technical Insights

Self-Supervised Framework Synergy: CAV-MAE successfully capitalizes on the synergies between contrastive learning and masked autoencoders. The former distinctively enhances cross-modal retrieval by ensuring close similarity constraints on paired audio-visual samples, whereas the latter demands detailed reconstructions from partially masked inputs, enriching the feature space.
Efficient Encoder Design: The model uses modality-specific encoders for both audio and visual inputs, funneling the output into a streamlined joint encoder. This design preserves computational efficiency while leveraging the depthwise audio-visual interaction in the joint encoder.
Masked Contrastive Learning: With a high masking ratio akin to transformer-based models, CAV-MAE uses strategic masking to avoid overfitting, which is particularly beneficial in constructing robust representations that are not heavily reliant on any single part of the input data.

Implications and Future Directions

Practically, CAV-MAE paves the way for more nuanced audio-visual event detection systems which can be of great use in multimedia information retrieval and context-aware content generation. Theoretically, the work underscores the potential for hybrid models that reconcile the traditionally distinct contrastive and reconstructive objectives into a coherent audio-visual learning framework.

In the future, extending this model to other modalities, such as integrating natural language processing with audio and visual inputs, could be explored. Additionally, the utilization of other structure-enhanced masking strategies and the impact of varying the masking level on different datasets and tasks provide avenues for further refinement of model performance.

Overall, this paper positions CAV-MAE as a significant contribution to the field of multi-modal learning, demonstrating robust capabilities through the amalgamation of powerful yet previously independent self-supervised learning paradigms.

Markdown Report Issue