Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 91 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 16 tok/s Pro

GPT-5 High 20 tok/s Pro

GPT-4o 108 tok/s Pro

Kimi K2 212 tok/s Pro

GPT OSS 120B 471 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Masked Autoencoders that Listen (2207.06405v3)

Published 13 Jul 2022 in cs.SD, cs.AI, cs.LG, and eess.AS

Abstract: This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. The code and models will be at https://github.com/facebookresearch/AudioMAE.

Citations (236)

View on Semantic Scholar

Collections

Summary

The paper introduces Audio-MAE, a self-supervised framework that masks audio spectrogram patches to learn robust representations.
It employs a Transformer encoder with high masking and a decoder using local window attention for efficient, accurate spectrogram reconstruction.
Experiments on six audio classification tasks show state-of-the-art performance without relying on external supervised pre-training.

Overview of "Masked Autoencoders that Listen"

The paper "Masked Autoencoders that Listen" explores the extension of image-based Masked Autoencoders (MAE) to the audio domain, specifically focusing on self-supervised representation learning using audio spectrograms. The authors aim to leverage the success of the MAE framework, well-established in the realms of natural language processing and computer vision, to advance audio understanding tasks.

Methodology

The central contribution of the paper is the design of the Audio-MAE, which comprises a standard Transformer encoder and decoder architecture tailored to process spectrograms. The main aspects of the methodology include:

High Masking Ratio: The encoder processes only a small fraction (20%) of non-masked spectrogram patches, significantly reducing the computational burden while retaining the ability to learn comprehensive audio representations.
Decoder with Local Attention: Recognizing the local correlations within audio spectrograms, the model employs local window attention in the decoder. This adaptation acknowledges the importance of temporal and frequency locality in audio signals, leading to accurate reconstruction of the masked spectrogram.
Fine-tuning with Masking: After pre-training, the encoder is fine-tuned on target datasets with a lower masking ratio, enhancing performance across various audio classification tasks.

Experimental Results

The empirical evaluation reveals that Audio-MAE attains state-of-the-art performance on six audio and speech classification tasks. Notably, the model excels without relying on any external supervised pre-training, thus demonstrating its capability to learn robust audio representations from scratch:

AudioSet: Achieved new state-of-the-art mean Average Precision (mAP) on the AudioSet dataset, surpassing models initialized with external ImageNet weights.
ESC-50, Speech Commands, VoxCeleb: Similarly strong performance was noted across environmental sound classification and speech identification tasks.

The paper highlights that the model's efficiency in pre-training on modest data sizes like AudioSet, coupled with a scalable masking strategy, facilitates remarkable accuracy improvements.

Implications and Future Directions

This research contributes significantly to the quest for versatile audio representation models. The paper provides evidence that sophisticated self-supervised frameworks, like MAE, can be adapted beyond text and image domains, holding promise for comprehensive multi-modal learning systems.

The implications are striking for both practical applications and theoretical exploration within AI, as this approach enhances model efficiency and scalability, especially crucial for domains with high-dimensional or lengthy input data. Future research could explore joint audio-visual learning, leveraging the inherent cross-modal relationships within videos to further enrich model understanding and performance.

In summary, "Masked Autoencoders that Listen" successfully adapts a high-impact methodology to the audio domain, cementing its potential to revolutionize audio analysis tasks and opening avenues for innovative multi-modal AI research.