Efficient Training of Audio Transformers with Patchout

Published 11 Oct 2021 in cs.SD, cs.LG, and eess.AS | (2110.05069v3)

Abstract: The great success of transformer-based models in NLP has led to various attempts at adapting these architectures to other domains such as vision and audio. Recent work has shown that transformers can outperform Convolutional Neural Networks (CNNs) on vision and audio tasks. However, one of the main shortcomings of transformer models, compared to the well-established CNNs, is the computational complexity. In transformers, the compute and memory complexity is known to grow quadratically with the input length. Therefore, there has been extensive work on optimizing transformers, but often at the cost of degrading predictive performance. In this work, we propose a novel method to optimize and regularize transformers on audio spectrograms. Our proposed models achieve a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU. Furthermore, we propose a transformer model that outperforms CNNs in terms of both performance and training speed. Source code: https://github.com/kkoutini/PaSST

Abstract PDF Upgrade to Chat

Authors (4)

Citations (214)

View on Semantic Scholar

Summary

The paper introduces Patchout to efficiently lower the computational complexity of audio transformers by omitting input patches during training.
The methodology utilizes both structured and unstructured dropout along with disentangled positional encoding to enhance training speed and model generalization.
Results demonstrate that Patchout achieves a 0.471 mAP on Audioset and trains up to eight times faster with reduced GPU memory, outperforming conventional CNNs.

Efficient Training of Audio Transformers with Patchout

The paper, "Efficient Training of Audio Transformers with Patchout," addresses an important challenge in the application of transformer architectures in audio processing. Despite the success of transformers in NLP and recent adaptations to vision tasks, their computational complexity remains a critical drawback compared to traditional Convolutional Neural Networks (CNNs). This complexity increases quadratically with input length, posing significant challenges in resource-constrained settings. The authors propose a method, Patchout, that reduces the computational and memory burdens associated with training transformers on audio spectrograms, specifically targeting audio classification tasks.

Contributions and Methodology

The primary contribution of this research is the introduction of Patchout, a novel method that reduces both computation and memory complexity during training. The approach leverages structured and unstructured dropout techniques during training, where certain patches of the input spectrogram are omitted, effectively reducing the sequence length and, consequently, the computational demand. This also serves as a form of regularization, potentially enhancing the generalization capacity of the models.

This study proposes a disentangled positional encoding where time and frequency dimensions are treated separately, simplifying the processing of audio snippets with variable lengths. This distinction allows for more efficient inference without the need for additional fine-tuning or interpolation of positional encodings typically required in standard transformer architectures.

Results

The authors present state-of-the-art performance on Audioset, noting that their methods allow transformers to outperform CNNs in both performance and training efficiency. Specifically, the Patchout method can achieve these results while being trained on a single consumer-grade GPU, indicating significant improvements in accessibility and efficiency.

Numerical results demonstrate that model variants employing structured Patchout achieve superior performance, with mean average precision (mAP) reaching 0.471 on Audioset, a noteworthy improvement compared to previous state-of-the-art results. Furthermore, the proposed models were shown to train up to eight times faster than previous transformer models requiring considerably less GPU memory.

The study also evaluates the transfer of these models to various downstream tasks, including OpenMIC, ESC50, TAU Urban Acoustic Scenes, and FSD50K datasets. The pre-trained models on Audioset demonstrate substantial improvements over CNNs after fine-tuning on these tasks, suggesting broad applicability and efficacy.

Implications and Future Work

The findings bear impactful implications for the field, both practically and theoretically. Practically, the work broadens the feasibility of deploying state-of-the-art transformer models in audio processing applications with limited computational resources. Theoretically, it opens up new avenues for the adaptation of transformer architectures to domains traditionally dominated by CNNs, challenging long-held assumptions about the optimal network architecture for spectrogram-based audio analysis.

Concluding, the introduction of Patchout presents a strategic advance in mitigating the computational drawbacks of transformers, making them more competitive and efficient for audio classification tasks. Future research could extend this method to other sequence-based tasks in varying domains or explore further optimizations of sequential input processing in transformer architectures. The flexibility and demonstrated performance of Patchout suggest potential adaptations in other data modalities where similar computational demands exist.

Markdown Report Issue