Audio Retrieval with Natural Language Queries: A Benchmark Study (2112.09418v2)

Published 17 Dec 2021 in eess.AS, cs.IR, and cs.SD

Abstract: The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like to hear. To study the tasks of text-audio and audio-text retrieval, which have received limited attention in the existing literature, we introduce three challenging new benchmarks. We first construct text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho audio captioning datasets. Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho. We employ these three benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into audio retrieval with free-form text queries. Code, audio features for all datasets used, and the SoundDescs dataset are publicly available at https://github.com/akoepke/audio-retrieval-benchmark.

Citations (88)

View on Semantic Scholar

Summary

The paper introduces novel benchmarks for audio-text and text-audio retrieval using datasets like AudioCaps and SoundDescs.
The paper employs pre-trained models and aggregates multiple audio experts, significantly enhancing retrieval performance.
The paper’s findings suggest practical benefits for multimedia search and cultural heritage, paving the way for future research.

Audio Retrieval with Natural Language Queries: A Benchmark Study

This paper presents a comprehensive paper on audio retrieval using natural language queries, addressing a gap in content-based multimedia retrieval systems. The authors introduce novel benchmarks for text-audio and audio-text retrieval tasks and propose datasets aimed at facilitating intuitive, text-based searches in extensive auditory databases.

Overview of Objectives and Methods

The primary objective of this work is to enable cross-modal retrieval, specifically, text-audio and audio-text retrieval, by matching audio content with written descriptions and vice versa. This approach allows users to search for audio using free-form natural language, creating a more accessible interface for exploring vast audio collections. The authors have constructed text-audio and audio-text retrieval benchmarks using existing datasets such as AudioCaps and Clotho, and further introduce the SoundDescs benchmark. SoundDescs pairs audio samples with diverse natural language descriptions, augmenting the content and scope of available audio sources.

To evaluate the retrieval task, the authors set baseline performances using various models adapted from video retrieval frameworks. These models benefit from pre-training on diverse audio tasks, showcasing the adaptability of pre-trained models when extended to cross-modal tasks. The integration of multiple datasets during pre-training is demonstrated to enhance model performance.

Numerical Results and Empirical Findings

The paper provides detailed numerical baselines across the proposed benchmarks. Substantial improvements were noted with pre-training on the SoundDescs dataset, pointing to the advantages of leveraging datasets rich in audio and textual variation. A notable aspect of the findings is the enhancement in retrieval performance by aggregating multiple audio experts. Additionally, the research reveals the correlation between audio file durations in retrieval efficacy, indicating a consistent performance across varied audio segments, including those of significant length.

Implications and Future Directions

The implications of this research are manifold, suggesting practical benefit in areas such as multimedia data search, conservation through auditory monitoring, cultural heritage access, and even creative industries like podcast and audiobook production. The theoretical contributions lie in the extension of cross-modal retrieval models, potentially informing future advancements in AI-driven audio understanding.

Future research directions could explore further integration with audio-visual data to enhance the retrieval performance and investigate more diverse and complex datasets. The benchmarks and datasets introduced in this paper open opportunities for continued exploration, refinement of retrieval models, and expansions into real-world applications.

Overall, this paper establishes a significant step towards comprehensive audio retrieval systems queried through natural language, laying the groundwork for future explorations into the vast potentials of multimedia retrieval technologies.

PDF Markdown

Related Papers

Audio Retrieval with Natural Language Queries (2021)
Audio Retrieval with WavText5K and CLAP Training (2022)
On Metric Learning for Audio-Text Cross-Modal Retrieval (2022)
Audio-text Retrieval in Context (2022)
Bridging Language Gaps in Audio-Text Retrieval (2024)

GitHub

GitHub - akoepke/audio-retrieval-benchmark: Implementation of "Audio Retrieval with Natural Language Queries: A Benchmark Study". (48 stars)