Emergent Mind

Improving Text-To-Audio Models with Synthetic Captions

(2406.15487)
Published Jun 18, 2024 in cs.CL , cs.LG , cs.SD , and eess.AS

Abstract

It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model} to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named \texttt{AF-AudioSet}, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new \textit{state-of-the-art}.

Distribution of sound types in AF-AudioSet.

Overview

  • The paper addresses the challenge of acquiring high-quality audio-caption pairs for Text-To-Audio (TTA) model training by generating synthetic captions using a pre-trained audio language model, resulting in the creation of the AF-AudioSet dataset.

  • The study systematically evaluates the impact of pre-training TTA models on AF-AudioSet, finding that it significantly enhances audio generation quality and leads to state-of-the-art performance on several benchmarks such as AudioCaps and MusicCaps.

  • The authors highlight the benefits of combining synthetic and real data during pre-training, emphasizing the potential for further improvements and the importance of this approach in the future development of advanced TTA models.

Improving Text-To-Audio Models with Synthetic Captions

The paper "Improving Text-To-Audio Models with Synthetic Captions" presents a novel approach to address the challenges in training robust Text-To-Audio (TTA) models, primarily focusing on the limitations in obtaining high-quality training data, specifically audio caption pairs. This work proposes an innovative solution through the generation of synthetic captions using an audio language model, leading to the creation of a substantial dataset named AF-AudioSet. The authors convincingly demonstrate that pre-training TTA models on this synthetic dataset significantly enhances the audio generation quality, achieving state-of-the-art performance in several benchmarks.

Key Challenges and Proposed Solution

One of the critical challenges in TTA model training is the scarcity of high-quality audio-caption pairs, a stark contrast to the abundance of data in the text-to-image domain. Traditional methods of augmenting such pairs include transforming tags and labels into natural language or manipulating existing audio-caption pairs. However, these approaches are inherently limited and often result in suboptimal alignment between the audio and text data.

To overcome these limitations, the authors propose using a pre-trained audio language model, specifically the Audio Flamingo chat model, to generate synthetic captions for audio samples. This model is capable of producing diverse and contextually relevant captions by leveraging its training on extensive dialogues. The generated captions are then filtered and ranked based on their CLAP (Contrastive Language-Audio Pretraining) similarity with the corresponding audio, ensuring high-quality synthetic captions.

Dataset and Experimental Methodology

The resulting dataset, AF-AudioSet, is a large-scale collection of synthetic captions filtered at various CLAP similarity thresholds to balance quality and quantity. The authors systematically evaluate the effect of pre-training TTA models on AF-AudioSet using benchmarks such as AudioCaps and MusicCaps. They explore different model architectures and sizes, demonstrating the broad applicability of their approach.

The pre-trained models undergo further fine-tuning on the specific benchmarks, with metrics such as Frechet Distance (FD), Inception Score (IS), CLAP similarity, and Frechet Audio Distance (FAD) being used to measure performance. The systematic evaluations reveal that pre-training on AF-AudioSet, particularly when combined with smaller, high-quality subsets, consistently improves audio generation quality, outperforming recent state-of-the-art methods.

Results and Findings

The study shows that pre-training on AF-AudioSet leads to significant performance improvements across various settings. For instance, on the AudioCaps benchmark, models pre-trained on the dataset with a CLAP threshold of 0.45 show marked improvements in FD, IS, and CLAP similarity. Similar trends are observed in the MusicCaps benchmark, where a threshold of 0.35 yields the best results. Notably, these improvements are evident across different model sizes and architectures, underscoring the robustness of the proposed approach.

Furthermore, the authors highlight that combining synthetic and real data during pre-training can lead to even better performance, suggesting a synergistic effect that leverages the strengths of both data types. This mixed pre-training strategy results in state-of-the-art performance on both text-to-audio and text-to-music generation tasks, confirming the efficacy of the synthetic dataset.

Implications and Future Directions

The implications of this research are significant for the future development of TTA models. The ability to generate high-quality synthetic captions at scale addresses a fundamental bottleneck in training data availability, enabling the development of more advanced and accurate TTA models. This approach also opens new avenues for exploring better dataset synthesis pipelines and pre-training strategies, which could further enhance model performance.

As audio language models continue to evolve and improve, the quality of synthetic captions is expected to increase, leading to even better alignment and diversity in the training data. Future research could investigate integrating more sophisticated contrastive audio-text embeddings and exploring other modalities in the synthesis process.

Conclusion

"Improving Text-To-Audio Models with Synthetic Captions" provides a comprehensive and effective solution to one of the critical challenges in TTA model training. By leveraging state-of-the-art audio language models to generate synthetic captions and systematically evaluating their impact on model performance, the authors demonstrate the potential of this approach to achieve significant improvements in audio generation quality. This work not only sets a new benchmark in TTA research but also paves the way for future advancements in multimodal AI systems.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.