Improving Text-To-Audio Models with Synthetic Captions (2406.15487v2)

Published 18 Jun 2024 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only LLMs} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio LLM} to synthesize accurate and diverse captions for audio at scale. We leverage this pipeline to produce a dataset of synthetic captions for AudioSet, named \texttt{AF-AudioSet}, and then evaluate the benefit of pre-training text-to-audio models on these synthetic captions. Through systematic evaluations on AudioCaps and MusicCaps, we find leveraging our pipeline and synthetic captions leads to significant improvements on audio generation quality, achieving a new \textit{state-of-the-art}.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces a novel method employing synthetic audio captions to address the scarcity of high-quality audio-caption pairs in TTA model training.
It utilizes a pre-trained audio language model to generate diverse, contextually relevant captions, which are rigorously filtered using CLAP similarity.
Pre-training on the large-scale AF-AudioSet dataset significantly improves audio generation quality across metrics like FD, IS, and FAD on benchmarks such as AudioCaps and MusicCaps.

Improving Text-To-Audio Models with Synthetic Captions

The paper "Improving Text-To-Audio Models with Synthetic Captions" presents a novel approach to address the challenges in training robust Text-To-Audio (TTA) models, primarily focusing on the limitations in obtaining high-quality training data, specifically audio caption pairs. This work proposes an innovative solution through the generation of synthetic captions using an audio LLM, leading to the creation of a substantial dataset named AF-AudioSet. The authors convincingly demonstrate that pre-training TTA models on this synthetic dataset significantly enhances the audio generation quality, achieving state-of-the-art performance in several benchmarks.

Key Challenges and Proposed Solution

One of the critical challenges in TTA model training is the scarcity of high-quality audio-caption pairs, a stark contrast to the abundance of data in the text-to-image domain. Traditional methods of augmenting such pairs include transforming tags and labels into natural language or manipulating existing audio-caption pairs. However, these approaches are inherently limited and often result in suboptimal alignment between the audio and text data.

To overcome these limitations, the authors propose using a pre-trained audio LLM, specifically the Audio Flamingo chat model, to generate synthetic captions for audio samples. This model is capable of producing diverse and contextually relevant captions by leveraging its training on extensive dialogues. The generated captions are then filtered and ranked based on their CLAP (Contrastive Language-Audio Pretraining) similarity with the corresponding audio, ensuring high-quality synthetic captions.

Dataset and Experimental Methodology

The resulting dataset, AF-AudioSet, is a large-scale collection of synthetic captions filtered at various CLAP similarity thresholds to balance quality and quantity. The authors systematically evaluate the effect of pre-training TTA models on AF-AudioSet using benchmarks such as AudioCaps and MusicCaps. They explore different model architectures and sizes, demonstrating the broad applicability of their approach.

The pre-trained models undergo further fine-tuning on the specific benchmarks, with metrics such as Frechet Distance (FD), Inception Score (IS), CLAP similarity, and Frechet Audio Distance (FAD) being used to measure performance. The systematic evaluations reveal that pre-training on AF-AudioSet, particularly when combined with smaller, high-quality subsets, consistently improves audio generation quality, outperforming recent state-of-the-art methods.

Results and Findings

The paper shows that pre-training on AF-AudioSet leads to significant performance improvements across various settings. For instance, on the AudioCaps benchmark, models pre-trained on the dataset with a CLAP threshold of 0.45 show marked improvements in FD, IS, and CLAP similarity. Similar trends are observed in the MusicCaps benchmark, where a threshold of 0.35 yields the best results. Notably, these improvements are evident across different model sizes and architectures, underscoring the robustness of the proposed approach.

Furthermore, the authors highlight that combining synthetic and real data during pre-training can lead to even better performance, suggesting a synergistic effect that leverages the strengths of both data types. This mixed pre-training strategy results in state-of-the-art performance on both text-to-audio and text-to-music generation tasks, confirming the efficacy of the synthetic dataset.

Implications and Future Directions

The implications of this research are significant for the future development of TTA models. The ability to generate high-quality synthetic captions at scale addresses a fundamental bottleneck in training data availability, enabling the development of more advanced and accurate TTA models. This approach also opens new avenues for exploring better dataset synthesis pipelines and pre-training strategies, which could further enhance model performance.

As audio LLMs continue to evolve and improve, the quality of synthetic captions is expected to increase, leading to even better alignment and diversity in the training data. Future research could investigate integrating more sophisticated contrastive audio-text embeddings and exploring other modalities in the synthesis process.

Conclusion

"Improving Text-To-Audio Models with Synthetic Captions" provides a comprehensive and effective solution to one of the critical challenges in TTA model training. By leveraging state-of-the-art audio LLMs to generate synthetic captions and systematically evaluating their impact on model performance, the authors demonstrate the potential of this approach to achieve significant improvements in audio generation quality. This work not only sets a new benchmark in TTA research but also paves the way for future advancements in multimodal AI systems.

Related Papers

Tweets

https://twitter.com/RafaelValleArt/status/1808181256066326668