AudioLCM: Text-to-Audio Generation with Latent Consistency Models (2406.00356v2)

Published 1 Jun 2024 in eess.AS and cs.SD

Abstract: Recent advancements in Latent Diffusion Models (LDMs) have propelled them to the forefront of various generative tasks. However, their iterative sampling process poses a significant computational burden, resulting in slow generation speeds and limiting their application in text-to-audio generation deployment. In this work, we introduce AudioLCM, a novel consistency-based model tailored for efficient and high-quality text-to-audio generation. AudioLCM integrates Consistency Models into the generation process, facilitating rapid inference through a mapping from any point at any time step to the trajectory's initial point. To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver. This innovation shortens the time schedule from thousands to dozens of steps while maintaining sample quality, thereby achieving fast convergence and high-quality generation. Furthermore, to optimize the performance of transformer-based neural network architectures, we integrate the advanced techniques pioneered by LLaMA into the foundational framework of transformers. This architecture supports stable and efficient training, ensuring robust performance in text-to-audio synthesis. Experimental results on text-to-sound generation and text-to-music synthesis tasks demonstrate that AudioLCM needs only 2 iterations to synthesize high-fidelity audios, while it maintains sample quality competitive with state-of-the-art models using hundreds of steps. AudioLCM enables a sampling speed of 333x faster than real-time on a single NVIDIA 4090Ti GPU, making generative models practically applicable to text-to-audio generation deployment. Our extensive preliminary analysis shows that each design in AudioLCM is effective.

Citations (4)

View on Semantic Scholar

Summary

The paper demonstrates that integrating consistency models into text-to-audio synthesis enables high-fidelity generation with only 2 inference steps.
The authors employ guided latent consistency distillation and multi-step ODE solvers to reduce computational demands and enhance convergence rates.
Empirical results reveal AudioLCM achieves a sampling speed 333 times faster than real-time, outperforming state-of-the-art models on key audio metrics.

Analysis of AudioLCM: A Text-to-Audio Generative Model Based on Latent Consistency

The research paper titled "AudioLCM: Text-to-Audio Generation with Latent Consistency Models" presents a substantial advance in the field of generative models by introducing an optimized method for text-to-audio synthesis. This work addresses the limitations inherent in existing Latent Diffusion Models (LDMs), primarily their computational inefficiency and slow inference speed. The authors propose a novel approach, AudioLCM, which integrates Consistency Models (CMs) into the generative process to achieve rapid, high-quality audio generation from text inputs.

Methodology

The AudioLCM model leverages the concept of a consistency function to map any point in a trajectory to its initial state, eliminating the need for iterative noise removal intrinsic to traditional LDMs. This approach allows for a substantial reduction in computational demand, maintaining sample quality while achieving a noteworthy speed increase in inference. To further enhance convergence rates and mitigate issues related to reduced sample iterations, AudioLCM employs Guided Latent Consistency Distillation. This involves a multi-step Ordinary Differential Equation (ODE) solver, reducing the time schedule from thousands to dozens of steps.

Moreover, to enhance model architecture, the authors adapt techniques from the LLaMA framework, integrating advanced methodologies into the transformer backbone of their model. This enables AudioLCM to support variable-length audio generation, thus improving training stability and performance.

Empirical Results

The empirical evaluation highlights AudioLCM's superiority over several state-of-the-art models in both the text-to-sound and text-to-music generation tasks. AudioLCM requires only 2 inference steps to synthesize high-fidelity audio, which is a significant improvement over models that necessitate hundreds of steps. On computational tests, AudioLCM achieves a sampling speed of 333 times faster than real-time on a single NVIDIA 4090Ti GPU. The impressive real-time factor (RTF) translates to practical applicability in real-world scenarios, where high-efficiency audio generation is crucial.

Objective metrics demonstrate AudioLCM's competency, with favorable results in Kullback-Leibler (KL) divergence, Frechet Audio Distance (FAD), and cross-modal alignment metrics like CLAP score. Subjective evaluations further solidify these findings, with human raters indicating a preference for the naturalness and faithfulness of AudioLCM-generated samples over competing systems.

Theoretical and Practical Implications

The integration of consistency models into the text-to-audio generation process represents a significant theoretical contribution, challenging the traditional paradigms of iterative denoising processes. By incorporating these models, the research shows promising avenues for reducing computational costs, which is a critical barrier for deploying such models on scalable platforms.

Practically, AudioLCM's enhanced capabilities directly translate to improved user experiences in applications spanning diverse domains, including automated music composition, personalized sound effect generation, and augmented reality technologies. The reduction in latency and increase in generation speed make it an attractive choice for industries where efficient and real-time audio synthesis is required.

Future Directions

Although AudioLCM makes notable advances, future research could focus on further minimizing discretization errors associated with multi-step ODE sampling processes. Exploring adaptive guidance parameters or more sophisticated distillation strategies may yield even higher fidelity audio samples.

In summary, the introduction of AudioLCM marks a meaningful contribution to generative modeling, providing both a robust theoretical framework and practical enhancements that elevate the field of text-to-audio synthesis. Its ability to operate efficiently without sacrificing quality sets a new standard for future research and application in the area of audio generation.

PDF Markdown

Related Papers

YouTube

Show All Videos

Reddit

[2406.00356] AudioLCM: Text-to-Audio Generation with Latent Consistency Models (1 point, 1 comment)