Quality-aware Masked Diffusion Transformer for Enhanced Music Generation (2405.15863v4)

Published 24 May 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Text-to-music (TTM) generation, which converts textual descriptions into audio, opens up innovative avenues for multimedia creation. Achieving high quality and diversity in this process demands extensive, high-quality data, which are often scarce in available datasets. Most open-source datasets frequently suffer from issues like low-quality waveforms and low text-audio consistency, hindering the advancement of music generation models. To address these challenges, we propose a novel quality-aware training paradigm for generating high-quality, high-musicality music from large-scale, quality-imbalanced datasets. Additionally, by leveraging unique properties in the latent space of musical signals, we adapt and implement a masked diffusion transformer (MDT) model for the TTM task, showcasing its capacity for quality control and enhanced musicality. Furthermore, we introduce a three-stage caption refinement approach to address low-quality captions' issue. Experiments show state-of-the-art (SOTA) performance on benchmark datasets including MusicCaps and the Song-Describer Dataset with both objective and subjective metrics. Demo audio samples are available at https://qa-mdt.github.io/, code and pretrained checkpoints are open-sourced at https://github.com/ivcylc/OpenMusic.

References (38)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces QA-MDT, a novel transformer that integrates quality evaluation via pseudo-MOS scores into the diffusion process to enhance music generation.
It refines captions using pretrained music-annotation models and CLAP alignment to improve the quality and diversity of training data.
Experimental results reveal significant FAD reductions and improved p-MOS scores, outperforming existing models like AudioLDM and MusicLDM.

Overview of the Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

The paper "Quality-aware Masked Diffusion Transformer for Enhanced Music Generation" addresses significant challenges in the domain of text-to-music (TTM) generation, focusing particularly on the limitations imposed by the availability of high-quality music data. The authors have identified key issues in existing open-source datasets, such as mislabeling, weak labeling, unlabeled data, and low-quality audio recordings, all of which impede effective model training. This research introduces a novel Quality-aware Masked Diffusion Transformer (QA-MDT) designed to enhance music generation by integrating mechanisms for assessing and handling the quality of music waveforms during the training phase.

Key Contributions

QA-MDT Architecture: The proposed method centers on the QA-MDT framework, which innovatively incorporates a quality-aware mechanism into the diffusion transformer architecture. By introducing pseudo-MOS scores, the model gains the ability to discern audio quality, thereby guiding the generative process to prioritize high-quality outputs. This approach leverages both coarse and fine-grain quality information through quality prefixes and quantized quality tokens, respectively.
Caption Refinement Strategy: The paper also addresses the issue of low-quality textual annotations through a sophisticated caption refinement process. This involves using a pretrained music caption model to enrich textual data and employing CLAP to ensure text-audio alignment. Additionally, LLMs are utilized to enhance the diversity and specificity of captions, ultimately leading to better training data for the generative model.
Objective and Subjective Evaluation: The authors conducted comprehensive experiments using both objective metrics—such as Fréchet Audio Distance (FAD), KL divergence, and Inception Score—and subjective evaluations. The latter was performed by human raters across various professional backgrounds to assess aspects such as overall audio quality and relevance to text input.

Experimental Insights

The QA-MDT demonstrated superior performance on the MusicCaps benchmark and other public datasets. Notably, objective evaluations revealed significant reductions in FAD and improvements in p-MOS scores, indicating enhanced audio quality and diversity. Subjective tests corroborated these findings, with the QA-MDT achieving higher ratings in terms of overall quality and text relevance compared to existing models like AudioLDM and MusicLDM.

The paper also presents extensive ablation studies to explore the effects of different architectural components and strategies. One major conclusion is that smaller patch sizes and overlap in the model's patchify strategy result in better modeling of audio spectra, improving not only the objective metrics but also the perceived musicality of the generated pieces.

Implications and Future Directions

The implications of this research extend both practically and theoretically. Practically, the QA-MDT offers a more reliable framework for generating music that maintains high fidelity and aligns well with textual descriptions. The architecture's flexibility, bolstered by its quality-aware capabilities, marks a significant step forward in tackling the quality discrepancies inherent in large-scale music datasets.

Theoretically, this work opens several avenues for future research. One aspect involves optimizing melodic structures in music generation to enhance aesthetic appeal. Additionally, exploring the scalability of the QA-MDT model for long-duration audio sequences could provide further insights into temporal correlation handling within generative models. As the field continues to evolve, integrating more sophisticated quality control mechanisms could further enrich the outcomes.

In conclusion, the QA-MDT provides a compelling solution to the challenges facing diffusion models in the TTM domain, setting a new standard for the development of high-performance music generation systems using open-source, large-scale datasets.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ArxivSound/status/1795305097280327947

https://twitter.com/rohanpaul_ai/status/1838639205141868936

https://twitter.com/AudioAndSpeech/status/1795331800111198257

https://twitter.com/ArxivSound/status/1826106981611880757

YouTube

Show All Videos