Quality-aware Masked Diffusion Transformer for Enhanced Music Generation (2405.15863v4)
Abstract: Text-to-music (TTM) generation, which converts textual descriptions into audio, opens up innovative avenues for multimedia creation. Achieving high quality and diversity in this process demands extensive, high-quality data, which are often scarce in available datasets. Most open-source datasets frequently suffer from issues like low-quality waveforms and low text-audio consistency, hindering the advancement of music generation models. To address these challenges, we propose a novel quality-aware training paradigm for generating high-quality, high-musicality music from large-scale, quality-imbalanced datasets. Additionally, by leveraging unique properties in the latent space of musical signals, we adapt and implement a masked diffusion transformer (MDT) model for the TTM task, showcasing its capacity for quality control and enhanced musicality. Furthermore, we introduce a three-stage caption refinement approach to address low-quality captions' issue. Experiments show state-of-the-art (SOTA) performance on benchmark datasets including MusicCaps and the Song-Describer Dataset with both objective and subjective metrics. Demo audio samples are available at https://qa-mdt.github.io/, code and pretrained checkpoints are open-sourced at https://github.com/ivcylc/OpenMusic.
- Fma: A dataset for music analysis. arXiv preprint arXiv:1612.01840, 2016.
- Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
- On the scalability of diffusion-based text-to-image generation. arXiv preprint arXiv:2404.02883, 2024.
- Pixart-\\\backslash\sigma: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. arXiv preprint arXiv:2403.04692, 2024a.
- High-resolution image synthesis with latent diffusion models, 2021.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023.
- Audio quality assessment of vinyl music collections using self-supervised learning. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
- Simple and controllable music generation. Advances in Neural Information Processing Systems, 36, 2024.
- Efficient neural music generation. Advances in Neural Information Processing Systems, 36, 2024.
- Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
- Mo\\\backslash\^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757, 2023.
- Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint arXiv:2302.03917, 2023.
- Riffusion - Stable diffusion for real-time music generation. 2022. URL https://riffusion.com/about.
- AudioLDM 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023a.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033, 2020.
- Video generation models as world simulators, 2024. URL https://openai. com/research/video-generation-models-as-world-simulators.
- All are worth words: a vit backbone for score-based diffusion models. In NeurIPS 2022 Workshop on Score-Based Methods, 2022.
- Vit-tts: visual text-to-speech with scalable diffusion transformer. arXiv preprint arXiv:2305.12708, 2023b.
- Masked diffusion models are fast distribution learners.
- Masked diffusion transformer is a strong image synthesizer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 23164–23173, 2023.
- Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305, 2023.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Audioldm 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023c.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Masked autoencoders that listen. Advances in Neural Information Processing Systems, 35:28708–28720, 2022.
- Lp-musiccaps: Llm-based pseudo music captioning. arXiv preprint arXiv:2307.16372, 2023.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
- Evaluation of algorithms using games: The case of music tagging. In ISMIR, pages 387–392. Citeseer, 2009.
- The million song dataset. 2011.
- U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
- xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers, 2022.
- Fr\\\backslash\’echet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018.
- Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1206–1210. IEEE, 2024b.