Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion (2301.11757v3)
Abstract: Recent years have seen the rapid development of large generative models for text; however, much less research has explored the connection between text and another "language" of communication -- music. Music, much like text, can convey emotions, stories, and ideas, and has its own unique structure and syntax. In our work, we bridge text and music via a text-to-music generation model that is highly efficient, expressive, and can handle long-term structure. Specifically, we develop Mo^usai, a cascading two-stage latent diffusion model that can generate multiple minutes of high-quality stereo music at 48kHz from textual descriptions. Moreover, our model features high efficiency, which enables real-time inference on a single consumer GPU with a reasonable speed. Through experiments and property analyses, we show our model's competence over a variety of criteria compared with existing music generation models. Lastly, to promote the open-source culture, we provide a collection of open-source libraries with the hope of facilitating future work in the field. We open-source the following: Codes: https://github.com/archinetai/audio-diffusion-pytorch; music samples for this paper: http://bit.ly/44ozWDH; all music samples for all models: https://bit.ly/audio-diffusion.
- BBC Music Magazine. 2022. Classical music: 50 greatest composers of all time. BBC Music Magazine.
- Michele Berlingerio and Francesca Bonin. 2018. Towards a music-language mapping. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Jeanette Bicknell. 2002. Can music convey semantic content? a kantian approach. The Journal of Aesthetics and Art Criticism, 60(3):253–261.
- AudioLM: A language modeling approach to audio generation. CoRR, abs/2209.03143.
- Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Antoine Caillon and Philippe Esling. 2021. RAVE: A variational autoencoder for fast and high-quality neural audio synthesis. CoRR, abs/2111.05011.
- Muse: Text-to-image generation via masked generative transformers. CoRR, abs/2301.00704.
- Sheng-Kuan Chung. 2006. Digital storytelling in integrated arts education. The International Journal of Arts Education, 4(1):33–50.
- Sylvie Delacroix. 2023. Data rivers: Carving out the public domain in the age of Chat-GPT. Available at SSRN.
- Unsupervised audiovisual synthesis via exemplar autoencoders. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Jukebox: A generative model for music. CoRR, abs/2005.00341.
- The challenge of realistic music generation: Modelling raw audio at scale. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 8000–8010.
- CLAP: learning audio concepts from natural language supervision. CoRR, abs/2206.04769.
- Neural audio synthesis of musical notes with wavenet autoencoders.
- Gansynth: Adversarial neural audio synthesis. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Taming transformers for high-resolution image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 12873–12883. Computer Vision Foundation / IEEE.
- European Commission. 2016. Proposal for a directive of the European parliament and of the council on copyright in the digital single market.
- Seth* Forsgren and Hayk* Martiros. 2022. Riffusion - Stable diffusion for real-time music generation.
- The exception for text and data mining (tdm) in the proposed directive on copyright in the digital single market-legal aspects. Centre for International Intellectual Property Studies (CEIPI) Research Paper, (2018-02).
- Mark Germer. 2011. Notes, 67(4):760–765.
- Learning dense representations for entity retrieval. In Computational Natural Language Learning (CoNLL).
- It’s raw! audio generation with state-space models. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 7616–7633. PMLR.
- Catch-a-waveform: Learning to generate audio from a single short example. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 20916–20928.
- Enabling factorized piano music modeling and generation with the MAESTRO dataset. In International Conference on Learning Representations.
- Enabling factorized piano music modeling and generation with the MAESTRO dataset. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507.
- Imagen video: High definition video generation with diffusion models. CoRR, abs/2210.02303.
- Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. CoRR, abs/2207.12598.
- Commu: Dataset for combinatorial music generation. CoRR, abs/2211.09385.
- Fréchet audio distance: A metric for evaluating music enhancement algorithms.
- Lip to speech synthesis with visual context attentional GAN. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 2758–2770.
- Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings.
- Panns: Large-scale pretrained audio neural networks for audio pattern recognition.
- Diffwave: A versatile diffusion model for audio synthesis. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- AudioGen: Textually guided audio generation. CoRR, abs/2209.15352.
- Melgan: Generative adversarial networks for conditional waveform synthesis. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 14881–14892.
- BDDM: bilateral denoising diffusion models for fast and high-quality speech synthesis. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Autoregressive image generation using residual quantization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 11513–11522. IEEE.
- Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis. CoRR, abs/2205.14807.
- Clip-event: Connecting text and images with event structures. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 16399–16408. IEEE.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- SampleRNN: An unconditional end-to-end neural audio generation model. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
- Rada Mihalcea and Carlo Strapparava. 2012. Lyrics, music, and emotions. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 590–599, Jeju Island, Korea. Association for Computational Linguistics.
- Chunked autoregressive GAN for conditional waveform synthesis. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Isabel Papadimitriou and Dan Jurafsky. 2020. Learning Music Helps You Read: Using transfer to study linguistic structure in language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6829–6839, Online. Association for Computational Linguistics.
- Marco Pasini and Jan Schlüter. 2022. Musika! fast infinite waveform music generation. CoRR, abs/2208.08706.
- Diffusion autoencoders: Toward a meaningful and decodable representation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10609–10619. IEEE.
- Improving language understanding by generative pre-training. Technical report, OpenAI.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
- Hierarchical text-conditional image generation with CLIP latents. CoRR, abs/2204.06125.
- High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, volume 9351 of Lecture Notes in Computer Science, pages 234–241. Springer.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. ArXiv, abs/2208.12242.
- Photorealistic text-to-image diffusion models with deep language understanding. CoRR, abs/2205.11487.
- Tim Salimans and Jonathan Ho. 2022. Progressive distillation for fast sampling of diffusion models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Flavio Schneider. 2023. ArchiSound: Audio generation with diffusion.
- Jay A Seitz. 2005. Dalcroze, the body, movement and musicality. Psychology of music, 33(4):419–435.
- Denoising diffusion implicit models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- Joseph P Swain. 1995. The concept of musical syntax. The Musical Quarterly, 79(2):281–308.
- Alan M. Turing. 1950. I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind, LIX(236):433–460.
- Wavenet: A generative model for raw audio. In The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September 2016, page 125. ISCA.
- Neural discrete representation learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 6306–6315.
- Phenaki: Variable length video generation from open domain textual description. CoRR, abs/2210.02399.
- James Webster. 2001. Sonata form. The new Grove dictionary of music and musicians, 23:687–698.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation.
- Diffsound: Discrete diffusion model for text-to-sound generation. CoRR, abs/2207.09983.
- Museformer: Transformer with fine- and coarse-grained attention for music generation. CoRR, abs/2210.10349.
- Scaling autoregressive models for content-rich text-to-image generation. CoRR, abs/2206.10789.
- Flavio Schneider (2 papers)
- Ojasv Kamal (5 papers)
- Zhijing Jin (70 papers)
- Bernhard Schölkopf (413 papers)