Emergent Mind

Long-form music generation with latent diffusion

(2404.10301)
Published Apr 16, 2024 in cs.SD , cs.LG , and eess.AS

Abstract

Audio-based generative models for music have seen great strides recently, but so far have not managed to produce full-length music tracks with coherent musical structure. We show that by training a generative model on long temporal contexts it is possible to produce long-form music of up to 4m45s. Our model consists of a diffusion-transformer operating on a highly downsampled continuous latent representation (latent rate of 21.5Hz). It obtains state-of-the-art generations according to metrics on audio quality and prompt alignment, and subjective tests reveal that it produces full-length music with coherent structure.

Histogram of popular music lengths with model comparison, enhanced readability via power-law warped vertical axis.

Overview

  • The paper presents a novel approach to generate full-length music tracks up to 4 minutes and 45 seconds long, with coherent musical structures, through training on long temporal contexts.

  • It introduces a model combining an autoencoder and diffusion-transformer to manage extended sequence lengths and produce continuous latent representations, enabling long-form music generation.

  • Quantitative and qualitative assessments indicate the model outperforms current methods in audio quality and structural coherence, generating music that rivals real samples without using semantic tokens for structure.

  • The findings highlight the model's potential for not just long-form music generation but also applications like audio-to-audio style transfer, melody generation, and production of short-form audio.

Leveraging Long Temporal Contexts for Generative Music Models

Introduction to the Challenge in Music Generation

Recent advancements in the generation of musical audio via deep learning have largely focused on short-duration music generation or conditional generation based on musical metadata and natural language prompts. However, these approaches have struggled with generating full-length music tracks that maintain a coherent musical structure over longer durations. Recognizing this gap, a novel study has approached the problem by training a generative model using long temporal contexts, specifically aiming to produce music tracks up to 4 minutes and 45 seconds long, with coherent musical structure throughout.

Novel Approach and Methodology

The paper introduces a model comprising a diffusion-transformer that operates on a continuous latent representation, downsampled to a latent rate of 21.5Hz. This significantly downsampled latent space is key for managing longer temporal contexts within the VRAM limitations of modern GPUs, thus enabling the generation of longer music pieces. The approach is distinguished by the absence of semantic tokens for long-term structural coherence, which contrasts with previous research that relied on such tokens for generating structured music. Instead, the model achieves long-form music generation through a combination of a highly compressed temporal representation and a transformer-based diffusion method tailored to these representations.

Model Components:

  • Autoencoder: Utilizes convolutional blocks for downsampling, coupled with ResNet-like layers employing Snake activation functions. This autoencoder is trained with both reconstruction and adversarial loss terms, using a convolutional discriminator for the latter.
  • Diffusion-Transformer (DiT): A transformer architecture is adopted, supplemented with techniques like efficient block-wise attention and gradient checkpointing to handle the extended sequence lengths necessary for long-form music generation.
  • Variable-Length Generation Capability: The model supports generation within a specified window length, with timing conditions facilitating the adjustment to user-specified lengths.

Training and Architecture Specifics

Training was conducted in multiple stages, beginning with the autoencoder and CLAP model, followed by the diffusion model. The diffusion model underwent pre-training to generate up to 3 minutes and 10 seconds of music, then fine-tuned to extend its capability to 4 minutes and 45 seconds. The model's architecture allows for variable-length music generation, a critical feature for producing diverse musical outputs.

Empirical Evaluation

Quantitative assessments show the model outperforming the state-of-the-art in terms of audio quality and prompt alignment across different lengths of generated music. Qualitative evaluations, through listening tests, suggest that the model can generate full-length music tracks with coherent structure, musicality, and high audio quality that compares favorably with real music samples. Remarkably, the model achieves these results without resorting to semantic tokens for imposing structure, suggesting that structural coherence in music can emerge from the model's training on prolonged temporal contexts.

Insights on Musical Structure and Future Directions

The exploration of generated music's structure through self-similarity matrices hints at the model's ability to create music with complex structures akin to those found in real-world tracks. This observation opens avenues for future research on how generative models can be further refined to capture and reproduce the intricate structures characteristic of human-composed music.

Moreover, the study highlights the potential of the model in applications beyond long-form music generation, such as audio-to-audio style transfer, vocal-like melody generation without intelligible words, and the production of short-form audio like sound effects or instrument samples. These additional capabilities underscore the flexibility and broad applicability of the model in various audio generation contexts.

Conclusion

This research marks a significant step forward in the generation of full-length music tracks with coherent structures, showcasing the feasibility of extending generative models to longer temporal contexts without sacrificing audio quality or structural coherence. The findings not only contribute to the advancement of music generation technology but also open new research pathways exploring the limits of generative models in capturing the essence of musical composition.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube