Emergent Mind

Audio Conditioning for Music Generation via Discrete Bottleneck Features

(2407.12563)
Published Jul 17, 2024 in cs.SD and eess.AS

Abstract

While most music generation models use textual or parametric conditioning (e.g. tempo, harmony, musical genre), we propose to condition a language model based music generation system with audio input. Our exploration involves two distinct strategies. The first strategy, termed textual inversion, leverages a pre-trained text-to-music model to map audio input to corresponding "pseudowords" in the textual embedding space. For the second model we train a music language model from scratch jointly with a text conditioner and a quantized audio feature extractor. At inference time, we can mix textual and audio conditioning and balance them thanks to a novel double classifier free guidance method. We conduct automatic and human studies that validates our approach. We will release the code and we provide music samples on https://musicgenstyle.github.io in order to show the quality of our model.

Texual Inversion method overview using pretrained text-to-music MusicGen.

Overview

  • The paper introduces a new method for music generation that uses audio inputs to condition a language model, instead of traditional textual or parametric inputs.

  • Key contributions include adapting textual inversion for audio conditioning, designing a style conditioner that uses both audio waveforms and text, and implementing a double classifier free guidance method to balance audio and textual inputs during inference.

  • Significant results are highlighted, such as achieving a better Fréchet Audio Distance compared to baseline methods and demonstrating high subjective scores for audio quality and similarity.

Audio Conditioning for Music Generation via Discrete Bottleneck Features

Overview

The paper "Audio Conditioning for Music Generation via Discrete Bottleneck Features" presents a novel approach to music generation that employs audio inputs to condition a language model, diverging from the more traditional textual or parametric conditioning methods. The authors outline two primary strategies for this novel input: textual inversion and a jointly trained style conditioner.

Key Contributions

  1. Adaptation of Textual Inversion: The authors adapt the textual inversion method from the domain of image generation to a pre-trained text-to-music model. By optimizing a textual embedding through backpropagation, they establish a mechanism for performing audio conditioning without re-training the model from scratch.
  2. Style Conditioner Design: A new method involving a style conditioner trained jointly with a text-to-music model is introduced. This conditioner uses a frozen audio feature extractor followed by a transformer encoder, Residual Vector Quantizer (RVQ), and temporal downsampling. This architecture allows the model to leverage both audio waveforms and textual descriptions simultaneously.
  3. Double Classifier Free Guidance: The authors assert that audio contains far more information than text, leading to the development of a double classifier free guidance method that balances textual and audio conditioning at inference time.
  4. Novel Objective Metrics: To validate their approach, the authors introduce objective metrics based on nearest neighbor searches in latent spaces, which they validate through human evaluations.

Numerical Results

The implementation and evaluation of these methods reveal several significant results:

  • The jointly trained style conditioner model achieves a Fréchet Audio Distance (FAD) of 0.85, superior to both the baseline continuation method (1.22) and a model using CLAP embeddings (0.96).
  • In terms of high-level audio similarity, the new model scores well on subjective metrics such as "Overall Quality" (OVL), "Similarity" (SIM), and "Variation" (VAR), indicating a good balance between close stylistic adherence and variety in generated music.

Implications and Future Directions

Practical Implications

The models and methods described in the paper offer a versatile tool for music creators, providing a way to generate music that remains coherent in style while incorporating both textual and audio inputs. This flexibility could significantly enhance content creation platforms and music production workflows by allowing fine-grained control over generated outputs.

Theoretical Implications

From a theoretical perspective, the introduction of audio conditioning through a discrete bottleneck provides a promising avenue for future research. The double classifier free guidance method, in particular, offers a new approach to balancing multiple forms of conditioning, potentially applicable to other generative models beyond music.

Speculation on Future Developments

Looking forward, the integration of more sophisticated audio feature extractors and further refinements to the RVQ and temporal downsampling techniques could enhance the fidelity and creative potential of these models. Additionally, expanding the scope of conditioning inputs to include other contextual data, such as user interaction patterns or even visual cues, could create even richer generative frameworks.

Conclusion

This paper marks a significant advancement in the field of music generation by demonstrating the feasibility and benefits of using discrete bottleneck features for audio conditioning. Through comprehensive experimentation and evaluation, the authors have provided a robust framework that sets the stage for future innovations in AI-driven music creation. The balanced interplay between textual and audio inputs, facilitated by innovative guidance methods and bottleneck designs, offers a compelling case for the broader adoption of these techniques in various AI-driven creative applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.