Bass Accompaniment Generation via Latent Diffusion (2402.01412v1)

Published 2 Feb 2024 in cs.SD, cs.LG, and eess.AS

Abstract: The ability to automatically generate music that appropriately matches an arbitrary input track is a challenging task. We present a novel controllable system for generating single stems to accompany musical mixes of arbitrary length. At the core of our method are audio autoencoders that efficiently compress audio waveform samples into invertible latent representations, and a conditional latent diffusion model that takes as input the latent encoding of a mix and generates the latent encoding of a corresponding stem. To provide control over the timbre of generated samples, we introduce a technique to ground the latent space to a user-provided reference style during diffusion sampling. For further improving audio quality, we adapt classifier-free guidance to avoid distortions at high guidance strengths when generating an unbounded latent space. We train our model on a dataset of pairs of mixes and matching bass stems. Quantitative experiments demonstrate that, given an input mix, the proposed system can generate basslines with user-specified timbres. Our controllable conditional audio generation framework represents a significant step forward in creating generative AI tools to assist musicians in music production.

Citations (8)

View on Semantic Scholar

Summary

The paper introduces a novel system that automatically generates bass accompaniments using latent diffusion models with single-stage autoencoder training.
It employs a conditional U-Net with dynamic positional bias and adapted classifier-free guidance to handle arbitrary audio lengths and reduce artifacts.
The tool enables practical style control by aligning generated bass latent representations to reference samples, ensuring timbral consistency in mixes.

This paper introduces a system for automatically generating bass accompaniments for arbitrary music mixes using latent diffusion models (2402.01412). The goal is to create a practical tool for musicians that generates musically coherent basslines matching an input mix and allows control over the generated bass's timbre.

The core method involves two main components:

Audio Autoencoder: This component compresses the high-dimensional audio waveforms of both the input mix ( $\mathbf{x}$ $x$ ) and the target bass stem ( $\mathbf{y}$ $y$ ) into lower-dimensional latent representations ( $\mathbf{c_x}$ $c_{x}$ , $\mathbf{c_y}$ $c_{y}$ ).
- Architecture: It adapts the Musika autoencoder [musika] but uses a single end-to-end training stage instead of two. It reconstructs log-magnitude spectrograms.
- Loss Functions: Training uses a combination of:
  - L1 loss on the log-magnitude spectrogram reconstruction ( $\mathcal{L}_{\text{mag}}$ ).
  - Multi-scale spectral distance loss on the waveform reconstruction ( $\mathcal{L}_{\text{spec}}$ ) to implicitly model phase.
  - Adversarial loss ( $\mathcal{L}_{\text{adv}}$ ) using two critics: one on standard log-magnitude spectrograms and another on mel-spectrograms to improve perceptual accuracy.
  - The final loss is $\mathcal{L}_{E,D} = \lambda_{\text{mag}}\mathcal{L}_{\text{mag}} + \lambda_{\text{spec}}\mathcal{L}_{\text{spec}} + \lambda_{\text{adv}}\mathcal{L}_{\text{adv}}$ .
- Implementation: Achieves a high time compression ratio ( $r=4096$ ) with latent dimensions $d_x=64$ for the mix and $d_y=32$ for the bass. Trained on 1.5-second audio crops.
Conditional Latent Diffusion Model: This model learns the conditional distribution $p(\mathbf{c_y}|\mathbf{c_x})$ $p (c_{y} ∣ c_{x})$ to generate the bass latent representation given the mix latent representation.
- Architecture: Based on a U-Net with self-attention layers in lower resolutions. The conditioning mix latent $\mathbf{c_x}$ is concatenated to the noisy bass latent $\mathbf{z}_t$ at each step. Timestep information is added via sinusoidal embeddings.
- Arbitrary Length Handling: To handle inputs/outputs of varying lengths (crucial for practical music applications), vanilla self-attention is replaced with attention layers using Dynamic Positional Bias (DPB). DPB learns relative positional biases via an MLP: $\mathbf{B}_{i,j}=\text{MLP}(i-j)$ .
- Training: Trained using a denoising objective on latent pairs derived from ~23-second audio clips. Uses the v-objective and a cosine noise schedule. 15% of conditioning inputs ( $\mathbf{c_x}$ ) are zeroed out during training to enable Classifier-Free Guidance (CFG).

Controllability Features:

Style Grounding: Allows users to specify a desired timbre using a reference bass audio sample (

\mathbf{y}_{\text{style}}

).

Encode the reference audio to its latent representation $\mathbf{c}_{\text{style}}$ .
Compute the time-averaged latent vector $\mu_t(\mathbf{c}_{\text{style}})$ .

During diffusion sampling (DDIM), at each step

k

, adjust the predicted denoised latent

\mathbf{\hat{c}}_{y, k}

to be closer to the style average:

# Pseudocode for style grounding adjustment at step k
style_avg = time_average(c_style)
current_avg = time_average(c_hat_y_k)
beta_k_sq = noise_schedule_beta[k]^2
c_hat_y_k_grounded = c_hat_y_k - current_avg + beta_k_sq * style_avg + (1 - beta_k_sq) * current_avg

The weighting

\beta_k^2

applies stronger grounding in earlier steps (more noise, defining broader characteristics) and less in later steps (less noise, refining details).

Adapted Classifier-Free Guidance (CFG): Standard CFG can cause distortions (saturation/artifacts) in unbounded latent spaces at high guidance strengths ( $\lambda_{\text{cfg}}$ ). The paper uses a technique from Diffflaw [diffflaw] to rescale the guided latent prediction, controlling the increase in standard deviation via a hyperparameter $\phi \in [0,1]$ . This reduces artifacts while still benefiting from guidance.

Implementation Details:

Datasets: Mix autoencoder trained on MTG-Jamendo; Bass autoencoder and diffusion model trained on an internal dataset of ~20k songs with stems.
Autoencoder Training: Adam optimizer ( $\beta_1=0.5, \beta_2=0.9$ ), batch size 32, 500k iterations. STFT uses $n_{fft}=2048$ , $hop=256$ . Multi-scale spectral loss uses FFT sizes from $2^5$ to $2^{12}$ .
Diffusion Model Training: AdamW optimizer ( $\beta_1=0.9, \beta_2=0.999$ ), batch size 128, 500k iterations. Trained on 256 timestep latent sequences (~23s audio).
Inference: DDIM sampler with $K=64$ steps (chosen based on FAD evaluation).

Evaluation:

Generation Quality: Fréchet Audio Distance (FAD) used to evaluate unconditional generation quality vs. DDIM steps and conditional quality vs. CFG strength, showing the benefit of the adapted CFG rescaling.
Conditional Coherence: A separately trained contrastive model evaluated whether generated basslines musically matched their corresponding input mixes. Results (visualized as a heatmap of scores) show strong diagonal alignment, indicating good conditioning.
Style Grounding Effectiveness: Embeddings from a pre-trained audio classifier (PANN) were extracted for style reference samples and generated samples (with and without grounding). Samples generated with grounding showed significantly lower Cosine (0.269 vs 0.644) and Euclidean (0.407 vs 0.836) distances to the target style embedding, quantifying the technique's success.

Practical Applications:

The system provides a tool for musicians and producers to quickly generate basslines that fit their existing musical material (mixes). The style grounding feature allows tailoring the timbre (e.g., matching the sound of a specific bass guitar or synth preset) without needing explicit timbre controls within the model, simply by providing an audio example. The ability to handle arbitrary lengths makes it flexible for real-world music production workflows.

Limitations:

The current system does not offer control over the specific notes or harmonic content of the generated bassline, focusing primarily on musical fit and timbre. Future work includes extending the model to generate accompaniments for other instruments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/deeplearnmusic/status/1754514269167665520

https://twitter.com/imstruments/status/1765470489080668428

YouTube

Show All Videos