A Diffusion-Based Generative Equalizer for Music Restoration (2403.18636v2)

Published 27 Mar 2024 in eess.AS and cs.SD

Abstract: This paper presents a novel approach to audio restoration, focusing on the enhancement of low-quality music recordings, and in particular historical ones. Building upon a previous algorithm called BABE, or Blind Audio Bandwidth Extension, we introduce BABE-2, which presents a series of improvements. This research broadens the concept of bandwidth extension to \emph{generative equalization}, a novel task that, to the best of our knowledge, has not been explicitly addressed in previous studies. BABE-2 is built around an optimization algorithm utilizing priors from diffusion models, which are trained or fine-tuned using a curated set of high-quality music tracks. The algorithm simultaneously performs two critical tasks: estimation of the filter degradation magnitude response and hallucination of the restored audio. The proposed method is objectively evaluated on historical piano recordings, showing an enhancement over the prior version. The method yields similarly impressive results in rejuvenating the works of renowned vocalists Enrico Caruso and Nellie Melba. This research represents an advancement in the practical restoration of historical music.

References (40)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces BABE-2, a diffusion-based generative equalizer that restores degraded audio by enhancing bandwidth extension with adaptive filter parameterization.
It employs diffusion models and inverse problem techniques to iteratively optimize audio reconstruction using noise regularization and breakpoint-collapse regularization.
Experimental results on piano and vocal recordings demonstrate superior restoration quality compared to traditional methods, preserving spectral details and tonal nuances.

Diffusion-Based Generative Equalizer for Music Restoration

This essay explores a significant advancement in music restoration introduced through a diffusion-based generative equalizer, which improves the quality of low-quality and historical audio recordings by employing a generative approach to bandwidth extension. The method, named BABE-2, builds upon previous work to enhance the degradation model used in audio restoration tasks, aimed primarily at recovering the audio quality to meet contemporary standards.

Methodology and Innovations

Diffusion Models

Diffusion models, a class of generative models, transform data by progressively adding and removing noise. Within the audio domain, these models help transition the signal from initial random noise to a cleaned version. The process is governed by an Ordinary Differential Equation (ODE) and approximated using neural networks, facilitating the restoration of audio quality.

Diffusion Posterior Sampling

The approach leverages diffusion models as priors to solve inverse problems. Such problems involve estimating the original audio signals from the recorded degradation. The posterior score decomposes into prior and likelihood scores, enabling differential optimization throughout the restored audio process.

When the degradation model is unknown, solving the inverse problem becomes challenging. BABE utilizes a zero-phase frequency-domain filter for blind bandwidth extension that adapts during the sampling process to iteratively optimize parameters without explicit knowledge of the degradation model.

Unique Contributions of BABE-2

BABE-2 introduces an improved filter parameterization expressed as a piecewise-linear function with adjustable slopes and breakpoints creating a symmetric frequency response equalizer. It addresses limitations in BABE's simplification and framework by expanding the degradation model to encapsulate spectral coloration typically found in historical recordings. This parameterization allows for a more accurate adaptation across various frequency bands.

Figure 1: Proposed frequency-response equalizer model consists of breakpoints creating a piecewise linear magnitude response.

To prevent the identified breakpoint-collapse problem within BABE, causing filter stages to merge improperly, BABE-2 introduces a Breakpoint-Collapse Regularization (BCR), enforcing effective spacing of breakpoints and ensuring flexibility for a richer frequency response model.

BABE-2 further implements noise regularization to counteract local convergence and nonlinear artifacts within historical recordings, ensuring stable optimization. Initialization employs the prior method LTAS-based procedure to improve convergence efficiency during inference and stabilize the reconstruction process with complementary audio examples.

Implementation

Inference Algorithm

The inference proceeds by iterative optimization of the generative equalizer based on the structured updates capturing both prior and likelihood scores, leveraging control of the audio reconstruction process while explicitly maintaining the target spectral profiles.

Figure 2: Restoration process for vocal recordings showcasing pipeline stages of denoising and adaptive frequency equalization.

Training and Parameters

The model was trained using extensive datasets (MAESTRO for piano and various studio recordings for vocals), involving pre-training and fine-tuning for tailored adaptations to historical figures. The inference process employs a structured schedule to optimize filter parameters iteratively with guaranteed consistency across frames.

Experiments and Analysis

Piano Recordings Evaluation

Experiments demonstrated BABE-2's efficacy in restoring piano music recordings to higher resemblance with contemporary audio quality standards. The method showed superior performance against traditional methods, especially in frequency response preservation.

Figure 3: Comparative LTAS analysis of original and restored piano recordings using different methods.

Vocal Recordings Evaluation

BABE-2 was tested on vocal recordings by famous singers such as Enrico Caruso and Nellie Melba, demonstrating its adaptability in restoring vocal qualities while preserving the unique tonal characteristics inherent to each performer. Critically, careful selection of reference singers during model fine-tuning enabled restoring historically accurate vocal nuances.

Figure 4: Spectrogram representations of two vocal restoration examples. The colored boxes highlight key points discussed.

Conclusion

BABE-2 represents an advancement in music restoration, effectively adapting diffusion models for generative equalization. Arresting historical music's degradation challenges promises more accessible preservation of audio recordings with unprecedented fidelity. While achieving notable success, especially in vocal restoration, the research opens future inquiries into better addressing nonlinear degradations and more accurately capturing temporal dynamics within the restoration process.