Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 49 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 172 tok/s Pro
GPT OSS 120B 472 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

A Diffusion-Based Generative Equalizer for Music Restoration (2403.18636v2)

Published 27 Mar 2024 in eess.AS and cs.SD

Abstract: This paper presents a novel approach to audio restoration, focusing on the enhancement of low-quality music recordings, and in particular historical ones. Building upon a previous algorithm called BABE, or Blind Audio Bandwidth Extension, we introduce BABE-2, which presents a series of improvements. This research broadens the concept of bandwidth extension to \emph{generative equalization}, a novel task that, to the best of our knowledge, has not been explicitly addressed in previous studies. BABE-2 is built around an optimization algorithm utilizing priors from diffusion models, which are trained or fine-tuned using a curated set of high-quality music tracks. The algorithm simultaneously performs two critical tasks: estimation of the filter degradation magnitude response and hallucination of the restored audio. The proposed method is objectively evaluated on historical piano recordings, showing an enhancement over the prior version. The method yields similarly impressive results in rejuvenating the works of renowned vocalists Enrico Caruso and Nellie Melba. This research represents an advancement in the practical restoration of historical music.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Digital Audio Restoration—A Statistical Model Based Approach, Springer, 1998.
  2. M. Kob and T. A. Weege, “How to interpret early recordings? Artefacts and resonances in recording and reproduction of singing voices,” Computational Phonogram Archiving, pp. 335–350, 2019.
  3. “Digital audio antiquing—Signal processing methods for imitating the sound quality of historical recordings,” J. Audio Eng. Soc., vol. 56, no. 3, pp. 115–139, Mar. 2008.
  4. “Blind deconvolution through digital signal processing,” Proc. IEEE, vol. 63, no. 4, pp. 678–692, 1975.
  5. E. Moliner and V. Välimäki, “A two-stage U-net for high-fidelity denoising of historical recordings,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Singapore, May 2022, pp. 841–845.
  6. “Diffusion models for audio restoration,” arXiv preprint arXiv:2402.09821, 2024.
  7. “Blind audio bandwidth extension: A diffusion-based zero-shot approach,” arXiv, 2024.
  8. “Score-based generative modeling through stochastic differential equations,” in Proc. Int. Conf. Learning Representations (ICLR), May 2021.
  9. “Elucidating the design space of diffusion-based generative models,” Adv. Neural Inf. Process. Syst. (NeurIPS), Dec. 2022.
  10. “Solving audio inverse problems with a diffusion model,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Rhodes, Greece, Jun. 2023.
  11. “Diffusion posterior sampling for general noisy inverse problems,” in Proc. Int. Conf. Learning Representations (ICLR), Kigali, Rwanda, May 2023.
  12. “Parallel diffusion models of operator and image for blind inverse problems,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), Vancouver, BC, Canada, Jun. 2023, pp. 6059–6069.
  13. “CADS: Unleashing the diversity of diffusion models through condition-annealed sampling,” in Proc. Int. Conf. Learning Representations (ICLR), 2024.
  14. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Represent. (ICLR), San Diego, CA, May 2015.
  15. A. Wright and V. Välimäki, “Perceptual loss function for neural modeling of audio systems,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2020, pp. 251–255.
  16. E. Moliner and V. Välimäki, “Diffusion-based audio inpainting,” J. Audio Eng. Soc., vol. 72, Mar. 2024.
  17. “Enabling factorized piano music modeling and generation with the MAESTRO dataset,” in Proc. Int. Conf. Learning Representations (ICLR), May 2019.
  18. “Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms,” in Proc. Interspeech, Aug. 2019, pp. 2350–2354.
  19. “Adapting Frechet audio distance for generative music evaluation,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2024.
  20. “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2023.
  21. “High fidelity neural audio compression,” Transactions on Machine Learning Research, 2023.
  22. E. Moliner and V. Välimäki, “BEHM-GAN: Bandwidth extension of historical music using generative adversarial networks,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 943–956, Jul. 2023.
  23. “Vocalset: A singing voice dataset,” in ISMIR, 2018, pp. 468–474.
  24. “Hybrid transformers for music source separation,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2023.
  25. J.B Steane, “Fach,” Groove Music Online, 2002.
  26. J. Freestone, Enrico Caruso: His Recorded Legacy, TS Denison, Minneapolis, 1961.
  27. B. Gentili, “The birth of ‘modern’ vocalism: The paradigmatic case of Enrico Caruso,” J. Royal Musical Assoc., vol. 146, no. 2, pp. 425–453, 2021.
  28. R. Celletti and A. Blyth, “Caruso, Enrico,” Groove Music Online, 2013.
  29. D. Shawe-Taylor and A. Blyth, “Gigli, Beniamino,” Groove Music Online, 2001.
  30. L. A. G. Strong, “John McCormack: The story of a singer,” The Macmillan Company, 1941.
  31. D. Shawe-Taylor, “Melba, Dame Nellie,” Groove Music Online, May 2009.
  32. E. Forbes, “Patti, Adelina,” Groove Music Online, 2001.
  33. “Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 3945–3954.
  34. “Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis,” in Interspeech, 2022.
  35. “The NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech,” in Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2013, pp. 1–9.
  36. “PJS: Phoneme-balanced japanese singing-voice corpus,” in Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2020, pp. 487–491.
  37. “M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus,” Adv. Neural Inf. Process. Syst. (NeurIPS) Datasets and Benchmarks Track, vol. 35, pp. 6914–6926, 2022.
  38. “Children’s song dataset for singing voice research,” in Proc. ISMIR, 2020, vol. 4.
  39. “Guidance with spherical gaussian constraint for conditional diffusion,” arXiv preprint arXiv:2402.03201, 2024.
  40. “Video diffusion models,” Adv. Neural Inf. Process. Syst. (NeurIPS), Dec. 2022.
Citations (4)

Summary

  • The paper introduces BABE-2, a diffusion-based generative equalizer that restores degraded audio by enhancing bandwidth extension with adaptive filter parameterization.
  • It employs diffusion models and inverse problem techniques to iteratively optimize audio reconstruction using noise regularization and breakpoint-collapse regularization.
  • Experimental results on piano and vocal recordings demonstrate superior restoration quality compared to traditional methods, preserving spectral details and tonal nuances.

Diffusion-Based Generative Equalizer for Music Restoration

This essay explores a significant advancement in music restoration introduced through a diffusion-based generative equalizer, which improves the quality of low-quality and historical audio recordings by employing a generative approach to bandwidth extension. The method, named BABE-2, builds upon previous work to enhance the degradation model used in audio restoration tasks, aimed primarily at recovering the audio quality to meet contemporary standards.

Methodology and Innovations

Diffusion Models

Diffusion models, a class of generative models, transform data by progressively adding and removing noise. Within the audio domain, these models help transition the signal from initial random noise to a cleaned version. The process is governed by an Ordinary Differential Equation (ODE) and approximated using neural networks, facilitating the restoration of audio quality.

Diffusion Posterior Sampling

The approach leverages diffusion models as priors to solve inverse problems. Such problems involve estimating the original audio signals from the recorded degradation. The posterior score decomposes into prior and likelihood scores, enabling differential optimization throughout the restored audio process.

Blind Inverse Problems

When the degradation model is unknown, solving the inverse problem becomes challenging. BABE utilizes a zero-phase frequency-domain filter for blind bandwidth extension that adapts during the sampling process to iteratively optimize parameters without explicit knowledge of the degradation model.

Unique Contributions of BABE-2

BABE-2 introduces an improved filter parameterization expressed as a piecewise-linear function with adjustable slopes and breakpoints creating a symmetric frequency response equalizer. It addresses limitations in BABE's simplification and framework by expanding the degradation model to encapsulate spectral coloration typically found in historical recordings. This parameterization allows for a more accurate adaptation across various frequency bands. Figure 1

Figure 1: Proposed frequency-response equalizer model consists of breakpoints creating a piecewise linear magnitude response.

To prevent the identified breakpoint-collapse problem within BABE, causing filter stages to merge improperly, BABE-2 introduces a Breakpoint-Collapse Regularization (BCR), enforcing effective spacing of breakpoints and ensuring flexibility for a richer frequency response model.

BABE-2 further implements noise regularization to counteract local convergence and nonlinear artifacts within historical recordings, ensuring stable optimization. Initialization employs the prior method LTAS-based procedure to improve convergence efficiency during inference and stabilize the reconstruction process with complementary audio examples.

Implementation

Inference Algorithm

The inference proceeds by iterative optimization of the generative equalizer based on the structured updates capturing both prior and likelihood scores, leveraging control of the audio reconstruction process while explicitly maintaining the target spectral profiles. Figure 2

Figure 2: Restoration process for vocal recordings showcasing pipeline stages of denoising and adaptive frequency equalization.

Training and Parameters

The model was trained using extensive datasets (MAESTRO for piano and various studio recordings for vocals), involving pre-training and fine-tuning for tailored adaptations to historical figures. The inference process employs a structured schedule to optimize filter parameters iteratively with guaranteed consistency across frames.

Experiments and Analysis

Piano Recordings Evaluation

Experiments demonstrated BABE-2's efficacy in restoring piano music recordings to higher resemblance with contemporary audio quality standards. The method showed superior performance against traditional methods, especially in frequency response preservation. Figure 3

Figure 3: Comparative LTAS analysis of original and restored piano recordings using different methods.

Vocal Recordings Evaluation

BABE-2 was tested on vocal recordings by famous singers such as Enrico Caruso and Nellie Melba, demonstrating its adaptability in restoring vocal qualities while preserving the unique tonal characteristics inherent to each performer. Critically, careful selection of reference singers during model fine-tuning enabled restoring historically accurate vocal nuances. Figure 4

Figure 4: Spectrogram representations of two vocal restoration examples. The colored boxes highlight key points discussed.

Conclusion

BABE-2 represents an advancement in music restoration, effectively adapting diffusion models for generative equalization. Arresting historical music's degradation challenges promises more accessible preservation of audio recordings with unprecedented fidelity. While achieving notable success, especially in vocal restoration, the research opens future inquiries into better addressing nonlinear degradations and more accurately capturing temporal dynamics within the restoration process.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com