MusicHiFi: Fast High-Fidelity Stereo Vocoding (2403.10493v4)

Published 15 Mar 2024 in cs.SD, eess.AS, and eess.SP

Abstract: Diffusion-based audio and music generation models commonly perform generation by constructing an image representation of audio (e.g., a mel-spectrogram) and then convert it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their usefulness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth extension, and upmixes to stereophonic audio. Compared to past work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at \url{https://MusicHiFi.github.io/web/}.

References (40)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces MusicHiFi, a cascaded GAN-based framework that efficiently converts low-resolution mel-spectrograms to stereophonic high-fidelity audio.
It employs dedicated modules for vocoding, bandwidth extension, and mono-to-stereo upmixing, ensuring superior audio quality and spatial control.
Experimental results demonstrate that MusicHiFi outperforms baseline models with significantly faster inference speeds and enhanced performance metrics.

MusicHiFi: A New Frontier in High-Fidelity Stereo Vocoding

Introduction

The generation of high-quality audio through advanced vocoding techniques remains a significant challenge in the field of music generation and audio processing. Despite the advancements, existing methods often produce monophonic audio at lower resolutions, which restricts their application potential. Addressing this gap, the introduction of MusicHiFi, an efficient high-fidelity stereophonic vocoder, marks a significant stride toward achieving superior audio quality. Using a cascade of three generative adversarial networks (GANs), MusicHiFi transforms low-resolution mel-spectrograms into high-fidelity stereophonic audio. Its architecture ensures fast inference speeds, better audio quality, and enhanced spatialization control compared to previous methods.

Methodology

MusicHiFi employs a unified approach across its three stages: vocoding, bandwidth extension (BWE), and mono-to-stereo upmixing (M2S). Each stage utilizes a GAN-based generator and discriminator architecture, with adaptations to meet the specific requirements of each task.

Vocoding (MusicHiFi-V): Converts low-resolution mel-spectrograms into audio waveforms, adhering to a unified GAN-based architecture for generation.
Bandwidth Extension (MusicHiFi-BWE): Transforms low-resolution audio to high-resolution outputs. Incorporates a residual connection and an upsampling step, allowing the module to focus on generating high-frequency content effectively.
Mono-to-Stereo Upmixing (MusicHiFi-M2S): Utilizes mid-side encoding to produce stereo audio from mono inputs. This approach not only preserves the original monophonic content but also facilitates superior control over the spatial width of the audio.

Experiment and Results

MusicHiFi was rigorously evaluated against standard benchmarks and baselines. In terms of vocoding, it demonstrated superior performance on key metrics like Mel-D, STFT-D, and ViSQOL, maintaining comparable performance on SI-SDR with significantly faster inference speeds. The BWE module showed equivalent or better performance with Aero, while significantly outperforming AudioSR. Notably, MusicHiFi was hundreds of times faster than the baseline models. The M2S module outperformed conventional DSP-based decorrelation methods in objective assessments, proving the method’s efficiency and efficacy in creating high-quality stereo audio.

Implications and Future Directions

MusicHiFi represents a breakthrough in stereo vocoding, offering an efficient, high-quality solution for audio and music generation tasks. Its design addresses the key challenges in the field, including speed of generation, quality of the audio, and spatialization control. Looking ahead, the potential applications of MusicHiFi are vast. The model can be integrated with mel-spectrogram-based music generators, enhance the fidelity of low-resolution recordings, or be used to spatialize monophonic music. Furthermore, the unified GAN-based architecture offers a robust framework that could inspire future developments in audio processing and generative modeling.

Conclusion

The advent of MusicHiFi opens new avenues in the generation of high-fidelity, stereophonic audio. By leveraging a cascaded GAN approach, MusicHiFi efficiently transforms low-resolution mel-spectrograms into high-quality stereophonic audio. Its architecture ensures superiority in audio quality, spatialization, and inference speed over existing methods. The successful implementation and validation of MusicHiFi not only underscore its potential for immediate applications but also set the stage for future innovations in audio and music generation.

PDF Markdown

Related Papers

GitHub

MusicHiFi: Fast High-Fidelity Stereo Vocoding

Tweets

https://twitter.com/_akhaliq/status/1769545612024811872

https://twitter.com/__gzhu__/status/1769543547601199282

https://twitter.com/NicholasJBryan/status/1769546166545551669

https://twitter.com/Gradio/status/1769732909248786911

https://twitter.com/fly51fly/status/1771660218600136760

https://twitter.com/gm8xx8/status/1769549325896499416