Emergent Mind

MusicHiFi: Fast High-Fidelity Stereo Vocoding

Published Mar 15, 2024 in cs.SD , eess.AS , and eess.SP


Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at https://MusicHiFi.github.io/web/.

Violin plots comparing subjective listening tests for BWE and M2S under various audio enhancement conditions.


  • MusicHiFi introduces a high-fidelity stereophonic vocoder that leverages a cascaded GAN architecture to transform low-resolution mel-spectrograms into high-quality stereo audio.

  • The methodology applies a unified GAN-based approach across three stages: vocoding, bandwidth extension (BWE), and mono-to-stereo upmixing (M2S) to ensure superior audio quality and efficient spatialization.

  • MusicHiFi demonstrates a notable improvement over existing models in vocoding and bandwidth extension performance, achieving faster inference speeds and better audio quality.

  • The implications of MusicHiFi are vast, offering potential applications in enhancing music generators, improving low-resolution recordings, and spatializing monophonic music, which sets a new foundation for future audio processing innovations.

MusicHiFi: A New Frontier in High-Fidelity Stereo Vocoding


The generation of high-quality audio through advanced vocoding techniques remains a significant challenge in the field of music generation and audio processing. Despite the advancements, existing methods often produce monophonic audio at lower resolutions, which restricts their application potential. Addressing this gap, the introduction of MusicHiFi, an efficient high-fidelity stereophonic vocoder, marks a significant stride toward achieving superior audio quality. Using a cascade of three generative adversarial networks (GANs), MusicHiFi transforms low-resolution mel-spectrograms into high-fidelity stereophonic audio. Its architecture ensures fast inference speeds, better audio quality, and enhanced spatialization control compared to previous methods.


MusicHiFi employs a unified approach across its three stages: vocoding, bandwidth extension (BWE), and mono-to-stereo upmixing (M2S). Each stage utilizes a GAN-based generator and discriminator architecture, with adaptations to meet the specific requirements of each task.

  • Vocoding (MusicHiFi-V): Converts low-resolution mel-spectrograms into audio waveforms, adhering to a unified GAN-based architecture for generation.
  • Bandwidth Extension (MusicHiFi-BWE): Transforms low-resolution audio to high-resolution outputs. Incorporates a residual connection and an upsampling step, allowing the module to focus on generating high-frequency content effectively.
  • Mono-to-Stereo Upmixing (MusicHiFi-M2S): Utilizes mid-side encoding to produce stereo audio from mono inputs. This approach not only preserves the original monophonic content but also facilitates superior control over the spatial width of the audio.

Experiment and Results

MusicHiFi was rigorously evaluated against standard benchmarks and baselines. In terms of vocoding, it demonstrated superior performance on key metrics like Mel-D, STFT-D, and ViSQOL, maintaining comparable performance on SI-SDR with significantly faster inference speeds. The BWE module showed equivalent or better performance with Aero, while significantly outperforming AudioSR. Notably, MusicHiFi was hundreds of times faster than the baseline models. The M2S module outperformed conventional DSP-based decorrelation methods in objective assessments, proving the method’s efficiency and efficacy in creating high-quality stereo audio.

Implications and Future Directions

MusicHiFi represents a breakthrough in stereo vocoding, offering an efficient, high-quality solution for audio and music generation tasks. Its design addresses the key challenges in the field, including speed of generation, quality of the audio, and spatialization control. Looking ahead, the potential applications of MusicHiFi are vast. The model can be integrated with mel-spectrogram-based music generators, enhance the fidelity of low-resolution recordings, or be used to spatialize monophonic music. Furthermore, the unified GAN-based architecture offers a robust framework that could inspire future developments in audio processing and generative modeling.


The advent of MusicHiFi opens new avenues in the generation of high-fidelity, stereophonic audio. By leveraging a cascaded GAN approach, MusicHiFi efficiently transforms low-resolution mel-spectrograms into high-quality stereophonic audio. Its architecture ensures superiority in audio quality, spatialization, and inference speed over existing methods. The successful implementation and validation of MusicHiFi not only underscore its potential for immediate applications but also set the stage for future innovations in audio and music generation.

Create an account to read this summary for free:


Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.