MusicHiFi: Fast High-Fidelity Stereo Vocoding (2403.10493v4)
Abstract: Diffusion-based audio and music generation models commonly perform generation by constructing an image representation of audio (e.g., a mel-spectrogram) and then convert it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their usefulness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth extension, and upmixes to stereophonic audio. Compared to past work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at \url{https://MusicHiFi.github.io/web/}.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Neural information processing systems (NeuraIPS), 2020.
- A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” 2022. [Online]. Available: https://arxiv.org/abs/2204.06125
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- S. Forsgren and H. Martiros, “Riffusion - Stable diffusion for real-time music generation,” 2022. [Online]. Available: https://riffusion.com/about
- H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “AudioLDM: Text-to-audio generation with latent diffusion models,” in International Conference on Machine Learning (ICML), 2023.
- Q. Huang, D. S. Park, T. Wang, T. I. Denk, A. Ly, N. Chen, Z. Zhang, Z. Zhang, J. Yu, C. Frank et al., “Noise2Music: Text-conditioned music generation with diffusion models,” 2023. [Online]. Available: https://arxiv.org/abs/2302.03917
- C. Hawthorne, I. Simon, A. Roberts, N. Zeghidour, J. Gardner, E. Manilow, and J. Engel, “Multi-instrument music synthesis with spectrogram diffusion,” in International Society for Music Information Retrieval (ISMIR), 2022.
- K. Chen, Y. Wu, H. Liu, M. Nezhurina, T. Berg-Kirkpatrick, and S. Dubnov, “MusicLDM: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies,” 2023. [Online]. Available: https://arxiv.org/abs/2308.01546
- S.-L. Wu, C. Donahue, S. Watanabe, and N. J. Bryan, “Music ControlNet: Multiple time-varying controls for music generation,” 2023. [Online]. Available: https://arxiv.org/abs/2311.07069
- Y. Wang, Z. Ju, X. Tan, L. He, Z. Wu, J. Bian et al., “AUDIT: Audio editing by following instructions with latent diffusion models,” Neural Information Processing Systems (NeurIPS), 2024.
- Z. Novack, J. McAuley, T. Berg-Kirkpatrick, and N. J. Bryan, “DITTO: diffusion inference-time T-optimization for music generation,” 2024. [Online]. Available: https://arxiv.org/abs/2401.12179
- Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “DiffWave: A versatile diffusion model for audio synthesis,” in International Conference on Learning Representations (ICLR), 2020.
- K. Kumar, R. Kumar, T. De Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. De Brebisson, Y. Bengio, and A. C. Courville, “MelGAN: Generative adversarial networks for conditional waveform synthesis,” Neural information processing systems (NeurIPS), 2019.
- J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” Neural Information Processing Systems (NeurIPS), 2020.
- S. gil Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” in International Conference on Learning Representations (ICLR), 2023.
- J. Su, Y. Wang, A. Finkelstein, and Z. Jin, “Bandwidth extension is all you need,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
- S. Han and J. Lee, “NU-Wave 2: A general neural audio upsampling model for various sampling rates,” in Interspeech, 2022.
- K. Zhang, Y. Ren, C. Xu, and Z. Zhao, “WSRGlow: A glow-based waveform generative model for audio super-resolution,” in Interspeech, 2021.
- S. E. Eskimez and K. Koishida, “Speech super resolution generative adversarial network,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
- R. Kumar, K. Kumar, V. Anand, Y. Bengio, and A. Courville, “NU-GAN: High resolution neural upsampling with gan,” 2020. [Online]. Available: https://arxiv.org/abs/2010.11362
- S. Hu, B. Zhang, B. Liang, E. Zhao, and S. Lui, “Phase-aware music super-resolution using generative adversarial networks,” in Interspeech, 2020.
- M. Mandel, O. Tal, and Y. Adi, “AERO: Audio super resolution in the spectral domain,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
- H. Liu, K. Chen, Q. Tian, W. Wang, and M. D. Plumbley, “AudioSR: Versatile audio super-resolution at scale,” 2023. [Online]. Available: https://arxiv.org/abs/2309.07314
- J. Serrà, D. Scaini, S. Pascual, D. Arteaga, J. Pons, J. Breebaart, and G. Cengarle, “Mono-to-stereo through parametric stereo generation,” in International Society of Music Information Retrieval (ISMIR), 2023.
- J. D. Johnston and A. J. Ferreira, “Sum-difference stereo transform coding,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1992.
- R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved RVQGAN,” in Neural Information Processing Systems (NeurIPS), 2023.
- L. Ziyin, T. Hartwig, and M. Ueda, “Neural networks fail to learn periodic functions and how to fix it,” Neural Information Processing Systems (NeurIPS), 2020.
- N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 2021.
- A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” 2022. [Online]. Available: https://arxiv.org/abs/2210.13438
- X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least squares generative adversarial networks,” in IEEE International Conference on Computer Vision (ICCV), 2017.
- M. R. Schroeder, “An artificial stereophonic effect obtained from a single audio signal,” Journal of the Audio Engineering Society (JAES), 1958.
- D. Fitzgerald, “Upmixing from mono-a source separation approach,” in IEEE International Conference on Digital Signal Processing (DSP), 2011.
- s3a spatialaudio, “s3a decorrelator,” https://github.com/s3a-spatialaudio/s3a-decorrelation-toolbox, 2019.
- M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “FMA: A dataset for music analysis,” in International Society for Music Information Retrieval (ISMIR), 2017.
- A. Liutkus, F.-R. Stöter, Z. Rafii, D. Kitamura, B. Rivet, N. Ito, N. Ono, and J. Fontecave, “The 2016 signal separation evaluation campaign,” in International Conference Latent Variable Analysis and Signal Separation LVA/ICA, P. Tichavský, M. Babaie-Zadeh, O. J. Michel, and N. Thirion-Moreau, Eds. Springer International Publishing, 2017.
- A. Hines, J. Skoglund, A. C. Kokaram, and N. Harte, “ViSQOL: an objective speech quality model,” EURASIP Journal on Audio, Speech, and Music Processing, 2015.
- J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
- B. Series, “Recommendation ITU BS.1534-3,” 2014.
- N. Jillings, D. Moffat, B. De Man, and J. D. Reiss, “Web Audio Evaluation Tool: A browser-based listening test environment,” in Sound and Music Computing Conference, 2015.
- S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian journal of statistics, 1979.