Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 110 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis (2402.01753v1)

Published 30 Jan 2024 in cs.SD, cs.LG, eess.AS, and eess.SP

Abstract: Generative adversarial network (GAN) models can synthesize highquality audio signals while ensuring fast sample generation. However, they are difficult to train and are prone to several issues including mode collapse and divergence. In this paper, we introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN, which was initially devised for speech synthesis from mel spectrogram. In our model, the training stability is enhanced by means of a forward diffusion process which consists in injecting noise from a Gaussian distribution to both real and fake samples before inputting them to the discriminator. We further improve the model by exploiting a spectrally-shaped noise distribution with the aim to make the discriminator's task more challenging. We then show the merits of our proposed model for speech and music synthesis on several datasets. Our experiments confirm that our model compares favorably in audio quality and efficiency compared to several baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang et al., “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. ICASSP, 2018.
  2. Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen et al., “Direct speech-to-speech translation with a sequence-to-sequence model,” in Proc. Interspeech, 2019.
  3. B. Sisman, J. Yamagishi, S. King, and H. Li, “An overview of voice conversion and its challenges: From statistical modeling to deep learning,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 132–157, 2020.
  4. J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, “GANSynth: Adversarial neural audio synthesis,” arXiv preprint arXiv:1902.08710, 2019.
  5. S. Ji, J. Luo, and X. Yang, “A comprehensive survey on deep music generation: Multi-level representations, algorithms, evaluations, and future directions,” arXiv preprint arXiv:2011.06801, 2020.
  6. D. Moffat, R. Selfridge, and J. Reiss, “Sound effect synthesis,” in Foundations in Sound Design for Interactive Media, M. Filimowicz, Ed.   Routledge, 2019.
  7. C. Schreck, D. Rohmer, D. L. James, S. Hahmann, and M.-P. Cani, “Real-time sound synthesis for paper material based on geometric analysis,” in Proc. ACM SIGGRAPH/Eurographics SCA, Jul. 2016.
  8. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves et al., “WaveNet: A generative model for raw audio,” in Proc. ISCA SSW, 2016.
  9. R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” in Proc. ICASSP, May 2019.
  10. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair et al., “Generative adversarial nets,” in Proc. NeurIPS, 2014.
  11. J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, 2020.
  12. S. gil Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” in Proc. ICLR, 2023.
  13. N. Kodali, J. Hays, J. Abernethy, and Z. Kira, “On convergence and stability of GANs,” arXiv preprint arXiv:1705.07215, 2017.
  14. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. NeurIPS, 2020.
  15. N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “WaveGrad: Estimating gradients for waveform generation,” in Proc. ICLR, 2021.
  16. S. gil Lee, H. Kim, C. Shin, X. Tan, C. Liu, Q. Meng et al., “PriorGrad: Improving conditional denoising diffusion models with data-dependent adaptive prior,” in Proc. ICLR, 2022.
  17. Z. Wang, H. Zheng, P. He, W. Chen, and M. Zhou, “Diffusion-GAN: Training GANs with diffusion,” in Proc. ICLR, 2023.
  18. Y. Koizumi, H. Zen, K. Yatabe, N. Chen, and M. Bacchiani, “SpecGrad: Diffusion probabilistic model based neural vocoder with adaptive noise spectral shaping,” in Proc. Interspeech, 2022.
  19. X. Wu, “Enhancing unsupervised speech recognition with diffusion GANs,” in Proc. ICASSP, Jun. 2023.
  20. W. Jang, D. C. Y. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation,” in Proc. Interspeech, 2021.
  21. C. Wang, C. Zeng, and X. He, “HiFi-WaveGAN: Generative adversarial network with auxiliary spectrogram-phase loss for high-fidelity singing voice generation,” arXiv preprint arXiv:2210.12740, 2022.
  22. T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila, “Training generative adversarial networks with limited data,” in Proc. NeurIPS, 2020.
  23. K. Ito and L. Johnson, “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  24. J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), Tech. Rep., 2019.
  25. V. Emiya, N. Bertin, B. David, and R. Badeau, “MAPS - a piano database for multipitch estimation and automatic transcription of music,” INRIA, Tech. Rep., Jul. 2010.
  26. O. Gillet and G. Richard, “ENST-Drums: An extensive audio-visual database for drum signals processing,” in Proc. ISMIR, 2006.
  27. A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, vol. 2, 2001, pp. 749–752.
  28. C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. ICASSP, 2010, pp. 4214–4217.
  29. W. A. Jassim, J. Skoglund, M. Chinen, and A. Hines, “Warp-Q: Quality prediction for generative neural speech codecs,” Proc. ICASSP, pp. 401–405, 2021.
  30. K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms,” in Proc. Interspeech, 2019.
  31. S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore et al., “CNN architectures for large-scale audio classification,” in Proc. ICASSP, 2017.
  32. L. Ziyin, T. Hartwig, and M. Ueda, “Neural networks fail to learn periodic functions and how to fix it,” in Proc. NeurIPS, 2020.
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.