Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 33 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 110 tok/s Pro

Kimi K2 207 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis (2402.01753v1)

Published 30 Jan 2024 in cs.SD, cs.LG, eess.AS, and eess.SP

Abstract: Generative adversarial network (GAN) models can synthesize highquality audio signals while ensuring fast sample generation. However, they are difficult to train and are prone to several issues including mode collapse and divergence. In this paper, we introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN, which was initially devised for speech synthesis from mel spectrogram. In our model, the training stability is enhanced by means of a forward diffusion process which consists in injecting noise from a Gaussian distribution to both real and fake samples before inputting them to the discriminator. We further improve the model by exploiting a spectrally-shaped noise distribution with the aim to make the discriminator's task more challenging. We then show the merits of our proposed model for speech and music synthesis on several datasets. Our experiments confirm that our model compares favorably in audio quality and efficiency compared to several baselines.

References (32)