Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VoiceExtender: Short-utterance Text-independent Speaker Verification with Guided Diffusion Model (2310.04681v1)

Published 7 Oct 2023 in cs.SD, cs.AI, and eess.AS

Abstract: Speaker verification (SV) performance deteriorates as utterances become shorter. To this end, we propose a new architecture called VoiceExtender which provides a promising solution for improving SV performance when handling short-duration speech signals. We use two guided diffusion models, the built-in and the external speaker embedding (SE) guided diffusion model, both of which utilize a diffusion model-based sample generator that leverages SE guidance to augment the speech features based on a short utterance. Extensive experimental results on the VoxCeleb1 dataset show that our method outperforms the baseline, with relative improvements in equal error rate (EER) of 46.1%, 35.7%, 10.4%, and 5.7% for the short utterance conditions of 0.5, 1.0, 1.5, and 2.0 seconds, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. “Speaker verification using adapted gaussian mixture models,” Digit. Signal Process., vol. 10, pp. 19–41, 2000.
  2. “X-vectors: Robust dnn embeddings for speaker recognition,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
  3. “ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in International Speech Communication Association (INTERSPEECH), Helen Meng, Bo Xu, and Thomas Fang Zheng, Eds. 2020, pp. 3830–3834, ISCA.
  4. “Cam++: A fast and efficient network for speaker verification using context-aware masking,” International Speech Communication Association (INTERSPEECH), 2023.
  5. “Speaker verification with short utterances: a review of challenges, trends and opportunities,” IET Biometrics, vol. 7, no. 2, pp. 91–101, 2018.
  6. “Short utterance compensation in speaker verification via cosine-based teacher-student learning of speaker embeddings,” Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 335–341, 2018.
  7. “A deep neural network for short-segment speaker recognition,” in International Speech Communication Association (INTERSPEECH), 2019.
  8. “Rawnext: Speaker verification system for variable-duration utterances with deep layer aggregation and extended dynamic scaling policies,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7647–7651.
  9. “Short utterance compensation in speaker verification via cosine-based teacher-student learning of speaker embeddings,” in Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 335–341.
  10. “Length- and noise-aware training techniques for short-utterance speaker recognition,” in International Speech Communication Association (INTERSPEECH), 2020.
  11. “Short-segment speaker verification using ecapa-tdnn with multi-resolution encoder,” International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
  12. “Text-independent speaker verification with adversarial learning on short utterances,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6569–6573.
  13. “Wasserstein generative adversarial networks,” in International Conference on Machine Learning. PMLR, 2017, pp. 214–223.
  14. “Frame-level phoneme-invariant speaker embedding for text-independent speaker recognition on extremely short utterances,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6799–6803.
  15. “Rhythm dancer: 3d dance generation by keymotion transition graph and pose-interpolation network,” CCF-Big Data, vol. 9, no. 1, pp. 23–37, 2023.
  16. “Large scale gan training for high fidelity natural image synthesis,” ArXiv, vol. abs/1809.11096, 2018.
  17. “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Neural Information Processing Systems (NeurIPS), vol. abs/2010.05646, 2020.
  18. “Waveglow: A flow-based generative network for speech synthesis,” International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621, 2018.
  19. “Glow: Generative flow with invertible 1x1 convolutions,” Neural Information Processing Systems (NeurIPS), vol. abs/1807.03039, 2018.
  20. Mohammad Aljanabi and chatGPT, “Chatgpt: Future directions and open possibilities,” Mesopotamian Journal of Cyber Security, 2023.
  21. “Pop piano music generation with the simplified transformer-xl,” Chinese Control and Decision Conference (CCDC), pp. 3818–3822, 2021.
  22. “Symbolic music generation with diffusion models,” International Society for Music Information Retrieval (ISMIR), 2021.
  23. “Bailando: 3d dance generation by actor-critic gpt with choreographic memory,” Computer Vision and Pattern Recognition (CVPR), pp. 11040–11049, 2022.
  24. “High-resolution image synthesis with latent diffusion models,” in Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10684–10695.
  25. “Repaint: Inpainting using denoising diffusion probabilistic models,” in Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11461–11471.
  26. “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” Computer Vision and Pattern Recognition (CVPR), pp. 22500–22510, 2022.
  27. “Text-guided synthesis of artistic images with retrieval-augmented diffusion models,” arXiv preprint arXiv:2207.13038, 2022.
  28. “Versatile diffusion: Text, images and variations all in one diffusion model,” arXiv preprint arXiv:2211.08332, 2022.
  29. “Diffsinger: Singing voice synthesis via shallow diffusion mechanism,” in Association for the Advancement of Artificial Intelligence (AAAI), 2021.
  30. “Video diffusion models,” Neural Information Processing Systems (NeurIPS), 2022.
  31. “Controlvideo: Training-free controllable text-to-video generation,” ArXiv, vol. abs/2305.13077, 2023.
  32. “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
  33. “Generative adversarial networks: An overview,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 53–65, 2018.
  34. “Diffusion models beat gans on image synthesis,” Neural Information Processing Systems (NeurIPS), vol. 34, pp. 8780–8794, 2021.
  35. “Classifier-free diffusion guidance,” Neural Information Processing Systems (NeurIPS), 2021.
  36. “Diffusion models: A comprehensive survey of methods and applications,” arXiv preprint arXiv:2209.00796, 2022.
  37. “A survey on generative diffusion model,” arXiv preprint arXiv:2209.02646, 2022.
  38. “Denoising diffusion implicit models,” International Conference on Learning Representations (ICLR), 2021.
  39. “Improved denoising diffusion probabilistic models,” in International Conference on Machine Learning. PMLR, 2021, pp. 8162–8171.
  40. “Denoising diffusion probabilistic models,” Neural Information Processing Systems (NeurIPS), vol. 33, pp. 6840–6851, 2020.
  41. “Attention is all you need,” Neural Information Processing Systems (NeurIPS), vol. 30, 2017.
  42. “VoxCeleb: A Large-Scale Speaker Identification Dataset,” in International Speech Communication Association (INTERSPEECH), 2017, pp. 2616–2620.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com