VoiceExtender: Short-utterance Text-independent Speaker Verification with Guided Diffusion Model (2310.04681v1)
Abstract: Speaker verification (SV) performance deteriorates as utterances become shorter. To this end, we propose a new architecture called VoiceExtender which provides a promising solution for improving SV performance when handling short-duration speech signals. We use two guided diffusion models, the built-in and the external speaker embedding (SE) guided diffusion model, both of which utilize a diffusion model-based sample generator that leverages SE guidance to augment the speech features based on a short utterance. Extensive experimental results on the VoxCeleb1 dataset show that our method outperforms the baseline, with relative improvements in equal error rate (EER) of 46.1%, 35.7%, 10.4%, and 5.7% for the short utterance conditions of 0.5, 1.0, 1.5, and 2.0 seconds, respectively.
- “Speaker verification using adapted gaussian mixture models,” Digit. Signal Process., vol. 10, pp. 19–41, 2000.
- “X-vectors: Robust dnn embeddings for speaker recognition,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
- “ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in International Speech Communication Association (INTERSPEECH), Helen Meng, Bo Xu, and Thomas Fang Zheng, Eds. 2020, pp. 3830–3834, ISCA.
- “Cam++: A fast and efficient network for speaker verification using context-aware masking,” International Speech Communication Association (INTERSPEECH), 2023.
- “Speaker verification with short utterances: a review of challenges, trends and opportunities,” IET Biometrics, vol. 7, no. 2, pp. 91–101, 2018.
- “Short utterance compensation in speaker verification via cosine-based teacher-student learning of speaker embeddings,” Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 335–341, 2018.
- “A deep neural network for short-segment speaker recognition,” in International Speech Communication Association (INTERSPEECH), 2019.
- “Rawnext: Speaker verification system for variable-duration utterances with deep layer aggregation and extended dynamic scaling policies,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7647–7651.
- “Short utterance compensation in speaker verification via cosine-based teacher-student learning of speaker embeddings,” in Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 335–341.
- “Length- and noise-aware training techniques for short-utterance speaker recognition,” in International Speech Communication Association (INTERSPEECH), 2020.
- “Short-segment speaker verification using ecapa-tdnn with multi-resolution encoder,” International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
- “Text-independent speaker verification with adversarial learning on short utterances,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6569–6573.
- “Wasserstein generative adversarial networks,” in International Conference on Machine Learning. PMLR, 2017, pp. 214–223.
- “Frame-level phoneme-invariant speaker embedding for text-independent speaker recognition on extremely short utterances,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6799–6803.
- “Rhythm dancer: 3d dance generation by keymotion transition graph and pose-interpolation network,” CCF-Big Data, vol. 9, no. 1, pp. 23–37, 2023.
- “Large scale gan training for high fidelity natural image synthesis,” ArXiv, vol. abs/1809.11096, 2018.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Neural Information Processing Systems (NeurIPS), vol. abs/2010.05646, 2020.
- “Waveglow: A flow-based generative network for speech synthesis,” International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617–3621, 2018.
- “Glow: Generative flow with invertible 1x1 convolutions,” Neural Information Processing Systems (NeurIPS), vol. abs/1807.03039, 2018.
- Mohammad Aljanabi and chatGPT, “Chatgpt: Future directions and open possibilities,” Mesopotamian Journal of Cyber Security, 2023.
- “Pop piano music generation with the simplified transformer-xl,” Chinese Control and Decision Conference (CCDC), pp. 3818–3822, 2021.
- “Symbolic music generation with diffusion models,” International Society for Music Information Retrieval (ISMIR), 2021.
- “Bailando: 3d dance generation by actor-critic gpt with choreographic memory,” Computer Vision and Pattern Recognition (CVPR), pp. 11040–11049, 2022.
- “High-resolution image synthesis with latent diffusion models,” in Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10684–10695.
- “Repaint: Inpainting using denoising diffusion probabilistic models,” in Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11461–11471.
- “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” Computer Vision and Pattern Recognition (CVPR), pp. 22500–22510, 2022.
- “Text-guided synthesis of artistic images with retrieval-augmented diffusion models,” arXiv preprint arXiv:2207.13038, 2022.
- “Versatile diffusion: Text, images and variations all in one diffusion model,” arXiv preprint arXiv:2211.08332, 2022.
- “Diffsinger: Singing voice synthesis via shallow diffusion mechanism,” in Association for the Advancement of Artificial Intelligence (AAAI), 2021.
- “Video diffusion models,” Neural Information Processing Systems (NeurIPS), 2022.
- “Controlvideo: Training-free controllable text-to-video generation,” ArXiv, vol. abs/2305.13077, 2023.
- “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
- “Generative adversarial networks: An overview,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 53–65, 2018.
- “Diffusion models beat gans on image synthesis,” Neural Information Processing Systems (NeurIPS), vol. 34, pp. 8780–8794, 2021.
- “Classifier-free diffusion guidance,” Neural Information Processing Systems (NeurIPS), 2021.
- “Diffusion models: A comprehensive survey of methods and applications,” arXiv preprint arXiv:2209.00796, 2022.
- “A survey on generative diffusion model,” arXiv preprint arXiv:2209.02646, 2022.
- “Denoising diffusion implicit models,” International Conference on Learning Representations (ICLR), 2021.
- “Improved denoising diffusion probabilistic models,” in International Conference on Machine Learning. PMLR, 2021, pp. 8162–8171.
- “Denoising diffusion probabilistic models,” Neural Information Processing Systems (NeurIPS), vol. 33, pp. 6840–6851, 2020.
- “Attention is all you need,” Neural Information Processing Systems (NeurIPS), vol. 30, 2017.
- “VoxCeleb: A Large-Scale Speaker Identification Dataset,” in International Speech Communication Association (INTERSPEECH), 2017, pp. 2616–2620.