Papers
Topics
Authors
Recent
2000 character limit reached

Phonetic-aware speaker embedding for far-field speaker verification

Published 27 Nov 2023 in cs.SD, cs.AI, and eess.AS | (2311.15627v1)

Abstract: When a speaker verification (SV) system operates far from the sound sourced, significant challenges arise due to the interference of noise and reverberation. Studies have shown that incorporating phonetic information into speaker embedding can improve the performance of text-independent SV. Inspired by this observation, we propose a joint-training speech recognition and speaker recognition (JTSS) framework to exploit phonetic content for far-field SV. The framework encourages speaker embeddings to preserve phonetic information by matching the frame-based feature maps of a speaker embedding network with wav2vec's vectors. The intuition is that phonetic information can preserve low-level acoustic dynamics with speaker information and thus partly compensate for the degradation due to noise and reverberation. Results show that the proposed framework outperforms the standard speaker embedding on the VOiCES Challenge 2019 evaluation set and the VoxCeleb1 test set. This indicates that leveraging phonetic information under far-field conditions is effective for learning robust speaker representations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. “Speaker verification using adapted gaussian mixture models,” Digital Signal Processing, vol. 10, no. 1-3, pp. 19–41, 2000.
  2. “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
  3. “Phoneme recognition using time-delay neural networks,” in Backpropagation, pp. 35–61. Psychology Press, 2013.
  4. “BUT system description to voxceleb speaker recognition challenge 2019,” arXiv preprint arXiv:1910.12592, 2019.
  5. “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. Interspeech, pp. 3830–3834, 2020.
  6. “Far-field speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 7, pp. 2023–2032, 2007.
  7. “Channel interdependence enhanced speaker embeddings for far-field speaker verification,” in 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), 2021.
  8. Lu Yi and Man-Wai Mak, “Adversarial separation and adaptation network for far-field speaker verification.,” in Proc. Interspeech, 2020, pp. 4298–4302.
  9. “Robust speaker verification using population-based data augmentation,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7642–7646.
  10. “Speaker embedding extraction with phonetic information,” arXiv preprint arXiv:1804.04862, 2018.
  11. “On the usage of phonetic information for text-independent speaker embedding extraction.,” in Proc. Interspeech, 2019, pp. 1148–1152.
  12. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460, 2020.
  13. “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
  14. “Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition,” in Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019, pp. 1652–1656.
  15. “Voxceleb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.
  16. “Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.
  17. “The kaldi speech recognition toolkit,” in Proc. IEEE workshop on automatic speech recognition and understanding, 2011, number CONF.
  18. “MUSAN: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
  19. “A binaural room impulse response database for the evaluation of dereverberation algorithms,” in Proc. 16th International Conference on Digital Signal Processing, 2009.
  20. “The voices from a distance challenge 2019 evaluation plan,” arXiv preprint arXiv:1902.10828, 2019.
  21. “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333.
  22. “LibriSpeech: An ASR corpus based on public domain audio books,” in Proc. IEEE international conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
  23. “Analysis of score normalization in multilingual speaker recognition.,” in Proc. Interspeech, 2017, pp. 1567–1571.
  24. “STC speaker recognition systems for the voices from a distance challenge,” 2019.
  25. “Joint optimization of diffusion probabilistic-based multichannel speech enhancement with far-field speaker verification,” in Proc. IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 428–435.

Summary

  • The paper presents the JTSS framework that jointly trains speaker verification and speech recognition to incorporate phonetic information.
  • It leverages unsupervised phonetic extraction from a pre-trained wav2vec 2.0 to preserve low-level acoustic dynamics in noisy and reverberant environments.
  • The approach achieves a 12.9% reduction in EER and a 14.4% reduction in minDCF, demonstrating significant improvements over baseline models.

Phonetic-aware Speaker Embedding for Far-field Speaker Verification

Introduction

The presented paper, "Phonetic-aware Speaker Embedding for Far-field Speaker Verification" (2311.15627), addresses the challenges associated with speaker verification (SV) in far-field conditions, where noise and reverberation significantly degrade performance. Traditional SV techniques, such as using Gaussian Mixture Models (GMMs), i-vectors, and more recent deep learning approaches like Time Delay Neural Networks (TDNNs) and ECAPA-TDNN, have shown competency primarily under near-field conditions. The degradation of SV performance in far-field settings necessitates novel approaches.

Leveraging prior observations that phonetic information can enhance SV performance, this study introduces a joint-training framework—termed Joint-Training of Speech and Speaker recognition (JTSS)—which integrates phonetic content into speaker embedding learning. This framework aims to mitigate the challenges posed by far-field conditions by aligning phonetic information extracted via wav2vec 2.0 vectors with frame-based feature maps, thus preserving low-level acoustic dynamics and improving speaker recognition robustness.

Methods

The proposed JTSS framework incorporates both speech recognition and speaker verification tasks without requiring manual phonetic transcriptions, employing a pre-trained wav2vec 2.0 model for phonetic extraction. This unsupervised strategy allows the preservation of acoustic dynamics critical to speaker identity, addressing performance degradation due to noise and reverberation in far-field environments. Figure 1

Figure 1: Framework of joint training of speech recognition and speaker classification (JTSS). The utterance-based speaker network in the speaker classification part comprises a pooling layer and a fully connected layer.

The speech recognition component and speaker classification component share frame-level layers to ensure the preservation of phonetic information. The JTSS framework jointly optimizes both tasks using a composite loss function that integrates the AAMSoftmax loss and a cosine similarity metric between phonetic content representations.

Results

The JTSS framework was evaluated using the VOiCES Challenge 2019 and VoxCeleb datasets. It demonstrated superior performance relative to baseline models employing ECAPA-TDNN and x-vector architectures. Notably, the ECAPA-TDNN variant of JTSS achieved a 12.9% reduction in Equal Error Rate (EER) and a 14.4% reduction in minimum Detection Cost Function (minDCF) compared to its baseline, clearly evidencing the efficacy of incorporating phonetic information in improving far-field SV.

On both clean and noisy Vox-O datasets, JTSS outperformed traditional methods, reinforcing its robustness under varied acoustic conditions. The reduced impact of noise and reverberation on JTSS performance highlights its potential to significantly enhance speaker discrimination capabilities in real-world settings.

Discussion

The study supports the hypothesis that integrating phonetic information, particularly from lower-level frame representations, enhances speaker verification under adverse acoustic conditions. The proposed framework offers a promising direction for further development of SV systems that are resilient to environmental noise and reverberation.

Future work could explore the refinement of phonetic extraction techniques and the integration of these frameworks into broader biometric security systems. Additionally, optimizing hyperparameters such as λ\lambda, which determines the balance between phonetic and speaker loss contributions, could further refine the proposed methodology.

Conclusion

The paper successfully demonstrates that the JTSS framework, utilizing phonetic information extracted via unsupervised learning from a pre-trained wav2vec 2.0 model, significantly improves far-field speaker verification performance. Through robust empirical validation, it establishes a meaningful contribution to the ongoing evolution of speaker recognition methodologies, especially in challenging acoustic environments. Further research and development could amplify the impact and applicability of these findings, opening avenues for advanced biometric security systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.