Adversarial Data Augmentation for Robust Speaker Verification (2402.02699v1)
Abstract: Data augmentation (DA) has gained widespread popularity in deep speaker models due to its ease of implementation and significant effectiveness. It enriches training data by simulating real-life acoustic variations, enabling deep neural networks to learn speaker-related representations while disregarding irrelevant acoustic variations, thereby improving robustness and generalization. However, a potential issue with the vanilla DA is augmentation residual, i.e., unwanted distortion caused by different types of augmentation. To address this problem, this paper proposes a novel approach called adversarial data augmentation (A-DA) which combines DA with adversarial learning. Specifically, it involves an additional augmentation classifier to categorize various augmentation types used in data augmentation. This adversarial learning empowers the network to generate speaker embeddings that can deceive the augmentation classifier, making the learned speaker embeddings more robust in the face of augmentation variations. Experiments conducted on VoxCeleb and CN-Celeb datasets demonstrate that our proposed A-DA outperforms standard DA in both augmentation matched and mismatched test conditions, showcasing its superior robustness and generalization against acoustic variations.
- Joseph P Campbell. Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9):1437–1462, 1997.
- Speaker recognition by machines and humans: A tutorial review. IEEE Signal processing magazine, 32(6):74–99, 2015.
- Douglas A Reynolds. An overview of automatic speaker recognition technology. In 2002 IEEE international conference on acoustics, speech, and signal processing, volume 4, pages IV–4072. IEEE, 2002.
- An overview of text-independent speaker recognition: From features to supervectors. Speech communication, 52(1):12–40, 2010.
- Speaker recognition based on deep learning: An overview. Neural Networks, 140:65–99, 2021.
- X-vectors: Robust DNN embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5329–5333. IEEE, 2018.
- X-vector DNN refinement with full-length recordings for speaker recognition. In INTERSPEECH, pages 1493–1496, 2019.
- The 2021 NIST speaker recognition evaluation. arXiv preprint arXiv:2204.10242, 2022.
- VoxSRC 2022: The fourth VoxCeleb speaker recognition challenge. arXiv preprint arXiv:2302.10248, 2023.
- ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv preprint arXiv:2005.07143, 2020.
- Data augmentation versus noise compensation for x-vector speaker recognition systems in noisy environments. In 2020 28th European Signal Processing Conference (EUSIPCO), pages 1–5. IEEE, 2021.
- A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5220–5224. IEEE, 2017.
- Speaker augmentation and bandwidth extension for deep speaker embedding. In INTERSPEECH, pages 406–410, 2019.
- Build a SRE challenge system: Lessons from VoxSRC 2022 and CNSRC 2022. arXiv preprint arXiv:2211.00815, 2022.
- Chien-Lin Huang. Exploring effective data augmentation with TDNN-LSTM neural network embedding for speaker recognition. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 291–295. IEEE, 2019.
- Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
- Investigation of specaugment for deep speaker embedding learning. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7139–7143. IEEE, 2020.
- Deep speaker feature learning for text-independent speaker verification. arXiv preprint arXiv:1705.03670, 2017.
- Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189. PMLR, 2015.
- Unsupervised domain adaptation via domain adversarial training for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4889–4893. IEEE, 2018.
- Augmentation adversarial training for self-supervised speaker recognition. arXiv preprint arXiv:2007.12085, 2020.
- Adversarial training for multi-domain speaker recognition. In 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), pages 1–5. IEEE, 2021.
- Cross-scene speaker verification based on dynamic convolution for the cnsrc 2022 challenge. In Odyssey, pages 368–375, 2022.
- Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018.
- MUSAN: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484, 2015.
- THCHS-30: A free Chinese speech corpus. arXiv preprint arXiv:1512.01882, 2015.
- CN-Celeb: a challenging Chinese speaker recognition dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7604–7608. IEEE, 2020.
- Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition. In APSIPA ASC, pages 1652–1656. IEEE, 2019.
- Attentive statistics pooling for deep speaker embedding. arXiv preprint arXiv:1803.10963, 2018.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
- Spot keywords from very noisy and mixed speech. arXiv preprint arXiv:2305.17706, 2023.