2000 character limit reached
Speech-based Age and Gender Prediction with Transformers (2306.16962v1)
Published 29 Jun 2023 in cs.SD and eess.AS
Abstract: We report on the curation of several publicly available datasets for age and gender prediction. Furthermore, we present experiments to predict age and gender with models based on a pre-trained wav2vec 2.0. Depending on the dataset, we achieve an MAE between 7.1 years and 10.8 years for age, and at least 91.1% ACC for gender (female, male, child). Compared to a modelling approach built on handcrafted features, our proposed system shows an improvement of 9% UAR for age and 4% UAR for gender. To make our findings reproducible, we release the best performing model to the community as well as the sample lists of the data splits.
- F. Eyben, M. Wöllmer and B. Schuller “openSMILE –- the Munich versatile and fast open-source audio feature extractor” In Proceedings of the 18th𝑡ℎ{}^{th}start_FLOATSUPERSCRIPT italic_t italic_h end_FLOATSUPERSCRIPT ACM international conference on Multimedia, 2010, pp. 1459–1462
- “The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language” In Proceedings of the 17th𝑡ℎ{}^{th}start_FLOATSUPERSCRIPT italic_t italic_h end_FLOATSUPERSCRIPT Annual Conference of the International Speech Communication Association, INTERSPEECH 2016, 2016
- In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 4, 2007
- “Age and gender recognition for telephone applications based on GMM supervectors and support vector machines” In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing – Proceedings, 2008
- “A database of age and gender annotated telephone speech” In Proceedings of the 7th𝑡ℎ{}^{th}start_FLOATSUPERSCRIPT italic_t italic_h end_FLOATSUPERSCRIPT International Conference on Language Resources and Evaluation, LREC 2010, 2010
- M. Feld, F. Burkhardt and C. Müller “Automatic speaker age and gender recognition in the car for tailoring dialog and mobile services” In Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, 2010
- M. Brückl “Altersbedingte Veränderungen der Stimme und Sprechweise von Frauen” 7, Mündliche Kommunikation Berlin: Logos Verlag, 2011
- “Towards Learning a Universal Non-Semantic Representation of Speech” In Proc. Interspeech 2020, 2020, pp. 140–144
- “Age and gender classification from speech using decision level fusion and ensemble based techniques” In Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, 2010
- “SMOTE: Synthetic minority over-sampling technique” In Journal of Artificial Intelligence Research, 2002
- “The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing” In IEEE Transactions on Affective Computing, 2016
- “irrNA: Coefficients of Interrater Reliability – Generalized for Randomly Incomplete Datasets” R package version 0.1.4, 2018 URL: https://CRAN.R-project.org/package=irrNA
- Denys Katerenchuk “Age group classification with speech and metadata multimodality fusion” In 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017 - Proceedings of Conference, 2017
- “Robust Speech Emotion Recognition Under Different Encoding Conditions.” In INTERSPEECH, 2019, pp. 3935–3939
- “Age group classification and gender recognition from speech with temporal convolutional neural networks” In Multimedia Tools and Applications, 2022
- “Voxceleb Enrichment for Age and Gender Recognition” In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 687–693
- “Estimation of speaker age and height from speech signal using bi-encoder transformer mixture model” In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2022-September International Speech Communication Association, 2022, pp. 1978–1982
- Seyed Omid Sadjadi, Sriram Ganapathy and Jason W. Pelecanos “Speaker age estimation on conversational telephone speech using senone posterior based i-vectors” In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 2016-May Institute of ElectricalElectronics Engineers Inc., 2016, pp. 5040–5044
- “Age Estimation in Short Speech Utterances Based on LSTM Recurrent Neural Networks” In IEEE Access 6 Institute of ElectricalElectronics Engineers Inc., 2018, pp. 22524–22530
- “A Database of Age and Gender Annotated Telephone Speech” In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10) Valletta, Malta: European Language Resources Association (ELRA), 2010
- “DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM” NIST, 1993
- “The INTERSPEECH 2010 paralinguistic challenge” In Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010 International Speech Communication Association, 2010, pp. 2794–2797
- Marcel Kockmann, Lukáš Burget and Jan Černocký “Brno University of Technology System for Interspeech 2010 Paralinguistic Challenge” In Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010) 2010.9 Makuhari, Chiba, JP: International Speech Communication Association, 2010, pp. 2822–2825
- “Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap” In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, pp. 1–13
- “wav2vec 2.0: A framework for self-supervised learning of speech representations” In Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 12449–12460
- “Robust wav2vec 2.0: Analyzing domain shift in self-supervised pre-training” In arXiv preprint arXiv:2104.01027, 2021
- “Voxceleb: Large-scale speaker verification in the wild” In Computer Speech & Language 60 Academic Press, 2020, pp. 101027
- “Common voice: A massively-multilingual speech corpus” In LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings European Language Resources Association (ELRA), 2020, pp. 4218–4222
- Sarah Ita Levitan, Taniya Mishra and Srinivas Bangalore “Automatic identification of gender from speech” In Proceedings of the International Conference on Speech Prosody 2016-January International Speech Communication Association, 2016, pp. 84–88
- Susanne Schötz “Acoustic analysis of adult speaker age” In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 4343 LNAI Springer Verlag, 2007, pp. 88–107
- Anvarjon Tursunov, Joon Yeon Choeh and Soonil Kwon “Age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms” In Sensors 21.17 MDPI, 2021, pp. 5892
- “Speaker gender recognition based on deep neural networks and ResNet50” In Wireless Communications and Mobile Computing 2022 Hindawi Limited, 2022, pp. 1–13
- “Joint gender and age estimation based on speech signals using x-vectors and transfer learning” In arXiv preprint arXiv:2012.01551, 2020
- “Transformers: State-of-the-Art Natural Language Processing” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations Online: Association for Computational Linguistics, 2020, pp. 38–45
- “A database of German emotional speech” In 9th European Conference on Speech Communication and Technology 5, 2005, pp. 1517–1520 DOI: 10.21437/Interspeech.2005-446