Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 45 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 11 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 88 tok/s Pro
Kimi K2 214 tok/s Pro
GPT OSS 120B 460 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Voxceleb-ESP: preliminary experiments detecting Spanish celebrities from their voices (2401.09441v1)

Published 20 Dec 2023 in cs.SD, cs.LG, and eess.AS

Abstract: This paper presents VoxCeleb-ESP, a collection of pointers and timestamps to YouTube videos facilitating the creation of a novel speaker recognition dataset. VoxCeleb-ESP captures real-world scenarios, incorporating diverse speaking styles, noises, and channel distortions. It includes 160 Spanish celebrities spanning various categories, ensuring a representative distribution across age groups and geographic regions in Spain. We provide two speaker trial lists for speaker identification tasks, each of them with same-video or different-video target trials respectively, accompanied by a cross-lingual evaluation of ResNet pretrained models. Preliminary speaker identification results suggest that the complexity of the detection task in VoxCeleb-ESP is equivalent to that of the original and much larger VoxCeleb in English. VoxCeleb-ESP contributes to the expansion of speaker recognition benchmarks with a comprehensive and diverse dataset for the Spanish language.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. J. Gonzalez-Rodriguez, “Evaluating automatic speaker recognition systems: An overview of the nist speaker recognition evaluations (1996-2014),” Loquens, vol. 1, no. 1, pp. e007–e007, 2014.
  2. P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice modeling with sparse training data,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3, pp. 345–354, 2005.
  3. L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4879–4883.
  4. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in ICASSP 2018 - 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5329–5333.
  5. A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A Large-Scale Speaker Identification Dataset,” in Proc. Interspeech 2017, 2017, pp. 2616–2620.
  6. J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep Speaker Recognition,” in Proc. Interspeech 2018, 2018, pp. 1086–1090.
  7. M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The Speakers in the Wild (SITW) Speaker Recognition Database,” in Proc. Interspeech 2016, 2016, pp. 818–822.
  8. Y. Fan, J. Kang, L. Li, K. Li, H. Chen, S. Cheng, P. Zhang, Z. Zhou, Y. Cai, and D. Wang, “Cn-celeb: A challenging chinese speaker recognition dataset,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7604–7608.
  9. V. T. Pham, X. T. H. Nguyen, V. Hoang, and T. T. T. Nguyen, “Vietnam-Celeb: a large-scale dataset for Vietnamese speaker recognition,” in Proc. Interspeech 2023, 2023, pp. 1918–1922.
  10. J. Mendonca and I. Trancoso, “Voxceleb-pt – a dataset for a speech processing course,” in Proc. IberSPEECH 2022, 2022, pp. 71–75.
  11. J. Ortega-Garcia, J. Gonzalez-Rodriguez, V. Marrero-Aguiar, J. Diaz-Gomez, R. Garcia-Jimenez, J. Lucena-Molina, and J. Sanchez-Molero, “Ahumada: a large speech corpus in spanish for speaker identification and verification,” in ICASSP 1998 - 1998 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Cat. No.98CH36181), vol. 2, 1998, pp. 773–776 vol.2.
  12. J. Ortega-Garcia, J. Gonzalez-Rodriguez, and V. Marrero-Aguiar, “Ahumada: A large speech corpus in spanish for speaker characterization and identification,” Speech Communication, vol. 31, no. 2, pp. 255–264, 2000. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167639399000813
  13. J. M. Martin, I. Lopez-Espejo, C. Gonzalez-Lao, D. Gallardo-Jimenez, A. M. Gomez García, J. L. Pérez Cordoba, V. E. Sanchez Calle, J. A. Morales Cordovilla, and A. M. Peinado Herreros, “Secuvoice - a spanish speech corpus for secure applications with smartphones,” in Proc. IberSPEECH 2016, 2016.
  14. S. Tomar, “Converting video formats with ffmpeg,” Linux Journal, vol. 2006, no. 146, p. 10, 2006.
  15. J. S. Chung, J. Huh, S. Mun, M. Lee, H.-S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In Defence of Metric Learning for Speaker Recognition,” in Proc. Interspeech 2020, 2020, pp. 2977–2981.
  16. Y. Kwon, H. S. Heo, B.-J. Lee, and J. S. Chung, “The ins and outs of speaker recognition: lessons from VoxSRC 2020,” in Proc. ICASSP, 2021.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets