Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 39 tok/s Pro
GPT-4o 112 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Privacy-oriented manipulation of speaker representations (2310.06652v2)

Published 10 Oct 2023 in eess.AS

Abstract: Speaker embeddings are ubiquitous, with applications ranging from speaker recognition and diarization to speech synthesis and voice anonymisation. The amount of information held by these embeddings lends them versatility, but also raises privacy concerns. Speaker embeddings have been shown to contain information on age, sex, health and more, which speakers may want to keep private, especially when this information is not required for the target task. In this work, we propose a method for removing and manipulating private attributes from speaker embeddings that leverages a Vector-Quantized Variational Autoencoder architecture, combined with an adversarial classifier and a novel mutual information loss. We validate our model on two attributes, sex and age, and perform experiments with ignorant and fully-informed attackers, and with in-domain and out-of-domain data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Support vector machines using GMM supervectors for speaker verification. IEEE signal processing letters. 2006;13(5):308–311.
  2. Front-End Factor Analysis for Speaker Verification. IEEE Transactions on Audio, Speech, and Language Processing. 2011;19(4):788–798. doi:10.1109/TASL.2010.2064307.
  3. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2018. p. 5329–5333.
  4. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In: Proc. Interspeech; 2020. p. 3830–3834.
  5. MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification. In: Proc. Interspeech 2022; 2022. p. 306–310.
  6. Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: Theory, implementation and analysis on standard tasks. Computer Speech & Language. 2022;71:101254.
  7. Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In: Proc. ICASSP. IEEE; 2020. p. 6184–6188.
  8. Introducing the Voice Privacy initiative. In: Proc. Interspeech; 2020. p. 1693–1697.
  9. Modeling obstructive sleep apnea voices using deep neural network embeddings and domain-adversarial training. IEEE Journal of Selected Topics in Signal Processing. 2019;14(2):240–250.
  10. Exploring Text and Audio Embeddings for Multi-Dimension Elderly Emotion Recognition. In: Proc. Interspeech; 2020. p. 2067–2071.
  11. Automatic Assessment of Speech Intelligibility using Consonant Similarity for Head and Neck Cancer. In: Proc. Interspeech; 2022. p. 3608–3612.
  12. Laver J. Principles of phonetics. Cambridge university press; 1994.
  13. Probing the Information Encoded in X-Vectors. In: Proc. ASRU; 2019. p. 726–733.
  14. Using x-vectors to automatically detect parkinson’s disease from speech. In: Proc. ICASSP. IEEE; 2020. p. 1155–1159.
  15. Kwasny D, Hemmerling D. Joint gender and age estimation based on speech signals using x-vectors and transfer learning. arXiv preprint arXiv:201201551. 2020;.
  16. European Parliament and Council. On the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). Regulation 2016/679. 2016;.
  17. California Civil Code, State of California. The California Consumer Privacy Act (CCPA); 2018. Available from: https://oag.ca.gov/privacy/ccpa.
  18. The GDPR & Speech Data: Reflections of Legal and Technology Communities, First Steps towards a Common Understanding. arXiv preprint. 2019;1907.03458.
  19. Nautsch A, et al. Preserving privacy in speaker and speech characterisation. Computer Speech & Language. 2019;58:441–480.
  20. Homomorphic encryption for arithmetic of approximate numbers. In: ASIACRYPT 2017. Springer; 2017. p. 409–437.
  21. Lindell Y. Secure Multiparty Computation (MPC). IACR Cryptology ePrint Archive. 2020;2020:300.
  22. Homomorphic Encryption for Speaker Recognition: Protection of Biometric Templates and Vendor Model Parameters . In: Proc. Odyssey; 2018. p. 16–23.
  23. Privacy-preserving PLDA speaker verification using outsourced secure computation. Speech Communication. 2019;114:60–71.
  24. A novel privacy-preserving speech recognition framework using bidirectional LSTM. Journal of Cloud Computing. 2020;9:1–13.
  25. Towards End-to-End Private Automatic Speaker Recognition. In: Proc. Interspeech; 2022. p. 2798–2802.
  26. Privacy-Preserving Automatic Speaker Diarization. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2023. p. 1–5.
  27. Emotion Filtering at the Edge. In: Proc. of the 1st Workshop on Machine Learning on Edge in Sensor Systems. ACM; 2019. p. 1–6.
  28. Adversarial Disentanglement of Speaker Representation for Attribute-Driven Privacy Preservation. In: Proc. Interspeech; 2021. p. 1902–1906.
  29. Understanding the tradeoffs in client-side privacy for downstream speech tasks. In: 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE; 2021. p. 841–848.
  30. Nelus A, Martin R. Privacy-aware Feature Extraction for Gender Discrimination versus Speaker Identification. In: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2019. p. 671–674.
  31. Non-uniform Speaker Disentanglement For Depression Detection From Raw Speech Signals. In: Proc. INTERSPEECH 2023; 2023. p. 2343–2347.
  32. X-vector anonymization using autoencoders and adversarial training for preserving speech privacy. Computer Speech & Language. 2022;74:101351.
  33. Hiding Speaker’s Sex in Speech Using Zero-Evidence Speaker Representation in an Analysis/Synthesis Pipeline. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2023. p. 1–5.
  34. A bridge between features and evidence for binary attribute-driven perfect privacy. In: Proc. ICASSP. IEEE; 2022. p. 3094–3098.
  35. Differentially Private Adversarial Auto-Encoder to Protect Gender in Voice Biometrics. In: Proceedings of the 2023 ACM Workshop on Information Hiding and Multimedia Security. IH&MMSec ’23; 2023. p. 127–132.
  36. Neural discrete representation learning. Advances in neural information processing systems. 2017;30.
  37. Ganin Y, Lempitsky V. Unsupervised domain adaptation by backpropagation. In: Proc. ICML. PMLR; 2015. p. 1180–1189.
  38. Estimating mutual information. Phys Rev E. 2004;69:066138. doi:10.1103/PhysRevE.69.066138.
  39. Demystifying Fixed k𝑘kitalic_k -Nearest Neighbor Information Estimators. IEEE Transactions on Information Theory. 2018;64(8):5629–5661. doi:10.1109/TIT.2018.2807481.
  40. Ross BC. Mutual information between discrete and continuous data sets. PloS one. 2014;9(2):e87357.
  41. Investigating the contribution of speaker attributes to speaker separability using disentangled speaker representations. In: Proc. Interspeech 2022; 2022. p. 610–614.
  42. Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations. Entropy. 2023;25(2). doi:10.3390/e25020375.
  43. Privacy-Preserving Voice Analysis via Disentangled Representations. In: Proceedings of the 2020 ACM SIGSAC Conference on Cloud Computing Security Workshop. CCSW’20; 2020. p. 1–14.
  44. Paralinguistic Privacy Protection at the Edge. ACM Trans Priv Secur. 2023;26(2). doi:10.1145/3570161.
  45. Jaiswal M, Provost EM. Privacy enhanced multimodal neural representations for emotion recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34; 2020. p. 7985–7993.
  46. Adversarial representation learning for private speech generation. In: ICML 2020 Workshop on Self-supervision in Audio and Speech; 2020. p. –.
  47. Stoidis D, Cavallaro A. Protecting Gender and Identity with Disentangled Speech Representations. In: Proc. Interspeech; 2021. p. 1699–1703.
  48. Stoidis D, Cavallaro A. Generating gender-ambiguous voices for privacy-preserving speech recognition. In: Proc. Interspeech 2022; 2022. p. 4237–4241.
  49. Beyond Neural-on-Neural Approaches to Speaker Gender Protection. In: Proc. ICASSP; 2023. p. 1–5.
  50. Privacy Enhanced Speech Emotion Communication using Deep Learning Aided Edge Computing. In: 2021 IEEE International Conference on Communications Workshops (ICC Workshops); 2021. p. 1–5.
  51. Feng T, Narayanan S. Privacy and Utility Preserving Data Transformation for Speech Emotion Recognition. In: 2021 9th International Conference on Affective Computing and Intelligent Interaction (ACII); 2021. p. 1–7.
  52. Secure Modular Hashing. In: WIFS. IEEE; 2015. p. 1–6.
  53. Cancelable speaker verification system based on binary Gaussian mixtures. In: 4th ATSIP; 2018. p. 1–6.
  54. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020; 2020. p. –.
  55. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems. 2020;33:12449–12460.
  56. Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM transactions on audio, speech, and language processing. 2019;27(12):2041–2053.
  57. Wu DY, Lee Hy. One-shot voice conversion by vector quantization. In: Proc. ICASSP. IEEE; 2020. p. 7734–7738.
  58. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence. 2010;33(1):117–128.
  59. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:161101144. 2016;.
  60. The challenge of realistic music generation: modelling raw audio at scale. Advances in Neural Information Processing Systems. 2018;31.
  61. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In: Proc. IEEE/CVF CVPR; 2019. p. 4685–4694.
  62. Generative Adversarial Nets. Advances in Neural Information Processing Systems. 2014;27.
  63. Elazar Y, Goldberg Y. Adversarial Removal of Demographic Attributes from Text Data. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics; 2018. p. 11–21.
  64. Kozachenko LF, Leonenko NN. A statistical estimate for the entropy of a random vector. Problems of Information Transmission. 1987; p. 9–16.
  65. Differentiable top-k with optimal transport. Advances in Neural Information Processing Systems. 2020;33:20520–20531.
  66. Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language. 2020;60:101027.
  67. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. In: Proc. Interspeech 2019; 2019. p. 1526–1530.
  68. Age-VOX-Celeb: Multi-Modal Corpus for Facial and Speech Estimation. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2021. p. 6963–6967.
  69. Mendonça J, Trancoso I. VoxCeleb-PT – a dataset for a speech processing course . In: Proc. IberSPEECH 2022; 2022. p. 71–75.
  70. The Privacy ZEBRA: Zero Evidence Biometric Recognition Assessment. In: Proc. Interspeech; 2020. p. 1698–1702.
  71. Ravanelli M, et al.. SpeechBrain: A General-Purpose Speech Toolkit; 2021.
  72. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML. PMLR; 2015. p. 448–456.
  73. Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980. 2014;.
  74. Smith LN, Topin N; SPIE. Super-convergence: Very fast training of neural networks using large learning rates. Artificial intelligence and machine learning for multi-domain operations applications. 2019;11006:369–386.
  75. Mutual Information-based Embedding Decoupling for Generalizable Speaker Verification. In: Proc. INTERSPEECH 2023; 2023. p. 3147–3151.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.