Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

3S-TSE: Efficient Three-Stage Target Speaker Extraction for Real-Time and Low-Resource Applications (2312.10979v2)

Published 18 Dec 2023 in cs.SD and eess.AS

Abstract: Target speaker extraction (TSE) aims to isolate a specific voice from multiple mixed speakers relying on a registerd sample. Since voiceprint features usually vary greatly, current end-to-end neural networks require large model parameters which are computational intensive and impractical for real-time applications, espetially on resource-constrained platforms. In this paper, we address the TSE task using microphone array and introduce a novel three-stage solution that systematically decouples the process: First, a neural network is trained to estimate the direction of the target speaker. Second, with the direction determined, the Generalized Sidelobe Canceller (GSC) is used to extract the target speech. Third, an Inplace Convolutional Recurrent Neural Network (ICRN) acts as a denoising post-processor, refining the GSC output to yield the final separated speech. Our approach delivers superior performance while drastically reducing computational load, setting a new standard for efficient real-time target speaker extraction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, 2019.
  2. “VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking,” in Proc. Interspeech 2019, 2019, pp. 2728–2732.
  3. “Selective cortical representation of attended speaker in multi-talker speech perception,” Nature, vol. 485, no. 7397, pp. 233–236, 2012.
  4. “Modelling auditory attention,” Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 372, no. 1714, pp. 20160101, 2017.
  5. “Robust Speaker Extraction Network Based on Iterative Refined Adaptation,” in Proc. Interspeech 2021, 2021, pp. 3530–3534.
  6. “Attention-based scaling adaptation for target speech extraction,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 658–662.
  7. “Dynamic-attention based encoder-decoder model for speaker extraction with anchor speech,” in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2019, pp. 297–301.
  8. “Time-domain speaker extraction network,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 327–334.
  9. “Improving speaker discrimination of target speech extraction with time-domain speakerbeam,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 691–695.
  10. “Tea-pse 3.0: Tencent-ethereal-audio-lab personalized speech enhancement system for icassp 2023 dns-challenge,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–2.
  11. “Real-Time Personalised Speech Enhancement Transformers with Dynamic Cross-attended Speaker Representations,” in Proc. INTERSPEECH 2023, 2023, pp. 804–808.
  12. “Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information,” in Proc. INTERSPEECH 2023, 2023, pp. 2488–2492.
  13. “Mc-spex: Towards effective speaker extraction with multi-scale interfusion and conditional speaker modulation,” arXiv preprint arXiv:2306.16250, 2023.
  14. “Spex: Multi-scale time domain speaker extraction network,” IEEE/ACM transactions on audio, speech, and language processing, vol. 28, pp. 1370–1384, 2020.
  15. “SpEx+: A Complete Time Domain Speaker Extraction Network,” in Proc. Interspeech 2020, 2020, pp. 1406–1410.
  16. “Speakerfilter: Deep learning-based target speaker extraction using anchor speech,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 376–380.
  17. “L-spex: Localized target speaker extraction,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7287–7291.
  18. “Text-to-speech for low-resource agglutinative language with morphology-aware language model pre-training,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 1–13, 2023.
  19. Lloyd Griffiths and CW Jim, “An alternative approach to linearly constrained adaptive beamforming,” IEEE Transactions on antennas and propagation, vol. 30, no. 1, pp. 27–34, 1982.
  20. “Self-attending rnn for speech enhancement to improve cross-corpus generalization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1374–1385, 2022.
  21. “Signal enhancement using beamforming and nonstationarity with applications to speech,” IEEE Transactions on Signal Processing, vol. 49, no. 8, pp. 1614–1626, 2001.
  22. “Inplace gated convolutional recurrent neural network for dual-channel speech enhancement,” arXiv preprint arXiv:2107.11968, 2021.
  23. “Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy,” IEEE Transactions on information theory, vol. 26, no. 1, pp. 26–37, 1980.
  24. Yi Luo and Nima Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  25. Byung Suk Lee, Noise robust pitch tracking by subband autocorrelation classification, Columbia University, 2012.
  26. “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
  27. “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009–2022, 2016.
  28. “Perceptual evaluation of speech quality (pesq) the new itu standard for end-to-end speech quality assessment: Part i: Time-delay compensation,” Journal of the Audio Engineering Society, vol. 50, no. 10, pp. 755–764, 2002.
  29. “Local-global speaker representation for target speaker extraction,” arXiv preprint arXiv:2210.15849, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.