Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploiting spatial information with the informed complex-valued spatial autoencoder for target speaker extraction (2210.15512v2)

Published 27 Oct 2022 in eess.AS and cs.SD

Abstract: In conventional multichannel audio signal enhancement, spatial and spectral filtering are often performed sequentially. In contrast, it has been shown that for neural spatial filtering a joint approach of spectro-spatial filtering is more beneficial. In this contribution, we investigate the spatial filtering performed by such a time-varying spectro-spatial filter. We extend the recently proposed complex-valued spatial autoencoder (COSPA) for the task of target speaker extraction by leveraging its interpretable structure and purposefully informing the network of the target speaker's position. We show that the resulting informed COSPA (iCOSPA) effectively and flexibly extracts a target speaker from a mixture of speakers. We also find that the proposed architecture is well capable of learning pronounced spatial selectivity patterns and show that the results depend significantly on the training target and the reference signal when computing various evaluation metrics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. “A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR,” in IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), 2017, pp. 276–280.
  2. Z.-Q. Wang and D. Wang, “All-Neural Multi-Channel Speech Enhancement,” in Proc. Interspeech 2018, 2018, pp. 3234–3238.
  3. “Online Multichannel Speech Enhancement Based on Recursive EM and DNN-Based Speech Presence Estimation,” IEEE/ACM Trans. Audio, Speech and Lang. Process., vol. 28, pp. 3080–3094, 2020.
  4. “Consistency-aware multi-channel speech enhancement using deep neural networks,” in IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), 2020, pp. 821–825.
  5. “Complex spectral mapping for single- and multi-channel speech enhancement and robust ASR,” IEEE/ACM Trans. Audio, Speech and Lang. Process., vol. 28, pp. 1778–1787, 2020.
  6. M. M. Halimeh and W. Kellermann, “Complex-valued spatial autoencoders for multichannel speech enhancement,” in IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), 2022, pp. 261–265.
  7. “On the role of spatial, spectral and temporal processing for DNN-based non-linear multi-channel speech enhancement,” in Interspeech 2022, 2022, pp. 2908–2912.
  8. “Embedding and beamforming: All-neural causal beamformer for multichannel speech enhancement,” in IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), 2022, pp. 6487–6491.
  9. “Deep long short-term memory adaptive beamforming networks for multichannel robust speech recognition,” in IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), 2017, pp. 271–275.
  10. X. Xiao et al., “Deep beamforming networks for multi-channel speech recognition,” in IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), 2016, pp. 5745–5749.
  11. “Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition,” in Proc. Interspeech 2016, 2016, pp. 1976–1980.
  12. “Neural spectrospatial filtering,” IEEE/ACM Trans. Audio, Speech and Lang. Process., vol. 30, pp. 605–621, 2022.
  13. Z.-Q. Wang and D. Wang, “Multi-microphone complex spectral mapping for speech dereverberation,” in IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), 2020, pp. 486–490.
  14. K. Tesch and T. Gerkmann, “Insights into deep non-linear filters for improved multi-channel speech enhancement,” arxiv.2206.13310, 2022.
  15. “Multi-channel overlapped speech recognition with location guided speech extraction network,” in IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 558–565.
  16. “Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information,” in Interspeech 2019, 2019, pp. 4290–4294.
  17. “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE J. Sel. Topics Signal Process., vol. 13, no. 4, pp. 800–814, 2019.
  18. “Multi-channel target speech extraction with channel decorrelation and target speaker adaptation,” in IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), 2021, pp. 6094–6098.
  19. “New insights on target speaker extraction,” arxiv.2202.00733, 02 2022.
  20. “Learning-based personal speech enhancement for teleconferencing by exploiting spatial-spectral features,” in IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), 2022, pp. 8787–8791.
  21. Robust Localization in Reverberant Rooms, pp. 157–180, Springer Berlin Heidelberg, Berlin, Heidelberg, 2001.
  22. Multichannel Source Activity Detection, Localization, and Tracking, chapter 4, pp. 47–64, John Wiley & Sons, Ltd, 2018.
  23. “3D convolutional neural networks for human action recognition,” in Proc. 27th Int. Conf. Machine Learning. 2010, ICML’10, p. 495–502, Omnipress.
  24. “3D U-Net: Learning dense volumetric segmentation from sparse annotation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI). Oct 2016, vol. 9901 of LNCS, pp. 424–432, Springer.
  25. E. Habets, “Room impulse response generator,” Tech. Rep., Technische Universiteit Eindhoven, The Netherlands, 2006.
  26. “Timit acoustic-phonetic continuous speech corpus,” Linguistic Data Consortium, Nov 1992.
  27. ITU-T Recommendation P.862.2, “Wideband extension to recommendation P.862 for the assessment of wideband telephone networks and speech codecs,” Recommendation, ITU, Nov. 2007.
  28. J. Jensen and C. H. Taal, “An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers,” IEEE/ACM Trans. Audio, Speech and Lang. Process., vol. 24, no. 11, pp. 2009–2022, Nov. 2016.
  29. “Performance measurement in blind audio source separation,” IEEE Trans. Audio, Speech and Lang. Process., vol. 14, no. 4, pp. 1462–1469, July 2006.
Citations (5)

Summary

We haven't generated a summary for this paper yet.