Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
9 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Binaural Speech Enhancement Using Deep Complex Convolutional Transformer Networks (2403.05393v1)

Published 8 Mar 2024 in eess.AS

Abstract: Studies have shown that in noisy acoustic environments, providing binaural signals to the user of an assistive listening device may improve speech intelligibility and spatial awareness. This paper presents a binaural speech enhancement method using a complex convolutional neural network with an encoder-decoder architecture and a complex multi-head attention transformer. The model is trained to estimate individual complex ratio masks in the time-frequency domain for the left and right-ear channels of binaural hearing devices. The model is trained using a novel loss function that incorporates the preservation of spatial information along with speech intelligibility improvement and noise reduction. Simulation results for acoustic scenarios with a single target speaker and isotropic noise of various types show that the proposed method improves the estimated binaural speech intelligibility and preserves the binaural cues better in comparison with several baseline algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. S. Doclo, S. Gannot, M. Moonen, and A. Spriet, “Acoustic beamforming for hearing aid applications,” in Handbook on Array Processing and Sensor Networks, S. Haykin and K. Ray Liu, Eds.   John Wiley & Sons, Inc., 2008.
  2. P. Guiraud, S. Hafezi, P. A. Naylor, A. H. Moore, J. Donley, V. Tourbabin, and T. Lunner, “An Introduction to the Speech Enhancement for Augmented Reality (Spear) Challenge,” in Proc. Int. Workshop on Acoust. Signal Enhancement (IWAENC), Bamberg, Germany, Sep. 2022, pp. 1–5.
  3. M. L. Hawley, R. Y. Litovsky, and J. F. Culling, “The benefit of binaural hearing in a cocktail party: Effect of location and type of interferer,” J. Acoust. Soc. Am., vol. 115, no. 2, pp. 833–843, 2004.
  4. R. Beutelmann and T. Brand, “Prediction of speech intelligibility in spatial noise and reverberation for normal-hearing and hearing-impaired listeners,” J. Acoust. Soc. Am., vol. 120, pp. 331–342, 2006.
  5. E. Hadad, D. Marquardt, S. Doclo, and S. Gannot, “Binaural multichannel Wiener filter with directional interference rejection,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP), Apr. 2015.
  6. T. J. Klasen, S. Doclo, T. Van den Bogaert, M. Moonen, and J. Wouters, “Binaural multi-channel Wiener filtering for hearing aids: Preserving interaural time and level differences,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP), vol. 5, 2006, pp. V–V.
  7. V. Tokala, M. Brookes, and P. A. Naylor, “Binaural Speech Enhancement Using STOI-optimal Masks,” in Proc. Int. Workshop on Acoust. Signal Enhancement (IWAENC), Bamberg, Germany, Sep. 2022, pp. 1–5.
  8. A. H. Moore, L. Lightburn, W. Xue, P. A. Naylor, and M. Brookes, “Binaural mask-informed speech enhancement for hearing aids with head tracking,” in Proc. Int. Workshop on Acoust. Signal Enhancement (IWAENC), Tokyo, Japan, Sep. 2018, pp. 461–465.
  9. C. Han, Y. Luo, and N. Mesgarani, “Real-Time Binaural Speech Separation with Preserved Spatial Cues,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP), May 2020, pp. 6404–6408.
  10. D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” arXiv preprint arXiv:1806.03185, 2018.
  11. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, Sep. 2018.
  12. K. Tan and D. Wang, “A convolutional recurrent neural network for real-time speech enhancement.” in Proc. Conf. of Int. Speech Commun. Assoc. (INTERSPEECH), vol. 2018, 2018, pp. 3229–3233.
  13. D. Yin, C. Luo, Z. Xiong, and W. Zeng, “Phasen: A phase-and-harmonics-aware speech enhancement network,” in Proc. AAAI Conf. on Artificial Intelligence, vol. 34, 2020, pp. 9458–9465.
  14. Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, “DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement,” in Proc. Conf. of Int. Speech Commun. Assoc. (INTERSPEECH).   ISCA, Sep. 2020, pp. 2472–2476.
  15. J. Kim, M. El-Khamy, and J. Lee, “T-GSA: Transformer with Gaussian-Weighted Self-Attention for Speech Enhancement,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP), May 2020, pp. 6649–6653.
  16. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, pp. 5998–6008, 2017.
  17. C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, and C. J. Pal, “Deep Complex Networks,” Feb. 2018.
  18. D. S. Williamson, Y. Wang, and D. Wang, “Complex Ratio Masking for monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 483–492, 2015.
  19. C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Process. (ICASSP), Dallas, Texas, USA, Mar. 2010, pp. 4214–4217.
  20. P. Manuel, “Mpariente/pytorch_stoi,” Feb. 2023. [Online]. Available: https://github.com/mpariente/pytorch_stoi
  21. D. Wang, “On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis,” in Speech Separation by Humans and Machines, P. Divenyi, Ed.   Boston, MA: Springer US, 2005, pp. 181–197.
  22. D. M. Brookes, “VOICEBOX: A speech processing toolbox for MATLAB,” 1997. [Online]. Available: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
  23. J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019. [Online]. Available: https://datashare.ed.ac.uk/handle/10283/3443
  24. H. Kayser, S. D. Ewert, J. Anemüller, T. Rohdenburg, V. Hohmann, and B. Kollmeier, “Database of multichannel in-ear and behind-the-Ear head-related and binaural room impulse responses,” EURASIP J. on Advances in Signal Process., vol. 2009, no. 1, p. 298605, Jul. 2009.
  25. J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, and V. Zue, “TIMIT acoustic-phonetic continuous speech corpus,” Linguistic Data Consortium (LDC), Philadelphia, USA, Corpus LDC93S1, 1993.
  26. A. Varga and H. J. M. Steeneken, “Assessment for automatic speech recognition II: NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Commun., vol. 3, no. 3, pp. 247–251, Jul. 1993.
  27. A. H. Andersen, J. M. de Haan, Z. H. Tan, and J. Jensen, “Refinement and validation of the binaural short time objective intelligibility measure for spatially diverse conditions,” Speech Commun., vol. 102, pp. 1–13, Sep. 2018.
Citations (6)

Summary

We haven't generated a summary for this paper yet.