Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model (2401.02673v1)

Published 5 Jan 2024 in eess.AS, cs.AI, and cs.SD

Abstract: Far-field speech recognition is a challenging task that conventionally uses signal processing beamforming to attack noise and interference problem. But the performance has been found usually limited due to heavy reliance on environmental assumption. In this paper, we propose a unified multichannel far-field speech recognition system that combines the neural beamforming and transformer-based Listen, Spell, Attend (LAS) speech recognition system, which extends the end-to-end speech recognition system further to include speech enhancement. Such framework is then jointly trained to optimize the final objective of interest. Specifically, factored complex linear projection (fCLP) has been adopted to form the neural beamforming. Several pooling strategies to combine look directions are then compared in order to find the optimal approach. Moreover, information of the source direction is also integrated in the beamforming to explore the usefulness of source direction as a prior, which is usually available especially in multi-modality scenario. Experiments on different microphone array geometry are conducted to evaluate the robustness against spacing variance of microphone array. Large in-house databases are used to evaluate the effectiveness of the proposed framework and the proposed method achieve 19.26\% improvement when compared with a strong baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012.
  2. “Towards end-to-end speech recognition with recurrent neural networks,” in International conference on machine learning, 2014, pp. 1764–1772.
  3. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964.
  4. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
  5. “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
  6. “Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 167–174.
  7. “End-to-end continuous speech recognition using attention-based recurrent nn: First results,” arXiv preprint arXiv:1412.1602, 2014.
  8. “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
  9. Speech enhancement, Springer Science & Business Media, 2005.
  10. Harry L Van Trees, Optimum array processing: Part IV of detection, estimation, and modulation theory, John Wiley & Sons, 2004.
  11. “Multichannel end-to-end speech recognition,” in International Conference on Machine Learning. PMLR, 2017, pp. 2632–2641.
  12. K Buckley and L Griffiths, “An adaptive generalized sidelobe canceller with derivative constraints,” IEEE Transactions on antennas and propagation, vol. 34, no. 3, pp. 311–319, 1986.
  13. “Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 30–36.
  14. “Factored spatial and spectral multichannel raw waveform cldnns,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 5075–5079.
  15. “Neural network based spectral mask estimation for acoustic beamforming,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 196–200.
  16. “Convolutional, long short-term memory, fully connected deep neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 4580–4584.
  17. “Spatial attention for far-field speech recognitionn with deep beamforming neural networks,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.
  18. “An overview of deep-learning-based audio-visual speech enhancement and separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
  19. “Neural spatio-temporal beamformer for target speech separation,” arXiv preprint arXiv:2005.03889, 2020.
  20. “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 4835–4839.
  21. “Reducing the computational complexity of multimicrophone acoustic models with integrated feature extraction,” 2016.
  22. “Complex linear projection (clp): A discriminative approach to joint feature extraction and acoustic modeling,” 2016.
  23. “Deep context: end-to-end contextual speech recognition,” in 2018 IEEE spoken language technology workshop (SLT). IEEE, 2018, pp. 418–425.
  24. “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 351–355.
  25. “Generalization of multi-channel linear prediction methods for blind mimo impulse response shortening,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 10, pp. 2707–2720, 2012.
  26. “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  27. “A comparison of techniques for language model integration in encoder-decoder speech recognition,” in 2018 IEEE spoken language technology workshop (SLT). IEEE, 2018, pp. 369–375.
  28. “Direction-of-arrival estimation using a mixed ℓ2,0subscriptℓ20\ell_{2,0}roman_ℓ start_POSTSUBSCRIPT 2 , 0 end_POSTSUBSCRIPT norm approximation,” IEEE Transactions on Signal processing, vol. 58, no. 9, pp. 4646–4655, 2010.
  29. “gpurir: A python library for room impulse response simulation with gpu acceleration,” Multimedia Tools and Applications, pp. 1–19, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Dongdi Zhao (1 paper)
  2. Jianbo Ma (9 papers)
  3. Lu Lu (189 papers)
  4. Jinke Li (7 papers)
  5. Xuan Ji (11 papers)
  6. Lei Zhu (280 papers)
  7. Fuming Fang (13 papers)
  8. Ming Liu (421 papers)
  9. Feijun Jiang (13 papers)

Summary

We haven't generated a summary for this paper yet.