Challenges and Insights: Exploring 3D Spatial Features and Complex Networks on the MISP Dataset (2310.03901v1)
Abstract: Multi-channel multi-talker speech recognition presents formidable challenges in the realm of speech processing, marked by issues such as background noise, reverberation, and overlapping speech. Overcoming these complexities requires leveraging contextual cues to separate target speech from a cacophonous mix, enabling accurate recognition. Among these cues, the 3D spatial feature has emerged as a cutting-edge solution, particularly when equipped with spatial information about the target speaker. Its exceptional ability to discern the target speaker within mixed audio, often rendering intermediate processing redundant, paves the way for the direct training of "All-in-one" ASR models. These models have demonstrated commendable performance on both simulated and real-world data. In this paper, we extend this approach to the MISP dataset to further validate its efficacy. We delve into the challenges encountered and insights gained when applying 3D spatial features to MISP, while also exploring preliminary experiments involving the replacement of these features with more complex input and models.
- “Multi-channel multi-speaker asr using 3d spatial feature,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6067–6071.
- “Neural spatial filter: Target speaker speech separation assisted with directional information,” Proc. Interspeech, pp. 4290–4294, 2019.
- “Multi-modal multi-channel target speech separation,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 3, pp. 530–541, 2020.
- “Multi-channel overlapped speech recognition with location guided speech extraction network,” in 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, pp. 558–565.
- “Audio-visual speech recognition in misp2021 challenge: Dataset release and deep analysis,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022, vol. 2022, pp. 1766–1770.
- “Pruned rnn-t for fast, memory-efficient asr training,” arXiv preprint arXiv:2206.13236, 2022.
- “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
- “Channel-wise av-fusion attention for multi-channel audio-visual speech recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 9251–9255.
- “The xiaomi-talkfreely system for audio-visual speech recognition in misp challenge 2021,” .
- “Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement,” arXiv preprint arXiv:2008.00264, 2020.
- “Complex neural spatial filter: Enhancing multi-channel target speech separation in complex domain,” IEEE Signal Processing Letters, vol. 28, pp. 1370–1374, 2021.