SURT 2.0: Advances in Transducer-based Multi-talker Speech Recognition (2306.10559v2)
Abstract: The Streaming Unmixing and Recognition Transducer (SURT) model was proposed recently as an end-to-end approach for continuous, streaming, multi-talker speech recognition (ASR). Despite impressive results on multi-turn meetings, SURT has notable limitations: (i) it suffers from leakage and omission related errors; (ii) it is computationally expensive, due to which it has not seen adoption in academia; and (iii) it has only been evaluated on synthetic mixtures. In this work, we propose several modifications to the original SURT which are carefully designed to fix the above limitations. In particular, we (i) change the unmixing module to a mask estimator that uses dual-path modeling, (ii) use a streaming zipformer encoder and a stateless decoder for the transducer, (iii) perform mixture simulation using force-aligned subsegments, (iv) pre-train the transducer on single-speaker data, (v) use auxiliary objectives in the form of masking loss and encoder CTC loss, and (vi) perform domain adaptation for far-field recognition. We show that our modifications allow SURT 2.0 to outperform its predecessor in terms of multi-talker ASR results, while being efficient enough to train with academic resources. We conduct our evaluations on 3 publicly available meeting benchmarks -- LibriCSS, AMI, and ICSI, where our best model achieves WERs of 16.9%, 44.6% and 32.2%, respectively, on far-field unsegmented recordings. We release training recipes and pre-trained models: https://sites.google.com/view/surt2.
- D. Amodei et al., “Deep speech 2 : End-to-end speech recognition in english and mandarin,” in ICML, 2016.
- W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig, “Toward human parity in conversational speech recognition,” IEEE/ACM TASLP, vol. 25, pp. 2410–2423, 2017.
- J. Li, “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, 2021.
- J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” in IEEE ASRU, 2015.
- K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, A. Sehr, W. Kellermann, and R. Maas, “The REVERB challenge: A common evaluation framework for dereverberation and recognition of reverberant speech,” in IEEE WASPAA, 2013.
- S. Watanabe, M. Mandel, J. Barker, and E. Vincent, “CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings,” ArXiv, 2020.
- J. Carletta et al., “The AMI meeting corpus: A pre-announcement,” in MLMI, 2005.
- E. Shriberg, A. Stolcke, and D. Baron, “Observations on overlap: findings and implications for automatic processing of multi-party conversation,” in InterSpeech, 2001.
- T. Yoshioka, D. Dimitriadis, A. Stolcke, W. Hinthorn, Z. Chen, M. Zeng, and X. Huang, “Meeting transcription using asynchronous distant microphones,” in InterSpeech, 2019.
- J. G. Fiscus, J. Ajot, and J. S. Garofolo, “The rich transcription 2007 meeting recognition evaluation,” in CLEaR, 2007.
- T. Hain, L. Burget et al., “Transcribing meetings with the AMIDA systems,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, pp. 486–498, 2012.
- M. V. Segbroeck, A. F. A. Zaid, K. Kutsenko, C. Huerta, T. Nguyen, X. Luo, B. Hoffmeister, J. Trmal, M. Omologo, and R. Maas, “DiPCo - dinner party corpus,” in InterSpeech, 2019.
- D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM TASLP, vol. 26, pp. 1702–1726, 2017.
- D. Raj, P. Denisov et al., “Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis,” in IEEE SLT, 2021.
- J. Wu, Z. Chen, S. Chen, Y. Wu, T. Yoshioka, N. Kanda, S. Liu, and J. Li, “Investigation of practical aspects of single channel speech separation for ASR,” in InterSpeech, 2021.
- D. Yu, X. Chang, and Y. Qian, “Recognizing multi-talker speech with permutation invariant training,” in InterSpeech, 2017.
- Y. Qian, X. Chang, and D. Yu, “Single-channel multi-talker speech recognition with permutation invariant training,” Speech Communication, vol. 104, pp. 1–11, 2017.
- H. Seki, T. Hori, S. Watanabe, J. Le Roux, and J. R. Hershey, “A purely end-to-end system for multi-speaker speech recognition,” in ACL, 2018.
- N. Kanda, Y. Gaur, X. Wang, Z. Meng, and T. Yoshioka, “Serialized output training for end-to-end overlapped speech recognition,” in InterSpeech, 2020.
- A. Graves, “Sequence transduction with recurrent neural networks,” in ICML Representation Learning Workshop, 2012.
- Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang et al., “Streaming end-to-end speech recognition for mobile devices,” in IEEE ICASSP, 2019.
- C. Wu, Y. Wang, Y. Shi, C.-F. Yeh, and F. Zhang, “Streaming Transformer-Based Acoustic Models Using Self-Attention with Augmented Memory,” in InterSpeech, 2020.
- J. Li, R. Zhao, H. Hu, and Y. Gong, “Improving RNN transducer modeling for end-to-end speech recognition,” in IEEE ASRU, 2019.
- N. Kanda, J. Wu, Y. Wu, X. Xiao, Z. Meng, X. Wang, Y. Gaur, Z. Chen, J. Li, and T. Yoshioka, “Streaming multi-talker ASR with token-level serialized output training,” in InterSpeech, 2022.
- L. Lu, N. Kanda, J. Li, and Y. Gong, “Streaming end-to-end multi-talker speech recognition,” IEEE Signal Processing Letters, vol. 28, pp. 803–807, 2020.
- I. Sklyar, A. Piunova, and Y. Liu, “Streaming multi-speaker ASR with RNN-T,” in IEEE ICASSP, 2021.
- D. Raj, L. Lu, Z. Chen, Y. Gaur, and J. Li, “Continuous streaming multi-talker ASR with dual-path transducers,” in IEEE ICASSP, 2022.
- I. Sklyar, A. Piunova, X. Zheng, and Y. Liu, “Multi-turn RNN-T for streaming recognition of multi-party speech,” in IEEE ICASSP, 2021.
- L. Lu, N. Kanda, J. Li, and Y. Gong, “Streaming multi-talker speech recognition with joint speaker identification,” in InterSpeech, 2021.
- L. Lu, J. Li, and Y. Gong, “Endpoint detection for streaming end-to-end multi-talker ASR,” in IEEE ICASSP, 2022.
- I. Sklyar, A. Piunova, and C. Osendorfer, “Separator-transducer-segmenter: Streaming recognition and segmentation of multi-party speech,” in InterSpeech, 2022.
- X. Chang, Y. Qian, and D. Yu, “Monaural multi-talker speech recognition with attention mechanism and gated convolutional networks,” in InterSpeech, 2018.
- ——, “Adaptive permutation invariant training with auxiliary information for monaural multi-talker speech recognition,” in IEEE ICASSP, 2018.
- T. Tan, Y. Qian, and D. Yu, “Knowledge transfer in permutation invariant training for single-channel multi-talker speech recognition,” in IEEE ICASSP, 2018.
- A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
- L. Lu, X. Zhang, and S. Renals, “On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition,” in IEEE ICASSP, 2016.
- J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in NIPS, 2015.
- C. Chiu, T. Sainath et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in IEEE ICASSP, 2018.
- X. Chang, W. Zhang, Y. Qian, J. L. Roux, and S. Watanabe, “End-to-end multi-speaker speech recognition with transformer,” in IEEE ICASSP, 2020.
- P. Denisov and N. T. Vu, “End-to-end multi-speaker speech recognition using speaker embeddings and transfer learning,” in InterSpeech, 2019.
- W. Zhang, X. Chang, Y. Qian, and S. Watanabe, “Improving end-to-end single-channel multi-talker speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1385–1394, 2020.
- Y. Lin, Z. Du, S. Zhang, F. Yu, Z. Zhao, and F. Wu, “Separate-to-recognize: Joint multi-target speech separation and speech recognition for speaker-attributed ASR,” in ISCSLP, 2022.
- X. Chang, Y. Qian, K. Yu, and S. Watanabe, “End-to-end monaural multi-speaker ASR system without pretraining,” in IEEE ICASSP, 2019.
- J. Shi, X. Chang, S. Watanabe, and B. Xu, “Train from scratch: Single-stage joint training of speech separation and recognition,” Computer, Speech, and Language, vol. 76, p. 101387, 2022.
- T. von Neumann, K. Kinoshita, L. Drude, C. Boeddeker, M. Delcroix, T. Nakatani, and R. Haeb-Umbach, “End-to-end training of time domain audio separation and recognition,” in IEEE ICASSP, 2019.
- H. Inaguma, M. Mimura, and T. Kawahara, “Enhancing monotonic multihead attention for streaming ASR,” in InterSpeech, 2020.
- M. Li, S. Zhang, C. Zorila, and R. Doddipatla, “Transformer-based streaming ASR with cumulative attention,” in IEEE ICASSP, 2022.
- E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe, “Transformer ASR with contextual block processing,” in IEEE ASRU, 2019.
- Y. Z. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-channel multi-speaker separation using deep clustering,” in InterSpeech, 2016.
- N. Kanda, Y. Gaur et al., “Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,” in InterSpeech, 2020.
- N. Kanda, X. Chang, Y. Gaur, X. Wang, Z. Meng, Z. Chen, and T. Yoshioka, “Investigation of end-to-end speaker-attributed ASR for continuous multi-talker recordings,” in IEEE SLT, 2020.
- N. Kanda, J. Wu, Y. Wu, X. Xiao, Z. Meng, X. Wang, Y. Gaur, Z. Chen, J. Li, and T. Yoshioka, “Streaming speaker-attributed ASR with token-level speaker embeddings,” in InterSpeech, 2022.
- N. Kanda, J. Wu, X. Wang, Z. Chen, J. Li, and T. Yoshioka, “Vararray meets t-SOT: Advancing the state of the art of streaming distant conversational speech recognition,” in IEEE ICASSP, 2023.
- N. Kanda, G. Ye, Y. Wu, Y. Gaur, X. Wang, Z. Meng, Z. Chen, and T. Yoshioka, “Large-scale pre-training of end-to-end multi-talker ASR for meeting transcription with single distant microphone,” in InterSpeech, 2021.
- F. Yu, S. Zhang, Y. Fu, L. Xie, S. Zheng, Z. Du, W. Huang, P. Guo, Z. Yan, B. Ma, X. Xu, and H. Bu, “M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge,” in IEEE ICASSP, 2021.
- F. Yu, S. Zhang, P. Guo, Y. Fu, Z. Du, S. Zheng, W. Huang, L. Xie, Z. Tan, D. Wang, Y. Qian, K.-A. Lee, Z. Yan, B. Ma, X. Xu, and H. Bu, “Summary on the ICASSP 2022 multi-channel multi-party meeting transcription grand challenge,” in IEEE ICASSP, 2022.
- Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu, and J. Li, “Continuous speech separation: Dataset and analysis,” in IEEE ICASSP, 2020.
- J. Wu, Z. Chen, J. Li, T. Yoshioka, Z. Tan, E. Lin, Y. Luo, and L. Xie, “An end-to-end architecture of online multi-channel speech separation,” in InterSpeech, 2020.
- T. Yoshioka, H. Erdogan, Z. Chen, X. Xiao, and F. Alleva, “Recognizing overlapped speech in meetings: A multichannel separation approach using neural networks,” in InterSpeech, 2018.
- S. Chen, Y. Wu, Z. Chen, J. Li, C. Wang, S. Liu, and M. Zhou, “Continuous speech separation with conformer,” in IEEE ICASSP, 2021.
- X. Wang, D. Wang, N. Kanda, S. E. Eskimez, and T. Yoshioka, “Leveraging real conversational data for multi-channel continuous speech separation,” in InterSpeech, 2022.
- Z. Chen, N. Kanda, J. Wu, Y. Wu, X. Wang, T. Yoshioka, J. Li, S. Sivasankaran, and S. E. Eskimez, “Speech separation with large-scale self-supervised learning,” in IEEE ICASSP, 2022.
- K. Kinoshita, L. Drude, M. Delcroix, and T. Nakatani, “Listening to each speaker one by one with recurrent selective hearing networks,” in IEEE ICASSP, 2018.
- Y. Zhang, Z. Chen, J. Wu, T. Yoshioka, P. Wang, Z. Meng, and J. Li, “Continuous speech separation with recurrent selective attention network,” in IEEE ICASSP, 2021.
- T. von Neumann, K. Kinoshita, C. Boeddeker, M. Delcroix, and R. Haeb-Umbach, “Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers,” in InterSpeech, 2021.
- ——, “Segment-less continuous speech separation of meetings: Training and evaluation criteria,” IEEE/ACM TASLP, vol. 31, pp. 576–589, 2023.
- Z.-Q. Wang, P. Wang, and D. Wang, “Multi-microphone complex spectral mapping for utterance-wise and continuous speech separation,” IEEE/ACM TASLP, vol. 29, pp. 2001–2014, 2020.
- T. Yoshioka, Z. Chen, C. Liu, X. Xiao, H. Erdogan, and D. Dimitriadis, “Low-latency speaker-independent continuous speech separation,” in IEEE ICASSP, 2019.
- Z.-Q. Wang and D. Wang, “Localization based sequential grouping for continuous speech separation,” in IEEE ICASSP, 2021.
- J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth ’CHiME’ speech separation and recognition challenge: dataset, task and baselines,” in InterSpeech, 2018.
- C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann, and R. Haeb-Umbach, “Front-end processing for the CHiME-5 dinner party scenario,” in CHiME Workshop, 2018.
- S. Horiguchi, Y. Fujita, and K. Nagamatsu, “Block-online guided source separation,” in IEEE SLT, 2021.
- D. Raj, D. Povey, and S. Khudanpur, “GPU-accelerated guided source separation for meeting transcription,” in InterSpeech, 2023.
- N. Kanda, C. Böddeker, J. Heitkaemper, Y. Fujita, S. Horiguchi, K. Nagamatsu, and R. Häb-Umbach, “Guided source separation meets a strong ASR backend: Hitachi/paderborn university joint investigation for dinner party ASR,” in InterSpeech, 2019.
- A. Arora, D. Raj, A. S. Subramanian, K. Li, B. Ben-Yair, M. Maciejewski, P. Żelasko, L. P. García-Perera, S. Watanabe, and S. Khudanpur, “The JHU multi-microphone multi-speaker ASR system for the CHiME-6 challenge,” in The CHiME Workshop, 2020.
- I. Medennikov, M. Korenevsky, T. Prisyach, Y. Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev, and A. Romanenko, “The STC system for the CHiME-6 challenge,” in CHiME Workshop, 2020.
- C. Boeddeker, T. Cord-Landwehr, T. von Neumann, and R. Haeb-Umbach, “An initialization scheme for meeting separation with spatial mixture models,” in InterSpeech, 2022.
- K. Žmolíková, M. Delcroix, D. Raj, S. Watanabe, and J. H. Cernocký, “Auxiliary loss function for target speech extraction and recognition with weak supervision based on speaker characteristics,” in InterSpeech, 2021.
- A. Sivaraman, S. Wisdom, H. Erdogan, and J. R. Hershey, “Adapting speech separation to real-world meetings using mixture invariant training,” in IEEE ICASSP, 2021.
- F. Yu, S. Zhang, P. Guo, Y. Liang, Z. Du, Y. Lin, and L. Xie, “MFCCA: Multi-frame cross-channel attention for multi-speaker ASR in multi-party meeting scenario,” in IEEE SLT, 2022.
- A. L. Janin, D. Baron, J. Edwards, D. P. W. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C. Wooters, “The ICSI meeting corpus,” in IEEE ICASSP, 2003.
- C. W. Fox, Y. Liu, E. Zwyssig, and T. Hain, “The sheffield wargames corpus,” in InterSpeech, 2013.
- Y. Fu, L. Cheng, S. Lv, Y. Jv, Y. Kong, Z. Chen, Y. Hu, L. Xie, J. Wu, H. Bu, X. Xu, J. Du, and J. Chen, “AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario,” in InterSpeech, 2021.
- Z. Yang, Y. Chen, L. Luo, R. Yang, L. Ye, G. Cheng, J. Xu, Y. Jin, Q. Zhang, P. Zhang, L. Xie, and Y. Yan, “Open source magicdata-RAMC: A rich annotated mandarin conversational speech dataset,” in InterSpeech, 2022.
- H. Sak, M. Shannon, K. Rao, and F. Beaufays, “Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping,” in InterSpeech, 2017.
- A. Tripathi, H. Lu, H. Sak, and H. Soltau, “Monotonic recurrent neural network transducer and decoding strategies,” in IEEE ASRU, 2019.
- N. Moritz, F. Seide, D. Le, J. Mahadeokar, and C. Fuegen, “An investigation of monotonic transducers for large-scale automatic speech recognition,” in IEEE SLT, 2022.
- J. Mahadeokar, Y. Shangguan, D. Le, G. Keren, H. Su, T. Le, C. feng Yeh, C. Fuegen, and M. L. Seltzer, “Alignment restricted streaming recurrent neural network transducer,” in IEEE SLT, 2021.
- F. Kuang, L. Guo, W. Kang, L. Lin, M. Luo, Z. Yao, and D. Povey, “Pruned RNN-T for fast, memory-efficient ASR training,” in InterSpeech, 2022.
- Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation,” in IEEE ICASSP, 2019.
- D. Povey, https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless7/zipformer.py.
- M. R. Ghodsi, X. Liu, J. A. Apfel, R. Cabrera, and E. Weinstein, “RNN-Transducer with stateless prediction network,” in IEEE ICASSP, 2020.
- S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in IEEE ICASSP, 2016.
- Y. Sudo, M. H. Shakeel, B. Yan, J. Shi, and S. Watanabe, “4D ASR: Joint modeling of CTC, attention, transducer, and mask-predict decoders,” in InterSpeech, 2023.
- F. Landini, A. Lozano-Diez, M. Díez, and L. Burget, “From simulated mixtures to simulated conversations as training data for end-to-end neural diarization,” in InterSpeech, 2022.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in IEEE ICASSP, 2015.
- X. A. Miró, C. Wooters, and J. Hernando, “Acoustic beamforming for speaker diarization of meetings,” IEEE TASLP, vol. 15, pp. 2011–2022, 2007.
- S. Renals and P. Swietojanski, “Neural networks for distant speech recognition,” in 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA), 2014.
- T. von Neumann, C. Boeddeker, K. Kinoshita, M. Delcroix, and R. Haeb-Umbach, “On word error rate definitions and their efficient computation for multi-speaker speech recognition systems,” in IEEE ICASSP, 2023.
- Z. Huang, D. Raj, L. P. García-Perera, and S. Khudanpur, “Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,” in IEEE ICASSP, 2023.
- C. J. Steinmetz and J. D. Reiss, “pyloudnorm: A simple yet flexible loudness meter in python,” in 150th AES Convention, 2021.
- D. Snyder, G. Chen, and D. Povey, “MUSAN: A music, speech, and noise corpus,” ArXiv, 2015.
- D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” in InterSpeech, 2019.
- C. Boeddeker, A. S. Subramanian, G. Wichern, R. Haeb-Umbach, and J. L. Roux, “TS-SEP: Joint diarization and separation conditioned on estimated speaker embeddings,” ArXiv, vol. abs/2303.03849, 2023.
- M. Kocour, K. Žmolíková, L. Ondel, J. Svec, M. Delcroix, T. Ochiai, L. Burget, and J. H. Cernocký, “Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model,” in InterSpeech, 2021.
- W. Xia, H. Lu, Q. Wang, A. Tripathi, I. Lopez-Moreno, and H. Sak, “Turn-to-diarize: Online speaker diarization constrained by transformer transducer speaker turn detection,” in IEEE ICASSP, 2021.
- L. Xu, Y. Gu, J. Kolehmainen, H. Khan, A. Gandhe, A. Rastrow, A. Stolcke, and I. Bulyko, “RescoreBERT: Discriminative speech recognition rescoring with BERT,” in IEEE ICASSP, 2022.