Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR (2309.16093v1)
Abstract: Due to the modality discrepancy between textual and acoustic modeling, efficiently transferring linguistic knowledge from a pretrained LLM (PLM) to acoustic encoding for automatic speech recognition (ASR) still remains a challenging task. In this study, we propose a cross-modality knowledge transfer (CMKT) learning framework in a temporal connectionist temporal classification (CTC) based ASR system where hierarchical acoustic alignments with the linguistic representation are applied. Additionally, we propose the use of Sinkhorn attention in cross-modality alignment process, where the transformer attention is a special case of this Sinkhorn attention process. The CMKT learning is supposed to compel the acoustic encoder to encode rich linguistic knowledge for ASR. On the AISHELL-1 dataset, with CTC greedy decoding for inference (without using any LLM), we achieved state-of-the-art performance with 3.64% and 3.94% character error rates (CERs) for the development and test sets, which corresponding to relative improvements of 34.18% and 34.88% compared to the baseline CTC-ASR system, respectively.
- A. Graves, and N. Jaitly, “Towards end to-end speech recognition with recurrent neural networks,” in Proc. ICML, pp. 1764–1772, 2014.
- J. Li, “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, DOI 10.1561/116.00000050, 2022.
- S. Watanabe, T. Hori, S. Kim, J. R. Hershey and T. Hayashi, “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, 2017.
- Y. Higuchi, K. Karube, T. Ogawa, T. Kobayashi, “Hierarchical conditional end-to-end asr with ctc and multi-granular subword units,” in Proc. of ICASSP, pp. 7797-7801, 2022.
- Y. Fujita, T. Komatsu, and Y. Kida, “Alternate Intermediate Conditioning with Syllable-Level and Character-Level Targets for Japanese ASR,” in Proc. of SLT, pp. 76-83, 2022.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. of NeurIPS, 2020.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- M. Han, F. Chen, J. Shi, S. Xu, B. Xu, “Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation,” arXiv preprint arXiv:2301.13003, 2023.
- K. Deng, S. Cao, Y. Zhang, L. Ma, G. Cheng, J. Xu, P. Zhang, “Improving CTC-Based Speech Recognition Via Knowledge Transferring from Pre-Trained Language Models,” in Proc. of ICASSP, pp. 8517-8521, 2022.
- W. Cho, D. Kwak, J. Yoon, N. Kim, “Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation,” in Proc. of INTERSPEECH, pp. 896-900, 2020.
- K. Lu and K. Chen, “A Context-aware Knowledge Transferring Strategy for CTC-based ASR,” in Proc. of SLT, pp. 60-67, 2022.
- F. Yu, K. Chen, and K. Lu, “Non-autoregressive ASR Modeling using Pre-trained Language Models for Chinese Speech Recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1474-1482, 2022
- Y. Kubo, S. Karita, M. Bacchiani, “Knowledge Transfer from Large-Scale Pretrained Language Models to End-To-End Speech Recognizers,” in Proc. of ICASSP, pp. 8512-8516, 2022.
- H. Futami, H. Inaguma, M. Mimura, S. Sakai, T. Kawahara, “Distilling the Knowledge of BERT for CTC-based ASR,” CoRR abs/2209.02030, 2022.
- K. Choi, H. Park, “Distilling a Pretrained Language Model to a Multilingual ASR Model,” in Proc. of INTERSPEECH, pp. 2203-2207, 2022.
- Y. Higuchi, T. Ogawa, T. Kobayashi, S. Watanabe, “BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder,” CoRR abs/2211.00792, 2022.
- C. Brodbeck, S. Bhattasali, A. Heredia, P. Resnik, J. Simon, E. Lau, “Parallel processing in speech perception with local and global representations of linguistic context,” Elife, doi: 10.7554 /eLife .72056, 2022.
- Y. Tay, D. Bahri, L. Yang, D. Metzler, D. Juan, “Sparse Sinkhorn Attention,” in Proc. of ICML, pp. 9438-9447, 2020.
- M. Sander, P. Ablin, M. Blondel, G. Peyre, “Sinkformers: Transformers with Doubly Stochastic Attention,” in Proc. of AISTATS, pp. 3515-3530, 2022.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. of NIPS, pp. 5998-6008, 2017.
- H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AIShell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in Proc. of COCOSDA, pp. 1-5, 2017.
- https://huggingface.co/
- D. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization,” in Proc. of ICLR, 2015.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.