Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 196 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR (2309.16093v1)

Published 28 Sep 2023 in eess.AS and cs.SD

Abstract: Due to the modality discrepancy between textual and acoustic modeling, efficiently transferring linguistic knowledge from a pretrained LLM (PLM) to acoustic encoding for automatic speech recognition (ASR) still remains a challenging task. In this study, we propose a cross-modality knowledge transfer (CMKT) learning framework in a temporal connectionist temporal classification (CTC) based ASR system where hierarchical acoustic alignments with the linguistic representation are applied. Additionally, we propose the use of Sinkhorn attention in cross-modality alignment process, where the transformer attention is a special case of this Sinkhorn attention process. The CMKT learning is supposed to compel the acoustic encoder to encode rich linguistic knowledge for ASR. On the AISHELL-1 dataset, with CTC greedy decoding for inference (without using any LLM), we achieved state-of-the-art performance with 3.64% and 3.94% character error rates (CERs) for the development and test sets, which corresponding to relative improvements of 34.18% and 34.88% compared to the baseline CTC-ASR system, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. A. Graves, and N. Jaitly, “Towards end to-end speech recognition with recurrent neural networks,” in Proc. ICML, pp. 1764–1772, 2014.
  2. J. Li, “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, DOI 10.1561/116.00000050, 2022.
  3. S. Watanabe, T. Hori, S. Kim, J. R. Hershey and T. Hayashi, “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, 2017.
  4. Y. Higuchi, K. Karube, T. Ogawa, T. Kobayashi, “Hierarchical conditional end-to-end asr with ctc and multi-granular subword units,” in Proc. of ICASSP, pp. 7797-7801, 2022.
  5. Y. Fujita, T. Komatsu, and Y. Kida, “Alternate Intermediate Conditioning with Syllable-Level and Character-Level Targets for Japanese ASR,” in Proc. of SLT, pp. 76-83, 2022.
  6. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. of NeurIPS, 2020.
  7. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  8. M. Han, F. Chen, J. Shi, S. Xu, B. Xu, “Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation,” arXiv preprint arXiv:2301.13003, 2023.
  9. K. Deng, S. Cao, Y. Zhang, L. Ma, G. Cheng, J. Xu, P. Zhang, “Improving CTC-Based Speech Recognition Via Knowledge Transferring from Pre-Trained Language Models,” in Proc. of ICASSP, pp. 8517-8521, 2022.
  10. W. Cho, D. Kwak, J. Yoon, N. Kim, “Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation,” in Proc. of INTERSPEECH, pp. 896-900, 2020.
  11. K. Lu and K. Chen, “A Context-aware Knowledge Transferring Strategy for CTC-based ASR,” in Proc. of SLT, pp. 60-67, 2022.
  12. F. Yu, K. Chen, and K. Lu, “Non-autoregressive ASR Modeling using Pre-trained Language Models for Chinese Speech Recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1474-1482, 2022
  13. Y. Kubo, S. Karita, M. Bacchiani, “Knowledge Transfer from Large-Scale Pretrained Language Models to End-To-End Speech Recognizers,” in Proc. of ICASSP, pp. 8512-8516, 2022.
  14. H. Futami, H. Inaguma, M. Mimura, S. Sakai, T. Kawahara, “Distilling the Knowledge of BERT for CTC-based ASR,” CoRR abs/2209.02030, 2022.
  15. K. Choi, H. Park, “Distilling a Pretrained Language Model to a Multilingual ASR Model,” in Proc. of INTERSPEECH, pp. 2203-2207, 2022.
  16. Y. Higuchi, T. Ogawa, T. Kobayashi, S. Watanabe, “BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder,” CoRR abs/2211.00792, 2022.
  17. C. Brodbeck, S. Bhattasali, A. Heredia, P. Resnik, J. Simon, E. Lau, “Parallel processing in speech perception with local and global representations of linguistic context,” Elife, doi: 10.7554 /eLife .72056, 2022.
  18. Y. Tay, D. Bahri, L. Yang, D. Metzler, D. Juan, “Sparse Sinkhorn Attention,” in Proc. of ICML, pp. 9438-9447, 2020.
  19. M. Sander, P. Ablin, M. Blondel, G. Peyre, “Sinkformers: Transformers with Doubly Stochastic Attention,” in Proc. of AISTATS, pp. 3515-3530, 2022.
  20. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. of NIPS, pp. 5998-6008, 2017.
  21. H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AIShell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in Proc. of COCOSDA, pp. 1-5, 2017.
  22. https://huggingface.co/
  23. D. Kingma, J. Ba, “Adam: A Method for Stochastic Optimization,” in Proc. of ICLR, 2015.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.