Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 45 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 206 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Cross-modal Alignment with Optimal Transport for CTC-based ASR (2309.13650v1)

Published 24 Sep 2023 in eess.AS and cs.SD

Abstract: Temporal connectionist temporal classification (CTC)-based automatic speech recognition (ASR) is one of the most successful end to end (E2E) ASR frameworks. However, due to the token independence assumption in decoding, an external LLM (LM) is required which destroys its fast parallel decoding property. Several studies have been proposed to transfer linguistic knowledge from a pretrained LM (PLM) to the CTC based ASR. Since the PLM is built from text while the acoustic model is trained with speech, a cross-modal alignment is required in order to transfer the context dependent linguistic knowledge from the PLM to acoustic encoding. In this study, we propose a novel cross-modal alignment algorithm based on optimal transport (OT). In the alignment process, a transport coupling matrix is obtained using OT, which is then utilized to transform a latent acoustic representation for matching the context-dependent linguistic features encoded by the PLM. Based on the alignment, the latent acoustic feature is forced to encode context dependent linguistic information. We integrate this latent acoustic feature to build conformer encoder-based CTC ASR system. On the AISHELL-1 data corpus, our system achieved 3.96% and 4.27% character error rate (CER) for dev and test sets, respectively, which corresponds to relative improvements of 28.39% and 29.42% compared to the baseline conformer CTC ASR system without cross-modal knowledge transfer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. J. Li, “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, DOI 10.1561/116.00000050, 2022.
  2. A. Graves, and N. Jaitly, “Towards end to-end speech recognition with recurrent neural networks,” in Proc. ICML, pp. 1764–1772, 2014.
  3. W. Chan, N. Jaitly, Q. Le and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. of ICASSP, pp. 4960-4964, 2016.
  4. S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in Proc. of ICASSP, pp. 4835–4839, 2017.
  5. T. Hori, S. Watanabe, and J. R. Hershey, “Joint ctc/attention decoding for end-to-end speech recognition,” in Proc. of ACL, vol. 1, pp. 518–529, 2017.
  6. S. Watanabe, T. Hori, S. Kim, J. R. Hershey and T. Hayashi, “Hybrid CTC/Attention Architecture for End-to-End Speech Recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, 2017.
  7. A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint, arXiv:1211.3711, 2012.
  8. J. Shin, Y. Lee, and K. Jung, “Effective sentence scoring method using BERT for speech recognition,” in Proc. of ACML, pp. 1081-1093, 2019.
  9. J. Salazar, D. Liang, T. Nguyen, K. Kirchhoff, “Masked Language Model Scoring,” in Proc. of ACL, pp. 2699-2712, 2020.
  10. J. Ao, R. Wang, Z. Zhou, et al, “Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing,” arXiv preprint, arXiv:2110.07205, 2021.
  11. S. Khurana, A. Laurent, J. Glass,“SAMU-XLSR: Semantically-Aligned Multimodal Utterance-Level Cross-Lingual Speech Representation,” IEEE J. Sel. Top. Signal Process., 16(6), pp. 1493-1504, 2022.
  12. Y. Fujita, T. Komatsu, and Y. Kida, “Multi-sequence intermediate conditioning for ctc-based asr,” arXiv preprint, arXiv:2204.00175, 2022.
  13. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. of NeurIPS, 2020.
  14. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint, arXiv:1810.04805, 2018.
  15. M. Han, F. Chen, J. Shi, S. Xu, B. Xu, “Knowledge Transfer from Pre-trained Language Models to Cif-based Speech Recognizers via Hierarchical Distillation,” arXiv preprint, arXiv:2301.13003, 2023.
  16. K. Deng, S. Cao, Y. Zhang, L. Ma, G. Cheng, J. Xu, P. Zhang, “Improving CTC-Based Speech Recognition Via Knowledge Transferring from Pre-Trained Language Models,” in Proc. of ICASSP, pp. 8517-8521, 2022.
  17. K. Deng, Z. Yang, S. Watanabe, Y. Higuchi, G. Cheng, P. Zhang, “Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models,” in Proc. of ICASSP, pp. 8522-8526, 2022.
  18. M. Han, L. Dong, Z. Liang, M. Cai, S. Zhou, Z. Ma, B. Xu, “Improving End-to-End Contextual Speech Recognition with Fine-Grained Contextual Knowledge Selection,” in Proc. of ICASSP, pp. 8532-8536, 2022
  19. W. Cho, D. Kwak, J. Yoon, N. Kim, “Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation,” in Proc. of INTERSPEECH, pp. 896-900, 2020.
  20. K. Lu and K. Chen, “A Context-aware Knowledge Transferring Strategy for CTC-based ASR,” in Proc. of SLT, pp. 60-67, 2022.
  21. F. Yu, K. Chen, and K. Lu, “Non-autoregressive ASR Modeling using Pre-trained Language Models for Chinese Speech Recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1474-1482, 2022
  22. G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint, arXiv:1503.02531, 2015.
  23. K. Choi, H. Park, “Distilling a Pretrained Language Model to a Multilingual ASR Model,” in Proc. of INTERSPEECH, pp. 2203-2207, 2022.
  24. W. Wang, S. Ren, Y. Qian, S. Liu, Y. Shi, Y. Qian, M. Zeng, “Optimizing Alignment of Speech and Language Latent Spaces for End-To-End Speech Recognition and Understanding,” in Proc. of ICASSP, pp. 7802-7806, 2021.
  25. H. Futami, H. Inaguma, M. Mimura, S. Sakai, T. Kawahara, “Distilling the Knowledge of BERT for CTC-based ASR,” arXiv preprint, CoRR abs/2209.02030, 2022.
  26. Y. Higuchi, T. Ogawa, T. Kobayashi, S. Watanabe, “BECTRA: Transducer-Based End-To-End ASR with Bert-Enhanced Encoder,” in Proc. of ICASSP, pp. 1-5, 2023.
  27. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. of NIPS, pp. 5998-6008, 2017.
  28. N. Courty, R. Flamary, A. Habrard, A. Rakotomamonjy, “Joint distribution optimal transportation for domain adaptation,” in Proc. of NIPS, pp. 3733-3742, 2017.
  29. H. Tseng, H. Lin, H. Hsuan, and Y. Tsao, “Interpretations of Domain Adaptations via Layer Variational Analysis,” arXiv preprint, CoRR abs/2302.01798, 2023.
  30. X. Lu, P. Shen, Y. Tsao, H. Kawai, “Unsupervised Neural Adaptation Model Based on Optimal Transport for Spoken Language Identification,” in Proc. of ICASSP, pp. 7213-7217, 2021.
  31. H. Lin, H. Tseng, X. Lu, Y. Tsao, “Unsupervised Noise Adaptive Speech Enhancement by Discriminator-Constrained Optimal Transport,” in Proc. of NeurIPS, pp. 19935-19946, 2021.
  32. Y. Zhou, Q. Fang, Y. Feng, “CMOT: Cross-modal Mixup via Optimal Transport for Speech Translation,” arXiv preprint, arXiv:2305.14635, 2023.
  33. P. Le, H. Gong, C. Wang, J. Pino, B. Lecouteux, D. Schwab, “Pre-training for Speech Translation: CTC Meets Optimal Transport,” arXiv preprint, CoRR abs/2301.11716, 2023.
  34. Z. Chi, L. Dong, B. Zheng, S. Huang, X. Mao, H. Huang, and F. Wei, “Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment,” in Proc. of ACL, pp. 3418-3430, 2021.
  35. M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” in Proc. of NIPS, vol. 26, 2013.
  36. Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng, “AIShell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in Proc. of COCOSDA, pp. 1-5, 2017.
  37. Diederik P. Kingma, Jimmy Ba, “Adam: A Method for Stochastic Optimization,” in Proc. of ICLR, 2015.
  38. https://huggingface.co/
  39. B. Zhang, D. Wu, Z. Peng, X. Song, Z. Yao, H. Lv, L. Xie, C. Yang, F. Pan, J. Niu, “WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit,” in Proc. of INTERSPEECH, pp. 1661-1665, 2022.
Citations (3)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube