Large Language Models are Efficient Learners of Noise-Robust Speech Recognition (2401.10446v1)
Abstract: Recent advances in LLMs have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do}, where one solution is introducing noise information as a conditioner into LLM. However, directly incorporating noise embeddings from audio encoder could harm the LLM tuning due to cross-modality gap. To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER. Furthermore, in order to enhance its representation ability of audio noise, we design a knowledge distillation (KD) approach via mutual information estimation to distill the real noise information in audio embeddings to our language embedding. Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate while with limited training data. Analysis shows that our language-space noise embedding can well represent the noise conditions of source speech, under which off-the-shelf LLMs show strong ability of language-space denoising.
- Bidirectional recurrent neural network language models for automatic speech recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5421–5425. IEEE, 2015.
- Mutual information neural estimation. In International conference on machine learning, pp. 531–540. PMLR, 2018.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Generative error correction for code-switching speech recognition using large language models. arXiv preprint arXiv:2310.13013, 2023a.
- Hyporadise: An open baseline for generative speech recognition with large language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023b.
- X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160, 2023c.
- Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
- A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
- Prompting large language models with speech recognition abilities. arXiv preprint arXiv:2307.11795, 2023.
- Trapping llm hallucinations using tagged context prompts. arXiv preprint arXiv:2306.06085, 2023.
- Freesound technical demo. In Proceedings of the 21st ACM international conference on Multimedia, pp. 411–412, 2013.
- Metricgan: Generative adversarial networks based black-box metric scores optimization for speech enhancement. In International Conference on Machine Learning, pp. 2031–2041. PMLR, 2019.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
- Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers. In Proc. Interspeech, 2023a.
- Joint audio and speech understanding. In IEEE Proc. ASRU, 2023b.
- The rats collection: Supporting hlt research with degraded audio data. In LREC, pp. 1970–1977. Citeseer, 2014.
- Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.
- A spelling correction model for end-to-end speech recognition. In Proc. ICASSP, pp. 5651–5655. IEEE, 2019.
- The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In ASR2000-Automatic speech recognition: challenges for the new Millenium ISCA tutorial and research workshop (ITRW), 2000.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Deliberation model based two-pass end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7799–7803. IEEE, 2020.
- Improving deliberation by text-only and semi-supervised training. arXiv preprint arXiv:2206.14716, 2022.
- Scaling up deliberation for multilingual asr. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 771–776. IEEE, 2023.
- Yi Hu and Philipos C Loizou. Subjective comparison of speech enhancement algorithms. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, volume 1, pp. I–I. IEEE, 2006.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Speech recognition with no speech or with noisy speech. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1090–1094. IEEE, 2019.
- Fastcorrect 2: Fast error correction on multiple candidates for automatic speech recognition. arXiv preprint arXiv:2109.14420, 2021.
- An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4):745–777, 2014.
- Robust automatic speech recognition: a bridge to practical applications, chapter 1, pp. 1–20. Academic Press, 2015.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- Spatial-channel token distillation for vision mlps. In International Conference on Machine Learning, pp. 12685–12695. PMLR, 2022.
- Prompting large language models for zero-shot domain adaptation in speech recognition. arXiv preprint arXiv:2306.16007, 2023b.
- Unsupervised noise adaptive speech enhancement by discriminator-constrained optimal transport. Advances in Neural Information Processing Systems, 34:19935–19946, 2021.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- N-best t5: Robust asr error correction using multiple input hypotheses and constrained decoding space. arXiv preprint arXiv:2303.00456, 2023.
- Recurrent neural network based language model. In Interspeech, volume 2, pp. 1045–1048. Makuhari, 2010.
- OpenAI. Introducing chatgpt. OpenAI Blog, 2022.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. IEEE, 2015.
- Dual application of speech enhancement for automatic speech recognition. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 223–228. IEEE, 2021.
- Enhancing speaker diarization with large language models: A contextual beam search approach. arXiv preprint arXiv:2309.05248, 2023.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
- An investigation of end-to-end models for robust speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6893–6897. IEEE, 2021.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pp. 28492–28518. PMLR, 2023.
- Whispering llama: A cross-modal generative error correction framework for speech recognition. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 10007–10016, 2023.
- Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
- Effective sentence scoring method using bert for speech recognition. In Asian Conference on Machine Learning, pp. 1081–1093. PMLR, 2019.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Investigating rnn-based speech enhancement methods for noise-robust text-to-speech. In SSW, pp. 146–152, 2016.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In 2013 O-COCOSDA/CASLRE, pp. 1–4, 2013.
- The 4th chime speech separation and recognition challenge. URL: http://spandh. dcs. shef. ac. uk/chime challenge {normal-{\{{Last Accessed on 1 August, 2018}normal-}\}}, 2016.
- Diarizationlm: Speaker diarization post-processing with large language models. arXiv preprint arXiv:2401.03506, 2024.
- Can whisper perform speech-based in-context learning. arXiv preprint arXiv:2309.07081, 2023.
- Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015, 2018.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- On decoder-only architecture for speech-to-text and large language model integration. arXiv preprint arXiv:2307.03917, 2023a.
- Improving audio captioning models with fine-grained audio features, text embedding supervision, and llm mix-up augmentation. arXiv preprint arXiv:2309.17352, 2023b.
- Multi-task language modeling for improving speech recognition of rare words. In Proc. IEEE ASRU, pp. 1087–1093. IEEE, 2021.
- Generative speech recognition error correction with large language models and task-activating prompting. In Proc. IEEE ASRU, 2023a.
- From english to more languages: Parameter-efficient model reprogramming for cross-lingual speech recognition. In Proc. ICASSP, pp. 1–5. IEEE, 2023b.
- Low-rank adaptation of large language model rescoring for parameter-efficient speech recognition. In IEEE Proc. ASRU, 2023.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
- Learning view-disentangled human pose representation by contrastive cross-view mutual information maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12793–12802, 2021.
- Arbitrary talking face generation via attentional audio-visual coherence learning. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 2362–2368, 2021.