Large Language Models are Efficient Learners of Noise-Robust Speech Recognition (2401.10446v1)

Published 19 Jan 2024 in cs.CL, cs.AI, cs.LG, cs.SD, and eess.AS

Abstract: Recent advances in LLMs have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with HyPoradise dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do}, where one solution is introducing noise information as a conditioner into LLM. However, directly incorporating noise embeddings from audio encoder could harm the LLM tuning due to cross-modality gap. To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER. Furthermore, in order to enhance its representation ability of audio noise, we design a knowledge distillation (KD) approach via mutual information estimation to distill the real noise information in audio embeddings to our language embedding. Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate while with limited training data. Analysis shows that our language-space noise embedding can well represent the noise conditions of source speech, under which off-the-shelf LLMs show strong ability of language-space denoising.

References (68)

Citations (13)

View on Semantic Scholar

Summary

The paper introduces RobustGER, which uses language-space noise embeddings and knowledge distillation to significantly reduce word error rates in noisy conditions.
Experiments with LLaMA-2, LLaMA, and Falcon demonstrate up to a 53.9% reduction in word error rates, highlighting enhanced noise robustness.
Ablation studies show that token-level noise representations are crucial for effective error correction while maintaining data efficiency.

Introduction

LLMs have successfully demonstrated significant capabilities across numerous natural language processing tasks. This advancement has spurred research into leveraging LLMs for Automatic Speech Recognition (ASR), particularly in recognition error correction using Generative Error Correction (GER). While GER has shown promise in improving recognition results by finetuning LLMs on transcribed N-best hypotheses from ASR decoding, performance in noisy environments—a common real-world challenge—has not received sufficient focus. Against this backdrop, the authors of this paper address the deficit by extending the GER benchmark to noisy conditions, introducing the novel Robust HyPoradise (RobustHP) dataset.

Methodology

The authors contend with the challenge of noise-robust GER through extracting a noise embedding in language space from N-best hypotheses. Their insight is predicated upon the hypothesis that more adverse noise conditions yield greater diversity within the N-best hypotheses, which can then be represented as a noise embedding for the denoising process. The paper proposes a Knowledge Distillation (KD) strategy leveraging Mutual Information Estimation (MIE) to distill real noise information from audio embeddings into the language-space noise embedding, enhancing its representational capacity.

Experimental Results

Applying recent LLMs, including LLaMA-2, LLaMA, and Falcon, the proposed approach termed RobustGER is demonstrated to achieve significant performance improvements. Specifically, it garners up to a 53.9% reduction in Word Error Rate (WER) on the RobustHP test sets. Furthermore, ablation studies explore the relative contributions of utterance-level versus token-level information contained within the noise embedding, corroborating the essential role of the latter in denoising efficacy for GER.

Analysis

A closer examination unveils that while the abstracted language embedding can represent certain noise types adequately, others remain entangled with clean speech representations. The KD technique enhances noise distinguishability, leading to improved noise-representativeness and WER outcomes. Moreover, data efficiency is established through sustained GER performance despite substantial reductions in the training data volume, highlighting the robustness and generalizability of RobustGER. Lastly, cases illustrating the GER capabilities underscore its proficiency in rectifying transcription errors that carry significant semantic implications.

Conclusion

This paper significantly augments the utility of GER for ASR under noisy conditions by implementing a refined, noise-aware correction method. By deploying a language-space noise embedding and finetuning it via KD from audio embeddings, the proposed method not only represents audio noise more effectively but also instructs LLMs efficiently, advancing the state of GER in noisy environments without a heavy training data dependency. This milestone likely paves the way for advanced, practical ASR systems robust against real-world acoustic disturbances. The open-sourced work invites further enhancements and adaptations within the speech processing community.

PDF Markdown

Tweets

https://twitter.com/liruizhe94/status/1749740278871671215