Contextualized Automatic Speech Recognition with Dynamic Vocabulary (2405.13344v2)

Published 22 May 2024 in eess.AS, cs.CL, and cs.SD

Abstract: Deep biasing (DB) enhances the performance of end-to-end automatic speech recognition (E2E-ASR) models for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined static vocabulary. This naive sequence decomposition produces unnatural token patterns, significantly lowering their occurrence probability. More advanced techniques address this problem by expanding the vocabulary with additional modules, including the external LLM shallow fusion or rescoring. However, they result in increasing the workload due to the additional modules. This paper proposes a dynamic vocabulary where bias tokens can be added during inference. Each entry in a bias list is represented as a single token, unlike a sequence of existing subword tokens. This approach eliminates the need to learn subword dependencies within the bias phrases. This method is easily applied to various architectures because it only expands the embedding and output layers in common E2E-ASR architectures. Experimental results demonstrate that the proposed method improves the bias phrase WER on English and Japanese datasets by 3.1 -- 4.9 points compared with the conventional DB method.

Summary

The paper introduces a dynamic vocabulary method that treats bias phrases as single tokens to improve recognition in end-to-end ASR systems.
Experimental results show a 3.1 to 4.9 point improvement in bias phrase WER on English and Japanese datasets compared to static methods.
The approach integrates seamlessly into various ASR architectures, enabling robust multilingual and real-time speech recognition applications.

Contextualized Automatic Speech Recognition with Dynamic Vocabulary

The paper introduces an innovative approach to augment the performance of End-to-End Automatic Speech Recognition (E2E-ASR) systems by integrating a dynamic vocabulary for deep biasing. The need for this paper arises from the shortcomings of current methods that handle rare words and contextual phrases solely through a static vocabulary, which can lead to inefficiencies, particularly with subword token sequences of rare or context-specific phrases.

Framework and Methodology

Conventional deep biasing techniques often incorporate external LLMs or adjust scoring algorithms to enhance performance for bias phrases. Despite their effectiveness, these approaches add computational burdens and complexity, mainly because they must manipulate subword dependencies within fixed vocabularies. The paper challenges this paradigm by proposing a dynamic vocabulary system where bias tokens are added dynamically during inference. This mechanism allows bias phrases to be treated as singular entities, bypassing intricate subword dependency modeling.

The proposed approach entails expanding the embedding and output layers within existing E2E-ASR architectures, such as Connectionist Temporal Classification (CTC), Recurrent Neural Network Transducer (RNN-T), and attention-based systems. This architecture agnosticism simplifies integration with existing frameworks without necessitating major architectural overhauls.

Experimental Results

The experimental validations demonstrate the efficacy of dynamic vocabulary, with significant performance improvements noted across various metrics. The dynamic vocabulary method enhances bias phrase Word Error Rate (WER) by 3.1 to 4.9 points on English and Japanese datasets compared to traditional deep biasing methods. Empirical results confirm that the model successfully handles unseen and infrequent phrases by dynamically updating its vocabulary, resulting in superior handling of bias phrases without detrimentally impacting the recognition of other vocabulary, i.e., unbiased phrases.

Notably, experiments with varying list sizes conclusively show that while larger lists may degrade specific performance aspects incrementally, the proposed method consistently outperforms conventional alternatives. Moreover, the method proves effective in multilingual settings, further attesting to its robustness and adaptability.

Contributions and Implications

This paper distinctively contributes to ASR research by reducing the reliance on static vocabulary or external LLMs for handling context-specific terms, leading to enhanced bias phrase recognition without additional computational overhead. It addresses a critical bottleneck in contextual ASR by training models using dynamic vocabularies, thus opening avenues for more adaptable and efficient speech recognition systems.

The proposed dynamic vocabulary methodology suggests that future ASR systems could more effectively tailor their vocabularies in real-time use cases, perhaps even personalizing recognition capabilities for individual users' needs and contexts. This flexibility implies considerable potential for real-world applications, particularly where user-specific or rare terminologies are prominent.

Future Directions

Potential advancements could further explore integrating dynamic vocabulary approaches within multilingual and streaming ASR systems, where vocabulary adaptability is even more crucial due to the diverse linguistic components users might encounter. Additionally, the implications of seamlessly embedding this approach into interactive voice systems and real-time speech processing applications present promising research trajectories.

Overall, the presented framework represents a significant step forward in contextualized ASR, offering a novel and technically sound methodology to contend with the challenges posed by dynamic vocabulary requirements in modern speech recognition tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/shinjiw_at_cmu/status/1864288281678237949

https://twitter.com/yuisudo24/status/1863176220386722019

https://twitter.com/ArxivSound/status/1793855668476092699

https://twitter.com/HeHarry_11/status/1863591158209020112

https://twitter.com/m_shak33l/status/1864590339996307823