Self-Supervised Learning for speech recognition with Intermediate layer supervision

Published 16 Dec 2021 in eess.AS and cs.CL | (2112.08778v1)

Abstract: Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information. Since the network capacity is limited, we believe the speech recognition performance could be further improved if the model is dedicated to audio content information learning. To this end, we propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL), which forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers. Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly, which achieves a 23.5%/11.6% relative word error rate reduction in the w/o LLM setting for base/large models. Detailed analysis shows the bottom layers of our model have a better correlation with phonetic units, which is consistent with our intuition and explains the success of our method for ASR.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (27)

View on Semantic Scholar

Summary

The paper introduces ILS-SSL, which applies an additional self-supervised loss on intermediate layers to steer learning toward phonetic content.
Experiments on LibriSpeech show up to a 23.5% reduction in word error rate compared to the HuBERT baseline without a language model.
The approach enables more efficient ASR systems by reducing reliance on labeled data and focusing model learning on audio content.

Overview of "Self-Supervised Learning for Speech Recognition with Intermediate Layer Supervision"

The paper "Self-Supervised Learning for Speech Recognition with Intermediate Layer Supervision" presents a novel approach aimed at enhancing speech recognition performance by focusing the learning process of speech models on content information. The methodology, termed Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL), augments traditional self-supervised learning (SSL) by applying additional SSL loss to intermediate layers of the model.

Methodology

The primary goal of the proposed ILS-SSL method is to steer pre-trained speech models towards learning audio content information rather than speaker characteristics. This is achieved by introducing a self-supervised loss function on selected intermediate layers of the model. The approach is carried out in two configurations: Base and Large models, with varying dataset sizes for pre-training and fine-tuning.

The model architecture largely mirrors HuBERT, consisting of a convolutional feature encoder coupled with a Transformer-based context encoder. The Transformer component operates with varying configurations, where Base models encompass 12 layers, and Large models, 24 layers.

Experimental Results

The authors evaluate their approach on the LibriSpeech dataset, demonstrating substantial improvements over the HuBERT baseline in terms of Word Error Rate (WER). Specifically, in the Base model setting without a LLM, ILS-SSL achieves a 23.5% reduction in WER on the test-other subset. With the larger scale pre-training on Libri-Light 60k dataset, a 9.5% WER reduction is observed. Additionally, when integrated with an external LLM, further gains are realized.

Insights and Analysis

The paper provides an incisive analysis of the layer-wise learning dynamics through k-means clustering, revealing that ILS-SSL effectively shifts the model’s focus towards phonetic content. Furthermore, ILS-SSL is also evaluated on the SUPERB benchmark covering various speech tasks. Results indicate that while the model retains exceptional performance for content-related tasks, speaker identification tasks show a decline, aligning with the method’s strategic focus.

Implications and Future Directions

The findings suggest significant implications for the development of more efficient ASR systems, reducing reliance on extensive labeled datasets. ILS-SSL not only enhances ASR-specific knowledge acquisition but also posits potential integration with LLMs to further improve performance. Future research could explore the integration of similar intermediate supervision strategies in multi-modal speech and text learning environments.

Overall, the paper contributes a significant refinement to SSL methodologies in automatic speech recognition, offering a pragmatic approach to overcoming existing model capacity constraints by strategically guiding layer-wise learning focus.

Markdown Report Issue