Papers
Topics
Authors
Recent
2000 character limit reached

Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems (2204.05188v2)

Published 11 Apr 2022 in cs.CL, cs.SD, and eess.AS

Abstract: Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) have been primarily due to effective pretraining of speech representations. One such pretraining paradigm is the distillation of semantic knowledge from state-of-the-art text-based models like BERT to speech encoder neural networks. This work is a step towards doing the same in a much more efficient and fine-grained manner where we align speech embeddings and BERT embeddings on a token-by-token basis. We introduce a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder such that these can be directly compared and aligned with BERT based contextual embeddings. This alignment is performed using a novel tokenwise contrastive loss. Fine-tuning such a pretrained model to perform intent recognition using speech directly yields state-of-the-art performance on two widely used SLU datasets. Our model improves further when fine-tuned with additional regularization using SpecAugment especially when speech is noisy, giving an absolute improvement as high as 8% over previous results.

Citations (7)

Summary

  • The paper presents a tokenwise contrastive pretraining method that aligns speech and text (BERT) embeddings at the token level.
  • It employs a cross-modal attention mechanism with a pyramidal BiLSTM to achieve fine-grained alignment, improving SLU performance.
  • Evaluations on SNIPS and FSC datasets show enhanced accuracy—up to 83.07%—with a compact 48M parameter design suitable for on-device deployment.

Tokenwise Contrastive Pretraining for End-to-End Speech-to-Intent Systems

Introduction

The paper "Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems" (2204.05188) addresses the challenges in traditional spoken language understanding (SLU) systems that use automatic speech recognition (ASR) followed by natural language understanding (NLU). These cascaded systems are prone to substantial error propagation from ASR to NLU, and they suffer from inefficiencies due to the large model sizes necessitated by combining state-of-the-art ASR and NLU components. Recent trends favor end-to-end (E2E) SLU systems that directly process speech to extract semantic intents, bypassing ASR entirely. This work enhances the pretraining of speech encoders for E2E SLU using a tokenwise contrastive loss to align speech representations finely with BERT-based text embeddings.

Proposed Methodology

The core innovation of this paper is a novel cross-modal attention mechanism that facilitates token-by-token alignment between speech and BERT embeddings, using a contrastive loss function.

The speech encoder uses 80-dimensional log-Mel features from speech inputs, processed by a 9-layer pyramidal Bidirectional LSTM (BiLSTM) with a self-attention layer. The encoding undergoes cross-modal attention, where non-contextual word embeddings are converted into contextual embeddings influenced by corresponding speech signals. This alignment process is depicted in the model overview (Figure 1). Figure 1

Figure 1: Model overview. During pretraining, the cross-attention mechanism yields contextual embeddings for the all input tokens using corresponding speech. During fine-tuning for downstream S2I, when only speech is available, only the [CLS] token is required.

The tokenwise contrastive loss optimizes the alignment by minimizing distance between the same tokens from speech and BERT modalities, improving cross-modal semantic representation. Pretraining is conducted using Librispeech data, followed by fine-tuning on downstream SLU tasks without in-domain transcripts, employing SpecAugment for data augmentation.

Results

The proposed model was tested on SNIPS-Smartlights and Fluent Speech Commands (FSC) datasets, outperforming existing methods in terms of accuracy. Tokenwise contrastive pretraining demonstrated substantial improvements on the SNIPS dataset, particularly for far-field speech, achieving up to 83.07% accuracy when combined with SpecAugment, which highlights its robustness to noisy settings.

Attention heatmaps validated the cross-modal attention's ability to learn alignment structures (Figure 2), suggesting the model effectively captures token-level dependencies across modalities. Figure 2

Figure 2: Attention heatmap of the cross-modal attention layer. The yellow boxes represent the actual alignment of the tokens with the speech. Cross-modal attention is implicitly learning alignment structures.

Additionally, the model is compact with just 48 million parameters, suitable for on-device deployment, offering a practical advantage over larger transformer-based architectures.

Discussion

This study contributes significantly to E2E SLU by demonstrating that finer alignment at the token level can enhance semantic understanding without reliance on large datasets or in-domain adaptation. The demonstrated performance improvements and model efficiency suggest potential deployment on resource-constrained devices.

While the results are promising, further exploration into entity extraction and slot-filling capabilities using tokenwise contrastive pretrained models would solidify its applicability in broader SLU tasks.

Conclusion

The paper presents a robust approach to overcoming traditional SLU limitations through refined cross-modal embedding alignment using tokenwise contrastive loss. The technique enhances speech-to-intent recognition accuracy, offering new directions for SLU research with practical implications for real-world voice assistant deployments. Future work should explore extending this framework to more diverse SLU applications, enhancing its versatility and generalization capabilities.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.