Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling (2307.07057v1)

Published 13 Jul 2023 in cs.CL, cs.CV, cs.SD, and eess.AS

Abstract: We study speech intent classification and slot filling (SICSF) by proposing to use an encoder pretrained on speech recognition (ASR) to initialize an end-to-end (E2E) Conformer-Transformer model, which achieves the new state-of-the-art results on the SLURP dataset, with 90.14% intent accuracy and 82.27% SLURP-F1. We compare our model with encoders pretrained on self-supervised learning (SSL), and show that ASR pretraining is much more effective than SSL for SICSF. To explore parameter efficiency, we freeze the encoder and add Adapter modules, and show that parameter efficiency is only achievable with an ASR-pretrained encoder, while the SSL encoder needs full finetuning to achieve comparable results. In addition, we provide an in-depth comparison on end-to-end models versus cascading models (ASR+NLU), and show that E2E models are better than cascaded models unless an oracle ASR model is provided. Last but not least, our model is the first E2E model that achieves the same performance as cascading models with oracle ASR. Code, checkpoints and configs are available.

Citations (4)

View on Semantic Scholar

Summary

The paper demonstrates that ASR pretraining outperforms SSL for speech intent classification and slot filling, achieving high accuracy on SLURP.
It introduces adapter modules and freezes the encoder to boost parameter efficiency, a strategy effective only with ASR-pretrained models.
Comparative analysis reveals that end-to-end models surpass cascading approaches unless using an oracle ASR, setting a new performance benchmark.

The paper "Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling" investigates the tasks of speech intent classification and slot filling (SICSF). The authors propose an innovative approach utilizing an encoder pretrained on Automatic Speech Recognition (ASR) to initialize an end-to-end (E2E) Conformer-Transformer model. This model achieves state-of-the-art performance on the SLURP dataset, with 90.14% intent accuracy and 82.27% SLURP-F1.

Key Contributions:

ASR vs. SSL Pretraining:
- The paper highlights that pretraining the encoder using ASR is significantly more effective for SICSF compared to self-supervised learning (SSL). This offers a new perspective on selecting pretraining strategies for speech processing tasks.
Parameter Efficiency:
- To enhance parameter efficiency, the authors introduce Adapter modules and freeze the encoder. They find this strategy successful only with ASR-pretrained encoders. In contrast, models with SSL-pretrained encoders require full finetuning to achieve similar performance levels.
Comparison of E2E and Cascading Models:
- The paper provides a comprehensive analysis comparing E2E models with traditional cascading models (ASR followed by Natural Language Understanding, NLU).
- It concludes that E2E models generally outperform cascading models unless an ideal or "oracle" ASR model is used in the cascading approach.
- Notably, the proposed E2E model is the first to match the performance of cascading models with perfect ASR, setting a new benchmark for this technology.

This research offers significant advancements in the integration of ASR pretraining for speech intent and slot filling tasks, pushing forward the capabilities of end-to-end speech processing systems. Additionally, the availability of code, checkpoints, and configurations facilitates further exploration and application by the research community.

PDF Markdown

Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling (2307.07057v1)

Summary

Key Contributions:

Related Papers