SLM: Bridge the thin gap between speech and text foundation models

Published 30 Sep 2023 in cs.CL, cs.SD, and eess.AS | (2310.00230v1)

Abstract: We present a joint Speech and LLM (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and LLMs. SLM freezes the pretrained foundation models to maximally preserves their capabilities, and only trains a simple adapter with just 1\% (156M) of the foundation models' parameters. This adaptation not only leads SLM to achieve strong performance on conventional tasks such as speech recognition (ASR) and speech translation (AST), but also introduces the novel capability of zero-shot instruction-following for more diverse tasks: given a speech input and a text instruction, SLM is able to perform unseen generation tasks including contextual biasing ASR using real-time context, dialog generation, speech continuation, and question answering, etc. Our approach demonstrates that the representational gap between pretrained speech and LLMs might be narrower than one would expect, and can be bridged by a simple adaptation mechanism. As a result, SLM is not only efficient to train, but also inherits strong capabilities already acquired in foundation models of different modalities.

Abstract PDF Upgrade to Chat

Authors (18)

First 10 authors:

Citations (47)

View on Semantic Scholar

Summary

The paper introduces SLM, which bridges speech and text modalities using a minimal adapter that adds only 1% additional parameters.
The architecture employs a transformer-based adapter that aligns frozen speech encoder outputs with textual representations for improved multitask performance.
Results demonstrate SLM’s robust zero-shot instruction-following and multilingual capabilities, with up to 46.2% improvement in contextual ASR tasks.

Analyzing the SLM: Integrating Speech and Text Foundation Models

The paper under review introduces the Speech and LLM (SLM), a compelling approach to unify foundation models operating in speech and text modalities. By employing a lightweight adapter mechanism, SLM integrates frozen speech and LLMs, leveraging their inherent capabilities without necessitating significant retraining or massive data requirements. This paper focuses on bridging the representational gap between these modalities, providing insights into multitask, multilingual, and dual-modal functionalities including automatic speech recognition (ASR), automatic speech translation (AST), and zero-shot instruction-following.

Model Architecture and Methodology

The core innovation of the SLM lies in its architecture, which is composed of a frozen pretrained speech encoder, a frozen pretrained LLM, and a minimalistic adapter. This adapter facilitates seamless transformation from speech encoding to the textual embedding space required by the LLM. Intriguingly, the adapter constitutes only about 1\% of the total parameters, highlighting its efficiency in preserving native features while adding novel capabilities.

The adapter, implemented as a transformer stack with as few as two layers, processes the output from the speech model, subsampling the sequence length to align it with textual input lengths. This approach is pivotal in ensuring the model can handle extended speech inputs efficiently. The adaptation mechanism's success in cross-modal transformation elucidates that the supposed representational gap is indeed surmountable with minimal changes to the existing model frameworks.

Results and Performance

SLM is evaluated via diverse tasks, notably surpassing benchmarks on multitask ASR and multilingual AST. Results indicate that while the model excels using zero-shot capabilities on unseen tasks, its real strength is demonstrated through speech recognition and translation accuracy, rivaling or surpassing existing paradigms like USM and AudioPaLM in several multilingual contexts.

Additionally, SLM showcases significant advancements in zero-shot instruction-following, where it effectively handles contextual biasing ASR tasks. The model's ability to adapt in real-time to provided context, for instance, outperforming conventional ASR on datasets by 46.2% relative improvement, underscores the potential of such cross-modal architectures.

Discussion and Future Directions

The exploration of adaptation depths and the inherent effects of various pretrained LLMs elucidates the intricacies of cross-modal integration. With a shallow adapter sufficing for prominent gains, this underscores the potential efficiency of cross-modal unification with minimal computational overhead.

Moreover, the paper hypothesizes the potential extrapolation of this methodology to various LLM architectures, including decoder-only models, which presents an intriguing area for further experimentation. Additionally, the potential for fine-tuning on specific tasks while maintaining the integrity of pretrained weights opens up avenues for customized applications in targeted scenarios, exemplifying the system’s flexibility and adaptability.

Implications

On a theoretical level, SLM challenges the prevalent notion of massive data dependency for cross-modal models by illustrating how foundational capabilities can be effectively retained and expanded through a strategic, lightweight bridging. Practically, its applications span real-world scenarios requiring precise speech recognition and translation among varied languages, marking its utility across global communication systems.

As AI continues to intersect more dynamically with speech processing, SLM paves the way for enhanced, efficient, and domain-specific implementations, providing a scalable framework that balances computational efficiency with expansive applicability. This signifies stepping stones for future work in unifying diverse modal processing into a cohesive, integrated AI ecosystem.

Markdown Report Issue