SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation (2310.09424v1)

Published 13 Oct 2023 in cs.CL, cs.HC, cs.SD, and eess.AS

Abstract: We present a novel Speech Augmented LLM (SALM) with {\em multitask} and {\em in-context} learning capabilities. SALM comprises a frozen text LLM, a audio encoder, a modality adapter module, and LoRA layers to accommodate speech input and associated task instructions. The unified SALM not only achieves performance on par with task-specific Conformer baselines for Automatic Speech Recognition (ASR) and Speech Translation (AST), but also exhibits zero-shot in-context learning capabilities, demonstrated through keyword-boosting task for ASR and AST. Moreover, {\em speech supervised in-context training} is proposed to bridge the gap between LLM training and downstream speech tasks, which further boosts the in-context learning ability of speech-to-text models. Proposed model is open-sourced via NeMo toolkit.

References (37)

Citations (35)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation (2310.09424v1)

Summary

Related Papers