Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System (2407.09817v2)

Published 13 Jul 2024 in cs.SD, cs.CL, and eess.AS

Abstract: Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. Specifically, (i) we freeze Whisper and plug a Sidecar separator into its encoder to separate mixed embedding for multiple talkers; (ii) a Target Talker Identifier is introduced to identify the embedding flow of the target talker on the fly, requiring only three-second enroLLMent speech as a cue; (iii) soft prompt tuning for decoder is explored for better task adaptation. Our method outperforms previous methods on two- and three-talker LibriMix and LibriSpeechMix datasets for both tasks, and delivers acceptable zero-shot performance on multi-talker ASR on AisheLLMix Mandarin dataset.

Authors (7)

Lingwei Meng (31 papers)
Jiawen Kang (204 papers)
Yuejiao Wang (10 papers)
Zengrui Jin (30 papers)
Xixin Wu (85 papers)
Xunying Liu (92 papers)
Helen Meng (204 papers)

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System (2407.09817v2)

Summary

Related Papers