Extending Whisper with prompt tuning to target-speaker ASR (2312.08079v2)
Abstract: Target-speaker automatic speech recognition (ASR) aims to transcribe the desired speech of a target speaker from multi-talker overlapped utterances. Most of the existing target-speaker ASR (TS-ASR) methods involve either training from scratch or fully fine-tuning a pre-trained model, leading to significant training costs and becoming inapplicable to large foundation models. This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR. Variants of prompt tuning approaches along with their configurations are explored and optimized for TS-ASR.Experimental results show that prompt tuning can achieve performance comparable to state-of-the-art full training approaches while only requiring about 1\% of task-specific model parameters. Notably, the original Whisper's features, such as inverse text normalization and timestamp tagging, are retained in target-speaker ASR, keeping the generated transcriptions natural and informative.
- “Robust speech recognition via large-scale weak supervision,” in Int. Conf. Mach. Learn. (ICML), 2023, pp. 28492–28518.
- “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2017, pp. 241–245.
- “Single-channel multi-talker speech recognition with permutation invariant training,” Speech Communication, vol. 104, pp. 1–11, 2018.
- “A sidecar separator can convert a single-talker speech recognition system to a multi-talker one,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2023.
- “Serialized output training for end-to-end overlapped speech recognition,” in Proc. Interspeech, 2020, pp. 2797–2801.
- “Adapting multi-lingual asr models for handling multiple talkers,” in Proc. Interspeech, 2023.
- “End-to-end speaker-attributed asr with transformer,” in Proc. Interspeech, 2021.
- “Auxiliary interference speaker loss for target-speaker speech recognition,” in Proc. Interspeech, 2019, pp. 236–240.
- “Conformer-based target-speaker automatic speech recognition for single-channel audio,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2023.
- “Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2023.
- “Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition,” in Proc. Interspeech, 2023.
- “Parameter-efficient transfer learning for NLP,” in Proc. Int. Conf. Mach. Learn. (ICML), 2019, pp. 2790–2799.
- “LoRA: Low-rank adaptation of large language models,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2022.
- “The power of scale for parameter-efficient prompt tuning,” in Proc. Conf. Empir. Meth. Nat. Lang. Process. (EMNLP), 2021.
- “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021.
- “Residual prompt tuning: Improving prompt tuning with residual reparameterization,” in Proc. Annual Meeting Association Comput. Linguistics (ACL), 2023.
- “Speechprompt v2: Prompt tuning for speech classification tasks,” arXiv preprint arXiv:2303.00733, 2023.
- “Prompt tuning of deep neural networks for speaker-adaptive visual speech recognition,” arXiv preprint arXiv:2302.08102, 2023.
- “Prompting the hidden talent of web-scale speech models for zero-shot task generalization,” in Proc. Interspeech, 2023.
- “P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks,” CoRR, vol. abs/2110.07602, 2021.
- “Language models are few-shot learners,” in Proc. Advances Neural Inf. Process. Systems (NeurIPS), 2020, vol. 33, pp. 1877–1901.
- “AutoPrompt: Eliciting knowledge from language models with automatically generated prompts,” in Proc. Conf. Empir. Meth. Nat. Lang. Process. (EMNLP), 2020.
- “LibriMix: An open-source dataset for generalizable speech separation,” arXiv preprint arXiv:2005.11262, 2020.
- “Librispeech: an ASR corpus based on public domain audio books,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2015, pp. 5206–5210.
- “WHAM!: Extending speech separation to noisy environments,” arXiv preprint arXiv:1907.01160, 2019.
- “Libri-light: A benchmark for asr with limited or no supervision,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2020, pp. 7669–7673.
- “Deep Neural Network Embeddings for Text-Independent Speaker Verification,” in Proc. Interspeech, 2017, pp. 999–1003.
- “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2019.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.