Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 44 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 13 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 86 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Extending Whisper with prompt tuning to target-speaker ASR (2312.08079v2)

Published 13 Dec 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Target-speaker automatic speech recognition (ASR) aims to transcribe the desired speech of a target speaker from multi-talker overlapped utterances. Most of the existing target-speaker ASR (TS-ASR) methods involve either training from scratch or fully fine-tuning a pre-trained model, leading to significant training costs and becoming inapplicable to large foundation models. This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR. Variants of prompt tuning approaches along with their configurations are explored and optimized for TS-ASR.Experimental results show that prompt tuning can achieve performance comparable to state-of-the-art full training approaches while only requiring about 1\% of task-specific model parameters. Notably, the original Whisper's features, such as inverse text normalization and timestamp tagging, are retained in target-speaker ASR, keeping the generated transcriptions natural and informative.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. “Robust speech recognition via large-scale weak supervision,” in Int. Conf. Mach. Learn. (ICML), 2023, pp. 28492–28518.
  2. “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2017, pp. 241–245.
  3. “Single-channel multi-talker speech recognition with permutation invariant training,” Speech Communication, vol. 104, pp. 1–11, 2018.
  4. “A sidecar separator can convert a single-talker speech recognition system to a multi-talker one,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2023.
  5. “Serialized output training for end-to-end overlapped speech recognition,” in Proc. Interspeech, 2020, pp. 2797–2801.
  6. “Adapting multi-lingual asr models for handling multiple talkers,” in Proc. Interspeech, 2023.
  7. “End-to-end speaker-attributed asr with transformer,” in Proc. Interspeech, 2021.
  8. “Auxiliary interference speaker loss for target-speaker speech recognition,” in Proc. Interspeech, 2019, pp. 236–240.
  9. “Conformer-based target-speaker automatic speech recognition for single-channel audio,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2023.
  10. “Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2023.
  11. “Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition,” in Proc. Interspeech, 2023.
  12. “Parameter-efficient transfer learning for NLP,” in Proc. Int. Conf. Mach. Learn. (ICML), 2019, pp. 2790–2799.
  13. “LoRA: Low-rank adaptation of large language models,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2022.
  14. “The power of scale for parameter-efficient prompt tuning,” in Proc. Conf. Empir. Meth. Nat. Lang. Process. (EMNLP), 2021.
  15. “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021.
  16. “Residual prompt tuning: Improving prompt tuning with residual reparameterization,” in Proc. Annual Meeting Association Comput. Linguistics (ACL), 2023.
  17. “Speechprompt v2: Prompt tuning for speech classification tasks,” arXiv preprint arXiv:2303.00733, 2023.
  18. “Prompt tuning of deep neural networks for speaker-adaptive visual speech recognition,” arXiv preprint arXiv:2302.08102, 2023.
  19. “Prompting the hidden talent of web-scale speech models for zero-shot task generalization,” in Proc. Interspeech, 2023.
  20. “P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks,” CoRR, vol. abs/2110.07602, 2021.
  21. “Language models are few-shot learners,” in Proc. Advances Neural Inf. Process. Systems (NeurIPS), 2020, vol. 33, pp. 1877–1901.
  22. “AutoPrompt: Eliciting knowledge from language models with automatically generated prompts,” in Proc. Conf. Empir. Meth. Nat. Lang. Process. (EMNLP), 2020.
  23. “LibriMix: An open-source dataset for generalizable speech separation,” arXiv preprint arXiv:2005.11262, 2020.
  24. “Librispeech: an ASR corpus based on public domain audio books,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2015, pp. 5206–5210.
  25. “WHAM!: Extending speech separation to noisy environments,” arXiv preprint arXiv:1907.01160, 2019.
  26. “Libri-light: A benchmark for asr with limited or no supervision,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2020, pp. 7669–7673.
  27. “Deep Neural Network Embeddings for Text-Independent Speaker Verification,” in Proc. Interspeech, 2017, pp. 999–1003.
  28. “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2019.
Citations (9)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube