Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Extending Whisper with prompt tuning to target-speaker ASR (2312.08079v2)

Published 13 Dec 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Target-speaker automatic speech recognition (ASR) aims to transcribe the desired speech of a target speaker from multi-talker overlapped utterances. Most of the existing target-speaker ASR (TS-ASR) methods involve either training from scratch or fully fine-tuning a pre-trained model, leading to significant training costs and becoming inapplicable to large foundation models. This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR. Variants of prompt tuning approaches along with their configurations are explored and optimized for TS-ASR.Experimental results show that prompt tuning can achieve performance comparable to state-of-the-art full training approaches while only requiring about 1\% of task-specific model parameters. Notably, the original Whisper's features, such as inverse text normalization and timestamp tagging, are retained in target-speaker ASR, keeping the generated transcriptions natural and informative.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. “Robust speech recognition via large-scale weak supervision,” in Int. Conf. Mach. Learn. (ICML), 2023, pp. 28492–28518.
  2. “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2017, pp. 241–245.
  3. “Single-channel multi-talker speech recognition with permutation invariant training,” Speech Communication, vol. 104, pp. 1–11, 2018.
  4. “A sidecar separator can convert a single-talker speech recognition system to a multi-talker one,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2023.
  5. “Serialized output training for end-to-end overlapped speech recognition,” in Proc. Interspeech, 2020, pp. 2797–2801.
  6. “Adapting multi-lingual asr models for handling multiple talkers,” in Proc. Interspeech, 2023.
  7. “End-to-end speaker-attributed asr with transformer,” in Proc. Interspeech, 2021.
  8. “Auxiliary interference speaker loss for target-speaker speech recognition,” in Proc. Interspeech, 2019, pp. 236–240.
  9. “Conformer-based target-speaker automatic speech recognition for single-channel audio,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2023.
  10. “Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2023.
  11. “Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition,” in Proc. Interspeech, 2023.
  12. “Parameter-efficient transfer learning for NLP,” in Proc. Int. Conf. Mach. Learn. (ICML), 2019, pp. 2790–2799.
  13. “LoRA: Low-rank adaptation of large language models,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2022.
  14. “The power of scale for parameter-efficient prompt tuning,” in Proc. Conf. Empir. Meth. Nat. Lang. Process. (EMNLP), 2021.
  15. “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021.
  16. “Residual prompt tuning: Improving prompt tuning with residual reparameterization,” in Proc. Annual Meeting Association Comput. Linguistics (ACL), 2023.
  17. “Speechprompt v2: Prompt tuning for speech classification tasks,” arXiv preprint arXiv:2303.00733, 2023.
  18. “Prompt tuning of deep neural networks for speaker-adaptive visual speech recognition,” arXiv preprint arXiv:2302.08102, 2023.
  19. “Prompting the hidden talent of web-scale speech models for zero-shot task generalization,” in Proc. Interspeech, 2023.
  20. “P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks,” CoRR, vol. abs/2110.07602, 2021.
  21. “Language models are few-shot learners,” in Proc. Advances Neural Inf. Process. Systems (NeurIPS), 2020, vol. 33, pp. 1877–1901.
  22. “AutoPrompt: Eliciting knowledge from language models with automatically generated prompts,” in Proc. Conf. Empir. Meth. Nat. Lang. Process. (EMNLP), 2020.
  23. “LibriMix: An open-source dataset for generalizable speech separation,” arXiv preprint arXiv:2005.11262, 2020.
  24. “Librispeech: an ASR corpus based on public domain audio books,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2015, pp. 5206–5210.
  25. “WHAM!: Extending speech separation to noisy environments,” arXiv preprint arXiv:1907.01160, 2019.
  26. “Libri-light: A benchmark for asr with limited or no supervision,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process. (ICASSP), 2020, pp. 7669–7673.
  27. “Deep Neural Network Embeddings for Text-Independent Speaker Verification,” in Proc. Interspeech, 2017, pp. 999–1003.
  28. “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2019.
Citations (9)

Summary

We haven't generated a summary for this paper yet.