Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 42 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 217 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

BLSP-KD: Bootstrapping Language-Speech Pre-training via Knowledge Distillation (2405.19041v1)

Published 29 May 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Recent end-to-end approaches have shown promise in extending LLMs to speech inputs, but face limitations in directly assessing and optimizing alignment quality and fail to achieve fine-grained alignment due to speech-text length mismatch. We introduce BLSP-KD, a novel approach for Bootstrapping Language-Speech Pretraining via Knowledge Distillation, which addresses these limitations through two key techniques. First, it optimizes speech-text alignment by minimizing the divergence between the LLM's next-token prediction distributions for speech and text inputs using knowledge distillation. Second, it employs a continuous-integrate-andfire strategy to segment speech into tokens that correspond one-to-one with text tokens, enabling fine-grained alignment. We also introduce Partial LoRA (PLoRA), a new adaptation method supporting LLM finetuning for speech inputs under knowledge distillation. Quantitative evaluation shows that BLSP-KD outperforms previous end-to-end baselines and cascaded systems with comparable scale of parameters, facilitating general instruction-following capabilities for LLMs with speech inputs. This approach provides new possibilities for extending LLMs to spoken language interactions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211–4215.
  2. Qwen technical report.
  3. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359.
  4. Towards multimodal sarcasm detection (an _Obviously_ perfect paper). In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4619–4629, Florence, Italy. Association for Computational Linguistics.
  5. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160.
  6. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909.
  7. Lauragpt: Listen, attend, understand, and regenerate audio with gpt. arXiv preprint arXiv:2310.04673.
  8. Must-c: a multilingual speech translation corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2012–2017. Association for Computational Linguistics.
  9. Cif: Continuous integrate-and-fire for end-to-end speech recognition. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6079–6083.
  10. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420.
  11. Prompting large language models with speech recognition abilities. arXiv preprint arXiv:2307.11795.
  12. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  13. Wavllm: Towards robust and adaptive speech large language model. arXiv preprint arXiv:2404.00656.
  14. Audiogpt: Understanding and generating speech, music, sound, and talking head. arXiv preprint arXiv:2304.12995.
  15. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  16. Mathematical language models: A survey. arXiv preprint arXiv:2312.07622.
  17. R OpenAI. 2023. Gpt-4 technical report. arXiv, pages 2303–08774.
  18. Cosmic: Data efficient instruction-tuning for speech in-context learning. arXiv preprint arXiv:2311.02248.
  19. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE.
  20. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
  21. Robust speech recognition via large-scale weak supervision. arxiv. arXiv preprint arXiv:2212.04356.
  22. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925.
  23. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
  24. Llasm: Large language and speech model.
  25. Gemini Team. 2024. Gemini: A family of highly capable multimodal models.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  27. Covost 2 and massively multilingual speech-to-text translation. arXiv preprint arXiv:2007.10310.
  28. Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing.
  29. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):1–26.
  30. Slm: Bridge the thin gap between speech and text foundation models. arXiv preprint arXiv:2310.00230.
  31. Speech-to-text adapter and speech-to-entity retriever augmented llms for speech understanding. arXiv preprint arXiv:2306.07944.
  32. Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107.
  33. Speechgen: Unlocking the generative power of speech language models with prompts. arXiv preprint arXiv:2306.02207.
  34. On decoder-only architecture for speech-to-text and large language model integration. arXiv preprint arXiv:2307.03917.
  35. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549.
  36. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000.
  37. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601.
  38. Unifying the perspectives of nlp and software engineering: A survey on language models for code. arXiv preprint arXiv:2311.07989.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube