Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training (2404.10922v1)
Abstract: Recent advancements in language modeling have led to the emergence of LLMs capable of various natural language processing tasks. Despite their success in text-based tasks, applying LLMs to the speech domain remains limited and challenging. This paper presents BLOOMZMMS, a novel model that integrates a multilingual LLM with a multilingual speech encoder, aiming to harness the capabilities of LLMs for speech recognition and beyond. Utilizing a multi-instructional training approach, we demonstrate the transferability of linguistic knowledge from the text to the speech modality. Our experiments, conducted on 1900 hours of transcribed data from 139 languages, establish that a multilingual speech representation can be effectively learned and aligned with a multilingual LLM. While this learned representation initially shows limitations in task generalization, we address this issue by generating synthetic targets in a multi-instructional style. Our zero-shot evaluation results confirm the robustness of our approach across multiple tasks, including speech translation and multilingual spoken language understanding, thereby opening new avenues for applying LLMs in the speech domain.
- Common voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France. European Language Resources Association.
- SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge? Proc. Interspeech 2023.
- XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. arXiv preprint arXiv:2111.09296.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems, volume 33, pages 12449–12460.
- An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 228–235. IEEE.
- Improving massively multilingual ASR with auxiliary CTC objectives. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
- MAESTRO: Matched Speech Text Representations through Modality Matching. Proc. Interspeech 2022.
- FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805. IEEE.
- XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium. Association for Computational Linguistics.
- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. In Advances in Neural Information Processing Systems, volume 35, pages 30318–30332.
- Prompting Large Language Models with Speech Recognition Abilities. arXiv preprint arXiv:2307.11795.
- Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proc. of ICML.
- Daniel Jurafsky and James H. Martin. 2009. Speech and language processing.
- E-branchformer: Branchformer with enhanced merging for speech recognition. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 84–91. IEEE.
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- Audio augmentation for speech recognition. In Sixteenth annual conference of the international speech communication association.
- Multilingual speech translation from efficient finetuning of pretrained models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 827–838, Online. Association for Computational Linguistics.
- Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition. arXiv preprint arXiv:2306.16007.
- Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition. arXiv preprint arXiv:2307.08234.
- Low-resource multilingual and zero-shot multispeaker TTS. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 741–751, Online only. Association for Computational Linguistics.
- Crosslingual generalization through multitask finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111, Toronto, Canada. Association for Computational Linguistics.
- Spoken question answering and speech continuation using spectrogram-powered llm. In The Twelfth International Conference on Learning Representations.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
- Scaling Speech Technology to 1,000+ Languages. arXiv preprint arXiv:2305.13516.
- MLS: A Large-Scale Multilingual Dataset for Speech Research. Proc. Interspeech 2020.
- Robust Speech Recognition via Large-Scale Weak Supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
- Language models are unsupervised multitask learners.
- Multitask Prompted Training Enables Zero-Shot Task Generalization. In ICLR 2022-Tenth International Conference on Learning Representations.
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.
- VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 993–1003, Online. Association for Computational Linguistics.
- CoVoST 2 and Massively Multilingual Speech Translation. Proc. Interspeech 2021, pages 2247–2251.
- The 2020 espnet update: new features, broadened applications, performance improvements, and future plans. In 2021 IEEE Data Science and Learning Workshop (DSLW), pages 1–6. IEEE.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- On decoder-only architecture for speech-to-text and large language model integration. arXiv preprint arXiv:2307.03917.
- SUPERB: Speech processing Universal PERformance Benchmark. Proc. Interspeech 2021, pages 1194–1198.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.