SALMONN: Towards Generic Hearing Abilities for Large Language Models (2310.13289v2)
Abstract: Hearing is arguably an essential ability of AI agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events, and music. In this paper, we propose SALMONN, a speech audio language music open neural network, built by integrating a pre-trained text-based LLM with speech and audio encoders into a single multimodal model. SALMONN enables the LLM to directly process and understand general audio inputs and achieve competitive performances on a number of speech and audio tasks used in training, such as automatic speech recognition and translation, auditory-information-based question answering, emotion recognition, speaker verification, and music and audio captioning etc. SALMONN also has a diverse set of emergent abilities unseen in the training, which includes but is not limited to speech translation to untrained languages, speech-based slot filling, spoken-query-based question answering, audio-based storytelling, and speech audio co-reasoning etc. The presence of cross-modal emergent abilities is studied, and a novel few-shot activation tuning approach is proposed to activate such abilities. To our knowledge, SALMONN is the first model of its type and can be regarded as a step towards AI with generic hearing abilities. The source code, model checkpoints and data are available at https://github.com/bytedance/SALMONN.
- MusicLM: Generating music from text. arXiv:2301.11325, 2023.
- Flamingo: a visual language model for few-shot learning. In Proc. NeurIPS, New Orleans, 2022.
- PaLM 2 technical report. arXiv:2305.10403, 2023.
- SLURP: A spoken language understanding resource package. In Proc. EMNLP, 2020.
- Language models are few-shot learners. In Proc. NeurIPS, New Orleans, 2020.
- IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359, 2008.
- X-LLM: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv:2305.04160, 2023a.
- VideoLLM: Modeling video sequence with large language models. arXiv:2305.13292, 2023b.
- GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio. In Proc. Interspeech, Brno, 2021.
- WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
- BEATs: Audio pre-training with acoustic tokenizers. In Proc. ICML, Honolulu, 2023c.
- Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Scaling instruction-finetuned language models. arXiv:2210.11416, 2022.
- LibriMix: An open-source dataset for generalizable speech separation. arXiv:2005.11262, 2020.
- InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500, 2023.
- LP-MusicCaps: LLM-based pseudo music captioning. arXiv:2307.16372, 2023.
- Clotho: An audio captioning dataset. In Proc. ICASSP, Barcelona, 2020.
- GLM: General language model pretraining with autoregressive blank infilling. In Proc. ACL, Dublin, Ireland, 2022.
- Prompting large language models with speech recognition abilities. arXiv preprint arXiv:2307.11795, 2023.
- Whisper-AT: Noise-robust automatic speech recognizers are also strong general audio event taggers. In Proc. Interspeech, Dublin, Ireland, 2023a.
- Listen, think, and understand. arXiv:2305.10790, 2023b.
- LoRA: Low-Rank Adaptation of large language models. In Proc. ICLR, 2022.
- Dynamic-SUPERB: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech. arXiv:2309.09510, 2023a.
- AudioGPT: Understanding and generating speech, music, sound, and talking head. arXiv:2304.12995, 2023b.
- Adapting self-supervised models to multi-talker speech recognition using speaker embeddings. In Proc. ICASSP, Rhodes, Greek, 2023c.
- Anette Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proc. EMNLP, Sapporo, Japan, 2003.
- AudioCaps: Generating captions for audios in the wild. In Proc. NAACL-HLT, Minneapolis, 2019.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proc. ICML, Hawaii, 2023a.
- MERT: Acoustic music understanding model with large-scale self-supervised training. arXiv:2306.00107, 2023b.
- Music understanding LLaMA: Advancing text-to-music generation with question answering and captioning. arXiv:2308.11276, 2023.
- Macaw-LLM: Multi-modal language modeling with image, audio, video, and text integration. arXiv:2306.09093, 2023.
- Video-ChatGPT: Towards detailed video understanding via large vision and language models. arXiv:2306.05424, 2023.
- WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv:2303.17395, 2023.
- Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 60:101027, 2019.
- Joint speech recognition and audio captioning. In Proc. ICASSP, Singapore, 2022.
- OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023.
- Training language models to follow instructions with human feedback. In Proc. NeurIPS, New Orleans, 2022.
- Librispeech: An ASR corpus based on public domain audio books. In Proc. ICASSP, South Brisbane, 2015.
- Instruction tuning with GPT-4. arXiv:2304.03277, 2023.
- Robust speech recognition via large-scale weak supervision. In Proc. ICML, Honolulu, 2023.
- AudioPaLM: A large language model that can speak and listen. arXiv:2306.12925, 2023.
- PandaGPT: One model to instruction-follow them all. arXiv:2305.16355, 2023.
- Fine-grained audio-visual joint representations for multimodal large language models, 2023.
- Learning features of music from scratch. In Proc. ICLR, Toulon, France, 2017.
- LLaMA: Open and efficient foundation language models. arXiv:2302.13971, 2023.
- Attention is all you need. In Proc. NeurIPS, Long Beach, 2017.
- CoVoST 2 and massively multilingual speech translation. In Proc. Interspeech, Brno, Czech Republic, 2021.
- Finetuned language models are zero-shot learners. In Proc. ICLR, 2022a.
- Emergent abilities of large language models. Transactions on Machine Learning Research, 2022b.
- On decoder-only architecture for speech-to-text and large language model integration. arXiv:2307.03917, 2023.
- Emotion recognition by fusing time synchronous and time asynchronous representations. In Proc. ICASSP, Toronto, Canada, 2021.
- WikiQA: A challenge dataset for open-domain question answering. In Proc. EMNLP, Lisbon, Portugal, 2015.
- Connecting speech encoder and large language model for ASR. arXiv:2309.13963, 2023.
- SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv:2305.11000, 2023a.
- Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023b.
- Learning video representations from large language models. In Proc. CVPR, New Orleans, 2022.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.