Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration (2306.09093v1)

Published 15 Jun 2023 in cs.CL, cs.AI, and cs.CV

Abstract: Although instruction-tuned LLMs have exhibited remarkable capabilities across various NLP tasks, their effectiveness on other data modalities beyond text has not been fully studied. In this work, we propose Macaw-LLM, a novel multi-modal LLM that seamlessly integrates visual, audio, and textual information. Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations. Our novel alignment module seamlessly bridges multi-modal features to textual features, simplifying the adaptation process from the modality modules to the cognitive module. In addition, we construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances. We have made our data, code and model publicly available, which we hope can pave the way for future research in multi-modal LLMs and expand the capabilities of LLMs to handle diverse data modalities and address complex real-world scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Audio visual scene-aware dialog. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 7558–7567. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.00774. URL http://openaccess.thecvf.com/content_CVPR_2019/html/Alamri_Audio_Visual_Scene-Aware_Dialog_CVPR_2019_paper.html.
  2. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/960a172bc7fbf0177ccccbb411a7d800-Abstract-Conference.html.
  3. Palm 2 technical report. CoRR, abs/2305.10403, 2023. doi: 10.48550/arXiv.2305.10403. URL https://doi.org/10.48550/arXiv.2305.10403.
  4. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015. URL https://openaccess.thecvf.com/content_iccv_2015/html/Antol_VQA_Visual_Question_ICCV_2015_paper.html.
  5. data2vec: A general framework for self-supervised learning in speech, vision and language. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  1298–1312. PMLR, 2022. URL https://proceedings.mlr.press/v162/baevski22a.html.
  6. Beit: Bert pre-training of image transformers. In ICLR 2022, April 2022. URL https://www.microsoft.com/en-us/research/publication/beit-bert-pre-training-of-image-transformers/.
  7. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  8. X-LLM: bootstrapping advanced large language models by treating multi-modalities as foreign languages. CoRR, abs/2305.04160, 2023. doi: 10.48550/arXiv.2305.04160. URL https://doi.org/10.48550/arXiv.2305.04160.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://vicuna.lmsys.org.
  10. Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311, 2022. doi: 10.48550/arXiv.2204.02311. URL https://doi.org/10.48550/arXiv.2204.02311.
  11. Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022. doi: 10.48550/arXiv.2210.11416. URL https://doi.org/10.48550/arXiv.2210.11416.
  12. Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500, 2023. doi: 10.48550/arXiv.2305.06500. URL https://doi.org/10.48550/arXiv.2305.06500.
  13. Imagebind: One embedding space to bind them all. CoRR, abs/2305.05665, 2023. doi: 10.48550/arXiv.2305.05665. URL https://doi.org/10.48550/arXiv.2305.05665.
  14. Multimodal-gpt: A vision and language model for dialogue with humans. CoRR, abs/2305.04790, 2023. doi: 10.48550/arXiv.2305.04790. URL https://doi.org/10.48550/arXiv.2305.04790.
  15. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017. URL https://openaccess.thecvf.com/content_cvpr_2017/html/Goyal_Making_the_v_CVPR_2017_paper.html.
  16. The false promise of imitating proprietary llms. CoRR, abs/2305.15717, 2023. doi: 10.48550/arXiv.2305.15717. URL https://doi.org/10.48550/arXiv.2305.15717.
  17. Training compute-optimal large language models. CoRR, abs/2203.15556, 2022. doi: 10.48550/arXiv.2203.15556. URL https://doi.org/10.48550/arXiv.2203.15556.
  18. Language is not all you need: Aligning perception with language models, 2023. URL https://doi.org/10.48550/arXiv.2302.14045.
  19. A survey on multi-modal summarization. ACM Comput. Surv., feb 2023. ISSN 0360-0300. doi: 10.1145/3584700. URL https://doi.org/10.1145/3584700. Just Accepted.
  20. Otter: A multi-modal model with in-context instruction tuning. CoRR, abs/2305.03726, 2023a. doi: 10.48550/arXiv.2305.03726. URL https://doi.org/10.48550/arXiv.2305.03726.
  21. Bactrian-x : A multilingual replicable instruction-following model with low-rank adaptation. CoRR, abs/2305.15011, 2023b. doi: 10.48550/arXiv.2305.15011. URL https://doi.org/10.48550/arXiv.2305.15011.
  22. Multi-modal summarization for asynchronous collection of text, image, audio and video. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  1092–1102, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1114. URL https://aclanthology.org/D17-1114.
  23. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. CoRR, abs/2301.12597, 2023c. doi: 10.48550/arXiv.2301.12597. URL https://doi.org/10.48550/arXiv.2301.12597.
  24. Microsoft COCO: common objects in context. In David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars (eds.), Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, volume 8693 of Lecture Notes in Computer Science, pp. 740–755. Springer, 2014. doi: 10.1007/978-3-319-10602-1_48. URL https://doi.org/10.1007/978-3-319-10602-1_48.
  25. Visual instruction tuning. CoRR, abs/2304.08485, 2023. doi: 10.48550/arXiv.2304.08485. URL https://doi.org/10.48550/arXiv.2304.08485.
  26. New trends in machine translation using large language models: Case examples with chatgpt. CoRR, abs/2305.01181, 2023. doi: 10.48550/arXiv.2305.01181. URL https://doi.org/10.48550/arXiv.2305.01181.
  27. Crosslingual generalization through multitask finetuning. CoRR, abs/2211.01786, 2022. doi: 10.48550/arXiv.2211.01786. URL https://doi.org/10.48550/arXiv.2211.01786.
  28. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/arXiv.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  29. Training language models to follow instructions with human feedback. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=TG8KACxEON.
  30. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  8748–8763. PMLR, 2021. URL http://proceedings.mlr.press/v139/radford21a.html.
  31. Robust speech recognition via large-scale weak supervision. CoRR, abs/2212.04356, 2022. doi: 10.48550/arXiv.2212.04356. URL https://doi.org/10.48550/arXiv.2212.04356.
  32. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash (eds.), KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pp. 3505–3506. ACM, 2020. doi: 10.1145/3394486.3406703. URL https://doi.org/10.1145/3394486.3406703.
  33. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4.
  34. Multi-modal open-domain dialogue. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  4863–4883, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.398. URL https://aclanthology.org/2021.emnlp-main.398.
  35. Much ado about time: Exhaustive annotation of temporal data. CoRR, abs/1607.07429, 2016. URL http://arxiv.org/abs/1607.07429.
  36. Language models can see: Plugging visual controls in text generation. CoRR, abs/2205.02655, 2022. doi: 10.48550/arXiv.2205.02655. URL https://doi.org/10.48550/arXiv.2205.02655.
  37. Pandagpt: One model to instruction-follow them all. CoRR, abs/2305.16355, 2023. doi: 10.48550/arXiv.2305.16355. URL https://doi.org/10.48550/arXiv.2305.16355.
  38. Multimodal dialogue response generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2854–2866, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.204. URL https://aclanthology.org/2022.acl-long.204.
  39. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  40. Lamda: Language models for dialog applications. CoRR, abs/2201.08239, 2022. URL https://arxiv.org/abs/2201.08239.
  41. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10.48550/arXiv.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
  42. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  43. Exploiting cross-sentence context for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  2826–2831, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1301. URL https://aclanthology.org/D17-1301.
  44. Document-level machine translation with large language models. CoRR, abs/2304.02210, 2023. doi: 10.48550/arXiv.2304.02210. URL https://doi.org/10.48550/arXiv.2304.02210.
  45. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  23318–23340. PMLR, 2022a. URL https://proceedings.mlr.press/v162/wang22al.html.
  46. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. CoRR, abs/2208.10442, 2022b. doi: 10.48550/arXiv.2208.10442. URL https://doi.org/10.48550/arXiv.2208.10442.
  47. Self-instruct: Aligning language model with self generated instructions. CoRR, abs/2212.10560, 2022c. doi: 10.48550/arXiv.2212.10560. URL https://doi.org/10.48550/arXiv.2212.10560.
  48. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR.
  49. Document flattening: Beyond concatenating context for document-level neural machine translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  448–462, Dubrovnik, Croatia, May 2023a. Association for Computational Linguistics. URL https://aclanthology.org/2023.eacl-main.33.
  50. Lamini-lm: A diverse herd of distilled models from large-scale instructions. CoRR, abs/2304.14402, 2023b. doi: 10.48550/arXiv.2304.14402. URL https://doi.org/10.48550/arXiv.2304.14402.
  51. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. CoRR, abs/2212.10773, 2022. doi: 10.48550/arXiv.2212.10773. URL https://doi.org/10.48550/arXiv.2212.10773.
  52. mplug-owl: Modularization empowers large language models with multimodality. CoRR, abs/2304.14178, 2023. doi: 10.48550/arXiv.2304.14178. URL https://doi.org/10.48550/arXiv.2304.14178.
  53. Minigpt-4: Enhancing vision-language understanding with advanced large language models. CoRR, abs/2304.10592, 2023. doi: 10.48550/arXiv.2304.10592. URL https://doi.org/10.48550/arXiv.2304.10592.
Citations (132)

Summary

We haven't generated a summary for this paper yet.