Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Image to Video, what do we need in multimodal LLMs? (2404.11865v2)

Published 18 Apr 2024 in cs.CV

Abstract: Covering from Image LLMs to the more complex Video LLMs, the Multimodal LLMs (MLLMs) have demonstrated profound capabilities in comprehending cross-modal information as numerous studies have illustrated. Previous methods delve into designing comprehensive Video LLMs through integrating video foundation models with primitive LLMs. Despite its effectiveness, such paradigm renders Video LLM's structure verbose and typically requires substantial video data for pre-training. Crucially, it neglects leveraging the foundational contributions of ready-made Image LLMs. In this paper, we introduce RED-VILLM, a Resource-Efficient Development pipeline which builds robust Video LLMs through leveraging the prior knowledge of Image LLMs. Specifically, since a video is naturally a combination of images along the temporal dimension, we devise a temporal adaptation plug-and-play structure, endowing the backbone Image LLM with the capability to grasp temporal information. Moreover, through applying this pipeline, we achieve the first Video LLM within the Chinese-speaking community. Extensive experiments demonstrate that Video LLMs developed through our approach surpass conventional Video LLMs, requiring minimal instructional data and training resources. Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022), 23716–23736.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
  3. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023).
  4. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  5. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/.
  6. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research 24, 240 (2023), 1–113.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems 36 (2024).
  8. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR 2021.
  9. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023).
  10. Vtimellm: Empower llm to grasp video moments. arXiv preprint arXiv:2311.18445 2 (2023).
  11. OpenCLIP. https://doi.org/10.5281/zenodo.5143773
  12. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2758–2766.
  13. Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046 (2023).
  14. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
  15. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355 (2023).
  16. Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005 (2023).
  17. LLaMA-VID: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043 (2023).
  18. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122 (2023).
  19. Visual instruction tuning. Advances in neural information processing systems 36 (2024).
  20. One for all: Video conversation is feasible without video instruction tuning. arXiv preprint arXiv:2309.15785 (2023).
  21. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 (2023).
  22. OpenAI. 2023a. ChatGPT. https://openai.com/blog/chatgpt/. Accessed: 2023-04-13.
  23. OpenAI. 2023b. GPT-4v(ision) System Card. https://api.semanticscholar.org/CorpusID:263218031. Accessed: 2023-04-13.
  24. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  25. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21, 140 (2020), 1–67.
  26. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449 (2023).
  27. Stanford Alpaca: An Instruction-Following Llama Model. https://github.com/tatsu-lab/stanford_alpaca.
  28. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  29. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  30. Attention is all you need. Advances in neural information processing systems 30 (2017).
  31. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
  32. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia. 1645–1653.
  33. Zero-shot video question answering via frozen bidirectional language models. Advances in Neural Information Processing Systems 35 (2022), 124–141.
  34. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023).
  35. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549 (2023).
  36. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 9127–9134.
  37. What matters in training a gpt4-style language model with multimodal inputs? arXiv preprint arXiv:2307.02469 (2023).
  38. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023).
  39. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023).
  40. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
  41. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
Citations (7)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com