Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 163 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 42 tok/s Pro
GPT-5 High 41 tok/s Pro
GPT-4o 94 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Language as the Medium: Multimodal Video Classification through text only (2309.10783v1)

Published 19 Sep 2023 in cs.CV, cs.AI, and cs.CL

Abstract: Despite an exciting new wave of multimodal machine learning models, current approaches still struggle to interpret the complex contextual relationships between the different modalities present in videos. Going beyond existing methods that emphasize simple activities or objects, we propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information. Our method leverages the extensive knowledge learnt by LLMs, such as GPT-3.5 or Llama2, to reason about textual descriptions of the visual and aural modalities, obtained from BLIP-2, Whisper and ImageBind. Without needing additional finetuning of video-text models or datasets, we demonstrate that available LLMs have the ability to use these multimodal textual descriptions as proxies for sight'' orhearing'' and perform zero-shot multimodal classification of videos in-context. Our evaluations on popular action recognition benchmarks, such as UCF-101 or Kinetics, show these context-rich descriptions can be successfully used in video understanding tasks. This method points towards a promising new research direction in multimodal classification, demonstrating how an interplay between textual, visual and auditory machine learning models can enable more holistic video understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. https://github.com/guillaumekln/faster-whisper.
  2. https://platform.openai.com/docs/guides/gpt/chat-completions-api.
  3. https://www.anthropic.com/index/introducing-claude.
  4. Alternating gradient descent and mixture-of-experts for integrated multimodal perception. arXiv preprint arXiv:2305.06324, 2023.
  5. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  6. Towards language models that can see: Computer vision through the lens of natural language. arXiv:2306.16410, 2023.
  7. Language models are few-shot learners. NeurIPS, 2020.
  8. Video chatcaptioner: Towards the enriched spatiotemporal descriptions. arXiv:2304.04227, 2023.
  9. Imagebind: One embedding space to bind them all. In CVPR, 2023.
  10. Vtc: Improving video-text retrieval with user comments. In ECCV, 2022.
  11. Language is not all you need: Aligning perception with language models. arXiv:2302.14045, 2023.
  12. Perceiver IO: A general architecture for structured inputs & outputs. arXiv:2107.14795, 2021.
  13. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023.
  14. OpenAI. GPT-4 technical report, 2023.
  15. Learning transferable visual models from natural language supervision. In ICML, 2021.
  16. Robust speech recognition via large-scale weak supervision. In ICML, 2023.
  17. Bloom: A 176b-parameter open-access multilingual language model. arXiv:2211.05100, 2022.
  18. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
  19. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.