Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 64 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models (2407.15841v2)

Published 22 Jul 2024 in cs.CV

Abstract: We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video LLM that can jointly capture detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as much spatial detail as possible (e.g., with 12x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for detailed video understanding. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets. Code has been made available at: https://github.com/apple/ml-slowfast-llava.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Phi-3 technical report: A highly capable language model locally on your phone. arXiv:2404.14219, 2024.
  2. GPT-4 technical report. arXiv:2303.08774, 2023.
  3. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  4. 4M-21: An any-to-any vision model for tens of tasks and modalities. arXiv:2406.09406, 2024.
  5. Collecting highly parallel data for paraphrase evaluation”. In ACL, 2011.
  6. VideoLLaMA 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv:2406.07476, 2024.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  8. Slowfast networks for video recognition. In ICCV, 2019.
  9. 3D-LLM: Injecting the 3d world into large language models. NeurIPS, 2023.
  10. Mixtral of experts. arXiv:2401.04088, 2024.
  11. Language repository for long video understanding. arXiv:2403.14622, 2024.
  12. An image grid can be worth a video: Zero-shot video question answering using a vlm. arXiv:2403.18406, 2024.
  13. IntentQA: Context-aware video intent reasoning. In ICCV, 2023a.
  14. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023b.
  15. VideoChat: Chat-centric video understanding. arXiv:2305.06355, 2023c.
  16. MVBench: A comprehensive multi-modal video understanding benchmark. arXiv:2311.17005, 2023d.
  17. LLaMA-VID: An image is worth 2 tokens in large language models. arXiv:2311.17043, 2023e.
  18. TGIF: A new dataset and benchmark on animated gif description. In CVPR, 2016.
  19. Video-LLaVA: Learning united visual representation by alignment before projection. arXiv:2311.10122, 2023.
  20. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023a.
  21. Visual instruction tuning. In NeurIPS, 2023b.
  22. LLaVA-NeXT: Improved reasoning, ocr, and world knowledge, 2024. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
  23. Vista-LLaMA: Reliable video narrator via equal distance to visual tokens. arXiv:2312.08870, 2023.
  24. VideoGPT+: Integrating image and video encoders for enhanced video understanding. arXiv:2406.09418, 2024a.
  25. Video-ChatGPT: Towards detailed video understanding via large vision and language models. In ACL, 2024b.
  26. Egoschema: A diagnostic benchmark for very long-form video language understanding. NeurIPS, 2024.
  27. MM1: Methods, analysis & insights from multimodal llm pre-training. arXiv:2403.09611, 2024.
  28. Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms. arXiv:2406.04334, 2024.
  29. 4M: Massively multimodal masked modeling. In NeurIPS, 2023.
  30. Learning transferable visual models from natural language supervision. In ICML, 2021.
  31. Two-stream convolutional networks for action recognition in videos. NeurIPS, 2014.
  32. MovieChat: From dense token to sparse memory for long video understanding. arXiv:2307.16449, 2023.
  33. Moviechat+: Question-aware sparse memory for long video question answering. arXiv:2404.17176, 2024.
  34. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 2024.
  35. Gemini: a family of highly capable multimodal models. arXiv:2312.11805, 2023.
  36. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023a.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023b.
  38. VideoAgent: Long-form video understanding with large language model as agent. arXiv:2403.10517, 2024a.
  39. VideoTree: Adaptive tree-based video representation for llm reasoning on long videos. arXiv:2405.19209, 2024b.
  40. Wenhao Wu. FreeVA: Offline mllm as training-free video assistant. arXiv:2405.07798, 2024.
  41. NExT-QA: Next phase of question-answering to explaining temporal actions. In CVPR, 2021.
  42. MSR-VTT: A large video description dataset for bridging video and language. In CVPR, 2016.
  43. PLLaVA: Parameter-free llava extension from images to videos for video dense captioning. arXiv:2404.16994, 2024.
  44. Ferret: Refer and ground anything anywhere at any granularity. arXiv:2310.07704, 2023.
  45. ActivityNet-QA: A dataset for understanding complex web videos via question answering. In AAAI, 2019.
  46. A simple llm framework for long-range video question-answering. arXiv:2312.17235, 2023a.
  47. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023b.
  48. Ferret-v2: An improved baseline for referring and grounding with large language models. arXiv:2404.07973, 2024a.
  49. LLaVA-NeXT: A strong zero-shot video understanding model, 2024b. URL https://llava-vl.github.io/blog/2024-04-30-llava-next-video/.
Citations (16)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube