Vamos: Versatile Action Models for Video Understanding (2311.13627v3)
Abstract: What makes good representations for video understanding, such as anticipating future activities, or answering video-conditioned questions? While earlier approaches focus on end-to-end learning directly from video pixels, we propose to revisit text-based representations, such as general-purpose video captions, which are interpretable and can be directly consumed by LLMs. Intuitively, different video understanding tasks may require representations that are complementary and at different granularity. To this end, we propose versatile action models (Vamos), a learning framework powered by a LLM as the ``reasoner'', and can flexibly leverage visual embedding and free-form text descriptions as its input. To interpret the important text evidence for question answering, we generalize the concept bottleneck model to work with tokens and nonlinear models, which uses hard attention to select a small subset of tokens from the free-form text as inputs to the LLM reasoner. We evaluate Vamos on five complementary benchmarks, Ego4D, NeXT-QA, IntentQA, Spacewalk-18, and EgoSchema, on its capability to model temporal dynamics, encode visual history, and perform reasoning. Surprisingly, we observe that text-based representations consistently achieve competitive performance on all benchmarks, and that visual embeddings provide marginal or no performance improvement, demonstrating the effectiveness of text-based video representation in the LLM era. We also demonstrate that our token bottleneck model is able to select relevant evidence from free-form text, support test-time intervention, and achieves nearly 5 times inference speedup while keeping a competitive question answering performance. Code and models are publicly released at https://brown-palm.github.io/Vamos/
- Library of actions: Implementing a generic robot execution framework by using manipulation action semantics. The International Journal of Robotics Research, 2019.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Physics of language models: Part 1, context-free grammar. arXiv preprint arXiv:2305.13673, 2023.
- When can transformers reason with abstract symbols? arXiv preprint arXiv:2310.09753, 2023.
- Revisiting the “video” in video-language understanding. In CVPR, 2022.
- Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
- Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
- Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023a.
- Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In CVPR, 2022.
- Atm: Action temporality modeling for video question answering. In ACM Multimedia, 2023b.
- Uniter: Universal image-text representation learning. In ECCV, 2020.
- Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018.
- Language models show human-like content effects on reasoning. arXiv preprint arXiv:2207.07051, 2022.
- Attention over learned object embeddings enables complex visual reasoning. In NeurIPS, 2021.
- Learning temporal dynamics from cycles in narrated video. In ICCV, 2021.
- Slowfast networks for video recognition. In ICCV, 2019.
- Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
- An empirical study of end-to-end video-language transformers with masked visual modeling. In CVPR, 2023.
- Cloob: Modern hopfield networks with infoloob outperform clip. In NeurIPS, 2022.
- Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering. In CVPR, 2023.
- Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022.
- Large language models are zero-shot time series forecasters. arXiv preprint arXiv:2310.07820, 2023.
- Ava: A video dataset of spatio-temporally localized atomic visual actions. In CVPR, 2018.
- Visual programming: Compositional visual reasoning without training. In CVPR, 2023.
- Video-based event recognition: activity representation and probabilistic recognition methods. Computer Vision and Image Understanding, 2004.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Avis: Autonomous visual information seeking with large language models. arXiv preprint arXiv:2306.08129, 2023.
- Palm: Predicting actions through language models@ ego4d long-term action anticipation challenge 2023. arXiv preprint arXiv:2306.16545, 2023.
- Technical report for ego4d long term action anticipation challenge 2023. arXiv preprint arXiv:2307.01467, 2023.
- Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000.
- Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- Time-agnostic prediction: Predicting predictable video frames. arXiv preprint arXiv:1808.07784, 2018.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Reasoning with heterogeneous graph alignment for video question answering. In AAAI, 2020.
- High-level event recognition in unconstrained videos. International journal of multimedia information retrieval, 2013.
- Action-gpt: Leveraging large-scale language models for improved and generalized zero shot action generation. arXiv preprint arXiv:2211.15603, 2022.
- Event detection in crowded videos. In ICCV, 2007.
- Large language models are temporal and causal reasoners for video question answering. In EMNLP, 2023.
- Concept bottleneck models. In ICML, 2020.
- A hybrid discriminative/generative approach for modeling human activities. In IJCAI, 2005.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023b.
- Intentqa: Context-aware video intent reasoning. In ICCV, 2023c.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
- Ai choreographer: Music conditioned 3d dance generation with aist++. In ICCV, 2021.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020.
- Egocentric video-language pretraining. In NeurIPS, 2022.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Forecasting human-object interaction: joint prediction of motor attention and actions in first person video. In ECCV, 2020.
- Egoschema: A diagnostic benchmark for very long-form video language understanding. arXiv preprint arXiv:2308.09126, 2023.
- Unsupervised learning of object structure and dynamics from videos. In NeurIPS, 2019.
- Large language models as general pattern machines. arXiv preprint arXiv:2307.04721, 2023.
- Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058, 2023.
- An ontology for video event representation. In CVPR Workshop, 2004.
- Training language models to follow instructions with human feedback, 2022. URL https://arxiv. org/abs/2203.02155, 13, 2022.
- The minimalist grammar of action. Philosophical Transactions of the Royal Society B: Biological Sciences, 2012.
- Parsing video events with goal inference and intent prediction. In ICCV, 2011.
- Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
- Action bank: A high-level representation of activity in video. In CVPR, 2012.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
- Flava: A foundational language and vision alignment model. In CVPR, 2022.
- Videobert: A joint model for video and language representation learning. In ICCV, 2019.
- Vipergpt: Visual inference via python execution for reasoning. In ICCV, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Generating videos with scene dynamics. In NeurIPS, 2016.
- All in one: Exploring unified video-language pre-training. In CVPR, 2023.
- Actions~ transformations. In CVPR, 2016.
- Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
- De-diffusion makes text a strong cross-modal interface. arXiv preprint arXiv:2311.00618, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
- Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, 2021.
- Video as conditional graph hierarchy for multi-granular question answering. In AAAI, 2022a.
- Video graph transformer for video question answering. In ECCV, 2022b.
- Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
- Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022.
- Hitea: Hierarchical temporal-aware video-language pre-training. In ICCV, 2023a.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023b.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Self-chained image-language model for video localization and question answering. arXiv preprint arXiv:2305.06988, 2023.
- Post-hoc concept bottleneck models. arXiv preprint arXiv:2205.15480, 2022.
- Merlot: Multimodal neural script knowledge models. In NeurIPS, 2021.
- Merlot reserve: Neural script knowledge through vision and language and sound. In CVPR, 2022.
- Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
- Antgpt: Can large language models help long-term action anticipation from videos? arXiv preprint arXiv:2307.16368, 2023.