STAR: A Benchmark for Situated Reasoning in Real-World Videos (2405.09711v1)
Abstract: Reasoning in the real world is not divorced from situations. How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated Reasoning in Real-World Videos (STAR Benchmark). This benchmark is built upon the real-world videos associated with human actions or interactions, which are naturally dynamic, compositional, and logical. The dataset includes four types of questions, including interaction, sequence, prediction, and feasibility. We represent the situations in real-world videos by hyper-graphs connecting extracted atomic entities and relations (e.g., actions, persons, objects, and relationships). Besides visual perception, situated reasoning also requires structured situation comprehension and logical reasoning. Questions and answers are procedurally generated. The answering logic of each question is represented by a functional program based on a situation hyper-graph. We compare various existing video reasoning models and find that they all struggle on this challenging situated reasoning task. We further propose a diagnostic neuro-symbolic model that can disentangle visual perception, situation abstraction, language understanding, and functional reasoning to understand the challenges of this benchmark.
- Vqa: Visual question answering. In ICCV, 2015.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Bakwc. bakwc/jamspell.
- M. Bloch. Situated learning: Legitimate peripheral participation. Man, 29(2):487–489, 1994.
- Situated cognition and the culture of learning. Educational researcher, 18(1):32–42, 1989.
- Grounding physical concepts of objects and events through dynamic visual reasoning. In International Conference on Learning Representations, 2021.
- W. J. Clancey. Situated cognition: Stepping out of representational flatland. AI Communications The European Journal on Artificial Intelligence, 4(2/3):109–112, 1991.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Dynamic visual reasoning by learning differentiable physics models from video and language. NeurIPS, 2021.
- RMPE: Regional multi-person pose estimation. In ICCV, 2017.
- Vqs: Linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation. In ICCV, pages 1811–1820, 2017.
- R. Girdhar and D. Ramanan. Cater: A diagnostic dataset for compositional actions and temporal reasoning. In ICLR, 2020.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
- Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
- Ptr: A benchmark for part-based conceptual, relational, and physical reasoning. Advances in Neural Information Processing Systems, 34, 2021.
- Language-conditioned graph networks for relational reasoning. In ICCV, 2019.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR, 2017.
- Action genome: Actions as compositions of spatio-temporal scene graphs. In CVPR, 2020.
- Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
- Inferring and executing programs for visual reasoning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2989–2998, 2017.
- Movinets: Mobile video networks for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16020–16030, 2021.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 2017.
- G. Lakemeyer. The situation calculus: A case for modal logic. Journal of Logic, Language and Information, 19(4):431–450, 2010.
- Hierarchical conditional relation networks for video question answering. In CVPR, 2020.
- Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7331–7341, 2021.
- Tvqa: Localized, compositional video question answering. In EMNLP, 2018.
- Tvqa+: Spatio-temporal grounding for video question answering. arXiv preprint arXiv:1904.11574, 2019.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
- J. McCarthy. Situations, actions, and causal laws. Technical report, STANFORD UNIV CA DEPT OF COMPUTER SCIENCE, 1963.
- Marioqa: Answering questions by watching gameplay videos. In ICCV, 2017.
- Gector–grammatical error correction: Tag, not rewrite. arXiv preprint arXiv:2005.12592, 2020.
- Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
- H. Prendinger and G. Schurz. Reasoning about action and change. Journal of logic, language and information, 5(2):209–245, 1996.
- R. Reiter. The frame problem in the situation calculus: A simple solution (sometimes) and a completeness result for goal regression. In Artificial and Mathematical Theory of Computation, pages 359–380. Citeseer, 1991.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
- What actions are needed for understanding human actions in videos? In Proceedings of the IEEE international conference on computer vision, pages 2137–2146, 2017.
- Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3716–3725, 2020.
- Movieqa: Understanding stories in movies through question-answering. In CVPR, 2016.
- Actions~ transformations. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2658–2667, 2016.
- M. S. Winslett. Reasoning about action using a possible models approach. Department of Computer Science, University of Illinois at Urbana-Champaign, 1988.
- Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431, 2016.
- Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017.
- Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1686–1697, 2021.
- A dataset and architecture for visual reasoning with a working memory. In ECCV, 2018.
- Clevrer: Collision events for video representation and reasoning. In ICLR, 2020.
- Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. arXiv preprint arXiv:1810.02338, 2018.
- From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6720–6731, 2019.
- Visual7w: Grounded question answering in images. In CVPR, 2016.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.
 
          