Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

STAR: A Benchmark for Situated Reasoning in Real-World Videos (2405.09711v1)

Published 15 May 2024 in cs.AI, cs.CL, and cs.CV

Abstract: Reasoning in the real world is not divorced from situations. How to capture the present knowledge from surrounding situations and perform reasoning accordingly is crucial and challenging for machine intelligence. This paper introduces a new benchmark that evaluates the situated reasoning ability via situation abstraction and logic-grounded question answering for real-world videos, called Situated Reasoning in Real-World Videos (STAR Benchmark). This benchmark is built upon the real-world videos associated with human actions or interactions, which are naturally dynamic, compositional, and logical. The dataset includes four types of questions, including interaction, sequence, prediction, and feasibility. We represent the situations in real-world videos by hyper-graphs connecting extracted atomic entities and relations (e.g., actions, persons, objects, and relationships). Besides visual perception, situated reasoning also requires structured situation comprehension and logical reasoning. Questions and answers are procedurally generated. The answering logic of each question is represented by a functional program based on a situation hyper-graph. We compare various existing video reasoning models and find that they all struggle on this challenging situated reasoning task. We further propose a diagnostic neuro-symbolic model that can disentangle visual perception, situation abstraction, language understanding, and functional reasoning to understand the challenges of this benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Vqa: Visual question answering. In ICCV, 2015.
  2. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  3. Bakwc. bakwc/jamspell.
  4. M. Bloch. Situated learning: Legitimate peripheral participation. Man, 29(2):487–489, 1994.
  5. Situated cognition and the culture of learning. Educational researcher, 18(1):32–42, 1989.
  6. Grounding physical concepts of objects and events through dynamic visual reasoning. In International Conference on Learning Representations, 2021.
  7. W. J. Clancey. Situated cognition: Stepping out of representational flatland. AI Communications The European Journal on Artificial Intelligence, 4(2/3):109–112, 1991.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Dynamic visual reasoning by learning differentiable physics models from video and language. NeurIPS, 2021.
  10. RMPE: Regional multi-person pose estimation. In ICCV, 2017.
  11. Vqs: Linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation. In ICCV, pages 1811–1820, 2017.
  12. R. Girdhar and D. Ramanan. Cater: A diagnostic dataset for compositional actions and temporal reasoning. In ICLR, 2020.
  13. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  14. Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  15. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  16. Ptr: A benchmark for part-based conceptual, relational, and physical reasoning. Advances in Neural Information Processing Systems, 34, 2021.
  17. Language-conditioned graph networks for relational reasoning. In ICCV, 2019.
  18. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  19. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR, 2017.
  20. Action genome: Actions as compositions of spatio-temporal scene graphs. In CVPR, 2020.
  21. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
  22. Inferring and executing programs for visual reasoning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2989–2998, 2017.
  23. Movinets: Mobile video networks for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16020–16030, 2021.
  24. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 2017.
  25. G. Lakemeyer. The situation calculus: A case for modal logic. Journal of Logic, Language and Information, 19(4):431–450, 2010.
  26. Hierarchical conditional relation networks for video question answering. In CVPR, 2020.
  27. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7331–7341, 2021.
  28. Tvqa: Localized, compositional video question answering. In EMNLP, 2018.
  29. Tvqa+: Spatio-temporal grounding for video question answering. arXiv preprint arXiv:1904.11574, 2019.
  30. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  31. J. McCarthy. Situations, actions, and causal laws. Technical report, STANFORD UNIV CA DEPT OF COMPUTER SCIENCE, 1963.
  32. Marioqa: Answering questions by watching gameplay videos. In ICCV, 2017.
  33. Gector–grammatical error correction: Tag, not rewrite. arXiv preprint arXiv:2005.12592, 2020.
  34. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
  35. H. Prendinger and G. Schurz. Reasoning about action and change. Journal of logic, language and information, 5(2):209–245, 1996.
  36. R. Reiter. The frame problem in the situation calculus: A simple solution (sometimes) and a completeness result for goal regression. In Artificial and Mathematical Theory of Computation, pages 359–380. Citeseer, 1991.
  37. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
  38. What actions are needed for understanding human actions in videos? In Proceedings of the IEEE international conference on computer vision, pages 2137–2146, 2017.
  39. Unbiased scene graph generation from biased training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3716–3725, 2020.
  40. Movieqa: Understanding stories in movies through question-answering. In CVPR, 2016.
  41. Actions~ transformations. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2658–2667, 2016.
  42. M. S. Winslett. Reasoning about action using a possible models approach. Department of Computer Science, University of Illinois at Urbana-Champaign, 1988.
  43. Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431, 2016.
  44. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017.
  45. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1686–1697, 2021.
  46. A dataset and architecture for visual reasoning with a working memory. In ECCV, 2018.
  47. Clevrer: Collision events for video representation and reasoning. In ICLR, 2020.
  48. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. arXiv preprint arXiv:1810.02338, 2018.
  49. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6720–6731, 2019.
  50. Visual7w: Grounded question answering in images. In CVPR, 2016.
Citations (138)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper:

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube