VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding (2403.14743v3)

Published 21 Mar 2024 in cs.CV

Abstract: Recent studies have demonstrated the effectiveness of LLMs as reasoning modules that can deconstruct complex tasks into more manageable sub-tasks, particularly when applied to visual reasoning tasks for images. In contrast, this paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of LLMs. Ours is a novel approach to extend the utility of LLMs in the context of video tasks, leveraging their capacity to generalize from minimal input and output demonstrations within a contextual framework. We harness their contextual learning capabilities by presenting LLMs with pairs of instructions and their corresponding high-level programs to generate executable visual programs for video understanding. To enhance the program's accuracy and robustness, we implement two important strategies. \emph{Firstly,} we employ a feedback-generation approach, powered by GPT-3.5, to rectify errors in programs utilizing unsupported functions. \emph{Secondly}, taking motivation from recent works on self-refinement of LLM outputs, we introduce an iterative procedure for improving the quality of the in-context examples by aligning the initial outputs to the outputs that would have been generated had the LLM not been bound by the structure of the in-context examples. Our results on several video-specific tasks, including visual QA, video anticipation, pose estimation, and multi-video QA, illustrate these enhancements' efficacy in improving the performance of visual programming approaches for video tasks.

Citations (3)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/twelve_labs/status/1879356255061553219

https://twitter.com/CSVisionPapers/status/1772123999473901572

https://twitter.com/ai_papers/status/1772437582225543368

VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding (2403.14743v3)

Summary

Related Papers

Tweets