CI w/o TN: Context Injection without Task Name for Procedure Planning (2402.15579v1)

Published 23 Feb 2024 in cs.CV and cs.CL

Abstract: This paper explores the challenge of procedure planning in instructional videos, which involves creating goal-directed plans based on visual start and goal observations from videos. Previous research has tackled this problem with gradually weaker training supervision, from heavy intermediate visual observations or language instructions to task class supervision. However, with the advent of LLMs, even given only the task name, these models can produce a detailed plan. In this study, we propose a much weaker setting without task name as supervision, which is not currently solvable by existing LLMs since they require good prompts with sufficient information. Specifically, we hypothesize that previous intermediate supervisions can serve as context information, and we use captions of visual start and goal observations as a much cheaper form of supervision. This approach greatly reduces the labeling cost since the captions can be easily obtained by large pre-trained vision-language foundation models. Technically, we apply BLIP to generate captions as supervision to train the context feature with contrastive learning loss. Afterward, the context feature is fed into the generator to aid in plan generation. Our experiments on two datasets with varying scales demonstrate that our model can achieve comparable performance on multiple metrics, which validates our hypothesis.

References (36)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/ai_papers/status/1762389237620703642

CI w/o TN: Context Injection without Task Name for Procedure Planning (2402.15579v1)

Summary

Related Papers

Tweets