VDebugger: Harnessing Execution Feedback for Debugging Visual Programs (2406.13444v3)

Published 19 Jun 2024 in cs.CL and cs.CV

Abstract: Visual programs are executable code generated by LLMs to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and invoke specialized models for each step to solve the problems. However, these programs are prone to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debugging complex visual programs remains a major bottleneck for visual reasoning. To address this, we introduce VDebugger, a novel critic-refiner framework trained to localize and debug visual programs by tracking execution step by step. VDebugger identifies and corrects program errors leveraging detailed execution feedback, improving interpretability and accuracy. The training data is generated through an automated pipeline that injects errors into correct visual programs using a novel mask-best decoding technique. Evaluations on six datasets demonstrate VDebugger's effectiveness, showing performance improvements of up to 3.2% in downstream task accuracy. Further studies show VDebugger's ability to generalize to unseen tasks, bringing a notable improvement of 2.3% on the unseen COVR task. Code, data and models are made publicly available at https://github.com/shirley-wu/vdebugger/

Summary

The paper introduces VDebugger, a critic-refiner framework that leverages detailed execution feedback to detect and correct logic errors in visual programs.
It employs an automated pipeline generating 47.7k program pairs with a novel mask-best decoding method that increases error injection success by up to ten times.
Empirical evaluations on six datasets show up to 3.2% accuracy improvement and robust generalization on unseen tasks, underscoring its practical impact.

An Overview of "VDebugger: Harnessing Execution Feedback for Debugging Visual Programs"

This paper presents VDebugger, a sophisticated critic-refiner framework designed to debug visual programs by tracking their execution step by step. Visual programs are generated by LLMs to solve complex visual reasoning tasks by decomposing them into multiple reasoning steps and invoking specialized models for each step. These visual programs, while innovative, are prone to logic errors that significantly impair their effectiveness. The paper notes that 58% of total errors in visual programs are due to such logic errors, making debugging a crucial task for improving the programs' accuracy and interpretability.

Main Contributions

The primary contributions of this paper include:

VDebugger Framework: The authors propose VDebugger, a novel framework that involves a critic to identify errors and a refiner to correct them, utilizing detailed execution feedback.
Automated Training Data Generation: The development of an automated pipeline to generate large-scale training datasets, including $47.7k$ program pairs. This is achieved through a novel mask-best decoding technique that increases the success rate of error injection by up to ten times compared to traditional greedy decoding.
Empirical Results: Extensive evaluations on six datasets demonstrate VDebugger's efficacy, achieving performance improvements of up to 3.2% in downstream task accuracy. Moreover, VDebugger shows a notable improvement of 2.3% on unseen tasks, underscoring its generalization capability.

Technical Details

The VDebugger framework involves two main components: the critic and the refiner. The critic tracks the execution states of the visual program and identifies errors at a fine-grained level, while the refiner corrects these errors based on detailed feedback from the critic. The execution feedback includes every code line executed, changes in variable values, and any errors encountered, enabling a comprehensive debugging process that mimics the stepping debugging strategy used by human programmers.

To train VDebugger, the authors devised an automated pipeline that first generates correct visual programs using an LLM and then creates incorrect programs by injecting errors. This is achieved using a novel mask-best sampling approach that selectively masks out the token with the highest predicted probability to induce errors, thereby improving the diversity and complexity of the generated incorrect programs. This method dramatically increases the success rate of error injection, facilitating the creation of a robust training dataset.

Empirical Evaluation

The paper conducts evaluations on six datasets across three task forms: visual question answering with one or multiple images and visual grounding. These datasets include GQA for compositional question answering, TallyQA for counting, NLVRv2, and three variants of RefCOCO for visual grounding tasks. Performance metrics such as accuracy and Intersection over Union (IoU) are used to measure the effectiveness of VDebugger.

The results indicate significant performance gains across all datasets, with improvements of up to 3.2% in accuracy. Moreover, the framework demonstrated generalization capabilities by effectively debugging programs generated by larger, more sophisticated LLMs like GPT-3.5 and CodeLlama-70B, achieving notable performance improvements even in these scenarios.

Implications and Future Directions

The practical implications of VDebugger are substantial. By improving the accuracy and interpretability of visual programs, VDebugger can enhance various applications in AI that require complex visual reasoning, such as autonomous navigation, medical image analysis, and intelligent surveillance systems. Theoretically, this work pushes the boundaries of debugging techniques for programmatic reasoning tasks introduced by LLMs, offering a robust approach to handle errors and improve model reliability.

Future research could explore integrating visual information into the debugging process, potentially enhancing the critic's ability to identify errors in visual program steps more accurately. Additionally, joint training with foundational vision-LLMs (VLMs) might further optimize the framework, allowing the debugger to leverage richer multimodal context.

Conclusion

In conclusion, "VDebugger: Harnessing Execution Feedback for Debugging Visual Programs" introduces an innovative approach to address a significant bottleneck in visual program execution. By leveraging detailed execution feedback and a novel error injection technique, VDebugger substantially improves the accuracy and robustness of visual programs, demonstrating its potential for broad applicability and setting a strong foundation for future explorations in programmatic visual reasoning.

Related Papers

GitHub

GitHub - shirley-wu/vdebugger: VDebugger: Harnessing Execution Feedback for Debugging Visual Programs (3 stars)

Tweets

https://twitter.com/lupantech/status/1805396589785268557

https://twitter.com/xueqing_w/status/1804038847304733122

https://twitter.com/knishimae0531/status/1805823948376228010