Modular Visual Question Answering via Code Generation

Published 8 Jun 2023 in cs.CL | (2306.05392v1)

Abstract: We present a framework that formulates visual question answering as modular code generation. In contrast to prior work on modular approaches to VQA, our approach requires no additional training and relies on pre-trained LMs, visual models pre-trained on image-caption pairs, and fifty VQA examples used for in-context learning. The generated Python programs invoke and compose the outputs of the visual models using arithmetic and conditional logic. Our approach improves accuracy on the COVR dataset by at least 3% and on the GQA dataset by roughly 2% compared to the few-shot baseline that does not employ code generation.

Abstract PDF Upgrade to Chat

Authors (9)

Citations (40)

View on Semantic Scholar

Summary

The paper presents a novel modular VQA framework that generates Python code to orchestrate visual primitives without requiring extra training.
It leverages pre-trained language and visual models to form detailed reasoning via code, achieving a 3% accuracy boost on COVR and 2% on GQA.
The method offers an adaptable, low-retraining solution for complex visual queries, paving the way for advanced multi-modal reasoning in future research.

Modular Visual Question Answering via Code Generation

The paper "Modular Visual Question Answering via Code Generation" presents a novel framework for addressing Visual Question Answering (VQA) through a modular approach that employs code generation. This methodology contrasts with traditional modular approaches such as differentiable neural module networks, which require significant retraining when modules are added or modified. The work leverages pre-trained LLMs and visual models that do not necessitate additional training, establishing a system that can solve complex VQA tasks by synthesizing responses via code execution.

Methodology

The proposed framework utilizes large pre-trained LLMs like Codex for the generation of Python programs. These programs orchestrate pre-defined visual primitives that interface with Visual LLMs (VLMs) to process and analyze image data. The key operations involve generating a logical breakdown of visual tasks that can include arithmetic or conditional logic, essentially transforming the VQA into a form of program synthesis.

The researchers introduce a suite of visual primitives:

query(image, question): Provides answers to questions about an image through iterative image patch captioning and LLM-based question answering.
get_pos(image, text): Uses localization techniques to determine the position of objects within an image.
find_matching_image(images, text): Identifies the most related image in a set to a given text using image-text similarity scores.

Results

Evaluations on the COVR and GQA datasets highlight the experimental gains brought about by this framework. The approach improved accuracy by 3% on the COVR dataset and roughly 2% on the GQA dataset compared to few-shot baselines. Moreover, the improvement is particularly significant in dealing with questions that involve spatial relationships or multiple conditions, reflecting the potential of modularity in reasoning tasks.

Implications

The implications of this research are twofold. Practically, the modular system introduced is versatile and easily adaptable for a wide array of VQA challenges, benefiting from the latest advancements in vision and LLMs without necessitating model re-training. Theoretically, it showcases the value of program synthesis in combining pre-trained models’ capabilities for multi-modal reasoning.

Future Directions

This work paves the way for future research in developing more sophisticated and nuanced primitives that could handle broader classes of reasoning or incorporate external libraries for additional functionality. Extending this framework into non-English language settings remains another prospective area for exploration. Additionally, addressing the computation and cost limitations inherent in deploying LLMs in real-world scenarios will be crucial to harnessing the full potential of this approach.

Markdown Report Issue