Prophet: Prompting Large Language Models with Complementary Answer Heuristics for Knowledge-based Visual Question Answering (2303.01903v4)

Published 3 Mar 2023 in cs.CV, cs.CL, and cs.LG

Abstract: Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question, hence restricting the performance of their models. Recent works have resorted to using a powerful LLM as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by these methods, we argue that they have not fully activated the capacity of the \emph{blind} LLM as the provided textual input is insufficient to depict the required visual information to answer the question. In this paper, we present Prophet -- a conceptually simple, flexible, and general framework designed to prompt LLM with answer heuristics for knowledge-based VQA. Specifically, we first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge. After that, we extract two types of complementary answer heuristics from the VQA model: answer candidates and answer-aware examples. The two types of answer heuristics are jointly encoded into a formatted prompt to facilitate the LLM's understanding of both the image and question, thus generating a more accurate answer. By incorporating the state-of-the-art LLM GPT-3, Prophet significantly outperforms existing state-of-the-art methods on four challenging knowledge-based VQA datasets. Prophet is general that can be instantiated with the combinations of different VQA models (i.e., both discriminative and generative ones) and different LLMs (i.e., both commercial and open-source ones). Moreover, Prophet can also be integrated with modern large multimodal models in different stages, which is named Prophet++, to further improve the capabilities on knowledge-based VQA tasks.

References (70)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces a two-stage framework that extracts candidate answers and answer-aware examples to form complementary answer heuristics.
It integrates these heuristics into LLM-based prompting, significantly improving the accuracy of knowledge-based visual question answering.
Experiments on datasets like OK-VQA and ScienceQA demonstrate that Prophet outperforms existing VQA methods, showcasing its versatility and scalability.

A Comprehensive Examination of Prophet: Enhancing Knowledge-Based Visual Question Answering with LLMs

The paper "Prophet: Prompting LLMs with Complementary Answer Heuristics for Knowledge-based Visual Question Answering" presents a novel framework for improving the performance of knowledge-based Visual Question Answering (VQA) by leveraging the inherent capabilities of LLMs. Using Prophet, the researchers aim to overcome the limitations of existing methods that either rely extensively on external knowledge bases (KBs) or do not fully capitalize on the reasoning power of LLMs.

Methodology

The framework consists of two primary stages: Answer Heuristics Generation and Heuristics-enhanced Prompting.

Answer Heuristics Generation:
- Prophet begins by training a baseline VQA model on a knowledge-based VQA dataset. Notably, this model does not initially incorporate external knowledge, thereby serving as a straightforward, pre-trained baseline.
- From this trained model, Prophet extracts two types of complementary answer heuristics:
  - Answer Candidates are a list of potential answers for a given question-image pair, ranked by their associated confidence scores.
  - Answer-aware Examples are in-context examples chosen based on the similarity of their answers to the target question.
- By iterating through various discriminative and generative VQA models, such as MCAN (discriminative) and mPLUG (generative), the framework is able to yield diverse heuristics.
Heuristics-enhanced Prompting:
- This stage involves formatting a prompt that includes the extracted heuristics, which is then fed into an LLM to infer the final answer.
- This integration of multiple knowledge sources in a structured prompt ensures the LLM can produce more accurate predictions by effectively understanding both the context and visual content of the input image-question pair.

Results and Discussion

Prophet's performance was evaluated across several challenging datasets including OK-VQA, A-OKVQA, ScienceQA, and TextVQA, each requiring different types of external domain knowledge. The experiments indicated that Prophet consistently outperforms prior state-of-the-art models across all tasks, especially demonstrating significant improvements over approaches relying on direct multimodal pretraining or simple LLM-based methods like PICa.

A key strength of Prophet lies in its versatility and scalability. It achieves notable performance even when instantiated with different combinations of VQA models and both commercial (e.g., GPT-3) and open-source LLMs (e.g., LLaMA). Importantly, the work highlights that Prophet can adapt to various types of knowledge tasks, thus demonstrating its potential as a flexible, generalizable framework in multimodal learning.

Implications and Future Directions

Prophet underscores the critical role of question-aware information in activating the full potential of LLMs for knowledge-based tasks. By focusing on the fusion of answer heuristics and LLM reasoning, it provides new insights into how LLMs can be leveraged beyond their conventional language processing functions.

However, there remains room for further exploration. For instance, future research could delve into refining the heuristics generation process or optimizing its computational efficiency. Additionally, extending Prophet's capabilities to even larger VQA datasets or integrating it with emerging LLM architectures could yield transformative advances in AI's understanding of multimodal tasks.

Overall, Prophet is a significant contribution to the field of VQA, illustrating how strategically harnessed LLMs, when complemented with well-structured input framing, can significantly enhance AI interpretative and reasoning capabilities, especially in domains reliant on external knowledge.

PDF Markdown

GitHub

GitHub - MILVLG/prophet: Implementation of CVPR 2023 paper "Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering". (276 stars)