Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources

Published 22 Nov 2015 in cs.CV | (1511.06973v2)

Abstract: We propose a method for visual question answering which combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. This allows more complex questions to be answered using the predominant neural network-based approach than has previously been possible. It particularly allows questions to be asked about the contents of an image, even when the image itself does not contain the whole answer. The method constructs a textual representation of the semantic content of an image, and merges it with textual information sourced from a knowledge base, to develop a deeper understanding of the scene viewed. Priming a recurrent neural network with this combined information, and the submitted question, leads to a very flexible visual question answering approach. We are specifically able to answer questions posed in natural language, that refer to information not contained in the image. We demonstrate the effectiveness of our model on two publicly available datasets, Toronto COCO-QA and MS COCO-VQA and show that it produces the best reported results in both cases.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (366)

View on Semantic Scholar

Summary

The paper demonstrates that fusing CNN-driven image attributes with external knowledge via an RNN framework enhances responses to complex visual questions.
The method combines multi-label CNN attribute detection, LSTM-generated captions, and dynamic SPARQL-based knowledge querying to bridge informational gaps.
Empirical results show significant accuracy improvements on Toronto COCO-QA (69.73%) and VQA (59.44%), surpassing prior state-of-the-art models.

Overview of "Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources"

This paper presents a novel approach for Visual Question Answering (VQA) that merges image content with external semantic knowledge to answer complex visual questions. Traditional VQA challenges are amplified by the unpredictability of both question formulation and the operations needed for responses. Existing systems often rely solely on internal image representations without leveraging auxiliary knowledge, which can restrict their ability to tackle questions necessitating broader contextual understanding.

Methodological Framework

The core advancement of this research is the integration of a textual image representation with external knowledge to improve VQA. The method employs a recurrent neural network (RNN), primed with information derived from both the image and a general knowledge base. This setup allows the model to respond to free-form, natural language questions about image content, even when critical information is absent from the image itself.

The proposed method involves three primary components:

Attribute-Based Image Representation: Images are decomposed into high-level semantic attributes using convolutional neural networks (CNNs). This multi-label classification approach detects objects, scenes, actions, and other relevant concepts within an image.
Caption-Based Image Representation: With an attribute-based image description, the model generates multiple image captions via an LSTM model. These captions form an internal textual representation that aids in understanding the image in a human-annotated context.
Knowledge Base Integration: The model dynamically queries an external knowledge base (e.g., DBpedia) using SPARQL, retrieving textual information that complements the internal image description. This feature is designed to bridge informational gaps present in image data alone.

Evaluation and Results

Empirical results demonstrate the superiority of this method on two datasets: Toronto COCO-QA and VQA. The proposed model achieves an accuracy of 69.73% on Toronto COCO-QA, surpassing the state-of-the-art by a significant margin, and 59.44% on VQA. Notably, the model excels in scenarios where questions necessitate information beyond visual content, such as common sense or domain-specific knowledge, thus advancing the model as an "AI-complete" solution.

Implications and Future Directions

The implications of this research are multifold. Practically, it illustrates a substantial improvement in VQA systems by incorporating extensive knowledge bases, suggesting a trajectory toward more human-like image understanding capabilities. Theoretically, it emphasizes the need for multimodal approaches, combining visual and semantic processing to solve complex AI tasks.

Future research could explore adaptive knowledge querying methods to tailor information retrieval to the specific demands of each question. Additionally, aligning image attributes more closely with narrative constructs could further refine the model's interpretative prowess. As larger and more comprehensive knowledge bases become available, the potential for even more sophisticated VQA systems grows, promising further advancements in artificial intelligence.

Markdown Report Issue