Hierarchical Question-Image Co-Attention for Visual Question Answering

Published 31 May 2016 in cs.CV and cs.CL | (1606.00061v5)

Abstract: A number of recent works have proposed attention models for Visual Question Answering (VQA) that generate spatial maps highlighting image regions relevant to answering the question. In this paper, we argue that in addition to modeling "where to look" or visual attention, it is equally important to model "what words to listen to" or question attention. We present a novel co-attention model for VQA that jointly reasons about image and question attention. In addition, our model reasons about the question (and consequently the image via the co-attention mechanism) in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Our model improves the state-of-the-art on the VQA dataset from 60.3% to 60.5%, and from 61.6% to 63.3% on the COCO-QA dataset. By using ResNet, the performance is further improved to 62.1% for VQA and 65.4% for COCO-QA.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (1,560)

View on Semantic Scholar

Summary

The paper introduces a novel co-attention mechanism that simultaneously generates attention maps for both image and question, improving VQA state-of-the-art results.
It implements a hierarchical question representation using word, phrase, and question levels along with parallel and alternating co-attention strategies.
The approach boosts performance on VQA and COCO-QA datasets, demonstrating significant advancements in multi-modal AI and robust visual-text integration.

Hierarchical Question-Image Co-Attention for Visual Question Answering

The paper "Hierarchical Question-Image Co-Attention for Visual Question Answering" by Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh introduces a novel approach to addressing the Visual Question Answering (VQA) problem. VQA is a multi-disciplinary problem that requires a model to provide accurate answers to questions based on the content of an image. The core contribution of this paper is the development of a co-attention model that simultaneously focuses on both pertinent regions in the image and relevant words in the question, hence enhancing the interpretive capacity of the VQA system.

Contributions

The main contributions of this paper include:

Co-Attention Mechanism: A co-attention mechanism was developed to jointly generate image and question attention maps. This mechanism diverges from previous models that primarily concentrated on visual attention, integrating question attention to provide a more holistic interpretational framework.
Hierarchical Representation of Questions: The authors proposed a hierarchical architecture that represents the question at three different levels: word, phrase, and question level. These hierarchical features are conjointly used to generate co-attention maps.
Novel Convolution-Pooling Strategy: At the phrase level, the paper introduces a convolution-pooling strategy that dynamically selects phrase sizes for representation, optimizing the model's adaptability to varying linguistic structures.
Benchmark Performance: The model was evaluated on two substantial datasets, VQA and COCO-QA, revealing improvements in state-of-the-art results from 60.3% to 62.1% on VQA and from 61.6% to 65.4% on COCO-QA.

Methodology

Hierarchical Question Representation

The paper proposes a hierarchical question encoding mechanism. The question is processed at three levels:

Word Level: Individual words are embedded into a vector space.
Phrase Level: A 1-dimensional convolution (with unigrams, bigrams, and trigrams) followed by max-pooling is used to capture phrase-level features.
Question Level: The phrase-level embeddings are encoded using an LSTM to derive the question-level embedding.

Co-Attention Mechanism

Two co-attention strategies were proposed:

Parallel Co-Attention: This technique produces attention maps for the image and question simultaneously by computing an affinity matrix representing similarity scores between the image and question features.
Alternating Co-Attention: This method alternates between question attention and image attention. Initially, a question summarization is performed, followed by image attention and a subsequent re-attention on the question based on the updated image features.

Both strategies aggregate attention hierarchically across the word, phrase, and question levels, enhancing the model's capability to understand and connect visual and textual data progressively.

Numerical Results and Analysis

Improvements in Benchmark Performance

The model demonstrated superior performance on the VQA and COCO-QA datasets. Specific improvements noted include:

On the VQA dataset, the performance improved from 60.3% to 62.1% for open-ended questions and from 64.2% to 66.1% for multiple-choice questions.
On the COCO-QA dataset, results saw a leap from 61.6% to 65.4%, which illustrates the efficacy of the hierarchical co-attention strategy.

Ablation Studies

The authors conducted thorough ablation studies to demonstrate the contributions of various components of their model. It was found that the highest performance contributions came from the question-level attention, followed by phrase-level and word-level attentions.

Implications and Future Directions

The incorporation of co-attention strategies that process both visual and textual information simultaneously or in an alternated manner opens new avenues for developing more robust multi-modal AI systems. This dual-attention perspective is critical for improving model robustness to linguistic variations and complex visual stimuli.

Furthermore, the hierarchical processing of the question ensures that varying granularities of textual information are captured, thus enhancing the overall comprehension of the question by the model.

Conclusion

This paper offers significant contributions to the VQA field by introducing a dual-attentive hierarchical model. Through both theoretical advancements and practical performance improvements, it lays the groundwork for future research in multi-modal deep learning models. As AI continues to evolve, such intricate models that efficiently integrate and process visual and textual data will be indispensable for various applications, including but not limited to automated captioning, interactive AI, and real-time image-based querying systems.

Markdown Report Issue