Overcoming Language Priors in Visual Question Answering with Adversarial Regularization

Published 8 Oct 2018 in cs.CV | (1810.03649v2)

Abstract: Modern Visual Question Answering (VQA) models have been shown to rely heavily on superficial correlations between question and answer words learned during training such as overwhelmingly reporting the type of room as kitchen or the sport being played as tennis, irrespective of the image. Most alarmingly, this shortcoming is often not well reflected during evaluation because the same strong priors exist in test distributions; however, a VQA system that fails to ground questions in image content would likely perform poorly in real-world settings. In this work, we present a novel regularization scheme for VQA that reduces this effect. We introduce a question-only model that takes as input the question encoding from the VQA model and must leverage language biases in order to succeed. We then pose training as an adversarial game between the VQA model and this question-only adversary -- discouraging the VQA model from capturing language biases in its question encoding. Further,we leverage this question-only model to estimate the increase in model confidence after considering the image, which we maximize explicitly to encourage visual grounding. Our approach is a model agnostic training procedure and simple to implement. We show empirically that it can improve performance significantly on a bias-sensitive split of the VQA dataset for multiple base models -- achieving state-of-the-art on this task. Further, on standard VQA tasks, our approach shows significantly less drop in accuracy compared to existing bias-reducing VQA models.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (225)

View on Semantic Scholar

Summary

The paper introduces an adversarial training framework that forces VQA models to reduce reliance on language priors by using a question-only adversary.
It employs Difference of Entropies regularization to enhance visual grounding by maximizing the information gain from image inputs.
Empirical tests on the VQA-CP dataset show significant performance improvements compared to previous bias mitigation techniques.

Overcoming Language Priors in Visual Question Answering with Adversarial Regularization

This paper addresses a significant challenge in Visual Question Answering (VQA): the tendency of models to leverage superficial language biases rather than genuinely grounding their answers in the visual content of images. The authors propose a novel adversarial regularization scheme designed to mitigate this reliance on language priors, thereby enhancing the visual grounding of VQA models.

Problem Context

VQA sits at the intersection of computer vision and natural language processing, aiming to answer questions based on image content. Despite advancements, many VQA systems tend to exploit repetitive question-answer pairs in datasets instead of relying on actual image analysis. For example, such systems might habitually associate the question type "What sport ...?" with the answer "tennis," regardless of the depicted content. This reliance on dataset biases can lead to poor performance in real-world scenarios or novel instances where these biases don't hold. The VQA-CP dataset, which deliberately varies answer distributions between training and test splits, highlights these deficiencies.

Methodology

The authors introduce an adversarial training framework to counteract unwanted language biases. The proposed strategy involves two core components:

Question-Only Adversary: They introduce a model that predicts answers purely from question encodings, without considering the image. This adversary is set to compete against the base VQA model during training. The aim is for the VQA model to adjust its question encoding to minimize the performance of the adversary, thus reducing bias learned from the dataset.
Difference of Entropies (DoE) Regularization: Beyond curtailing bias, the method enhances the VQA model's grounding by optimizing the information gain from incorporating image data. By maximizing the entropy difference before and after processing the image, the model is encouraged to update its predictions based on visual content.

These strategies are model-agnostic and introduce minimal complexity, making them applicable to a range of existing VQA architectures.

Results and Analysis

Empirical evaluation on the bias-sensitive VQA-CP dataset demonstrated substantial improvements for various base models, including SAN and UpDn. The proposed adversarial regularization consistently outperformed existing bias mitigation techniques, achieving state-of-the-art results on VQA-CP. Specifically, combining both the question-only adversary and DoE regularization yielded significant cumulative benefits, markedly improving performance compared to either component used in isolation.

Interestingly, when evaluated on the more biased VQA v1 dataset, the proposed regularization led to a performance drop, albeit less pronounced than with some other existing methods. This suggests that while the method effectively reduces biases, there is some trade-off with exploiting these biases when beneficial to performance.

Implications and Future Directions

The paper's contributions offer a robust approach to enhancing the interpretability and reliability of VQA systems by ensuring that model predictions are better grounded in visual evidence. This approach paves the way for developing VQA systems more capable of generalizing beyond the constraints of their training datasets.

Further exploration could focus on refining these strategies to address potential over-regularization, where necessary language information is also mitigated. In addition, extending this methodology to other multi-modal tasks prone to dataset biases could be valuable. As AI continues to integrate into real-world applications requiring nuanced understanding across diverse contexts, such robust, bias-aware strategies will be increasingly critical.

Markdown Report Issue