Counterfactual Visual Explanations (1904.07451v2)

Published 16 Apr 2019 in cs.LG, cs.AI, cs.CV, and stat.ML

Abstract: In this work, we develop a technique to produce counterfactual visual explanations. Given a 'query' image $I$ for which a vision system predicts class $c$, a counterfactual visual explanation identifies how $I$ could change such that the system would output a different specified class $c'$. To do this, we select a 'distractor' image $I'$ that the system predicts as class $c'$ and identify spatial regions in $I$ and $I'$ such that replacing the identified region in $I$ with the identified region in $I'$ would push the system towards classifying $I$ as $c'$. We apply our approach to multiple image classification datasets generating qualitative results showcasing the interpretability and discriminativeness of our counterfactual explanations. To explore the effectiveness of our explanations in teaching humans, we present machine teaching experiments for the task of fine-grained bird classification. We find that users trained to distinguish bird species fare better when given access to counterfactual explanations in addition to training examples.

Authors (6)

Yash Goyal (14 papers)
Ziyan Wu (59 papers)
Jan Ernst (8 papers)
Dhruv Batra (160 papers)
Devi Parikh (129 papers)
Stefan Lee (62 papers)

Citations (479)

View on Semantic Scholar

Summary

The paper presents a novel approach for generating counterfactual visual explanations by identifying minimal edits in image regions to change model classifications.
It employs spatial region identification and edit transformation techniques using CNN feature maps to determine influential regions from distractor images.
Experimental results on MNIST, Omniglot, and CUB Birds demonstrate the method’s effectiveness in elucidating discriminative features and enhancing machine teaching.

Counterfactual Visual Explanations: An Overview

The paper "Counterfactual Visual Explanations" presents a novel approach to improving interpretability in computer vision models by generating counterfactual visual explanations. The authors propose a technique that identifies modifications to specific regions of an input image, necessary for altering the output classification from one class to another, termed as the distractor class. These insights are particularly aimed at addressing questions like "What changes to an input image would cause the model to classify it differently?"

Methodology

The methodology begins with the identification of a 'query' image, which the model classifies as belonging to a certain class, and a 'distractor' image, classified by the same model as belonging to a different target class. The approach entails identifying spatial regions within these images such that substituting regions from the distractor image into the query image would alter the model's classification of the query.

Key Steps:

Spatial Region Identification: The technique computes regions in feature space, derived from the convolutional layers of a CNN, that maximally influence the classification outcome.
Edit Transformation: The identified regions in the query image are replaced with those from the distractor image, forming a counterfactual image. The choice of the regions is optimized to require minimal modification while effectively changing the classification.
Quantitative and Qualitative Analysis: The approach is applied to datasets such as MNIST, Omniglot, and Caltech-UCSD Birds, showcasing its ability to highlight discriminative features and improve user interpretability of model decisions.

Experimental Results

The approach's efficacy is demonstrated through:

MNIST: The model accurately identifies stroke changes necessary for digit transformation, averaging 2.67 edits for effective class change.
Omniglot: Required 1.46 spatial edits on average, showing its efficiency in handling complex character shapes across diverse alphabets.
CUB Birds Dataset: It highlights specific distinguishing features, such as plumage or wing patterns, and needed about 5.3 spatial edits on average for attribute-neighbor classes.

Machine Teaching Application

One significant application presented is machine teaching, where counterfactual explanations enhance human learning. In a paper involving bird species classification, the availability of counterfactual explanations was shown to improve the accuracy of human learners compared to both non-explanatory and simpler explanatory baselines.

Implications and Future Directions

The paper's contributions lie in defining a framework for counterfactual reasoning in vision systems, enriching their interpretability and facilitating human learning. These insights are valuable in safety-critical applications like automated vehicles and medical diagnostics, where understanding 'why' a model made a decision is as critical as the decision itself.

Future research could extend this work by incorporating more complex models and datasets, optimizing computational resources, and exploring its impact on trust and usability in human-AI interaction contexts. Additionally, effective integration with real-time systems to provide immediate and intuitive feedback represents a promising direction for enhancing user experience in interactive AI systems.

PDF Markdown