Identifying Spurious Correlations using Counterfactual Alignment (2312.02186v3)

Published 1 Dec 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Models driven by spurious correlations often yield poor generalization performance. We propose the counterfactual (CF) alignment method to detect and quantify spurious correlations of black box classifiers. Our methodology is based on counterfactual images generated with respect to one classifier being input into other classifiers to see if they also induce changes in the outputs of these classifiers. The relationship between these responses can be quantified and used to identify specific instances where a spurious correlation exists. This is validated by observing intuitive trends in face-attribute and waterbird classifiers, as well as by fabricating spurious correlations and detecting their presence, both visually and quantitatively. Furthermore, utilizing the CF alignment method, we demonstrate that we can evaluate robust optimization methods (GroupDRO, JTT, and FLAC) by detecting a reduction in spurious correlations.

References (37)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces counterfactual alignment, creating synthetically altered images to detect and aggregate spurious correlations across classifiers.
It demonstrates the method’s effectiveness in uncovering both intuitive and non-intuitive relationships, particularly in face attribute classification.
The approach enables bias mitigation by adjusting classifier parameters, though its effectiveness is constrained by the autoencoder capacity and intrinsic generation biases.

In the field of AI and machine learning, particularly in the area of image classification, a critical issue arises when models rely on what are called spurious correlations – relationships in the data that may exist due to coincidence or context but are not actually relevant to the task at hand. In essence, spurious correlations can lead to questionable model decisions, where the logic applied by the AI does not truly reflect the way we would want it to make decisions.

To address this issue, a method called counterfactual alignment has been introduced. This technique creates counterfactual images: versions of an input image that have been synthetically altered to change the classifier's prediction, keeping all other aspects as unchanged as possible. By generating these images with respect to one classifier and testing them on other classifiers, researchers can gain insight into whether the classifiers are basing their decisions on similar features of the input images. If the alterations made in the counterfactual images also lead to changes in the predictions of other classifiers, it suggests shared feature usage – features that all classifiers consider when making a decision.

The counterfactual alignment method can not only spot specific instances of spurious correlations but also aggregate statistics over an entire dataset. This is particularly insightful in scenarios where the data involves complex features, such as in face attribute classification. In this context, researchers have demonstrated that counterfactual alignment can detect intuitive and non-intuitive relationships. For instance, one might intuitively expect heavy makeup to be correlated with the attractiveness attribute, but not necessarily with features like lip size if that wasn't explicitly part of the attractiveness definition.

To further validate the efficacy of this method, researchers have successfully fabricated classifiers with specific spurious correlations and then used counterfactual alignment to detect these artificial biases. This verification step is critical because it shows that the method isn’t just sensitive to existing patterns in data but can also identify newly introduced ones.

An interesting extension of this method involves using it to rectify the biases that it discovers. By adjusting classifier parameters based on the insights gained from counterfactual alignment, it's possible to reduce the influence of spurious correlations on a classifier's output. Induced biases can be corrected, for instance, by composing classifiers with weights that counteract the influence of irrelevant attributes.

One example detailed in the work is the adjustment of a classifier trained to identify "heavy makeup" that inadvertently uses "lip size" as a predictive feature. By composing this classifier with another that positively identifies "big lips," the researchers were able to mitigate the unwanted correlation.

While the technique offers a promising direction for understanding and correcting biases in classifiers, it also has limitations. The method's effectiveness can be constrained by the capacity of the autoencoder used to generate counterfactual images, and there may be inherent biases in the counterfactual generation process itself. Additionally, the paper's focus on face attribute classification enables strong visual verification but also means its generalizability to other domains remains to be demonstrated.

In conclusion, counterfactual alignment offers a novel window into the inner workings of classifiers, allowing for a detailed examination of correlations and the establishment of more robust, fair, and explainable AI systems. It represents a step towards ensuring that machine learning models are "right for the right reasons," aligning model predictions with the human rationale behind them. The source code and model weights for these experiments have been made publicly available, inviting further exploration and adaptation of the counterfactual alignment method within the broader AI community.

PDF Markdown

Related Papers

GitHub

GitHub - ieee8023/latentshift: A method to generate counterfactuals (12 stars)

Tweets

https://twitter.com/josephpaulcohen/status/1802545804475052302

https://twitter.com/388675831/status/1740450657599938625