LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity (2404.03214v2)

Published 4 Apr 2024 in cs.CV

Abstract: Vision Transformers (ViTs), with their ability to model long-range dependencies through self-attention mechanisms, have become a standard architecture in computer vision. However, the interpretability of these models remains a challenge. To address this, we propose LeGrad, an explainability method specifically designed for ViTs. LeGrad computes the gradient with respect to the attention maps of ViT layers, considering the gradient itself as the explainability signal. We aggregate the signal over all layers, combining the activations of the last as well as intermediate tokens to produce the merged explainability map. This makes LeGrad a conceptually simple and an easy-to-implement tool for enhancing the transparency of ViTs. We evaluate LeGrad in challenging segmentation, perturbation, and open-vocabulary settings, showcasing its versatility compared to other SotA explainability methods demonstrating its superior spatial fidelity and robustness to perturbations. A demo and the code is available at https://github.com/WalBouss/LeGrad.

References (1)

Molnar, C.: Interpretable Machine Learning (2019)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces LeGrad, a novel layer-wise gradient method that reveals feature formation sensitivity in Vision Transformers.
It aggregates attention map gradients across layers to produce robust visual explanations with superior spatial fidelity.
The approach scales efficiently to large models, enhancing model transparency and aiding in bias detection and model debugging.

Exploring Vision Transformers' Interpretability with LeGrad

Introduction to LeGrad

In the dynamic landscape of computer vision research, the interpretability of Vision Transformers (ViTs) presents a challenge that is as complex as it is crucial. Traditional explainability methods tailored for convolutional architectures, such as GradCAM and LRP, encounter limitations when applied to the distinct architecture of ViTs. To bridge this gap, researchers have introduced LeGrad, an innovative method designed to enhance the transparency of Vision Transformers by focussing on the sensitivity of feature formation within these models.

The Essence of LeGrad

LeGrad stands out by taking a layer-wise approach to explainability, tapping into the gradient with respect to the attention maps across all layers of a ViT. Unlike existing methods that may weigh attention maps with their gradients or use attention maps directly, LeGrad aggregates the explainability signals by comprehensively considering the gradient's impact on attention across different layers. This methodology offers several advantages:

Simplicity and Versatility: LeGrad's reliance on gradients makes it conceptually straightforward and adaptable to various ViTs, regardless of their size or specific feature aggregation mechanisms employed.
Robust Spatial Fidelity: Through extensive evaluations, including segmentation, perturbation tests, and open-vocabulary scenarios, LeGrad has demonstrated superior spatial fidelity in highlighting relevant image regions for model predictions. Its performance significantly outpaces that of other state-of-the-art (SotA) explainability methods, particularly in large-scale, open-vocabulary settings.
Scalability to Large Models: Its layer-wise gradient-based approach enables effective scaling to architectures with billions of parameters, such as ViT-bigG/14, without compromising on computational efficiency or the interpretability of the explanations generated.

Methodology

At its core, LeGrad operates by computing the gradient of a target activation with respect to the attention maps for each ViT layer. It then aggregates these layer-specific signals into a unified explainability map. This process involves several key steps:

Gradient Computation: For each layer, the gradient of the target activation regarding the attention map is computed, factoring in the layer's contribution to the final prediction.
Aggregation of Layer-wise Signals: The explainability signals from all layers are pooled together, enhancing the final map's representativeness of the model's decision-making process across its depth.
Normalization and Visualization: The aggregated signal is normalized and reshaped into a 2D explainability map that visually represents the regions of an image most influential to the model's predictions.

Practical Implications and Future Directions

LeGrad's ability to provide clear and accurate visual explanations of ViTs' decision-making processes has practical implications in improving model transparency, trustworthiness, and debugging capabilities. By elucidating the model's focus in making predictions, LeGrad can aid researchers and practitioners in identifying biases, artifacts, or spurious correlations that models might rely on.

Looking ahead, LeGrad opens avenues for further research into making increasingly complex models interpretable. Its methodological foundations encourage exploration into more nuanced aspects of explainability, such as dissecting the role of individual attention heads or delving deeper into the specific interactions between layers that contribute to feature formation. Moreover, adapting and extending LeGrad's principles to other architectures within the broad spectrum of transformer models could further democratize access to model interpretability across various domains in AI.

Conclusion

LeGrad represents a significant step forward in the interpretability of Vision Transformers, addressing the nuanced challenge of understanding these models' decision-making processes. Its methodological soundness, combined with robust empirical results, positions LeGrad as a valuable tool in the quest for transparent and explainable AI. By highlighting the importance of considering the gradient's influence across all layers of a ViT, LeGrad sets a new standard in the field, paving the way for future advancements in explainable AI.

PDF Markdown

Related Papers

Tweets

https://twitter.com/HildeKuehne/status/1777302268796121256

https://twitter.com/CSVisionPapers/status/1776266546924458337

https://twitter.com/Deep_In_Depth/status/1777788706281156643