Attention Mechanisms Don't Learn Additive Models: Rethinking Feature Importance for Transformers (2405.13536v2)

Published 22 May 2024 in cs.LG, cs.AI, and cs.CL

Abstract: We address the critical challenge of applying feature attribution methods to the transformer architecture, which dominates current applications in natural language processing and beyond. Traditional attribution methods to explainable AI (XAI) explicitly or implicitly rely on linear or additive surrogate models to quantify the impact of input features on a model's output. In this work, we formally prove an alarming incompatibility: transformers are structurally incapable of representing linear or additive surrogate models used for feature attribution, undermining the grounding of these conventional explanation methodologies. To address this discrepancy, we introduce the Softmax-Linked Additive Log Odds Model (SLALOM), a novel surrogate model specifically designed to align with the transformer framework. SLALOM demonstrates the capacity to deliver a range of insightful explanations with both synthetic and real-world datasets. We highlight SLALOM's unique efficiency-quality curve by showing that SLALOM can produce explanations with substantially higher fidelity than competing surrogate models or provide explanations of comparable quality at a fraction of their computational costs. We release code for SLALOM as an open-source project online at https://github.com/tleemann/slalom_explanations.

References (59)

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that transformer attention mechanisms inherently cannot represent additive models, challenging current feature attribution practices.
It introduces SLALOM, a novel surrogate model that effectively captures token-level interactions and non-linearities in transformer architectures.
Empirical results on synthetic and real-world datasets validate SLALOM's ability to recover accurate parameter mappings and logit scores.

Insights on Feature Attribution for Transformers: Analyzing and Addressing Limitations

The paper "Attention Mechanisms Don’t Learn Additive Models: Rethinking Feature Importance for Transformers" by Leemann et al. tackles the pivotal challenge in the domain of feature attribution methods within the framework of transformer architectures. Transformers, renowned for their supremacy in natural language processing applications, underscore the need for interpretation methodologies aligned with their structural intricacies.

Key Findings and Contributions

The core finding of this paper is the intrinsic limitation of transformers in aligning with linear or additive surrogate models that traditionally serve as the backbone for feature attribution methods. The authors prove, both theoretically and empirically, that transformers inherently cannot represent additive models, including generalized additive models (GAMs) and linear models, due to the structure introduced by the attention mechanism. This revelation poses a significant challenge, casting doubt on the faithfulness of existing explanation practices in interpretablity-centric domains, such as judicial or medical settings, where LLMs are increasingly deployed.

In response to these challenges, the authors propose the Softmax-Linked Additive Log-Odds Model (SLALOM), a novel surrogate model crafted to harmonize with transformer architectures. SLALOM stands distinct from conventional models by providing a two-dimensional feature representation: the token value indicating independent contribution and the token importance indicating interaction weight vis-a-vis other tokens. By accommodating non-linearities and interactions, SLALOM transcends the capabilities of existing approaches.

Empirical validation across synthetic and real-world datasets highlights SLALOM's superior performance in delivering faithful explanations. The model demonstrates robustness across diverse tasks, underlining the necessity for tailored feature attributions rather than a monolithic approach.

Theoretical and Empirical Analysis

The paper comprehensively outlines how common transformer architectures fail to embody GAMs and linear models. The authors achieve this by examining the transformer’s attention mechanism, which typically normalizes dependencies across the entire token sequence, preventing additive functions from being accurately represented.

Corroborating the theoretical framework, empirical experiments illustrate that common transformers, irrespective of layer depth, inadequately capture linear relationships expressed in synthetic datasets designed with linear log-odds systems. Such limitations starkly contrast with fully connected models that successfully recover these linear relationships.

In exploring the potential of SLALOM, the research identifies its capacity for efficient recovery and representation of transformer model outputs. Experiments indicate that SLALOM not only recovers the true parameter mappings from transformers trained on SLALOM-generated data but also achieves an impressive recovery of logit scores approximating ground truth.

Implications and Future Directions

The implications of this work are multi-faceted. By illuminating the structural limitations of transformers in learning additive models, the paper signals a crucial oversight in existing XAI practices. The findings urge a re-evaluation of current interpretability methodologies, particularly for applications in high-stakes domains where model transparency is paramount.

Practically, the integration of SLALOM in interpretability pipelines may pave the way for more nuanced and reliable feature attribution mechanisms. Researchers and practitioners are encouraged to explore task-specific feature attributions, which promise enhanced explanatory power in line with the diverse capabilities of LLMs.

From a theoretical perspective, the paper opens avenues for exploring alternative surrogate models capable of capturing the complex interactions endemic to transformer-generated data. Refining SLALOM and pursuing other innovative models tailored to different architectures could significantly advance our understanding and application of XAI frameworks.

In conclusion, Leemann et al.'s work contributes a crucial paradigm shift in how we comprehend and apply feature attribution in transformer models, directing the community toward more accurate and insightful interpretability solutions.