Self-Attention Attribution: Interpreting Information Interactions Inside Transformer

Published 23 Apr 2020 in cs.CL | (2004.11207v2)

Abstract: The great success of Transformer-based models benefits from the powerful multi-head self-attention mechanism, which learns token dependencies and encodes contextual information from the input. Prior work strives to attribute model decisions to individual input features with different saliency measures, but they fail to explain how these input features interact with each other to reach predictions. In this paper, we propose a self-attention attribution method to interpret the information interactions inside Transformer. We take BERT as an example to conduct extensive studies. Firstly, we apply self-attention attribution to identify the important attention heads, while others can be pruned with marginal performance degradation. Furthermore, we extract the most salient dependencies in each layer to construct an attribution tree, which reveals the hierarchical interactions inside Transformer. Finally, we show that the attribution results can be used as adversarial patterns to implement non-targeted attacks towards BERT.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (191)

View on Semantic Scholar

Summary

The paper presents a self-attention attribution method that integrates gradients to quantify token interactions and assess self-attention head significance.
It visualizes internal information flow through an attribution tree, exposing hierarchical token dependencies across Transformer layers.
The method also uncovers adversarial patterns in BERT, suggesting effective pruning strategies and highlighting potential model vulnerabilities.

Self-Attention Attribution: Interpreting Information Interactions Inside Transformer

The paper proposes a self-attention attribution method aimed at interpreting the internal information flow within Transformer-based models. The authors apply this method specifically on BERT, a widely recognized Transformer variant, and demonstrate its utility across various tasks involving natural language processing. The primary objective is to understand how different parts of the input influence model predictions, and to visualize these interactions more comprehensively than previously possible by merely evaluating individual attention weights.

The proposed methodology utilizes the concept of integrated gradients to derive attribution scores associated with self-attention weights in the model. This allows the identification of significant dependencies that directly impact the model's predictions. Through extensive experiments, it is shown that the attribution scores derived using this approach provide a superior metric for interpreting the importance of various self-attention heads compared to traditional evaluative methods such as Taylor expansion.

Highlights and Results

Attribution and Pruning: By identifying attention heads with low attribution scores, the paper explores an effective pruning strategy. Experimental results exhibited a competitive performance for this pruning method against existing strategies based on accuracy differences.
Visualization via Attribution Tree: The method ensemble also enables the construction of an interaction tree, providing a hierarchical view of how different tokens within the input interact across Transformer layers. This visualization reveals the inherently hierarchical nature of information processing within Transformer models and highlights their ability to capture both local and global dependencies.
Adversarial Patterns: The authors demonstrate that the interplay of certain input features can be leveraged to generate adversarial attacks on BERT. Notably, inserting minimal modifications derived from attribution insights into the input can significantly alter model predictions, underscoring potential vulnerabilities in over-parameterized models like Transformers.

Theoretical and Practical Implications

The findings contribute significantly to both theoretical and practical domains. Theoretically, they offer a refined understanding of the flow of information within transformers, elucidating which specific dependencies and token interactions prove instrumental to model outputs. Practically, these insights have implications for model optimization—such as more targeted pruning that maintains performance while reducing computational burden.

Moreover, the potential for generating adversarial examples as demonstrated could further inform model robustness efforts, prompting exploration into more resilient training regimes or the incorporation of defenses against such vulnerabilities.

Speculations on Future Developments

Future research building on this work could explore the generality of the self-attention attribution method across other Transformer-based architectures beyond BERT, including those designed for multi-modal tasks. Additionally, as the field of AI moves towards more interpretable and transparent models, integrating such granular attribution analyses into model development pipelines could become increasingly critical.

Overall, this paper provides a comprehensive framework for interpreting and optimizing the information flow within self-attention mechanisms—central to modern NLP architectures. As researchers continue to seek more interpretable and efficient models, methodologies like these are poised to play a pivotal role.

Markdown Report Issue