Emergent Mind

Dissecting Query-Key Interaction in Vision Transformers

(2405.14880)
Published Apr 4, 2024 in cs.CV and cs.AI

Abstract

Self-attention in vision transformers has been thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features in an image. However, contextualization is also an important and necessary computation for processing signals. Contextualization potentially requires tokens to attend to dissimilar tokens such as those corresponding to backgrounds or different objects, but this effect has not been reported in previous studies. In this study, we investigate whether self-attention in vision transformers exhibits a preference for attending to similar tokens or dissimilar tokens, providing evidence of perceptual grouping and contextualization, respectively. To study this question, we propose the use of singular value decomposition on the query-key matrix ${\textbf{W}q}T\textbf{W}k$. Naturally, the left and right singular vectors are feature directions of the self-attention layer and can be analyzed in pairs to interpret the interaction between tokens. We find that early layers attend more to similar tokens, while late layers show increased attention to dissimilar tokens. Moreover, many of these interactions between features represented by singular vectors are interpretable. We present a novel perspective on interpreting the attention mechanism, which may contribute to understanding how transformer models utilize context and salient features when processing images.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.