Uncovering hidden geometry in Transformers via disentangling position and context

Published 7 Oct 2023 in cs.LG, cs.AI, and stat.ML | (2310.04861v2)

Abstract: Transformers are widely used to extract semantic meanings from input tokens, yet they usually operate as black-box models. In this paper, we present a simple yet informative decomposition of hidden states (or embeddings) of trained transformers into interpretable components. For any layer, embedding vectors of input sequence samples are represented by a tensor $\boldsymbol{h} \in \mathbb{R}^{C \times T \times d}$. Given embedding vector $\boldsymbol{h}{c,t} \in \mathbb{R}^d$ at sequence position $t \le T$ in a sequence (or context) $c \le C$, extracting the mean effects yields the decomposition [ \boldsymbol{h}{c,t} = \boldsymbol{\mu} + \mathbf{pos}t + \mathbf{ctx}_c + \mathbf{resid}{c,t} ] where $\boldsymbol{\mu}$ is the global mean vector, $\mathbf{pos}t$ and $\mathbf{ctx}_c$ are the mean vectors across contexts and across positions respectively, and $\mathbf{resid}{c,t}$ is the residual vector. For popular transformer architectures and diverse text datasets, empirically we find pervasive mathematical structure: (1) $(\mathbf{pos}t){t}$ forms a low-dimensional, continuous, and often spiral shape across layers, (2) $(\mathbf{ctx}c)_c$ shows clear cluster structure that falls into context topics, and (3) $(\mathbf{pos}_t){t}$ and $(\mathbf{ctx}_c)_c$ are mutually nearly orthogonal. We argue that smoothness is pervasive and beneficial to transformers trained on languages, and our decomposition leads to improved model interpretability.