Emergent Mind

Abstract

We find the location of factual knowledge in LLMs by exploring the residual stream and analyzing subvalues in vocabulary space. We find the reason why subvalues have human-interpretable concepts when projecting into vocabulary space. The before-softmax values of subvalues are added by an addition function, thus the probability of top tokens in vocabulary space will increase. Based on this, we find using log probability increase to compute the significance of layers and subvalues is better than probability increase, since the curve of log probability increase has a linear monotonically increasing shape. Moreover, we calculate the inner products to evaluate how much a feed-forward network (FFN) subvalue is activated by previous layers. Base on our methods, we find where factual knowledge <France, capital, Paris> is stored. Specifically, attention layers store "Paris is related to France". FFN layers store "Paris is a capital/city", activated by attention subvalues related to "capital". We leverage our method on Baevski-18, GPT2 medium, Llama-7B and Llama-13B. Overall, we provide a new method for understanding the mechanism of transformers. We will release our code on github.

Overview

  • Transformer-based models have profoundly enhanced AI tasks but their complex mechanisms lack transparency.

  • The study explores the 'residual stream' of transformers revealing its role in accumulating outputs and influencing predictions.

  • A new metric, log probability increase, measures layer contributions, highlighting distributed knowledge over attention and FFN layers.

  • Empirical evidence suggests that all layers contribute to predictions, and the importance is distributed rather than centralized.

  • The research introduces a novel approach for understanding layer influence and pledges to release the code for public use.

Introduction

Transformer-based models have drastically advanced performance across various AI tasks. While successful on the surface, the intricacies of how these models arrive at their predictions often remain opaque, a problem which impedes further improvement and trustworthiness. Current interpretability approaches struggle with the increasingly complex structures underlying these models, leaving us with pressing questions regarding parameter significance and the accurate location of knowledge within the network's architecture.

Unveiling the Mysteries of Transformers

The key to understanding transformers is dissecting the so-called residual stream—a pathway where the outputs of different layers interact and accumulate. By delving into the residual stream, this study deciphers the mechanism behind the connections made between these outputs, revealing a direct addition function that impacts the probabilities associated with prediction outcomes. The probability of a given token increases when its before-softmax value is large.

Assigning Contributions and Probing Layers

To pinpoint influential parameters, this research establishes the use of log probability increase as a metric for quantifying a layer's contribution to a prediction. Leveraging this metric, the study illuminates how each layer—be it attention or feed-forward neural network (FFN)—supports word predictions. Furthermore, analyzing inner products, the research provides insights into the interplay between preceding layers and how they impact subsequent FFN layers.

Empirical Findings and Methodological Innovations

Empirical analyses on a collection of sampled cases indicate that every layer within transformers plays a role in next-word prediction, with knowledge distributed across both attention and FFN layers. Notably, no single layer or module monopolizes importance; several contribute jointly to predictions. Case studies reinforce these findings, demonstrating that paramount transformer-specific features for prediction may reside in both attention and FFN subvalues. Lastly, the research presents a methodological contribution by showcasing a technique for detailing the influence of preceding layers on upper FFN layers.

Roadmap to Interpretability

The study promises to release the code on GitHub, which will enable the public to implement these interpretability methods. Through such transparency, it is anticipated that the robust interpretability of transformer-based models will be enhanced significantly.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.