Neuron-Level Knowledge Attribution in Large Language Models (2312.12141v4)

Published 19 Dec 2023 in cs.CL and cs.LG

Abstract: Identifying important neurons for final predictions is essential for understanding the mechanisms of LLMs. Due to computational constraints, current attribution techniques struggle to operate at neuron level. In this paper, we propose a static method for pinpointing significant neurons. Compared to seven other methods, our approach demonstrates superior performance across three metrics. Additionally, since most static methods typically only identify "value neurons" directly contributing to the final prediction, we propose a method for identifying "query neurons" which activate these "value neurons". Finally, we apply our methods to analyze six types of knowledge across both attention and feed-forward network (FFN) layers. Our method and analysis are helpful for understanding the mechanisms of knowledge storage and set the stage for future research in knowledge editing. The code is available on https://github.com/zepingyu0512/neuron-attribution.

References (26)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces log probability increase as a metric for quantifying a layer's contribution to next-word prediction.
It reveals that knowledge is distributed across both attention and FFN layers, with no single module dominating the prediction process.
The study demonstrates methodological innovations by detailing the influence of preceding layers on subsequent FFN outputs.

Introduction

Transformer-based models have drastically advanced performance across various AI tasks. While successful on the surface, the intricacies of how these models arrive at their predictions often remain opaque, a problem which impedes further improvement and trustworthiness. Current interpretability approaches struggle with the increasingly complex structures underlying these models, leaving us with pressing questions regarding parameter significance and the accurate location of knowledge within the network's architecture.

Unveiling the Mysteries of Transformers

The key to understanding transformers is dissecting the so-called residual stream—a pathway where the outputs of different layers interact and accumulate. By exploring the residual stream, this paper deciphers the mechanism behind the connections made between these outputs, revealing a direct addition function that impacts the probabilities associated with prediction outcomes. The probability of a given token increases when its before-softmax value is large.

Assigning Contributions and Probing Layers

To pinpoint influential parameters, this research establishes the use of log probability increase as a metric for quantifying a layer's contribution to a prediction. Leveraging this metric, the paper illuminates how each layer—be it attention or feed-forward neural network (FFN)—supports word predictions. Furthermore, analyzing inner products, the research provides insights into the interplay between preceding layers and how they impact subsequent FFN layers.

Empirical Findings and Methodological Innovations

Empirical analyses on a collection of sampled cases indicate that every layer within transformers plays a role in next-word prediction, with knowledge distributed across both attention and FFN layers. Notably, no single layer or module monopolizes importance; several contribute jointly to predictions. Case studies reinforce these findings, demonstrating that paramount transformer-specific features for prediction may reside in both attention and FFN subvalues. Lastly, the research presents a methodological contribution by showcasing a technique for detailing the influence of preceding layers on upper FFN layers.

Roadmap to Interpretability

The paper promises to release the code on GitHub, which will enable the public to implement these interpretability methods. Through such transparency, it is anticipated that the robust interpretability of transformer-based models will be enhanced significantly.

Related Papers

Tweets

https://twitter.com/1737392743674953728/status/1737621189613203822

https://twitter.com/1737392743674953728/status/1737609525799375224