Neuron-Level Knowledge Attribution in Large Language Models (2312.12141v4)
Abstract: Identifying important neurons for final predictions is essential for understanding the mechanisms of LLMs. Due to computational constraints, current attribution techniques struggle to operate at neuron level. In this paper, we propose a static method for pinpointing significant neurons. Compared to seven other methods, our approach demonstrates superior performance across three metrics. Additionally, since most static methods typically only identify "value neurons" directly contributing to the final prediction, we propose a method for identifying "query neurons" which activate these "value neurons". Finally, we apply our methods to analyze six types of knowledge across both attention and feed-forward network (FFN) layers. Our method and analysis are helpful for understanding the mechanisms of knowledge storage and set the stage for future research in knowledge editing. The code is available on https://github.com/zepingyu0512/neuron-attribution.
- Layer normalization. arXiv preprint arXiv:1607.06450.
- Alexei Baevski and Michael Auli. 2018. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696.
- Analyzing transformers in embedding space. arXiv preprint arXiv:2209.02535.
- Editing factual knowledge in language models. arXiv preprint arXiv:2104.08164.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 1.
- Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767.
- Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680.
- Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913.
- Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. arXiv preprint arXiv:2301.04213.
- Understanding transformer memorization recall through idioms. arXiv preprint arXiv:2210.03588.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- Superposition, memorization, and double descent. Transformer Circuits Thread.
- Transformer-patcher: One mistake worth one neuron. arXiv preprint arXiv:2301.09785.
- Sarthak Jain and Byron C Wallace. 2019. Attention is not explanation. arXiv preprint arXiv:1902.10186.
- Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372.
- Memory-based model editing at scale. In International Conference on Machine Learning, pages 15817–15831. PMLR.
- In-context learning and induction heads. arXiv preprint arXiv:2209.11895.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Improving language understanding by generative pre-training.
- Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. In 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 464–483. IEEE.
- Attention is all you need. Advances in neural information processing systems, 30.
- Editing large language models: Problems, methods, and opportunities. arXiv preprint arXiv:2305.13172.
- A survey of large language models. arXiv preprint arXiv:2303.18223.