AtP*: An efficient and scalable method for localizing LLM behaviour to components (2403.00745v1)
Abstract: Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively expensive for SoTA LLMs. We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives. We propose a variant of AtP called AtP*, with two changes to address these failure modes while retaining scalability. We present the first systematic study of AtP and alternative methods for faster activation patching and show that AtP significantly outperforms all other investigated methods, with AtP* providing further significant improvement. Finally, we provide a method to bound the probability of remaining false negatives of AtP* estimates.
- Leace: Perfect linear concept erasure in closed form. arXiv preprint arXiv:2306.03819, 2023.
- Pythia: A suite for analyzing large language models across training and scaling. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 2397–2430. PMLR, 2023. URL https://proceedings.mlr.press/v202/biderman23a.html.
- Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- Thread: Circuits. Distill, 2020. 10.23915/distill.00024. https://distill.pub/2020/circuits.
- Sparse interventions in language models with differentiable masking, 2021.
- Causal scrubbing, a method for rigorously testing interpretability hypotheses. AI Alignment Forum, 2022. https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
- Towards automated circuit discovery for mechanistic interpretability, 2023.
- Sparse autoencoders find highly interpretable features in language models, 2023.
- J. Feng and J. Steinhardt. How do language models bind entities in context?, 2023.
- Perforatedcnns: Acceleration through elimination of redundant convolutions. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/f0e52b27a7a5d6a1a87373dffa53dbe5-Paper.pdf.
- Causal analysis of syntactic agreement mechanisms in neural language models. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1828–1843, Online, Aug. 2021. Association for Computational Linguistics. 10.18653/v1/2021.acl-long.144. URL https://aclanthology.org/2021.acl-long.144.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Neural natural language inference models partially embed theories of lexical entailment and negation, 2020.
- Causal abstractions of neural networks, 2021.
- Inducing causal structure for interpretable neural networks, 2022.
- Causal abstraction for faithful model interpretation, 2023.
- Dissecting recall of factual associations in auto-regressive language models, 2023.
- Localizing model behavior with path patching, 2023.
- Finding neurons in a haystack: Case studies with sparse probing, 2023.
- How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model, 2023.
- Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models, 2023.
- In-context learning creates task vectors, 2023.
- An empirical analysis of compute-optimal large language model training. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 30016–30030. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf.
- Rigorously assessing natural language explanations of neurons, 2023.
- A comparison of causal scrubbing, causal abstractions, and related methods. AI Alignment Forum, 2022. https://www.alignmentforum.org/posts/uLMWMeBG3ruoBRhMW/a-comparison-of-causal-scrubbing-causal-abstractions-and.
- Improving activation steering in language models with mean-centring, 2023.
- Inference-time intervention: Eliciting truthful answers from a language model, 2023.
- Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla, 2023.
- Learning sparse neural networks through l0subscript𝑙0l_{0}italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT regularization, 2018.
- Copy suppression: Comprehensively understanding an attention head, 2023.
- The hydra effect: Emergent self-repair in language model computations, 2023.
- Locating and editing factual associations in gpt, 2023.
- Circuit component reuse across tasks in transformer language models, 2023.
- Are sixteen heads really better than one? In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/2c601ad9d2ff9bc8b282670cdd54f69f-Paper.pdf.
- Pruning convolutional neural networks for resource efficient inference. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=SJGCiw5gl.
- N. Nanda. Attribution patching: Activation patching at industrial scale. 2022. URL https://www.neelnanda.io/mechanistic-interpretability/attribution-patching.
- Fact finding: Attempting to reverse-engineer factual recall on the neuron level, Dec 2023. URL https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall.
- nostalgebraist. interpreting gpt: the logit lens. 2020. URL https://www.alignmentforum.org/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.
- In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.
- J. Pearl. Direct and indirect effects, 2001.
- Improving language understanding by generative pre-training, 2018.
- Steering llama 2 via contrastive activation addition, 2023.
- J. M. Robins and S. Greenland. Identifiability and exchangeability for direct and indirect effects. Epidemiology, 3:143–155, 1992. URL https://api.semanticscholar.org/CorpusID:10757981.
- Discovering the compositional structure of vector representations with role learning networks. In A. Alishahi, Y. Belinkov, G. Chrupała, D. Hupkes, Y. Pinter, and H. Sajjad, editors, Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 238–254, Online, Nov. 2020. Association for Computational Linguistics. 10.18653/v1/2020.blackboxnlp-1.23. URL https://aclanthology.org/2020.blackboxnlp-1.23.
- A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis, 2023.
- Attribution patching outperforms automated circuit discovery, 2023.
- Linear representations of sentiment in large language models, 2023.
- Function vectors in large language models, 2023.
- Activation addition: Steering language models without optimization, 2023.
- Attention is all you need, 2017.
- Residual networks behave like ensembles of relatively shallow networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/37bc2f75bf1bcfe8450a1a41c200364c-Paper.pdf.
- Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12388–12401. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022.
- B. L. Welch. The generalization of ‘Student’s’ problem when several different population variances are involved. Biometrika, 34(1-2):28–35, 01 1947. ISSN 0006-3444. 10.1093/biomet/34.1-2.28. URL https://doi.org/10.1093/biomet/34.1-2.28.
- Representation engineering: A top-down approach to ai transparency, 2023.