Towards Best Practices of Activation Patching in Language Models: Metrics and Methods (2309.16042v2)
Abstract: Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step. Activation patching, also known as causal tracing or interchange intervention, is a standard technique for this task (Vig et al., 2020), but the literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods. In several settings of localization and circuit discovery in LLMs, we find that varying these hyperparameters could lead to disparate interpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we provide recommendations for the best practices of activation patching going forwards.
- Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
- Hidden progress in deep learning: SGD learns parities near the computational limit. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023.
- On privileged and convergent bases in neural network representations. arXiv preprint arXiv:2307.12941, 2023.
- Thread: Circuits. Distill, 5(3):e24, 2020.
- Toward transparent AI: A survey on interpreting the inner structures of deep neural networks. In IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 2022.
- Causal scrubbing, a method for rigorously testing interpretability hypotheses. AI Alignment Forum, 2022. https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
- A toy model of universality: Reverse engineering how networks learn group operations. In International Conference on Machine Learning (ICML), 2023.
- Towards automated circuit discovery for mechanistic interpretability. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023.
- Knowledge neurons in pretrained transformers. In Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
- Analyzing transformers in embedding space. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
- Causal analysis of syntactic agreement mechanisms in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), 2021.
- Neural natural language inference models partially embed theories of lexical entailment and negation. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2020.
- Causal abstractions of neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Inducing causal structure for interpretable neural networks. In International Conference on Machine Learning (ICML), 2022.
- Finding alignments between interpretable causal variables and distributed neural representations. arXiv preprint arXiv:2303.02536, 2023.
- Transformer feed-forward layers are key-value memories. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
- Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
- Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767, 2023.
- Localizing model behavior with path patching. arXiv preprint arXiv:2304.05969, 2023.
- Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint arXiv:2305.01610, 2023.
- How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- The out-of-distribution problem in explainability and search methods for feature importance explanations. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Does localization inform editing? Surprising differences in causality-based localization vs. knowledge editing in language models. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- A circuit for Python docstrings in a 4-layer attention-only transformer. https://www.alignmentforum.org/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only, 2023.
- Natural language descriptions of deep visual features. In International Conference on Learning Representations (ICLR), 2021.
- A benchmark for interpretability methods in deep neural networks. In Advances in neural information processing systems (NeurIPS), 2019.
- Feature relevance quantification in explainable ai: A causal problem. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2020.
- Interpreting transformer’s attention dynamic memory and visualizing the semantic information flow of GPT. arXiv preprint arXiv:2305.13417, 2023.
- NeuroSurgeon: A toolkit for subnetwork analysis. arXiv preprint arXiv:2309.00244, 2023.
- Emergent world representations: Exploring a sequence model trained on a synthetic task. In International Conference on Learning Representations (ICLR), 2023a.
- Inference-time intervention: Eliciting truthful answers from a language model. In Advances in Neural Information Processing Systems (NeurIPS), 2023b.
- How do transformers learn topic structure: Towards a mechanistic understanding. In International Conference on Machine Learning (ICML), 2023c.
- Does circuit analysis interpretability scale? Evidence from multiple choice capabilities in Chinchilla. arXiv preprint arXiv:2307.09458, 2023.
- The hydra effect: Emergent self-repair in language model computations. arXiv preprint arXiv:2307.15771, 2023.
- Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Language models implement simple word2vec-style vector arithmetic. arXiv preprint arXiv:2305.16130, 2023.
- Compositional explanations of neurons. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- TransformerLens. https://github.com/neelnanda-io/TransformerLens, 2022.
- Progress measures for grokking via mechanistic interpretability. In International Conference on Learning Representations (ICLR), 2023a.
- Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941, 2023b.
- Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. https://transformer-circuits.pub/2022/mech-interp-essay/index.html, 2022.
- In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- Judea Pearl. Direct and indirect effects. In Conference on Uncertainty and Artificial Intelligence (UAI), 2001.
- Language models are unsupervised multitask learners. OpenAI blog, 2019.
- Polysemanticity and capacity in neural networks. arXiv preprint arXiv:2210.01892, 2022.
- Discovering the compositional structure of vector representations with role learning networks. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2020.
- Understanding arithmetic reasoning in language models using causal mediation analysis. arXiv preprint arXiv:2305.15054, 2023.
- Explaining grokking through circuit efficiency. arXiv preprint arXiv:2309.02390, 2023.
- Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Investigating gender bias in language models using causal mediation analysis. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
- Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In International Conference on Learning Representations (ICLR), 2023.
- (Un)interpretability of transformers: a case study with Dyck grammars. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Interpretability at scale: Identifying causal mechanisms in alpaca. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. In Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, 2021.
- The clock and the pizza: Two stories in mechanistic explanation of neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Fred Zhang (15 papers)
- Neel Nanda (50 papers)