Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods (2309.16042v2)

Published 27 Sep 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step. Activation patching, also known as causal tracing or interchange intervention, is a standard technique for this task (Vig et al., 2020), but the literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods. In several settings of localization and circuit discovery in LLMs, we find that varying these hyperparameters could lead to disparate interpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we provide recommendations for the best practices of activation patching going forwards.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
  2. Hidden progress in deep learning: SGD learns parities near the computational limit. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  3. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023.
  4. On privileged and convergent bases in neural network representations. arXiv preprint arXiv:2307.12941, 2023.
  5. Thread: Circuits. Distill, 5(3):e24, 2020.
  6. Toward transparent AI: A survey on interpreting the inner structures of deep neural networks. In IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 2022.
  7. Causal scrubbing, a method for rigorously testing interpretability hypotheses. AI Alignment Forum, 2022. https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
  8. A toy model of universality: Reverse engineering how networks learn group operations. In International Conference on Machine Learning (ICML), 2023.
  9. Towards automated circuit discovery for mechanistic interpretability. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  10. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023.
  11. Knowledge neurons in pretrained transformers. In Annual Meeting of the Association for Computational Linguistics (ACL), 2022.
  12. Analyzing transformers in embedding space. In Annual Meeting of the Association for Computational Linguistics (ACL), 2023.
  13. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
  14. Causal analysis of syntactic agreement mechanisms in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP), 2021.
  15. Neural natural language inference models partially embed theories of lexical entailment and negation. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2020.
  16. Causal abstractions of neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  17. Inducing causal structure for interpretable neural networks. In International Conference on Machine Learning (ICML), 2022.
  18. Finding alignments between interpretable causal variables and distributed neural representations. arXiv preprint arXiv:2303.02536, 2023.
  19. Transformer feed-forward layers are key-value memories. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
  20. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
  21. Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767, 2023.
  22. Localizing model behavior with path patching. arXiv preprint arXiv:2304.05969, 2023.
  23. Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint arXiv:2305.01610, 2023.
  24. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  25. The out-of-distribution problem in explainability and search methods for feature importance explanations. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  26. Does localization inform editing? Surprising differences in causality-based localization vs. knowledge editing in language models. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  27. A circuit for Python docstrings in a 4-layer attention-only transformer. https://www.alignmentforum.org/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only, 2023.
  28. Natural language descriptions of deep visual features. In International Conference on Learning Representations (ICLR), 2021.
  29. A benchmark for interpretability methods in deep neural networks. In Advances in neural information processing systems (NeurIPS), 2019.
  30. Feature relevance quantification in explainable ai: A causal problem. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2020.
  31. Interpreting transformer’s attention dynamic memory and visualizing the semantic information flow of GPT. arXiv preprint arXiv:2305.13417, 2023.
  32. NeuroSurgeon: A toolkit for subnetwork analysis. arXiv preprint arXiv:2309.00244, 2023.
  33. Emergent world representations: Exploring a sequence model trained on a synthetic task. In International Conference on Learning Representations (ICLR), 2023a.
  34. Inference-time intervention: Eliciting truthful answers from a language model. In Advances in Neural Information Processing Systems (NeurIPS), 2023b.
  35. How do transformers learn topic structure: Towards a mechanistic understanding. In International Conference on Machine Learning (ICML), 2023c.
  36. Does circuit analysis interpretability scale? Evidence from multiple choice capabilities in Chinchilla. arXiv preprint arXiv:2307.09458, 2023.
  37. The hydra effect: Emergent self-repair in language model computations. arXiv preprint arXiv:2307.15771, 2023.
  38. Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  39. Language models implement simple word2vec-style vector arithmetic. arXiv preprint arXiv:2305.16130, 2023.
  40. Compositional explanations of neurons. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  41. TransformerLens. https://github.com/neelnanda-io/TransformerLens, 2022.
  42. Progress measures for grokking via mechanistic interpretability. In International Conference on Learning Representations (ICLR), 2023a.
  43. Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941, 2023b.
  44. Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. https://transformer-circuits.pub/2022/mech-interp-essay/index.html, 2022.
  45. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  46. Judea Pearl. Direct and indirect effects. In Conference on Uncertainty and Artificial Intelligence (UAI), 2001.
  47. Language models are unsupervised multitask learners. OpenAI blog, 2019.
  48. Polysemanticity and capacity in neural networks. arXiv preprint arXiv:2210.01892, 2022.
  49. Discovering the compositional structure of vector representations with role learning networks. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2020.
  50. Understanding arithmetic reasoning in language models using causal mediation analysis. arXiv preprint arXiv:2305.15054, 2023.
  51. Explaining grokking through circuit efficiency. arXiv preprint arXiv:2309.02390, 2023.
  52. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  53. Investigating gender bias in language models using causal mediation analysis. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  54. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  55. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In International Conference on Learning Representations (ICLR), 2023.
  56. (Un)interpretability of transformers: a case study with Dyck grammars. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  57. Interpretability at scale: Identifying causal mechanisms in alpaca. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
  58. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. In Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, 2021.
  59. The clock and the pizza: Two stories in mechanistic explanation of neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Fred Zhang (15 papers)
  2. Neel Nanda (50 papers)
Citations (69)

Summary

  • The paper compares Gaussian noising and token replacement, finding that token replacement preserves in-distribution prompt properties better.
  • It reveals that evaluation metrics like logit difference offer nuanced insights by capturing both positive and negative component influences.
  • The study demonstrates that sliding window patching exposes joint-layer effects, suggesting single-layer interventions to mitigate amplification artifacts.

Towards Best Practices of Activation Patching in LLMs: Metrics and Methods

The paper of mechanistic interpretability (MI) in machine learning aims to elucidate the internal functioning of models, translating complex algorithms into human-understandable processes. A prominent technique within MI is activation patching, which includes causal tracing and interchange intervention, designed to identify and assess crucial components within LLMs. The current literature on this technique, however, reveals significant variance in methodological details, with no clear consensus on hyperparameters or evaluation metrics. This paper contributes a systematic examination of the methodological elements of activation patching, evaluating how changes in these parameters can affect interpretability outcomes.

Methodological Variances in Activation Patching

The authors identify three major methodological dimensions in activation patching, each with its distinct impact on interpretability results:

  1. Corruption Method: The paper compares Gaussian Noising (GN) and Symmetric Token Replacement (STR) as methods of generating corrupted prompts. GN adds random noise to key embeddings, risking out-of-distribution behavior, while STR swaps key tokens with semantically related ones, maintaining the prompts within distribution.
  2. Evaluation Metric: The paper contrasts probability, logit difference, and Kullback-Leibler (KL) divergence as metrics to evaluate patching effects. Each metric captures different aspects of model behavior post-intervention, influencing the attributions made about component importance.
  3. Sliding Window Patching: This involves restoring activations across multiple layers simultaneously, as compared to single-layer patching and summation. This method emphasizes the joint effects of adjacent layers, indicating where clusters of computational dependencies might reside.

Empirical Findings and Conceptual Considerations

Through empirical analyses on tasks such as factual recall, indirect object identification, arithmetic reasoning, and others, the paper reveals:

  • Corruption Impact: Disparate results with GN and STR highlight the susceptibility of activation patching to the choice of corruption method. In factual recall tasks, GN yielded pronounced peaks of activation importance not replicated by STR, indicating possible noise-induced misattributions.
  • Metric Influence: The selection of the evaluation metric significantly alters interpretability outcomes. Probability, while useful, can obscure the detection of negatively contributing components due to its non-negative nature. Logit difference provides a more balanced view by accounting for both the positive and negative influences of components.
  • Window Patching Effects: Sliding window patching tends to accentuate the localization of computational tasks within layers, suggesting a higher joint influence among consecutive layers, a phenomenon not as apparent in single-layer evaluations.

Recommendations

Given these findings, the authors recommend STR for corruption due to its tendency to maintain model prompts in distribution, thereby reducing interpretability ambiguities arising from out-of-distribution effects. Logit difference is advocated as a robust metric for its nuanced reflection of component contributions. Additionally, while sliding window patching can reveal implicit dependencies across layers, single-layer interventions should be prioritized to mitigate amplification artifacts.

Theoretical and Practical Implications

This paper provides valuable insights into the nuances of interpretability analysis in LLMs, cautioning against simplistic applications of activation patching without due consideration of methodological rigors. The findings urge future MI research to adopt standardized practices that ensure robustness and replicability, thus enhancing our understanding and control of LLM behaviors. Such methodological refinements are pivotal for advancing trustworthy AI systems, enabling reliable feature attributions, and facilitating the development of interpretable AI at scale. Future explorations might extend these findings to larger, more complex models and other architectural paradigms, further stabilizing the interpretability discourse within AI research.

Youtube Logo Streamline Icon: https://streamlinehq.com