Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AtP*: An efficient and scalable method for localizing LLM behaviour to components (2403.00745v1)

Published 1 Mar 2024 in cs.LG and cs.CL

Abstract: Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively expensive for SoTA LLMs. We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives. We propose a variant of AtP called AtP*, with two changes to address these failure modes while retaining scalability. We present the first systematic study of AtP and alternative methods for faster activation patching and show that AtP significantly outperforms all other investigated methods, with AtP* providing further significant improvement. Finally, we provide a method to bound the probability of remaining false negatives of AtP* estimates.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Leace: Perfect linear concept erasure in closed form. arXiv preprint arXiv:2306.03819, 2023.
  2. Pythia: A suite for analyzing large language models across training and scaling. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 2397–2430. PMLR, 2023. URL https://proceedings.mlr.press/v202/biderman23a.html.
  3. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
  4. Thread: Circuits. Distill, 2020. 10.23915/distill.00024. https://distill.pub/2020/circuits.
  5. Sparse interventions in language models with differentiable masking, 2021.
  6. Causal scrubbing, a method for rigorously testing interpretability hypotheses. AI Alignment Forum, 2022. https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
  7. Towards automated circuit discovery for mechanistic interpretability, 2023.
  8. Sparse autoencoders find highly interpretable features in language models, 2023.
  9. J. Feng and J. Steinhardt. How do language models bind entities in context?, 2023.
  10. Perforatedcnns: Acceleration through elimination of redundant convolutions. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/f0e52b27a7a5d6a1a87373dffa53dbe5-Paper.pdf.
  11. Causal analysis of syntactic agreement mechanisms in neural language models. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1828–1843, Online, Aug. 2021. Association for Computational Linguistics. 10.18653/v1/2021.acl-long.144. URL https://aclanthology.org/2021.acl-long.144.
  12. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  13. Neural natural language inference models partially embed theories of lexical entailment and negation, 2020.
  14. Causal abstractions of neural networks, 2021.
  15. Inducing causal structure for interpretable neural networks, 2022.
  16. Causal abstraction for faithful model interpretation, 2023.
  17. Dissecting recall of factual associations in auto-regressive language models, 2023.
  18. Localizing model behavior with path patching, 2023.
  19. Finding neurons in a haystack: Case studies with sparse probing, 2023.
  20. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model, 2023.
  21. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models, 2023.
  22. In-context learning creates task vectors, 2023.
  23. An empirical analysis of compute-optimal large language model training. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 30016–30030. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/c1e2faff6f588870935f114ebe04a3e5-Paper-Conference.pdf.
  24. Rigorously assessing natural language explanations of neurons, 2023.
  25. A comparison of causal scrubbing, causal abstractions, and related methods. AI Alignment Forum, 2022. https://www.alignmentforum.org/posts/uLMWMeBG3ruoBRhMW/a-comparison-of-causal-scrubbing-causal-abstractions-and.
  26. Improving activation steering in language models with mean-centring, 2023.
  27. Inference-time intervention: Eliciting truthful answers from a language model, 2023.
  28. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla, 2023.
  29. Learning sparse neural networks through l0subscript𝑙0l_{0}italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT regularization, 2018.
  30. Copy suppression: Comprehensively understanding an attention head, 2023.
  31. The hydra effect: Emergent self-repair in language model computations, 2023.
  32. Locating and editing factual associations in gpt, 2023.
  33. Circuit component reuse across tasks in transformer language models, 2023.
  34. Are sixteen heads really better than one? In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/2c601ad9d2ff9bc8b282670cdd54f69f-Paper.pdf.
  35. Pruning convolutional neural networks for resource efficient inference. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=SJGCiw5gl.
  36. N. Nanda. Attribution patching: Activation patching at industrial scale. 2022. URL https://www.neelnanda.io/mechanistic-interpretability/attribution-patching.
  37. Fact finding: Attempting to reverse-engineer factual recall on the neuron level, Dec 2023. URL https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall.
  38. nostalgebraist. interpreting gpt: the logit lens. 2020. URL https://www.alignmentforum.org/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.
  39. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  40. J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.
  41. J. Pearl. Direct and indirect effects, 2001.
  42. Improving language understanding by generative pre-training, 2018.
  43. Steering llama 2 via contrastive activation addition, 2023.
  44. J. M. Robins and S. Greenland. Identifiability and exchangeability for direct and indirect effects. Epidemiology, 3:143–155, 1992. URL https://api.semanticscholar.org/CorpusID:10757981.
  45. Discovering the compositional structure of vector representations with role learning networks. In A. Alishahi, Y. Belinkov, G. Chrupała, D. Hupkes, Y. Pinter, and H. Sajjad, editors, Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 238–254, Online, Nov. 2020. Association for Computational Linguistics. 10.18653/v1/2020.blackboxnlp-1.23. URL https://aclanthology.org/2020.blackboxnlp-1.23.
  46. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis, 2023.
  47. Attribution patching outperforms automated circuit discovery, 2023.
  48. Linear representations of sentiment in large language models, 2023.
  49. Function vectors in large language models, 2023.
  50. Activation addition: Steering language models without optimization, 2023.
  51. Attention is all you need, 2017.
  52. Residual networks behave like ensembles of relatively shallow networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/37bc2f75bf1bcfe8450a1a41c200364c-Paper.pdf.
  53. Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 12388–12401. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf.
  54. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small, 2022.
  55. B. L. Welch. The generalization of ‘Student’s’ problem when several different population variances are involved. Biometrika, 34(1-2):28–35, 01 1947. ISSN 0006-3444. 10.1093/biomet/34.1-2.28. URL https://doi.org/10.1093/biomet/34.1-2.28.
  56. Representation engineering: A top-down approach to ai transparency, 2023.
Citations (30)

Summary

  • The paper introduces AtP*, an enhanced attribution patching method for efficiently localizing LLM behavior.
  • It employs a recalibrated softmax and dropout in the backward pass to reduce false negatives and improve reliability.
  • The evaluation demonstrates that AtP* outperforms traditional methods in speed without sacrificing interpretability or accuracy.

Enhancing Interpretability of LLMs through Advanced Attribution Patching Techniques

Introduction

Understanding the internal mechanics of LLMs is paramount as their role in digital systems continue to expand. A critical aspect of this endeavor is attributing model behavior causally to its components, a task that is not only intellectually fascinating but also crucial for enhancing model transparency, reliability, and control. Despite the obvious necessity, tracing behaviors back to specific elements within state-of-the-art LLMs presents a considerable challenge due to their immense complexity. Activation Patching has been a preferred approach for its ability to directly compute causal attributions by intervening on model components. However, its linearly scaling cost with the number of components in SoTA models makes exhaustive investigation impractical. This paper addresses this scalability issue by exploring Attribution Patching (AtP), introducing a refined variant AtP* for improved efficiency and accuracy, and establishing a systematic comparison with alternative methods.

Attribution Patching and its Pitfalls

Attribution Patching (AtP), as an approximation to Activation Patching, offers significant speedups but is not without its limitations. Notably, it encounters two main classes of failure modes: false negatives arising from attention saturation and brittle false negatives resulting from cancellations of positive and negative effects. These failures substantially reduce AtP's reliability, potentially overlooking crucial components involved in model behavior.

Introducing AtP*

To counter these deficiencies, this paper proposes AtP*, an enhanced version of AtP, incorporating two key modifications. Firstly, it adopts a recalculated softmax operation for queries and keys to tackle gradient approximation issues in saturated attention scenarios. Secondly, it introduces dropout in the backward pass to mitigate cancellations, thereby reducing brittle false negatives. These adjustments considerably preserve AtP's scalability while significantly curtailing its proneness to false negatives.

Systematic Evaluation of Patching Methods

Through exhaustive comparisons with both brute force Activation Patching and various approximations, including novel alternatives, the paper establishes that AtP significantly outperforms other methods in speed without compromising accuracy. With the introduction of AtP*, there’s a noticeable further improvement in performance, validating the proposed enhancements. Additionally, the authors present a diagnostic method to statistically bound the probability of any remaining false negatives in AtP* estimates, adding a layer of reliability in its application.

Implications and Future Directions

Aside from demonstrating superior performance, the refinement and validation of AtP* have broader implications:

  • Theoretical Advancements: AtP* contributes to the understanding of interpretability in LLMs, highlighting both the potential and limitations of gradient-based approximations.
  • Practical Applications: By offering a scalable method for causal attribution, AtP* aids researchers in dissecting LLM behavior, paving the way for more interpretable and controllable models.
  • Future Research: The findings invite further exploration into other components like layer normalization and extensions to edge attribution and coarser nodes, suggesting an expansive horizon for future investigations in LLM interpretability.

Conclusion

This paper's journey into refining Attribution Patching underscores the intricate balance between computationally feasible methods and the fidelity of causal attributions in LLMs. AtP*, with its careful adjustments, represents a significant step forward in this domain, offering a viable path for rigorously unraveling the mechanisms underpinning LLM behaviors. By leaning into the complexities and addressing the subtleties head-on, we inch closer to the larger goal of creating transparent, interpretable, and reliable AI systems.