Towards Best Practices of Activation Patching in Language Models: Metrics and Methods (2309.16042v2)

Published 27 Sep 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization -- identifying the important model components -- is a key step. Activation patching, also known as causal tracing or interchange intervention, is a standard technique for this task (Vig et al., 2020), but the literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods. In several settings of localization and circuit discovery in LLMs, we find that varying these hyperparameters could lead to disparate interpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we provide recommendations for the best practices of activation patching going forwards.

References (59)

Citations (69)

View on Semantic Scholar

Summary

The paper compares Gaussian noising and token replacement, finding that token replacement preserves in-distribution prompt properties better.
It reveals that evaluation metrics like logit difference offer nuanced insights by capturing both positive and negative component influences.
The study demonstrates that sliding window patching exposes joint-layer effects, suggesting single-layer interventions to mitigate amplification artifacts.

Towards Best Practices of Activation Patching in LLMs: Metrics and Methods

The paper of mechanistic interpretability (MI) in machine learning aims to elucidate the internal functioning of models, translating complex algorithms into human-understandable processes. A prominent technique within MI is activation patching, which includes causal tracing and interchange intervention, designed to identify and assess crucial components within LLMs. The current literature on this technique, however, reveals significant variance in methodological details, with no clear consensus on hyperparameters or evaluation metrics. This paper contributes a systematic examination of the methodological elements of activation patching, evaluating how changes in these parameters can affect interpretability outcomes.

Methodological Variances in Activation Patching

The authors identify three major methodological dimensions in activation patching, each with its distinct impact on interpretability results:

Corruption Method: The paper compares Gaussian Noising (GN) and Symmetric Token Replacement (STR) as methods of generating corrupted prompts. GN adds random noise to key embeddings, risking out-of-distribution behavior, while STR swaps key tokens with semantically related ones, maintaining the prompts within distribution.
Evaluation Metric: The paper contrasts probability, logit difference, and Kullback-Leibler (KL) divergence as metrics to evaluate patching effects. Each metric captures different aspects of model behavior post-intervention, influencing the attributions made about component importance.
Sliding Window Patching: This involves restoring activations across multiple layers simultaneously, as compared to single-layer patching and summation. This method emphasizes the joint effects of adjacent layers, indicating where clusters of computational dependencies might reside.

Empirical Findings and Conceptual Considerations

Through empirical analyses on tasks such as factual recall, indirect object identification, arithmetic reasoning, and others, the paper reveals:

Corruption Impact: Disparate results with GN and STR highlight the susceptibility of activation patching to the choice of corruption method. In factual recall tasks, GN yielded pronounced peaks of activation importance not replicated by STR, indicating possible noise-induced misattributions.
Metric Influence: The selection of the evaluation metric significantly alters interpretability outcomes. Probability, while useful, can obscure the detection of negatively contributing components due to its non-negative nature. Logit difference provides a more balanced view by accounting for both the positive and negative influences of components.
Window Patching Effects: Sliding window patching tends to accentuate the localization of computational tasks within layers, suggesting a higher joint influence among consecutive layers, a phenomenon not as apparent in single-layer evaluations.

Recommendations

Given these findings, the authors recommend STR for corruption due to its tendency to maintain model prompts in distribution, thereby reducing interpretability ambiguities arising from out-of-distribution effects. Logit difference is advocated as a robust metric for its nuanced reflection of component contributions. Additionally, while sliding window patching can reveal implicit dependencies across layers, single-layer interventions should be prioritized to mitigate amplification artifacts.

Theoretical and Practical Implications

This paper provides valuable insights into the nuances of interpretability analysis in LLMs, cautioning against simplistic applications of activation patching without due consideration of methodological rigors. The findings urge future MI research to adopt standardized practices that ensure robustness and replicability, thus enhancing our understanding and control of LLM behaviors. Such methodological refinements are pivotal for advancing trustworthy AI systems, enabling reliable feature attributions, and facilitating the development of interpretable AI at scale. Future explorations might extend these findings to larger, more complex models and other architectural paradigms, further stabilizing the interpretability discourse within AI research.