Towards Automated Circuit Discovery for Mechanistic Interpretability (2304.14997v4)

Published 28 Apr 2023 in cs.LG

Abstract: Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find which abstract neural network units are involved in the behavior. By varying the dataset, metric, and units under investigation, researchers can understand the functionality of each component. We automate one of the process' steps: to identify the circuit that implements the specified behavior in the model's computational graph. We propose several algorithms and reproduce previous interpretability results to validate them. For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation. ACDC selected 68 of the 32,000 edges in GPT-2 Small, all of which were manually found by previous work. Our code is available at https://github.com/ArthurConmy/Automatic-Circuit-Discovery.

Citations (201)

View on Semantic Scholar

Summary

The paper proposes ACDC, an algorithm that automates the discovery of behavior-driving circuits by iteratively pruning neural network graphs using activation patching experiments.
ACDC systematically isolates critical subgraphs by testing the impact of removing connections based on performance metrics like KL divergence, outperforming baseline methods in ROC evaluations on tasks such as IOI and Greater-Than.
The method enables scaling mechanistic interpretability by reducing manual effort and recovering sparser, efficient circuits, while also highlighting limitations in activation patching sensitivity and network complexity.

This paper, "Towards Automated Circuit Discovery for Mechanistic Interpretability" (2304.14997), addresses the challenge of scaling mechanistic interpretability research to larger and more complex neural network models. It identifies a common workflow followed in successful interpretability studies that reverse-engineer specific model behaviors into understandable circuits. The core contribution is the development and evaluation of algorithms, primarily Automatic Circuit DisCovery (ACDC), to automate a crucial step in this workflow: finding the subgraph of the model responsible for a given behavior through systematic activation patching experiments.

The paper systematizes the mechanistic interpretability workflow into three main steps:

Observe behavior, select dataset and metric: Identify a specific behavior the model exhibits, curate a dataset that elicits this behavior, and choose a metric to quantitatively measure the extent of the behavior. Examples include the Greater-Than task, IOI task, and Docstring completion.
Divide the network into a graph: Represent the neural network as a computational directed acyclic graph (DAG) at a chosen level of granularity (e.g., attention heads, MLP layers, individual neurons, split by token position). This defines the potential nodes and edges of a circuit.
Patch activations to isolate the subgraph: Iteratively perform experiments to identify which connections (edges) and components (nodes) in the computational graph are essential for the observed behavior. This is typically done by modifying activations (e.g., patching with corrupted data or zeros) and measuring the impact on the chosen metric.

The paper focuses on automating Step 3, which is often the most labor-intensive and manual part of the process. The proposed algorithm, ACDC, formalizes an iterative pruning approach. It starts with the full computational graph and works backward from the output node. For each node, it considers removing incoming edges one by one. An edge is permanently removed if its temporary removal does not significantly degrade the model's performance on the target behavior, as measured by a chosen metric (primarily KL divergence from the original model's output distribution). The process continues recursively on the remaining nodes.

ACDC employs activation patching using "interchange interventions," where activations on a clean input data point are overwritten with activations from a corrupted input data point. This is preferred over zero-patching as it is argued to keep the activations closer to the natural distribution. However, the paper also explores zero-patching as an alternative.

To evaluate ACDC and compare it to adapted baseline methods (Subnetwork Probing (SP) and Head Importance Score for Pruning (HISP)), the authors use two main strategies:

ROC curves against known circuits: For several tasks where previous work has manually identified approximate circuits (IOI, Greater-Than, Docstring) and for known circuits in toy tracr models, the algorithms are evaluated on their ability to recover these "ground truth" circuits. Performance is measured by plotting Receiver Operating Characteristic (ROC) curves (True Positive Rate vs. False Positive Rate) at varying algorithm hyperparameters (ACDC threshold $\tau$ , SP regularization $\lambda$ , HISP top-k components). Both edge-level and node-level classification are considered.
Stand-alone circuit properties: For tasks like Induction, where a clear "ground truth" circuit boundary might be less defined, the algorithms are evaluated based on the trade-off between the recovered subgraph's performance on the task metric (e.g., KL divergence) and its sparsity (number of edges). Experiments on "reset networks" (models with permuted weights that don't exhibit the behavior) are used to test if the algorithms are prone to finding spurious circuits.

The experimental results demonstrate that ACDC is a competitive method for automated circuit discovery.

When evaluated using ROC curves against known circuits, ACDC achieves competitive AUC scores compared to SP and HISP across various tasks and metrics, sometimes outperforming them (e.g., IOI, Greater-Than, tracr-reverse AUC on edges with KL divergence and corrupted activations).
For the tracr tasks (xproportion, reverse), ACDC with zero-patching is particularly successful, perfectly recovering the known circuits at any threshold $\tau > 0$ .
When optimizing for KL divergence on the Induction task, ACDC tends to recover sparser subgraphs for a given level of performance compared to SP and HISP, suggesting better efficiency in finding relevant components.
The paper provides qualitative evidence showing ACDC's ability to recover substantial parts of previously identified circuits (IOI, Greater-Than, Docstring) and its utility in novel research (Gendered Pronoun completion), where it helped identify surprising node importance.

However, the evaluation also highlights limitations:

The performance of all methods, including ACDC, is sensitive to the choice of metric and activation patching method, sometimes performing poorly in certain settings.
The methods tend to miss components that have a "negative" or inhibitory effect on the task performance (e.g., Negative Name Mover heads in the IOI circuit) when optimizing metrics like KL divergence or logit difference, as these metrics reward removing such components.
In toy examples like an OR gate circuit, iterative pruning methods like ACDC can fail to identify all necessary inputs, potentially due to the greedy nature of removing edges one by one.
Current "ground truth" circuits from manual work are imperfect and may contain extraneous edges or miss important ones, limiting the definitive strength of ROC-based evaluations.
Computational efficiency can still be a concern for large models, especially when exploring fine-grained computational graphs.

Despite these limitations, the paper concludes that ACDC represents a significant step towards automating circuit discovery, a process previously heavily reliant on manual effort. The authors emphasize that automating this step is crucial for scaling mechanistic interpretability to the LLMs currently in use. Their open-source implementation of ACDC is intended to accelerate future research in this area, potentially enabling the discovery of more complex circuits and contributing to a deeper understanding of how neural networks function.

PDF Markdown

Related Papers

GitHub

GitHub - ArthurConmy/Automatic-Circuit-Discovery (174 stars)

Tweets

https://twitter.com/aryaman2020/status/1872102737602592805

https://twitter.com/NeelNanda5/status/1827398240188195131

YouTube

Show All Videos