Emergent Mind

Abstract

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process they followed. First, researchers choose a metric and dataset that elicit the desired model behavior. Then, they apply activation patching to find which abstract neural network units are involved in the behavior. By varying the dataset, metric, and units under investigation, researchers can understand the functionality of each component. We automate one of the process' steps: to identify the circuit that implements the specified behavior in the model's computational graph. We propose several algorithms and reproduce previous interpretability results to validate them. For example, the ACDC algorithm rediscovered 5/5 of the component types in a circuit in GPT-2 Small that computes the Greater-Than operation. ACDC selected 68 of the 32,000 edges in GPT-2 Small, all of which were manually found by previous work. Our code is available at https://github.com/ArthurConmy/Automatic-Circuit-Discovery.

Overview

  • The paper discusses the development of the Automatic Circuit Discovery (ACDC) algorithm designed to automate the discovery of circuits in neural networks for mechanistic interpretability.

  • Mechanistic interpretability breaks down AI behaviors into algorithms and involves a multi-step process including observation, dataset creation, and activation patching.

  • The granular computational graphs help researchers in defining interpretation scopes and analyzing the network's functionality.

  • ACDC optimizes activation patching, using quantitative metrics to ensure essential connections are retained for model functionality.

  • While it has limitations, ACDC's open-source availability may accelerate future interpretability research and understanding of complex AI models.

Introduction

Recent advances in AI have presented an exciting yet challenging frontier in understanding how complex neural networks like transformers achieve particular behaviors. The drive towards increasing this understanding is called mechanistic interpretability. Central to this endeavor is the concept of circuits within computational graphs, subsets that capture defined functionalities. Typically, unraveling these circuits is a manual process that not only is time-consuming but also scales poorly with the model's complexity. Researchers have introduced the Automatic Circuit Discovery (ACDC) – an algorithm aimed at automating the discovery of circuits in neural network models.

The Mechanistic Interpretability Workflow

Mechanistic interpretability involves breaking down and analyzing AI behaviors into identifiable algorithms within a model's architecture. This process unfolds in steps, beginning with observing a behavior and creating a corresponding dataset that elicits it. Researchers then define the interpretation scope by deciding the granularity level at which the network's functionality will be analyzed, represented as a computational graph. The rigorous process of activation patching is carried out to refine and reduce the model's components until a satisfactory explanation of its behavior is reached.

Automating Circuit Discovery

ACDC proposes to automate the crucial step of activation patching. Optimization of the algorithm has streamlined it to filter out unnecessary connections keenly, retaining only those essential for the model to perform its designated tasks. The quantitative metrics introduced to evaluate the success of the algorithm reinforce its legitimacy and potential utility in interpretability research. Since transformer models have been particularly challenging to dissect due to their 'black-box' nature, ACDC's ability to rediscover known circuits with high accuracy demonstrates its effectiveness in this area.

Conclusion

The advent of ACDC marks significant progress in the field of AI interpretability, demonstrating the potential to automate part of what has conventionally been manual and labor-intensive interpretability work. While it does have some limitations, ACDC's developments and its open-source availability set an encouraging stage for future improvements and wider research contributions. The ultimate goal is the scaled application of such methods to larger models, getting closer to a comprehensive understanding of the algorithms that underpin AI behaviors.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.