Emergent Mind

Transcoders Find Interpretable LLM Feature Circuits

(2406.11944)
Published Jun 17, 2024 in cs.LG and cs.CL

Abstract

A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of models corresponding to specific behaviors or capabilities. However, MLP sublayers make fine-grained circuit analysis on transformer-based language models difficult. In particular, interpretable features -- such as those found by sparse autoencoders (SAEs) -- are typically linear combinations of extremely many neurons, each with its own nonlinearity to account for. Circuit analysis in this setting thus either yields intractably large circuits or fails to disentangle local and global behavior. To address this we explore transcoders, which seek to faithfully approximate a densely activating MLP layer with a wider, sparsely-activating MLP layer. We successfully train transcoders on language models with 120M, 410M, and 1.4B parameters, and find them to perform at least on par with SAEs in terms of sparsity, faithfulness, and human-interpretability. We then introduce a novel method for using transcoders to perform weights-based circuit analysis through MLP sublayers. The resulting circuits neatly factorize into input-dependent and input-invariant terms. Finally, we apply transcoders to reverse-engineer unknown circuits in the model, and we obtain novel insights regarding the greater-than circuit in GPT2-small. Our results suggest that transcoders can prove effective in decomposing model computations involving MLPs into interpretable circuits. Code is available at https://github.com/jacobdunefsky/transcoder_circuits.

Three highest activation variance features of MLP10 transcoder for the "greater-than" operation.

Overview

  • The paper introduces transcoders, a novel tool designed to approximate densely activating MLP layers in transformers with wider, sparsely-activating MLP layers, facilitating fine-grained circuit analysis and identifying interpretable feature circuits.

  • Extensive evaluations demonstrate that transcoders are more effective than Sparse Autoencoders (SAEs) in terms of interpretability, sparsity, and faithfulness, confirmed through empirical studies involving various models and tasks, including GPT2-small.

  • Transcoders mark a significant advancement in mechanistic interpretability of LLMs, offering theoretical and practical implications, such as enabling more tractable circuit analysis and providing insights into model behaviors at a finer resolution.

Interpretable Feature Circuits Discovered by Transcoders in LLMs

This essay discusses the core contributions and methodologies presented in the paper, "Transcoders Find Interpretable LLM Feature Circuits" by Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. The paper aims to advance mechanistic interpretability in LLMs by introducing and exploring transcoders. Transcoders are designed to approximate the densely activating MLP layers in transformers with wider, sparsely-activating MLP layers. This approximation facilitates fine-grained circuit analysis, enabling the identification of interpretable feature circuits responsible for model behaviors.

Key Contributions

The primary contributions of the paper are manifold:

  1. Introduction of Transcoders: The authors present transcoders as a novel tool to approximate MLP layers in transformer models. These transcoders are trained with an L1 regularization penalty to encourage sparsity, which aids interpretability without sacrificing the fidelity of the original model's computations.
  2. Comparison with Sparse Autoencoders (SAEs): Extensive evaluations demonstrate that transcoders perform on par with or better than SAEs regarding interpretability, sparsity, and faithfulness.
  3. Circuit-Finding Methodology: A novel method for using transcoders in circuit analysis is introduced, leveraging the disentangling property of transcoders to cleanly factorize circuits into input-dependent and input-invariant terms.
  4. Empirical Evaluations: The paper provides empirical evidence by applying transcoders to various tasks and models, such as reverse-engineering the "greater-than circuit" in GPT2-small, and detailed case studies that showcase the practical utility of transcoders in mechanistic interpretability.

Methodology

Transcoder Training and Architecture

Transcoders extend the dense MLP sublayers to wide and sparse alternatives. The training process involves minimizing a loss function that balances faithfulness (matching the MLP output) and sparsity (L1 penalty on neuron activations). The architecture comprises an encoder-decoder setup within a single hidden layer MLP: [ z{TC} = ReLU(W{enc} x + b{enc}) ] [ TC(x) = W{dec} z{TC} + b{dec} ]

Comparison to SAEs

Transcoders are evaluated against SAEs trained on MLP outputs across multiple language models, including GPT2-small, Pythia-410M, and Pythia-1.4B. The metrics used for evaluation include interpretability (human-judged), sparsity (mean L0 norm of activations), and faithfulness (cross-entropy loss difference when transcoders replace MLPs).

Circuit Analysis with Transcoders

The paper introduces a method to perform circuit analysis using transcoders:

  • Attribution Calculation: Attribution of earlier-layer features to later features is computed by the product of feature activations and dot products of encoder and decoder vectors.
  • Computational Subgraphs: Important computational paths are identified by analyzing attributions iteratively.
  • De-Embeddings: De-embedding vectors are used to determine the direct effect of input tokens on transcoder features, providing input-invariant insights into model behavior.

Empirical Results

Blind Case Studies

Several blind case studies are conducted, where the authors infer the semantics of hidden features purely through circuit analysis. One notable study involved reverse-engineering a feature in a GPT2-small transcoder and correctly identifying it as a semicolon in citation patterns.

Greater-Than Circuit in GPT2-small

The authors revisit the "greater-than circuit" previously analyzed by \citet{hannahow2023}. Using transcoders, they not only corroborate earlier findings but also identify relevant MLP10 features and how these features contribute to the model's behavior when predicting sequential years. They demonstrate that transcoders provide a sparser and more interpretable computational subgraph compared to traditional neuronal analysis.

Implications and Speculations on Future Developments

Practical Implications

The introduction of transcoders has significant practical implications for debugging and understanding LLM behaviors. By providing a clear and sparse approximation of MLP sublayers, transcoders make fine-grained circuit analysis more tractable. This can lead to better model interpretability, facilitate identification of emergent behaviors, and potentially guide the development of more reliable and controllable AI systems.

Theoretical Implications

The ability of transcoders to disentangle input-dependent and input-invariant components of model behavior offers a profound theoretical tool for understanding neural networks. This factorization might enable the formulation of new hypotheses about how higher-level cognitive tasks are represented within transformer models.

Future Directions

Future research could explore extensions of transcoders to other neural architectures beyond transformers or employ transcoders in understanding attention mechanisms. Additionally, enhancing the scalability of transcoders to larger models and datasets will be crucial for generalizing their applicability.

Conclusion

Transcoders mark a significant step forward in mechanistic interpretability of LLMs, providing a bridge between the dense computations of MLP layers and sparse, human-interpretable circuits. The paper's rigorous methodology and comprehensive evaluations offer compelling evidence of the utility of transcoders in fine-grained circuit analysis, providing both practical tools and theoretical insights into deep model behaviors.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.