Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla (2307.09458v3)

Published 18 Jul 2023 in cs.LG

Abstract: \emph{Circuit analysis} is a promising technique for understanding the internal mechanisms of LLMs. However, existing analyses are done in small models far from the state of the art. To address this, we present a case study of circuit analysis in the 70B Chinchilla model, aiming to test the scalability of circuit analysis. In particular, we study multiple-choice question answering, and investigate Chinchilla's capability to identify the correct answer \emph{label} given knowledge of the correct answer \emph{text}. We find that the existing techniques of logit attribution, attention pattern visualization, and activation patching naturally scale to Chinchilla, allowing us to identify and categorize a small set of output nodes' (attention heads and MLPs). We further study thecorrect letter' category of attention heads aiming to understand the semantics of their features, with mixed results. For normal multiple-choice question answers, we significantly compress the query, key and value subspaces of the head without loss of performance when operating on the answer labels for multiple-choice questions, and we show that the query and key subspaces represent an Nth item in an enumeration' feature to at least some extent. However, when we attempt to use this explanation to understand the heads' behaviour on a more general distribution including randomized answer labels, we find that it is only a partial explanation, suggesting there is more to learn about the operation ofcorrect letter' heads on multiple choice question answering.

Citations (78)

View on Semantic Scholar

Summary

The paper demonstrates that circuit analysis techniques can scale to large models by identifying key output nodes critical for multiple-choice reasoning.
The study reveals that attention heads compress query and key subspaces effectively, pinpointing components that encode answer structure.
The analysis acknowledges limitations, as randomized cue labels expose gaps in current interpretability methods for fully understanding the model’s reasoning.

Analysis of Circuit Interpretability in the Chinchilla Model

The paper "Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla" explores the challenges and successes in extending circuit analysis techniques to large-scale LLMs. This paper focuses on the Chinchilla model, a 70-billion-parameter entity, overshadowing the smaller models typically scrutinized in prior research.

Objectives and Methodology

The central aim of the paper is to test the scalability of circuit analysis, particularly in understanding multiple-choice question-answering capabilities within the Chinchilla model. The inquiry leverages methodologies like logit attribution, attention pattern visualization, and activation patching, which have been employed in smaller models, to ascertain their efficacy at this grander scale.

The authors dissect the multiple-choice answering process to comprehend how the Chinchilla model identifies the correct answer among options labeled A, B, C, and D, given its knowledge of the correct answer text. The analysis is an amalgamation of identifying circuit components, referred to as 'output nodes', and understanding the semantics behind attention head features.

Results

Successful Scaling of Techniques: The analysis demonstrates that existing circuit analysis techniques can scale to the Chinchilla model. Identification of a small set of output nodes, specifically attention heads and MLPs, validates the application of these techniques at such a scale.
Attention Heads Analysis: The paper meticulously examines 'correct letter' attention heads. These heads are identified to have clear roles, like encoding the n-th item in an enumeration while attending to the correct answer's label. The authors find that through singular value decomposition, the query and key subspaces can be compressed significantly without compromising the model's performance, underscoring a certain efficiency in feature representation.
Limitations in Generalization: While the analysis provides insights, there's an acknowledgment that these findings offer only a partial explanation. When the cue labels become randomized, the current understanding of the heads' operation appears incomplete, suggesting further complexity within these large models' reasoning processes.

Implications and Future Directions

The implications of this research are significant for the field of mechanistic interpretability. Understanding circuits within LLMs not only aids in comprehending their reasoning processes but also bolsters techniques to mitigate risks like deceptive alignment. The methodologies, despite being successful at this scale, reveal the complexities and challenges inherent in interpretability at the frontier of machine learning.

Future research could explore automating the identification of relevant nodes within models and further disentangling the semantic meanings within heads and MLPs. Moreover, expanding this analysis across different model architectures might uncover more insights into how various models implement similar linguistic tasks.

The success of scaling circuit analysis to the Chinchilla model indicates a positive trajectory for interpretability in LLMs, serving as a foundation for subsequent explorations and refinements within this dynamic field.

PDF Markdown

Related Papers

Tweets

https://twitter.com/jack_merullo_/status/1748438535739146276

YouTube

Show All Videos