Emergent Mind

Abstract

Many recent language model (LM) interpretability studies have adopted the circuits framework, which aims to find the minimal computational subgraph, or circuit, that explains LM behavior on a given task. Most studies determine which edges belong in a LM's circuit by performing causal interventions on each edge independently, but this scales poorly with model size. Edge attribution patching (EAP), gradient-based approximation to interventions, has emerged as a scalable but imperfect solution to this problem. In this paper, we introduce a new method - EAP with integrated gradients (EAP-IG) - that aims to better maintain a core property of circuits: faithfulness. A circuit is faithful if all model edges outside the circuit can be ablated without changing the model's performance on the task; faithfulness is what justifies studying circuits, rather than the full model. Our experiments demonstrate that circuits found using EAP are less faithful than those found using EAP-IG, even though both have high node overlap with circuits found previously using causal interventions. We conclude more generally that when using circuits to compare the mechanisms models use to solve tasks, faithfulness, not overlap, is what should be measured.

EAP-IG circuits match or outperform EAP circuits in faithfulness, with values close to 1.0 being ideal.

Overview

  • The paper introduces a novel methodology, Edge Attribution Patching with Integrated Gradients (EAP-IG), to enhance the faithfulness of circuits in transformer language models.

  • Circuit faithfulness is defined as the importance of a circuit in replicating model performance on a specific task without deterioration when model edges outside the circuit are ablated.

  • The study demonstrates that circuits derived from EAP-IG are more faithful than those obtained through conventional Edge Attribution Patching (EAP), particularly in tasks like Greater-Than and Country-Capital.

  • The paper argues against relying solely on component overlap to assess circuit faithfulness, promoting a comprehensive evaluation of circuit relevance and utility in explaining model behavior.

Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms

Introduction to Circuits in Language Models

The utilization of the circuits framework has gained traction in efforts to demystify the workings of transformer language models (LMs). This approach identifies the minimal necessary computational subgraph, termed a circuit, which captures a modeled behavior for a specific task. Traditional methodologies have employed causal interventions to discern critical components and their interconnections. However, these methods scale poorly with increasing model complexity, given the necessity to individually assess each connection within a model that may comprise thousands of edges.

To address the scalability challenge, the Edge Attribution Patching (EAP) methodology emerged, leveraging gradient-based approximations to predict the impact of altering each model edge. Nevertheless, this paper argues that mere component overlap between EAP-derived circuits and manually identified circuits does not suffice to ascertain their correctness or utility. Instead, it introduces the concept of circuit faithfulness, positing that a circuit's fidelity to replicating model performance on a task is paramount. Building on this premise, the research proposes a novel approach, edge attribution patching with integrated gradients (EAP-IG), promising to enhance circuit faithfulness.

The Essence of Circuit Faithfulness

Circuit faithfulness transcends the component overlap by ensuring that the ablation of model edges outside a designated circuit does not deteriorate the model's task performance. This stringent criterion validates the circuit's relevance and justifies the circuit-centric analysis of model behavior. The study assesses EAP-derived circuits' faithfulness, finding them lacking compared to those identified using EAP-IG. The comparison emphasizes that the alignment of circuits' components does not inherently guarantee their faithfulness or relevance in explaining model behavior.

Unpacking EAP-IG

Drawing from the parallels between EAP and gradient-based input attribution methods, the study introduces EAP-IG. By incorporating integrated gradients (IG), EAP-IG better navigates issues like zero gradients that might mask significant contributions of model components. This enhancement allows for a more nuanced approximation of the effects of edge manipulation, promising more faithful circuits. Initial evaluations across several tasks reveal that circuits derived from EAP-IG exhibit greater faithfulness than those obtained through EAP, thus asserting the merit of integrating IG into the circuit-finding process.

Experimental Findings and Implications

The comparative analysis across six tasks demonstrates varied performance between EAP, EAP-IG, and actual activation patching. In instances like IOI and Gender Bias tasks, circuits based on activation patching outperform EAP and EAP-IG in faithfulness. However, for tasks such as Greater-Than and Country-Capital, EAP-IG closes the faithfulness gap, occasionally surpassing activation patching circuits. Moreover, in SVA and Hypernymy tasks, the study highlights EAP's pitfalls, which EAP-IG adeptly avoids. These results advocate for the superiority of EAP-IG in fostering more faithful circuits, though acknowledging room for further enhancement.

Relationship Between Overlap and Faithfulness

Further explorations into the relationship between circuit overlap and faithfulness reveal a nuanced landscape. While high overlap often correlates with high faithfulness within specific tasks, the predictive power of overlap does not hold consistently, particularly in cross-task scenarios. This discrepancy underscores the complexity of circuit behaviors and the inadequacy of overlap as a singular metric for evaluating circuit relevance or efficiency.

Concluding Thoughts

This study contributes significantly to the evolving discourse on interpretability in language models by spotlighting the importance of circuit faithfulness. The introduction of EAP-IG marks a promising step toward achieving more faithful and, consequently, more insightful circuits. The discussions herein invite further exploration into alternative metrics and methodologies capable of unraveling the sophisticated mechanisms underpinning language model behaviors. As the quest for understanding AI's deep structures continues, embracing both theoretical innovations and practical methodologies, such as EAP-IG, remains crucial.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.