Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms (2403.17806v2)

Published 26 Mar 2024 in cs.LG and cs.CL

Abstract: Many recent LLM (LM) interpretability studies have adopted the circuits framework, which aims to find the minimal computational subgraph, or circuit, that explains LM behavior on a given task. Most studies determine which edges belong in a LM's circuit by performing causal interventions on each edge independently, but this scales poorly with model size. Edge attribution patching (EAP), gradient-based approximation to interventions, has emerged as a scalable but imperfect solution to this problem. In this paper, we introduce a new method - EAP with integrated gradients (EAP-IG) - that aims to better maintain a core property of circuits: faithfulness. A circuit is faithful if all model edges outside the circuit can be ablated without changing the model's performance on the task; faithfulness is what justifies studying circuits, rather than the full model. Our experiments demonstrate that circuits found using EAP are less faithful than those found using EAP-IG, even though both have high node overlap with circuits found previously using causal interventions. We conclude more generally that when using circuits to compare the mechanisms models use to solve tasks, faithfulness, not overlap, is what should be measured.

Citations (17)

View on Semantic Scholar

Summary

The paper demonstrates that integrating integrated gradients with edge attribution patching (EAP-IG) yields circuits that better replicate model performance.
The paper establishes circuit faithfulness as a key metric by ensuring that removing non-circuit edges does not impair task outcomes.
The paper compares EAP, EAP-IG, and activation patching across tasks, showing that EAP-IG effectively bridges the faithfulness gap in several benchmarks.

Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms

Introduction to Circuits in LLMs

The utilization of the circuits framework has gained traction in efforts to demystify the workings of transformer LLMs (LMs). This approach identifies the minimal necessary computational subgraph, termed a circuit, which captures a modeled behavior for a specific task. Traditional methodologies have employed causal interventions to discern critical components and their interconnections. However, these methods scale poorly with increasing model complexity, given the necessity to individually assess each connection within a model that may comprise thousands of edges.

To address the scalability challenge, the Edge Attribution Patching (EAP) methodology emerged, leveraging gradient-based approximations to predict the impact of altering each model edge. Nevertheless, this paper argues that mere component overlap between EAP-derived circuits and manually identified circuits does not suffice to ascertain their correctness or utility. Instead, it introduces the concept of circuit faithfulness, positing that a circuit's fidelity to replicating model performance on a task is paramount. Building on this premise, the research proposes a novel approach, edge attribution patching with integrated gradients (EAP-IG), promising to enhance circuit faithfulness.

The Essence of Circuit Faithfulness

Circuit faithfulness transcends the component overlap by ensuring that the ablation of model edges outside a designated circuit does not deteriorate the model's task performance. This stringent criterion validates the circuit's relevance and justifies the circuit-centric analysis of model behavior. The paper assesses EAP-derived circuits' faithfulness, finding them lacking compared to those identified using EAP-IG. The comparison emphasizes that the alignment of circuits' components does not inherently guarantee their faithfulness or relevance in explaining model behavior.

Unpacking EAP-IG

Drawing from the parallels between EAP and gradient-based input attribution methods, the paper introduces EAP-IG. By incorporating integrated gradients (IG), EAP-IG better navigates issues like zero gradients that might mask significant contributions of model components. This enhancement allows for a more nuanced approximation of the effects of edge manipulation, promising more faithful circuits. Initial evaluations across several tasks reveal that circuits derived from EAP-IG exhibit greater faithfulness than those obtained through EAP, thus asserting the merit of integrating IG into the circuit-finding process.

Experimental Findings and Implications

The comparative analysis across six tasks demonstrates varied performance between EAP, EAP-IG, and actual activation patching. In instances like IOI and Gender Bias tasks, circuits based on activation patching outperform EAP and EAP-IG in faithfulness. However, for tasks such as Greater-Than and Country-Capital, EAP-IG closes the faithfulness gap, occasionally surpassing activation patching circuits. Moreover, in SVA and Hypernymy tasks, the paper highlights EAP's pitfalls, which EAP-IG adeptly avoids. These results advocate for the superiority of EAP-IG in fostering more faithful circuits, though acknowledging room for further enhancement.

Relationship Between Overlap and Faithfulness

Further explorations into the relationship between circuit overlap and faithfulness reveal a nuanced landscape. While high overlap often correlates with high faithfulness within specific tasks, the predictive power of overlap does not hold consistently, particularly in cross-task scenarios. This discrepancy underscores the complexity of circuit behaviors and the inadequacy of overlap as a singular metric for evaluating circuit relevance or efficiency.

Concluding Thoughts

This paper contributes significantly to the evolving discourse on interpretability in LLMs by spotlighting the importance of circuit faithfulness. The introduction of EAP-IG marks a promising step toward achieving more faithful and, consequently, more insightful circuits. The discussions herein invite further exploration into alternative metrics and methodologies capable of unraveling the sophisticated mechanisms underpinning LLM behaviors. As the quest for understanding AI's deep structures continues, embracing both theoretical innovations and practical methodologies, such as EAP-IG, remains crucial.

Related Papers

Tweets

https://twitter.com/michaelwhanna/status/1773665533423923631

https://twitter.com/fly51fly/status/1774186926804992060

https://twitter.com/knishimae0531/status/1774311938190958934