Does Transformer Interpretability Transfer to RNNs? (2404.05971v1)

Published 9 Apr 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Recent advances in recurrent neural network architectures, such as Mamba and RWKV, have enabled RNNs to match or exceed the performance of equal-size transformers in terms of LLMing perplexity and downstream evaluations, suggesting that future systems may be built on completely new architectures. In this paper, we examine if selected interpretability methods originally designed for transformer LLMs will transfer to these up-and-coming recurrent architectures. Specifically, we focus on steering model outputs via contrastive activation addition, on eliciting latent predictions via the tuned lens, and eliciting latent knowledge from models fine-tuned to produce false outputs under certain conditions. Our results show that most of these techniques are effective when applied to RNNs, and we show that it is possible to improve some of them by taking advantage of RNNs' compressed state.

References (29)

Citations (4)

View on Semantic Scholar

Summary

The paper demonstrates that transformer-based interpretability methods, such as CAA and the tuned lens, effectively transfer to RNN architectures.
It introduces 'state steering' for RNNs like Mamba and RWKV, showcasing reduced perplexity across layers and improved control over model responses.
Experimental findings reveal that RNNs, utilizing compressed states, can retain and elicit latent predictions, widening the interpretability toolkit for AI models.

Does Transformer Interpretability Transfer to RNNs?

Introduction

The paper by Gonçalo Paulo, Thomas Marshall, and Nora Belrose from EleutherAI explores the adaptability of interpretability methods, initially designed for transformer models, to recurrent neural model (RNN) architectures. They specifically focus on the Mamba and RWKV RNN architectures due to their demonstrated equality or superiority in performance compared to transformer models in certain aspects. Through a series of experiments, the paper examines if techniques such as Contrastive Activation Addition (CAA), the tuned lens, and the elicitation of latent predictions and knowledge can be effectively applied to these RNNs.

Architectures Analyzed

The paper concentrates on two main RNN architectures: Mamba and RWKV. Both architectures are designed for efficiency and performance, circumventing the quadratic complexity of the transformer’s self-attention mechanism.

Mamba: Incorporates a causal convolution block and a selective state-space model (SSM) for routing information, significantly enhancing model expressivity.
RWKV (Receptance-Weighted Key Value): Utilizes alternating time mix and channel mix modules. RWKV v5 is noted for its "multi-headed" matrix-valued state, suggesting an improvement in handling information compared to its predecessors.

These architectures, available on the HuggingFace Hub, provide a base for investigating the transferability of interpretability methods initially tailored for transformers.

Interpretability Techniques

The authors delve into three primary interpretability techniques:

Contrastive Activation Addition (CAA): They hypothesize that CAA can be effective for RNNs, particularly due to their compressed state potentially facilitating model steering.
The Tuned Lens: Viewing each layer as incrementally refining next-token predictions, the authors aim to shed light on whether this lens applicable to transformers can offer insights into RNNs' operation.
'Quirky' Models: These models, fine-tuned to produce incorrect outputs under specific conditions, serve to probe the extent to which RNNs retain latent knowledge that can be correctly elicited.

These methods underscore the paper's broader aim to understand if and how the inner workings and behaviors of RNNs can be interpreted vis-a-vis transformers.

Findings and Implications

Efficacy Across Models: Most interpretability techniques tested showed considerable effectiveness when applied to RNNs. Particularly noteworthy is the method's success in steering model responses and eliciting latent predictions.
State Steering: The paper introduces 'state steering' as a novel variant of CAA for RNNs, exploiting the models' compressed state for more effective behavior control.
Tuned Lens Perplexity: The research demonstrates that the tuned lens reveals a systematic decrease in perplexity across layers for both RNN architectures, mirroring findings with transformers.

These results not only confirm the possibility of extending transformer interpretability methods to RNNs but also open avenues for further optimization leveraging RNNs' unique structural characteristics.

Future Directions

The paper acknowledges the potential for in-depth exploration of RNN states for improved interpretability and suggests extending this research to other model categories. The exploration of additional interpretability tools, especially those rooted in mechanistic or circuit-based approaches, is recommended to broaden understanding of model behaviors and to continue improving the efficacy of AI models in real-world applications.

Conclusion

In conclusion, the work by Paulo, Marshall, and Belrose makes a significant contribution to the field of AI by demonstrating that transformer-based interpretability methods can, to a large extent, be applied to RNN architectures. This research not only enhances our understanding of RNN behavior but also broadens the toolkit available for interpreting and steering the outputs of diverse neural network models. The practical and theoretical implications of this research underscore the importance of continued exploration in the field of AI interpretability.

PDF Markdown

Related Papers

Tweets

https://twitter.com/norabelrose/status/1777975663590531533

https://twitter.com/fly51fly/status/1778172888371380277

https://twitter.com/gm8xx8/status/1777978805132386335

https://twitter.com/BlancheMinerva/status/1788281683713699859

https://twitter.com/Clockrelativity/status/1778039046754746785

https://twitter.com/knishimae0531/status/1778208374032867647

HackerNews

Does Transformer Interpretability Transfer to RNNs? (3 points, 0 comments)