Emergent Mind

Does Transformer Interpretability Transfer to RNNs?

(2404.05971)
Published Apr 9, 2024 in cs.LG , cs.AI , and cs.CL

Abstract

Recent advances in recurrent neural network architectures, such as Mamba and RWKV, have enabled RNNs to match or exceed the performance of equal-size transformers in terms of language modeling perplexity and downstream evaluations, suggesting that future systems may be built on completely new architectures. In this paper, we examine if selected interpretability methods originally designed for transformer language models will transfer to these up-and-coming recurrent architectures. Specifically, we focus on steering model outputs via contrastive activation addition, on eliciting latent predictions via the tuned lens, and eliciting latent knowledge from models fine-tuned to produce false outputs under certain conditions. Our results show that most of these techniques are effective when applied to RNNs, and we show that it is possible to improve some of them by taking advantage of RNNs' compressed state.

A Tuned Lens model predicts initial inputs then accurately forecasts future outcomes in a mamba model.

Overview

  • The paper investigates the applicability of interpretability methods developed for transformer models to RNN architectures, specifically focusing on Mamba and RWKV RNNs.

  • Experiments are conducted to see if Contrastive Activation Addition, the tuned lens, and elicitation of latent predictions apply effectively to RNNs.

  • Results indicate that most interpretability techniques initially designed for transformers are effective with RNN architectures, introducing 'state steering' as a novel method.

  • It suggests extending research to more AI models and emphasizes the significance of improving AI interpretability for practical and theoretical advancements.

Does Transformer Interpretability Transfer to RNNs?

Introduction

The paper by Gonçalo Paulo, Thomas Marshall, and Nora Belrose from EleutherAI explores the adaptability of interpretability methods, initially designed for transformer models, to recurrent neural model (RNN) architectures. They specifically focus on the Mamba and RWKV RNN architectures due to their demonstrated equality or superiority in performance compared to transformer models in certain aspects. Through a series of experiments, the paper examines if techniques such as Contrastive Activation Addition (CAA), the tuned lens, and the elicitation of latent predictions and knowledge can be effectively applied to these RNNs.

Architectures Analyzed

The study concentrates on two main RNN architectures: Mamba and RWKV. Both architectures are designed for efficiency and performance, circumventing the quadratic complexity of the transformer’s self-attention mechanism.

  • Mamba: Incorporates a causal convolution block and a selective state-space model (SSM) for routing information, significantly enhancing model expressivity.
  • RWKV (Receptance-Weighted Key Value): Utilizes alternating time mix and channel mix modules. RWKV v5 is noted for its "multi-headed" matrix-valued state, suggesting an improvement in handling information compared to its predecessors.

These architectures, available on the HuggingFace Hub, provide a base for investigating the transferability of interpretability methods initially tailored for transformers.

Interpretability Techniques

The authors delve into three primary interpretability techniques:

  1. Contrastive Activation Addition (CAA): They hypothesize that CAA can be effective for RNNs, particularly due to their compressed state potentially facilitating model steering.
  2. The Tuned Lens: Viewing each layer as incrementally refining next-token predictions, the authors aim to shed light on whether this lens applicable to transformers can offer insights into RNNs' operation.
  3. 'Quirky' Models: These models, fine-tuned to produce incorrect outputs under specific conditions, serve to probe the extent to which RNNs retain latent knowledge that can be correctly elicited.

These methods underscore the paper's broader aim to understand if and how the inner workings and behaviors of RNNs can be interpreted vis-a-vis transformers.

Findings and Implications

  • Efficacy Across Models: Most interpretability techniques tested showed considerable effectiveness when applied to RNNs. Particularly noteworthy is the method's success in steering model responses and eliciting latent predictions.
  • State Steering: The study introduces 'state steering' as a novel variant of CAA for RNNs, exploiting the models' compressed state for more effective behavior control.
  • Tuned Lens Perplexity: The research demonstrates that the tuned lens reveals a systematic decrease in perplexity across layers for both RNN architectures, mirroring findings with transformers.

These results not only confirm the possibility of extending transformer interpretability methods to RNNs but also open avenues for further optimization leveraging RNNs' unique structural characteristics.

Future Directions

The paper acknowledges the potential for in-depth exploration of RNN states for improved interpretability and suggests extending this research to other model categories. The exploration of additional interpretability tools, especially those rooted in mechanistic or circuit-based approaches, is recommended to broaden understanding of model behaviors and to continue improving the efficacy of AI models in real-world applications.

Conclusion

In conclusion, the work by Paulo, Marshall, and Belrose makes a significant contribution to the field of AI by demonstrating that transformer-based interpretability methods can, to a large extent, be applied to RNN architectures. This research not only enhances our understanding of RNN behavior but also broadens the toolkit available for interpreting and steering the outputs of diverse neural network models. The practical and theoretical implications of this research underscore the importance of continued exploration in the realm of AI interpretability.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

HackerNews