Emergent Mind

Abstract

Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in realistic scenarios makes validating recent approaches, such as sparse dictionary learning, elusive. To address this challenge, we propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them against \emph{supervised} feature dictionaries. First, we demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task. Second, we use the supervised dictionaries to develop and contextualize evaluations of unsupervised dictionaries along the same three axes. We apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with sparse autoencoders (SAEs) trained on either the IOI or OpenWebText datasets. We find that these SAEs capture interpretable features for the IOI task, but they are less successful than supervised features in controlling the model. Finally, we observe two qualitative phenomena in SAE training: feature occlusion (where a causally relevant concept is robustly overshadowed by even slightly higher-magnitude ones in the learned features), and feature over-splitting (where binary features split into many smaller, less interpretable features). We hope that our framework will provide a useful step towards more objective and grounded evaluations of sparse dictionary learning methods.

Distribution of interpretations for IOI features learned by SAEs on OpenWebText dataset with F_1 score threshold.

Overview

  • The paper evaluates sparse autoencoders (SAEs) as a method for interpreting and controlling LLMs, comparing their performance against supervised feature dictionaries.

  • A framework is proposed for evaluating unsupervised feature dictionaries on specific tasks, with a focus on indirect object identification (IOI) using GPT-2 Small, to benchmark their interpretability and controllability.

  • Key findings indicate that while task-specific SAEs capture interpretable features more effectively than general ones, they still fall short in precision and control compared to supervised methods, highlighting the need for further refinement.

Evaluating Sparse Feature Dictionaries in Language Models

Introduction

When it comes to understanding what goes on inside LLMs like GPT-2, interpretability is a big deal. The core idea here is to disentangle the complex representations these models use into meaningful features. Recently, sparse autoencoders (SAEs) have been suggested as a promising way to achieve this. The paper we're discussing dives deep into validating these methods by comparing them against dictionaries built using supervised features—essentially, features we "know" to be meaningful beforehand.

What’s the Problem?

We don't have a clear "ground truth" for what these model features should be, making it a tough job to evaluate new methods. This paper's approach is to create a framework for evaluating these feature dictionaries on specific tasks, comparing them against supervised dictionaries for context. They focus on the indirect object identification (IOI) task using GPT-2 Small to find out how well these dictionaries perform.

The Core Approach

To break it down, here’s the framework the researchers proposed:

  1. Supervised Feature Dictionaries: First, they demonstrate that supervised dictionaries can do an excellent job of approximating, controlling, and interpreting model computations.
  2. Contextualizing Unsupervised Dictionaries: They then use these supervised dictionaries to benchmark unsupervised ones (like those learned via SAEs).

Key Results

They trained SAEs on datasets specific to the task (IOI) and a larger, general dataset (OpenWebText). Here’s what they found:

  • Interpretability: Both task-specific and full-dataset SAEs capture interpretable features, but task-specific ones were better at this.
  • Sparse Controllability: Supervised dictionaries allowed for more precise editing of features to change model behavior compared to SAEs. Task-specific SAEs fared better than general ones but were still not as good as supervised features.
  • Occlusion and Over-splitting: They noticed two phenomena in SAE training:
  • Feature Occlusion: Important features can be overshadowed by even slightly stronger features.
  • Feature Over-splitting: Binary features can get split into several smaller features, making them less interpretable.

Practical Implications

These findings imply that while SAEs can indeed produce some meaningful and interpretable features, there's still a way to go before they can match the precision and control offered by supervised methods. This matters because the better we can understand what our models are doing, the more effectively we can use and refine them.

Future Directions

Future work could involve tuning SAE training procedures to improve their performance or exploring ways to lessen the issues of occlusion and over-splitting. Also, it might be valuable to apply this evaluation framework to a wider range of tasks and models, making it more robust and generalizable.

Conclusion

In a nutshell, the paper provides a robust framework for evaluating the effectiveness of sparse feature dictionaries in LLMs. It sets a benchmark using supervised features and uses it to measure how well unsupervised methods like SAEs stack up. While SAEs show promise, there's still a gap between them and the more reliable supervised methods. The ongoing challenge is to close this gap, making LLMs more interpretable and controllable along the way.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.