Emergent Mind

Sparse Autoencoders Find Highly Interpretable Features in Language Models

(2309.08600)
Published Sep 15, 2023 in cs.LG and cs.CL

Abstract

One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task \citep{wang2022interpretability} to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

Method overview: Sampling model activations, training a sparse autoencoder, interpreting features using autointerpretability scores.

Overview

  • The paper addresses the challenge of polysemanticity in neural networks, where neurons activate in various, unrelated contexts, by hypothesizing that this is due to superposition in the activation space.

  • The authors propose using sparse autoencoders to make the features represented by neural network activations more interpretable, by training these autoencoders to reconstruct activations sparsely.

  • The results indicate that features identified by sparse autoencoders are more interpretable than those found through traditional methods like PCA and ICA, with implications for better model debugging, ethical AI, and fine-grained control.

Demystifying Neural Network Internals with Sparse Autoencoders

What’s the Big Idea?

Have you ever wondered what's really going on inside those neural networks you're deploying? I mean, sure, they work—often incredibly well—but what exactly are they doing under the hood? That's the mystery our paper is wrestling with, and trust me, it's a fascinating ride.

The big issue here is polysemanticity. This means that neurons in a network can activate in multiple, seemingly unrelated contexts, making it hard to pin down what a neuron actually represents. One hypothesis is that this polysemanticity is due to superposition, where the network tries to represent more features than it has neurons for, by using an overcomplete set of directions in activation space. This could be why understanding these networks can feel like cracking a code with too many secret keys.

The Approach: Sparse Autoencoders to the Rescue

Think of our approach as a detective story where we're trying to find the hidden directions in the neural network's activation space. We utilize something called sparse autoencoders. These special neural networks are trained to reconstruct the internal activations of a language model but in a sparse way. This means that only a few neurons in the hidden layer activate for any given input, making the features they represent much easier to interpret.

Sparse autoencoders help us identify sets of sparsely activating features that are more monosemantic (i.e., they activate in specific, human-understandable contexts) than neurons found by other methods.

Strong Numerical Results: Interpretability at Scale

One of the coolest parts of our findings? Our sparse autoencoder-based features turn out to be way more interpretable than those generated by traditional methods like Principal Component Analysis (PCA) or Independent Component Analysis (ICA).

When tested using autointerpretability scores—a metric that measures how well the activation of a feature can be predicted based on its description—our dictionary features outperform the competition. Take a look at this:

images/residual_top_random_means_and_cis_new.png

See those error bars? They show 95% confidence intervals, and it's clear that our method performs better on average compared to other ways of finding dictionary features.

Implications and Future Directions

So why should you care? Well, improving interpretability gets us closer to building AI systems that humans can trust. This enhanced understanding can lead to:

  • Better Model Debugging: Knowing precisely which features contribute to specific behaviors can help swiftly identify and fix problematic parts of the network.
  • Ethical AI: With clearer insights into decision-making processes, aligning AI behaviors with ethical guidelines becomes more feasible.
  • Fine-Grained Control: Understanding these inner workings allows for more nuanced interventions, like steering model behavior by tweaking specific features.

Case Study: Putting Theory into Practice

Let's look at a hands-on example. Suppose we have a dictionary feature that activates on apostrophes. We can analyze what happens when we ablate this feature (essentially “turn it off”). It turns out that removing this feature mainly reduces the likelihood of the model predicting an “s” token right after an apostrophe. This makes sense for contractions and possessive forms in English, like "it's" or "Bob's."

images/logan/apostrophe.png

Future Developments: What's Next?

Looking ahead, there are several intriguing paths to explore:

  1. Scalable Interpretability: Using sparse autoencoders on larger and more complex models.
  2. Enhanced Steering: Combining these insights with model steering frameworks for fine-grained control.
  3. Ethical Governance: Applying these techniques in real-world applications to ensure models act in accordance with societal norms.

By continuing the journey of mechanistic interpretability, we aim to peel back the layers of these complex models and make them as transparent and reliable as possible.

Final Thoughts

Understanding neural networks' internal operations isn't just an academic exercise—it's a cornerstone for building safe, trustworthy AI systems. Our work with sparse autoencoders offers a promising path forward, making it easier to demystify these models and bring AI development into a realm where we can truly understand and control its outcomes.

So whether you're debugging a model or ensuring it aligns with ethical standards, these insights can be a game-changer in your toolkit. Happy modeling!

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.