Sparse Autoencoders Find Highly Interpretable Features in Language Models (2309.08600v3)

Published 15 Sep 2023 in cs.LG and cs.CL

Abstract: One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a LLM. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task \citep{wang2022interpretability} to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in LLMs using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

Citations (173)

View on Semantic Scholar

Summary

The paper demonstrates how sparse autoencoders extract monosemantic features from language model activations to overcome polysemanticity and superposition challenges.
It shows that these features outperform traditional methods like PCA and ICA, as indicated by improved autointerpretability scores with 95% confidence intervals.
The findings enable better model debugging, ethical AI design, and controlled feature interventions to enhance transparency in neural networks.

Demystifying Neural Network Internals with Sparse Autoencoders

What’s the Big Idea?

Have you ever wondered what's really going on inside those neural networks you're deploying? I mean, sure, they work—often incredibly well—but what exactly are they doing under the hood? That's the mystery our paper is wrestling with, and trust me, it's a fascinating ride.

The big issue here is polysemanticity. This means that neurons in a network can activate in multiple, seemingly unrelated contexts, making it hard to pin down what a neuron actually represents. One hypothesis is that this polysemanticity is due to superposition, where the network tries to represent more features than it has neurons for, by using an overcomplete set of directions in activation space. This could be why understanding these networks can feel like cracking a code with too many secret keys.

The Approach: Sparse Autoencoders to the Rescue

Think of our approach as a detective story where we're trying to find the hidden directions in the neural network's activation space. We utilize something called sparse autoencoders. These special neural networks are trained to reconstruct the internal activations of a LLM but in a sparse way. This means that only a few neurons in the hidden layer activate for any given input, making the features they represent much easier to interpret.

Sparse autoencoders help us identify sets of sparsely activating features that are more monosemantic (i.e., they activate in specific, human-understandable contexts) than neurons found by other methods.

Strong Numerical Results: Interpretability at Scale

One of the coolest parts of our findings? Our sparse autoencoder-based features turn out to be way more interpretable than those generated by traditional methods like Principal Component Analysis (PCA) or Independent Component Analysis (ICA).

When tested using autointerpretability scores—a metric that measures how well the activation of a feature can be predicted based on its description—our dictionary features outperform the competition. Take a look at this:

Interpretability Scores

See those error bars? They show 95% confidence intervals, and it's clear that our method performs better on average compared to other ways of finding dictionary features.

Implications and Future Directions

So why should you care? Well, improving interpretability gets us closer to building AI systems that humans can trust. This enhanced understanding can lead to:

Better Model Debugging: Knowing precisely which features contribute to specific behaviors can help swiftly identify and fix problematic parts of the network.
Ethical AI: With clearer insights into decision-making processes, aligning AI behaviors with ethical guidelines becomes more feasible.
Fine-Grained Control: Understanding these inner workings allows for more nuanced interventions, like steering model behavior by tweaking specific features.

Case Study: Putting Theory into Practice

Let's look at a hands-on example. Suppose we have a dictionary feature that activates on apostrophes. We can analyze what happens when we ablate this feature (essentially “turn it off”). It turns out that removing this feature mainly reduces the likelihood of the model predicting an “s” token right after an apostrophe. This makes sense for contractions and possessive forms in English, like "it's" or "Bob's."

Apostrophe Feature

Future Developments: What's Next?

Looking ahead, there are several intriguing paths to explore:

Scalable Interpretability: Using sparse autoencoders on larger and more complex models.
Enhanced Steering: Combining these insights with model steering frameworks for fine-grained control.
Ethical Governance: Applying these techniques in real-world applications to ensure models act in accordance with societal norms.

By continuing the journey of mechanistic interpretability, we aim to peel back the layers of these complex models and make them as transparent and reliable as possible.

Final Thoughts

Understanding neural networks' internal operations isn't just an academic exercise—it's a cornerstone for building safe, trustworthy AI systems. Our work with sparse autoencoders offers a promising path forward, making it easier to demystify these models and bring AI development into a field where we can truly understand and control its outcomes.

So whether you're debugging a model or ensuring it aligns with ethical standards, these insights can be a game-changer in your toolkit. Happy modeling!

PDF Markdown

Related Papers

Tweets

https://twitter.com/elliotarledge/status/1850818989318148203

https://twitter.com/taha_yssne/status/1853614228940239300

https://twitter.com/aidanprattewart/status/1821733224227172467

https://twitter.com/ChaudharyMaheep/status/1787093012213792844

https://twitter.com/jpillowtime/status/1937213130980933916

https://twitter.com/jreuben1/status/1753374581262262675

YouTube

Show All Videos

Reddit

One hypothesised cause of polysemanticity is superposition, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. (1 point, 1 comment)