Toy Models of Superposition (2209.10652v1)

Published 21 Sep 2022 in cs.LG

Abstract: Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

Authors (16)

Nelson Elhage (15 papers)
Tristan Hume (11 papers)
Catherine Olsson (18 papers)
Nicholas Schiefer (18 papers)
Tom Henighan (21 papers)
Shauna Kravec (15 papers)
Zac Hatfield-Dodds (19 papers)
Robert Lasenby (25 papers)
Dawn Drain (23 papers)
Carol Chen (7 papers)
Roger Grosse (68 papers)
Sam McCandlish (24 papers)
Jared Kaplan (79 papers)
Dario Amodei (33 papers)
Martin Wattenberg (39 papers)
Christopher Olah (10 papers)

Citations (230)

View on Semantic Scholar

Summary

The paper demonstrates that toy ReLU networks encode multiple unrelated features via superposition, exceeding their dimensional limits.
The paper identifies a phase change from dense to sparse representations, highlighting nonlinear interference among encoded features.
The paper reveals that superposition organizes features into complex geometric structures, which may affect adversarial robustness and model interpretability.

Toy Models of Superposition: A Detailed Analysis

The paper "Toy Models of Superposition" by Nelson Elhage et al. explores the perplexing phenomenon of polysemanticity in neural networks, where individual neurons encode multiple unrelated features. By utilizing toy models, specifically small ReLU networks trained on synthetic data with sparse input features, the authors probe into the conditions under which models represent more features than they have dimensions—a phenomenon they term "superposition."

Key Findings

Superposition in Neural Networks: The paper confirms that neural networks can indeed represent more features than the number of available dimensions via superposition. This marks a significant departure from traditional linear models, which typically store principal components while ignoring additional features.
Phase Change in Superposition: The research identifies a phase change, transitioning from dense to sparse feature representations, implicating that superposition starts to occur as sparsity increases. The models exhibit non-linear behaviors due to interference among the features whenever they utilize superposition.
Geometric Structures: The toy models show that superposition organizes features into complex geometric structures based on uniform polytopes. These include digons, triangles, pentagons, and tetrahedrons, indicating a deep geometric underpinning for feature representation.
Computation in Superposition: Surprisingly, the models can perform computations like the absolute value function entirely while in superposition. This leads to a hypothesis that practical neural networks simulate larger, highly sparse networks noisily.
Implications for Adversarial Robustness and Grokking: The paper finds preliminary evidence that superposition may make networks more susceptible to adversarial examples. It also hints at a potential link with grokking phenomena, which involves sudden generalization post initial memorization during training.

Implications and Future Directions

Theoretical Implications

The finding that neural networks can store and compute with more features than they have dimensions through superposition opens various theoretical avenues. Primarily, it questions how feature importance and sparsity can drive the emergence of rich geometric structures. Furthermore, understanding the nature of phase changes in superposition could shed light on the deeper mathematical properties governing neural representations.

Practical Implications

This research has substantial consequences for interpretability and safety of AI systems. Superposition presents a challenge, as it complicates the task of understanding and quantifying the features a model uses. For instance, highly polysemantic neurons entail that the standard practices of visualizing neurons as detectors of specific features may no longer hold true. The implications for adversarial robustness are profound; understanding and mitigating superposition could lead to models that are more resilient to such attacks.

Speculative Future Developments

Going forward, the findings suggest several speculative trajectories:

Model Architectures Without Superposition: Investigating whether models can be designed inherently free of superposition while retaining performance. Mixture of Experts models could offer a pathway, as they dynamically activate different subsets of neurons, potentially alleviating superposition.
Sparse Coding and Decomposition Techniques: Developing overcomplete bases or sparse coding algorithms to decode features post-hoc in models utilizing superposition. This could make the models more interpretable without altering their foundational structure.
Stratified Interpretability: Applying these insights to create hierarchical interpretations, where lower layers recapitulate local bases free of superposition, providing a mechanism to understand computations efficiently layer by layer.
Further Probing of Computation in Superposition: The exploration of more complex functions and whether they can be computed effectively in superposition, enhancing our understanding of how neural networks handle non-linearities in a high-dimensional feature space.

Conclusion

The paper by Elhage et al. offers a nuanced understanding of how neural networks encode information through superposition, representing a sophisticated and essential progression in the quest for interpretable AI. The discovery of geometric structures and phase changes within toy models provides a solid foundation for future research aimed at mitigating the challenges posed by polysemanticity and superposition in practical neural networks. As AI systems continue to evolve, insights from such foundational studies will be crucial in guiding the development of models that are both performant and interpretable.

Related Papers

Tweets

YouTube

Show All Videos