Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT (2310.07582v2)
Abstract: Foundation models exhibit significant capabilities in decision-making and logical deductions. Nonetheless, a continuing discourse persists regarding their genuine understanding of the world as opposed to mere stochastic mimicry. This paper meticulously examines a simple transformer trained for Othello, extending prior research to enhance comprehension of the emergent world model of Othello-GPT. The investigation reveals that Othello-GPT encapsulates a linear representation of opposing pieces, a factor that causally steers its decision-making process. This paper further elucidates the interplay between the linear world representation and causal decision-making, and their dependence on layer depth and model complexity. We have made the code public.
- Guillaume Alain and Yoshua Bengio. 2018. Understanding intermediate layers using linear classifier probes. arXiv:1610.01644 [stat.ML]
- Deep ViT Features as Dense Visual Descriptors. CoRR abs/2112.05814 (2021). arXiv:2112.05814
- Yonatan Belinkov. 2021. Probing Classifiers: Promises, Shortcomings, and Advances. arXiv:2102.12452 [cs.CL]
- On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 610–623.
- What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Florence, Italy, 276–286. https://doi.org/10.18653/v1/W19-4828
- Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712 (2022).
- Visualisation and’diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure. Journal of Artificial Intelligence Research 61 (2018), 907–926.
- Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=DeG07_TcZvT
- Neel Nanda. 2023. Actually, Othello-GPT Has A Linear Emergent World Model. <https://neelnanda.io/mechanistic-interpretability/othello>
- Zoom in: An introduction to circuits. Distill 5, 3 (2020), e00024–001.
- Feature Visualization. Distill (2017). https://doi.org/10.23915/distill.00007 https://distill.pub/2017/feature-visualization.
- In-context Learning and Induction Heads. arXiv:2209.11895 [cs.LG]
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 (2022).
- Learning Chess Blindfolded: Evaluating Language Models on State Tracking. CoRR abs/2102.13249 (2021). arXiv:2102.13249
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.