Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models (2408.00113v2)
Abstract: What latent features are encoded in LLM (LM) representations? Recent work on training sparse autoencoders (SAEs) to disentangle interpretable features in LM representations has shown significant promise. However, evaluating the quality of these SAEs is difficult because we lack a ground-truth collection of interpretable features that we expect good SAEs to recover. We thus propose to measure progress in interpretable dictionary learning by working in the setting of LMs trained on chess and Othello transcripts. These settings carry natural collections of interpretable features -- for example, "there is a knight on F3" -- which we leverage into $\textit{supervised}$ metrics for SAE quality. To guide progress in interpretable dictionary learning, we introduce a new SAE training technique, $\textit{p-annealing}$, which improves performance on prior unsupervised metrics as well as our new metrics.
- K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing, 54(11):4311–4322, 2006.
- Simple, efficient, and neural algorithms for sparse coding. In Conference on learning theory, pp. 113–149. PMLR, 2015.
- Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018. doi: 10.1162/tacl˙a˙00034. URL https://aclanthology.org/Q18-1034.
- Why regularized auto-encoders learn sparse representation? In International Conference on Machine Learning, pp. 136–144. PMLR, 2016.
- Dictionary learning for sparse coding: Algorithms and convergence analysis. IEEE transactions on pattern analysis and machine intelligence, 38(7):1356–1369, 2015.
- Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 14.05. 2023), 2023.
- An improved analysis of the er-spud dictionary learning algorithm. In 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.
- Sae lens. https://github.com/jbloomAus/SAELens, 2024.
- Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, pp. 2, 2023.
- Toward a unified theory of efficient, predictive, and sparse coding. Proceedings of the National Academy of Sciences, 115(1):186–191, 2018.
- Chartrand, R. Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Processing Letters, 14(10):707–710, 2007.
- Christensen, O. et al. An introduction to frames and Riesz bases, volume 7. Springer, 2003.
- The importance of encoding versus training with sparse coding and vector quantization. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 921–928, 2011.
- An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223. JMLR Workshop and Conference Proceedings, 2011.
- Cooney, A. Sparse autoencoder library. https://github.com/ai-safety-foundation/sparse_autoencoder, 2023.
- Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2023.
- An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 57(11):1413–1457, 2004.
- Adaptive greedy approximations. Constructive approximation, 13:57–98, 1997.
- Donoho, D. L. Superresolution via sparsity constraints. SIAM journal on mathematical analysis, 23(5):1309–1331, 1992.
- Dictionary learning algorithms and applications. Springer, 2018.
- Sparse coding and nmf. In 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), volume 4, pp. 2529–2533. IEEE, 2004.
- Elad, M. Sparse and redundant representations: from theory to applications in signal and image processing. Springer Science & Business Media, 2010.
- A generalized uncertainty principle and sparse representation in pairs of bases. IEEE Transactions on Information Theory, 48(9):2558–2567, 2002.
- Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022.
- A primer on the inner workings of transformer-based language models. arXiv preprint arXiv:2405.00208, 2024.
- A Mathematical Introduction to Compressive Sensing. Springer New York, New York, NY, 2013.
- Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024.
- Learning fast approximations of sparse coding. In Proceedings of the 27th international conference on international conference on machine learning, pp. 399–406, 2010.
- Autoencoders, minimum description length and helmholtz free energy. Advances in neural information processing systems, 6, 1993.
- Hoyer, P. O. Non-negative sparse coding. In Proceedings of the 12th IEEE workshop on neural networks for signal processing, pp. 557–565. IEEE, 2002.
- Tanh penalty in dictionary learning, 2024. URL https://transformer-circuits.pub/2024/feb-update/index.html. Accessed: 2024-05-20.
- Performance limits of dictionary learning for sparse coding. In 2014 22nd European Signal Processing Conference (EUSIPCO), pp. 765–769. IEEE, 2014.
- Karvonen, A. Emergent world models and latent variable estimation in chess-playing language models, 2024.
- Interpreting attention layer outputs with sparse autoencoders. In ICML 2024 Workshop on Mechanistic Interpretability, 2024.
- Sparseness analysis in the pretraining of deep neural networks. IEEE transactions on neural networks and learning systems, 28(6):1425–1438, 2016.
- Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations, 2023a.
- A comprehensive survey on design and application of autoencoder in deep learning. Applied Soft Computing, 138:110176, 2023b.
- Lichess. lichess.org open database, 2024. URL https://database.lichess.org.
- Convolutional sparse autoencoders for image classification. IEEE transactions on neural networks and learning systems, 29(7):3289–3294, 2017.
- Towards principled evaluations of sparse autoencoders for interpretability and control. arXiv preprint arXiv:2405.08366, 2024.
- K-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013.
- Matching pursuits with time-frequency dictionaries. IEEE Transactions on signal processing, 41(12):3397–3415, 1993.
- dictionary_learning. https://github.com/saprmarks/dictionary_learning, 2024.
- Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647, 2024.
- Acquisition of chess knowledge in alphazero. Proceedings of the National Academy of Sciences, 119(47), November 2022. ISSN 1091-6490. doi: 10.1073/pnas.2206625119. URL http://dx.doi.org/10.1073/pnas.2206625119.
- Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
- Emergent linear representations in world models of self-supervised sequence models. In Belinkov, Y., Hao, S., Jumelet, J., Kim, N., McCarthy, A., and Mohebbi, H. (eds.), Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp. 16–30, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.blackboxnlp-1.2. URL https://aclanthology.org/2023.blackboxnlp-1.2.
- Natarajan, B. K. Sparse approximate solutions to linear systems. SIAM journal on computing, 24(2):227–234, 1995.
- Ng, A. et al. Sparse autoencoder. CS294A Lecture notes, 72(2011):1–19, 2011.
- On the dynamics of gradient descent for autoencoders. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2858–2867. PMLR, 2019.
- Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.
- Sparse coding of sensory inputs. Current opinion in neurobiology, 14(4):481–487, 2004.
- The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658, 2023.
- Improving dictionary learning with gated sparse autoencoders. arXiv preprint arXiv:2404.16014, 2024.
- Sparse coding and autoencoders. In 2018 IEEE International Symposium on Information Theory (ISIT), pp. 36–40. IEEE, 2018.
- Improving sparse autoencoders by square-rooting l1 and removing lowest activation features, 2024. URL https://www.lesswrong.com/posts/YiGs8qJ8aNBgwt2YN/improving-sae-s-by-sqrt-ing-l1-and-removing-lowest. Accessed: 2024-05-20.
- Taking features out of superposition with sparse autoencoders, 2023. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition. Accessed: 2023-05-10.
- Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html.
- Tillmann, A. M. On the computational intractability of exact and approximate dictionary learning. IEEE Signal Processing Letters, 22(1):45–49, 2014.
- On the performance of sparse recovery via ℓpsubscriptℓ𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-minimization (0≤p≤1)0𝑝1(0\leq p\leq 1)( 0 ≤ italic_p ≤ 1 ). IEEE Transactions on Information Theory, 57(11):7255–7278, 2011.
- Stable recovery of sparse signals via lp-minimization. Applied and Computational Harmonic Analysis, 38(1):161–176, 2015.
- Addressing feature suppression in sparse autoencoders, 2024. URL https://www.lesswrong.com/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes. Accessed: 2024-05-20.
- High-dimensional data analysis with low-dimensional models: Principles, computation, and applications. Cambridge University Press, 2022.
- Sparse recovery conditions and performance bounds for ℓpsubscriptℓ𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-minimization. IEEE Transactions on Signal Processing, 66(19):5014–5028, 2018.
- Linear spatial pyramid matching using sparse coding for image classification. In 2009 IEEE Conference on computer vision and pattern recognition, pp. 1794–1801. IEEE, 2009.
- Does ℓpsubscriptℓ𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-minimization outperform ℓ1subscriptℓ1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-minimization? IEEE Transactions on Information Theory, 63(11):6896–6935, 2017.