Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 42 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 111 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 451 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

From Neurons to Neutrons: A Case Study in Interpretability (2405.17425v1)

Published 27 May 2024 in cs.LG and nucl-th

Abstract: Mechanistic Interpretability (MI) promises a path toward fully understanding how neural networks make their predictions. Prior work demonstrates that even when trained to perform simple arithmetic, models can implement a variety of algorithms (sometimes concurrently) depending on initialization and hyperparameters. Does this mean neuron-level interpretability techniques have limited applicability? We argue that high-dimensional neural networks can learn low-dimensional representations of their training data that are useful beyond simply making good predictions. Such representations can be understood through the mechanistic interpretability lens and provide insights that are surprisingly faithful to human-derived domain knowledge. This indicates that such approaches to interpretability can be useful for deriving a new understanding of a problem from models trained to solve it. As a case study, we extract nuclear physics concepts by studying models trained to reproduce nuclear data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  7319–7328, 2021.
  2. Table of experimental nuclear ground state charge radii: An update. Atomic Data and Nuclear Data Tables, 99(1):69–95, January 2013. doi: 10.1016/j.adt.2011.12.006.
  3. Pca of high dimensional random walks with comparison to neural network training. Advances in Neural Information Processing Systems, 31, 2018.
  4. Slicegpt: Compress large language models by deleting rows and columns, 2024.
  5. WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models. arXiv e-prints, art. arXiv:2311.15930, November 2023. doi: 10.48550/arXiv.2311.15930.
  6. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  7. Nuclear Physics A. Stationary States of Nuclei. Rev. Mod. Phys., 8:82–229, 1936. doi: 10.1103/RevModPhys.8.82.
  8. Bowman, S. R. Eight Things to Know about Large Language Models. arXiv e-prints, art. arXiv:2304.00612, April 2023. doi: 10.48550/arXiv.2304.00612.
  9. Understanding disentangling in β𝛽\betaitalic_β-vae. In NeurIPS Workshop on Learning Disentangled Representations, 2018.
  10. Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp.  2610–2620, 2018.
  11. Cranmer, M. Interpretable machine learning for science with pysr and symbolicregression. jl. arXiv preprint arXiv:2305.01582, 2023.
  12. Discovery of a planar black hole mass scaling relation for spiral galaxies. The Astrophysical Journal Letters, 956(1):L22, 2023.
  13. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv e-prints, art. arXiv:2305.14314, May 2023. doi: 10.48550/arXiv.2305.14314.
  14. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
  15. Language Models Represent Space and Time. arXiv e-prints, art. arXiv:2310.02207, October 2023. doi: 10.48550/arXiv.2310.02207.
  16. How much does attention actually attend? questioning the importance of attention in pretrained transformers. arXiv preprint arXiv:2211.03495, 2022.
  17. Towards a definition of disentangled representations. arXiv preprint arXiv:1812.02230, 2018.
  18. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  19. Saliency, scale and image description. International Journal of Computer Vision, 45(2):83–105, 2001.
  20. Disentangling by factorising. In International Conference on Machine Learning, pp.  2649–2658. PMLR, 2018.
  21. Kirson, M. W. Mutual influence of terms in a semi-empirical mass formula. Nucl. Phys. A, 798:29–60, 2008. doi: 10.1016/j.nuclphysa.2007.10.011.
  22. Analysis of neuronal ensemble activity reveals the pitfalls and shortcomings of rotation dynamics. Scientific Reports, 9(1):18978, 2019.
  23. Rediscovering orbital mechanics with machine learning. Machine Learning: Science and Technology, 4(4):045002, 2023.
  24. Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations, 2018.
  25. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. arXiv e-prints, art. arXiv:2210.13382, October 2022. doi: 10.48550/arXiv.2210.13382.
  26. Towards understanding grokking: An effective theory of representation learning. Advances in Neural Information Processing Systems, 35:34651–34663, 2022.
  27. Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning, pp.  4114–4124. PMLR, 2019.
  28. Interpretable machine learning methods applied to jet background subtraction in heavy ion collisions. arXiv preprint arXiv:2303.08275, 2023.
  29. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  30. Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023.
  31. Interpreting principal component analyses of spatial population genetic variation. Nature genetics, 40(5):646–649, 2008.
  32. Olah, C. Mechanistic interpretability, variables, and the importance of interpretable bases. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/mech-interp-essay/index.html.
  33. Feature visualization. Distill, 2017. URL https://distill.pub/2017/feature-visualization/.
  34. Pauli, W. Über den zusammenhang des abschlusses der elektronengruppen im atom mit der komplexstruktur der spektren. Zeitschrift für Physik, 31(1):765–783, Feb 1925. ISSN 0044-3328. doi: 10.1007/BF02980631. URL https://doi.org/10.1007/BF02980631.
  35. Interpreting dynamics of neural activity after dimensionality reduction. bioRxiv, pp.  2022–03, 2022.
  36. GPT4GEO: How a Language Model Sees the World’s Geography. arXiv e-prints, art. arXiv:2306.00020, May 2023. doi: 10.48550/arXiv.2306.00020.
  37. Shinn, M. Phantom oscillations in principal component analysis. bioRxiv, pp.  2023–06, 2023.
  38. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  39. The AME 2020 atomic mass evaluation (II). Tables, graphs and references. Chin. Phys. C, 45(3):030003, 2021. doi: 10.1088/1674-1137/abddaf.
  40. Weizsäcker, C. F. v. Zur theorie der kernmassen. Zeitschrift für Physik, 96(7):431–458, Jul 1935. ISSN 0044-3328. doi: 10.1007/BF01337700. URL https://doi.org/10.1007/BF01337700.
  41. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp.  818–833. Springer, 2014.
  42. AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. arXiv e-prints, art. arXiv:2303.10512, March 2023. doi: 10.48550/arXiv.2303.10512.
  43. A survey on neural network interpretability. IEEE Transactions on Emerging Topics in Computational Intelligence, 5(5):726–742, 2021. doi: 10.1109/TETCI.2021.3100641.
  44. The clock and the pizza: Two stories in mechanistic explanation of neural networks. arXiv preprint arXiv:2306.17844, 2023.

Summary

  • The paper demonstrates that neural networks can extract low-dimensional, physics-relevant embeddings from nuclear data via mechanistic interpretability.
  • Using PCA, the study uncovers ordered proton and neutron embeddings that mirror established nuclear models, including parity splits and helicity patterns.
  • Multi-task training enhances model performance while revealing hidden layer features analogous to the semi-empirical mass formula and nuclear shell model.

Interpretability in Neural Networks: Insights from Nuclear Physics

In the paper "From Neurons to Neutrons: A Case Study in Interpretability" (2405.17425), the authors explore the potential of mechanistic interpretability (MI) to uncover scientific insights from neural networks trained on nuclear physics data. They investigate whether neural networks can learn representations that align with established human knowledge and provide new scientific understanding.

Mechanistic Interpretability Approach

Mechanistic Interpretability is an approach that aims to understand how neural networks make predictions by analyzing their internal representations and computational mechanisms. The paper explores the ability of high-dimensional neural networks to learn low-dimensional embeddings that can reveal insights into the training data, particularly in the domain of nuclear physics.

The authors use Principal Component Analysis (PCA) to analyze the embeddings learned by neural networks. Through PCA, they project the proton and neutron embeddings onto lower-dimensional spaces to identify structural patterns that reflect the known nuclear physics concepts such as pairing effects and the Semi-Empirical Mass Formula (SEMF). Figure 1

Figure 1

Figure 1: Projections of neutron number embeddings onto their first three principal components (PCs). Models were trained on nuclear data (left) or a human-derived nuclear theory (right).

Nuclear Physics Case Study

The case paper focuses on nuclear physics, where the objective is to predict nuclear properties based on the number of protons (ZZ) and neutrons (NN). The authors argue that the structured representations learned by neural networks can mirror established physics concepts and assist in scientific discovery. Figure 2

Figure 2: Binding energy per nucleon as given by the SEMF formula (left) and observed in measurements (right).

Embedding Structures

The paper highlights several features of the learned embeddings that correlate with generalization performance and known physics concepts:

  1. Helicity: A spiral pattern observed in the embeddings, particularly when trained on binding energy predictions, suggests a geometric structure that parallels the SEMF's theoretical geometry.
  2. Orderedness: The ordering of proton and neutron numbers in the PCA space correlates with model generalization, implying inductive biases encoded within the embeddings.
  3. Parity Split: The distinction between even and odd proton and neutron number embeddings, reflecting pairing effects known in nuclear physics. Figure 3

    Figure 3: Projection of proton number (Z) embeddings onto the first two principal components (PCs), superimposed on the neural network's binding energy predictions.

Principal Components Analysis

Principal components analysis played a crucial role in identifying meaningful patterns in the neural network's representations. The paper provides evidence supporting the efficacy of using PCA to capture essential features, illustrating that a few principal components can recover significant portions of the model's performance. Figure 4

Figure 4: Fitting a helix to the PC-projected embeddings.

Experiments and Observations

  1. Multi-Task Learning: By training the model on multiple related tasks simultaneously, the authors observed improved performance and richer learned representations. This multi-task setup enhances the model's ability to generalize and captures the shared structure underlying various nuclear observables.
  2. Hidden Layer Features: Analysis of penultimate layer activations revealed features that strongly resemble terms in established nuclear models, such as volume and pairing terms from the SEMF and contributions from the nuclear shell model. Figure 5

    Figure 5: Test performance over different observables for models trained on a single task versus multiple tasks jointly.

Conclusion

The research demonstrates that neural networks trained on nuclear physics data can learn representations that provide insights into both the model's prediction process and the underlying physical phenomena. Through mechanistic interpretability, the authors effectively reinterpret the learned embeddings and latent features in terms of established theoretical models.

The paper signifies a step forward in using intrinsic model representations to gain scientific understanding, potentially paving the way for discovering new phenomena in various domains. Future research may expand these interpretability techniques to more complex scientific issues and unexplored datasets.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 tweets and received 53 likes.

Upgrade to Pro to view all of the tweets about this paper: