Initialization is Critical to Whether Transformers Fit Composite Functions by Reasoning or Memorizing (2405.05409v5)
Abstract: Transformers have shown impressive capabilities across various tasks, but their performance on compositional problems remains a topic of debate. In this work, we investigate the mechanisms of how transformers behave on unseen compositional tasks. We discover that the parameter initialization scale plays a critical role in determining whether the model learns inferential (reasoning-based) solutions, which capture the underlying compositional primitives, or symmetric (memory-based) solutions, which simply memorize mappings without understanding the compositional structure. By analyzing the information flow and vector representations within the model, we reveal the distinct mechanisms underlying these solution types. We further find that inferential (reasoning-based) solutions exhibit low complexity bias, which we hypothesize is a key factor enabling them to learn individual mappings for single anchors. We validate our conclusions on various real-world datasets. Our findings provide valuable insights into the role of initialization scale in tuning the reasoning and memorizing ability and we propose the initialization rate $\gamma$ to be a convenient tunable hyper-parameter in common deep learning frameworks, where $1/d_{\mathrm{in}}\gamma$ is the standard deviation of parameters of the layer with $d_{\mathrm{in}}$ input neurons.
- Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023).
- Welcome to the era of chatgpt et al. the prospects of large language models, Business & Information Systems Engineering 65 (2023) 95–101.
- Chatgpt goes to law school, J. Legal Educ. 71 (2021) 387.
- Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901.
- Sparks of artificial general intelligence: Early experiments with gpt-4, arXiv preprint arXiv:2303.12712 (2023).
- Neurocompositional computing: From the central paradox of cognition to a new generation of ai systems, AI Magazine 43 (2022) 308–322.
- Anchor function: a type of benchmark functions for studying language models, arXiv preprint arXiv:2401.08309 (2024).
- Emergent abilities of large language models, arXiv preprint arXiv:2206.07682 (2022).
- How does gpt obtain its ability? tracing emergent abilities of language models to their sources, Yao Fu’s Notion (2022).
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, arXiv preprint arXiv:2206.04615 (2022).
- Learning compositionally through attentive guidance, arXiv preprint arXiv:1805.09657 (2018).
- The neural data router: Adaptive control flow in transformers improves systematic generalization, arXiv preprint arXiv:2110.07732 (2021).
- Ctl++: Evaluating generalization on never-seen compositional patterns of known functions, and compatibility of neural representations, arXiv preprint arXiv:2210.06350 (2022).
- Do vision-language pretrained models learn composable primitive concepts?, arXiv preprint arXiv:2203.17271 (2022).
- Break it down: Evidence for structural compositionality in neural networks, arXiv preprint arXiv:2301.10884 (2023).
- Compositional abilities emerge multiplicatively: Exploring diffusion models on a synthetic task, arXiv preprint arXiv:2310.09336 (2023).
- Faith and fate: Limits of transformers on compositionality, Advances in Neural Information Processing Systems 36 (2024).
- How capable can a transformer become? a study on synthetic, interpretable tasks, arXiv preprint arXiv:2311.12997 (2023).
- Transformers learn shortcuts to automata, arXiv preprint arXiv:2210.10749 (2022).
- Chain of thought prompting elicits reasoning in large language models, arXiv preprint arXiv:2201.11903 (2022).
- Selection-inference: Exploiting large language models for interpretable logical reasoning, arXiv preprint arXiv:2205.09712 (2022).
- A. Creswell, M. Shanahan, Faithful reasoning using large language models, arXiv preprint arXiv:2208.14271 (2022).
- Neural Tangent Kernel: Convergence and Generalization in Neural Networks, in: Advances in Neural Information Processing Systems 31, 2018, pp. 8571–8580.
- On exact computation with an infinitely wide neural net, in: Advances in Neural Information Processing Systems, 2019, pp. 8141–8150.
- A type of generalization error induced by initialization in deep neural networks, arXiv:1905.07777 [cs, stat] (2019).
- A mean field view of the landscape of two-layer neural networks, Proceedings of the National Academy of Sciences 115 (2018) E7665–E7671. doi:10.1073/pnas.1806579115.
- G. Rotskoff, E. Vanden-Eijnden, Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks, in: Advances in Neural Information Processing Systems 31, 2018, pp. 7146–7155.
- L. Chizat, F. Bach, On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport, in: Advances in Neural Information Processing Systems 31, 2018, pp. 3036–3046.
- J. Sirignano, K. Spiliopoulos, Mean field analysis of neural networks: A central limit theorem, Stochastic Processes and their Applications 130 (2020) 1820–1852. doi:10.1016/j.spa.2019.06.003.
- Gradient dynamics of shallow univariate relu networks, CoRR abs/1906.07842 (2019). URL: http://arxiv.org/abs/1906.07842. arXiv:1906.07842.
- A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics., Sci. China Math. 63 (2020).
- Phase diagram for two-layer relu neural networks at infinite-width limit, Journal of Machine Learning Research 22 (2021) 1–47.
- Empirical phase diagram for three-layer neural networks with infinite width, Advances in Neural Information Processing Systems (2022).
- Stochastic modified equations and dynamics of dropout algorithm, arXiv preprint arXiv:2305.15850 (2023).
- Z. Zhang, Z.-Q. J. Xu, Implicit regularization of dropout, IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
- Embedding principle of loss landscape of deep neural networks, Advances in Neural Information Processing Systems 34 (2021) 14848–14859.
- Linear stability hypothesis and rank stratification for nonlinear models, arXiv preprint arXiv:2211.11623 (2022).
- Understanding the difficulty of training transformers, arXiv preprint arXiv:2004.08249 (2020).
- Gradinit: Learning to initialize neural networks for stable and efficient training, Advances in Neural Information Processing Systems 34 (2021) 16410–16422.
- A. Trockman, J. Z. Kolter, Mimetic initialization of self-attention layers, in: International Conference on Machine Learning, PMLR, 2023, pp. 34456–34468.
- Improving deep transformer with depth-scaled initialization and merged attention, arXiv preprint arXiv:1908.11365 (2019).
- Improving transformer optimization through better initialization, in: International Conference on Machine Learning, PMLR, 2020, pp. 4475–4483.
- Deepnet: Scaling transformers to 1,000 layers, IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
- Towards understanding the condensation of neural networks at initial training, arXiv preprint arXiv:2105.11686 (2021).
- Embedding principle: a hierarchical structure of loss landscape of deep neural networks, Journal of Machine Learning vol 1 (2022) 1–45.
- A closer look at memorization in deep networks, in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017, pp. 233–242.
- Training Behavior of Deep Neural Network in Frequency Domain, in: Neural Information Processing, Lecture Notes in Computer Science, 2019, pp. 264–274. doi:10.1007/978-3-030-36708-4_22.
- Frequency principle: Fourier analysis sheds light on deep neural networks, Communications in Computational Physics 28 (2020) 1746–1767.
- On the spectral bias of deep neural networks, International Conference on Machine Learning (2019).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.
 
          