Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Initialization is Critical to Whether Transformers Fit Composite Functions by Inference or Memorizing (2405.05409v3)

Published 8 May 2024 in cs.LG

Abstract: Transformers have shown impressive capabilities across various tasks, but their performance on compositional problems remains a topic of debate. In this work, we investigate the mechanisms of how transformers behave on unseen compositional tasks. We discover that the parameter initialization scale plays a critical role in determining whether the model learns inferential solutions, which capture the underlying compositional primitives, or symmetric solutions, which simply memorize mappings without understanding the compositional structure. By analyzing the information flow and vector representations within the model, we reveal the distinct mechanisms underlying these solution types. We further find that inferential solutions exhibit low complexity bias, which we hypothesize is a key factor enabling them to learn individual mappings for single anchors. We validate our conclusions on various real-world datasets. Our findings provide valuable insights into the role of initialization scale in shaping the type of solution learned by transformers and their ability to learn and generalize compositional tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023).
  2. Welcome to the era of chatgpt et al. the prospects of large language models, Business & Information Systems Engineering 65 (2023) 95–101.
  3. Chatgpt goes to law school, J. Legal Educ. 71 (2021) 387.
  4. Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901.
  5. Sparks of artificial general intelligence: Early experiments with gpt-4, arXiv preprint arXiv:2303.12712 (2023).
  6. Neurocompositional computing: From the central paradox of cognition to a new generation of ai systems, AI Magazine 43 (2022) 308–322.
  7. Anchor function: a type of benchmark functions for studying language models, arXiv preprint arXiv:2401.08309 (2024).
  8. Emergent abilities of large language models, arXiv preprint arXiv:2206.07682 (2022).
  9. How does gpt obtain its ability? tracing emergent abilities of language models to their sources, Yao Fu’s Notion (2022).
  10. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, arXiv preprint arXiv:2206.04615 (2022).
  11. Learning compositionally through attentive guidance, arXiv preprint arXiv:1805.09657 (2018).
  12. The neural data router: Adaptive control flow in transformers improves systematic generalization, arXiv preprint arXiv:2110.07732 (2021).
  13. Ctl++: Evaluating generalization on never-seen compositional patterns of known functions, and compatibility of neural representations, arXiv preprint arXiv:2210.06350 (2022).
  14. Do vision-language pretrained models learn composable primitive concepts?, arXiv preprint arXiv:2203.17271 (2022).
  15. Break it down: Evidence for structural compositionality in neural networks, arXiv preprint arXiv:2301.10884 (2023).
  16. Compositional abilities emerge multiplicatively: Exploring diffusion models on a synthetic task, arXiv preprint arXiv:2310.09336 (2023).
  17. Faith and fate: Limits of transformers on compositionality, Advances in Neural Information Processing Systems 36 (2024).
  18. How capable can a transformer become? a study on synthetic, interpretable tasks, arXiv preprint arXiv:2311.12997 (2023).
  19. Transformers learn shortcuts to automata, arXiv preprint arXiv:2210.10749 (2022).
  20. Chain of thought prompting elicits reasoning in large language models, arXiv preprint arXiv:2201.11903 (2022).
  21. Selection-inference: Exploiting large language models for interpretable logical reasoning, arXiv preprint arXiv:2205.09712 (2022).
  22. A. Creswell, M. Shanahan, Faithful reasoning using large language models, arXiv preprint arXiv:2208.14271 (2022).
  23. Neural Tangent Kernel: Convergence and Generalization in Neural Networks, in: Advances in Neural Information Processing Systems 31, 2018, pp. 8571–8580.
  24. On exact computation with an infinitely wide neural net, in: Advances in Neural Information Processing Systems, 2019, pp. 8141–8150.
  25. A type of generalization error induced by initialization in deep neural networks, arXiv:1905.07777 [cs, stat] (2019).
  26. A mean field view of the landscape of two-layer neural networks, Proceedings of the National Academy of Sciences 115 (2018) E7665–E7671. doi:10.1073/pnas.1806579115.
  27. G. Rotskoff, E. Vanden-Eijnden, Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks, in: Advances in Neural Information Processing Systems 31, 2018, pp. 7146–7155.
  28. L. Chizat, F. Bach, On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport, in: Advances in Neural Information Processing Systems 31, 2018, pp. 3036–3046.
  29. J. Sirignano, K. Spiliopoulos, Mean field analysis of neural networks: A central limit theorem, Stochastic Processes and their Applications 130 (2020) 1820–1852. doi:10.1016/j.spa.2019.06.003.
  30. Gradient dynamics of shallow univariate relu networks, CoRR abs/1906.07842 (2019). URL: http://arxiv.org/abs/1906.07842. arXiv:1906.07842.
  31. A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics., Sci. China Math. 63 (2020).
  32. Phase diagram for two-layer relu neural networks at infinite-width limit, Journal of Machine Learning Research 22 (2021) 1–47.
  33. Empirical phase diagram for three-layer neural networks with infinite width, Advances in Neural Information Processing Systems (2022).
  34. Stochastic modified equations and dynamics of dropout algorithm, arXiv preprint arXiv:2305.15850 (2023).
  35. Z. Zhang, Z.-Q. J. Xu, Implicit regularization of dropout, IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
  36. Embedding principle of loss landscape of deep neural networks, Advances in Neural Information Processing Systems 34 (2021) 14848–14859.
  37. Linear stability hypothesis and rank stratification for nonlinear models, arXiv preprint arXiv:2211.11623 (2022).
  38. Understanding the difficulty of training transformers, arXiv preprint arXiv:2004.08249 (2020).
  39. Gradinit: Learning to initialize neural networks for stable and efficient training, Advances in Neural Information Processing Systems 34 (2021) 16410–16422.
  40. A. Trockman, J. Z. Kolter, Mimetic initialization of self-attention layers, in: International Conference on Machine Learning, PMLR, 2023, pp. 34456–34468.
  41. Improving deep transformer with depth-scaled initialization and merged attention, arXiv preprint arXiv:1908.11365 (2019).
  42. Improving transformer optimization through better initialization, in: International Conference on Machine Learning, PMLR, 2020, pp. 4475–4483.
  43. Deepnet: Scaling transformers to 1,000 layers, IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
  44. Towards understanding the condensation of neural networks at initial training, arXiv preprint arXiv:2105.11686 (2021).
  45. Embedding principle: a hierarchical structure of loss landscape of deep neural networks, Journal of Machine Learning vol 1 (2022) 1–45.
  46. A closer look at memorization in deep networks, in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, 2017, pp. 233–242.
  47. Training Behavior of Deep Neural Network in Frequency Domain, in: Neural Information Processing, Lecture Notes in Computer Science, 2019, pp. 264–274. doi:10.1007/978-3-030-36708-4_22.
  48. Frequency principle: Fourier analysis sheds light on deep neural networks, Communications in Computational Physics 28 (2020) 1746–1767.
  49. On the spectral bias of deep neural networks, International Conference on Machine Learning (2019).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhongwang Zhang (17 papers)
  2. Pengxiao Lin (5 papers)
  3. Zhiwei Wang (223 papers)
  4. Yaoyu Zhang (43 papers)
  5. Zhi-Qin John Xu (66 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets