Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-attention Networks Localize When QK-eigenspectrum Concentrates (2402.02098v1)

Published 3 Feb 2024 in stat.ML and cs.LG

Abstract: The self-attention mechanism prevails in modern machine learning. It has an interesting functionality of adaptively selecting tokens from an input sequence by modulating the degree of attention localization, which many researchers speculate is the basis of the powerful model performance but complicates the underlying mechanism of the learning dynamics. In recent years, mainly two arguments have connected attention localization to the model performances. One is the rank collapse, where the embedded tokens by a self-attention block become very similar across different tokens, leading to a less expressive network. The other is the entropy collapse, where the attention probability approaches non-uniform and entails low entropy, making the learning dynamics more likely to be trapped in plateaus. These two failure modes may apparently contradict each other because the rank and entropy collapses are relevant to uniform and non-uniform attention, respectively. To this end, we characterize the notion of attention localization by the eigenspectrum of query-key parameter matrices and reveal that a small eigenspectrum variance leads attention to be localized. Interestingly, the small eigenspectrum variance prevents both rank and entropy collapse, leading to better model expressivity and trainability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. ReZero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, pp.  1352–1361. PMLR, 2021.
  3. Birth of a transformer: A memory viewpoint. Advances in Neural Information Processing Systems, 36, 2023.
  4. Brookes, M. The matrix reference manual. http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html, 1998. [Online; accessed 01-September-2023].
  5. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  6. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In Proceedings of the 38th International Conference on Machine Learning, pp.  2793–2803. PMLR, 2021.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, 2021.
  8. Graves, A. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
  9. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems, 35:37822–37836, 2022.
  10. What they do when in doubt: a study of inductive biases in seq2seq learners. In Proceedings of the 9th International Conference on Learning Representations, 2021.
  11. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, 2015.
  12. Analyzing feed-forward blocks in transformers through the lens of attention map. arXiv preprint arXiv:2302.00456, 2023.
  13. slimIPL: Language-model-free iterative pseudo-labeling. In Proc. Interspeech 2021, pp.  741–745, 2021.
  14. From softmax to sparsemax: A sparse model of attention and multi-label classification. In Proceedings of the 34th International Conference on Machine Learning, pp.  1614–1623. PMLR, 2016.
  15. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  16. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  17. Grokking of hierarchical structure in vanilla transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  439–448. Association for Computational Linguistics, 2023a.
  18. Characterizing intrinsic compositionality in transformers with tree projections. In Proceedings of the 11th International Conference on Learning Representations, 2023b.
  19. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural Information Processing Systems, 35:27198–27211, 2022.
  20. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp.  48–53, 2019.
  21. Spike no more: Stabilizing the pre-training of large language models. arXiv preprint arXiv:2312.16903, 2023a.
  22. B2T connection: Serving stability and performance in deep transformers. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  3078–3095, 2023b.
  23. Transformers as support vector machines. arXiv preprint arXiv:2308.16898, 2023a.
  24. Max-margin token selection in attention mechanism. Advances in Neural Information Processing Systems, 36, 2023b.
  25. Scan and snap: Understanding training dynamics and token composition in 1-layer transformer. Advances in Neural Information Processing Systems, 36, 2023a.
  26. JoMA: Demystifying multilayer transformers via JOint Dynamics of MLP and Attention. arXiv preprint arXiv:2310.00535, 2023b.
  27. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, pp.  10347–10357. PMLR, 2021.
  28. Attention is all you need. Advances in Neural Information Processing Systems, 30:6000–6010, 2017.
  29. An explanation of in-context learning as implicit Bayesian inference. In Proceedings of the 10th International Conference on Learning Representations, 2022.
  30. On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning, pp.  10524–10533. PMLR, 2020.
  31. Stabilizing transformer training by preventing attention entropy collapse. In Proceedings of the 40th International Conference on Machine Learning, pp.  40770–40803. PMLR, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com