Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fourier Circuits in Neural Networks and Transformers: A Case Study of Modular Arithmetic with Multiple Inputs (2402.09469v3)

Published 12 Feb 2024 in cs.LG and stat.ML

Abstract: In the evolving landscape of machine learning, a pivotal challenge lies in deciphering the internal representations harnessed by neural networks and Transformers. Building on recent progress toward comprehending how networks execute distinct target functions, our study embarks on an exploration of the underlying reasons behind networks adopting specific computational strategies. We direct our focus to the complex algebraic learning task of modular addition involving $k$ inputs. Our research presents a thorough analytical characterization of the features learned by stylized one-hidden layer neural networks and one-layer Transformers in addressing this task. A cornerstone of our theoretical framework is the elucidation of how the principle of margin maximization shapes the features adopted by one-hidden layer neural networks. Let $p$ denote the modulus, $D_p$ denote the dataset of modular arithmetic with $k$ inputs and $m$ denote the network width. We demonstrate that a neuron count of $ m \geq 2{2k-2} \cdot (p-1) $, these networks attain a maximum $ L_{2,k+1} $-margin on the dataset $ D_p $. Furthermore, we establish that each hidden-layer neuron aligns with a specific Fourier spectrum, integral to solving modular addition problems. By correlating our findings with the empirical observations of similar studies, we contribute to a deeper comprehension of the intrinsic computational mechanisms of neural networks. Furthermore, we observe similar computational mechanisms in attention matrices of one-layer Transformers. Our work stands as a significant stride in unraveling their operation complexities, particularly in the realm of complex algebraic tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. Fast attention requires bounded entries. In NeurIPS. arXiv preprint arXiv:2302.13214, 2023.
  2. How to capture higher-order correlations? generalizing matrix softmax attention to kronecker computation. In The Twelfth International Conference on Learning Representations, 2024.
  3. On the inductive bias of neural networks for learning read-once dnfs. In Uncertainty in Artificial Intelligence, pages 255–265. PMLR, 2022.
  4. Hidden progress in deep learning: Sgd learns parities near the computational limit. Advances in Neural Information Processing Systems, 35:21750–21764, 2022.
  5. Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, March 2022.
  6. Weakly learning dnf and characterizing statistical query learning using fourier analysis. In Proceedings of the twenty-sixth annual ACM symposium on Theory of computing, pages 253–262, 1994.
  7. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
  8. Jean Bourgain. An improved estimate in the restricted isometry problem. In Geometric aspects of functional analysis, pages 65–70. Springer, 2014.
  9. Convex optimization. Cambridge university press, 2004.
  10. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory. PMLR, 2020.
  11. Benign overfitting in two-layer convolutional neural networks. Advances in neural information processing systems, 35:25237–25250, 2022.
  12. A toy model of universality: Reverse engineering how networks learn group operations. arXiv preprint arXiv:2302.03025, 2023.
  13. Curve detectors. Distill, 5(6):e00024–003, 2020.
  14. Fourier-sparse interpolation without a frequency gap. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 741–750. IEEE, 2016.
  15. Learning mixtures of linear regressions in subexponential time via fourier moments. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 587–600, 2020.
  16. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  17. Query complexity of active learning for function family with nearly orthogonal basis. arXiv preprint arXiv:2306.03356, 2023.
  18. Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE transactions on information theory, 52(12):5406–5425, 2006.
  19. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  20. Neural networks can learn representations with gradient descent. In Conference on Learning Theory. PMLR, 2022.
  21. Learning parities with neural networks. Advances in Neural Information Processing Systems, 33:20356–20365, 2020.
  22. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022.
  23. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021.
  24. Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. In Conference on Learning Theory, pages 2668–2703. PMLR, 2022.
  25. Implicit bias in leaky relu networks trained on high-dimensional data. In The Eleventh International Conference on Learning Representations, 2022.
  26. Benign overfitting in linear classifiers and leaky relu networks from kkt conditions for margin maximization. In The Thirty Sixth Annual Conference on Learning Theory, pages 3173–3228. PMLR, 2023.
  27. Multi-pass transformer for machine translation. arXiv preprint arXiv:2009.11382, 2020.
  28. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning, pages 1832–1841. PMLR, 2018.
  29. Implicit bias of gradient descent on linear convolutional networks. Advances in Neural Information Processing Systems, 31, 2018.
  30. Improved time bounds for near-optimal sparse fourier representations. In Wavelets XI, volume 5914, page 59141A. International Society for Optics and Photonics, 2005.
  31. An O⁢(k⁢log⁡n)𝑂𝑘𝑛{O}(k\log n)italic_O ( italic_k roman_log italic_n ) time fourier set query algorithm. arXiv preprint arXiv:2208.09634, 2022.
  32. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  33. Nearly optimal sparse fourier transform. In Proceedings of the forty-fourth annual ACM symposium on Theory of Computing (STOC), pages 563–578, 2012.
  34. Simple and practical algorithm for sparse fourier transform. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms (SODA), pages 1183–1194. SIAM, 2012.
  35. The restricted isometry property of subsampled fourier matrices. In Geometric aspects of functional analysis, pages 163–179. Springer, 2017.
  36. Sample-optimal Fourier sampling in any constant dimension. In IEEE 55th Annual Symposium onFoundations of Computer Science (FOCS), pages 514–523. IEEE, 2014.
  37. (nearly) sample-optimal sparse fourier transform. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms (SODA), pages 480–499. SIAM, 2014.
  38. Arthur Jacot. Implicit bias of large depth networks: a notion of rank for nonlinear functions. In The Eleventh International Conference on Learning Representations, 2022.
  39. Super-resolution and robust sparse continuous fourier transform in any constant dimension: Nearly linear time and sample complexity. In ACM-SIAM Symposium on Discrete Algorithms (SODA), 2023.
  40. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems, 2022.
  41. The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pages 1772–1798. PMLR, 2019.
  42. Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems, 33:17176–17186, 2020.
  43. Michael Kapralov. Sparse Fourier transform in any constant dimension with nearly-optimal sample complexity in sublinear time. In Symposium on Theory of Computing Conference (STOC), 2016.
  44. Michael Kapralov. Sample efficient estimation and recovery in sparse FFT via isolation on average. In 58th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2017.
  45. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
  46. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  47. Dichotomy of early and late phase implicit biases can provably induce grokking. In The Twelfth International Conference on Learning Representations, 2024.
  48. Towards understanding grokking: An effective theory of representation learning. Advances in Neural Information Processing Systems, 35:34651–34663, 2022.
  49. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2019.
  50. How do transformers learn topic structure: Towards a mechanistic understanding. In Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2023.
  51. Gradient descent on two-layer nets: Margin maximization and simplicity bias. Advances in Neural Information Processing Systems, 34:12978–12991, 2021.
  52. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6), 2022.
  53. Solving empirical risk minimization in the current matrix multiplication time. In COLT, 2019.
  54. A theoretical understanding of shallow vision transformers: Learning, generalization, and sample complexity. In The Eleventh International Conference on Learning Representations, 2023.
  55. Feature emergence via margin maximization: case studies in algebraic tasks. In The Twelfth International Conference on Learning Representations, 2024.
  56. Beren Millidge. Grokking’grokking’, 2022.
  57. Ankur Moitra. Super-resolution, extremal functions and the condition number of vandermonde matrices. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 821–830, 2015.
  58. Grokking of hierarchical structure in vanilla transformers. arXiv preprint arXiv:2305.18741, 2023.
  59. Implicit bias in deep linear classification: Initialization scale vs training accuracy. Advances in Neural Information Processing Systems, 33, 2020.
  60. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2023.
  61. (nearly) sample-optimal sparse fourier transform in any dimension; ripless and filterless. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 1568–1577. IEEE, 2019.
  62. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020.
  63. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  64. OpenAI. Introducing ChatGPT. https://openai.com/blog/chatgpt, 2022. Accessed: 2023-09-10.
  65. OpenAI. Gpt-4 technical report, 2023.
  66. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022.
  67. Fully quantized transformer for machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1–14, 2020.
  68. A robust sparse Fourier transform in the continuous setting. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 583–600. IEEE, 2015.
  69. On sparse reconstruction from fourier and gaussian measurements. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 61(8):1025–1045, 2008.
  70. The trade-off between universality and label efficiency of representations from contrastive learning. In The Eleventh International Conference on Learning Representations, 2023.
  71. Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations, 2018.
  72. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
  73. Representational strengths and limitations of transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  74. Domain generalization via nuclear norm regularization. In Conference on Parsimony and Learning (Proceedings Track), 2023.
  75. Zhao Song. Matrix Theory: Optimization, Concentration and Algorithms. PhD thesis, The University of Texas at Austin, 2019.
  76. Failures of gradient-based deep learning. In International Conference on Machine Learning, pages 3067–3075. PMLR, 2017.
  77. Sparse fourier transform over lattices: A unified approach to signal reconstruction. arXiv preprint arXiv:2205.00658, 2022.
  78. Quartic samples suffice for fourier interpolation. In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), pages 1414–1425. IEEE, 2023.
  79. The pitfalls of simplicity bias in neural networks. Advances in Neural Information Processing Systems, 33:9573–9585, 2020.
  80. A theoretical analysis on feature learning in neural networks: Emergence from inputs and advantage over fixed features. In International Conference on Learning Representations, 2022.
  81. Provable guarantees for neural networks via gradient feature learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  82. Why larger language models do in-context learning differently? In R0-FoMo:Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023.
  83. A nearly-optimal bound for fast regression with ℓ∞subscriptℓ\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT, guarantee. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 32463–32482. PMLR, 2023.
  84. Benign overfitting in ridge regression. Journal of Machine Learning Research, 24(123):1–76, 2023.
  85. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817, 2022.
  86. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  87. Scan and snap: Understanding training dynamics and token composition in 1-layer transformer. Advances in Neural Information Processing Systems, 2023.
  88. Gal Vardi. On the implicit bias in deep-learning algorithms. Communications of the ACM, 66(6):86–93, 2023.
  89. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. Advances in Neural Information Processing Systems, 32, 2019.
  90. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
  91. Improving foundation models for few-shot learning via multitask finetuning. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.
  92. Towards few-shot adaptation of foundation models via multitask finetuning. In The Twelfth International Conference on Learning Representations, 2024.
  93. Benign overfitting and grokking in relu networks for xor cluster data. In The Twelfth International Conference on Learning Representations, 2024.
  94. Large language models’ understanding of math: Source criticism and extrapolation. arXiv preprint arXiv:2311.07618, 2023.
Citations (9)

Summary

We haven't generated a summary for this paper yet.