Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

KAN: Kolmogorov-Arnold Networks (2404.19756v4)

Published 30 Apr 2024 in cs.LG, cond-mat.dis-nn, cs.AI, and stat.ML

Abstract: Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today's deep learning models which rely heavily on MLPs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (96)
  1. Simon Haykin. Neural networks: a comprehensive foundation. Prentice Hall PTR, 1994.
  2. George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
  3. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  4. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  5. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023.
  6. A.N. Kolmogorov. On the representation of continuous functions of several variables as superpositions of continuous functions of a smaller number of variables. Dokl. Akad. Nauk, 108(2), 1956.
  7. On a constructive proof of kolmogorov’s superposition theorem. Constructive approximation, 30:653–675, 2009.
  8. Space-filling curves and kolmogorov superposition-based neural networks. Neural Networks, 15(1):57–67, 2002.
  9. Mario Köppen. On the training of a kolmogorov network. In Artificial Neural Networks—ICANN 2002: International Conference Madrid, Spain, August 28–30, 2002 Proceedings 12, pages 474–479. Springer, 2002.
  10. On the realization of a kolmogorov network. Neural Computation, 5(1):18–20, 1993.
  11. The kolmogorov superposition theorem can break the curse of dimensionality when approximating high dimensional functions. arXiv preprint arXiv:2112.09963, 2021.
  12. The kolmogorov spline network for image processing. In Image Processing: Concepts, Methodologies, Tools, and Applications, pages 54–78. IGI Global, 2013.
  13. Exsplinet: An interpretable and expressive spline-based neural network. Neural Networks, 152:332–346, 2022.
  14. Theoretical issues in deep networks. Proceedings of the National Academy of Sciences, 117(48):30039–30045, 2020.
  15. Why does deep and cheap learning work so well? Journal of Statistical Physics, 168:1223–1247, 2017.
  16. Nonlinear material design using principal stretches. ACM Transactions on Graphics (TOG), 34(4):1–11, 2015.
  17. A neural scaling law from the dimension of the data manifold. arXiv preprint arXiv:2004.10802, 2020.
  18. Precision machine learning. Entropy, 25(1):175, 2023.
  19. A practical guide to splines, volume 27. springer-verlag New York, 1978.
  20. Ai feynman: A physics-inspired method for symbolic regression. Science Advances, 6(16):eaay2631, 2020.
  21. Ai feynman 2.0: Pareto-optimal symbolic regression exploiting graph modularity. Advances in Neural Information Processing Systems, 33:4860–4871, 2020.
  22. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378:686–707, 2019.
  23. Physics-informed machine learning. Nature Reviews Physics, 3(6):422–440, 2021.
  24. Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  25. Brain plasticity and behavior. Annual review of psychology, 49(1):43–64, 1998.
  26. Modular and hierarchically modular organization of brain networks. Frontiers in neuroscience, 4:7572, 2010.
  27. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  28. Revisiting neural networks for continual learning: An architectural perspective, 2024.
  29. Advancing mathematics by guiding human intuition with ai. Nature, 600(7887):70–74, 2021.
  30. Searching for ribbons with machine learning, 2023.
  31. P. Petersen. Riemannian Geometry. Graduate Texts in Mathematics. Springer New York, 2006.
  32. Philip W Anderson. Absence of diffusion in certain random lattices. Physical review, 109(5):1492, 1958.
  33. David J Thouless. A relation between the density of states and range of localization for one dimensional random systems. Journal of Physics C: Solid State Physics, 5(1):77, 1972.
  34. Scaling theory of localization: Absence of quantum diffusion in two dimensions. Physical Review Letters, 42(10):673, 1979.
  35. Fifty years of anderson localization. Physics today, 62(8):24–29, 2009.
  36. Anderson localization of light. Nature Photonics, 7(3):197–204, 2013.
  37. Optics of photonic quasicrystals. Nature photonics, 7(3):177–187, 2013.
  38. Sajeev John. Strong localization of photons in certain disordered dielectric superlattices. Physical review letters, 58(23):2486, 1987.
  39. Observation of a localization transition in quasiperiodic photonic lattices. Physical review letters, 103(1):013901, 2009.
  40. Reentrant delocalization transition in one-dimensional photonic quasicrystals. Physical Review Research, 5(3):033170, 2023.
  41. Absence of many-body mobility edges. Physical Review B, 93(1):014203, 2016.
  42. Many-body localization and quantum nonergodicity in a model with a single-particle mobility edge. Physical review letters, 115(18):186601, 2015.
  43. Interactions and mobility edges: Observing the generalized aubry-andré model. Physical review letters, 126(4):040603, 2021.
  44. J Biddle and S Das Sarma. Predicted mobility edges in one-dimensional incommensurate optical lattices: An exactly solvable model of anderson localization. Physical review letters, 104(7):070601, 2010.
  45. Self-consistent theory of mobility edges in quasiperiodic chains. Physical Review B, 103(6):L060201, 2021.
  46. Nearest neighbor tight binding models with an exact mobility edge in one dimension. Physical review letters, 114(14):146601, 2015.
  47. One-dimensional quasiperiodic mosaic lattice with exact mobility edges. Physical Review Letters, 125(19):196604, 2020.
  48. Duality between two generalized aubry-andré models with exact mobility edges. Physical Review B, 103(17):174205, 2021.
  49. Exact new mobility edges between critical and localized states. Physical Review Letters, 131(17):176401, 2023.
  50. Tomaso Poggio. How deep sparse networks avoid the curse of dimensionality: Efficiently computable functions are compositionally sparse. CBMM Memo, 10:2022, 2022.
  51. Johannes Schmidt-Hieber. The kolmogorov–arnold representation theorem revisited. Neural networks, 137:119–126, 2021.
  52. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  53. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  54. Data and parameter scaling laws for neural machine translation. In ACL Rolling Review - May 2021, 2021.
  55. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
  56. Explaining neural scaling laws. arXiv preprint arXiv:2102.06701, 2021.
  57. The quantization model of neural scaling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  58. A resource model for neural scaling law. arXiv preprint arXiv:2402.05164, 2024.
  59. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  60. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  61. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023.
  62. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022.
  63. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2023.
  64. The clock and the pizza: Two stories in mechanistic explanation of neural networks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  65. Seeing is believing: Brain-inspired modular training for mechanistic interpretability. Entropy, 26(1):41, 2023.
  66. Softmax linear units. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/solu/index.html.
  67. Learning activation functions: A new paradigm for understanding neural networks. arXiv preprint arXiv:1906.09529, 2019.
  68. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
  69. Neural network architecture beyond width and depth. Advances in Neural Information Processing Systems, 35:5669–5681, 2022.
  70. Discovering parametric activation functions. Neural Networks, 148:48–65, 2022.
  71. Learning activation functions in deep (spline) neural networks. IEEE Open Journal of Signal Processing, 1:295–309, 2020.
  72. Deep spline networks with control of lipschitz regularity. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3242–3246. IEEE, 2019.
  73. Renáta Dubcáková. Eureqa: software review. Genetic Programming and Evolvable Machines, 12:173–178, 2011.
  74. Gplearn. https://github.com/trevorstephens/gplearn. Accessed: 2024-04-19.
  75. Miles Cranmer. Interpretable machine learning for science with pysr and symbolicregression. jl. arXiv preprint arXiv:2305.01582, 2023.
  76. Extrapolation and learning equations. arXiv preprint arXiv:1610.02995, 2016.
  77. Occamnet: A fast neural model for symbolic regression at scale. arXiv preprint arXiv:2007.10784, 2020.
  78. Symbolic regression via deep reinforcement learning enhanced genetic programming seeding. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
  79. Bing Yu et al. The deep ritz method: a deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics, 6(1):1–12, 2018.
  80. Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895, 2020.
  81. Physics-informed neural operator for learning partial differential equations. ACM/JMS Journal of Data Science, 2021.
  82. Neural operator: Learning maps between function spaces with applications to pdes. Journal of Machine Learning Research, 24(89):1–97, 2023.
  83. Fourier continuation for exact derivative computation in physics-informed neural operators. arXiv preprint arXiv:2211.15960, 2022.
  84. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature machine intelligence, 3(3):218–229, 2021.
  85. Learning to Unknot. Mach. Learn. Sci. Tech., 2(2):025035, 2021.
  86. Rectangular knot diagrams classification with deep learning, 2020.
  87. Mark C Hughes. A neural network approach to predicting and computing knot invariants. Journal of Knot Theory and Its Ramifications, 29(03):2050005, 2020.
  88. Disentangling a deep learned volume formula. JHEP, 06:040, 2021.
  89. Illuminating new and known relations between knot invariants. 11 2022.
  90. Fabian Ruehle. Data science applications to string theory. Phys. Rept., 839:1–117, 2020.
  91. Y.H. He. Machine Learning in Pure Mathematics and Theoretical Physics. G - Reference,Information and Interdisciplinary Subjects Series. World Scientific, 2023.
  92. Rigor with machine learning from field theory to the poincaréconjecture. Nature Reviews Physics, 2024.
  93. Multiscale invertible generative networks for high-dimensional bayesian inference. In International Conference on Machine Learning, pages 12632–12641. PMLR, 2021.
  94. Algebraic multigrid methods. Acta Numerica, 26:591–721, 2017.
  95. Exponentially convergent multiscale finite element method. Communications on Applied Mathematics and Computation, pages 1–17, 2023.
  96. Implicit neural representations with periodic activation functions. Advances in neural information processing systems, 33:7462–7473, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Ziming Liu (87 papers)
  2. Yixuan Wang (95 papers)
  3. Sachin Vaidya (23 papers)
  4. Fabian Ruehle (64 papers)
  5. James Halverson (66 papers)
  6. Marin Soljačić (141 papers)
  7. Thomas Y. Hou (57 papers)
  8. Max Tegmark (133 papers)
Citations (217)

Summary

  • The paper introduces KANs, which replace fixed neuron activations with learnable spline-based functions on edges, leading to improved accuracy and interpretability.
  • It presents technical innovations such as dynamic grid updates, residual activation functions, and sparsification to optimize training and simplify network structure.
  • Experimental results show KANs outperform traditional MLPs in tasks ranging from PDE solutions to symbolic regression, achieving efficient scaling and robust function approximation.

Kolmogorov-Arnold Networks (KANs) are proposed as a promising alternative to Multi-Layer Perceptrons (MLPs), drawing inspiration from the Kolmogorov-Arnold representation theorem. While traditional MLPs utilize fixed activation functions on nodes (neurons), KANs introduce learnable activation functions on the edges (weights). This means KANs have no linear weight matrices; instead, each weight parameter is replaced by a univariate function parameterized as a spline. The nodes in a KAN simply sum the incoming signals. This architectural change is shown to lead to improved accuracy and interpretability on small-scale AI + Science tasks.

The Kolmogorov-Arnold representation theorem states that any continuous function ff of nn variables on a bounded domain can be written as a finite composition involving only univariate functions and the binary operation of addition: f(x1,,xn)=q=12n+1Φq(p=1nϕq,p(xp))f(x_1, \dots, x_n) = \sum_{q=1}^{2n+1} \Phi_q \left(\sum_{p=1}^n \phi_{q,p}(x_p)\right). While the original theorem implies a shallow, specific structure (depth-2, width $2n+1$), early attempts to build neural networks directly from this theorem were hindered by the requirement for potentially non-smooth or fractal-like univariate functions and the limited applicability of the shallow structure. The paper generalizes this idea to arbitrary widths and depths, framing a KAN as a composition of "KAN layers."

A KAN layer with ninn_{in} inputs and noutn_{out} outputs is defined by a matrix of univariate functions Φ={ϕq,p}{\mathbf\Phi} = \{\phi_{q,p}\}, where p=1,,ninp=1,\dots,n_{in} and q=1,,noutq=1,\dots,n_{out}. The forward pass through a layer is given by xl+1,j=i=1nlϕl,j,i(xl,i)x_{l+1,j} = \sum_{i=1}^{n_l} \phi_{l,j,i}(x_{l,i}), where xl,ix_{l,i} is the activation of neuron (l,i)(l,i) and ϕl,j,i\phi_{l,j,i} is the learnable activation function on the edge connecting (l,i)(l,i) to (l+1,j)(l+1,j). A deep KAN is formed by stacking these layers.

Key implementation details are crucial for making KANs trainable:

  1. Residual activation functions: Each activation function ϕ(x)\phi(x) is parameterized as a sum of a fixed basis function b(x)b(x) (like SiLU) and a spline function: ϕ(x)=wbb(x)+wsspline(x)\phi(x) = w_b b(x) + w_s \text{spline}(x). The spline part is a linear combination of B-spline basis functions, ciBi(x)\sum c_i B_i(x), where cic_i are trainable coefficients. wbw_b and wsw_s are trainable scaling factors.
  2. Initialization: Spline coefficients cic_i are initialized near zero so spline(x)0\text{spline}(x) \approx 0 initially. wsw_s is initialized to 1, and wbw_b is initialized like linear weights in MLPs (e.g., Xavier initialization). This ensures the network behaves like an MLP with residual connections initially.
  3. Dynamic Grid Update: Spline grids are updated during training based on the distribution of input activations to handle evolving activation ranges.

The parameter count for a KAN with depth LL, width NN, spline order kk, and GG intervals is O(N2LG)O(N^2 L G). While this appears larger than an MLP's O(N2L)O(N^2 L), KANs often achieve better performance with much smaller widths and depths, resulting in fewer total parameters.

A theoretical analysis shows that if a function admits a smooth Kolmogorov-Arnold representation, a KAN with finite grid size GG can approximate the function with an error bound of O(Gk1+m)O(G^{-k-1+m}) in CmC^m norm (Theorem 2.1). This bound is independent of the input dimension nn, suggesting that KANs can potentially beat the curse of dimensionality (COD) for functions with compositional structure, unlike standard approximation theories for MLPs which predict errors scaling with $1/d$. This leads to a theoretical neural scaling law of N(k+1)\ell \propto N^{-(k+1)} for KANs, which is faster than typical MLP scaling laws (α=(k+1)/d\alpha = (k+1)/d or α=(k+1)/2\alpha=(k+1)/2).

A practical technique to improve KAN accuracy is grid extension. Since splines can approximate functions more accurately with finer grids, a trained KAN can be extended to a higher-resolution KAN by fitting the old coarse-grained splines with new fine-grained ones. This allows improving accuracy without retraining from scratch, unlike scaling up MLPs. Experiments show staircase-like loss curves where loss drops significantly after each grid extension. Smaller KANs tend to generalize better and tolerate larger grid sizes before overfitting.

KANs emphasize interpretability. Techniques are developed to simplify trained networks:

  1. Sparsification: L1 regularization on the magnitude of activation functions and an additional entropy regularization encourage sparsity among activation functions. The total loss is total=pred+λ(μ1lΦl1+μ2lS(Φl))\ell_{\rm total} = \ell_{\rm pred} + \lambda (\mu_1 \sum_l |\mathbf{\Phi}_l|_1 + \mu_2 \sum_l S(\mathbf{\Phi}_l)), where Φl1|\mathbf{\Phi}_l|_1 is the sum of L1 norms of activations in layer ll, and S(Φl)S(\mathbf{\Phi}_l) is an entropy term.
  2. Visualization: Activation functions are visualized with transparency proportional to their magnitude, highlighting important connections.
  3. Pruning: Nodes are pruned based on the maximum L1 norm of their incoming and outgoing connections.
  4. Symbolification: Numerical activation functions can be snapped to symbolic forms (e.g., sin, exp, log). This is done by fitting affine parameters (a,b,c,d)(a,b,c,d) such that the numerical output yy matches cf(ax+b)+dc f(ax+b)+d for a potential symbolic function ff. Functions like fix_symbolic and suggest_symbolic facilitate this.

An interactive workflow using these techniques allows human users to collaborate with KANs to discover symbolic formulas. Starting with a larger KAN, training with sparsification, pruning away unimportant neurons, visually inspecting activation functions, manually fixing them to guessed symbolic forms, and retraining affine parameters can lead to discovering the underlying symbolic expression, as demonstrated with the example f(x,y)=exp(sin(πx)+y2)f(x,y) = \exp(\sin(\pi x)+y^2). This iterative process offers more transparency and debuggability compared to traditional symbolic regression methods.

Experimental results confirm KANs' practical advantages.

  • On toy datasets with known compositional structures, KANs exhibit neural scaling laws much closer to the theoretically predicted N4\ell \propto N^{-4} than MLPs, which scale slower and plateau quickly.
  • Fitting special functions (multivariate functions common in science like Bessel or Legendre functions), KANs consistently outperform MLPs on Pareto frontiers comparing parameter count and RMSE. KANs can find surprisingly compact representations for these functions.
  • On the Feynman dataset (real-world physics equations), KANs achieve comparable or better accuracy than MLPs with fewer parameters. Auto-pruned KAN shapes are often smaller than human-constructed ones, hinting at potentially more efficient representations.
  • For solving Partial Differential Equations (PDEs) using Physics-Informed Neural Networks (PINNs), a small KAN can achieve significantly higher accuracy and parameter efficiency than a much larger MLP on a Poisson equation example.
  • In preliminary experiments on continual learning, the locality of spline basis functions in KANs helps prevent catastrophic forgetting in a toy 1D regression task, allowing the network to learn new tasks without degrading performance on previously learned ones.

Beyond supervised tasks, KANs can be used for unsupervised learning to discover implicit relations f(x1,,xd)0f(x_1, \dots, x_d) \approx 0 among variables. By training a KAN to classify real data vs. permuted (corrupted) data and structuring the last layer with a Gaussian activation centered at zero, the network implicitly learns f0f \approx 0 for real data. This method successfully rediscovers known mathematical relations in the Knot Theory dataset (dependence of signature on meridinal/longitudinal translations, the relation V=μrλV=\mu_r\lambda, and a relation between short geodesic and injectivity radius).

In condensed matter physics, KANs are applied to extract mobility edges in quasiperiodic models. For simpler models like the Mosaic Model and Generalized Andre-Aubry Model, KANs, guided by user assumptions, can accurately extract the mobility edge functions or formulas close to theoretical ground truths. For the more complex Modified Andre-Aubry Model, a human-KAN collaborative approach involving initial training, pruning, manual symbolic snapping based on visual inspection, and testing different symbolic hypotheses (e.g., testing cosh(p)\cosh(p) dependence) demonstrates how scientists can use KANs as a flexible tool to trade off simplicity and accuracy in discovered formulas, facilitating scientific discovery.

Potential limitations include the need for a deeper mathematical understanding of KANs beyond the original KAT, algorithmic aspects like optimizing training efficiency (KANs are currently slower than MLPs, especially due to spline evaluation overhead), and exploring hybrid architectures combining KANs and MLPs or using different basis functions. However, the paper argues that KANs' interpretability and potential for better accuracy make them a valuable tool for AI + Science, serving as a "LLM" of functions for human-AI collaboration.

The decision to use KANs over MLPs depends on the priorities: if training speed is paramount, MLPs might be preferred. However, if accuracy and interpretability are key, especially for small-to-medium scale science and engineering problems where understanding the underlying relationships is important, KANs offer significant advantages despite slower training.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com