The Expressive Capacity of State Space Models: A Formal Language Perspective (2405.17394v2)
Abstract: Recently, recurrent models based on linear state space models (SSMs) have shown promising performance in LLMing (LM), competititve with transformers. However, there is little understanding of the in-principle abilities of such models, which could provide useful guidance to the search for better LM architectures. We present a comprehensive theoretical study of the capacity of such SSMs as it compares to that of transformers and traditional RNNs. We find that SSMs and transformers have overlapping but distinct strengths. In star-free state tracking, SSMs implement straightforward and exact solutions to problems that transformers struggle to represent exactly. They can also model bounded hierarchical structure with optimal memory even without simulating a stack. On the other hand, we identify a design choice in current SSMs that limits their expressive power. We discuss implications for SSM and LM research, and verify results empirically on a recent SSM, Mamba.
- In-context language learning: Arhitectures and algorithms. arXiv preprint arXiv:2401.12973, 2024.
- J. Almeida. Finite semigroups and universal algebra, volume 3. World Scientific, 1995.
- Masked hard-attention transformers and boolean rasp recognize exactly the star-free languages. arXiv preprint arXiv:2310.13897, 2023.
- Layer normalization. stat, 1050:21, 2016.
- Regular languages in nc1. Journal of Computer and System Sciences, 44(3):478–499, 1992.
- On the ability and limitations of transformers to recognize formal languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7096–7116, 2020.
- On the distribution of deep clausal embeddings: A large cross-linguistic study. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3938–3943, 2019.
- Quasi-recurrent neural networks. In International Conference on Learning Representations, 2016.
- D. Chiang and P. Cholak. Overcoming a theoretical limitation of self-attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7654–7664, 2022.
- Tighter bounds on the expressivity of transformer encoders. 2023.
- N. Chomsky. Syntactic structures, 1957.
- N. Chomsky and M. P. Schützenberger. The algebraic theory of context-free languages. In Studies in Logic and the Foundations of Mathematics, volume 35, pages 118–161. Elsevier, 1963.
- Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR, 2017.
- Griffin: Mixing gated linear recurrences with local attention for efficient language models. CoRR, abs/2402.19427, 2024. doi: 10.48550/ARXIV.2402.19427. URL https://doi.org/10.48550/arXiv.2402.19427.
- Neural networks and the chomsky hierarchy. In The Eleventh International Conference on Learning Representations, 2022.
- S. Eilenberg. Automata, languages, and machines. Academic press, 1974.
- J. L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
- Structures, not strings: linguistics as part of the cognitive sciences. Trends in cognitive sciences, 19(12):729–743, 2015.
- Counter machines and counter languages. Mathematical systems theory, 2(3):265–283, Sep 1968a. ISSN 1433-0490. doi: 10.1007/BF01694011. URL https://doi.org/10.1007/BF01694011.
- Counter machines and counter languages. Math. Syst. Theory, 2(3):265–283, 1968b. doi: 10.1007/BF01694011. URL https://doi.org/10.1007/BF01694011.
- Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=COZDy0WYGg.
- A. Ginzburg. Algebraic theory of automata. Academic Press, 1968.
- A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. CoRR, abs/2312.00752, 2023. doi: 10.48550/ARXIV.2312.00752. URL https://doi.org/10.48550/arXiv.2312.00752.
- Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021.
- M. Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156–171, 2020.
- M. Hahn and M. Rofin. Why are sensitive functions hard for transformers? CoRR, abs/2402.09963, 2024. doi: 10.48550/ARXIV.2402.09963. URL https://doi.org/10.48550/arXiv.2402.09963.
- Visibly counter languages and the structure of nc11{}^{\mbox{1}}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT. In G. F. Italiano, G. Pighizzini, and D. Sannella, editors, Mathematical Foundations of Computer Science 2015 - 40th International Symposium, MFCS 2015, Milan, Italy, August 24-28, 2015, Proceedings, Part II, volume 9235 of Lecture Notes in Computer Science, pages 384–394. Springer, 2015. doi: 10.1007/978-3-662-48054-0\_32. URL https://doi.org/10.1007/978-3-662-48054-0_32.
- Rnns can generate bounded hierarchical languages with optimal memory. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1978–2010, 2020.
- S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
- Introduction to automata theory, languages, and computation. ACM New York, NY, USA, 2001.
- B. Horne and D. Hush. Bounds on the complexity of recurrent neural network implementations of finite state machines. Advances in neural information processing systems, 6, 1993.
- P. Indyk. Optimal simulation of automata by neural nets. In Annual Symposium on Theoretical Aspects of Computer Science, pages 337–348. Springer, 1995.
- Repeat after me: Transformers are better than state space models at copying, 2024.
- R. E. Kalman. On the general theory of control systems. In Proceedings First International Conference on Automatic Control, Moscow, USSR, pages 481–492, 1960.
- R. E. Kalman. Mathematical description of linear dynamical systems. Journal of the Society for Industrial and Applied Mathematics, Series A: Control, 1(2):152–192, 1963.
- F. Karlsson. Constraints on multiple center-embedding of clauses. Journal of Linguistics, 43(2):365–392, 2007.
- S. Kleene. Representation of events in nerve nets and finite automata. In Automata Studies. 1951.
- Visibly counter languages and constant depth circuits. In E. W. Mayr and N. Ollinger, editors, 32nd International Symposium on Theoretical Aspects of Computer Science, STACS 2015, March 4-7, 2015, Garching, Germany, volume 30 of LIPIcs, pages 594–607. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2015. doi: 10.4230/LIPICS.STACS.2015.594. URL https://doi.org/10.4230/LIPIcs.STACS.2015.594.
- K. Krohn and J. Rhodes. Algebraic theory of machines. i. prime decomposition theorem for finite semigroups and machines. Transactions of the American Mathematical Society, 116:450–464, 1965.
- Input-driven multi-counter automata. Theoretical Computer Science, 870:121–136, 2021.
- Simple recurrent units for highly parallelizable recurrence. arXiv preprint arXiv:1709.02755, 2017.
- Jamba: A hybrid transformer-mamba language model, 2024.
- Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535, 2016.
- Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022.
- Exposing attention glitches with flip-flop language modeling. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/510ad3018bbdc5b6e3b10646e2e35771-Abstract-Conference.html.
- R. McNaughton and S. A. Papert. Counter-Free Automata (MIT research monograph no. 65). The MIT Press, 1971.
- Long range language modeling via gated state spaces. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=5MkYIYCbva.
- W. Merrill and A. Sabharwal. A logic for expressing log-precision transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- The illusion of state in state-space models. In ICML, 2024.
- G. A. Miller and N. Chomsky. Finitary models of language users. 1963.
- S. M. S. Mirac Suzgun, Yonatan Belinkov. On evaluating the generalization of lstm models in formal languages. volume 2, pages 277–286. University of Massachusetts Amherst Libraries, 1 2019. doi: 10.7275/s02b-4d91. URL https://openpublishing.library.umass.edu/scil/article/id/1167/.
- Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, pages 26670–26698. PMLR, 2023.
- On limitations of the transformer architecture. arXiv preprint arXiv:2402.08164, 2024.
- On the turing completeness of modern neural network architectures. arXiv preprint arXiv:1901.03429, 2019.
- P. C. Phillips and V. Solo. Asymptotics for linear processes. The Annals of Statistics, pages 971–1001, 1992.
- Hgrn2: Gated linear rnns with state expansion. arXiv preprint arXiv:2404.07904, 2024a.
- Hierarchically gated recurrent neural network for sequence modeling. Advances in Neural Information Processing Systems, 36, 2024b.
- J. Sakarovitch. Elements of automata theory. Cambridge university press, 2009.
- Representational strengths and limitations of transformers. Advances in Neural Information Processing Systems, 36, 2024.
- M. P. Schützenberger. On finite monoids having only trivial subgroups. Inf. Control., 8(2):190–194, 1965.
- N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- H. Siegelman and E. D. Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50:132–150, 1995.
- H. T. Siegelmann. Neural networks and analog computation: beyond the Turing limit. Springer Science & Business Media, 1999.
- H. Straubing. Finite automata, formal logic, and circuit complexity. Birkhaeuser, 1994.
- L. Strobl. Average-hard attention transformers are constant-depth uniform threshold circuits, 2023.
- Transformers as recognizers of formal languages: A survey on expressivity. CoRR, abs/2311.00208, 2023. doi: 10.48550/ARXIV.2311.00208. URL https://doi.org/10.48550/arXiv.2311.00208.
- Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
- M. Tomita. Dynamic construction of finite-state automata from examples using hill-climbing. In Proceedings of the Fourth Annual Conference of the Cognitive Science Society, pages 105–108, 1982.
- Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- On the practical computational power of finite precision rnns for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 740–745, 2018.
- A. Yang and D. Chiang. Counting like transformers: Compiling temporal counting logic into softmax transformers. arXiv preprint arXiv:2404.04393, 2024.
- Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023.
- Self-attention networks can process bounded hierarchical languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 2021a. doi: 10.18653/v1/2021.acl-long.292. URL http://dx.doi.org/10.18653/v1/2021.acl-long.292.
- Self-attention networks can process bounded hierarchical languages. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3770–3785, Online, Aug. 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.292. URL https://aclanthology.org/2021.acl-long.292.
- B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.