RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval (2402.18510v4)
Abstract: This paper investigates the gap in representation powers of Recurrent Neural Networks (RNNs) and Transformers in the context of solving algorithmic problems. We focus on understanding whether RNNs, known for their memory efficiency in handling long sequences, can match the performance of Transformers, particularly when enhanced with Chain-of-Thought (CoT) prompting. Our theoretical analysis reveals that CoT improves RNNs but is insufficient to close the gap with Transformers. A key bottleneck lies in the inability of RNNs to perfectly retrieve information from the context, even with CoT: for several tasks that explicitly or implicitly require this capability, such as associative recall and determining if a graph is a tree, we prove that RNNs are not expressive enough to solve the tasks while Transformers can solve them with ease. Conversely, we prove that adopting techniques to enhance the in-context retrieval capability of RNNs, including Retrieval-Augmented Generation (RAG) and adding a single Transformer layer, can elevate RNNs to be capable of solving all polynomial-time solvable problems with CoT, hence closing the representation gap with Transformers.
- In-context language learning: Architectures and algorithms, 2024.
- Sumformer: Universal approximation for efficient transformers, 2023.
- The space complexity of approximating the frequency moments. In Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pp. 20–29, 1996.
- Zoology: Measuring and improving recall in efficient language models. arXiv preprint arXiv:2312.04927, 2023.
- Using fast weights to attend to the recent past. Advances in neural information processing systems, 29, 2016.
- Neural machine translation by jointly learning to align and translate, 2016.
- On the ability and limitations of transformers to recognize formal languages, 2020.
- Improving language models by retrieving from trillions of tokens, 2022.
- Recurrent memory transformer, 2022.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
- Towards revealing the mystery behind chain of thought: A theoretical perspective. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=qHrADgAdYu.
- Hungry hungry hippos: Towards language modeling with state space models, 2023.
- Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
- Mamba: Linear-time sequence modeling with selective state spaces, 2023.
- Realm: Retrieval-augmented language model pre-training, 2020.
- Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156–171, December 2020. ISSN 2307-387X. doi: 10.1162/tacl_a_00306. URL http://dx.doi.org/10.1162/tacl_a_00306.
- Modeling recurrence for transformer. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1198–1207, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1122. URL https://aclanthology.org/N19-1122.
- Formal language recognition by hard attention transformers: Perspectives from circuit complexity, 2022.
- Computing on data streams. External memory algorithms, 50:107–118, 1998.
- Parallel models of associative memory: updated edition. Psychology press, 2014.
- John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
- Repeat after me: Transformers are better than state space models at copying. arXiv preprint arXiv:2402.01032, 2024.
- Retrieval as attention: End-to-end learning of retrieval and reading within a single transformer. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2336–2349, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.149. URL https://aclanthology.org/2022.emnlp-main.149.
- Transformers are RNNs: Fast autoregressive transformers with linear attention. In Hal Daumé III and Aarti Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 5156–5165. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/katharopoulos20a.html.
- Large language models are zero-shot reasoners, 2023.
- In search of needles in a 11m haystack: Recurrent memory finds what llms miss, 2024.
- How do transformers learn topic structure: Towards a mechanistic understanding, 2023.
- Chain of thought empowers transformers to solve inherently serial problems, 2024.
- On the curse of memory in recurrent neural networks: Approximation and optimization analysis. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=8Sqhl-nF50.
- On the approximation properties of recurrent encoder-decoder architectures. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=xDIvIqQ3DXD.
- Transformers learn shortcuts to automata. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=De4FYqjFueZ.
- Stable, fast and accurate: Kernelized attention with relative positional encoding, 2021.
- Focus your attention (with adaptive iir filters), 2023.
- The Parallelism Tradeoff: Limitations of Log-Precision Transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 06 2023. ISSN 2307-387X. doi: 10.1162/tacl_a_00562. URL https://doi.org/10.1162/tacl_a_00562.
- Saturated transformers are constant-depth threshold circuits, 2022.
- Selection and sorting with limited storage. Theoretical computer science, 12(3):315–323, 1980.
- Show your work: Scratchpads for intermediate computation with language models, 2021.
- Can mamba learn how to learn? a comparative study on in-context learning tasks, 2024.
- Rwkv: Reinventing rnns for the transformer era, 2023.
- Random feature attention, 2021.
- Hyena hierarchy: Towards larger convolutional language models, 2023.
- Long-range language modeling with self-retrieval, 2023.
- Representational strengths and limitations of transformers, 2023.
- Transformers, parallel computation, and logarithmic depth, 2024.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Retentive network: A successor to transformer for large language models, 2023.
- Scan and snap: Understanding training dynamics and token composition in 1-layer transformer, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- Linformer: Self-attention with linear complexity, 2020.
- Chain-of-thought reasoning without prompting, 2024.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- Non-holographic associative memory. Nature, 222(5197):960–962, 1969.
- Efficient streaming language models with attention sinks, 2023.
- Do efficient transformers really save computation?, 2024.
- Self-attention networks can process bounded hierarchical languages, 2023.
- Kaiyue Wen (18 papers)
- Xingyu Dang (3 papers)
- Kaifeng Lyu (28 papers)