Chain of Thought Empowers Transformers to Solve Inherently Serial Problems (2402.12875v4)
Abstract: Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of LLMs on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length $n$, previous works have shown that constant-depth transformers with finite precision $\mathsf{poly}(n)$ embedding size can only solve problems in $\mathsf{TC}0$ without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in $\mathsf{AC}0$, a proper subset of $ \mathsf{TC}0$. However, with $T$ steps of CoT, constant-depth transformers using constant-bit precision and $O(\log n)$ embedding size can solve any problem solvable by boolean circuits of size $T$. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- David A. Barrington. Bounded-width polynomial-size branching programs recognize exactly those languages in nc. pp. 1–5, 1986.
- Unbounded fan-in circuits and associative functions. In Proceedings of the fifteenth annual ACM symposium on Theory of computing, pp. 52–60, 1983.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
- Tighter bounds on the expressivity of transformer encoders. arXiv preprint arXiv:2301.10743, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
- Compositional semantic parsing with large language models. arXiv preprint arXiv:2209.15003, 2022.
- Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pp. 5793–5831. PMLR, 2022.
- Towards revealing the mystery behind chain of thought: a theoretical perspective. arXiv preprint arXiv:2305.15408, 2023.
- Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022.
- Looped transformers as programmable computers. arXiv preprint arXiv:2301.13196, 2023.
- David Goldberg. What every computer scientist should know about floating-point arithmetic. ACM computing surveys (CSUR), 23(1):5–48, 1991.
- Google. Palm 2 technical report, 2023. URL https://ai.google/static/documents/palm2techreport.pdf.
- Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156–171, 2020.
- Formal language recognition by hard attention transformers: Perspectives from circuit complexity. Transactions of the Association for Computational Linguistics, 10:800–810, 2022.
- William Hesse. Division is in uniform tc0. In International Colloquium on Automata, Languages, and Programming, pp. 104–114. Springer, 2001.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
- IEEE. Ieee standard for floating-point arithmetic. IEEE Std 754-2008, pp. 1–70, 2008. doi: 10.1109/IEEESTD.2008.4610935.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems, 2022.
- Algebraic theory of machines. i. prime decomposition theorem for finite semigroups and machines. Transactions of the American Mathematical Society, 116:450–464, 1965.
- Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329, 2022.
- Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 2022.
- Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022a.
- Towards efficient and scalable sharpness-aware minimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12360–12370, 2022b.
- Fiat-shamir for repeated squaring with applications to ppad-hardness and vdfs. In Advances in Cryptology–CRYPTO 2020: 40th Annual International Cryptology Conference, CRYPTO 2020, Santa Barbara, CA, USA, August 17–21, 2020, Proceedings, Part III, pp. 632–651. Springer, 2020.
- Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686, 2022.
- Oded Maler. On the krohn-rhodes cascaded decomposition theorem. In Time for Verification: Essays in Memory of Amir Pnueli, pp. 260–278. Springer, 2010.
- Counter-Free Automata (MIT research monograph no. 65). The MIT Press, 1971.
- A logic for expressing log-precision transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
- The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11:531–545, 2023b.
- On the power of saturated transformers: A view from circuit complexity. arXiv preprint arXiv:2106.16213, 2021.
- Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843–856, 2022.
- Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
- OpenAI. Gpt-4 technical report, 2023.
- On the turing completeness of modern neural network architectures. arXiv preprint arXiv:1901.03429, 2019.
- Attention is turing complete. The Journal of Machine Learning Research, 22(1):3463–3497, 2021.
- Relations among complexity measures. Journal of the ACM (JACM), 26(2):361–381, 1979.
- Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
- Time-lock puzzles and timed-release crypto. 1996.
- Language models are multilingual chain-of-thought reasoners. International Conference on Machine Learning, 2023.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950, 2019.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Jesse Vig. Visualizing attention in transformer-based language representation models. arXiv preprint arXiv:1904.02679, 2019.
- Towards understanding chain-of-thought prompting: An empirical study of what matters. arXiv preprint arXiv:2212.10001, 2022a.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022b.
- Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations (ICLR), 2023.
- Chain of thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 2022.
- Thinking like transformers. In International Conference on Machine Learning, pp. 11080–11090. PMLR, 2021.
- Christopher B Wilson. Relativized circuit complexity. Journal of Computer and System Sciences, 31(2):169–181, 1985.
- On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp. 10524–10533. PMLR, 2020.
- Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
- Andrew Chi-Chih Yao. Circuits and local computation. pp. 186–196, 1989.
- Self-attention networks can process bounded hierarchical languages. arXiv preprint arXiv:2105.11115, 2021.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
- Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023.
- Least-to-most prompting enables complex reasoning in large language models. International Conference on Learning Representations (ICLR), 2023.