Teaching Arithmetic to Small Transformers (2307.03381v1)
Abstract: LLMs like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction objective for rapidly eliciting arithmetic capabilities.
- Exploring length generalization in large language models. arXiv preprint arXiv:2207.04901, 2022.
- Bowman, S. R. Can recursive neural tensor networks learn logical reasoning? arXiv preprint arXiv:1312.6192, 2013.
- Recursive neural networks for learning logical semantics. CoRR, abs/1406.1827, 5, 2014.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Making neural programming architectures generalize via recursion. arXiv preprint arXiv:1704.06611, 2017.
- Charton, F. Linear algebra with transformers. arXiv preprint arXiv:2112.01898, 2021.
- Charton, F. What is my math transformer doing?–three results on interpretability and generalization. arXiv preprint arXiv:2211.00170, 2022.
- Towards synthesizing complex programs from input-output examples. arXiv preprint arXiv:1706.01284, 2017.
- Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
- Compositional semantic parsing with large language models. arXiv preprint arXiv:2209.15003, 2022.
- Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
- Looped transformers as programmable computers. arXiv preprint arXiv:2301.13196, 2023.
- Data-centric ai requires rethinking data notion. arXiv preprint arXiv:2110.02491, 2021.
- How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. arXiv preprint arXiv:2305.00586, 2023.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
- Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228, 2015.
- Karpathy, A. char-rnn. https://github.com/karpathy/char-rnn, 2015.
- Karpathy, A. Andrej karpathy’s lightweight implementation of medium-sized gpts. GitHub, 2022. URL https://github.com/karpathy/nanoGPT.
- Have you seen that number? investigating extrapolation in question answering models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7031–7037, 2021.
- The algebraic combinatorial approach for low-rank matrix completion. J. Mach. Learn. Res., 16(1):1391–1436, 2015.
- Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
- Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International conference on machine learning, pp. 2873–2882. PMLR, 2018.
- Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146, 2017.
- Exposing attention glitches with flip-flop language modeling. arXiv preprint arXiv:2306.00946, 2023.
- Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
- MosaicML. Introducing mpt-7b: A new standard for open source, commercially usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05.
- A data-centric approach for training deep neural networks with less data. arXiv preprint arXiv:2110.03613, 2021.
- Investigating the limitations of transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019, 2021.
- Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
- Making transformers solve compositional tasks. arXiv preprint arXiv:2108.04378, 2021.
- Attention is turing complete. The Journal of Machine Learning Research, 22(1):3463–3497, 2021.
- Open clone of openai’s unreleased webtext dataset scraper. GitHub, 2019. URL https://github.com/jcpeterson/openwebtext.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
- Limitations of language models in arithmetic and symbolic induction. arXiv preprint arXiv:2208.05051, 2022.
- Improving language understanding by generative pre-training. 2018.
- Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361, 2019.
- Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.07206, 2022.
- Recht, B. A simpler approach to matrix completion. Journal of Machine Learning Research, 12(12), 2011.
- Neural programmer-interpreters. arXiv preprint arXiv:1511.06279, 2015.
- Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413, 2016.
- Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
- Language models are multilingual chain-of-thought reasoners. arXiv preprint arXiv:2210.03057, 2022.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554, 2022.
- Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014.
- Leap-of-thought: Teaching pre-trained models to systematically reason over implicit knowledge. Advances in Neural Information Processing Systems, 33:20227–20237, 2020.
- Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- Llama: Open and efficient foundation language models, 2023.
- Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Do nlp models know numbers? probing numeracy in embeddings. arXiv preprint arXiv:1909.07940, 2019.
- Exploring generalization ability of pretrained language models on arithmetic and logical reasoning. In Natural Language Processing and Chinese Computing: 10th CCF International Conference, NLPCC 2021, Qingdao, China, October 13–17, 2021, Proceedings, Part I 10, pp. 758–769. Springer, 2021.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Emergent analogical reasoning in large language models. arXiv preprint arXiv:2212.09196, 2022.
- Statistically meaningful approximation: a case study on approximating turing machines with transformers. Advances in Neural Information Processing Systems, 35:12071–12083, 2022a.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022b.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022c.
- How well do large language models perform in arithmetic tasks? arXiv preprint arXiv:2304.02015, 2023.
- Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077, 2019.
- Learning to execute. arXiv preprint arXiv:1410.4615, 2014.
- Learning to discover efficient mathematical identities. Advances in Neural Information Processing Systems, 27, 2014.
- Star: Self-taught reasoner bootstrapping reasoning with reasoning. 2022.
- Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022a.
- Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066, 2022b.