Large Language Models Are Reasoning Teachers (2212.10071v2)
Abstract: Recent works have shown that chain-of-thought (CoT) prompting can elicit LLMs to solve complex reasoning tasks, step-by-step. However, prompt-based CoT methods are dependent on very large models such as GPT-3 175B which are prohibitive to deploy at scale. In this paper, we use these large models as reasoning teachers to enable complex reasoning in smaller models and reduce model size requirements by several orders of magnitude. We propose Fine-tune-CoT, a method that generates reasoning samples from very large teacher models to fine-tune smaller models. We evaluate our method on a wide range of public models and complex tasks. We find that Fine-tune-CoT enables substantial reasoning capability in small models, far outperforming prompt-based baselines and even the teacher model in many tasks. Additionally, we extend our method by leveraging the teacher model's ability to generate multiple distinct rationales for each original sample. Enriching the fine-tuning data with such diverse reasoning results in a substantial performance boost across datasets, even for very small models. We conduct ablations and sample studies to understand the emergence of reasoning capabilities of student models. Our code implementation and data are available at https://github.com/itsnamgyu/reasoning-teacher.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 535–541, New York, NY, USA. Association for Computing Machinery.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Language models show human-like content effects on reasoning. arXiv preprint, arXiv:2207.07051.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Jonathan St BT Evans. 2010. Intuition and reasoning: A dual-process perspective. Psychological Inquiry, 21(4):313–326.
- Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726.
- Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, Online. Association for Computational Linguistics.
- Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819.
- Pretrained transformers improve out-of-distribution robustness. arXiv preprint arXiv:2004.06100.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- Learning to solve arithmetic word problems with verb categorization. In EMNLP, pages 523–533. Citeseer.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610.
- Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. arXiv preprint, arXiv: 1606.07947.
- Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
- Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597.
- Solving quantitative reasoning problems with language models. arXiv preprint, arXiv: 2206.14858.
- Explanations from large language models make small reasoners better. arXiv preprint, arXiv: 2210.06726.
- On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586.
- Teaching small language models to reason. arXiv preprint arXiv:2212.08410.
- Generating training data with language models: Towards zero-shot language understanding. arXiv preprint, arXiv: 2202.04538.
- A statistical perspective on distillation. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7632–7642. PMLR.
- Paul Micaelli and Amos Storkey. 2019. Zero-Shot Knowledge Transfer via Adversarial Belief Matching, chapter -. Curran Associates Inc., Red Hook, NY, USA.
- When Does Label Smoothing Help? Curran Associates Inc., Red Hook, NY, USA.
- Zero-shot knowledge distillation in deep networks. In International Conference on Machine Learning, pages 4743–4751. PMLR.
- Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
- Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191.
- Improving language understanding by generative pre-training. -.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
- Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint, arXiv:2202.07206.
- Subhro Roy and Dan Roth. 2016. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004.
- Automatically identifying words that can serve as labels for few-shot text classification. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5569–5578, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Timo Schick and Hinrich Schütze. 2021a. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269, Online. Association for Computational Linguistics.
- Timo Schick and Hinrich Schütze. 2021b. It’s not just size that matters: Small language models are also few-shot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2339–2352, Online. Association for Computational Linguistics.
- Progressive network grafting for few-shot knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2541–2549.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2107.02137.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Attention is all you need. arXiv preprint arXiv:1706.03762.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Gpt3mix: Leveraging large-scale language models for text augmentation. arXiv preprint arXiv:2104.08826.
- Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465.
- Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR.