Emergent Mind

Large Language Models Are Reasoning Teachers

(2212.10071)
Published Dec 20, 2022 in cs.CL , cs.AI , and cs.LG

Abstract

Recent works have shown that chain-of-thought (CoT) prompting can elicit language models to solve complex reasoning tasks, step-by-step. However, prompt-based CoT methods are dependent on very large models such as GPT-3 175B which are prohibitive to deploy at scale. In this paper, we use these large models as reasoning teachers to enable complex reasoning in smaller models and reduce model size requirements by several orders of magnitude. We propose Fine-tune-CoT, a method that generates reasoning samples from very large teacher models to fine-tune smaller models. We evaluate our method on a wide range of public models and complex tasks. We find that Fine-tune-CoT enables substantial reasoning capability in small models, far outperforming prompt-based baselines and even the teacher model in many tasks. Additionally, we extend our method by leveraging the teacher model's ability to generate multiple distinct rationales for each original sample. Enriching the fine-tuning data with such diverse reasoning results in a substantial performance boost across datasets, even for very small models. We conduct ablations and sample studies to understand the emergence of reasoning capabilities of student models. Our code implementation and data are available at https://github.com/itsnamgyu/reasoning-teacher.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 535–541, New York, NY, USA. Association for Computing Machinery.
  3. PaLM: Scaling Language Modeling with Pathways
  4. Scaling Instruction-Finetuned Language Models
  5. Training Verifiers to Solve Math Word Problems
  6. Language models show human-like content effects on reasoning tasks
  7. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  8. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  9. Jonathan St BT Evans. 2010. Intuition and reasoning: A dual-process perspective. Psychological Inquiry, 21(4):313–326.
  10. Specializing Smaller Language Models towards Multi-Step Reasoning
  11. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, Online. Association for Computational Linguistics.
  12. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
  13. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819.
  14. Pretrained Transformers Improve Out-of-Distribution Robustness
  15. Distilling the Knowledge in a Neural Network
  16. Training Compute-Optimal Large Language Models
  17. Learning to solve arithmetic word problems with verb categorization. In EMNLP, pages 523–533. Citeseer.
  18. LoRA: Low-Rank Adaptation of Large Language Models
  19. Large Language Models Can Self-Improve
  20. Sequence-Level Knowledge Distillation
  21. Large Language Models are Zero-Shot Reasoners
  22. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597.
  23. Solving Quantitative Reasoning Problems with Language Models
  24. Explanations from Large Language Models Make Small Reasoners Better
  25. Making Large Language Models Better Reasoners with Step-Aware Verifier
  26. Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems
  27. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing
  28. Teaching Small Language Models to Reason
  29. Generating Training Data with Language Models: Towards Zero-Shot Language Understanding
  30. A statistical perspective on distillation. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7632–7642. PMLR.
  31. Paul Micaelli and Amos Storkey. 2019. Zero-Shot Knowledge Transfer via Adversarial Belief Matching, chapter -. Curran Associates Inc., Red Hook, NY, USA.
  32. When Does Label Smoothing Help? Curran Associates Inc., Red Hook, NY, USA.
  33. Zero-shot knowledge distillation in deep networks. In International Conference on Machine Learning, pages 4743–4751. PMLR.
  34. Show Your Work: Scratchpads for Intermediate Computation with Language Models
  35. Training language models to follow instructions with human feedback
  36. Are NLP Models really able to Solve Simple Math Word Problems?
  37. Improving language understanding by generative pre-training. -.
  38. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  39. Scaling Language Models: Methods, Analysis & Insights from Training Gopher
  40. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67.
  41. Impact of Pretraining Term Frequencies on Few-Shot Reasoning
  42. Solving General Arithmetic Word Problems
  43. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
  44. Are Emergent Abilities of Large Language Models a Mirage?
  45. Automatically identifying words that can serve as labels for few-shot text classification. In Proceedings of the 28th International Conference on Computational Linguistics, pages 5569–5578, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  46. Timo Schick and Hinrich Schütze. 2021a. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269, Online. Association for Computational Linguistics.
  47. Timo Schick and Hinrich Schütze. 2021b. It’s not just size that matters: Small language models are also few-shot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2339–2352, Online. Association for Computational Linguistics.
  48. Progressive network grafting for few-shot knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2541–2549.
  49. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  50. ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
  51. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
  52. LLaMA: Open and Efficient Foundation Language Models
  53. Attention Is All You Need
  54. Self-Consistency Improves Chain of Thought Reasoning in Language Models
  55. Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification.
  56. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
  57. GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation
  58. STaR: Bootstrapping Reasoning With Reasoning
  59. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR.

Show All 59