Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AS-ES Learning: Towards Efficient CoT Learning in Small Models (2403.01969v1)

Published 4 Mar 2024 in cs.CL

Abstract: Chain-of-Thought (CoT) serves as a critical emerging ability in LLMs, especially when it comes to logical reasoning. Attempts have been made to induce such ability in small models as well by distilling from the data with CoT generated by LLMs. However, existing methods often simply generate and incorporate more data from LLMs and fail to note the importance of efficiently utilizing existing CoT data. We here propose a new training paradigm AS-ES (Abstractive Segments - Extractive Segments) learning, which exploits the inherent information in CoT for iterative generation. Experiments show that our methods surpass the direct seq2seq training on CoT-extensive tasks like MWP and PET summarization, without data augmentation or altering the model itself. Furthermore, we explore the reason behind the inefficiency of small models in learning CoT and provide an explanation of why AS-ES learning works, giving insights into the underlying mechanism of CoT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Mcc-kd: Multi-cot consistent knowledge distillation. arXiv preprint arXiv:2310.14747.
  2. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  3. Multi-hop inference for question-driven summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6734–6744.
  4. Scalable multi-hop relational reasoning for knowledge-aware question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1295–1309, Online. Association for Computational Linguistics.
  5. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726.
  6. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071.
  7. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301.
  8. Yichen Jiang and Mohit Bansal. 2019. Self-assembling modular networks for interpretable multi-hop reasoning. arXiv preprint arXiv:1909.05803.
  9. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  10. Sci-cot: Leveraging large language models for enhanced knowledge distillation in small models for scientific qa. arXiv preprint arXiv:2308.04679.
  11. A survey on multi-hop question answering and generation. arXiv preprint arXiv:2204.09140.
  12. Learning reasoning strategies in end-to-end differentiable proving. In International Conference on Machine Learning, pages 6938–6949. PMLR.
  13. Roshanak Mirzaee and Parisa Kordjamshidi. 2023. Disentangling extraction and reasoning in multi-hop spatial reasoning. arXiv preprint arXiv:2310.16731.
  14. Improving coherence and consistency in neural sequence models with dual-system, neuro-symbolic reasoning. Advances in Neural Information Processing Systems, 34:25192–25204.
  15. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  16. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  17. Rnnlogic: Learning logic rules for reasoning on knowledge graphs. arXiv preprint arXiv:2010.04029.
  18. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  19. Distilling reasoning capabilities into smaller language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073.
  20. Self-consistency improves chain of thought reasoning in language models. ArXiv, abs/2203.11171.
  21. SMASH: Improving SMAll language models’ few-SHot ability with prompt-based distillation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6608–6619, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  22. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  23. A review of machine learning for the optimization of production processes. The International Journal of Advanced Manufacturing Technology, 104:1889 – 1902.
  24. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
  25. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934.
  26. Multimodal chain-of-thought reasoning in language models. ArXiv, abs/2302.00923.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Nuwa Xi (11 papers)
  2. Yuhan Chen (39 papers)
  3. Sendong Zhao (31 papers)
  4. Haochun Wang (17 papers)
  5. Bing Qin (186 papers)
  6. Ting Liu (329 papers)

Summary

We haven't generated a summary for this paper yet.