The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning (2305.14045v2)

Published 23 May 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs (LMs) with less than 100B parameters are known to perform poorly on chain-of-thought (CoT) reasoning in contrast to large LMs when solving unseen tasks. In this work, we aim to equip smaller LMs with the step-by-step reasoning capability by instruction tuning with CoT rationales. In order to achieve this goal, we first introduce a new instruction-tuning dataset called the CoT Collection, which augments the existing Flan Collection (including only 9 CoT tasks) with additional 1.84 million rationales across 1,060 tasks. We show that CoT fine-tuning Flan-T5 (3B & 11B) with CoT Collection enables smaller LMs to have better CoT capabilities on unseen tasks. On the BIG-Bench-Hard (BBH) benchmark, we report an average improvement of +4.34% (Flan-T5 3B) and +2.60% (Flan-T5 11B), in terms of zero-shot task accuracy. Furthermore, we show that instruction tuning with CoT Collection allows LMs to possess stronger few-shot learning capabilities on 4 domain-specific tasks, resulting in an improvement of +2.24% (Flan-T5 3B) and +2.37% (Flan-T5 11B), even outperforming ChatGPT utilizing demonstrations until the max length by a +13.98% margin. Our code, the CoT Collection data, and model checkpoints are publicly available.

Authors (7)

Seungone Kim (34 papers)
Se June Joo (4 papers)
Doyoung Kim (19 papers)
Joel Jang (30 papers)
Seonghyeon Ye (25 papers)
Jamin Shin (24 papers)
Minjoon Seo (82 papers)

Citations (69)

View on Semantic Scholar

Summary

The paper shows that using 1.84 million CoT rationales in fine-tuning significantly enhances zero-shot and few-shot performance for smaller LMs.
It introduces the expansive CoT COLLECTION dataset that extends prior collections with over 1,000 diverse tasks to improve reasoning.
Rigorous evaluations on the BIG-Bench-Hard benchmark reveal average zero-shot accuracy gains of 4.34% for Flan-T5 3B and 2.60% for 11B models.

Improving Zero-shot and Few-shot Learning of LLMs via Chain-of-Thought Fine-Tuning

Introduction to Cot Collection

The COT COLLECTION dataset aims to enhance the reasoning capabilities of smaller LLMs (LMs) by facilitating Chain-of-Thought (CoT) prompting and fine-tuning. It incorporates 1.84 million rationales across 1,060 tasks, supplementing the previous Flan Collection which included merely 9 CoT tasks. This strategic extension is designed to offset the inadequacies smaller LMs face in zero-shot and few-shot learning scenarios when compared to their larger counterparts.

The Origin and Composition of COT COLLECTION

COT COLLECTION is introduced against the backdrop of previous works that have highlighted the challenge in applying CoT prompting to smaller LMs effectively. Its inception is rooted in the need for a substantial dataset that can cater to instruction tuning with CoT rationales, aimed at equipping smaller LMs with an enhanced step-by-step reasoning ability. Distinctively, the dataset not only augments the Flan Collection by magnitudes but also introduces rationales spanning a wide range of tasks, which is imperative for generalization across unseen tasks.

Evaluation and Key Findings

The paper reports on the performance of CoT fine-tuned Flan-T5 models across different scales in both zero-shot and few-shot learning contexts. The evaluation, conducted using the BIG-Bench-Hard (BBH) benchmark, demonstrates notable average improvements in zero-shot task accuracy: +4.34% for Flan-T5 3B and +2.60% for Flan-T5 11B models respectively. Furthermore, in few-shot learning tests across four domain-specific tasks, CoT fine-tuning yielded improvements, outperforming even larger models and demonstrating the efficacy of the approach in enhancing smaller LMs' reasoning capabilities.

Practical Implications and Future Directions

The paper underscores the potential of CoT COLLECTION in bridging the gap between smaller and larger LMs in terms of reasoning and instruction-following capabilities. It opens up avenues for further research into optimizing CoT prompting strategies for a broader set of languages and tasks, especially under low-resource settings. The findings also prompt a reconsideration of the hitherto predominant focus on scale (in terms of model parameters) as the sole driver of performance improvement, highlighting the critical role of diversified training data in achieving generalization.

Conclusive Assessment

Overall, the COT COLLECTION and its associated findings render a significant contribution to the ongoing efforts in refining AI's reasoning and learning capabilities. By demonstrating tangible improvements in both zero-shot and few-shot learning arenas, the dataset not only serves as a valuable resource for further research but also emphasizes the importance of curated, task-specific training data in unlocking the full potential of smaller LMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/seungonekim/status/1749160845207392767