Emergent Mind

Abstract

Language models (LMs) with less than 100B parameters are known to perform poorly on chain-of-thought (CoT) reasoning in contrast to large LMs when solving unseen tasks. In this work, we aim to equip smaller LMs with the step-by-step reasoning capability by instruction tuning with CoT rationales. In order to achieve this goal, we first introduce a new instruction-tuning dataset called the CoT Collection, which augments the existing Flan Collection (including only 9 CoT tasks) with additional 1.84 million rationales across 1,060 tasks. We show that CoT fine-tuning Flan-T5 (3B & 11B) with CoT Collection enables smaller LMs to have better CoT capabilities on unseen tasks. On the BIG-Bench-Hard (BBH) benchmark, we report an average improvement of +4.34% (Flan-T5 3B) and +2.60% (Flan-T5 11B), in terms of zero-shot task accuracy. Furthermore, we show that instruction tuning with CoT Collection allows LMs to possess stronger few-shot learning capabilities on 4 domain-specific tasks, resulting in an improvement of +2.24% (Flan-T5 3B) and +2.37% (Flan-T5 11B), even outperforming ChatGPT utilizing demonstrations until the max length by a +13.98% margin. Our code, the CoT Collection data, and model checkpoints are publicly available.

CoT fine-tuning and collection enhance accuracy in in-domain tasks.

Overview

  • The COT COLLECTION dataset is introduced to improve the reasoning abilities of smaller Language Models (LMs) by incorporating Chain-of-Thought (CoT) prompting and fine-tuning across 1,060 tasks with 1.84 million rationales.

  • This dataset aims to address the challenges smaller LMs face in zero-shot and few-shot learning scenarios, offering a substantial resource for instruction tuning with CoT rationales.

  • Evaluation of CoT fine-tuned Flan-T5 models on the BIG-Bench-Hard (BBH) benchmark and four domain-specific tasks showed significant improvements in task accuracy in both zero-shot and few-shot contexts.

  • The paper highlights the potential of the COT COLLECTION for enhancing LMs' reasoning capabilities and suggests further research into CoT prompting strategies for diverse languages and tasks.

Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

Introduction to Cot Collection

The COT COLLECTION dataset aims to enhance the reasoning capabilities of smaller Language Models (LMs) by facilitating Chain-of-Thought (CoT) prompting and fine-tuning. It incorporates 1.84 million rationales across 1,060 tasks, supplementing the previous Flan Collection which included merely 9 CoT tasks. This strategic extension is designed to offset the inadequacies smaller LMs face in zero-shot and few-shot learning scenarios when compared to their larger counterparts.

The Origin and Composition of COT COLLECTION

COT COLLECTION is introduced against the backdrop of previous works that have highlighted the challenge in applying CoT prompting to smaller LMs effectively. Its inception is rooted in the need for a substantial dataset that can cater to instruction tuning with CoT rationales, aimed at equipping smaller LMs with an enhanced step-by-step reasoning ability. Distinctively, the dataset not only augments the Flan Collection by magnitudes but also introduces rationales spanning a wide range of tasks, which is imperative for generalization across unseen tasks.

Evaluation and Key Findings

The paper reports on the performance of CoT fine-tuned Flan-T5 models across different scales in both zero-shot and few-shot learning contexts. The evaluation, conducted using the BIG-Bench-Hard (BBH) benchmark, demonstrates notable average improvements in zero-shot task accuracy: +4.34% for Flan-T5 3B and +2.60% for Flan-T5 11B models respectively. Furthermore, in few-shot learning tests across four domain-specific tasks, CoT fine-tuning yielded improvements, outperforming even larger models and demonstrating the efficacy of the approach in enhancing smaller LMs' reasoning capabilities.

Practical Implications and Future Directions

The paper underscores the potential of CoT COLLECTION in bridging the gap between smaller and larger LMs in terms of reasoning and instruction-following capabilities. It opens up avenues for further research into optimizing CoT prompting strategies for a broader set of languages and tasks, especially under low-resource settings. The findings also prompt a reconsideration of the hitherto predominant focus on scale (in terms of model parameters) as the sole driver of performance improvement, highlighting the critical role of diversified training data in achieving generalization.

Conclusive Assessment

Overall, the COT COLLECTION and its associated findings render a significant contribution to the ongoing efforts in refining AI's reasoning and learning capabilities. By demonstrating tangible improvements in both zero-shot and few-shot learning arenas, the dataset not only serves as a valuable resource for further research but also emphasizes the importance of curated, task-specific training data in unlocking the full potential of smaller LMs.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.