Specializing Smaller Language Models towards Multi-Step Reasoning

Published 30 Jan 2023 in cs.CL, cs.AI, and cs.LG | (2301.12726v1)

Abstract: The surprising ability of LLMs to perform well on complex reasoning with only few-shot chain-of-thought prompts is believed to emerge only in very large-scale models (100+ billion parameters). We show that such abilities can, in fact, be distilled down from GPT-3.5 ($\ge$ 175B) to T5 variants ($\le$ 11B). We propose model specialization, to specialize the model's ability towards a target task. The hypothesis is that large models (commonly viewed as larger than 100B) have strong modeling power, but are spread on a large spectrum of tasks. Small models (commonly viewed as smaller than 10B) have limited model capacity, but if we concentrate their capacity on a specific target task, the model can achieve a decent improved performance. We use multi-step math reasoning as our testbed because it is a very typical emergent ability. We show two important aspects of model abilities: (1). there exists a very complex balance/ tradeoff between LLMs' multi-dimensional abilities; (2). by paying the price of decreased generic ability, we can clearly lift up the scaling curve of models smaller than 10B towards a specialized multi-step math reasoning ability. We further give comprehensive discussions about important design choices for better generalization, including the tuning data format, the start model checkpoint, and a new model selection method. We hope our practice and discoveries can serve as an important attempt towards specialized smaller models in the new research paradigm set by LLMs.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (197)

View on Semantic Scholar

Summary

The paper demonstrates that fine-tuning smaller models with chain-of-thought data can achieve a 10-point accuracy improvement in multi-step math reasoning tasks.
It reveals a tradeoff where enhanced performance on specialized tasks comes at the expense of broader, general-purpose reasoning capabilities.
The study introduces an optimized distillation method using distribution matching to improve training convergence and model stability.

Specializing Smaller LLMs Towards Multi-Step Reasoning

In recent years, the expanding capabilities of LLMs in NLP have predominantly overshadowed smaller models. Nevertheless, the intriguing study "Specializing Smaller LLMs towards Multi-Step Reasoning" makes a substantial contribution by exploring how smaller models can be tuned to replicate complex reasoning abilities, often attributed solely to their larger counterparts. This paper primarily investigates whether the emergent reasoning ability typically seen in LLMs with 100+ billion parameters can be distilled into smaller models like T5 variants with up to 11 billion parameters.

Key Findings and Methodological Innovations

The research hypothesizes that while large-scale models (exceeding 100 billion parameters) possess a broad-spectrum modeling prowess, their capabilities are dispersed across various tasks. In contrast, smaller models, typically under 10 billion parameters, can achieve considerable performance on specific tasks by concentrating their limited resources. This study specifically focuses on training these models to perform multi-step math reasoning, a well-defined task often used to measure emergent reasoning abilities. Key findings and methodologies include:

Model Specialization: By fine-tuning smaller models specifically on chain-of-thought (CoT) data sourced from larger models, it's demonstrated that these models can significantly improve in multi-step reasoning tasks. The paper uses distillation from GPT-3.5 (≥ 175B) to smaller T5 models (≤ 11B) as a means to concentrate model capacity on multi-step math reasoning.
Performance Tradeoffs: A central contribution of this research is the elucidation of tradeoffs in model specialization. Specializing smaller models leads to a pronounced improvement in targeted task performance: an impressive +10 accuracy point gain on multi-step reasoning tasks. Yet, this comes at a compromise of reduced performance on generic tasks, as measured by losses in the BigBench Hard suite, reflecting a drop in broader CoT abilities.
Generalization and Data Formats: Through meticulous experimentation, the research provides insights into how different data tuning configurations (e.g., in-context versus zero-shot) impact model performance. It concludes that while zero-shot data can enhance base capabilities, it diminishes the model's ability for in-context learning, highlighting the need for careful consideration of data format during training.
Distillation Techniques: Distillation is optimized through distribution matching rather than sample matching, aligning teacher and student models' tokenizations via dynamic programming. This methodological enhancement provides more stable convergence during training.

Practical and Theoretical Implications

The implications of this research span theoretic landscapes and practical applications:

Broadening Accessibility: By compressing complex reasoning skills into smaller models, more researchers and practitioners gain access to robust AI tools without requiring vast computational resources traditionally associated with large models.
Revising Emergent Abilities: The research challenges the notion that certain reasoning abilities are exclusively emergent in large models, showing that with targeted specialization, smaller models can exhibit similar log-linear scaling behavior, thus calling for a reevaluation of what constitutes emergent properties in model competencies.
Impact on Cross-Domain Applicability: The insights from specialized model training could further impact areas like education and content creation, where domain-specific reasoning is crucial.

Future Directions

While this study takes significant strides in reframing the capabilities of smaller models, it also lays foundational groundwork for subsequent inquiries:

Exploration of Additional Task Specializations: Extending specialization to other domains beyond math reasoning could open new frontiers in specialized AI applications.
Integration with Auxiliary Techniques: Methods such as adding a calculator or enhancing self-consistency could further improve specialized models, warranting investigation into their combined impact on model efficiency and accuracy.
Longitudinal Studies on Scalability: Longer-term studies focusing on how fine-tuning strategies evolve as models and tasks grow more complex could provide deeper insights into scaling specialized abilities efficiently.

In summary, this paper advocates for an innovative approach to maximizing the utility of smaller LLMs by strategically aligning their abilities with particular tasks. It encourages the AI community to look beyond the sheer scale of models and invest in the precision of their design and application.

Markdown Report Issue