Emergent Mind

Emergent Abilities in Reduced-Scale Generative Language Models

(2404.02204)
Published Apr 2, 2024 in cs.CL and cs.LG

Abstract

Large language models can solve new tasks without task-specific fine-tuning. This ability, also known as in-context learning (ICL), is considered an emergent ability and is primarily seen in LLMs with billions of parameters. This study investigates if such emergent properties are strictly tied to model size or can be demonstrated by smaller models trained on reduced-scale data. To explore this, we simplify pre-training data and pre-train 36 causal language models with parameters varying from 1 million to 165 million parameters. We show that models trained on this simplified pre-training data demonstrate enhanced zero-shot capabilities across various tasks in simplified language, achieving performance comparable to that of pre-trained models six times larger on unrestricted language. This suggests that downscaling the language allows zero-shot learning capabilities to emerge in models with limited size. Additionally, we find that these smaller models pre-trained on simplified data demonstrate a power law relationship between the evaluation loss and the three scaling factors: compute, dataset size, and model size.

Filtering SlimPajama dataset using AO-Childes vocabulary to pre-train simplified models, comparing their performance.

Overview

  • This study explores whether emergent abilities in language models, typically seen in large models, can be unlocked in smaller models by simplifying pre-training data.

  • 36 causal language models with parameter counts ranging from 1 million to 165 million were pre-trained on a dataset simplified to resemble child-directed speech, demonstrating zero-shot capabilities comparable to much larger models.

  • An effective batch size and a tokenizer tailored to the simplified dataset were used in training, with evaluations distinguishing between zero-shot and few-shot capabilities.

  • Findings suggest smaller models can exhibit emergent abilities with simplified pre-training, offering a cost-effective alternative and opening new research avenues in model scaling and in-context learning.

Emergent Abilities in Reduced-Scale Generative Language Models

Introduction

The capability of LLMs to partake in in-context learning (ICL) without the need for fine-tuning has spurred significant interest. This feature, predominantly observed in billion-parameter models, raises the question: can emergent abilities be unlocked in smaller models through simplified pre-training data? This study endeavors to explore this avenue by pre-training 36 causal language models with parameter counts ranging from 1 million to 165 million on a simplified English dataset. The results indicate that smaller models, when trained on simplified data, exhibit zero-shot capabilities on par with models six times their size trained on comprehensive datasets.

Simplifying Pre-training Data

The fundamental approach involved filtering existing pre-training corpora to adhere to a simplified vocabulary based on child-directed speech, resulting in a dataset predominantly consisting of simple linguistic structures. The SlimPajama dataset served as the basis for this simplified dataset, which underwent filtration to remove tokens outside the defined vocabulary, maintaining a minimal out-of-vocabulary rate.

Pre-training and Evaluation

Models were trained across a range of parameters, utilizing a tokenizer developed on the slimmed-down dataset. The training employed an effective batch size adjusted to the token count, with models spanning from 1M to 165M parameters undergoing this regimen. The evaluation covered a broad spectrum of tasks, differentiating between zero-shot and few-shot capabilities, and utilized a standard as well as a simplified variant of these tasks for a comprehensive analysis.

Findings

  • Zero-Shot Learning Capabilities: Simplified models demonstrated enhanced zero-shot learning capabilities across various tasks in simplified language, suggesting that by tailoring the complexity of the language, smaller models can indeed exhibit emergent abilities.
  • Model Scaling and Performance: An observed power law relationship between evaluation loss and scaling factors (compute, dataset size, and model size) was consistent with findings from larger models, indicating predictable performance improvements with increasing scale, even in a simplified language setting.
  • Comparative Performance: Simplified models, particularly the Simple 165M model, showcased zero-shot performance on simplified datasets that were comparable or superior to their larger counterparts trained on comprehensive datasets.

Implications and Future Directions

The study sheds light on the potential of simplifying pre-training data as a viable strategy to instigate emergent abilities in smaller models. This not only has ramifications for reducing computational costs but also opens up new avenues for research into the mechanisms behind in-context learning and the bounds of model scaling. Future investigations could explore the effects of further data simplification, integration with model distillation techniques, and the efficacy of simplified models in specific application domains.

Conclusions

This study posits that the emergent abilities typically reserved for LLMs can be accessed by smaller models through the strategic simplification of pre-training data. The implications of this are twofold: firstly, it highlights the adaptability and potential of smaller models in capturing complex language phenomena; secondly, it proposes a cost-effective alternative to the prevailing trend of scaling up model size for achieving advanced linguistic capabilities. As such, this research contributes valuable insights into the ongoing dialogue on effective and efficient ways to enhance the performance of generative language models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.