Emergent Abilities in Reduced-Scale Generative Language Models (2404.02204v1)
Abstract: LLMs can solve new tasks without task-specific fine-tuning. This ability, also known as in-context learning (ICL), is considered an emergent ability and is primarily seen in LLMs with billions of parameters. This study investigates if such emergent properties are strictly tied to model size or can be demonstrated by smaller models trained on reduced-scale data. To explore this, we simplify pre-training data and pre-train 36 causal LLMs with parameters varying from 1 million to 165 million parameters. We show that models trained on this simplified pre-training data demonstrate enhanced zero-shot capabilities across various tasks in simplified language, achieving performance comparable to that of pre-trained models six times larger on unrestricted language. This suggests that downscaling the language allows zero-shot learning capabilities to emerge in models with limited size. Additionally, we find that these smaller models pre-trained on simplified data demonstrate a power law relationship between the evaluation loss and the three scaling factors: compute, dataset size, and model size.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
- Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
- Honey, I shrunk the language: Language model behavior at reduced scale. arXiv preprint arXiv:2305.17266.
- Bill Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Third international workshop on paraphrasing (IWP2005).
- Understanding emergent abilities of language models from the loss perspective. arXiv preprint arXiv:2403.15796.
- Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759.
- A framework for few-shot language model evaluation.
- Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations.
- The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717.
- Textbooks are all you need. arXiv preprint arXiv:2306.11644.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
- Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1419–1436, Online. Association for Computational Linguistics.
- Babyberta: Learning more grammar with small-scale child-directed language. In Proceedings of the 25th conference on computational natural language learning, pages 624–646.
- Philip A Huebner and Jon A Willits. 2021. Using lexical context to discover the noun category: Younger children have it easier. In Psychology of learning and motivation, volume 75, pages 279–331. Elsevier.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Scaling laws of rope-based extrapolation. arXiv preprint arXiv:2310.05209.
- Teaching small language models to reason. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1773–1781, Toronto, Canada. Association for Computational Linguistics.
- A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849.
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
- In-context learning and induction heads. arXiv preprint arXiv:2209.11895.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.
- Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004.
- Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
- SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, page 127063.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Inar Timiryasov and Jean-Loup Tastet. 2023. Baby llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty. arXiv preprint arXiv:2308.02019.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Attention is all you need. Advances in neural information processing systems, 30.
- Blimp: The benchmark of linguistic minimal pairs for english. Transactions of the Association for Computational Linguistics, 8:377–392.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
- A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116.
- Opt: Open pre-trained transformer language models, 2022. URL https://arxiv. org/abs/2205.01068, 3:19–0.