Observational Scaling Laws and the Predictability of Language Model Performance (2405.10938v3)
Abstract: Understanding how LLM performance varies with scale is critical to benchmark and algorithm development. Scaling laws are one approach to building this understanding, but the requirement of training models across many different scales has limited their use. We propose an alternative, observational approach that bypasses model training and instead builds scaling laws from ~100 publically available models. Building a single scaling law from multiple model families is challenging due to large variations in their training compute efficiencies and capabilities. However, we show that these variations are consistent with a simple, generalized scaling law where LLM performance is a function of a low-dimensional capability space, and model families only vary in their efficiency in converting training compute to capabilities. Using this approach, we show the surprising predictability of complex scaling phenomena: we show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models; we show that the agent performance of models such as GPT-4 can be precisely predicted from simpler non-agentic benchmarks; and we show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as LLM capabilities continue to improve.
- Exploring the limits of large scale pre-training. arXiv preprint arXiv:2110.02095, 2021.
- Meta AI. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/, 2024. Accessed: 2024-05-13.
- The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
- Anthropic. Claude 2, July 2023. URL https://www.anthropic.com/index/claude-2. Accessed: 2023-08-31.
- Foundational challenges in assuring alignment and safety of large language models. arXiv preprint arXiv:2404.09932, 2024.
- A theory for emergence of complex skills in language models. arXiv preprint arXiv:2307.15936, 2023.
- Explaining neural scaling laws. arXiv preprint arXiv:2102.06701, 2021.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
- Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Revealing the structure of language model capabilities. arXiv preprint arXiv:2306.10062, 2023.
- Broken neural scaling laws. arXiv preprint arXiv:2210.14891, 2022.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/, March 2023. Accessed: 2024-05-13.
- Chatbot arena: An open platform for evaluating llms by human preference, 2024.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Databricks. Dolly: The first open commercially viable instruction-tuned llm. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm, April 2023. Accessed: 2024-05-13.
- Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2023.
- Understanding emergent abilities of language models from the loss perspective. arXiv preprint arXiv:2403.15796, 2024.
- Lukas Finnveden. Extrapolating gpt-n performance. https://www.lesswrong.com/posts/k2SNji3jXaLGhBeYP/extrapolating-gpt-n-performance, 2020. Accessed: 2024-05-07.
- Language models scale reliably with over-training and on downstream tasks. arXiv preprint arXiv:2403.08540, 2024.
- Predictability and surprise in large generative models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764, 2022.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- Koala: A dialogue model for academic research. Blog post, April 2023. URL https://bair.berkeley.edu/blog/2023/04/03/koala/.
- Scaling laws for neural machine translation. arXiv preprint arXiv:2109.07740, 2021.
- Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
- Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021.
- Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Predicting emergent abilities with infinite resolution evaluation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=lDbjooxLkD.
- Compression represents intelligence linearly. arXiv preprint arXiv:2404.09937, 2024.
- David Ilić. Unveiling the general intelligence factor in language models: A psychometric approach. arXiv preprint arXiv:2310.11616, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
- Cognition Labs. Introducing devin, the first ai software engineer, March 2024. URL https://www.cognition-labs.com/introducing-devin. Accessed: 2023-05-03.
- LAION. Open assistant. https://projects.laion.ai/Open-Assistant/, 2023. Accessed: 2024-05-13.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023a.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
- Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023c.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021a.
- Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668, 2021b.
- Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=1qvx610Cu7.
- Do question answering modeling improvements hold across benchmarks? arXiv preprint arXiv:2102.01065, 2021.
- Agentbench: Evaluating llms as agents. In The Twelfth International Conference on Learning Representations, 2023b.
- Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173, 2024.
- Are emergent abilities in large language models just in-context learning? arXiv preprint arXiv:2309.01809, 2023.
- Agentboard: An analytical evaluation board of multi-turn llm agents. arXiv preprint arXiv:2401.13178, 2024.
- Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983, 2023.
- Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International conference on machine learning, pages 7721–7735. PMLR, 2021.
- Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
- Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
- OpenAI. Gpt-4 technical report, 2023.
- David Owen. How predictable is language model benchmark performance? arXiv preprint arXiv:2401.04757, 2024.
- Efficient benchmarking (of language models). arXiv preprint arXiv:2308.11696, 2023.
- tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992, 2024.
- Friedrich Pukelsheim. Optimal design of experiments. SIAM, 2006.
- Revisiting correlations between intrinsic and extrinsic evaluations of word embeddings. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data: 17th China National Conference, CCL 2018, and 6th International Symposium, NLP-NABD 2018, Changsha, China, October 19–21, 2018, Proceedings 17, pages 209–221. Springer, 2018.
- Do cifar-10 classifiers generalize to cifar-10? arXiv preprint arXiv:1806.00451, 2018.
- Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pages 5389–5400. PMLR, 2019.
- Toran Bruce Richards. Auto-gpt: Autonomous artificial intelligence software agent. https://github.com/Significant-Gravitas/Auto-GPT, 2023. URL https://github.com/Significant-Gravitas/Auto-GPT. Initial release: March 30, 2023.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems, 36, 2023a.
- Are emergent abilities of large language models a mirage? In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems, 33:18583–18599, 2020.
- Scaling laws vs model architectures: How does inductive bias influence scaling? In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- Qwen Team. Introducing qwen1.5. https://qwenlm.github.io/blog/qwen1.5/, 2024. Accessed: 2024-05-13.
- The MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms. https://www.databricks.com/blog/mpt-7b, 2023. Accessed: 2024-05-13.
- On the correlation of word embedding evaluation metrics. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4789–4797, 2020.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Pablo Villalobos. Scaling laws literature review, 2023. URL https://epochai.org/blog/scaling-laws-literature-review. Accessed: 2024-05-12.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2018.
- Openchat: Advancing open-source language models with mixed-quality data. In The Twelfth International Conference on Learning Representations, 2023a.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023b.
- Emergent abilities of large language models. Transactions on Machine Learning Research, 2022a.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022b.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Training trajectories of language models across scales. arXiv preprint arXiv:2212.09803, 2022.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- Lemur: Harmonizing natural language and code for language agents. In The Twelfth International Conference on Learning Representations, 2024.
- Cold case: The lost mnist digits. Advances in neural information processing systems, 32, 2019.
- Swe-agent: Agent computer interfaces enable software engineering language models, 2024.
- Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations, 2023.