gzip Predicts Data-dependent Scaling Laws (2405.16684v1)
Abstract: Past work has established scaling laws that predict the performance of a neural LLM (LM) as a function of its parameter count and the number of tokens it's trained on, enabling optimal allocation of a fixed compute budget. Are these scaling laws agnostic to training data as some prior work suggests? We generate training datasets of varying complexities by modulating the syntactic properties of a PCFG, finding that 1) scaling laws are sensitive to differences in data complexity and that 2) gzip, a compression algorithm, is an effective predictor of how data complexity impacts scaling properties. We propose a new data-dependent scaling law for LM's that accounts for the training data's gzip-compressibility; its compute-optimal frontier increases in dataset size preference (over parameter count preference) as training data becomes harder to compress.
- Scaling laws for generative mixed-modal language models. In International Conference on Machine Learning, pages 265–279. PMLR, 2023.
- Estimating the entropy of linguistic distributions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 175–195, 2022.
- Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
- NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/P04-3031.
- Thomas Breydo. thomasbreydo/pcfg. 8 2021. URL https://github.com/thomasbreydo/pcfg.
- Broken neural scaling laws. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.
- Toward the logical description of languages in their phonemic aspect. Language, pages 34–46, 1953.
- Zhiyi Chi. Statistical properties of probabilistic context-free grammars. Computational Linguistics, 25(1):131–160, 1999. URL https://aclanthology.org/J99-1004.
- Noam Chomsky. Three models for the description of language. IRE Transactions on information theory, 2(3):113–124, 1956.
- Probabilistic context-free grammars estimated from infinite distributions. IEEE transactions on pattern analysis and machine intelligence, 29(8):1379–1393, 2007.
- Learning curves: Asymptotic values and rate of convergence. Advances in neural information processing systems, 6, 1993.
- Are all languages equally hard to language-model? In Marilyn Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 536–541, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2085. URL https://aclanthology.org/N18-2085.
- Language modeling is compression. arXiv preprint arXiv:2309.10668, 2023.
- Peter Deutsch. Deflate compressed data format specification version 1.3. Technical report, 1996.
- Gnu gzip. GNU Operating System, 1992.
- Ulf Grenander. Syntax-controlled probabilities. Division of Applied Mathematics, Brown University, 1967.
- Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
- Zellig Harris. A theory of language and information: a mathematical approach. Oxford University Press, 1991.
- Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Compression represents intelligence linearly. arXiv preprint arXiv:2404.09937, 2024.
- David A Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101, 1952.
- “low-resource” text classification: A parameter-free classification method with compressors. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 6810–6828, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.426. URL https://aclanthology.org/2023.findings-acl.426.
- Andy L Jones. Scaling scaling laws with board games. arXiv preprint arXiv:2104.03113, 2021.
- Mission: Impossible language models. arXiv preprint arXiv:2401.06416, 2024.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
- A neural scaling law from lottery ticket ensembling. arXiv preprint arXiv:2310.02258, 2023.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2023.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
- Fineweb, 2024. URL https://huggingface.co/datasets/HuggingFaceFW/fineweb.
- Word lengths are optimized for efficient communication. Proceedings of the National Academy of Sciences, 108(9):3526–3529, 2011.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- A constructive prediction of the generalization error across scales. arXiv preprint arXiv:1909.12673, 2019.
- On the derivational entropy of left-to-right probabilistic finite-state automata and hidden Markov models. Computational Linguistics, 44(1):17–37, April 2018. doi: 10.1162/COLI_a_00306. URL https://aclanthology.org/J18-1002.
- Claude E Shannon. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64, 1951.
- Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.
- Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
- Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36, 2024.
- Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
- A universal algorithm for sequential data compression. IEEE Transactions on information theory, 23(3):337–343, 1977.