How to Train Data-Efficient LLMs (2402.09668v1)
Abstract: The training of LLMs is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose Density sampling, which models the data distribution to select a diverse sample. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories. Coverage sampling can recover the performance of the full data, while models trained on Ask-LLM data consistently outperform full-data training -- even when we reject 90% of the original dataset, while converging up to 70% faster.
- Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
- Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649, 2023.
- Self-consuming generative models go mad. arXiv preprint arXiv:2307.01850, 2023.
- Palm 2 technical report, 2023.
- Data pruning and neural scaling laws: fundamental limitations of score-based algorithms. arXiv preprint arXiv:2302.06960, 2023a.
- Data pruning and neural scaling laws: fundamental limitations of score-based algorithms. Transactions on Machine Learning Research, 2023b. ISSN 2835-8856. URL https://openreview.net/forum?id=iRTL4pDavo.
- Coresets via bilevel optimization for continual learning and streaming. Advances in Neural Information Processing Systems, 33:14879–14890, 2020.
- Large language models suffer from their own output: An analysis of the self-consuming training loop. arXiv preprint arXiv:2311.16822, 2023.
- One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.
- Super-samples from kernel herding. arXiv preprint arXiv:1203.3472, 2012.
- Training data subset search with ensemble active learning. IEEE Transactions on Intelligent Transportation Systems, 23(9):14741–14752, 2021.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Sub-linear race sketches for approximate kernel density estimation on streaming data. In Proceedings of The Web Conference 2020, pp. 1739–1749, 2020.
- One-pass diversified sampling with application to terabyte-scale genomic sequence streams. In International Conference on Machine Learning, pp. 4202–4218. PMLR, 2022.
- Selection via proxy: Efficient data selection for deep learning. In ICLR, 2020.
- Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pp. 253–262, 2004.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
- Devroye, L. The equivalence of weak, strong and complete convergence in l1 for kernel density estimates. The Annals of Statistics, pp. 896–904, 1983.
- Dsdm: Model-aware dataset selection with datamodels, 2024.
- Turning big data into tiny data: Constant-size coresets for k-means, pca, and projective clustering. SIAM Journal on Computing, 49(3):601–657, 2020.
- What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021.
- Deepcore: A comprehensive library for coreset selection in deep learning. In International Conference on Database and Expert Systems Applications, pp. 181–195. Springer, 2022a.
- Deepcore: A comprehensive library for coreset selection in deep learning. In International Conference on Database and Expert Systems Applications, pp. 181–195. Springer, 2022b.
- Wiki-40b: Multilingual language model dataset. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 2440–2452, 2020.
- Simfluence: Modeling the influence of individual training examples by simulating training runs. arXiv preprint arXiv:2303.08114, 2023.
- Hastings, W. K. Monte carlo sampling methods using markov chains and their applications. 1970.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Teaching machines to read and comprehend. Advances in neural information processing systems, 28, 2015.
- An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030, 2022.
- Composable core-sets for diversity and coverage maximization. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 100–108, 2014.
- Phi-2: The surprising power of small language models, 2023.
- Accelerating deep learning by focusing on the biggest losers. arXiv preprint arXiv:1910.00762, 2019.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Discrepancy, coresets, and sketches in machine learning. In Conference on Learning Theory, pp. 1975–1993. PMLR, 2019.
- Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning, pp. 2525–2534. PMLR, 2018.
- Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700, 2020.
- Distance-sensitive bloom filters. In 2006 Proceedings of the Eighth Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 41–50. SIAM, 2006.
- Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data. arXiv preprint arXiv:2306.13840, 2023.
- Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8424–8445, 2022.
- One-pass distribution sketch for measuring data heterogeneity in federated learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023a.
- A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. arXiv preprint arXiv:2305.13169, 2023b.
- Estimating the carbon footprint of bloom, a 176b parameter language model. Journal of Machine Learning Research, 24(253):1–15, 2023.
- Coresets for classification–simplified and strengthened. Advances in Neural Information Processing Systems, 34:11643–11654, 2021.
- Rephrasing the web: A recipe for compute and data-efficient language modeling, 2024.
- When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564, 2023.
- Trivial or impossible–dichotomous data difficulty masks model differences (on imagenet and beyond). arXiv preprint arXiv:2110.05922, 2021.
- A diverse corpus for evaluating and developing english math word problem solvers. arXiv preprint arXiv:2106.15772, 2021.
- Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pp. 15630–15649. PMLR, 2022.
- Scaling data-constrained language models. arXiv preprint arXiv:2305.16264, 2023.
- Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877, 2021.
- OpenAI. Gpt-4 technical report, 2023.
- Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191, 2021.
- Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607, 2021.
- Phillips, J. M. Coresets and sketches. In Handbook of discrete and computational geometry, pp. 1269–1288. Chapman and Hall/CRC, 2017.
- Formal algorithms for transformers. arXiv preprint arXiv:2207.09238, 2022.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
- The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
- Rosenblatt, M. Remarks on some nonparametric estimates of a density function. The annals of mathematical statistics, pp. 832–837, 1956.
- Data distillation: A survey. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. Survey Certification.
- Svp-cf: Selection via proxy for collaborative filtering data. arXiv preprint arXiv:2107.04984, 2021.
- Farzi data: Autoregressive data distillation. arXiv preprint arXiv:2310.09983, 2023.
- Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 542–550. SIAM, 2014.
- Mixture-of-experts meets instruction tuning: A winning combination for large language models. arXiv preprint arXiv:2305.14705, 2023.
- The curse of recursion: Training on generated data makes models forget.(5 2023). URl: https://arxiv. org/abs/2305.17493, 2023.
- Rehashing kernel evaluation in high dimensions. In International Conference on Machine Learning, pp. 5789–5798. PMLR, 2019.
- Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243, 2019.
- D4: Improving llm pretraining via document de-duplication and diversification. arXiv preprint arXiv:2308.12284, 2023.
- An empirical study of example forgetting during deep neural network learning. In ICLR, 2019.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- On coresets for support vector machines. Theoretical Computer Science, 890:171–191, 2021.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Weng, L. Large transformer model inference optimization. Lil’Log, Jan 2023. URL https://lilianweng.github.io/posts/2023-01-10-inference-optimization/.
- Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019.
- Consistency of the kernel density estimator: a survey. Statistical Papers, 53(1):1–21, 2012.
- Doremi: Optimizing data mixtures speeds up language model pretraining. arXiv preprint arXiv:2305.10429, 2023a.
- Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169, 2023b.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.