Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data (2306.13840v3)

Published 24 Jun 2023 in cs.CL, cs.AI, cs.LG, and cs.NE

Abstract: Current trends in pre-training LLMs primarily focus on the scaling of model and dataset size. While the quality of pre-training data is considered an important factor for training powerful LLMs, it remains a nebulous concept that has not been rigorously characterized. To this end, we propose a formalization of one key aspect of data quality -- measuring the variability of natural language data -- specifically via a measure we call the diversity coefficient. Our empirical analysis shows that the proposed diversity coefficient aligns with the intuitive properties of diversity and variability, e.g., it increases as the number of latent concepts increases. Then, we measure the diversity coefficient of publicly available pre-training datasets and demonstrate that their formal diversity is high compared to theoretical lower and upper bounds. Finally, we conduct a comprehensive set of controlled interventional experiments with GPT-2 and LLaMAv2 that demonstrate the diversity coefficient of pre-training data characterizes useful aspects of downstream model evaluation performance -- totaling 44 models of various sizes (51M to 7B parameters). We conclude that our formal notion of diversity is an important aspect of data quality that captures variability and causally leads to improved evaluation performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Task2vec: Task embedding for meta-learning. CoRR, abs/1902.03545, 2019. URL http://arxiv.org/abs/1902.03545.
  2. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
  3. Evaluating Large Language Models Trained on Code. URL https://www.github.com/openai/human-eval.
  4. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  5. Unified scaling laws for routed language models. In International Conference on Machine Learning, pp. 4057–4086. PMLR, 2022.
  6. Impossibility theorems for domain adaptation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp.  129–136, 2010. URL https://proceedings.mlr.press/v9/david10a.html.
  7. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1:4171–4186, oct 2018. URL https://arxiv.org/abs/1810.04805v2.
  8. Random network distillation as a diversity metric for both image and text generation, 2020.
  9. The vendi score: A diversity evaluation metric for machine learning, 2022.
  10. The pile: An 800gb dataset of diverse text for language modeling, 2020.
  11. Google. Palm 2 technical report. Technical report, 2023. URL https://ai.google/static/documents/palm2techreport.pdf.
  12. Data and parameter scaling laws for neural machine translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  5915–5922, 2021.
  13. Proof artifact co-training for theorem proving with language models.
  14. Hashimoto, T. Model performance scaling with multiple data sources. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  4107–4116. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/hashimoto21a.html.
  15. Deep Residual Learning for Image Recognition. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016-December:770–778, dec 2015. ISSN 10636919. doi: 10.1109/CVPR.2016.90. URL https://arxiv.org/abs/1512.03385v1.
  16. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  17. Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021.
  18. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
  19. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  20. Jones, A. L. Scaling scaling laws with board games. arXiv preprint arXiv:2104.03113, 2021.
  21. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  22. ImageNet Classification with Deep Convolutional Neural Networks. 2012. URL http://code.google.com/p/cuda-convnet/.
  23. Improved precision and recall metric for assessing generative models, 2019.
  24. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. arXiv preprint arXiv:2305.13169, 2023. URL https://doi.org/10.48550/arXiv.2305.13169.
  25. Pointer sentinel mixture models, 2016.
  26. The Curse of Low Task Diversity: On the Failure of Transfer Learning to Outperform MAML and Their Empirical Equivalence. arXiv, 2022a. doi: 10.48550/arXiv.2208.01545. URL https://arxiv.org/abs/2208.01545.
  27. The curse of low task diversity: On the failure of transfer learning to outperform maml and their empirical equivalence, 2022b. URL https://arxiv.org/abs/2208.01545.
  28. Is pre-training truly better than meta-learning? arXiv preprint arXiv:2306.13841, 2023. https://doi.org/10.48550/arXiv.2306.13841.
  29. Playing Atari with Deep Reinforcement Learning. 2013.
  30. Reliable fidelity and diversity metrics for generative models. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  7176–7185. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/naeem20a.html.
  31. Scaling laws for a multi-agent reinforcement learning model. arXiv preprint arXiv:2210.00849, 2022.
  32. Nostalgebraist. Chinchilla’s wild implications. AI Alignment Forum, 2022.
  33. OpenAI. Gpt-4 technical report. 2023.
  34. Generative Language Modeling for Automated Theorem Proving. sep 2020. URL http://arxiv.org/abs/2009.03393.
  35. Mathematical Reasoning via Self-supervised Skip-tree Training. Technical report.
  36. Language models are unsupervised multitask learners. 2019.
  37. Learning Transferable Visual Models From Natural Language Supervision. feb 2021. URL https://arxiv.org/abs/2103.00020v1.
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  39. A constructive prediction of the generalization error across scales. arXiv preprint arXiv:1909.12673, 2019.
  40. Ruiz, A. What it takes to train a foundation model, 7 2023. URL https://www.nocode.ai/what-it-takes-to-train-a-foundation-model/. Director of Data Science at IBM and the founder of NoCode.ai.
  41. Assessing generative models via precision and recall, 2018.
  42. Are emergent abilities of large language models a mirage?, 2023.
  43. Mastering the game of Go with deep neural networks and tree search. Nature 2016 529:7587, 529(7587):484–489, jan 2016. ISSN 1476-4687. doi: 10.1038/nature16961. URL https://www.nature.com/articles/nature16961.
  44. Revisiting precision and recall definition for generative model evaluation, 2019.
  45. Llama: Open and efficient foundation language models, 2023.
  46. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  47. Exploring and predicting transferability across NLP tasks. CoRR, abs/2005.00770, 2020. URL https://arxiv.org/abs/2005.00770.
  48. An explanation of in-context learning as implicit bayesian inference. CoRR, abs/2111.02080, 2021. URL https://arxiv.org/abs/2111.02080.
  49. Mastering Atari Games with Limited Data. oct 2021. URL https://arxiv.org/abs/2111.00210v1.
  50. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12104–12113, 2022.
  51. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021. URL https://doi.org/10.1145/3446776.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com