A major factor in the recent success of LLMs is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required. Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection research. Consequently, knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.
Data selection is critical in training language models, especially LLMs, necessitating strategies for managing vast, diverse datasets to enhance model accuracy, efficiency, and fairness.
The paper introduces a taxonomy of data selection methods focused on distribution matching for domain-specific precision and distribution diversification for general applicability and robustness.
Pretraining LLMs involves filtering extensive datasets (like Common Crawl) to eliminate low-quality content, using heuristic and sophisticated model-based methods to preserve high-quality data.
Future advancements in data selection are tied to developing direct data evaluation metrics, comprehensive benchmarks, and strategies for balancing memorization and generalization.
Data selection is a pivotal aspect of the machine learning pipeline, particularly relevant in the age of LLMs which are trained on massive, heterogeneous corpora. Selecting the right data for training these models is not a straightforward task—it involves identifying which subsets of data will lead to the best model performance in terms of accuracy, efficiency, and fairness. The challenge lies not only in handling the sheer volume of available data but also in mitigating the variance in its quality.
A broad classification of data selection practices can be encapsulated into two primary goals: matching the distribution of the training data to the target task (distribution matching) and enhancing the coverage and diversity of the dataset (distribution diversification). Both approaches have their applications, with the former being crucial for domain-specific tasks requiring high precision, and the latter for general-purpose models necessitating robustness and broad applicability.
The process of data selection comprises several strategic components, notably:
For pretraining LLMs, the goal is often to filter and curate data from extensive datasets like the Common Crawl corpus, ensuring the removal of low-quality or irrelevant information while retaining high-quality content. Various heuristic approaches are employed for this purpose, alongside more sophisticated model-based and perplexity-based quality filtering. The challenge is to achieve a balance that favors data efficiency and model performance without introducing significant biases.
The review underlines the nuanced trade-offs between memorization and generalization inherent in data selection decisions. Innovations in direct data evaluation metrics, development of comprehensive benchmarks, and the shift towards more comprehensive data processing strategies are highlighted as key future directions.
This survey aims to provide a structured understanding of the landscape of data selection methods in machine learning, with a focus on LLMs. It emphasizes the intricate balance required in selecting data that both aligns with target tasks and ensures models are robust, fair, and efficient. As the field evolves, so too will the strategies for selecting the optimal datasets, underscoring the importance of continued research and innovation in this space.
Muppet: Massive multi-task representations with pre-finetuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5799–5811, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.468. https://aclanthology.org/2021.emnlp-main.468.
HTLM: Hyper-text pre-training and prompting of language models. In International Conference on Learning Representations, 2022. https://openreview.net/forum?id=P-pPW1nxf1r.
D-REX: Dialogue relation extraction with explanations. In Bing Liu, Alexandros Papangelis, Stefan Ultes, Abhinav Rastogi, Yun-Nung Chen, Georgios Spithourakis, Elnaz Nouri, and Weiyan Shi (eds.), Proceedings of the 4th Workshop on NLP for Conversational AI, pp. 34–46, Dublin, Ireland, May 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.nlp4convai-1.4. https://aclanthology.org/2022.nlp4convai-1.4.
FETA: A benchmark for few-sample task transfer in open-domain dialogue. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 10936–10953, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.751. https://aclanthology.org/2022.emnlp-main.751.
Improving few-shot generalization by exploring and exploiting auxiliary data. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. https://openreview.net/forum?id=JDnLXc4NOn.
GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch, 9 2023. https://www.github.com/eleutherai/gpt-neox.
Data pruning for efficient model pruning in neural machine translation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 236–246, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.18. https://aclanthology.org/2023.findings-emnlp.18.
Make every example count: On the stability and utility of self-influence for learning from noisy NLP datasets. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 10107–10121, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.625. https://aclanthology.org/2023.emnlp-main.625.
Emergent and predictable memorization in LLMs. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. https://openreview.net/forum?id=Iq0DvhB4Kf.
Pythia: A suite for analyzing LLMs across training and scaling. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 2397–2430. PMLR, 23–29 Jul 2023b. https://proceedings.mlr.press/v202/biderman23a.html.
Demographic dialectal variation in social media: A case study of African-American English. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1119–1130, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1120. https://aclanthology.org/D16-1120.
A large annotated corpus for learning natural language inference. In Lluís Màrquez, Chris Callison-Burch, and Jian Su (eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. https://aclanthology.org/D15-1075.
Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium, 2018. https://api.semanticscholar.org/CorpusID:170076423.
Extracting training data from LLMs. In USENIX Security Symposium, 2020. https://api.semanticscholar.org/CorpusID:229156229.
Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, 2023. https://openreview.net/forum?id=TatRHT_1cK.
Open problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. https://openreview.net/forum?id=bx24KpJ4Eb. Survey Certification.
Active bias: Training more accurate neural networks by emphasizing high variance samples. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 1002–1012, 2017. https://proceedings.neurips.cc/paper/2017/hash/2f37d10131f2a483a8dd005b3d14b0d9-Abstract.html.
Kyle Chayka. Is A.I. Art Stealing from Artists? The New Yorker, February 2023. ISSN 0028-792X. https://www.newyorker.com/culture/infinite-scroll/is-ai-art-stealing-from-artists. Section: infinite scroll.
An Empirical Survey of Data Augmentation for Limited Data Learning in NLP. Transactions of the Association for Computational Linguistics, 11:191–211, 03 2023a. ISSN 2307-387X. doi: 10.1162/tacla00542. https://doi.org/10.1162/tacl_a_00542.
Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations, 2020. https://openreview.net/forum?id=HJg2b0VYDr.
Together Computer. Redpajama: an open dataset for training LLMs, October 2023. https://github.com/togethercomputer/RedPajama-Data.
Cross-lingual language model pretraining. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. https://proceedings.neurips.cc/paper_files/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf.
Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. https://aclanthology.org/2020.acl-main.747.
Emilia David. Ai image training dataset found to include child sexual abuse imagery. The Verge, December 2023. https://www.theverge.com/2023/12/20/24009418/generative-ai-image-laion-csam-google-stability-stanford. 7:57 AM PST.
Robust learning with progressive data expansion against spurious correlation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. https://openreview.net/forum?id=9QEVJ9qm46.
BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL, 2019. https://aclanthology.org/N19-1423.
Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1286–1305, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.98. https://aclanthology.org/2021.emnlp-main.98.
GLaM: Efficient scaling of language models with mixture-of-experts. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 5547–5569. PMLR, 17–23 Jul 2022. https://proceedings.mlr.press/v162/du22c.html.
More Effective Boilerplate Removal - the GoldMiner Algorithm. Polibits - Research journal on Computer science and computer engineering with applications, 1(48):79–83, 2013. ISSN 1870-9044. http://polibits.gelbukh.com/2013_48.
Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 5988–6008. PMLR, 17–23 Jul 2022. https://proceedings.mlr.press/v162/ethayarajh22a.html.
A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 968–988, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.84. https://aclanthology.org/2021.findings-acl.84.
Automatic document selection for efficient encoder pretraining. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 9522–9530, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.647. https://aclanthology.org/2022.emnlp-main.647.
Datacomp: In search of the next generation of multimodal datasets. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. https://openreview.net/forum?id=dVaWCDMBof.
Data shapley: Equitable valuation of data for machine learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2242–2251. PMLR, 09–15 Jun 2019. https://proceedings.mlr.press/v97/ghorbani19c.html.
The trade-offs of domain adaptation for neural language models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3802–3813, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.264. https://aclanthology.org/2022.acl-long.264.
Learning word vectors for 157 languages. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga (eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). https://aclanthology.org/L18-1550.
Automated curriculum learning for neural networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1311–1320. PMLR, 06–11 Aug 2017. https://proceedings.mlr.press/v70/graves17a.html.
Whose language counts as high quality? measuring language ideologies in text data selection. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2562–2580, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.165. https://aclanthology.org/2022.emnlp-main.165.
On-demand sampling: Learning optimally from multiple distributions. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 406–419. Curran Associates, Inc., 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/02917acec264a52a729b99d9bc857909-Paper-Conference.pdf.
Kenneth Heafield. KenLM: Faster and smaller language model queries. In Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar F. Zaidan (eds.), Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197, Edinburgh, Scotland, July 2011. Association for Computational Linguistics. https://aclanthology.org/W11-2123.
Measuring coding challenge competence with apps. In J. Vanschoren and S. Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1. Curran, 2021. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/c24cd76e1ce41366a4bbe8a49b02a028-Paper-round2.pdf.
LoRA: Low-rank adaptation of LLMs. In International Conference on Learning Representations, 2022. https://openreview.net/forum?id=nZeVKeeFYf9.
Datamodels: Understanding predictions with data and data with predictions. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 9525–9587. PMLR, 17–23 Jul 2022. https://proceedings.mlr.press/v162/ilyas22a.html.
Data-efficient finetuning using cross-task nearest neighbors. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp. 9036–9061, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.576. https://aclanthology.org/2023.findings-acl.576.
Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 5075–5084, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.308. https://aclanthology.org/2023.emnlp-main.308.
UnifiedQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, 2020. https://aclanthology.org/2020.findings-emnlp.171.
Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72, 2022. doi: 10.1162/tacla00447. https://aclanthology.org/2022.tacl-1.4.
Huggingface h4 stack exchange preference dataset, 2023b. https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.
Training subset selection for weak supervision. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 16023–16036. Curran Associates, Inc., 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/66720ca4e5a09ff83b55a117a6b2a86c-Paper-Conference.pdf.
The bigscience roots corpus: A 1.6tb composite multilingual dataset. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 31809–31826. Curran Associates, Inc., 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/ce9e92e3de2372a4b93353eb7f3dc0bd-Paper-Datasets_and_Benchmarks.pdf.
Making LLMs better data creators. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 15349–15360, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.948. https://aclanthology.org/2023.emnlp-main.948.
Deduplicating training data makes language models better. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8424–8445, Dublin, Ireland, May 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.577. https://aclanthology.org/2022.acl-long.577.
Diverse demonstrations improve in-context compositional generalization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1401–1422, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.78. https://aclanthology.org/2023.acl-long.78.
Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=IFXTZERXdM7.
Starcoder: may the source be with you! Transactions on Machine Learning Research, 2023a. ISSN 2835-8856. https://openreview.net/forum?id=KoFOg41haE. Reproducibility Certification.
Unified demonstration retriever for in-context learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 4644–4668, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.256. https://aclanthology.org/2023.acl-long.256.
Competition-level code generation with alphacode. Science, 378(6624):1092–1097, December 2022a. ISSN 1095-9203. doi: 10.1126/science.abq1158. http://dx.doi.org/10.1126/science.abq1158.
Making something out of nothing: Building robust task-oriented dialogue systems from scratch. In Alexa Prize TaskBot Challenge 1 Proceedings, 2022b. https://www.amazon.science/alexa-prize/proceedings/making-something-out-of-nothing-building-robust-task-oriented-dialogue-systems-from-scratch.
Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023. https://https://huggingface.co/Open-Orca/SlimOrca.
What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp. 100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. https://aclanthology.org/2022.deelio-1.10.
Statistical rejection sampling improves preference optimization. In The Twelfth International Conference on Learning Representations, 2024b. https://openreview.net/forum?id=xbjSwwrQOe.
Multi-task deep neural networks for natural language understanding. In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4487–4496, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1441. https://aclanthology.org/P19-1441.
Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8086–8098, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. https://aclanthology.org/2022.acl-long.556.
What’s in the box? an analysis of undesirable content in the Common Crawl corpus. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 182–189, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-short.24. https://aclanthology.org/2021.acl-short.24.
Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993. doi: 10.1137/0222058. https://doi.org/10.1137/0222058.
Dataperf: Benchmarks for data-centric AI development. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. https://openreview.net/forum?id=LaFKTgrZMG.
MetaICL: Learning to learn in context. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2791–2809, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.201. https://aclanthology.org/2022.naacl-main.201.
Prioritized training on points that are learnable, worth learning, and not yet learnt. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 15630–15649. PMLR, 17–23 Jul 2022. https://proceedings.mlr.press/v162/mindermann22a.html.
Intelligent selection of language model training data. In Jan Hajič, Sandra Carberry, Stephen Clark, and Joakim Nivre (eds.), Proceedings of the ACL 2010 Conference Short Papers, pp. 220–224, Uppsala, Sweden, July 2010. Association for Computational Linguistics. https://aclanthology.org/P10-2041.
Chenghao Mou. Large-scale near-deduplication behind bigcode, May 2023. https://huggingface.co/blog/dedup. Accessed: 2023-12-06.
Scaling data-constrained language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. https://openreview.net/forum?id=j5BuTrEj35.
Quality not quantity: On the interaction between dataset design and robustness of CLIP. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=LTCBavFWp5C.
Distributionally robust language modeling. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4227–4237, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1432. https://aclanthology.org/D19-1432.
Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In Piotr Bański, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald Lüngen, and Caroline Iliadi (eds.), Proceedings of the Workshop on Challenges in the Management of Large Corpora, Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pp. 9 – 16, Mannheim, 2019. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/ids-pub-9021. http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215.
Logic-LM: Empowering LLMs with symbolic solvers for faithful logical reasoning. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 3806–3824, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.248. https://aclanthology.org/2023.findings-emnlp.248.
Deep learning on a data diet: Finding important examples early in training. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. https://openreview.net/forum?id=Uj7pF-D-YvT.
True few-shot learning with language models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. https://openreview.net/forum?id=ShnM-rRh4T.
Deep contextualized word representations. NAACL, 2018. https://aclanthology.org/N18-1202.
Dynamic pretraining of vision-language models. In ICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls, 2023. https://openreview.net/forum?id=meQWVbMCqXr.
GAIA search: Hugging face and pyserini interoperability for NLP training data exploration. In Danushka Bollegala, Ruihong Huang, and Alan Ritter (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp. 588–598, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-demo.57. https://aclanthology.org/2023.acl-demo.57.
Intermediate-task transfer learning with pretrained language models: When and why does it work? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5231–5247, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.467. https://aclanthology.org/2020.acl-main.467.
Improving language understanding by generative pre-training, 2018. https://api.semanticscholar.org/CorpusID:49313245.
Language models are unsupervised multitask learners, 2019. https://api.semanticscholar.org/CorpusID:160025533.
Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435. https://jmlr.org/papers/volume21/20-074/20-074.pdf.
No robots. https://huggingface.co/datasets/HuggingFaceH4/no_robots
Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. https://aclanthology.org/D19-1410.
Learning to retrieve prompts for in-context learning. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2655–2671, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.191. https://aclanthology.org/2022.naacl-main.191.
Distributionally robust neural networks. In International Conference on Learning Representations, 2020. https://openreview.net/forum?id=ryxGuJrFvS.
Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022. https://openreview.net/forum?id=9Vrb9D0WI4.
Paul tremblay, mona awad vs. openai, inc., et al., 2023. https://storage.courtlistener.com/recap/gov.uscourts.cand.414822/gov.uscourts.cand.414822.1.0_1.pdf. Case 3:23-cv-03223-AMO Document 1 Filed 06/28/23, UNITED STATES DISTRICT COURT, NORTHERN DISTRICT OF CALIFORNIA, SAN FRANCISCO DIVISION.
Toolformer: Language models can teach themselves to use tools. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. https://openreview.net/forum?id=Yacmpz84TH.
LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. https://openreview.net/forum?id=M3Y74vmsMcY.
Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. https://openreview.net/forum?id=H1aIuk-RW.
Text data acquisition for domain-specific language models. In Dan Jurafsky and Eric Gaussier (eds.), Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 382–389, Sydney, Australia, July 2006. Association for Computational Linguistics. https://aclanthology.org/W06-1645.
Effective robustness against natural distribution shifts for models with different training data. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. https://openreview.net/forum?id=PAYXfIUKWY.
Beyond neural scaling laws: beating power law scaling via data pruning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=UmvSlP-PyV.
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. https://openreview.net/forum?id=uyTL5Bvosj.
Selective annotation makes language models better few-shot learners. In The Eleventh International Conference on Learning Representations, 2023. https://openreview.net/forum?id=qY1hlv7gwg.
Detecting personal information in training corpora: an analysis. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pp. 208–220, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.trustnlp-1.18. https://aclanthology.org/2023.trustnlp-1.18.
Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9275–9293, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.746. https://aclanthology.org/2020.emnlp-main.746.
Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca
A reproduction of apple’s bi-directional LSTM models for language identification in short strings. In Ionut-Teodor Sorodoc, Madhumita Sushil, Ece Takmaz, and Eneko Agirre (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 36–42, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-srw.6. https://aclanthology.org/2021.eacl-srw.6.
Writing system and speaker metadata for 2,800+ language varieties. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 5035–5046, Marseille, France, June 2022. European Language Resources Association. https://aclanthology.org/2022.lrec-1.538.
GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi (eds.), Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353–355, Brussels, Belgium, November 2018a. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. https://aclanthology.org/W18-5446.
Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. https://openreview.net/forum?id=BGvkwZEGt7.
Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5085–5109, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.340. https://aclanthology.org/2022.emnlp-main.340.
Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13484–13508, Toronto, Canada, July 2023d. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. https://aclanthology.org/2023.acl-long.754.
Findings of the BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora. In Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Bhargavi Paranjabe, Adina Williams, Tal Linzen, and Ryan Cotterell (eds.), Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pp. 1–34, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.conll-babylm.1. https://aclanthology.org/2023.conll-babylm.1.
Finetuned language models are zero-shot learners. ICLR 2022, 2021. https://openreview.net/forum?id=gEZrGCozdqR.
Challenges in detoxifying language models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 2447–2469, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.210. https://aclanthology.org/2021.findings-emnlp.210.
CCNet: Extracting high quality monolingual datasets from web crawl data. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4003–4012, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. https://aclanthology.org/2020.lrec-1.494.
A broad-coverage challenge corpus for sentence understanding through inference. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. https://aclanthology.org/N18-1101.
Moderate coreset: A universal method of data selection for real-world data-efficient deep learning. In The Eleventh International Conference on Learning Representations, 2023b. https://openreview.net/forum?id=7D5EECbOaf9.
Doremi: Optimizing data mixtures speeds up language model pretraining. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. https://openreview.net/forum?id=lXuByUeHhd.
Data selection for language models via importance resampling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. https://openreview.net/forum?id=uPSQv0leAu.
Curriculum learning for natural language understanding. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6095–6104, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.542. https://aclanthology.org/2020.acl-main.542.
mT5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 483–498, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. https://aclanthology.org/2021.naacl-main.41.
Real-fake: Effective training data synthesis through distribution matching. In The Twelfth International Conference on Learning Representations, 2024a. https://openreview.net/forum?id=svIdLLZpsA.
Active example selection for in-context learning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 9134–9148, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.622. https://aclanthology.org/2022.emnlp-main.622.
LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. https://openreview.net/forum?id=KBMOKmX2he.