Emergent Mind

A Survey on Data Selection for Language Models

(2402.16827)
Published Feb 26, 2024 in cs.CL and cs.LG

Abstract

A major factor in the recent success of LLMs is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required. Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection research. Consequently, knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research.

Overview of a data pipeline detailing steps from raw data to training language models.

Overview

  • Data selection is critical in training language models, especially LLMs, necessitating strategies for managing vast, diverse datasets to enhance model accuracy, efficiency, and fairness.

  • The paper introduces a taxonomy of data selection methods focused on distribution matching for domain-specific precision and distribution diversification for general applicability and robustness.

  • Pretraining LLMs involves filtering extensive datasets (like Common Crawl) to eliminate low-quality content, using heuristic and sophisticated model-based methods to preserve high-quality data.

  • Future advancements in data selection are tied to developing direct data evaluation metrics, comprehensive benchmarks, and strategies for balancing memorization and generalization.

Comprehensive Review on Data Selection Methods for Language Models

Introduction to Data Selection in Machine Learning

Data selection is a pivotal aspect of the machine learning pipeline, particularly relevant in the age of LLMs which are trained on massive, heterogeneous corpora. Selecting the right data for training these models is not a straightforward task—it involves identifying which subsets of data will lead to the best model performance in terms of accuracy, efficiency, and fairness. The challenge lies not only in handling the sheer volume of available data but also in mitigating the variance in its quality.

Taxonomy of Data Selection Methods

A broad classification of data selection practices can be encapsulated into two primary goals: matching the distribution of the training data to the target task (distribution matching) and enhancing the coverage and diversity of the dataset (distribution diversification). Both approaches have their applications, with the former being crucial for domain-specific tasks requiring high precision, and the latter for general-purpose models necessitating robustness and broad applicability.

The process of data selection comprises several strategic components, notably:

  • Utility Function Definition: This involves mapping data points to a numeric value representing their utility, which is crucial for filtering and prioritizing data.
  • Selection Mechanism: Utilized to decide which data points are included in the training set based on their assigned utility values.
  • Dataset Characteristics Adjustment: Methods under this category operate on altering the dataset's distribution to favor certain characteristics deemed desirable for the training objectives.

Pretraining Data Selection

For pretraining LLMs, the goal is often to filter and curate data from extensive datasets like the Common Crawl corpus, ensuring the removal of low-quality or irrelevant information while retaining high-quality content. Various heuristic approaches are employed for this purpose, alongside more sophisticated model-based and perplexity-based quality filtering. The challenge is to achieve a balance that favors data efficiency and model performance without introducing significant biases.

Enhancing Language Model Performance through Specific Data Selection Techniques

  • Fine-tuning and Multitask Learning: These methods leverage auxiliary datasets or diverse tasks to improve model performance on specific targets or across a multitude of tasks. The emphasis here is on domain-specific selection, where additional data is judiciously chosen to closely mirror the task at hand.
  • In-Context Learning: Techniques focusing on selecting or generating potent demonstrations within prompts to guide the model more effectively, demonstrating how precision in data selection can significantly influence model behavior even without direct training on that data.
  • Task-specific Fine-tuning: Task-specific settings call for strategies that either increase the training data’s alignment with the target task or optimize data efficiency and robustness by carefully curating and diversifying the training samples.

Future Directions and Challenges

The review underlines the nuanced trade-offs between memorization and generalization inherent in data selection decisions. Innovations in direct data evaluation metrics, development of comprehensive benchmarks, and the shift towards more comprehensive data processing strategies are highlighted as key future directions.

Conclusion

This survey aims to provide a structured understanding of the landscape of data selection methods in machine learning, with a focus on LLMs. It emphasizes the intricate balance required in selecting data that both aligns with target tasks and ensures models are robust, fair, and efficient. As the field evolves, so too will the strategies for selecting the optimal datasets, underscoring the importance of continued research and innovation in this space.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

HackerNews
References
  1. SemDeDup: Data-efficient learning at web-scale through semantic deduplication
  2. Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow., 9(12):993–1004, aug 2016. ISSN 2150-8097. doi: 10.14778/2994509.2994518. https://doi.org/10.14778/2994509.2994518.
  3. One-network adversarial fairness. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  2412–2420
  4. normalization for de-duplication of web pages. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, pp.  1987–1990, New York, NY, USA, 2009. Association for Computing Machinery. ISBN 9781605585123. doi: 10.1145/1645953.1646283. https://doi.org/10.1145/1645953.1646283.
  5. Muppet: Massive multi-task representations with pre-finetuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  5799–5811, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.468. https://aclanthology.org/2021.emnlp-main.468.

  6. HTLM: Hyper-text pre-training and prompting of language models. In International Conference on Learning Representations, 2022. https://openreview.net/forum?id=P-pPW1nxf1r.

  7. D-REX: Dialogue relation extraction with explanations. In Bing Liu, Alexandros Papangelis, Stefan Ultes, Abhinav Rastogi, Yun-Nung Chen, Georgios Spithourakis, Elnaz Nouri, and Weiyan Shi (eds.), Proceedings of the 4th Workshop on NLP for Conversational AI, pp.  34–46, Dublin, Ireland, May 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.nlp4convai-1.4. https://aclanthology.org/2022.nlp4convai-1.4.

  8. FETA: A benchmark for few-sample task transfer in open-domain dialogue. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  10936–10953, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.751. https://aclanthology.org/2022.emnlp-main.751.

  9. Efficient online data mixing for language model pre-training, 2023a
  10. Improving few-shot generalization by exploring and exploiting auxiliary data. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. https://openreview.net/forum?id=JDnLXc4NOn.

  11. SantaCoder: don't reach for the stars!
  12. GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch, 9 2023. https://www.github.com/eleutherai/gpt-neox.

  13. ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
  14. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77
  15. Program synthesis with large language models
  16. Amittai Axelrod. Cynical selection of language model training data
  17. Data pruning for efficient model pruning in neural machine translation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  236–246, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.18. https://aclanthology.org/2023.findings-emnlp.18.

  18. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  19. Constitutional AI: Harmlessness from AI Feedback
  20. Notus. https://github.com/argilla-io/notus

  21. Theoretical guarantees on the best-of-n alignment policy
  22. Make every example count: On the stability and utility of self-influence for learning from noisy NLP datasets. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  10107–10121, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.625. https://aclanthology.org/2023.emnlp-main.625.

  23. Emergent and predictable memorization in LLMs. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. https://openreview.net/forum?id=Iq0DvhB4Kf.

  24. Pythia: A suite for analyzing LLMs across training and scaling. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  2397–2430. PMLR, 23–29 Jul 2023b. https://proceedings.mlr.press/v202/biderman23a.html.

  25. Fair preprocessing: towards understanding compositional fairness of data transformers in machine learning pipeline. In Proceedings of the 29th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, pp.  981–993
  26. Demographic dialectal variation in social media: A case study of African-American English. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  1119–1130, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1120. https://aclanthology.org/D16-1120.

  27. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422–426, jul 1970. ISSN 0001-0782. doi: 10.1145/362686.362692. https://doi.org/10.1145/362686.362692.
  28. The Foundation Model Transparency Index
  29. Coresets via bilevel optimization for continual learning and streaming. Advances in neural information processing systems, 33:14879–14890
  30. A large annotated corpus for learning natural language inference. In Lluís Màrquez, Chris Callison-Burch, and Jian Su (eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  632–642, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. https://aclanthology.org/D15-1075.

  31. A.Z. Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pp.  21–29, 1997. doi: 10.1109/SEQUEN.1997.666900.
  32. When is memorization of irrelevant training data necessary for high-accuracy learning? In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, pp.  123–132, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380539. doi: 10.1145/3406325.3451131. https://doi.org/10.1145/3406325.3451131.
  33. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

  34. Genie: Generative interactive environments
  35. Instruction mining: When data mining meets large language model finetuning
  36. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium, 2018. https://api.semanticscholar.org/CorpusID:170076423.

  37. Extracting training data from LLMs. In USENIX Security Symposium, 2020. https://api.semanticscholar.org/CorpusID:229156229.

  38. Quantifying memorization across neural language models. In The Eleventh International Conference on Learning Representations, 2023. https://openreview.net/forum?id=TatRHT_1cK.

  39. Open problems and fundamental limitations of reinforcement learning from human feedback. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. https://openreview.net/forum?id=bx24KpJ4Eb. Survey Certification.

  40. Suppressing pink elephants with direct principle feedback
  41. Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
  42. Impact of imputation strategies on fairness in machine learning. Journal of Artificial Intelligence Research, 74:1011–1035
  43. Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4750–4759
  44. Bias in machine learning software: why? how? what to do? In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2021, pp.  429–440, New York, NY, USA, August 2021. Association for Computing Machinery. ISBN 978-1-4503-8562-6. doi: 10.1145/3468264.3468537. https://dl.acm.org/doi/10.1145/3468264.3468537.
  45. Active bias: Training more accurate neural networks by emphasizing high variance samples. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  1002–1012, 2017. https://proceedings.neurips.cc/paper/2017/hash/2f37d10131f2a483a8dd005b3d14b0d9-Abstract.html.

  46. Data Curation Alone Can Stabilize In-context Learning
  47. Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pp.  380–388, New York, NY, USA, 2002. Association for Computing Machinery. ISBN 1581134959. doi: 10.1145/509907.509965. https://doi.org/10.1145/509907.509965.
  48. Kyle Chayka. Is A.I. Art Stealing from Artists? The New Yorker, February 2023. ISSN 0028-792X. https://www.newyorker.com/culture/infinite-scroll/is-ai-art-stealing-from-artists. Section: infinite scroll.

  49. An Empirical Survey of Data Augmentation for Limited Data Learning in NLP. Transactions of the Association for Computational Linguistics, 11:191–211, 03 2023a. ISSN 2307-387X. doi: 10.1162/tacla00542. https://doi.org/10.1162/tacl_a_00542.

  50. Evaluating large language models trained on code
  51. Skill-it! a data-driven skills framework for understanding and training language models, 2023b
  52. Super-Samples from Kernel Herding
  53. Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
  54. Palm: Scaling language modeling with pathways
  55. Scaling Instruction-Finetuned Language Models
  56. Training verifiers to solve math word problems
  57. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations
  58. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations, 2020. https://openreview.net/forum?id=HJg2b0VYDr.

  59. Together Computer. Redpajama: an open dataset for training LLMs, October 2023. https://github.com/togethercomputer/RedPajama-Data.

  60. Cross-lingual language model pretraining. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. https://proceedings.neurips.cc/paper_files/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf.

  61. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. https://aclanthology.org/2020.acl-main.747.

  62. Ultrafeedback: Boosting language models with high-quality feedback, 2023a
  63. Scaling up dataset distillation to imagenet-1k with constant memory. In International Conference on Machine Learning, pp.  6565–6590. PMLR, 2023b.
  64. Emilia David. Ai image training dataset found to include child sexual abuse imagery. The Verge, December 2023. https://www.theverge.com/2023/12/20/24009418/generative-ai-image-laion-csam-google-stability-stanford. 7:57 AM PST.

  65. Robust learning with progressive data expansion against spurious correlation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. https://openreview.net/forum?id=9QEVJ9qm46.

  66. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL, 2019. https://aclanthology.org/N19-1423.

  67. NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation
  68. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  1286–1305, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.98. https://aclanthology.org/2021.emnlp-main.98.

  69. GLaM: Efficient scaling of language models with mixture-of-experts. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  5547–5569. PMLR, 17–23 Jul 2022. https://proceedings.mlr.press/v162/du22c.html.

  70. What’s in my big data? In The Twelfth International Conference on Learning Representations
  71. More Effective Boilerplate Removal - the GoldMiner Algorithm. Polibits - Research journal on Computer science and computer engineering with applications, 1(48):79–83, 2013. ISSN 1870-9044. http://polibits.gelbukh.com/2013_48.

  72. Dsdm: Model-aware dataset selection with datamodels
  73. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  5988–6008. PMLR, 17–23 Jul 2022. https://proceedings.mlr.press/v162/ethayarajh22a.html.

  74. KTO: Model Alignment as Prospect Theoretic Optimization
  75. Irreducible curriculum for language model pretraining
  76. Doge: Domain reweighting with generalization estimation
  77. Data filtering networks
  78. A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  968–988, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.84. https://aclanthology.org/2021.findings-acl.84.

  79. Automatic document selection for efficient encoder pretraining. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  9522–9530, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.647. https://aclanthology.org/2022.emnlp-main.647.

  80. Missing the missing values: The ugly duckling of fairness in machine learning. International Journal of Intelligent Systems, 36(7):3217–3258
  81. Big data curation. New horizons for a data-driven economy: A roadmap for usage and exploitation of big data in Europe, pp.  87–118
  82. Datacomp: In search of the next generation of multimodal datasets. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. https://openreview.net/forum?id=dVaWCDMBof.

  83. The pile: An 800gb dataset of diverse text for language modeling
  84. Scaling laws for reward model overoptimization. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023a. https://dl.acm.org/doi/10.5555/3618408.3618845.
  85. Ambiguity-Aware In-Context Learning with Large Language Models
  86. Data shapley: Equitable valuation of data for machine learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  2242–2251. PMLR, 09–15 Jun 2019. https://proceedings.mlr.press/v97/ghorbani19c.html.

  87. Preprocessing matters: Automated pipeline selection for fair classification. In International Conference on Modeling Decisions for Artificial Intelligence, pp.  202–213. Springer
  88. Lightweight inspection of data preprocessing in native machine learning pipelines. In Conference on Innovative Data Systems Research (CIDR)
  89. The trade-offs of domain adaptation for neural language models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3802–3813, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.264. https://aclanthology.org/2022.acl-long.264.

  90. Learning word vectors for 157 languages. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga (eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. European Language Resources Association (ELRA). https://aclanthology.org/L18-1550.

  91. Automated curriculum learning for neural networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  1311–1320. PMLR, 06–11 Aug 2017. https://proceedings.mlr.press/v70/graves17a.html.

  92. OLMo: Accelerating the Science of Language Models
  93. Textbooks Are All You Need
  94. Deepcore: A comprehensive library for coreset selection in deep learning. In International Conference on Database and Expert Systems Applications, pp.  181–195. Springer
  95. Towards lossless dataset distillation via difficulty-aligned trajectory matching. In The Twelfth International Conference on Learning Representations
  96. Coverage-based Example Selection for In-Context Learning
  97. Whose language counts as high quality? measuring language ideologies in text data selection. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2562–2580, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.165. https://aclanthology.org/2022.emnlp-main.165.

  98. On-demand sampling: Learning optimally from multiple distributions. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  406–419. Curran Associates, Inc., 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/02917acec264a52a729b99d9bc857909-Paper-Conference.pdf.

  99. Balancing out bias: Achieving fairness through balanced training. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  11335–11350
  100. Take One Step at a Time to Know Incremental Utility of Demonstration: An Analysis on Reranking for Few-Shot In-Context Learning
  101. Kenneth Heafield. KenLM: Faster and smaller language model queries. In Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar F. Zaidan (eds.), Proceedings of the Sixth Workshop on Statistical Machine Translation, pp.  187–197, Edinburgh, Scotland, July 2011. Association for Computational Linguistics. https://aclanthology.org/W11-2123.

  102. Measuring coding challenge competence with apps. In J. Vanschoren and S. Yeung (eds.), Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1. Curran, 2021. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/c24cd76e1ce41366a4bbe8a49b02a028-Paper-round2.pdf.

  103. Training Compute-Optimal Large Language Models
  104. Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor
  105. LoRA: Low-rank adaptation of LLMs. In International Conference on Learning Representations, 2022. https://openreview.net/forum?id=nZeVKeeFYf9.

  106. Datamodels: Understanding predictions with data and data with predictions. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  9525–9587. PMLR, 17–23 Jul 2022. https://proceedings.mlr.press/v162/ilyas22a.html.

  107. On the complementarity of data selection and fine tuning for domain adaptation
  108. Data-efficient finetuning using cross-task nearest neighbors. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  9036–9061, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.576. https://aclanthology.org/2023.findings-acl.576.

  109. Camels in a changing climate: Enhancing lm adaptation with tulu 2, 2023b
  110. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
  111. Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  5075–5084, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.308. https://aclanthology.org/2023.emnlp-main.308.

  112. Matthew Jagielski. A note on interpreting canary exposure
  113. Perplexed by quality: A perplexity-based method for adult and harmful content detection in multilingual heterogeneous web data
  114. Imputation strategies under clinical presence: Impact on algorithmic fairness. In Machine Learning for Health, pp.  12–34. PMLR
  115. Fairness without imputation: A decision tree approach for fair prediction with missing values. In AAAI Conference on Artificial Intelligence, volume 36, pp.  9558–9566, 2022a.
  116. Who gets the benefit of the doubt? racial bias in machine learning algorithms applied to secondary school math education. In Math AI for Education: Bridging the Gap Between Research and Smart Education. AIED, 2022b.
  117. Identifying and correcting label bias in machine learning. In International Conference on Artificial Intelligence and Statistics, pp.  702–712. PMLR
  118. D-optimality for regression designs: A review. Technometrics, 17(1):15–23, 1975. doi: 10.1080/00401706.1975.10489266. https://www.tandfonline.com/doi/abs/10.1080/00401706.1975.10489266.
  119. Fasttext.zip: Compressing text classification models
  120. Data preprocessing techniques for classification without discrimination. Knowledge and information systems, 33(1):1–33
  121. Scaling Laws for Neural Language Models
  122. Unifying Question Answering, Text Classification, and Regression via Span Extraction
  123. UnifiedQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, 2020. https://aclanthology.org/2020.findings-emnlp.171.

  124. Grad-match: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, pp.  5464–5474. PMLR, 2021a.
  125. Glister: Generalization based data subset selection for efficient and robust learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  8110–8118, 2021b.
  126. Prometheus: Inducing Fine-grained Evaluation Capability in Language Models
  127. OpenAssistant Conversations -- Democratizing Large Language Model Alignment
  128. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72, 2022. doi: 10.1162/tacla00447. https://aclanthology.org/2022.tacl-1.4.

  129. MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
  130. Active instruction tuning: Improving cross-task generalization by training on prompt sensitive tasks
  131. Ds-1000: a natural and reliable benchmark for data science code generation. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org
  132. The History and Risks of Reinforcement Learning and Human Feedback
  133. Huggingface h4 stack exchange preference dataset, 2023b. https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences.

  134. Training subset selection for weak supervision. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  16023–16036. Curran Associates, Inc., 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/66720ca4e5a09ff83b55a117a6b2a86c-Paper-Conference.pdf.

  135. The bigscience roots corpus: A 1.6tb composite multilingual dataset. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  31809–31826. Curran Associates, Inc., 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/ce9e92e3de2372a4b93353eb7f3dc0bd-Paper-Datasets_and_Benchmarks.pdf.

  136. Making LLMs better data creators. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  15349–15360, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.948. https://aclanthology.org/2023.emnlp-main.948.

  137. Deduplicating training data makes language models better. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8424–8445, Dublin, Ireland, May 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.577. https://aclanthology.org/2022.acl-long.577.

  138. Dataset condensation with contrastive signals. In International Conference on Machine Learning, pp.  12352–12364. PMLR, 2022b.
  139. Diverse demonstrations improve in-context compositional generalization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1401–1422, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.78. https://aclanthology.org/2023.acl-long.78.

  140. Solving quantitative reasoning problems with language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=IFXTZERXdM7.

  141. Cleanml: A study for evaluating the impact of data cleaning on ml classification tasks. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pp.  13–24, 2021. doi: 10.1109/ICDE51399.2021.00009.
  142. Starcoder: may the source be with you! Transactions on Machine Learning Research, 2023a. ISSN 2835-8856. https://openreview.net/forum?id=KoFOg41haE. Reproducibility Certification.

  143. Finding Support Examples for In-Context Learning
  144. Unified demonstration retriever for in-context learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  4644–4668, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.256. https://aclanthology.org/2023.acl-long.256.

  145. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, December 2022a. ISSN 1095-9203. doi: 10.1126/science.abq1158. http://dx.doi.org/10.1126/science.abq1158.

  146. Making something out of nothing: Building robust task-oriented dialogue systems from scratch. In Alexa Prize TaskBot Challenge 1 Proceedings, 2022b. https://www.amazon.science/alexa-prize/proceedings/making-something-out-of-nothing-building-robust-task-oriented-dialogue-systems-from-scratch.

  147. Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification, 2023. https://https://huggingface.co/Open-Orca/SlimOrca.

  148. Exploration with Principles for Diverse AI Supervision
  149. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp.  100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. https://aclanthology.org/2022.deelio-1.10.

  150. Infini-gram: Scaling unbounded n-gram language models to a trillion tokens, 2024a
  151. Statistical rejection sampling improves preference optimization. In The Twelfth International Conference on Learning Representations, 2024b. https://openreview.net/forum?id=xbjSwwrQOe.

  152. What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning
  153. Multi-task deep neural networks for natural language understanding. In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4487–4496, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1441. https://aclanthology.org/P19-1441.

  154. The flan collection: Designing data and methods for effective instruction tuning, 2023a
  155. The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
  156. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity, 2023c
  157. SELF: Self-Evolution with Language Feedback
  158. #instag: Instruction tagging for analyzing supervised fine-tuning of large language models, 2023b
  159. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8086–8098, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. https://aclanthology.org/2022.acl-long.556.

  160. What’s in the box? an analysis of undesirable content in the Common Crawl corpus. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp.  182–189, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-short.24. https://aclanthology.org/2021.acl-short.24.

  161. FinGPT: Large Generative Models for a Small Language
  162. Learning adversarially fair and transferable representations. In International Conference on Machine Learning, pp.  3384–3393. PMLR
  163. Paloma: A benchmark for evaluating language model fit
  164. D2 pruning: Message passing for balancing diversity and difficulty in data pruning
  165. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993. doi: 10.1137/0222058. https://doi.org/10.1137/0222058.

  166. Impact of missing data imputation on the fairness and accuracy of graph node classifiers. In IEEE International Conference on Big Data, pp.  5988–5997
  167. Data portraits: Recording foundation model training data
  168. Which Examples to Annotate for In-Context Learning? Towards Effective and Efficient Selection
  169. Dataperf: Benchmarks for data-centric AI development. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. https://openreview.net/forum?id=LaFKTgrZMG.

  170. The Natural Language Decathlon: Multitask Learning as Question Answering
  171. Data curation: A study of researcher practices and needs. portal: Libraries and the Academy, 14(2):139–164
  172. MetaICL: Learning to learn in context. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2791–2809, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.201. https://aclanthology.org/2022.naacl-main.201.

  173. Prioritized training on points that are learnable, worth learning, and not yet learnt. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  15630–15649. PMLR, 17–23 Jul 2022. https://proceedings.mlr.press/v162/mindermann22a.html.

  174. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp.  6950–6960. PMLR, 2020a.
  175. Coresets for robust training of deep neural networks against noisy labels. Advances in Neural Information Processing Systems, 33:11465–11477, 2020b.
  176. Cross-Task Generalization via Natural Language Crowdsourcing Instructions
  177. Measuring data
  178. Intelligent selection of language model training data. In Jan Hajič, Sandra Carberry, Stephen Clark, and Joakim Nivre (eds.), Proceedings of the ACL 2010 Conference Short Papers, pp.  220–224, Uppsala, Sweden, July 2010. Association for Computational Linguistics. https://aclanthology.org/P10-2041.

  179. Chenghao Mou. Large-scale near-deduplication behind bigcode, May 2023. https://huggingface.co/blog/dedup. Accessed: 2023-12-06.

  180. SGPT: GPT Sentence Embeddings for Semantic Search
  181. MTEB: Massive Text Embedding Benchmark
  182. Crosslingual Generalization through Multitask Finetuning
  183. OctoPack: Instruction Tuning Code Large Language Models
  184. Scaling data-constrained language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. https://openreview.net/forum?id=j5BuTrEj35.

  185. Generative Representational Instruction Tuning
  186. Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012. ISBN 0262018020.
  187. WebGPT: Browser-assisted question-answering with human feedback
  188. Can foundation models wrangle your data?
  189. In-context Example Selection with Influences
  190. Quality not quantity: On the interaction between dataset design and robustness of CLIP. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=LTCBavFWp5C.

  191. Dataset meta-learning from kernel ridge-regression. In International Conference on Learning Representations
  192. Dataset distillation with infinitely wide convolutional networks. Advances in Neural Information Processing Systems, 34:5186–5198
  193. Gpt-4 technical report
  194. Distributionally robust language modeling. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4227–4237, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1432. https://aclanthology.org/D19-1432.

  195. Proving test set contamination in black box language models
  196. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In Piotr Bański, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald Lüngen, and Caroline Iliadi (eds.), Proceedings of the Workshop on Challenges in the Management of Large Corpora, Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pp.  9 – 16, Mannheim, 2019. Leibniz-Institut für Deutsche Sprache. doi: 10.14618/ids-pub-9021. http://nbn-resolving.de/urn:nbn:de:bsz:mh39-90215.

  197. Training language models to follow instructions with human feedback
  198. West-of-n: Synthetic preference generation for improved reward modeling
  199. Logic-LM: Empowering LLMs with symbolic solvers for faithful logical reasoning. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  3806–3824, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.248. https://aclanthology.org/2023.findings-emnlp.248.

  200. Trak: Attributing model behavior at scale
  201. Deep learning on a data diet: Finding important examples early in training. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. https://openreview.net/forum?id=Uj7pF-D-YvT.

  202. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only
  203. True few-shot learning with language models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. https://openreview.net/forum?id=ShnM-rRh4T.

  204. Deep contextualized word representations. NAACL, 2018. https://aclanthology.org/N18-1202.

  205. Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks
  206. Dynamic pretraining of vision-language models. In ICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls, 2023. https://openreview.net/forum?id=meQWVbMCqXr.

  207. GAIA search: Hugging face and pyserini interoperability for NLP training data exploration. In Danushka Bollegala, Ruihong Huang, and Alan Ritter (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp.  588–598, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-demo.57. https://aclanthology.org/2023.acl-demo.57.

  208. Identifying mislabeled data using the area under the margin ranking. Advances in Neural Information Processing Systems, 33:17044–17056
  209. Adaptive second order coresets for data-efficient machine learning. In International Conference on Machine Learning, pp.  17848–17869. PMLR
  210. Intermediate-task transfer learning with pretrained language models: When and why does it work? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5231–5247, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.467. https://aclanthology.org/2020.acl-main.467.

  211. Improving language understanding by generative pre-training, 2018. https://api.semanticscholar.org/CorpusID:49313245.

  212. Language models are unsupervised multitask learners, 2019. https://api.semanticscholar.org/CorpusID:160025533.

  213. Learning transferable visual models from natural language supervision
  214. Scaling language models: Methods, analysis & insights from training gopher
  215. Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  216. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435. https://jmlr.org/papers/volume21/20-074/20-074.pdf.

  217. No robots. https://huggingface.co/datasets/HuggingFaceH4/no_robots

  218. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3982–3992, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. https://aclanthology.org/D19-1410.

  219. Learning to retrieve prompts for in-context learning. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2655–2671, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.191. https://aclanthology.org/2022.naacl-main.191.

  220. Distributionally robust neural networks. In International Conference on Learning Representations, 2020. https://openreview.net/forum?id=ryxGuJrFvS.

  221. FAWOS: Fairness-Aware Oversampling Algorithm Based on Distributions of Sensitive Attributes. IEEE Access, 9:81370–81379, 2021. ISSN 2169-3536. doi: 10.1109/ACCESS.2021.3084121. Conference Name: IEEE Access.
  222. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022. https://openreview.net/forum?id=9Vrb9D0WI4.

  223. Paul tremblay, mona awad vs. openai, inc., et al., 2023. https://storage.courtlistener.com/recap/gov.uscourts.cand.414822/gov.uscourts.cand.414822.1.0_1.pdf. Case 3:23-cv-03223-AMO Document 1 Filed 06/28/23, UNITED STATES DISTRICT COURT, NORTHERN DISTRICT OF CALIFORNIA, SAN FRANCISCO DIVISION.

  224. Teven Le Scao. Scaling multilingual language models under constrained data. PhD thesis, Université de Lorraine
  225. What Language Model to Train if You Have One Million GPU Hours?
  226. FairPrep: Promoting Data to a First-Class Citizen in Studies on Fairness-Enhancing Interventions
  227. Toolformer: Language models can teach themselves to use tools. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. https://openreview.net/forum?id=Yacmpz84TH.

  228. Cross-lingual supervision improves large language models pre-training
  229. Apricot: Submodular selection for data summarization in python. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435.
  230. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. https://openreview.net/forum?id=M3Y74vmsMcY.

  231. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018. https://openreview.net/forum?id=H1aIuk-RW.

  232. Text data acquisition for domain-specific language models. In Dan Jurafsky and Eric Gaussier (eds.), Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp.  382–389, Sydney, Australia, July 2006. Association for Computational Linguistics. https://aclanthology.org/W06-1645.

  233. Detecting pretraining data from large language models, 2023a
  234. Effective robustness against natural distribution shifts for models with different training data. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. https://openreview.net/forum?id=PAYXfIUKWY.

  235. Aya dataset: An open-access collection for multilingual instruction tuning
  236. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing
  237. Dolma: an open corpus of three trillion tokens for language model pretraining research
  238. Carpe diem, seize the samples uncertain "at the moment" for adaptive batch selection. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20, pp.  1385–1394, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450368599. doi: 10.1145/3340531.3411898. https://doi.org/10.1145/3340531.3411898.
  239. Ryosuke Sonoda. Fair oversampling technique using heterogeneous clusters. Information Sciences, 640:119059, September 2023. ISSN 0020-0255. doi: 10.1016/j.ins.2023.119059. https://www.sciencedirect.com/science/article/pii/S0020025523006448.
  240. Beyond neural scaling laws: beating power law scaling via data pruning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=UmvSlP-PyV.

  241. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. https://openreview.net/forum?id=uyTL5Bvosj.

  242. Selective annotation makes language models better few-shot learners. In The Eleventh International Conference on Learning Representations, 2023. https://openreview.net/forum?id=qY1hlv7gwg.

  243. Detecting personal information in training corpora: an analysis. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pp.  208–220, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.trustnlp-1.18. https://aclanthology.org/2023.trustnlp-1.18.

  244. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  9275–9293, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.746. https://aclanthology.org/2020.emnlp-main.746.

  245. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  246. Data curation with deep learning. In EDBT, pp.  277–286
  247. D4: Improving llm pretraining via document de-duplication and diversification
  248. A reproduction of apple’s bi-directional LSTM models for language identification in short strings. In Ionut-Teodor Sorodoc, Madhumita Sushil, Ece Takmaz, and Eneko Agirre (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp.  36–42, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-srw.6. https://aclanthology.org/2021.eacl-srw.6.

  249. Llama: Open and efficient foundation language models, 2023a
  250. Llama 2: Open foundation and fine-tuned chat models, 2023b
  251. Zephyr: Direct Distillation of LM Alignment
  252. Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
  253. Writing system and speaker metadata for 2,800+ language varieties. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp.  5035–5046, Marseille, France, June 2022. European Language Resources Association. https://aclanthology.org/2022.lrec-1.538.

  254. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi (eds.), Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  353–355, Brussels, Belgium, November 2018a. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. https://aclanthology.org/W18-5446.

  255. Training data selection for support vector machines. In Lipo Wang, Ke Chen, and Yew Soon Ong (eds.), Advances in Natural Computation, pp.  554–564, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg. ISBN 978-3-540-31853-8.
  256. Cafe: Learning to condense dataset by aligning features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12196–12205, 2022a.
  257. Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  5310–5319
  258. Shepherd: A critic for language model generation, 2023a
  259. Dataset Distillation
  260. Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. https://openreview.net/forum?id=BGvkwZEGt7.

  261. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5085–5109, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.340. https://aclanthology.org/2022.emnlp-main.340.

  262. How far can camels go? exploring the state of instruction tuning on open resources, 2023c
  263. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13484–13508, Toronto, Canada, July 2023d. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. https://aclanthology.org/2023.acl-long.754.

  264. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics
  265. Findings of the BabyLM challenge: Sample-efficient pretraining on developmentally plausible corpora. In Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Bhargavi Paranjabe, Adina Williams, Tal Linzen, and Ryan Cotterell (eds.), Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pp.  1–34, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.conll-babylm.1. https://aclanthology.org/2023.conll-babylm.1.

  266. Finetuned language models are zero-shot learners. ICLR 2022, 2021. https://openreview.net/forum?id=gEZrGCozdqR.

  267. Challenges in detoxifying language models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  2447–2469, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.210. https://aclanthology.org/2021.findings-emnlp.210.

  268. CCNet: Extracting high quality monolingual datasets from web crawl data. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference, pp.  4003–4012, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. https://aclanthology.org/2020.lrec-1.494.

  269. A broad-coverage challenge corpus for sentence understanding through inference. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1112–1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. https://aclanthology.org/N18-1101.

  270. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
  271. Scattershot: Interactive in-context example curation for text transformation. In Proceedings of the 28th International Conference on Intelligent User Interfaces, IUI ’23, pp.  353–367, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701061. doi: 10.1145/3581641.3584059. https://doi.org/10.1145/3581641.3584059.
  272. Sheared llama: Accelerating language model pre-training via structured pruning, 2023a
  273. Less: Selecting influential data for targeted instruction tuning
  274. Moderate coreset: A universal method of data selection for real-world data-efficient deep learning. In The Eleventh International Conference on Learning Representations, 2023b. https://openreview.net/forum?id=7D5EECbOaf9.

  275. C-Pack: Packaged Resources To Advance General Chinese Embedding
  276. Doremi: Optimizing data mixtures speeds up language model pretraining. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. https://openreview.net/forum?id=lXuByUeHhd.

  277. Data selection for language models via importance resampling. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. https://openreview.net/forum?id=uPSQv0leAu.

  278. Detoxifying Language Models Risks Marginalizing Minority Voices
  279. Curriculum learning for natural language understanding. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  6095–6104, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.542. https://aclanthology.org/2020.acl-main.542.

  280. WizardLM: Empowering Large Language Models to Follow Complex Instructions
  281. Misconfidence-based Demonstration Selection for LLM In-Context Learning
  282. Perils of self-feedback: Self-bias amplifies in large language models, 2024a
  283. In-context Learning with Retrieved Demonstrations for Language Models: A Survey
  284. mT5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  483–498, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. https://aclanthology.org/2021.naacl-main.41.

  285. Fair Class Balancing: Enhancing Model Fairness without Observing Sensitive Attributes. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20, pp.  1715–1724, New York, NY, USA, October 2020. Association for Computing Machinery. ISBN 978-1-4503-6859-9. doi: 10.1145/3340531.3411980. https://dl.acm.org/doi/10.1145/3340531.3411980.
  286. Fairness with overlapping groups; a probabilistic perspective. Advances in Neural Information Processing Systems, 33, 2020a.
  287. Towards fairer datasets: filtering and balancing the distribution of the people subtree in the ImageNet hierarchy. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp.  547–558, Barcelona Spain, January 2020b. ACM. ISBN 978-1-4503-6936-7. doi: 10.1145/3351095.3375709. https://dl.acm.org/doi/10.1145/3351095.3375709.
  288. Image data augmentation for deep learning: A survey
  289. Compositional Exemplars for In-context Learning
  290. BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting
  291. Real-fake: Effective training data synthesis through distribution matching. In The Twelfth International Conference on Learning Representations, 2024a. https://openreview.net/forum?id=svIdLLZpsA.

  292. Self-rewarding language models, 2024b
  293. Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
  294. MC^2: A Multilingual Corpus of Minority Languages in China
  295. IDEAL: Influence-Driven Selective Annotations Empower In-Context Learners in Large Language Models
  296. Instruction tuning for large language models: A survey, 2023c
  297. Fairness in Missing Data Imputation
  298. Active example selection for in-context learning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  9134–9148, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.622. https://aclanthology.org/2022.emnlp-main.622.

  299. Bo Zhao and Hakan Bilen. Dataset condensation with differentiable siamese augmentation. In International Conference on Machine Learning, pp.  12674–12685. PMLR
  300. Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  6514–6523
  301. Dataset condensation with gradient matching. In International Conference on Learning Representations
  302. Improved distribution matching for dataset condensation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7856–7865
  303. Coverage-centric coreset selection for high pruning rates. In The Eleventh International Conference on Learning Representations
  304. LIMA: Less is more for alignment. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. https://openreview.net/forum?id=KBMOKmX2he.

  305. Dataset distillation using neural feature regression. Advances in Neural Information Processing Systems, 35:9813–9827
  306. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023a
  307. Multimodal c4: An open, billion-scale corpus of images interleaved with text, 2023b
  308. Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models
  309. Fine-tuning language models from human preferences

Show All 309