Proving Test Set Contamination in Black Box Language Models (2310.17623v2)
Abstract: LLMs are trained on vast amounts of internet data, prompting concerns and speculation that they have memorized public benchmarks. Going from speculation to proof of contamination is challenging, as the pretraining data used by proprietary models are often not publicly accessible. We show that it is possible to provide provable guarantees of test set contamination in LLMs without access to pretraining data or model weights. Our approach leverages the fact that when there is no data contamination, all orderings of an exchangeable benchmark should be equally likely. In contrast, the tendency for LLMs to memorize example order means that a contaminated LLM will find certain canonical orderings to be much more likely than others. Our test flags potential contamination whenever the likelihood of a canonically ordered benchmark dataset is significantly higher than the likelihood after shuffling the examples. We demonstrate that our procedure is sensitive enough to reliably prove test set contamination in challenging situations, including models as small as 1.4 billion parameters, on small test sets of only 1000 examples, and datasets that appear only a few times in the pretraining corpus. Using our test, we audit five popular publicly accessible LLMs for test set contamination and find little evidence for pervasive contamination.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
- Pythia: A suite for analyzing large language models across training and scaling, 2023.
- Piqa: Reasoning about physical commonsense in natural language. ArXiv, abs/1911.11641, 2019. URL https://api.semanticscholar.org/CorpusID:208290939.
- Biomedlm, 2022. URL https://crfm.stanford.edu/2022/12/15/biomedlm.html.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), 2020a.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020b.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020c.
- The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Conference on Security Symposium, SEC’19, pp. 267–284, USA, 2019. USENIX Association. ISBN 9781939133069.
- Extracting training data from large language models. In USENIX Security Symposium, 2021.
- Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2924–2936, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclanthology.org/N19-1300.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018. URL https://api.semanticscholar.org/CorpusID:3922816.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1286–1305, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.98. URL https://aclanthology.org/2021.emnlp-main.98.
- R. A. Fisher. Statistical Methods for Research Workers. Oliver & Boyd, Edinburgh, 4th edition, 1934.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493, 2023.
- Levanter — legible, scalable, reproducible foundation models with jax, 2023. URL https://crfm.stanford.edu/2023/06/16/levanter-1_0-release.html.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Deduplicating training data mitigates privacy risks in language models. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 10697–10707. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/kandpal22a.html.
- Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019a. doi: 10.1162/tacl˙a˙00276. URL https://aclanthology.org/Q19-1026.
- Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019b.
- Testing statistical hypotheses. Springer Texts in Statistics. Springer, New York, third edition, 2005. ISBN 0-387-98864-5.
- Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2019.
- TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
- Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2022.
- Membership inference attacks against language models via neighbourhood comparison. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp. 11330–11343, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.719. URL https://aclanthology.org/2023.findings-acl.719.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing, 2018a. URL https://api.semanticscholar.org/CorpusID:52183757.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391, Brussels, Belgium, October-November 2018b. Association for Computational Linguistics. doi: 10.18653/v1/D18-1260. URL https://aclanthology.org/D18-1260.
- Mistral. Mistral 7b, 2023. URL https://mistral.ai/news/announcing-mistral-7b/.
- OpenAI. Gpt-4 technical report, 2023.
- The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
- Permutation p-values should never be zero: Calculating exact p-values when permutations are randomly drawn. Statistical Applications in Genetics and Molecular Biology, 9(1), 2010. doi: doi:10.2202/1544-6115.1585. URL https://doi.org/10.2202/1544-6115.1585.
- Language models are unsupervised multitask learners, 2018. URL https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
- SQuAD: 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL https://aclanthology.org/D16-1264.
- Did chatgpt cheat on your test?, 2023. URL https://hitz-zentroa.github.io/lm-contamination/blog/.
- Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), 2017.
- Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR), 2019.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- A broad-coverage challenge corpus for sentence understanding through inference. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. URL https://aclanthology.org/N18-1101.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://www.aclweb.org/anthology/P19-1472.