Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 150 tok/s
Gemini 2.5 Pro 42 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 87 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Proving Test Set Contamination in Black Box Language Models (2310.17623v2)

Published 26 Oct 2023 in cs.CL and cs.LG

Abstract: LLMs are trained on vast amounts of internet data, prompting concerns and speculation that they have memorized public benchmarks. Going from speculation to proof of contamination is challenging, as the pretraining data used by proprietary models are often not publicly accessible. We show that it is possible to provide provable guarantees of test set contamination in LLMs without access to pretraining data or model weights. Our approach leverages the fact that when there is no data contamination, all orderings of an exchangeable benchmark should be equally likely. In contrast, the tendency for LLMs to memorize example order means that a contaminated LLM will find certain canonical orderings to be much more likely than others. Our test flags potential contamination whenever the likelihood of a canonically ordered benchmark dataset is significantly higher than the likelihood after shuffling the examples. We demonstrate that our procedure is sensitive enough to reliably prove test set contamination in challenging situations, including models as small as 1.4 billion parameters, on small test sets of only 1000 examples, and datasets that appear only a few times in the pretraining corpus. Using our test, we audit five popular publicly accessible LLMs for test set contamination and find little evidence for pervasive contamination.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  2. Pythia: A suite for analyzing large language models across training and scaling, 2023.
  3. Piqa: Reasoning about physical commonsense in natural language. ArXiv, abs/1911.11641, 2019. URL https://api.semanticscholar.org/CorpusID:208290939.
  4. Biomedlm, 2022. URL https://crfm.stanford.edu/2022/12/15/biomedlm.html.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), 2020a.
  6. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020b.
  7. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020c.
  8. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Conference on Security Symposium, SEC’19, pp. 267–284, USA, 2019. USENIX Association. ISBN 9781939133069.
  9. Extracting training data from large language models. In USENIX Security Symposium, 2021.
  10. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2023.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  12. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2924–2936, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL https://aclanthology.org/N19-1300.
  13. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018. URL https://api.semanticscholar.org/CorpusID:3922816.
  14. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  15. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  1286–1305, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.98. URL https://aclanthology.org/2021.emnlp-main.98.
  16. R. A. Fisher. Statistical Methods for Research Workers. Oliver & Boyd, Edinburgh, 4th edition, 1934.
  17. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  18. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493, 2023.
  19. Levanter — legible, scalable, reproducible foundation models with jax, 2023. URL https://crfm.stanford.edu/2023/06/16/levanter-1_0-release.html.
  20. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  21. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  22. Deduplicating training data mitigates privacy risks in language models. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  10697–10707. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/kandpal22a.html.
  23. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019a. doi: 10.1162/tacl˙a˙00276. URL https://aclanthology.org/Q19-1026.
  24. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019b.
  25. Testing statistical hypotheses. Springer Texts in Statistics. Springer, New York, third edition, 2005. ISBN 0-387-98864-5.
  26. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2019.
  27. TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
  28. Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2022.
  29. Membership inference attacks against language models via neighbourhood comparison. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  11330–11343, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.719. URL https://aclanthology.org/2023.findings-acl.719.
  30. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing, 2018a. URL https://api.semanticscholar.org/CorpusID:52183757.
  31. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  2381–2391, Brussels, Belgium, October-November 2018b. Association for Computational Linguistics. doi: 10.18653/v1/D18-1260. URL https://aclanthology.org/D18-1260.
  32. Mistral. Mistral 7b, 2023. URL https://mistral.ai/news/announcing-mistral-7b/.
  33. OpenAI. Gpt-4 technical report, 2023.
  34. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
  35. Permutation p-values should never be zero: Calculating exact p-values when permutations are randomly drawn. Statistical Applications in Genetics and Molecular Biology, 9(1), 2010. doi: doi:10.2202/1544-6115.1585. URL https://doi.org/10.2202/1544-6115.1585.
  36. Language models are unsupervised multitask learners, 2018. URL https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
  37. SQuAD: 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL https://aclanthology.org/D16-1264.
  38. Did chatgpt cheat on your test?, 2023. URL https://hitz-zentroa.github.io/lm-contamination/blog/.
  39. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), 2017.
  40. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  41. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  42. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations (ICLR), 2019.
  43. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  44. A broad-coverage challenge corpus for sentence understanding through inference. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1112–1122, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1101. URL https://aclanthology.org/N18-1101.
  45. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://www.aclweb.org/anthology/P19-1472.
Citations (98)

Summary

  • The paper introduces a rigorous method leveraging exchangeability and sharded likelihood tests to detect test set contamination.
  • The approach requires no access to pre-training data, ensuring independent verification of language model performance.
  • Empirical results on models like GPT-2, LLaMA2, and Mistral-7B demonstrate robust detection of even low contamination rates.

A Formal Analysis of Test Set Contamination in Black Box LLMs

The paper "Proving Test Set Contamination in Black Box LLMs" addresses a fundamental issue related to the trustworthiness of performance metrics of LLMs. Specifically, it confronts the challenge of verifying whether a model's pre-training data has inadvertently included evaluation benchmarks, thereby skewing performance metrics through memorization rather than genuine generalization. This issue cannot be easily dismissed or identified due to the proprietary nature of models and the opacity of their training data.

Methodological Contributions

The authors introduce a methodologically rigorous approach to resolve the problem of test set contamination. Their proposed method does not require access to the model's pre-training data or weight parameters. Instead, the strategy leverages the statistical property of exchangeability. Exchangeability implies that any permutation of a dataset's ordering should be equally likely. Thus, a LLM trained on a dataset should exhibit no preference for any specific ordering. The crux of their method involves testing the likelihood of a model's canonical ordering of a dataset against random permutations; notable discrepancies would suggest memorization.

A key innovation is the sharded likelihood comparison test. By partitioning the dataset into multiple shards and comparing the log probability likelihood of these shards with permuted sequences, the authors enhance the statistical power and computational efficiency of their method. This sharding technique addresses statistical and computational limits present in conventional permutation tests. With rigorous statistical grounding, they provide asymptotic false positive guarantees that affirm the validity of identified test set contamination.

Empirical Findings

The authors present convincing empirical evaluations that validate their approach's effectiveness across various experimental settings. Specifically, their paper involves injecting known test sets into pre-training corpora of 1.4 billion parameter models. The experiments reveal that the proposed method can reliably detect even low rates of contamination, with strong statistical significance particularly when datasets are duplicated ten times or more within the training corpus. Furthermore, the sharded likelihood comparison test is shown to outperform traditional permutation methods in detecting contamination in computationally demanding settings.

Testing on publicly accessible models like LLaMA2, Mistral-7B, and GPT-2 demonstrates limited evidence of extensive contamination, consistent with previous findings by model developers. This empirical application underlines the method's potential as a tool for independent audits of benchmarking integrity in LLMs.

Implications and Future Directions

This work has both practical and theoretical implications. Practically, it provides a robust tool for the research community to independently verify model training integrity and its relation to reported benchmarks. The authors release their models and datasets as benchmarks, encouraging further developments in this vital area. Theoretically, the method refines our understanding of information leakage in gigantic pre-training datasets and sets new directions for privacy-preserving machine learning research.

Future research avenues could explore improving the power of these methods to detect single-instance contamination, aligning theoretical developments with practical applications in model auditing. Moreover, expanding the framework to handle non-exchangeable datasets, or those partially represented in training data without direct duplication, would enhance its applicability. Given the growing complexity and capability of LLMs, ensuring the veracity of their performance remains an essential task for advancing reliable AI deployment.

Conclusion

The paper presents a rigorous framework for evaluating test set contamination in black box LLMs, enriching the current methodologies available to the field. The statistical tests developed are both computationally efficient and powerful, providing meaningful insights into recognizing hidden dataset contamination. This work underscores the necessity for transparency in LLM training, advocating for consistent external auditing to uphold the integrity of AI research benchmaking.

Dice Question Streamline Icon: https://streamlinehq.com

Open Questions

We haven't generated a list of open questions mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com