Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models (2404.06209v3)
Abstract: While many have shown how LLMs can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a LLM has seen a tabular dataset during training. This investigation reveals that LLMs have memorized many popular tabular datasets verbatim. We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. At the same time, LLMs show non-trivial performance on novel datasets and are surprisingly robust to data transformations. We then investigate the in-context statistical learning abilities of LLMs. While LLMs are significantly better than random at solving statistical classification problems, the sample efficiency of few-shot learning lags behind traditional statistical learning algorithms, especially as the dimension of the problem increases. This suggests that much of the observed few-shot performance on novel real-world datasets is due to the LLM's world knowledge. Overall, our results highlight the importance of testing whether an LLM has seen an evaluation dataset during pre-training. We release the https://github.com/interpretml/LLM-Tabular-Memorization-Checker Python package to test LLMs for memorization of tabular datasets.
- Understanding in-context learning in transformers and llms by learning to learn discrete functions. arXiv preprint arXiv:2310.03016, 2023.
- Emergent and predictable memorization in large language models. In NeurIPS, 2023.
- Data science with llms and interpretable models. XAI4Sci Workshop at AAAI-24, 2024.
- Language models are realistic tabular data generators. In International Conference on Learning Representations (ICLR), 2023.
- Language models are few-shot learners. NeurIPS, 2020.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX security symposium (USENIX security 19), pp. 267–284, 2019.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021.
- Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pp. 1897–1914. IEEE, 2022a.
- Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022b.
- Speak, memory: An archaeology of books known to chatgpt/gpt-4. arXiv preprint arXiv:2305.00118, 2023.
- Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794, 2016.
- Exploring the potential of large language models (llms) in learning on graphs. arXiv preprint arXiv:2307.03393, 2023.
- Label-only membership inference attacks. In International conference on machine learning, pp. 1964–1974. PMLR, 2021.
- Retiring adult: New datasets for fair machine learning. Advances in neural information processing systems, 34:6478–6490, 2021.
- Lift: Language-interfaced fine-tuning for non-language machine learning tasks. Advances in Neural Information Processing Systems, 35:11763–11784, 2022.
- Do membership inference attacks work on large language models? arXiv preprint arXiv:2402.07841, 2024.
- Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654, 2023a.
- Faith and fate: Limits of transformers on compositionality. In Advances in Neural Information Processing Systems, volume 36, 2023b.
- What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35, 2022.
- Nathan Goad. Diabetic Ketoacidosis and Hyperchloremia Full Dataset, 2018. URL https://doi.org/10.7910/DVN/PX9K2R.
- Tabllm: few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pp. 5549–5581. PMLR, 2023.
- Large language models for automated data science: Introducing caafe for context-aware automated feature engineering. Advances in Neural Information Processing Systems, 36, 2024.
- Investigating data contamination for pre-training language models. arXiv preprint arXiv:2401.06059, 2024.
- Time-llm: Time series forecasting by reprogramming large language models. ICLM, 2024.
- Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291–297, 1997. doi: https://doi.org/10.1016/S0167-7152(96)00140-X.
- Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning, 2023.
- Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
- Data contamination: From memorization to exploitation. arXiv preprint arXiv:2203.08242, 2022.
- Membership inference on word embedding and beyond. arXiv preprint arXiv:2106.11384, 2021.
- Membership inference attacks against language models via neighbourhood comparison. arXiv preprint arXiv:2305.18462, 2023.
- Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638, 2023.
- Adapting pretrained language models for solving tabular prediction problems in the electronic health record. arXiv preprint arXiv:2303.14920, 2023.
- Quantifying privacy risks of masked language models using membership inference attacks. arXiv preprint arXiv:2203.03929, 2022.
- Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911, 2022.
- Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
- Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
- OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023.
- Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
- Detecting pretraining data from large language models. ICLR 2024, 2024.
- Using the adap learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the annual symposium on computer application in medical care, pp. 261. American Medical Informatics Association, 1988.
- Table meets llm: Can large language models understand structured table data? a benchmark and empirical study. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pp. 645–654, 2024.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60, 2014.
- Transformers learn in-context by gradient descent. In ICML, 2023.
- Towards parameter-efficient automation of data wrangling tasks with prefix-tuning. In NeurIPS 2022 First Table Representation Workshop, 2022.
- Anypredict: Foundation model for tabular prediction. arXiv preprint arXiv:2305.12081, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023.
- Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477, 2023.
- Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850, 2023.
- A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
- Kola: Carefully benchmarking world knowledge of large language models. NeurIPS Datasets and Benchmarks Track, 2023.