Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 86 tok/s Pro
Kimi K2 173 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models (2404.12404v4)

Published 15 Apr 2024 in cs.LG and cs.AI

Abstract: LLMs have demonstrated remarkable in-context learning capabilities across diverse applications. In this work, we explore the effectiveness of LLMs for generating realistic synthetic tabular data, identifying key prompt design elements to optimize performance. We introduce EPIC, a novel approach that leverages balanced, grouped data samples and consistent formatting with unique variable mapping to guide LLMs in generating accurate synthetic data across all classes, even for imbalanced datasets. Evaluations on real-world datasets show that EPIC achieves state-of-the-art machine learning classification performance, significantly improving generation efficiency. These findings highlight the effectiveness of EPIC for synthetic tabular data generation, particularly in addressing class imbalance. Our source code for our work is available at: https://seharanul17.github.io/project-synthetic-tabular-LLM/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Smote for high-dimensional class-imbalanced data. BMC bioinformatics, 14:1–16, 2013.
  2. Language models are realistic tabular data generators. arXiv preprint arXiv:2210.06280, 2022.
  3. Breiman, L. Random forests. Machine learning, 45:5–32, 2001.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
  6. Boosting deep learning risk prediction with generative adversarial networks for electronic health records. In 2017 IEEE International Conference on Data Mining (ICDM), pp.  787–792. IEEE, 2017.
  7. Danets: Deep abstract networks for tabular data classification and regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  3930–3938, 2022.
  8. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp.  785–794, 2016.
  9. Generating multi-label discrete patient records using generative adversarial networks. In Machine learning for healthcare conference, pp.  286–305. PMLR, 2017.
  10. Friedman, J. H. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp.  1189–1232, 2001.
  11. Smote-nc and gradient boosting imputation based random forest classifier for predicting severity level of covid-19 patients with blood samples. Neural Computing and Applications, 33(22):15693–15707, 2021.
  12. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  13. From images to textual prompts: Zero-shot visual question answering with frozen large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10867–10877, 2023.
  14. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  15. Pate-gan: Generating synthetic data with differential privacy guarantees. In International conference on learning representations, 2018.
  16. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30, 2017.
  17. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  18. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  19. Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning, pp.  17564–17579. PMLR, 2023.
  20. Large language models as general pattern machines. arXiv preprint arXiv:2307.04721, 2023.
  21. Smote-enc: A novel smote-based method to generate synthetic data for nominal and continuous features. Applied System Innovation, 4(1):18, 2021.
  22. The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp.  399–410. IEEE, 2016.
  23. Catboost: unbiased boosting with categorical features. Advances in neural information processing systems, 31, 2018.
  24. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  25. Generating privacy-preserving synthetic tabular data using oblivious variational autoencoders. In Proceedings of the Workshop on Economics of Privacy and Data Labor at the 37 th International Conference on Machine Learning (ICML), 2020.
  26. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  27. Modeling tabular data using conditional gan. Advances in neural information processing systems, 32, 2019.
  28. Ctab-gan: Effective table data synthesizing. In Asian Conference on Machine Learning, pp.  97–112. PMLR, 2021.
  29. Ctab-gan+: Enhancing tabular data synthesis. Frontiers in big Data, 6:1296508, 2024.
  30. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: