A Survey on Self-Supervised Learning for Non-Sequential Tabular Data (2402.01204v4)
Abstract: Self-supervised learning (SSL) has been incorporated into many state-of-the-art models in various domains, where SSL defines pretext tasks based on unlabeled datasets to learn contextualized and robust representations. Recently, SSL has become a new trend in exploring the representation learning capability in the realm of tabular data, which is more challenging due to not having explicit relations for learning descriptive representations. This survey aims to systematically review and summarize the recent progress and challenges of SSL for non-sequential tabular data (SSL4NS-TD). We first present a formal definition of NS-TD and clarify its correlation to related studies. Then, these approaches are categorized into three groups - predictive learning, contrastive learning, and hybrid learning, with their motivations and strengths of representative methods in each direction. Moreover, application issues of SSL4NS-TD are presented, including automatic data engineering, cross-table transferability, and domain knowledge integration. In addition, we elaborate on existing benchmarks and datasets for NS-TD applications to analyze the performance of existing tabular models. Finally, we discuss the challenges of SSL4NS-TD and provide potential directions for future research. We expect our work to be useful in terms of encouraging more research on lowering the barrier to entry SSL for the tabular domain, and of improving the foundations for implicit tabular data.
- Anonymous. Making pre-trained language models great on tabular prediction. In ICLR, 2024.
- Anonymous. PTaRL: Prototype-based tabular representation learning via space calibration. In The Twelfth International Conference on Learning Representations, 2024.
- Tabnet: Attentive interpretable tabular learning. In AAAI, pages 6679–6687. AAAI Press, 2021.
- Transformers for tabular data representation: A survey of models and applications. Transactions of the Association for Computational Linguistics, 11:227–249, 2023.
- Unsupervised speech recognition. In NeurIPS, 2021.
- Scarf: Self-supervised contrastive learning using random feature corruption. In ICLR, 2022.
- A cookbook of self-supervised learning. CoRR, abs/2304.12210, 2023.
- Baosenguo. baosenguo/kaggle-moa-2nd-place-solution, 2021.
- Language models are realistic tabular data generators. In ICLR, 2023.
- Xgboost: A scalable tree boosting system. In KDD, pages 785–794. ACM, 2016.
- Recontab: Regularized contrastive representation learning for tabular data. In NeurIPS 2023 Second Table Representation Learning Workshop, 2023.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171–4186. Association for Computational Linguistics, 2019.
- LIFT: language-interfaced fine-tuning for non-language machine learning tasks. In NeurIPS, 2022.
- Dora: Domain-based self-supervised learning framework for low-resource real estate appraisal. In CIKM, pages 4552–4558. ACM, 2023.
- Autogluon-tabular: Robust and accurate automl for structured data. CoRR, abs/2003.06505, 2020.
- Revisiting deep learning models for tabular data. In NeurIPS, pages 18932–18943, 2021.
- On embeddings for numerical features in tabular deep learning. In NeurIPS, 2022.
- Why do tree-based models still outperform deep learning on typical tabular data? In NeurIPS, 2022.
- STab: Self-supervised learning for tabular data. In NeurIPS 2022 First Table Representation Workshop, 2022.
- A survey on user behavior modeling in recommender systems. In IJCAI, pages 6656–6664. ijcai.org, 2023.
- Training compute-optimal large language models. CoRR, abs/2203.15556, 2022.
- Tabpfn: A transformer that solves small tabular classification problems in a second. In ICLR, 2023.
- Tabtransformer: Tabular data modeling using contextual embeddings. CoRR, abs/2012.06678, 2020.
- A survey on contrastive self-supervised learning. CoRR, abs/2011.00362, 2020.
- Well-tuned simple nets excel on tabular datasets. In NeurIPS, 2021.
- Net-dnf: Effective deep modeling of tabular data. In ICLR, 2021.
- Predicting what you already know helps: Provable self-supervised learning. In NeurIPS, 2021.
- Self-supervision enhanced feature selection with correlated gates. In ICLR, 2022.
- Transfer learning with deep tabular models. In ICLR, 2023.
- Scaling language-image pre-training via masking. In CVPR, pages 23390–23400. IEEE, 2023.
- Textbooks are all you need II: phi-1.5 technical report. CoRR, abs/2309.05463, 2023.
- When do neural nets outperform boosted trees on tabular data? CoRR, abs/2305.02997, 2023.
- STUNT: few-shot tabular learning with self-generated tasks from unlabeled tables. In ICLR, 2023.
- Context encoders: Feature learning by inpainting. In CVPR, pages 2536–2544. IEEE Computer Society, 2016.
- Neural oblivious decision ensembles for deep learning on tabular data. In ICLR, 2020.
- Catboost: unbiased boosting with categorical features. In NeurIPS, 2018.
- Tabular data: Deep learning is not all you need. Inf. Fusion, 81:84–90, 2022.
- SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training. In NeurIPS 2022 First Table Representation Workshop, 2022.
- Self-supervised representation learning from random data projectors. In NeurIPS 2023 Second Table Representation Learning Workshop, 2023.
- Subtab: Subsetting features of tabular data for self-supervised representation learning. In NeurIPS, 2021.
- Representation learning with contrastive predictive coding. CoRR, abs/1807.03748, 2018.
- Transtab: Learning transferable tabular transformers across tables. In NeurIPS, 2022.
- Convnext V2: co-designing and scaling convnets with masked autoencoders. In CVPR, pages 16133–16142. IEEE, 2023.
- Switchtab: Switched autoencoders are effective tabular learners, 2024.
- Xianchao Wu. Enhancing unsupervised speech recognition with diffusion GANS. In ICASSP, pages 1–5. IEEE, 2023.
- Unitabe: Pretraining a unified tabular encoder for heterogeneous tabular data. CoRR, abs/2307.09249, 2023.
- CT-BERT: learning better tabular representations through cross-table pre-training. CoRR, abs/2307.04308, 2023.
- VIME: extending the success of self- and semi-supervised learning to tabular domain. In NeurIPS, 2020.
- Generative table pre-training empowers models for tabular prediction. In EMNLP, pages 14836–14854. Association for Computational Linguistics, 2023.
- A comprehensive survey on pretrained foundation models: A history from BERT to chatgpt. CoRR, abs/2302.09419, 2023.
- Xtab: Cross-table pretraining for tabular transformers. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 43181–43204. PMLR, 2023.