Papers
Topics
Authors
Recent
2000 character limit reached

TabularFM: An Open Framework For Tabular Foundational Models (2406.09837v2)

Published 14 Jun 2024 in cs.LG

Abstract: Foundational models (FMs), pretrained on extensive datasets using self-supervised techniques, are capable of learning generalized patterns from large amounts of data. This reduces the need for extensive labeled datasets for each new task, saving both time and resources by leveraging the broad knowledge base established during pretraining. Most research on FMs has primarily focused on unstructured data, such as text and images, or semi-structured data, like time-series. However, there has been limited attention to structured data, such as tabular data, which, despite its prevalence, remains under-studied due to a lack of clean datasets and insufficient research on the transferability of FMs for various tabular data tasks. In response to this gap, we introduce a framework called TabularFM, which incorporates state-of-the-art methods for developing FMs specifically for tabular data. This includes variations of neural architectures such as GANs, VAEs, and Transformers. We have curated a million of tabular datasets and released cleaned versions to facilitate the development of tabular FMs. We pretrained FMs on this curated data, benchmarked various learning methods on these datasets, and released the pretrained models along with leaderboards for future comparative studies. Our fully open-sourced system provides a comprehensive analysis of the transferability of tabular FMs. By releasing these datasets, pretrained models, and leaderboards, we aim to enhance the validity and usability of tabular FMs in the near future.

Citations (1)

Summary

  • The paper introduces an open framework integrating curated datasets and modular generative models to enable rigorous pretraining and transferability studies in tabular data.
  • The paper demonstrates that pretrained CTGAN and STVAE variants consistently outperform scratch-trained models by approximately 10 metric points across benchmarks.
  • The paper highlights limitations such as dataset scale and meta-information integration while laying a foundation for future advancements in tabular FMs.

TabularFM: An Open Framework for Tabular Foundational Models

Motivation and Context

Pretrained foundational models (FMs) have yielded strong generalization advances in domains such as text and vision, but tabular data—despite its ubiquity and practical significance—remains comparatively under-explored in FM research. Key challenges include tabular data heterogeneity, lack of standardized, high-quality benchmarks, and open questions regarding cross-table transferability and model design. TabularFM systematically addresses these deficiencies by providing large-scale cleaned datasets, pretrained generative models, transferability studies, and a modular open-source framework for rigorous experimentation with tabular FMs. Figure 1

Figure 1: TSNE representation of the top 10 tabular data domains, showing clear structure in the projected space and enabling analysis for domain-level transferability.

Datasets: Curation and Domain Splitting

TabularFM constructs its experimental foundation on two corpora: Kaggle and GitTables. From >1 million candidate tables, rigorous quality filtering (file format, usability score, metadata completeness, and value-type constraints) yields 1,435 from Kaggle and 1,258 from GitTables. Notably, extensive data cleaning—column filtering, missing data imputation, noise/ID/timestamp exclusion—is critical given the highly noisy distribution of web-acquired tabular resources.

Multiple splitting protocols are introduced. Beyond standard random splits, TabularFM establishes a domain-based partition using kk-means clustering over BERT-encoded table names, yielding domain-homogeneous test sets for probing out-of-domain transfer. This is essential for scientific analysis of model generalization and transferability beyond i.i.d. settings.

TabularFM Framework and Model Architectures

The TabularFM framework supports end-to-end processing from data acquisition through preprocessing, model training, and evaluation with extensible modules for (i) data transformation, (ii) generative model pretraining and fine-tuning, and (iii) transferability-centric benchmarking.

Supported Generative Models

  • CTGAN: Conditional GAN for tabular data, employing WGAN-GP objectives, conditional vectors per column, and PacGAN-style batches to mitigate mode collapse.
  • TVAE: VAE adapted for tabular modality, coupling Gaussian mixture-based normalization and ELBO optimization.
  • STVAE: A modification of TVAE eliminating dataset-specific trainable std-dev parameters, hence directly optimizing MSE—a design intended to increase inter-table transferability.
  • STVAEM: Extends STVAE with per-column signature embeddings, concatenated via column-name encoding from large-scale pretrained LLMs.
  • GReaT: Transformer decoder models (distilled GPT-2 baseline), utilizing language modeling over serialized, textified table rows.

Each model is integrated with data transformation pipelines: categorical columns are one-hot encoded; numerical columns are normalized via Gaussian mixture modeling; for transformer models, data is serialized as natural-language phrases e.g., "Age is 26 and Gender is M". This modularization supports reproducible comparison and experimentation.

Experimental Protocol and Evaluation Metrics

TabularFM evaluates transferability by pretraining models on curated pretraining sets, then fine-tuning and evaluating on distinct validation/test splits against a baseline of models trained from scratch. For model and data comparison, synthetic versus real data is systematically assessed using:

  • Column Shape Similarity: KS statistic for numericals, TVD for categoricals.
  • Column Trend Similarity: Pearson correlation for numerical pairs, TVD over contingency tables for categorical/heterogeneous pairs.

Overall scores are mean-aggregated to enable rigorous, interpretable statistical comparison (e.g., using Mann-Whitney U tests to assess significance).

Results and Empirical Analysis

Consistently, CTGAN and STVAE variants pretrained on large tables outperform models trained from scratch by approximately 10 points in overall metrics, both on random and domain splits. However, adding meta-information (STVAEM) only marginally improves transferability. Transformers pretrained solely on text (GReaT) perform better than any tabular-pretrained transformer, and surprisingly, domain-adaptive finetuning sometimes slightly degrades transformer performance. This implies existing tabular datasets may be insufficiently scaled for further transformer pretraining to outcompete LLMs trained on massive text.

The transferability of pretrained generative models is also demonstrated in convergence acceleration and superior fit to empirical data distributions: Figure 2

Figure 2: Training/validation loss for STVAE, with pretrained initialization yielding optimized solutions faster and to lower final loss than training from scratch.

Figure 3

Figure 3: Column-wise distributions for pre-trained STVAE vs. scratch-trained STVAE reveal superior tail modeling and distributional fidelity post-pretraining.

Performance improvements are robust across a range of network sizes and learning rates: Figure 4

Figure 4: Learning rate sensitivity: Pretrained CTGAN models outpace scratch-trained models across learning rates in validation trials.

Figure 5

Figure 5: Larger CTGAN architectures yield improved validation performance, with pretraining consistently outperforming training from scratch.

Transferability, Generality, and Limits of Knowledge Capture

Analysis of column-level transferability reveals that columns corresponding to general semantics (e.g., Age, Gender, Disease) benefit most from pretraining, while columns containing highly specific or domain-unique semantics (e.g., budget codes) do not: Figure 6

Figure 6: Wordclouds for columns with high vs. low transferability highlight semantic generality as a key factor.

Moreover, pretrained models are able to capture and transfer meaningful correlations, such as public-health and clinical patterns, across unrelated tables: Figure 7

Figure 7: Visualization of correlation trends where pretrained models exhibit significant improvements or deficits relative to scratch-trained models.

Limitations and Open Problems

Despite clear evidence for transferability gains, several limitations are identified:

  • Current training focuses exclusively on numerical and categorical columns, omitting richer data types (dates, free text, timeseries).
  • The scale and diversity of current tabular datasets are still insufficient, especially for transformer FMs, whose pretraining appears bounded by dataset size.
  • Pretraining with additional meta-information (signature/column embeddings) yields at best incremental improvements, indicating a need for more expressive or context-sensitive metadata integration.
  • Comparison to larger-scale LLMs remains difficult without more expansive tabular pretraining corpora.

Implications and Future Directions

The systematic findings of TabularFM establish that:

  • Generative models such as CTGAN and STVAE, when pretrained on large, diverse corpora, exhibit substantially improved sample quality, faster convergence, and modest cross-domain generalization capacity in tabular data synthesis.
  • Transformers, despite their column permutation invariance and high baseline performance when pretrained on text, do not automatically benefit from further pretraining on modest tabular corpora unless tabular FM datasets can be scaled much further.
  • Evaluation and benchmarking in tabular foundation models now becomes standardized thanks to the provided open framework, cleaned datasets, pretrained checkpoints, and reproducible leaderboards.
  • The field remains in early stages relative to canonical NLP/Vision FMs: challenges in data curation, transfer across heavily heterogeneous domains, and appropriate use of meta-information must be addressed for next-generation tabular FMs.

Conclusion

TabularFM significantly advances the systematic study of foundational models for tabular data by providing curated datasets, pretrained generative model architectures, and rigorous benchmarks for transferability. The experimental findings robustly support the efficacy of pretraining (especially for GANs and VAEs) in tabular domains and illustrate the nuanced limits of transfer in transformer-based approaches given current dataset scale. The openly released framework, models, and benchmarks form an essential foundation for future state-of-the-art research in tabular FM design and deployment. Further scaling, richer architecture innovations, and broader data-type support represent important areas for upcoming advancement.

Whiteboard

Paper to Video (Beta)

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of concrete gaps and open problems that remain unresolved and could guide follow-up research.

  • Scope and scale of pretraining data
    • The “cleaned” corpus shrinks from ~1M raw tables to only 2,693 usable tables; it is unclear whether transferability conclusions hold at larger, more diverse scales and in non-public/enterprise domains.
    • Lack of analysis on dataset biases induced by sourcing only from Kaggle and GitHub (topic skew, language skew, quality skew) and how these biases affect transfer.
    • No ablation on how the degree of table heterogeneity (schemas, domain coverage, cardinalities) influences pretraining benefits.
  • Domain split validity and leakage
    • The “domain-based” split relies on k-means over BERT embeddings of table names; this risks leakage via naming conventions and does not ensure distributional independence at the value level.
    • No alternative domain partitioning strategies (e.g., based on value distributions, metadata fields, provenance/repo, or leave-one-domain-out) are tested to validate robustness of transfer claims.
  • Evaluation scope: synthetic data quality vs task utility
    • Transferability is assessed only through synthetic-versus-real similarity (column shapes, pairwise correlations) rather than downstream task utility (classification/regression, imputation, anomaly detection).
    • No TSTR/TSLR-style evaluations (Train on Synthetic, Test on Real / vice versa) or linear-probe tests on learned representations to verify practical utility of the pretraining.
  • Metrics and statistical rigor
    • Evaluation relies on marginal distributions and Pearson pairwise correlations; higher-order dependencies, nonlinear relationships, and constraint satisfaction (e.g., functional dependencies) are not measured.
    • No classifier two-sample tests, mutual information, copula-based distances, or multi-variate goodness-of-fit metrics to capture complex tabular structure.
    • Multiple-comparisons handling and robustness across random seeds are not reported; sensitivity to evaluation protocol (sample sizes, subsampling strategy) is unexplored.
  • Missing data, outliers, and real-world artifacts
    • Datasets were filtered to remove noisy/unstructured content; the framework does not model missingness patterns, outliers, or data-entry errors common in real-world tables.
    • No evaluation of imputation capability or how pretraining affects robustness to missing values and extreme values.
  • Categorical encoding and scalability
    • One-hot encoding is used for categorical variables; the framework does not address high-cardinality categories, rare category handling, or memory-efficient encodings (e.g., hashing, learned embeddings).
    • No analysis of how category cardinality and imbalance affect pretraining transfer and model stability.
  • Units, scales, and semantic typing
    • Preprocessing normalizes numerics using mixture-of-Gaussians per column but does not incorporate units, value ranges, or semantic types; cross-table comparability of semantically similar columns is therefore unclear.
    • Open question: which semantic signals (types, units, ontologies, value vocabularies) most improve cross-table transfer?
  • Column-name dependence and obfuscation
    • LLM-based methods and metadata use heavily depend on column names; there is no evaluation with obfuscated, noisy, multilingual, or synonym-rich schema names to test semantic robustness beyond surface tokens.
    • It remains unclear how much transfer stems from general world knowledge in GPT-2 versus actual tabular structure learning.
  • Metadata design and incorporation
    • The proposed STVAEM (dataset-level “signature” from column-name embeddings) did not help; there is no systematic exploration of alternative metadata (types, units, statistical profiles, schema graphs) or how to incorporate them (adapters, prompts, multi-task losses).
    • No experiments with ontology alignment, schema linking, or entity/value-level metadata that could bridge schema heterogeneity.
  • Permutation invariance and table symmetries
    • CTGAN/TVAE variants are not permutation-invariant; the impact of column order and row order is not ablated.
    • No evaluation of architectures explicitly enforcing set/sequence invariances (e.g., DeepSets, Set Transformers, permutation-invariant positional encodings) or order-agnostic serializations for LLMs.
  • LLM adaptation challenges
    • Fine-tuning GPT-style models on the provided tabular corpora hurts performance; the causes (catastrophic forgetting, insufficient data scale, poor serialization, optimization settings) remain untested.
    • No exploration of parameter-efficient tuning (LoRA, adapters), replay/regularization for anti-forgetting, curriculum learning, or serialization/prompting ablations to stabilize tabular adaptation.
  • Architectural coverage and baselines
    • The framework evaluates only CTGAN/TVAE family and one LLM-based generator; modern alternatives (e.g., diffusion models for tables like TabDDPM, normalizing flows, CTAB-GAN+, masked autoencoders, FT-Transformer/SAINT/TabTransformer) are absent.
    • Classical probabilistic baselines (Gaussian copulas, Bayesian networks) are missing, limiting interpretability of gains.
  • Privacy, memorization, and safety
    • No privacy audits (membership inference, attribute inference, record linkage) or differentially private training; the risk of memorization in generative tabular FMs is unquantified.
    • No fairness/bias assessment (e.g., transfer on sensitive attributes like Age/Gender), nor analysis of spurious correlations that may be amplified by pretraining.
  • Representations beyond generation
    • The framework focuses on data synthesis; it is unclear whether pretrained encoders yield transferable representations for non-generative tasks (e.g., few-shot classification/regression, retrieval, causal discovery).
    • No release/evaluation of reusable tabular encoders with task-agnostic objectives (contrastive, masked-cell modeling).
  • Constraint and relational awareness
    • Row/column-level constraints (uniqueness, sums, monotonicity), and inter-column logical relations are not modeled or evaluated in generation quality.
    • Only single-table settings are considered; multi-table, relational schemas (foreign keys, joins) and cross-table pretraining are unexplored.
  • Temporal, spatial, and mixed-type columns
    • The framework excludes temporal, geospatial, free-text, image-in-cell, and other mixed types frequently present in tabular data; typed handling and cross-modal pretraining remain open.
    • The impact of date/time-specific semantics and seasonality on transfer is not evaluated.
  • Hyperparameter, compute, and scaling laws
    • Pretraining budgets are limited (single A100, 2 days cap); undertraining may confound negative LLM fine-tuning results.
    • No systematic scaling-law study (model size, data size, training steps) to map performance vs. resources.
  • Reproducibility and licensing
    • Kaggle license restrictions prevent full release of trained models, impeding reproducibility; community-standard, fully open benchmarks/splits are still missing.
    • Seed control, run-to-run variance, and environment determinism are not documented across all experiments.
  • Cross-lingual and multi-locale generalization
    • No analysis of column/value languages, locale-specific formats (decimal separators, dates, currencies), or multilingual embeddings for cross-lingual transfer.
  • Robustness and security
    • Adversarial robustness (e.g., to schema perturbations, value corruptions) and distribution shift resilience are not evaluated.
    • No study of how small schema changes (renaming, reordering, type-casting) affect pretrained model performance.
  • Interpretability of transferred “knowledge”
    • Qualitative wordcloud/correlation analyses are suggestive but not causal; controlled studies are needed to validate which correlations truly generalize and which are dataset artifacts.
  • Leaderboard design and standardization
    • The proposed leaderboards center on synthetic data fidelity; standardized suites spanning utility, privacy, fairness, robustness, and constraint satisfaction are needed for comprehensive FM assessment.
  • Generalization to unseen schemas (zero-shot)
    • The framework does not evaluate zero-shot generation or adaptation to entirely unseen schemas (novel columns and combinations) without fine-tuning.
  • Environmental impact
    • No reporting on energy/carbon costs or efficiency comparisons across methods, which are increasingly expected for FM research.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.