Emergent Mind

Why do tree-based models still outperform deep learning on tabular data?

(2207.08815)
Published Jul 18, 2022 in cs.LG , cs.AI , stat.ME , and stat.ML

Abstract

While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear. We contribute extensive benchmarks of standard and novel deep learning methods as well as tree-based models such as XGBoost and Random Forests, across a large number of datasets and hyperparameter combinations. We define a standard set of 45 datasets from varied domains with clear characteristics of tabular data and a benchmarking methodology accounting for both fitting models and finding good hyperparameters. Results show that tree-based models remain state-of-the-art on medium-sized data ($\sim$10K samples) even without accounting for their superior speed. To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs). This leads to a series of challenges which should guide researchers aiming to build tabular-specific NNs: 1. be robust to uninformative features, 2. preserve the orientation of the data, and 3. be able to easily learn irregular functions. To stimulate research on tabular architectures, we contribute a standard benchmark and raw data for baselines: every point of a 20 000 compute hours hyperparameter search for each learner.

Overview

  • The paper compares traditional tree-based models like XGBoost, Random Forests, and Gradient Boosting Trees with deep learning models on tabular data, revealing that tree-based models often perform better.

  • A benchmarking methodology involving 45 tabular datasets and an extensive hyperparameter optimization process underscores tree-based models' superiority, particularly on medium-sized datasets.

  • An empirical analysis highlights the differing inductive biases of tree-based models and neural networks, such as tree-based models' robustness to uninformative features and neural networks' bias towards smoother solutions.

  • The findings suggest ensembles of decision trees should continue to be the preferred approach for tabular data problems, and there's a need for deep learning architectures tailored to overcome identified inductive biases for tabular data.

Understanding Tree-Based Model Dominance in Tabular Data Through Empirical Benchmarks

Introduction to Tree-Based Models vs. Deep Learning for Tabular Data

While deep learning has brought about transformative improvements across various domains such as vision, text, and audio, its performance on tabular data has remained less convincing. In contrast, traditional machine learning techniques, particularly ensemble tree-based methods like XGBoost, Random Forests, and Gradient Boosting Trees, continue to be the de facto choice for a wide range of applications involving tabular data. This preference stands despite deep learning's potential for handling complex, hierarchical patterns within data. The reasons behind this discrepancy, particularly the specific conditions under which tree-based models outperform neural networks on tabular datasets, form the crux of our exploration.

Benchmarking Methodology and Results Overview

The study meticulously designs a benchmarking process to compare the performance of various tree-based models and deep learning architectures across an extensive collection of tabular datasets. This comparison includes a hyperparameter optimization step for each model to ensure that the results reflect each model's best potential performance. The methodology encompasses:

  • The selection and pre-processing of 45 diverse tabular datasets from publicly available sources, aiming to cover a wide spectrum of real-world applications.
  • An extensive hyperparameter search, amounting to about 20,000 compute hours, to fine-tune each model.
  • A fair and consistent performance evaluation setup, including metrics like accuracy and R2 score for classification and regression tasks, respectively.

The key findings from these benchmarks overwhelmingly show that tree-based models maintain a significant edge over deep learning models, especially in medium-sized datasets, which are predominant in real-world applications.

Empirical Investigation into Model Inductive Biases

Delving into the reasons behind this performance disparity, the paper conducts an empirical analysis to uncover the differing inductive biases between tree-based models and neural networks. This investigation led to several key insights:

  • Tree-based models are inherently better at managing tabular data's irregular target function patterns, whereas neural networks display a bias towards smoother solutions. This characteristic of neural networks to prefer learning low-frequency functions makes them less efficient at capturing the 'sharpness' in many real-world tabular data distributions.
  • Tree-based models exhibit a robustness to uninformative features that is not present in neural networks. Tabular datasets often contain a significant portion of such features, contributing further to the competitive edge of tree-based methods.
  • Neural networks' rotational invariance acts as a double-edged sword. While it is beneficial in certain domains like image processing, it leads to suboptimal performance in tabular data scenarios where the natural orientation of features carries significant informational value.

Practical Implications and Future Directions

The observed superiority of tree-based models in handling tabular data has significant implications for both practice and research. From an applied perspective, the findings reinforce the notion that ensembles of decision trees should remain the first-line approach for most tabular data problems. On the research front, the insights regarding neural networks' inductive biases open up avenues for developing more tailored deep learning architectures for tabular data. Such architectures would need to counteract the inclination towards smoothing, enhance information extraction from uninformative features, and incorporate data orientation sensitivity.

Conclusion

In summary, this comprehensive benchmarking study and subsequent empirical analysis provide a clear picture of the current landscape of machine learning model performance on tabular data. While deep learning continues to advance rapidly, traditional tree-based methods still hold a strong position in this specific realm. The identified inductive biases and characteristics provide a roadmap for future research efforts aimed at bridging this performance gap.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.