Why do tree-based models still outperform deep learning on tabular data? (2207.08815v1)

Published 18 Jul 2022 in cs.LG, cs.AI, stat.ME, and stat.ML

Abstract: While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear. We contribute extensive benchmarks of standard and novel deep learning methods as well as tree-based models such as XGBoost and Random Forests, across a large number of datasets and hyperparameter combinations. We define a standard set of 45 datasets from varied domains with clear characteristics of tabular data and a benchmarking methodology accounting for both fitting models and finding good hyperparameters. Results show that tree-based models remain state-of-the-art on medium-sized data ($\sim$10K samples) even without accounting for their superior speed. To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs). This leads to a series of challenges which should guide researchers aiming to build tabular-specific NNs: 1. be robust to uninformative features, 2. preserve the orientation of the data, and 3. be able to easily learn irregular functions. To stimulate research on tabular architectures, we contribute a standard benchmark and raw data for baselines: every point of a 20 000 compute hours hyperparameter search for each learner.

Citations (324)

View on Semantic Scholar

Summary

The paper demonstrates that tree-based models outperform deep learning on tabular data through extensive benchmarks on 45 diverse datasets.
It employs a rigorous hyperparameter tuning process totaling about 20,000 compute hours to ensure each model's best performance is evaluated.
The study reveals that neural networks’ bias for smooth functions limits their ability to capture the sharp, irregular patterns present in many tabular datasets.

Understanding Tree-Based Model Dominance in Tabular Data Through Empirical Benchmarks

Introduction to Tree-Based Models vs. Deep Learning for Tabular Data

While deep learning has brought about transformative improvements across various domains such as vision, text, and audio, its performance on tabular data has remained less convincing. In contrast, traditional machine learning techniques, particularly ensemble tree-based methods like XGBoost, Random Forests, and Gradient Boosting Trees, continue to be the de facto choice for a wide range of applications involving tabular data. This preference stands despite deep learning's potential for handling complex, hierarchical patterns within data. The reasons behind this discrepancy, particularly the specific conditions under which tree-based models outperform neural networks on tabular datasets, form the crux of our exploration.

Benchmarking Methodology and Results Overview

The paper meticulously designs a benchmarking process to compare the performance of various tree-based models and deep learning architectures across an extensive collection of tabular datasets. This comparison includes a hyperparameter optimization step for each model to ensure that the results reflect each model's best potential performance. The methodology encompasses:

The selection and pre-processing of 45 diverse tabular datasets from publicly available sources, aiming to cover a wide spectrum of real-world applications.
An extensive hyperparameter search, amounting to about 20,000 compute hours, to fine-tune each model.
A fair and consistent performance evaluation setup, including metrics like accuracy and R² score for classification and regression tasks, respectively.

The key findings from these benchmarks overwhelmingly show that tree-based models maintain a significant edge over deep learning models, especially in medium-sized datasets, which are predominant in real-world applications.

Empirical Investigation into Model Inductive Biases

Delving into the reasons behind this performance disparity, the paper conducts an empirical analysis to uncover the differing inductive biases between tree-based models and neural networks. This investigation led to several key insights:

Tree-based models are inherently better at managing tabular data's irregular target function patterns, whereas neural networks display a bias towards smoother solutions. This characteristic of neural networks to prefer learning low-frequency functions makes them less efficient at capturing the 'sharpness' in many real-world tabular data distributions.
Tree-based models exhibit a robustness to uninformative features that is not present in neural networks. Tabular datasets often contain a significant portion of such features, contributing further to the competitive edge of tree-based methods.
Neural networks' rotational invariance acts as a double-edged sword. While it is beneficial in certain domains like image processing, it leads to suboptimal performance in tabular data scenarios where the natural orientation of features carries significant informational value.

Practical Implications and Future Directions

The observed superiority of tree-based models in handling tabular data has significant implications for both practice and research. From an applied perspective, the findings reinforce the notion that ensembles of decision trees should remain the first-line approach for most tabular data problems. On the research front, the insights regarding neural networks' inductive biases open up avenues for developing more tailored deep learning architectures for tabular data. Such architectures would need to counteract the inclination towards smoothing, enhance information extraction from uninformative features, and incorporate data orientation sensitivity.

Conclusion

In summary, this comprehensive benchmarking paper and subsequent empirical analysis provide a clear picture of the current landscape of machine learning model performance on tabular data. While deep learning continues to advance rapidly, traditional tree-based methods still hold a strong position in this specific field. The identified inductive biases and characteristics provide a roadmap for future research efforts aimed at bridging this performance gap.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ieszczyk/status/1780400049609998569

https://twitter.com/kdpsinghlab/status/1765035253142376792

https://twitter.com/__SimonCoste__/status/1807765223215972413

https://twitter.com/jjgarciaripoll/status/1765032682147201439

https://twitter.com/OzancanOzdemir/status/1757720934943797570

https://twitter.com/CoeusCap/status/1806393467267916061

YouTube

Show All Videos

HackerNews

Why do tree-based models still outperform deep learning on tabular data? (2022) (211 points, 110 comments)