When Do Neural Nets Outperform Boosted Trees on Tabular Data? (2305.02997v4)

Published 4 May 2023 in cs.LG, cs.AI, and stat.ML

Abstract: Tabular data is one of the most commonly used types of data in machine learning. Despite recent advances in neural nets (NNs) for tabular data, there is still an active discussion on whether or not NNs generally outperform gradient-boosted decision trees (GBDTs) on tabular data, with several recent works arguing either that GBDTs consistently outperform NNs on tabular data, or vice versa. In this work, we take a step back and question the importance of this debate. To this end, we conduct the largest tabular data analysis to date, comparing 19 algorithms across 176 datasets, and we find that the 'NN vs. GBDT' debate is overemphasized: for a surprisingly high number of datasets, either the performance difference between GBDTs and NNs is negligible, or light hyperparameter tuning on a GBDT is more important than choosing between NNs and GBDTs. A remarkable exception is the recently-proposed prior-data fitted network, TabPFN: although it is effectively limited to training sets of size 3000, we find that it outperforms all other algorithms on average, even when randomly sampling 3000 training datapoints. Next, we analyze dozens of metafeatures to determine what properties of a dataset make NNs or GBDTs better-suited to perform well. For example, we find that GBDTs are much better than NNs at handling skewed or heavy-tailed feature distributions and other forms of dataset irregularities. Our insights act as a guide for practitioners to determine which techniques may work best on their dataset. Finally, with the goal of accelerating tabular data research, we release the TabZilla Benchmark Suite: a collection of the 36 'hardest' of the datasets we study. Our benchmark suite, codebase, and all raw results are available at https://github.com/naszilla/tabzilla.

Citations (95)

View on Semantic Scholar

Summary

The paper finds no single algorithm dominates across all datasets, highlighting that light hyperparameter tuning often matters more than the algorithm choice.
The study reveals that gradient-boosted trees excel on irregular datasets with skewed distributions, while certain neural nets like TabPFN perform competitively with limited training data.
The introduction of the TabZilla Benchmark Suite with 36 challenging datasets supports future research on more effective and generalizable models for tabular data.

Comprehensive Analysis of Neural Nets vs. Boosted Trees for Tabular Data

Overview

The paper undertakes a comprehensive paper comparing the performance of neural networks (NNs) and gradient-boosted decision trees (GBDTs) on tabular data. Often debated, the efficiency of NNs and GBDTs has led to parallel streams of research advocating for either approach. However, this paper explores the core of this debate by analyzing the specific conditions under which each method excels or underperforms. Leveraging a significant analysis spanning 19 algorithms across 176 datasets—the largest paper of its kind—the findings suggest the prevailing "NN vs. GBDT" debate may be overstated. Particularly, light hyperparameter tuning on specific models shows more significance than the choice between NNs and GBDTs.

Algorithmic Performance Comparisons

The paper initiates its comparative paper by evaluating 19 distinct algorithms, including three GBDTs, eleven neural networks, and five baseline models across a swath of tabular datasets. It unveils two particularly interesting insights:

No individual algorithm domination: No single algorithm emerged as a clear winner across all datasets. Even baseline models performed exceptionally well on some datasets, highlighting that often, a well-tuned simple model might suffice.
TabPFN's notable performance: Among the evaluated algorithms, TabPFN, a prior-data fitted neural network, showcased remarkable performance. Notably, it achieved competitive results even on large datasets by strategically sampling a subset of 3000 data points for training, illustrating its efficiency and potential scalability.

The Impact of Dataset Characteristics

An in-depth meta-feature analysis deciphered dataset characteristics that influence the suitability of NNs or GBDTs. Findings reveal that:

GBDTs excel over neural networks for handling irregular datasets with skewed or heavy-tailed distributions.
Larger datasets and datasets with a high size-to-features ratio tend to favor GBDTs.

This segment offers critical insights for practitioners and researchers in choosing the right algorithm based on the distinctive attributes of their datasets.

Introducing TabZilla

The paper's consequential contribution is the TabZilla Benchmark Suite, encapsulating the 36 "hardest" datasets identified through their extensive analysis. This suite is tailored to challenge and evaluate new tabular algorithms effectively. It’s supported by open-source codebase and well-documented meta-features, providing a comprehensive toolkit for further research and practical application in the domain of tabular data.

Conclusion and Future Directions

This paper elevates the discourse on the neural networks vs. gradient-boosted decision trees debate by providing an extensive empirical foundation. By highlighting the nuanced dependencies of algorithmic performance on dataset characteristics and demonstrating the relative importance of hyperparameter tuning, it serves as a guide for both future research directions and practical applications in machine learning for tabular data.

The findings also lay groundwork for exploring more efficient neural network architectures for tabular applications, considering data irregularities and size. Additionally, the introduction of TabZilla opens avenues for developing more generalizable models capable of handling the complexity and diversity of real-world datasets.

In summary, the conversation shifts from a binary choice between NNs and GBDTs to a more informed selection process that considers dataset specifics, making strides towards more nuanced and effective machine learning in the field of tabular data.

PDF Markdown

Related Papers

GitHub

GitHub - naszilla/tabzilla (162 stars)

Tweets

https://twitter.com/AliciaCurth/status/1859269161911713850

https://twitter.com/Mohsen_S89/status/1768037997868388728

https://twitter.com/SaudenoBR/status/1750920512484540781

https://twitter.com/predict_addict/status/1779041054005596362