Emergent Mind

When Do Neural Nets Outperform Boosted Trees on Tabular Data?

(2305.02997)
Published May 4, 2023 in cs.LG , cs.AI , and stat.ML

Abstract

Tabular data is one of the most commonly used types of data in machine learning. Despite recent advances in neural nets (NNs) for tabular data, there is still an active discussion on whether or not NNs generally outperform gradient-boosted decision trees (GBDTs) on tabular data, with several recent works arguing either that GBDTs consistently outperform NNs on tabular data, or vice versa. In this work, we take a step back and question the importance of this debate. To this end, we conduct the largest tabular data analysis to date, comparing 19 algorithms across 176 datasets, and we find that the 'NN vs. GBDT' debate is overemphasized: for a surprisingly high number of datasets, either the performance difference between GBDTs and NNs is negligible, or light hyperparameter tuning on a GBDT is more important than choosing between NNs and GBDTs. A remarkable exception is the recently-proposed prior-data fitted network, TabPFN: although it is effectively limited to training sets of size 3000, we find that it outperforms all other algorithms on average, even when randomly sampling 3000 training datapoints. Next, we analyze dozens of metafeatures to determine what properties of a dataset make NNs or GBDTs better-suited to perform well. For example, we find that GBDTs are much better than NNs at handling skewed or heavy-tailed feature distributions and other forms of dataset irregularities. Our insights act as a guide for practitioners to determine which techniques may work best on their dataset. Finally, with the goal of accelerating tabular data research, we release the TabZilla Benchmark Suite: a collection of the 36 'hardest' of the datasets we study. Our benchmark suite, codebase, and all raw results are available at https://github.com/naszilla/tabzilla.

Work includes study on tabular data, analysis of algorithms and metafeatures, release of TabZilla datasets.

Overview

  • The paper conducts a comprehensive comparison between neural networks (NNs) and gradient-boosted decision trees (GBDTs) on tabular data, analyzing the conditions under which each model performs best.

  • Through an evaluation of 19 algorithms across 176 datasets, it finds no single algorithm universally outperforms others, and highlights the performance of TabPFN, a type of neural network.

  • It identifies dataset characteristics that influence the performance of NNs and GBDTs, suggesting GBDTs are better for large or irregular datasets.

  • Introduces TabZilla, a benchmark suite for challenging and evaluating tabular algorithms, alongside discussing the implications for future research and practical applications.

Comprehensive Analysis of Neural Nets vs. Boosted Trees for Tabular Data

Overview

The paper undertakes a comprehensive study comparing the performance of neural networks (NNs) and gradient-boosted decision trees (GBDTs) on tabular data. Often debated, the efficiency of NNs and GBDTs has led to parallel streams of research advocating for either approach. However, this paper explore the core of this debate by analyzing the specific conditions under which each method excels or underperforms. Leveraging a significant analysis spanning 19 algorithms across 176 datasets—the largest study of its kind—the findings suggest the prevailing "NN vs. GBDT" debate may be overstated. Particularly, light hyperparameter tuning on specific models shows more significance than the choice between NNs and GBDTs.

Algorithmic Performance Comparisons

The paper initiates its comparative study by evaluating 19 distinct algorithms, including three GBDTs, eleven neural networks, and five baseline models across a swath of tabular datasets. It unveils two particularly interesting insights:

  1. No individual algorithm domination: No single algorithm emerged as a clear winner across all datasets. Even baseline models performed exceptionally well on some datasets, highlighting that often, a well-tuned simple model might suffice.
  2. TabPFN's notable performance: Among the evaluated algorithms, TabPFN, a prior-data fitted neural network, showcased remarkable performance. Notably, it achieved competitive results even on large datasets by strategically sampling a subset of 3000 data points for training, illustrating its efficiency and potential scalability.

The Impact of Dataset Characteristics

An in-depth meta-feature analysis deciphered dataset characteristics that influence the suitability of NNs or GBDTs. Findings reveal that:

  • GBDTs excel over neural networks for handling irregular datasets with skewed or heavy-tailed distributions.
  • Larger datasets and datasets with a high size-to-features ratio tend to favor GBDTs.

This segment offers critical insights for practitioners and researchers in choosing the right algorithm based on the distinctive attributes of their datasets.

Introducing TabZilla

The paper's consequential contribution is the TabZilla Benchmark Suite, encapsulating the 36 "hardest" datasets identified through their extensive analysis. This suite is tailored to challenge and evaluate new tabular algorithms effectively. It’s supported by open-source codebase and well-documented meta-features, providing a comprehensive toolkit for further research and practical application in the domain of tabular data.

Conclusion and Future Directions

This paper elevates the discourse on the neural networks vs. gradient-boosted decision trees debate by providing an extensive empirical foundation. By highlighting the nuanced dependencies of algorithmic performance on dataset characteristics and demonstrating the relative importance of hyperparameter tuning, it serves as a guide for both future research directions and practical applications in machine learning for tabular data.

The findings also lay groundwork for exploring more efficient neural network architectures for tabular applications, considering data irregularities and size. Additionally, the introduction of TabZilla opens avenues for developing more generalizable models capable of handling the complexity and diversity of real-world datasets.

In summary, the conversation shifts from a binary choice between NNs and GBDTs to a more informed selection process that considers dataset specifics, making strides towards more nuanced and effective machine learning in the realm of tabular data.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

GitHub