Emergent Mind

A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning

(2310.18988)
Published Oct 29, 2023 in stat.ML and cs.LG

Abstract

Conventional statistical wisdom established a well-understood relationship between model complexity and prediction error, typically presented as a U-shaped curve reflecting a transition between under- and overfitting regimes. However, motivated by the success of overparametrized neural networks, recent influential work has suggested this theory to be generally incomplete, introducing an additional regime that exhibits a second descent in test error as the parameter count p grows past sample size n - a phenomenon dubbed double descent. While most attention has naturally been given to the deep-learning setting, double descent was shown to emerge more generally across non-neural models: known cases include linear regression, trees, and boosting. In this work, we take a closer look at evidence surrounding these more classical statistical machine learning methods and challenge the claim that observed cases of double descent truly extend the limits of a traditional U-shaped complexity-generalization curve therein. We show that once careful consideration is given to what is being plotted on the x-axes of their double descent plots, it becomes apparent that there are implicitly multiple complexity axes along which the parameter count grows. We demonstrate that the second descent appears exactly (and only) when and where the transition between these underlying axes occurs, and that its location is thus not inherently tied to the interpolation threshold p=n. We then gain further insight by adopting a classical nonparametric statistics perspective. We interpret the investigated methods as smoothers and propose a generalized measure for the effective number of parameters they use on unseen examples, using which we find that their apparent double descent curves indeed fold back into more traditional convex shapes - providing a resolution to tensions between double descent and statistical intuition.

Decomposing double descent in Random Fourier Features Regression, adapted from Belkin et al. (2019).

Overview

  • This paper critically evaluates the double descent phenomenon in statistical machine learning, focusing on classical methods like trees, boosting, and linear regression.

  • It identifies that double descent does not stem from increased model complexity per se, but rather from shifts in underlying parameter augmentation methods and complexity axes.

  • The concept of an 'effective number of parameters' is introduced as a measure of model complexity that better explains generalization, fitting the observed data into traditional U-shaped curves.

  • The conclusions suggest that the evidence for double descent in non-deep learning models can be understood within the existing framework of the bias-variance tradeoff when considering implicit complexity axes.

Rethinking Double Descent and the Role of Model Complexity in Statistical Learning

Introduction

Recent discussions within the machine learning community have focused on the phenomenon known as double descent, where predictive performance initially worsens before substantially improving as model complexity continues to increase. This behavior appears to challenge long-established beliefs regarding the trade-off between bias and variance in machine learning models. This paper, titled "A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning," critically evaluates the evidence for double descent in classical statistical machine learning methods, such as trees, boosting, and linear regression, through a detailed examination of experimental results and theoretical insights from recent works.

Revisiting the Evidence for Double Descent

Trees and Boosting

The paper's first section scrutinizes the existence of double descent in decision trees and gradient boosting methods. Analyses show that the appearance of a second decrease in test error, as model complexity increases, can be deconstructed when considering two distinct axes of model complexity concurrently. In both trees and boosting, the phenomenon is tied to transitions between changes in tree depth (or number of boosting rounds) and an increase in ensemble size. These findings indicate that double descent arises not from increased model complexity per se but due to shifts in the underlying parameter augmentation methods.

Linear Regression with Random Fourier Features

The section dedicated to linear regression with Random Fourier Features (RFF) explores how extending parameter count beyond the dataset size does not inherently lead to a complexity increase. The observed double descent behavior for linear regression models is linked to mixing mechanisms that grow model complexity in disparate ways: direct feature augmentation and unsupervised dimensionality reduction via min-norm solutions. This separation of complexity axes resolves the apparent conflict with traditional statistical learning principles, revealing that beyond certain thresholds, increases in raw parameters do not equate to increases in model complexity.

A Nonparametric Statistics Perspective

Adopting a classical nonparametric statistics viewpoint, the authors reinterpret the methods under consideration as smoothers. They propose a generalized notion of effective number of parameters to measure model complexity concerning unseen data. Contrary to raw parameter counts, this measure uncovers that actual model complexity does not increase in what was previously considered the overparameterized regime. This insight systematically folds the double descent curves back into traditional U-shaped generalization curves when plotted against this more appropriate measure of complexity.

Implications and Future Directions

The work concludes that the previously reported experimental evidence for double descent in non-deep learning models can be fully explained within the existing U-shaped bias-variance tradeoff framework when considering both implicit complexity axes and the effective number of parameters used. Practical implications include potential new avenues for model selection criteria that better capture model complexity's effect on generalization. Additionally, the paper speculates on applicability to deep learning models, suggesting that similar underlying principles of parameter counting and complexity axes may help understand double descent phenomena in more complex architectures.

Conclusion

The investigation into the 'double descent' phenomenon with classical machine learning methods reveals it as an artifact of transitioning between different models or complexity augmentation mechanisms, rather than a fundamental challenge to established learning theory. By decoupling raw parameter counts from model complexity and adopting a smoother-based viewpoint, traditional statistical intuitions about model generalization are not only preserved but also enriched. This reevaluation encourages a more nuanced understanding of model complexity, emphasizing the critical distinction between raw parameters and effective parameters in assessing learning algorithms' generalization capabilities.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube