Understanding the Role of Optimization in Double Descent (2312.03951v1)

Published 6 Dec 2023 in cs.LG and stat.ML

Abstract: The phenomenon of model-wise double descent, where the test error peaks and then reduces as the model size increases, is an interesting topic that has attracted the attention of researchers due to the striking observed gap between theory and practice \citep{Belkin2018ReconcilingMM}. Additionally, while double descent has been observed in various tasks and architectures, the peak of double descent can sometimes be noticeably absent or diminished, even without explicit regularization, such as weight decay and early stopping. In this paper, we investigate this intriguing phenomenon from the optimization perspective and propose a simple optimization-based explanation for why double descent sometimes occurs weakly or not at all. To the best of our knowledge, we are the first to demonstrate that many disparate factors contributing to model-wise double descent (initialization, normalization, batch size, learning rate, optimization algorithm) are unified from the viewpoint of optimization: model-wise double descent is observed if and only if the optimizer can find a sufficiently low-loss minimum. These factors directly affect the condition number of the optimization problem or the optimizer and thus affect the final minimum found by the optimizer, reducing or increasing the height of the double descent peak. We conduct a series of controlled experiments on random feature models and two-layer neural networks under various optimization settings, demonstrating this optimization-based unified view. Our results suggest the following implication: Double descent is unlikely to be a problem for real-world machine learning setups. Additionally, our results help explain the gap between weak double descent peaks in practice and strong peaks observable in carefully designed setups.

Summary

The paper reveals that optimization factors such as initialization, learning rates, and feature scaling alter the condition number, shaping the double descent curve.
Empirical tests on two-layer neural networks and random feature models show that extended training time is key for a noticeable double descent peak.
The findings imply that proper tuning and regularization in real-world ML applications often mitigate the double descent phenomenon.

Introduction to Model-wise Double Descent

The phenomenon of model-wise double descent in machine learning refers to the counterintuitive situation where the error rate of a predictive model first decreases, then increases, and finally decreases again as model complexity continues to rise beyond a certain point. This challenges classical theories in generalization behavior and has garnered significant interest.

Optimization's Impact on Double Descent

This research explores the phenomenon from the perspective of optimization. It suggests that factors often viewed as separate contributors—such as model initialization, learning rates, feature scaling, normalization, batch sizes, and the optimization algorithm used—are actually interrelated through optimization. These factors either directly or indirectly influence the 'condition number' of the optimization problem or optimizer. The condition number, reflecting the ratio of the largest to the smallest singular values of a feature matrix, plays a pivotal role by affecting how easy it is for the optimizer to find a low-loss minimum. Thus, it impacts the severity of the double descent curve's peak.

Empirical Observations and Implications for Real-World Application

The paper's experiments, using controlled setups on random feature models and two-layer neural networks with various optimization settings, demonstrate that double descent does not always manifest and is less likely to be a problem in practical applications. Real-world machine learning models are usually well-tuned with validation sets, and other regularizing techniques often circumvent double descent. Also, additional training iterations are typically needed for a strong double descent phenomenon to surface, which is not a common practice when models have already converged.

Exploring the Underlying Causes and Solutions

Further investigation shows that when a given training setup does not display double descent, allowing the training process to proceed much longer enables the peak to resurface. This indicates that the duration of the training process is a simple yet significant reason for the occurrence of double descent in certain settings. The comprehensive analysis strongly implies the importance of optimization nuances in understanding double descent and paves the way for future research to delve into theoretical explanations with new perspectives.

PDF Markdown