Emergent Mind

Understanding the Role of Optimization in Double Descent

(2312.03951)
Published Dec 6, 2023 in cs.LG and stat.ML

Abstract

The phenomenon of model-wise double descent, where the test error peaks and then reduces as the model size increases, is an interesting topic that has attracted the attention of researchers due to the striking observed gap between theory and practice \citep{Belkin2018ReconcilingMM}. Additionally, while double descent has been observed in various tasks and architectures, the peak of double descent can sometimes be noticeably absent or diminished, even without explicit regularization, such as weight decay and early stopping. In this paper, we investigate this intriguing phenomenon from the optimization perspective and propose a simple optimization-based explanation for why double descent sometimes occurs weakly or not at all. To the best of our knowledge, we are the first to demonstrate that many disparate factors contributing to model-wise double descent (initialization, normalization, batch size, learning rate, optimization algorithm) are unified from the viewpoint of optimization: model-wise double descent is observed if and only if the optimizer can find a sufficiently low-loss minimum. These factors directly affect the condition number of the optimization problem or the optimizer and thus affect the final minimum found by the optimizer, reducing or increasing the height of the double descent peak. We conduct a series of controlled experiments on random feature models and two-layer neural networks under various optimization settings, demonstrating this optimization-based unified view. Our results suggest the following implication: Double descent is unlikely to be a problem for real-world machine learning setups. Additionally, our results help explain the gap between weak double descent peaks in practice and strong peaks observable in carefully designed setups.

Overview

  • The paper discusses the model-wise double descent phenomenon in machine learning, where a model's error rate exhibits a non-monotonic relationship with increasing complexity.

  • It explores how various optimization factors such as initialization, learning rates, and batch sizes interrelate and influence the severity of the double descent curve.

  • The condition number of the optimization problem is identified as a critical determinant in the optimization process, affecting the double descent phenomenon.

  • Empirical studies reveal that double descent is less prevalent in practical applications due to well-tuned models and regularizing techniques.

  • The research suggests that the duration of training is a key factor in the occurrence of double descent and has implications for future theoretical work.

Introduction to Model-wise Double Descent

The phenomenon of model-wise double descent in machine learning refers to the counterintuitive situation where the error rate of a predictive model first decreases, then increases, and finally decreases again as model complexity continues to rise beyond a certain point. This challenges classical theories in generalization behavior and has garnered significant interest.

Optimization's Impact on Double Descent

This research explore the phenomenon from the perspective of optimization. It suggests that factors often viewed as separate contributors—such as model initialization, learning rates, feature scaling, normalization, batch sizes, and the optimization algorithm used—are actually interrelated through optimization. These factors either directly or indirectly influence the 'condition number' of the optimization problem or optimizer. The condition number, reflecting the ratio of the largest to the smallest singular values of a feature matrix, plays a pivotal role by affecting how easy it is for the optimizer to find a low-loss minimum. Thus, it impacts the severity of the double descent curve's peak.

Empirical Observations and Implications for Real-World Application

The study's experiments, using controlled setups on random feature models and two-layer neural networks with various optimization settings, demonstrate that double descent does not always manifest and is less likely to be a problem in practical applications. Real-world machine learning models are usually well-tuned with validation sets, and other regularizing techniques often circumvent double descent. Also, additional training iterations are typically needed for a strong double descent phenomenon to surface, which is not a common practice when models have already converged.

Exploring the Underlying Causes and Solutions

Further investigation shows that when a given training setup does not display double descent, allowing the training process to proceed much longer enables the peak to resurface. This indicates that the duration of the training process is a simple yet significant reason for the occurrence of double descent in certain settings. The comprehensive analysis strongly implies the importance of optimization nuances in understanding double descent and paves the way for future research to delve into theoretical explanations with new perspectives.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.