Training behavior of deep neural network in frequency domain (1807.01251v6)

Published 3 Jul 2018 in cs.LG, cs.AI, cs.IT, math.IT, math.ST, stat.ML, and stat.TH

Abstract: Why deep neural networks (DNNs) capable of overfitting often generalize well in practice is a mystery [#zhang2016understanding]. To find a potential mechanism, we focus on the study of implicit biases underlying the training process of DNNs. In this work, for both real and synthetic datasets, we empirically find that a DNN with common settings first quickly captures the dominant low-frequency components, and then relatively slowly captures the high-frequency ones. We call this phenomenon Frequency Principle (F-Principle). The F-Principle can be observed over DNNs of various structures, activation functions, and training algorithms in our experiments. We also illustrate how the F-Principle help understand the effect of early-stopping as well as the generalization of DNNs. This F-Principle potentially provides insights into a general principle underlying DNN optimization and generalization.

Citations (290)

View on Semantic Scholar

Summary

The paper introduces the Frequency Principle (F-Principle), revealing that DNNs first capture low-frequency components before progressing to higher frequencies.
The paper employs Fourier analysis across varied datasets, architectures, and activation functions to robustly validate the F-Principle.
The paper concludes that small initialization and early stopping enhance generalization by guiding models to prioritize low-frequency signal learning.

Analyzing the Frequency Domain Behavior of Deep Neural Networks

This paper explores the training dynamics of Deep Neural Networks (DNNs) from a frequency domain perspective. It attempts to demystify the phenomenon where over-parameterized DNNs, which are theoretically susceptible to overfitting, often perform well in generalization tasks. The authors introduce a concept they term the Frequency Principle (F-Principle), which asserts that DNNs with standard configurations initially learn low-frequency data components before successively capturing higher-frequency components. This paper's contribution lies in empirically demonstrating the ubiquity of the F-Principle across a variety of neural network architectures, activation functions, training algorithms, and datasets.

Key Findings and Methodology

The paper applies Fourier analysis to investigate the frequency domain behavior of DNNs during training. By examining both synthetic and real-world data (such as MNIST and CIFAR10), the authors characterize DNN optimization as first prioritizing low-frequency components of the input data before tackling higher frequencies, a behavior analogical to techniques employed in numerical algorithms like the Multigrid method for partial differential equations.

The authors conducted extensive experiments using DNNs of varying depths and widths, incorporating different activation functions (tanh, ReLU) and training algorithms (gradient descent, stochastic gradient descent, Adam) to establish the prevalence of the F-Principle. Their empirical evidence suggests that the F-Principle is a robust descriptor of the DNN training process, highlighting an underlying implicit bias in how DNNs fit data in the frequency domain.

Implications and Insights

The implications of the F-Principle are significant for understanding DNN generalization. By predominantly capturing low-frequency components, which tend to generalize better due to their inherent simplicity and abundant presence in natural data, DNNs sidestep overfitting, especially when high-frequency data components are contaminated with noise. This insight elucidates why early stopping, a regularization technique, often aids in improving a DNN's generalization by preventing the model from fitting high-frequency noise after initially fitting the salient low-frequency components.

Moreover, the paper suggests that the initialization of DNN parameters plays a critical role in the manifestation of the F-Principle. Networks initialized with small weight values demonstrate stronger adherence to the F-Principle, thereby achieving superior generalization compared to those initialized with larger values. Small initializations result in models that initially output less variable, smoother functions, which naturally align with lower frequency components during training.

Future Directions

The findings offer several promising directions for future research. One avenue involves developing theoretical models that more deeply analyze the F-Principle, potentially shedding light on the complex landscape of non-convex optimization in DNNs. Additionally, integrating the F-Principle into the design of novel training regimes or architectures might enhance their efficiency and robustness. Exploring the F-Principle's implications in other machine learning paradigms, such as meta-learning and transfer learning, could also yield valuable insights, particularly in understanding how pre-trained models adjust to new tasks.

In conclusion, this paper posits that the Frequency Principle is a noteworthy lens through which to view and understand the generalization phenomena in DNNs. While the industry and academia have moved towards larger, more complex models, understanding fundamental principles like the F-Principle might unlock novel strategies for model training, ultimately leading to more reliable and efficient neural network deployments across various applications.

PDF Markdown