Revisiting Random Weight Perturbation for Efficiently Improving Generalization (2404.00357v1)
Abstract: Improving the generalization ability of modern deep neural networks (DNNs) is a fundamental challenge in machine learning. Two branches of methods have been proposed to seek flat minima and improve generalization: one led by sharpness-aware minimization (SAM) minimizes the worst-case neighborhood loss through adversarial weight perturbation (AWP), and the other minimizes the expected Bayes objective with random weight perturbation (RWP). While RWP offers advantages in computation and is closely linked to AWP on a mathematical basis, its empirical performance has consistently lagged behind that of AWP. In this paper, we revisit the use of RWP for improving generalization and propose improvements from two perspectives: i) the trade-off between generalization and convergence and ii) the random perturbation generation. Through extensive experimental evaluations, we demonstrate that our enhanced RWP methods achieve greater efficiency in enhancing generalization, particularly in large-scale problems, while also offering comparable or even superior performance to SAM. The code is released at https://github.com/nblt/mARWP.
- Towards understanding sharpness-aware minimization. In International Conference on Machine Learning (ICML), 2022.
- Low-pass filtering sgd for recovering flat optima in the deep learning optimization landscape. In International Conference on Artificial Intelligence and Statistics (AISTATIS), 2022.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703, 2020.
- Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
- Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
- Sharp minima can generalize for deep nets. In International Conference on Machine Learning (ICML), 2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
- Efficient sharpness-aware minimization for improved training of neural networks. In International Conference on Learning Representations (ICLR), 2022a.
- Sharpness-aware training for free. In Advances in Neural Information Processing Systems (NeurIPS), 2022b.
- Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2):674–701, 2012.
- Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations (ICLR), 2020.
- Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
- An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning, 2019.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Fundamentals of convex analysis. Springer Science & Business Media, 2004.
- Simplifying neural nets by discovering flat minima. In Advances in Neural Information Processing Systems (NeurIPS), 1994.
- Flat minima. Neural computation, 1997.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.
- Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
- An adaptive policy to employ sharpness-aware minimization. In International Conference on Learning Representations (ICLR), 2023.
- Fantastic generalization measures and where to find them. In International Conference on Learning Representations (ICLR), 2020.
- On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. Journal of the ACM (JACM), 2021.
- Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2016.
- On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations (ICLR), 2017.
- Fisher sam: Information geometry and sharpness aware minimisation. In International Conference on Machine Learning (ICML), 2022.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
- Big transfer (bit): General visual representation learning. In European conference on computer vision (ECCV), 2020.
- Learning multiple layers of features from tiny images. Technical Report, 2009.
- Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In International Conference on Machine Learning (ICML), 2021.
- Enhancing sharpness-aware optimization through variance suppression. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
- Friendly sharpness-aware minimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Towards efficient and scalable sharpness-aware minimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), 2022a.
- Random sharpness-aware minimization. In Advances in Neural Information Processing Systems (NeurIPS), 2022b.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), 2021.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Make sharpness-aware minimization stronger: A sparsified perturbation approach. arXiv preprint arXiv:2210.05177, 2022.
- SAM as an optimal relaxation of Bayes. In International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=k4fevFqSQcX.
- Exploring generalization in deep learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
- Automatic differentiation in pytorch. 2017.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (ICML), 2019.
- Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
- Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
- On the generalization of models trained with sgd: Information-theoretic bounds and implications. In International Conference on Learning Representations (ICLR), 2021.
- Smoothout: Smoothing out sharp minima to improve generalization in deep learning. arXiv preprint arXiv:1805.07898, 2018.
- Wide residual networks. In Richard C. Wilson, Edwin R. Hancock, and William A. P. Smith (eds.), British Machine Vision Conference (BMVC), 2016.
- mixup: Beyond empirical risk minimization. In International Conference on Learning Representations (ICLR), 2018.
- Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Penalizing gradient norm for efficiently improving generalization in deep learning. In International Conference on Machine Learning (ICML), 2022a.
- SS-SAM: Stochastic scheduled sharpness-aware minimization for efficiently training deep neural networks. arXiv preprint arXiv:2203.09962, 2022b.
- Regularizing neural networks via adversarial model perturbation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Toward understanding the importance of noise in training neural networks. In International Conference on Machine Learning (ICML), 2019.
- Surrogate gap minimization improves sharpness-aware training. In International Conference on Learning Representations (ICLR), 2022.