On the Convergence Proof of AMSGrad and a New Version

Published 7 Apr 2019 in cs.LG, math.OC, and stat.ML | (1904.03590v4)

Abstract: The adaptive moment estimation algorithm Adam (Kingma and Ba) is a popular optimizer in the training of deep neural networks. However, Reddi et al. have recently shown that the convergence proof of Adam is problematic and proposed a variant of Adam called AMSGrad as a fix. In this paper, we show that the convergence proof of AMSGrad is also problematic. Concretely, the problem in the convergence proof of AMSGrad is in handling the hyper-parameters, treating them as equal while they are not. This is also the neglected issue in the convergence proof of Adam. We provide an explicit counter-example of a simple convex optimization setting to show this neglected issue. Depending on manipulating the hyper-parameters, we present various fixes for this issue. We provide a new convergence proof for AMSGrad as the first fix. We also propose a new version of AMSGrad called AdamX as another fix. Our experiments on the benchmark dataset also support our theoretical results.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (79)

View on Semantic Scholar

Summary

The paper presents a new convergence proof for AMSGrad by adjusting hyper-parameter decay to ensure average regret converges to zero.
It introduces AdamX, an improved optimizer that refines AMSGrad’s gradient square tracking to guarantee positive components throughout training.
Empirical tests on benchmarks like CIFAR-10 with ResNet models demonstrate AdamX’s reliability and competitive performance against AMSGrad.

Analysis of Convergence Issues in AMSGrad and Introduction of AdamX

The paper addresses convergence issues identified within the AMSGrad optimizer, a well-known variant of the Adam optimization algorithm used frequently in training deep neural networks. The authors build upon existing critiques, notably those highlighted by Reddi and colleagues, which pointed out flaws in the original convergence proof for AMSGrad. Specifically, they identify problems in handling hyper-parameters, a vital component of the algorithm's performance, treating them as equal in situations where they should not be.

The authors provide a detailed exploration of how these issues manifest in AMSGrad's convergence proof. They present a counter-example using a simple convex optimization setting to illustrate the neglected aspect of the convergence proof, namely how improper manipulation of hyper-parameters affects the algorithm's guarantees.

Three primary contributions are detailed in the paper:

New Convergence Proof for AMSGrad: They propose a new convergence proof for AMSGrad when special parameter conditions are met. This proof demands either an exponentially decaying schedule for the hyper-parameter $\beta_{1,t}$ or a specific inverse time scaling, addressing the shortcomings in the handling of these parameters within the AMSGrad framework. The paper provides theoretical support showing that with these settings, the proof ensures that the average regret satisfies the convergence criterion, $R(T)/T \rightarrow 0$ .
Introduction of AdamX: As a broader solution, particularly when a general parameter schedule is used, the authors propose a new optimizer, AdamX. This variation adapts AMSGrad’s framework but introduces a modification in the maximum tracking mechanism for the squared gradient averages. This adjustment ensures the components always remain positive, thereby addressing the critical issue in the convergence proof. AdamX maintains similar empirical performance to AMSGrad in benchmark tests while providing a rigorous convergence guarantee.
Empirical Evidence: The paper includes experimental results that validate the theoretical findings. Testing both AMSGrad and AdamX against benchmark datasets such as CIFAR-10 using ResNet models, confirms the proposed optimizer's reliability and comparative performance to AMSGrad under the revised theoretical foundation.

The implications of this research are significant for both theoretical and practical applications in machine learning. By providing a robust theoretical underpinning for convergence, the proposed modifications can enhance the reliability of adaptive moment estimations in gradient-based optimization tasks. This not only improves the theoretical foundation but can also have practical impacts on developing more stable and efficient deep learning models.

Future exploration might involve examining various decay schedules for $\beta_{1,t}$ and testing the optimizer on other complex tasks and models. As the field of machine learning continues to expand, optimizers like AdamX provide an adaptable framework potentially applicable to new methodologies and architectures, reinforcing the optimization process at the core of model training.