Search CORE

22 research outputs found

On the Last Iterate Convergence of Momentum Methods

Author: Li Xiaoyu
Liu Mingrui
Orabona Francesco
Publication venue
Publication date: 13/02/2021
Field of study

SGD with Momentum (SGDM) is widely used for large scale optimization of machine learning problems. Yet, the theoretical understanding of this algorithm is not complete. In fact, even the most recent results require changes to the algorithm like an averaging scheme and a projection onto a bounded domain, which are never used in practice. Also, no lower bound is known for SGDM. In this paper, we prove for the first time that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate of SGDM suffers from an error

\Omega(\frac{\log T}{\sqrt{T}})

after

T

steps. Based on this fact, we study a new class of (both adaptive and non-adaptive) Follow-The-Regularized-Leader-based SGDM algorithms with \emph{increasing momentum} and \emph{shrinking updates}. For these algorithms, we show that the last iterate has optimal convergence

O (\frac{1}{\sqrt{T}})

for unconstrained convex optimization problems. Further, we show that in the interpolation setting with convex and smooth functions, our new SGDM algorithm automatically converges at a rate of

O(\frac{\log T}{T})

. Empirical results are shown as well

arXiv.org e-Print Archive

Stability and Deviation Optimal Risk Bounds with Convergence Rate $O(1/n)$

Author: Klochkov Yegor
Zhivotovskiy Nikita
Publication venue
Publication date: 01/01/2021
Field of study

The sharpest known high probability generalization bounds for uniformly stable algorithms (Feldman, Vondr\'{a}k, 2018, 2019), (Bousquet, Klochkov, Zhivotovskiy, 2020) contain a generally inevitable sampling error term of order

\Theta(1/\sqrt{n})

. When applied to excess risk bounds, this leads to suboptimal results in several standard stochastic convex optimization problems. We show that if the so-called Bernstein condition is satisfied, the term

\Theta(1/\sqrt{n})

can be avoided, and high probability excess risk bounds of order up to

O(1/n)

are possible via uniform stability. Using this result, we show a high probability excess risk bound with the rate

O(\log n/n)

for strongly convex and Lipschitz losses valid for \emph{any} empirical risk minimization method. This resolves a question of Shalev-Shwartz, Shamir, Srebro, and Sridharan (2009). We discuss how

O(\log n/n)

high probability excess risk bounds are possible for projected gradient descent in the case of strongly convex and Lipschitz losses without the usual smoothness assumption.Comment: 12 pages; presented at NeurIP

arXiv.org e-Print Archive

Repository for Publications and Research Data