Search CORE

3,818 research outputs found

Enhanced Adaptive Gradient Algorithms for Nonconvex-PL Minimax Optimization

Author: Huang Feihu
Publication venue
Publication date: 13/03/2023
Field of study

In the paper, we study a class of nonconvex nonconcave minimax optimization problems (i.e.,

\min_x\max_y f(x,y)

), where

f(x,y)

is possible nonconvex in

x

, and it is nonconcave and satisfies the Polyak-Lojasiewicz (PL) condition in

y

. Moreover, we propose a class of enhanced momentum-based gradient descent ascent methods (i.e., MSGDA and AdaMSGDA) to solve these stochastic Nonconvex-PL minimax problems. In particular, our AdaMSGDA algorithm can use various adaptive learning rates in updating the variables

x

and

y

without relying on any global and coordinate-wise adaptive learning rates. Theoretically, we present an effective convergence analysis framework for our methods. Specifically, we prove that our MSGDA and AdaMSGDA methods have the best known sample (gradient) complexity of

O(\epsilon^{-3})

only requiring one sample at each loop in finding an

\epsilon

-stationary solution (i.e.,

\mathbb{E}\|\nabla F(x)\|\leq \epsilon

, where

F(x)=\max_y f(x,y)

). This manuscript commemorates the mathematician Boris Polyak (1935-2023).Comment: 30 page

arXiv.org e-Print Archive

Escaping Saddle Points with Adaptive Gradient Methods

Author: Kale Satyen
Kumar Sanjiv
Reddi Sashank J.
Sra Suvrit
Staib Matthew
Publication venue
Publication date: 03/02/2020
Field of study

Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points.Comment: Update Theorem 4.1 and proof to use martingale concentration bounds, i.e. matrix Freedma

arXiv.org e-Print Archive