3,818 research outputs found

    Enhanced Adaptive Gradient Algorithms for Nonconvex-PL Minimax Optimization

    Full text link
    In the paper, we study a class of nonconvex nonconcave minimax optimization problems (i.e., minxmaxyf(x,y)\min_x\max_y f(x,y)), where f(x,y)f(x,y) is possible nonconvex in xx, and it is nonconcave and satisfies the Polyak-Lojasiewicz (PL) condition in yy. Moreover, we propose a class of enhanced momentum-based gradient descent ascent methods (i.e., MSGDA and AdaMSGDA) to solve these stochastic Nonconvex-PL minimax problems. In particular, our AdaMSGDA algorithm can use various adaptive learning rates in updating the variables xx and yy without relying on any global and coordinate-wise adaptive learning rates. Theoretically, we present an effective convergence analysis framework for our methods. Specifically, we prove that our MSGDA and AdaMSGDA methods have the best known sample (gradient) complexity of O(ϵ3)O(\epsilon^{-3}) only requiring one sample at each loop in finding an ϵ\epsilon-stationary solution (i.e., EF(x)ϵ\mathbb{E}\|\nabla F(x)\|\leq \epsilon, where F(x)=maxyf(x,y)F(x)=\max_y f(x,y)). This manuscript commemorates the mathematician Boris Polyak (1935-2023).Comment: 30 page

    Escaping Saddle Points with Adaptive Gradient Methods

    Full text link
    Adaptive methods such as Adam and RMSProp are widely used in deep learning but are not well understood. In this paper, we seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points. Furthermore, we show that adaptive methods can efficiently estimate the aforementioned preconditioner. By gluing together these two components, we provide the first (to our knowledge) second-order convergence result for any adaptive method. The key insight from our analysis is that, compared to SGD, adaptive methods escape saddle points faster, and can converge faster overall to second-order stationary points.Comment: Update Theorem 4.1 and proof to use martingale concentration bounds, i.e. matrix Freedma
    corecore