122,873 research outputs found

    Accelerating Stochastic Recursive and Semi-stochastic Gradient Methods with Adaptive Barzilai-Borwein Step Sizes

    Full text link
    The mini-batch versions of StochAstic Recursive grAdient algoritHm and Semi-Stochastic Gradient Descent method, employed the random Barzilai-Borwein step sizes (shorted as MB-SARAH-RBB and mS2GD-RBB), have surged into prominence through timely step size sequence. Inspired by modern adaptors and variance reduction techniques, we propose two new variant rules in the paper, referred to as RHBB and RHBB+, thereby leading to four algorithms MB-SARAH-RHBB, MB-SARAH-RHBB+, mS2GD-RHBB and mS2GD-RHBB+ respectively. RHBB+ is an enhanced version that additionally incorporates the importance sampling technique. They are aggressive in updates, robust in performance and self-adaptive along iterative periods. We analyze the flexible convergence structures and the corresponding complexity bounds in strongly convex cases. Comprehensive tuning guidance is theoretically provided for reference in practical implementations. Experiments show that the proposed methods consistently outperform the original and various state-of-the-art methods on frequently tested data sets. In particular, tests on the RHBB+ verify the efficacy of applying the importance sampling technique to the step size level. Numerous explorations display the promising scalability of our iterative adaptors.Comment: 44 pages, 33 figure

    Generalized Polyak Step Size for First Order Optimization with Momentum

    Full text link
    In machine learning applications, it is well known that carefully designed learning rate (step size) schedules can significantly improve the convergence of commonly used first-order optimization algorithms. Therefore how to set step size adaptively becomes an important research question. A popular and effective method is the Polyak step size, which sets step size adaptively for gradient descent or stochastic gradient descent without the need to estimate the smoothness parameter of the objective function. However, there has not been a principled way to generalize the Polyak step size for algorithms with momentum accelerations. This paper presents a general framework to set the learning rate adaptively for first-order optimization methods with momentum, motivated by the derivation of Polyak step size. It is shown that the resulting methods are much less sensitive to the choice of momentum parameter and may avoid the oscillation of the heavy-ball method on ill-conditioned problems. These adaptive step sizes are further extended to the stochastic settings, which are attractive choices for stochastic gradient descent with momentum. Our methods are demonstrated to be more effective for stochastic gradient methods than prior adaptive step size algorithms in large-scale machine learning tasks.Comment: 28 pages, ICML202

    Beyond the Golden Ratio for Variational Inequality Algorithms

    Full text link
    We improve the understanding of the golden ratio algorithm\textit{golden ratio algorithm}, which solves monotone variational inequalities (VI) and convex-concave min-max problems via the distinctive feature of adapting the step sizes to the local Lipschitz constants. Adaptive step sizes not only eliminate the need to pick hyperparameters, but they also remove the necessity of global Lipschitz continuity and can increase from one iteration to the next. We first establish the equivalence of this algorithm with popular VI methods such as reflected gradient, Popov or optimistic gradient descent-ascent in the unconstrained case with constant step sizes. We then move on to the constrained setting and introduce a new analysis that allows to use larger step sizes, to complete the bridge between the golden ratio algorithm and the existing algorithms in the literature. Doing so, we actually eliminate the link between the golden ratio 1+52\frac{1+\sqrt{5}}{2} and the algorithm. Moreover, we improve the adaptive version of the algorithm, first by removing the maximum step size hyperparameter (an artifact from the analysis) to improve the complexity bound, and second by adjusting it to nonmonotone problems with weak Minty solutions, with superior empirical performance

    On the Influence of Bias-Correction on Distributed Stochastic Optimization

    Full text link
    Various bias-correction methods such as EXTRA, gradient tracking methods, and exact diffusion have been proposed recently to solve distributed {\em deterministic} optimization problems. These methods employ constant step-sizes and converge linearly to the {\em exact} solution under proper conditions. However, their performance under stochastic and adaptive settings is less explored. It is still unknown {\em whether}, {\em when} and {\em why} these bias-correction methods can outperform their traditional counterparts (such as consensus and diffusion) with noisy gradient and constant step-sizes. This work studies the performance of exact diffusion under the stochastic and adaptive setting, and provides conditions under which exact diffusion has superior steady-state mean-square deviation (MSD) performance than traditional algorithms without bias-correction. In particular, it is proven that this superiority is more evident over sparsely-connected network topologies such as lines, cycles, or grids. Conditions are also provided under which exact diffusion method match or may even degrade the performance of traditional methods. Simulations are provided to validate the theoretical findings.Comment: 17 pages, 9 figure, submitted for publicatio

    Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for Deep Learning

    Full text link
    We propose a new per-layer adaptive step-size procedure for stochastic first-order optimization methods for minimizing empirical loss functions in deep learning, eliminating the need for the user to tune the learning rate (LR). The proposed approach exploits the layer-wise stochastic curvature information contained in the diagonal blocks of the Hessian in deep neural networks (DNNs) to compute adaptive step-sizes (i.e., LRs) for each layer. The method has memory requirements that are comparable to those of first-order methods, while its per-iteration time complexity is only increased by an amount that is roughly equivalent to an additional gradient computation. Numerical experiments show that SGD with momentum and AdamW combined with the proposed per-layer step-sizes are able to choose effective LR schedules and outperform fine-tuned LR versions of these methods as well as popular first-order and second-order algorithms for training DNNs on Autoencoder, Convolutional Neural Network (CNN) and Graph Convolutional Network (GCN) models. Finally, it is proved that an idealized version of SGD with the layer-wise step sizes converges linearly when using full-batch gradients

    Convergence of First-Order Methods for Constrained Nonconvex Optimization with Dependent Data

    Full text link
    We focus on analyzing the classical stochastic projected gradient methods under a general dependent data sampling scheme for constrained smooth nonconvex optimization. We show the worst-case rate of convergence O~(t−1/4)\tilde{O}(t^{-1/4}) and complexity O~(ε−4)\tilde{O}(\varepsilon^{-4}) for achieving an ε\varepsilon-near stationary point in terms of the norm of the gradient of Moreau envelope and gradient mapping. While classical convergence guarantee requires i.i.d. data sampling from the target distribution, we only require a mild mixing condition of the conditional distribution, which holds for a wide class of Markov chain sampling algorithms. This improves the existing complexity for the constrained smooth nonconvex optimization with dependent data from O~(ε−8)\tilde{O}(\varepsilon^{-8}) to O~(ε−4)\tilde{O}(\varepsilon^{-4}) with a significantly simpler analysis. We illustrate the generality of our approach by deriving convergence results with dependent data for stochastic proximal gradient methods, adaptive stochastic gradient algorithm AdaGrad and stochastic gradient algorithm with heavy ball momentum. As an application, we obtain first online nonnegative matrix factorization algorithms for dependent data based on stochastic projected gradient methods with adaptive step sizes and optimal rate of convergence.Comment: 32 pages, 1 figure, 1 tabl
    • …
    corecore