Search CORE

122,873 research outputs found

Accelerating Stochastic Recursive and Semi-stochastic Gradient Methods with Adaptive Barzilai-Borwein Step Sizes

Author: Peng Zheng
Wang Jiangshan
Yang Yiming
Publication venue
Publication date: 22/10/2023
Field of study

The mini-batch versions of StochAstic Recursive grAdient algoritHm and Semi-Stochastic Gradient Descent method, employed the random Barzilai-Borwein step sizes (shorted as MB-SARAH-RBB and mS2GD-RBB), have surged into prominence through timely step size sequence. Inspired by modern adaptors and variance reduction techniques, we propose two new variant rules in the paper, referred to as RHBB and RHBB+, thereby leading to four algorithms MB-SARAH-RHBB, MB-SARAH-RHBB+, mS2GD-RHBB and mS2GD-RHBB+ respectively. RHBB+ is an enhanced version that additionally incorporates the importance sampling technique. They are aggressive in updates, robust in performance and self-adaptive along iterative periods. We analyze the flexible convergence structures and the corresponding complexity bounds in strongly convex cases. Comprehensive tuning guidance is theoretically provided for reference in practical implementations. Experiments show that the proposed methods consistently outperform the original and various state-of-the-art methods on frequently tested data sets. In particular, tests on the RHBB+ verify the efficacy of applying the importance sampling technique to the step size level. Numerous explorations display the promising scalability of our iterative adaptors.Comment: 44 pages, 33 figure

arXiv.org e-Print Archive

Generalized Polyak Step Size for First Order Optimization with Momentum

Author: Johansson Mikael
Wang Xiaoyu
Zhang Tong
Publication venue
Publication date: 22/05/2023
Field of study

In machine learning applications, it is well known that carefully designed learning rate (step size) schedules can significantly improve the convergence of commonly used first-order optimization algorithms. Therefore how to set step size adaptively becomes an important research question. A popular and effective method is the Polyak step size, which sets step size adaptively for gradient descent or stochastic gradient descent without the need to estimate the smoothness parameter of the objective function. However, there has not been a principled way to generalize the Polyak step size for algorithms with momentum accelerations. This paper presents a general framework to set the learning rate adaptively for first-order optimization methods with momentum, motivated by the derivation of Polyak step size. It is shown that the resulting methods are much less sensitive to the choice of momentum parameter and may avoid the oscillation of the heavy-ball method on ill-conditioned problems. These adaptive step sizes are further extended to the stochastic settings, which are attractive choices for stochastic gradient descent with momentum. Our methods are demonstrated to be more effective for stochastic gradient methods than prior adaptive step size algorithms in large-scale machine learning tasks.Comment: 28 pages, ICML202

arXiv.org e-Print Archive

Beyond the Golden Ratio for Variational Inequality Algorithms

Author: Alacaoglu Ahmet
Böhm Axel
Malitsky Yura
Publication venue
Publication date: 28/12/2022
Field of study

We improve the understanding of the

\textit{golden ratio algorithm}

, which solves monotone variational inequalities (VI) and convex-concave min-max problems via the distinctive feature of adapting the step sizes to the local Lipschitz constants. Adaptive step sizes not only eliminate the need to pick hyperparameters, but they also remove the necessity of global Lipschitz continuity and can increase from one iteration to the next. We first establish the equivalence of this algorithm with popular VI methods such as reflected gradient, Popov or optimistic gradient descent-ascent in the unconstrained case with constant step sizes. We then move on to the constrained setting and introduce a new analysis that allows to use larger step sizes, to complete the bridge between the golden ratio algorithm and the existing algorithms in the literature. Doing so, we actually eliminate the link between the golden ratio

\frac{1+\sqrt{5}}{2}

and the algorithm. Moreover, we improve the adaptive version of the algorithm, first by removing the maximum step size hyperparameter (an artifact from the analysis) to improve the complexity bound, and second by adjusting it to nonmonotone problems with weak Minty solutions, with superior empirical performance

arXiv.org e-Print Archive

On the Influence of Bias-Correction on Distributed Stochastic Optimization

Author: Alghunaim Sulaiman A.
Sayed Ali H.
Ying Bicheng
Yuan Kun
Publication venue
Publication date: 11/07/2019
Field of study

Various bias-correction methods such as EXTRA, gradient tracking methods, and exact diffusion have been proposed recently to solve distributed {\em deterministic} optimization problems. These methods employ constant step-sizes and converge linearly to the {\em exact} solution under proper conditions. However, their performance under stochastic and adaptive settings is less explored. It is still unknown {\em whether}, {\em when} and {\em why} these bias-correction methods can outperform their traditional counterparts (such as consensus and diffusion) with noisy gradient and constant step-sizes. This work studies the performance of exact diffusion under the stochastic and adaptive setting, and provides conditions under which exact diffusion has superior steady-state mean-square deviation (MSD) performance than traditional algorithms without bias-correction. In particular, it is proven that this superiority is more evident over sparsely-connected network topologies such as lines, cycles, or grids. Conditions are also provided under which exact diffusion method match or may even degrade the performance of traditional methods. Simulations are provided to validate the theoretical findings.Comment: 17 pages, 9 figure, submitted for publicatio

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for Deep Learning

Author: Bahamou Achraf
Goldfarb Donald
Publication venue
Publication date: 23/05/2023
Field of study

We propose a new per-layer adaptive step-size procedure for stochastic first-order optimization methods for minimizing empirical loss functions in deep learning, eliminating the need for the user to tune the learning rate (LR). The proposed approach exploits the layer-wise stochastic curvature information contained in the diagonal blocks of the Hessian in deep neural networks (DNNs) to compute adaptive step-sizes (i.e., LRs) for each layer. The method has memory requirements that are comparable to those of first-order methods, while its per-iteration time complexity is only increased by an amount that is roughly equivalent to an additional gradient computation. Numerical experiments show that SGD with momentum and AdamW combined with the proposed per-layer step-sizes are able to choose effective LR schedules and outperform fine-tuned LR versions of these methods as well as popular first-order and second-order algorithms for training DNNs on Autoencoder, Convolutional Neural Network (CNN) and Graph Convolutional Network (GCN) models. Finally, it is proved that an idealized version of SGD with the layer-wise step sizes converges linearly when using full-batch gradients

arXiv.org e-Print Archive

Convergence of First-Order Methods for Constrained Nonconvex Optimization with Dependent Data

Author: Alacaoglu Ahmet
Lyu Hanbaek
Publication venue
Publication date: 23/06/2023
Field of study

We focus on analyzing the classical stochastic projected gradient methods under a general dependent data sampling scheme for constrained smooth nonconvex optimization. We show the worst-case rate of convergence

\tilde{O}(t^{-1/4})

and complexity

\tilde{O}(\varepsilon^{-4})

for achieving an

\varepsilon

-near stationary point in terms of the norm of the gradient of Moreau envelope and gradient mapping. While classical convergence guarantee requires i.i.d. data sampling from the target distribution, we only require a mild mixing condition of the conditional distribution, which holds for a wide class of Markov chain sampling algorithms. This improves the existing complexity for the constrained smooth nonconvex optimization with dependent data from

\tilde{O}(\varepsilon^{-8})

\tilde{O}(\varepsilon^{-4})

with a significantly simpler analysis. We illustrate the generality of our approach by deriving convergence results with dependent data for stochastic proximal gradient methods, adaptive stochastic gradient algorithm AdaGrad and stochastic gradient algorithm with heavy ball momentum. As an application, we obtain first online nonnegative matrix factorization algorithms for dependent data based on stochastic projected gradient methods with adaptive step sizes and optimal rate of convergence.Comment: 32 pages, 1 figure, 1 tabl

arXiv.org e-Print Archive