23,412 research outputs found

    Momentum-based variance reduction in non-convex SGD

    Full text link
    Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large “mega-batches” in order to achieve their improved results. We present a new algorithm, Storm, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses F, Storm finds a point x with E[k∇F(x)k] ≤ O(1 /√ T + σ^1/3 /T^1/3) in T iterations with σ^2 variance in the gradients, matching the optimal rate and without requiring knowledge of σ.https://arxiv.org/pdf/1905.10018.pdfPublished versio

    Momentum-Based Variance Reduction in Non-Convex SGD

    Full text link
    Variance reduction has emerged in recent years as a strong competitor to stochastic gradient descent in non-convex problems, providing the first algorithms to improve upon the converge rate of stochastic gradient descent for finding first-order critical points. However, variance reduction techniques typically require carefully tuned learning rates and willingness to use excessively large "mega-batches" in order to achieve their improved results. We present a new algorithm, STORM, that does not require any batches and makes use of adaptive learning rates, enabling simpler implementation and less hyperparameter tuning. Our technique for removing the batches uses a variant of momentum to achieve variance reduction in non-convex optimization. On smooth losses FF, STORM finds a point x\boldsymbol{x} with E[F(x)]O(1/T+σ1/3/T1/3)\mathbb{E}[\|\nabla F(\boldsymbol{x})\|]\le O(1/\sqrt{T}+\sigma^{1/3}/T^{1/3}) in TT iterations with σ2\sigma^2 variance in the gradients, matching the optimal rate but without requiring knowledge of σ\sigma.Comment: Added Ac

    Federated Learning Using Variance Reduced Stochastic Gradient for Probabilistically Activated Agents

    Full text link
    This paper proposes an algorithm for Federated Learning (FL) with a two-layer structure that achieves both variance reduction and a faster convergence rate to an optimal solution in the setting where each agent has an arbitrary probability of selection in each iteration. In distributed machine learning, when privacy matters, FL is a functional tool. Placing FL in an environment where it has some irregular connections of agents (devices), reaching a trained model in both an economical and quick way can be a demanding job. The first layer of our algorithm corresponds to the model parameter propagation across agents done by the server. In the second layer, each agent does its local update with a stochastic and variance-reduced technique called Stochastic Variance Reduced Gradient (SVRG). We leverage the concept of variance reduction from stochastic optimization when the agents want to do their local update step to reduce the variance caused by stochastic gradient descent (SGD). We provide a convergence bound for our algorithm which improves the rate from O(1K)O(\frac{1}{\sqrt{K}}) to O(1K)O(\frac{1}{K}) by using a constant step-size. We demonstrate the performance of our algorithm using numerical examples

    Accelerating Variance-Reduced Stochastic Gradient Methods

    Get PDF
    Variance reduction is a crucial tool for improving the slow convergence of stochastic gradient descent. Only a few variance-reduced methods, however, have yet been shown to directly benefit from Nesterov’s acceleration techniques to match the convergence rates of accelerated gradient methods. Such approaches rely on “negative momentum”, a technique for further variance reduction that is generally specific to the SVRG gradient estimator. In this work, we show for the first time that negative momentum is unnecessary for acceleration and develop a universal acceleration framework that allows all popular variance-reduced methods to achieve accelerated convergence rates. The constants appearing in these rates, including their dependence on the number of functions n, scale with the mean-squared-error and bias of the gradient estimator. In a series of numerical experiments, we demonstrate that versions of SAGA, SVRG, SARAH, and SARGE using our framework significantly outperform non-accelerated versions and compare favourably with algorithms using negative momentum.</p

    The Limitation and Practical Acceleration of Stochastic Gradient Algorithms in Inverse Problems.

    Get PDF
    In this work we investigate the practicability of stochastic gradient descent and recently introduced variants with variance-reduction techniques in imaging inverse problems, such as space-varying image deblurring. Such algorithms have been shown in machine learning literature to have optimal complexities in theory, and provide great improvement empirically over the full gradient methods. Surprisingly, in some tasks such as image deblurring, many of such methods fail to converge faster than the accelerated full gradient method (FISTA), even in terms of epoch counts. We investigate this phenomenon and propose a theory-inspired mechanism to characterize whether a given inverse problem should be preferred to be solved by stochastic optimization technique with a known sampling pattern. Furthermore, to overcome another key bottleneck of stochastic optimization which is the heavy computation of proximal operators while maintaining fast convergence, we propose an accelerated primal-dual SGD algorithm and demonstrate the effectiveness of our approach in image deblurring experiments.acceptedVersionPeer reviewe
    corecore