Search CORE

44,887 research outputs found

Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods

Author: Bach Francis
Gower Robert M.
Roux Nicolas Le
Publication venue
Publication date: 01/01/2018
Field of study

Our goal is to improve variance reducing stochastic methods through better control variates. We first propose a modification of SVRG which uses the Hessian to track gradients over time, rather than to recondition, increasing the correlation of the control variates and leading to faster theoretical convergence close to the optimum. We then propose accurate and computationally efficient approximations to the Hessian, both using a diagonal and a low-rank matrix. Finally, we demonstrate the effectiveness of our method on a wide range of problems.Comment: 17 pages, 2 figures, 1 tabl

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization

Author: Lacoste-Julien Simon
Leblond Rémi
Pedregosa Fabian
Publication venue
Publication date: 05/11/2017
Field of study

Due to their simplicity and excellent performance, parallel asynchronous variants of stochastic gradient descent have become popular methods to solve a wide range of large-scale optimization problems on multi-core architectures. Yet, despite their practical success, support for nonsmooth objectives is still lacking, making them unsuitable for many problems of interest in machine learning, such as the Lasso, group Lasso or empirical risk minimization with convex constraints. In this work, we propose and analyze ProxASAGA, a fully asynchronous sparse method inspired by SAGA, a variance reduced incremental gradient algorithm. The proposed method is easy to implement and significantly outperforms the state of the art on several nonsmooth, large-scale problems. We prove that our method achieves a theoretical linear speedup with respect to the sequential version under assumptions on the sparsity of gradients and block-separability of the proximal term. Empirical benchmarks on a multi-core architecture illustrate practical speedups of up to 12x on a 20-core machine.Comment: Appears in Advances in Neural Information Processing Systems 30 (NIPS 2017), 28 page

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Minimizing Finite Sums with the Stochastic Average Gradient

Author: Bach Francis
Roux Nicolas Le
Schmidt Mark
Publication venue
Publication date: 10/05/2016
Field of study

We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method's iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergence rate is improved from O(1/k^{1/2}) to O(1/k) in general, and when the sum is strongly-convex the convergence rate is improved from the sub-linear O(1/k) to a linear convergence rate of the form O(p^k) for p \textless{} 1. Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of non-uniform sampling strategies.Comment: Revision from January 2015 submission. Major changes: updated literature follow and discussion of subsequent work, additional Lemma showing the validity of one of the formulas, somewhat simplified presentation of Lyapunov bound, included code needed for checking proofs rather than the polynomials generated by the code, added error regions to the numerical experiment

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

A cooperative conjugate gradient method for linear systems permitting multithread implementation of low complexity

Author: Bhaya Amit
Bliman Pierre-Alexandre
Niedu Guilherme
Pazos Fernando
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/03/2012
Field of study

This paper proposes a generalization of the conjugate gradient (CG) method used to solve the equation

Ax=b

for a symmetric positive definite matrix

A

of large size

n

. The generalization consists of permitting the scalar control parameters (= stepsizes in gradient and conjugate gradient directions) to be replaced by matrices, so that multiple descent and conjugate directions are updated simultaneously. Implementation involves the use of multiple agents or threads and is referred to as cooperative CG (cCG), in which the cooperation between agents resides in the fact that the calculation of each entry of the control parameter matrix now involves information that comes from the other agents. For a sufficiently large dimension

n

, the use of an optimal number of cores gives the result that the multithread implementation has worst case complexity

O(n^{2+1/3})

in exact arithmetic. Numerical experiments, that illustrate the interest of theoretical results, are carried out on a multicore computer.Comment: Expanded version of manuscript submitted to the IEEE-CDC 2012 (Conference on Decision and Control

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

A Proximal Stochastic Gradient Method with Progressive Variance Reduction

Author: Xiao Lin
Zhang Tong
Publication venue
Publication date: 01/01/2014
Field of study

We consider the problem of minimizing the sum of two convex functions: one is the average of a large number of smooth component functions, and the other is a general convex function that admits a simple proximal mapping. We assume the whole objective function is strongly convex. Such problems often arise in machine learning, known as regularized empirical risk minimization. We propose and analyze a new proximal stochastic gradient method, which uses a multi-stage scheme to progressively reduce the variance of the stochastic gradient. While each iteration of this algorithm has similar cost as the classical stochastic gradient method (or incremental gradient method), we show that the expected objective value converges to the optimum at a geometric rate. The overall complexity of this method is much lower than both the proximal full gradient method and the standard proximal stochastic gradient method

arXiv.org e-Print Archive

CiteSeerX