44,887 research outputs found
Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods
Our goal is to improve variance reducing stochastic methods through better
control variates. We first propose a modification of SVRG which uses the
Hessian to track gradients over time, rather than to recondition, increasing
the correlation of the control variates and leading to faster theoretical
convergence close to the optimum. We then propose accurate and computationally
efficient approximations to the Hessian, both using a diagonal and a low-rank
matrix. Finally, we demonstrate the effectiveness of our method on a wide range
of problems.Comment: 17 pages, 2 figures, 1 tabl
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization
Due to their simplicity and excellent performance, parallel asynchronous
variants of stochastic gradient descent have become popular methods to solve a
wide range of large-scale optimization problems on multi-core architectures.
Yet, despite their practical success, support for nonsmooth objectives is still
lacking, making them unsuitable for many problems of interest in machine
learning, such as the Lasso, group Lasso or empirical risk minimization with
convex constraints.
In this work, we propose and analyze ProxASAGA, a fully asynchronous sparse
method inspired by SAGA, a variance reduced incremental gradient algorithm. The
proposed method is easy to implement and significantly outperforms the state of
the art on several nonsmooth, large-scale problems. We prove that our method
achieves a theoretical linear speedup with respect to the sequential version
under assumptions on the sparsity of gradients and block-separability of the
proximal term. Empirical benchmarks on a multi-core architecture illustrate
practical speedups of up to 12x on a 20-core machine.Comment: Appears in Advances in Neural Information Processing Systems 30 (NIPS
2017), 28 page
Minimizing Finite Sums with the Stochastic Average Gradient
We propose the stochastic average gradient (SAG) method for optimizing the
sum of a finite number of smooth convex functions. Like stochastic gradient
(SG) methods, the SAG method's iteration cost is independent of the number of
terms in the sum. However, by incorporating a memory of previous gradient
values the SAG method achieves a faster convergence rate than black-box SG
methods. The convergence rate is improved from O(1/k^{1/2}) to O(1/k) in
general, and when the sum is strongly-convex the convergence rate is improved
from the sub-linear O(1/k) to a linear convergence rate of the form O(p^k) for
p \textless{} 1. Further, in many cases the convergence rate of the new method
is also faster than black-box deterministic gradient methods, in terms of the
number of gradient evaluations. Numerical experiments indicate that the new
algorithm often dramatically outperforms existing SG and deterministic gradient
methods, and that the performance may be further improved through the use of
non-uniform sampling strategies.Comment: Revision from January 2015 submission. Major changes: updated
literature follow and discussion of subsequent work, additional Lemma showing
the validity of one of the formulas, somewhat simplified presentation of
Lyapunov bound, included code needed for checking proofs rather than the
polynomials generated by the code, added error regions to the numerical
experiment
A cooperative conjugate gradient method for linear systems permitting multithread implementation of low complexity
This paper proposes a generalization of the conjugate gradient (CG) method
used to solve the equation for a symmetric positive definite matrix
of large size . The generalization consists of permitting the scalar control
parameters (= stepsizes in gradient and conjugate gradient directions) to be
replaced by matrices, so that multiple descent and conjugate directions are
updated simultaneously. Implementation involves the use of multiple agents or
threads and is referred to as cooperative CG (cCG), in which the cooperation
between agents resides in the fact that the calculation of each entry of the
control parameter matrix now involves information that comes from the other
agents. For a sufficiently large dimension , the use of an optimal number of
cores gives the result that the multithread implementation has worst case
complexity in exact arithmetic. Numerical experiments, that
illustrate the interest of theoretical results, are carried out on a multicore
computer.Comment: Expanded version of manuscript submitted to the IEEE-CDC 2012
(Conference on Decision and Control
A Proximal Stochastic Gradient Method with Progressive Variance Reduction
We consider the problem of minimizing the sum of two convex functions: one is
the average of a large number of smooth component functions, and the other is a
general convex function that admits a simple proximal mapping. We assume the
whole objective function is strongly convex. Such problems often arise in
machine learning, known as regularized empirical risk minimization. We propose
and analyze a new proximal stochastic gradient method, which uses a multi-stage
scheme to progressively reduce the variance of the stochastic gradient. While
each iteration of this algorithm has similar cost as the classical stochastic
gradient method (or incremental gradient method), we show that the expected
objective value converges to the optimum at a geometric rate. The overall
complexity of this method is much lower than both the proximal full gradient
method and the standard proximal stochastic gradient method
- …