9,675 research outputs found
Minimizing Finite Sums with the Stochastic Average Gradient
We propose the stochastic average gradient (SAG) method for optimizing the
sum of a finite number of smooth convex functions. Like stochastic gradient
(SG) methods, the SAG method's iteration cost is independent of the number of
terms in the sum. However, by incorporating a memory of previous gradient
values the SAG method achieves a faster convergence rate than black-box SG
methods. The convergence rate is improved from O(1/k^{1/2}) to O(1/k) in
general, and when the sum is strongly-convex the convergence rate is improved
from the sub-linear O(1/k) to a linear convergence rate of the form O(p^k) for
p \textless{} 1. Further, in many cases the convergence rate of the new method
is also faster than black-box deterministic gradient methods, in terms of the
number of gradient evaluations. Numerical experiments indicate that the new
algorithm often dramatically outperforms existing SG and deterministic gradient
methods, and that the performance may be further improved through the use of
non-uniform sampling strategies.Comment: Revision from January 2015 submission. Major changes: updated
literature follow and discussion of subsequent work, additional Lemma showing
the validity of one of the formulas, somewhat simplified presentation of
Lyapunov bound, included code needed for checking proofs rather than the
polynomials generated by the code, added error regions to the numerical
experiment
A Lower Bound for the Optimization of Finite Sums
This paper presents a lower bound for optimizing a finite sum of
functions, where each function is -smooth and the sum is -strongly
convex. We show that no algorithm can reach an error in minimizing
all functions from this class in fewer than iterations, where is a
surrogate condition number. We then compare this lower bound to upper bounds
for recently developed methods specializing to this setting. When the functions
involved in this sum are not arbitrary, but based on i.i.d. random data, then
we further contrast these complexity results with those for optimal first-order
methods to directly optimize the sum. The conclusion we draw is that a lot of
caution is necessary for an accurate comparison, and identify machine learning
scenarios where the new methods help computationally.Comment: Added an erratum, we are currently working on extending the result to
randomized algorithm
Proximal Stochastic Newton-type Gradient Descent Methods for Minimizing Regularized Finite Sums
In this work, we generalized and unified recent two completely different
works of Jascha \cite{sohl2014fast} and Lee \cite{lee2012proximal} respectively
into one by proposing the \textbf{prox}imal s\textbf{to}chastic
\textbf{N}ewton-type gradient (PROXTONE) method for optimizing the sums of two
convex functions: one is the average of a huge number of smooth convex
functions, and the other is a non-smooth convex function. While a set of
recently proposed proximal stochastic gradient methods, include MISO,
Prox-SDCA, Prox-SVRG, and SAG, converge at linear rates, the PROXTONE
incorporates second order information to obtain stronger convergence results,
that it achieves a linear convergence rate not only in the value of the
objective function, but also in the \emph{solution}. The proof is simple and
intuitive, and the results and technique can be served as a initiate for the
research on the proximal stochastic methods that employ second order
information.Comment: arXiv admin note: text overlap with arXiv:1309.2388, arXiv:1403.4699
by other author
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite-Sum Structure
Stochastic optimization algorithms with variance reduction have proven
successful for minimizing large finite sums of functions. Unfortunately, these
techniques are unable to deal with stochastic perturbations of input data,
induced for example by data augmentation. In such cases, the objective is no
longer a finite sum, and the main candidate for optimization is the stochastic
gradient descent method (SGD). In this paper, we introduce a variance reduction
approach for these settings when the objective is composite and strongly
convex. The convergence rate outperforms SGD with a typically much smaller
constant factor, which depends on the variance of gradient estimates only due
to perturbations on a single example.Comment: Advances in Neural Information Processing Systems (NIPS), Dec 2017,
Long Beach, CA, United State
- …