62,426 research outputs found
An Accelerated Decentralized Stochastic Proximal Algorithm for Finite Sums
Modern large-scale finite-sum optimization relies on two key aspects:
distribution and stochastic updates. For smooth and strongly convex problems,
existing decentralized algorithms are slower than modern accelerated
variance-reduced stochastic algorithms when run on a single machine, and are
therefore not efficient. Centralized algorithms are fast, but their scaling is
limited by global aggregation steps that result in communication bottlenecks.
In this work, we propose an efficient \textbf{A}ccelerated
\textbf{D}ecentralized stochastic algorithm for \textbf{F}inite \textbf{S}ums
named ADFS, which uses local stochastic proximal updates and randomized
pairwise communications between nodes. On machines, ADFS learns from
samples in the same time it takes optimal algorithms to learn from samples
on one machine. This scaling holds until a critical network size is reached,
which depends on communication delays, on the number of samples , and on the
network topology. We provide a theoretical analysis based on a novel augmented
graph approach combined with a precise evaluation of synchronization times and
an extension of the accelerated proximal coordinate gradient algorithm to
arbitrary sampling. We illustrate the improvement of ADFS over state-of-the-art
decentralized approaches with experiments.Comment: Code available in source files. arXiv admin note: substantial text
overlap with arXiv:1901.0986
Stochastic Variance Reduction Methods for Saddle-Point Problems
We consider convex-concave saddle-point problems where the objective
functions may be split in many components, and extend recent stochastic
variance reduction methods (such as SVRG or SAGA) to provide the first
large-scale linearly convergent algorithms for this class of problems which is
common in machine learning. While the algorithmic extension is straightforward,
it comes with challenges and opportunities: (a) the convex minimization
analysis does not apply and we use the notion of monotone operators to prove
convergence, showing in particular that the same algorithm applies to a larger
class of problems, such as variational inequalities, (b) there are two notions
of splits, in terms of functions, or in terms of partial derivatives, (c) the
split does need to be done with convex-concave terms, (d) non-uniform sampling
is key to an efficient algorithm, both in theory and practice, and (e) these
incremental algorithms can be easily accelerated using a simple extension of
the "catalyst" framework, leading to an algorithm which is always superior to
accelerated batch algorithms.Comment: Neural Information Processing Systems (NIPS), 2016, Barcelona, Spai
Herding as a Learning System with Edge-of-Chaos Dynamics
Herding defines a deterministic dynamical system at the edge of chaos. It
generates a sequence of model states and parameters by alternating parameter
perturbations with state maximizations, where the sequence of states can be
interpreted as "samples" from an associated MRF model. Herding differs from
maximum likelihood estimation in that the sequence of parameters does not
converge to a fixed point and differs from an MCMC posterior sampling approach
in that the sequence of states is generated deterministically. Herding may be
interpreted as a"perturb and map" method where the parameter perturbations are
generated using a deterministic nonlinear dynamical system rather than randomly
from a Gumbel distribution. This chapter studies the distinct statistical
characteristics of the herding algorithm and shows that the fast convergence
rate of the controlled moments may be attributed to edge of chaos dynamics. The
herding algorithm can also be generalized to models with latent variables and
to a discriminative learning setting. The perceptron cycling theorem ensures
that the fast moment matching property is preserved in the more general
framework
Minimizing Finite Sums with the Stochastic Average Gradient
We propose the stochastic average gradient (SAG) method for optimizing the
sum of a finite number of smooth convex functions. Like stochastic gradient
(SG) methods, the SAG method's iteration cost is independent of the number of
terms in the sum. However, by incorporating a memory of previous gradient
values the SAG method achieves a faster convergence rate than black-box SG
methods. The convergence rate is improved from O(1/k^{1/2}) to O(1/k) in
general, and when the sum is strongly-convex the convergence rate is improved
from the sub-linear O(1/k) to a linear convergence rate of the form O(p^k) for
p \textless{} 1. Further, in many cases the convergence rate of the new method
is also faster than black-box deterministic gradient methods, in terms of the
number of gradient evaluations. Numerical experiments indicate that the new
algorithm often dramatically outperforms existing SG and deterministic gradient
methods, and that the performance may be further improved through the use of
non-uniform sampling strategies.Comment: Revision from January 2015 submission. Major changes: updated
literature follow and discussion of subsequent work, additional Lemma showing
the validity of one of the formulas, somewhat simplified presentation of
Lyapunov bound, included code needed for checking proofs rather than the
polynomials generated by the code, added error regions to the numerical
experiment
- …