1,983 research outputs found
An Analysis of Asynchronous Stochastic Accelerated Coordinate Descent
Gradient descent, and coordinate descent in particular, are core tools in
machine learning and elsewhere. Large problem instances are common. To help
solve them, two orthogonal approaches are known: acceleration and parallelism.
In this work, we ask whether they can be used simultaneously. The answer is
"yes".
More specifically, we consider an asynchronous parallel version of the
accelerated coordinate descent algorithm proposed and analyzed by Lin, Liu and
Xiao (SIOPT'15). We give an analysis based on the efficient implementation of
this algorithm. The only constraint is a standard bounded asynchrony
assumption, namely that each update can overlap with at most q others. (q is at
most the number of processors times the ratio in the lengths of the longest and
shortest updates.) We obtain the following three results:
1. A linear speedup for strongly convex functions so long as q is not too
large.
2. A substantial, albeit sublinear, speedup for strongly convex functions for
larger q.
3. A substantial, albeit sublinear, speedup for convex functions
Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties
We describe an asynchronous parallel stochastic proximal coordinate descent
algorithm for minimizing a composite objective function, which consists of a
smooth convex function plus a separable convex function. In contrast to
previous analyses, our model of asynchronous computation accounts for the fact
that components of the unknown vector may be written by some cores
simultaneously with being read by others. Despite the complications arising
from this possibility, the method achieves a linear convergence rate on
functions that satisfy an optimal strong convexity property and a sublinear
rate () on general convex functions. Near-linear speedup on a multicore
system can be expected if the number of processors is . We describe
results from implementation on ten cores of a multicore processor.Comment: arXiv admin note: text overlap with arXiv:1311.187
A2BCD: An Asynchronous Accelerated Block Coordinate Descent Algorithm With Optimal Complexity
In this paper, we propose the Asynchronous Accelerated Nonuniform Randomized
Block Coordinate Descent algorithm (A2BCD), the first asynchronous
Nesterov-accelerated algorithm that achieves optimal complexity. This parallel
algorithm solves the unconstrained convex minimization problem, using p
computing nodes which compute updates to shared solution vectors, in an
asynchronous fashion with no central coordination. Nodes in asynchronous
algorithms do not wait for updates from other nodes before starting a new
iteration, but simply compute updates using the most recent solution
information available. This allows them to complete iterations much faster than
traditional ones, especially at scale, by eliminating the costly
synchronization penalty of traditional algorithms.
We first prove that A2BCD converges linearly to a solution with a fast
accelerated rate that matches the recently proposed NU_ACDM, so long as the
maximum delay is not too large. Somewhat surprisingly, A2BCD pays no complexity
penalty for using outdated information. We then prove lower complexity bounds
for randomized coordinate descent methods, which show that A2BCD (and hence
NU_ACDM) has optimal complexity to within a constant factor. We confirm with
numerical experiments that A2BCD outperforms NU_ACDM, which is the current
fastest coordinate descent algorithm, even at small scale. We also derive and
analyze a second-order ordinary differential equation, which is the
continuous-time limit of our algorithm, and prove it converges linearly to a
solution with a similar accelerated rate.Comment: 33 pages, 6 figure
A Class of Parallel Doubly Stochastic Algorithms for Large-Scale Learning
We consider learning problems over training sets in which both, the number of
training examples and the dimension of the feature vectors, are large. To solve
these problems we propose the random parallel stochastic algorithm (RAPSA). We
call the algorithm random parallel because it utilizes multiple parallel
processors to operate on a randomly chosen subset of blocks of the feature
vector. We call the algorithm stochastic because processors choose training
subsets uniformly at random. Algorithms that are parallel in either of these
dimensions exist, but RAPSA is the first attempt at a methodology that is
parallel in both the selection of blocks and the selection of elements of the
training set. In RAPSA, processors utilize the randomly chosen functions to
compute the stochastic gradient component associated with a randomly chosen
block. The technical contribution of this paper is to show that this minimally
coordinated algorithm converges to the optimal classifier when the training
objective is convex. Moreover, we present an accelerated version of RAPSA
(ARAPSA) that incorporates the objective function curvature information by
premultiplying the descent direction by a Hessian approximation matrix. We
further extend the results for asynchronous settings and show that if the
processors perform their updates without any coordination the algorithms are
still convergent to the optimal argument. RAPSA and its extensions are then
numerically evaluated on a linear estimation problem and a binary image
classification task using the MNIST handwritten digit dataset.Comment: arXiv admin note: substantial text overlap with arXiv:1603.0678
Coordinate Descent Algorithms
Coordinate descent algorithms solve optimization problems by successively
performing approximate minimization along coordinate directions or coordinate
hyperplanes. They have been used in applications for many years, and their
popularity continues to grow because of their usefulness in data analysis,
machine learning, and other areas of current interest. This paper describes the
fundamentals of the coordinate descent approach, together with variants and
extensions and their convergence properties, mostly with reference to convex
objectives. We pay particular attention to a certain problem structure that
arises frequently in machine learning applications, showing that efficient
implementations of accelerated coordinate descent algorithms are possible for
problems of this type. We also present some parallel variants and discuss their
convergence properties under several models of parallel execution
An Asynchronous Parallel Stochastic Coordinate Descent Algorithm
We describe an asynchronous parallel stochastic coordinate descent algorithm
for minimizing smooth unconstrained or separably constrained functions. The
method achieves a linear convergence rate on functions that satisfy an
essential strong convexity property and a sublinear rate () on general
convex functions. Near-linear speedup on a multicore system can be expected if
the number of processors is in unconstrained optimization and
in the separable-constrained case, where is the number of
variables. We describe results from implementation on 40-core processors
Markov Chain Block Coordinate Descent
The method of block coordinate gradient descent (BCD) has been a powerful
method for large-scale optimization. This paper considers the BCD method that
successively updates a series of blocks selected according to a Markov chain.
This kind of block selection is neither i.i.d. random nor cyclic. On the other
hand, it is a natural choice for some applications in distributed optimization
and Markov decision process, where i.i.d. random and cyclic selections are
either infeasible or very expensive. By applying mixing-time properties of a
Markov chain, we prove convergence of Markov chain BCD for minimizing Lipschitz
differentiable functions, which can be nonconvex. When the functions are convex
and strongly convex, we establish both sublinear and linear convergence rates,
respectively. We also present a method of Markov chain inertial BCD. Finally,
we discuss potential applications
Distributed Asynchronous Dual Free Stochastic Dual Coordinate Ascent
The primal-dual distributed optimization methods have broad large-scale
machine learning applications. Previous primal-dual distributed methods are not
applicable when the dual formulation is not available, e.g. the
sum-of-non-convex objectives. Moreover, these algorithms and theoretical
analysis are based on the fundamental assumption that the computing speeds of
multiple machines in a cluster are similar. However, the straggler problem is
an unavoidable practical issue in the distributed system because of the
existence of slow machines. Therefore, the total computational time of the
distributed optimization methods is highly dependent on the slowest machine. In
this paper, we address these two issues by proposing distributed asynchronous
dual free stochastic dual coordinate ascent algorithm for distributed
optimization. Our method does not need the dual formulation of the target
problem in the optimization. We tackle the straggler problem through
asynchronous communication and the negative effect of slow machines is
significantly alleviated. We also analyze the convergence rate of our method
and prove the linear convergence rate even if the individual functions in
objective are non-convex. Experiments on both convex and non-convex loss
functions are used to validate our statements
TMAC: A Toolbox of Modern Async-Parallel, Coordinate, Splitting, and Stochastic Methods
TMAC is a toolbox written in C++11 that implements algorithms based on a set
of modern methods for large-scale optimization. It covers a variety of
optimization problems, which can be both smooth and nonsmooth, convex and
nonconvex, as well as constrained and unconstrained. The algorithms implemented
in TMAC, such as the coordinate up- date method and operator splitting method,
are scalable as they decompose a problem into simple subproblems. These
algorithms can run in a multi-threaded fashion, either synchronously or
asynchronously, to take advantages of all the cores available. TMAC
architecture mimics how a scientist writes down an optimization algorithm.
Therefore, it is easy for one to obtain a new algorithm by making simple
modifications such as adding a new operator and adding a new splitting, while
maintaining the multicore parallelism and other features. The package is
available at https://github.com/uclaopt/TMAC
DSCOVR: Randomized Primal-Dual Block Coordinate Algorithms for Asynchronous Distributed Optimization
Machine learning with big data often involves large optimization models. For
distributed optimization over a cluster of machines, frequent communication and
synchronization of all model parameters (optimization variables) can be very
costly. A promising solution is to use parameter servers to store different
subsets of the model parameters, and update them asynchronously at different
machines using local datasets. In this paper, we focus on distributed
optimization of large linear models with convex loss functions, and propose a
family of randomized primal-dual block coordinate algorithms that are
especially suitable for asynchronous distributed implementation with parameter
servers. In particular, we work with the saddle-point formulation of such
problems which allows simultaneous data and model partitioning, and exploit its
structure by doubly stochastic coordinate optimization with variance reduction
(DSCOVR). Compared with other first-order distributed algorithms, we show that
DSCOVR may require less amount of overall computation and communication, and
less or no synchronization. We discuss the implementation details of the DSCOVR
algorithms, and present numerical experiments on an industrial distributed
computing system
- …