172 research outputs found
Distributed Big-Data Optimization via Block-Iterative Convexification and Averaging
In this paper, we study distributed big-data nonconvex optimization in
multi-agent networks. We consider the (constrained) minimization of the sum of
a smooth (possibly) nonconvex function, i.e., the agents' sum-utility, plus a
convex (possibly) nonsmooth regularizer. Our interest is in big-data problems
wherein there is a large number of variables to optimize. If treated by means
of standard distributed optimization algorithms, these large-scale problems may
be intractable, due to the prohibitive local computation and communication
burden at each node. We propose a novel distributed solution method whereby at
each iteration agents optimize and then communicate (in an uncoordinated
fashion) only a subset of their decision variables. To deal with non-convexity
of the cost function, the novel scheme hinges on Successive Convex
Approximation (SCA) techniques coupled with i) a tracking mechanism
instrumental to locally estimate gradient averages; and ii) a novel block-wise
consensus-based protocol to perform local block-averaging operations and
gradient tacking. Asymptotic convergence to stationary solutions of the
nonconvex problem is established. Finally, numerical results show the
effectiveness of the proposed algorithm and highlight how the block dimension
impacts on the communication overhead and practical convergence speed
Coordinate Descent Algorithms
Coordinate descent algorithms solve optimization problems by successively
performing approximate minimization along coordinate directions or coordinate
hyperplanes. They have been used in applications for many years, and their
popularity continues to grow because of their usefulness in data analysis,
machine learning, and other areas of current interest. This paper describes the
fundamentals of the coordinate descent approach, together with variants and
extensions and their convergence properties, mostly with reference to convex
objectives. We pay particular attention to a certain problem structure that
arises frequently in machine learning applications, showing that efficient
implementations of accelerated coordinate descent algorithms are possible for
problems of this type. We also present some parallel variants and discuss their
convergence properties under several models of parallel execution
Distributed Big-Data Optimization via Block Communications
We study distributed multi-agent large-scale optimization problems, wherein
the cost function is composed of a smooth possibly nonconvex sum-utility plus a
DC (Difference-of-Convex) regularizer. We consider the scenario where the
dimension of the optimization variables is so large that optimizing and/or
transmitting the entire set of variables could cause unaffordable computation
and communication overhead. To address this issue, we propose the first
distributed algorithm whereby agents optimize and communicate only a portion of
their local variables. The scheme hinges on successive convex approximation
(SCA) to handle the nonconvexity of the objective function, coupled with a
novel block-signal tracking scheme, aiming at locally estimating the average of
the agents' gradients. Asymptotic convergence to stationary solutions of the
nonconvex problem is established. Numerical results on a sparse regression
problem show the effectiveness of the proposed algorithm and the impact of the
block size on its practical convergence speed and communication cost
Markov Chain Block Coordinate Descent
The method of block coordinate gradient descent (BCD) has been a powerful
method for large-scale optimization. This paper considers the BCD method that
successively updates a series of blocks selected according to a Markov chain.
This kind of block selection is neither i.i.d. random nor cyclic. On the other
hand, it is a natural choice for some applications in distributed optimization
and Markov decision process, where i.i.d. random and cyclic selections are
either infeasible or very expensive. By applying mixing-time properties of a
Markov chain, we prove convergence of Markov chain BCD for minimizing Lipschitz
differentiable functions, which can be nonconvex. When the functions are convex
and strongly convex, we establish both sublinear and linear convergence rates,
respectively. We also present a method of Markov chain inertial BCD. Finally,
we discuss potential applications
Incremental Aggregated Proximal and Augmented Lagrangian Algorithms
We consider minimization of the sum of a large number of convex functions,
and we propose an incremental aggregated version of the proximal algorithm,
which bears similarity to the incremental aggregated gradient and subgradient
methods that have received a lot of recent attention. Under cost function
differentiability and strong convexity assumptions, we show linear convergence
for a sufficiently small constant stepsize. This result also applies to
distributed asynchronous variants of the method, involving bounded
interprocessor communication delays.
We then consider dual versions of incremental proximal algorithms, which are
incremental augmented Lagrangian methods for separable equality-constrained
optimization problems. Contrary to the standard augmented Lagrangian method,
these methods admit decomposition in the minimization of the augmented
Lagrangian, and update the multipliers far more frequently. Our incremental
aggregated augmented Lagrangian methods bear similarity to several known
decomposition algorithms, including the alternating direction method of
multipliers (ADMM) and more recent variations. We compare these methods in
terms of their properties, and highlight their potential advantages and
limitations.
We also address the solution of separable inequality-constrained optimization
problems through the use of nonquadratic augmented Lagrangiias such as the
exponential, and we dually consider a corresponding incremental aggregated
version of the proximal algorithm that uses nonquadratic regularization, such
as an entropy function. We finally propose a closely related linearly
convergent method for minimization of large differentiable sums subject to an
orthant constraint, which may be viewed as an incremental aggregated version of
the mirror descent method
Asynchronous parallel primal-dual block coordinate update methods for affinely constrained convex programs
Recent several years have witnessed the surge of asynchronous (async-)
parallel computing methods due to the extremely big data involved in many
modern applications and also the advancement of multi-core machines and
computer clusters. In optimization, most works about async-parallel methods are
on unconstrained problems or those with block separable constraints.
In this paper, we propose an async-parallel method based on block coordinate
update (BCU) for solving convex problems with nonseparable linear constraint.
Running on a single node, the method becomes a novel randomized primal-dual BCU
with adaptive stepsize for multi-block affinely constrained problems. For these
problems, Gauss-Seidel cyclic primal-dual BCU needs strong convexity to have
convergence. On the contrary, merely assuming convexity, we show that the
objective value sequence generated by the proposed algorithm converges in
probability to the optimal value and also the constraint residual to zero. In
addition, we establish an ergodic convergence result, where is the
number of iterations. Numerical experiments are performed to demonstrate the
efficiency of the proposed method and significantly better speed-up performance
than its sync-parallel counterpart
Distributed Big-Data Optimization via Block-wise Gradient Tracking
We study distributed big-data nonconvex optimization in multi-agent networks.
We consider the (constrained) minimization of the sum of a smooth (possibly)
nonconvex function, i.e., the agents' sum-utility, plus a convex (possibly)
nonsmooth regularizer. Our interest is on big-data problems in which there is a
large number of variables to optimize. If treated by means of standard
distributed optimization algorithms, these large-scale problems may be
intractable due to the prohibitive local computation and communication burden
at each node. We propose a novel distributed solution method where, at each
iteration, agents update in an uncoordinated fashion only one block of the
entire decision vector. To deal with the nonconvexity of the cost function, the
novel scheme hinges on Successive Convex Approximation (SCA) techniques
combined with a novel block-wise perturbed push-sum consensus protocol, which
is instrumental to perform local block-averaging operations and tracking of
gradient averages. Asymptotic convergence to stationary solutions of the
nonconvex problem is established. Finally, numerical results show the
effectiveness of the proposed algorithm and highlight how the block dimension
impacts on the communication overhead and practical convergence speed
Distributed big-data optimization via block communications
We study distributed multi-agent large-scale optimization problems, wherein the cost function is composed of a smooth possibly nonconvex sum-utility plus a DC (Difference-of-Convex) regularizer. We consider the scenario where the dimension of the optimization variables is so large that optimizing and/or transmitting the entire set of variables could cause unaffordable computation and communication overhead. To address this issue, we propose the first distributed algorithm whereby agents optimize and communicate only a portion of their local variables. The scheme hinges on successive convex approximation (SCA) to handle the nonconvexity of the objective function, coupled with a novel block- signal tracking scheme, aiming at locally estimating the average of the agents\u2019 gradients. Asymptotic convergence to stationary solutions of the nonconvex problem is established. Numerical results on a sparse regression problem show the effectiveness of the proposed algorithm and the impact of the block size on its practical convergence speed and communication cost
More Iterations per Second, Same Quality -- Why Asynchronous Algorithms may Drastically Outperform Traditional Ones
In this paper, we consider the convergence of a very general
asynchronous-parallel algorithm called ARock, that takes many well-known
asynchronous algorithms as special cases (gradient descent, proximal gradient,
Douglas Rachford, ADMM, etc.). In asynchronous-parallel algorithms, the
computing nodes simply use the most recent information that they have access
to, instead of waiting for a full update from all nodes in the system. This
means that nodes do not have to waste time waiting for information, which can
be a major bottleneck, especially in distributed systems. When the system has
nodes, asynchronous algorithms may complete more
iterations than synchronous algorithms in a given time period ("more iterations
per second").
Although asynchronous algorithms may compute more iterations per second,
there is error associated with using outdated information. How many more
iterations in total are needed to compensate for this error is still an open
question. The main results of this paper aim to answer this question. We prove,
loosely, that as the size of the problem becomes large, the number of
additional iterations that asynchronous algorithms need becomes negligible
compared to the total number ("same quality" of the iterations). Taking these
facts together, our results provide solid evidence of the potential of
asynchronous algorithms to vastly speed up certain distributed computations.Comment: 29 page
Pathwise Coordinate Optimization for Sparse Learning: Algorithm and Theory
The pathwise coordinate optimization is one of the most important
computational frameworks for high dimensional convex and nonconvex sparse
learning problems. It differs from the classical coordinate optimization
algorithms in three salient features: {\it warm start initialization}, {\it
active set updating}, and {\it strong rule for coordinate preselection}. Such a
complex algorithmic structure grants superior empirical performance, but also
poses significant challenge to theoretical analysis. To tackle this long
lasting problem, we develop a new theory showing that these three features play
pivotal roles in guaranteeing the outstanding statistical and computational
performance of the pathwise coordinate optimization framework. Particularly, we
analyze the existing pathwise coordinate optimization algorithms and provide
new theoretical insights into them. The obtained insights further motivate the
development of several modifications to improve the pathwise coordinate
optimization framework, which guarantees linear convergence to a unique sparse
local optimum with optimal statistical properties in parameter estimation and
support recovery. This is the first result on the computational and statistical
guarantees of the pathwise coordinate optimization framework in high
dimensions. Thorough numerical experiments are provided to support our theory.Comment: Accepted by the Annals of Statistics, 2016
- …