2,731 research outputs found
A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates
This paper proposes a novel proximal-gradient algorithm for a decentralized
optimization problem with a composite objective containing smooth and
non-smooth terms. Specifically, the smooth and nonsmooth terms are dealt with
by gradient and proximal updates, respectively. The proposed algorithm is
closely related to a previous algorithm, PG-EXTRA \cite{shi2015proximal}, but
has a few advantages. First of all, agents use uncoordinated step-sizes, and
the stable upper bounds on step-sizes are independent of network topologies.
The step-sizes depend on local objective functions, and they can be as large as
those of the gradient descent. Secondly, for the special case without
non-smooth terms, linear convergence can be achieved under the strong convexity
assumption. The dependence of the convergence rate on the objective functions
and the network are separated, and the convergence rate of the new algorithm is
as good as one of the two convergence rates that match the typical rates for
the general gradient descent and the consensus averaging. We provide numerical
experiments to demonstrate the efficacy of the introduced algorithm and
validate our theoretical discoveries
99% of Distributed Optimization is a Waste of Time: The Issue and How to Fix it
Many popular distributed optimization methods for training machine learning
models fit the following template: a local gradient estimate is computed
independently by each worker, then communicated to a master, which subsequently
performs averaging. The average is broadcast back to the workers, which use it
to perform a gradient-type step to update the local version of the model. It is
also well known that many such methods, including SGD, SAGA, and accelerated
SGD for over-parameterized models, do not scale well with the number of
parallel workers. In this paper we observe that the above template is
fundamentally inefficient in that too much data is unnecessarily communicated
by the workers, which slows down the overall system. We propose a fix based on
a new update-sparsification method we develop in this work, which we suggest be
used on top of existing methods. Namely, we develop a new variant of parallel
block coordinate descent based on independent sparsification of the local
gradient estimates before communication. We demonstrate that with only
blocks sent by each of workers, where is the total number of parameter
blocks, the theoretical iteration complexity of the underlying distributed
methods is essentially unaffected. As an illustration, this means that when
parallel workers are used, the communication of blocks is
redundant, and hence a waste of time. Our theoretical claims are supported
through extensive numerical experiments which demonstrate an almost perfect
match with our theory on a number of synthetic and real datasets.Comment: 41 pages, 8 algorithms, 10 theorems, 12 figure
Convergence Analysis of Iterative Methods for Nonsmooth Convex Optimization over Fixed Point Sets of Quasi-Nonexpansive Mappings
This paper considers a networked system with a finite number of users and
supposes that each user tries to minimize its own private objective function
over its own private constraint set. It is assumed that each user's constraint
set can be expressed as a fixed point set of a certain quasi-nonexpansive
mapping. This enables us to consider the case in which the projection onto the
constraint set cannot be computed efficiently. This paper proposes two methods
for solving the problem of minimizing the sum of their nondifferentiable,
convex objective functions over the intersection of their fixed point sets of
quasi-nonexpansive mappings in a real Hilbert space. One method is a parallel
subgradient method that can be implemented under the assumption that each user
can communicate with other users. The other is an incremental subgradient
method that can be implemented under the assumption that each user can
communicate with its neighbors. Investigation of the two methods' convergence
properties for a constant step size reveals that, with a small constant step
size, they approximate a solution to the problem. Consideration of the case in
which the step-size sequence is diminishing demonstrates that the sequence
generated by each of the two methods strongly converges to the solution to the
problem under certain assumptions. Convergence rate analysis of the two methods
under certain situations is provided to illustrate the two methods' efficiency.
This paper also discusses nonsmooth convex optimization over sublevel sets of
convex functions and provides numerical comparisons that demonstrate the
effectiveness of the proposed methods
Achieving Geometric Convergence for Distributed Optimization over Time-Varying Graphs
This paper considers the problem of distributed optimization over
time-varying graphs. For the case of undirected graphs, we introduce a
distributed algorithm, referred to as DIGing, based on a combination of a
distributed inexact gradient method and a gradient tracking technique. The
DIGing algorithm uses doubly stochastic mixing matrices and employs fixed
step-sizes and, yet, drives all the agents' iterates to a global and consensual
minimizer. When the graphs are directed, in which case the implementation of
doubly stochastic mixing matrices is unrealistic, we construct an algorithm
that incorporates the push-sum protocol into the DIGing structure, thus
obtaining Push-DIGing algorithm. The Push-DIGing uses column stochastic
matrices and fixed step-sizes, but it still converges to a global and
consensual minimizer. Under the strong convexity assumption, we prove that the
algorithms converge at R-linear (geometric) rates as long as the step-sizes do
not exceed some upper bounds. We establish explicit estimates for the
convergence rates. When the graph is undirected it shows that DIGing scales
polynomially in the number of agents. We also provide some numerical
experiments to demonstrate the efficacy of the proposed algorithms and to
validate our theoretical findings
Distributed optimization in wireless sensor networks: an island-model framework
Wireless Sensor Networks (WSNs) is an emerging technology in several
application domains, ranging from urban surveillance to environmental and
structural monitoring. Computational Intelligence (CI) techniques are
particularly suitable for enhancing these systems. However, when embedding CI
into wireless sensors, severe hardware limitations must be taken into account.
In this paper we investigate the possibility to perform an online, distributed
optimization process within a WSN. Such a system might be used, for example, to
implement advanced network features like distributed modelling, self-optimizing
protocols, and anomaly detection, to name a few. The proposed approach, called
DOWSN (Distributed Optimization for WSN) is an island-model infrastructure in
which each node executes a simple, computationally cheap (both in terms of CPU
and memory) optimization algorithm, and shares promising solutions with its
neighbors. We perform extensive tests of different DOWSN configurations on a
benchmark made up of continuous optimization problems; we analyze the influence
of the network parameters (number of nodes, inter-node communication period and
probability of accepting incoming solutions) on the optimization performance.
Finally, we profile energy and memory consumption of DOWSN to show the
efficient usage of the limited hardware resources available on the sensor
nodes
Proximal Point Algorithms for Nonsmooth Convex Optimization with Fixed Point Constraints
The problem of minimizing the sum of nonsmooth, convex objective functions
defined on a real Hilbert space over the intersection of fixed point sets of
nonexpansive mappings, onto which the projections cannot be efficiently
computed, is considered. The use of proximal point algorithms that use the
proximity operators of the objective functions and incremental optimization
techniques is proposed for solving the problem. With the focus on fixed point
approximation techniques, two algorithms are devised for solving the problem.
One blends an incremental subgradient method, which is a useful algorithm for
nonsmooth convex optimization, with a Halpern-type fixed point iteration
algorithm. The other is based on an incremental subgradient method and the
Krasnosel'ski\u\i-Mann fixed point algorithm. It is shown that any weak
sequential cluster point of the sequence generated by the Halpern-type
algorithm belongs to the solution set of the problem and that there exists a
weak sequential cluster point of the sequence generated by the
Krasnosel'ski\u\i-Mann-type algorithm, which also belongs to the solution set.
Numerical comparisons of the two proposed algorithms with existing subgradient
methods for concrete nonsmooth convex optimization show that the proposed
algorithms achieve faster convergence
Two Stochastic Optimization Algorithms for Convex Optimization with Fixed Point Constraints
Two optimization algorithms are proposed for solving a stochastic programming
problem for which the objective function is given in the form of the
expectation of convex functions and the constraint set is defined by the
intersection of fixed point sets of nonexpansive mappings in a real Hilbert
space. This setting of fixed point constraints enables consideration of the
case in which the projection onto each of the constraint sets cannot be
computed efficiently. Both algorithms use a convex function and a nonexpansive
mapping determined by a certain probabilistic process at each iteration. One
algorithm blends a stochastic gradient method with the Halpern fixed point
algorithm. The other is based on a stochastic proximal point algorithm and the
Halpern fixed point algorithm; it can be applied to nonsmooth convex
optimization. Convergence analysis showed that, under certain assumptions, any
weak sequential cluster point of the sequence generated by either algorithm
almost surely belongs to the solution set of the problem. Convergence rate
analysis illustrated their efficiency, and the numerical results of convex
optimization over fixed point sets demonstrated their effectiveness
On the Acceleration of L-BFGS with Second-Order Information and Stochastic Batches
This paper proposes a framework of L-BFGS based on the (approximate)
second-order information with stochastic batches, as a novel approach to the
finite-sum minimization problems. Different from the classical L-BFGS where
stochastic batches lead to instability, we use a smooth estimate for the
evaluations of the gradient differences while achieving acceleration by
well-scaling the initial Hessians. We provide theoretical analyses for both
convex and nonconvex cases. In addition, we demonstrate that within the popular
applications of least-square and cross-entropy losses, the algorithm admits a
simple implementation in the distributed environment. Numerical experiments
support the efficiency of our algorithms
Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD
Large-scale machine learning training, in particular distributed stochastic
gradient descent, needs to be robust to inherent system variability such as
node straggling and random communication delays. This work considers a
distributed training framework where each worker node is allowed to perform
local model updates and the resulting models are averaged periodically. We
analyze the true speed of error convergence with respect to wall-clock time
(instead of the number of iterations), and analyze how it is affected by the
frequency of averaging. The main contribution is the design of AdaComm, an
adaptive communication strategy that starts with infrequent averaging to save
communication delay and improve convergence speed, and then increases the
communication frequency in order to achieve a low error floor. Rigorous
experiments on training deep neural networks show that AdaComm can take less time than fully synchronous SGD, and still reach the same final
training loss.Comment: Accepted to SysML 201
Almost Sure Convergence of Random Projected Proximal and Subgradient Algorithms for Distributed Nonsmooth Convex Optimization
Two distributed algorithms are described that enable all users connected over
a network to cooperatively solve the problem of minimizing the sum of all
users' objective functions over the intersection of all users' constraint sets,
where each user has its own private nonsmooth convex objective function and
closed convex constraint set, which is the intersection of a number of simple,
closed convex sets. One algorithm enables each user to adjust its estimate by
using a proximity operator of its objective function and the metric projection
onto one set randomly selected from the simple, closed convex sets. The other
is a distributed random projection algorithm that determines each user's
estimate by using a subgradient of its objective function instead of the
proximity operator. Investigation of the two algorithms' convergence properties
for a diminishing step-size rule revealed that, under certain assumptions, the
sequences of all users generated by each of the two algorithms converge almost
surely to the same solution. Moreover, convergence rate analysis of the two
algorithms is provided, and desired choices of the step size sequences such
that the two algorithms have fast convergence are discussed. Numerical
comparisons for concrete nonsmooth convex optimization support the convergence
analysis and demonstrate the effectiveness of the two algorithms
- …