91 research outputs found
Asynchrony and Acceleration in Gossip Algorithms
This paper considers the minimization of a sum of smooth and strongly convex
functions dispatched over the nodes of a communication network. Previous works
on the subject either focus on synchronous algorithms, which can be heavily
slowed down by a few slow nodes (the straggler problem), or consider a model of
asynchronous operation (Boyd et al., 2006) in which adjacent nodes communicate
at the instants of Poisson point processes. We have two main contributions. 1)
We propose CACDM (a Continuously Accelerated Coordinate Dual Method), and for
the Poisson model of asynchronous operation, we prove CACDM to converge to
optimality at an accelerated convergence rate in the sense of Nesterov et
Stich, 2017. In contrast, previously proposed asynchronous algorithms have not
been proven to achieve such accelerated rate. While CACDM is based on discrete
updates, the proof of its convergence crucially depends on a continuous time
analysis. 2) We introduce a new communication scheme based on Loss-Networks,
that is programmable in a fully asynchronous and decentralized way, unlike the
Poisson model of asynchronous operation that does not capture essential aspects
of asynchrony such as non-instantaneous communications and computations. Under
this Loss-Network model of asynchrony, we establish for CDM (a Coordinate Dual
Method) a rate of convergence in terms of the eigengap of the Laplacian of the
graph weighted by local effective delays. We believe this eigengap to be a
fundamental bottleneck for convergence rates of asynchronous optimization.
Finally, we verify empirically that CACDM enjoys an accelerated convergence
rate in the Loss-Network model of asynchrony
A Coordinate Descent Primal-Dual Algorithm and Application to Distributed Asynchronous Optimization
Based on the idea of randomized coordinate descent of -averaged
operators, a randomized primal-dual optimization algorithm is introduced, where
a random subset of coordinates is updated at each iteration. The algorithm
builds upon a variant of a recent (deterministic) algorithm proposed by V\~u
and Condat that includes the well known ADMM as a particular case. The obtained
algorithm is used to solve asynchronously a distributed optimization problem. A
network of agents, each having a separate cost function containing a
differentiable term, seek to find a consensus on the minimum of the aggregate
objective. The method yields an algorithm where at each iteration, a random
subset of agents wake up, update their local estimates, exchange some data with
their neighbors, and go idle. Numerical results demonstrate the attractive
performance of the method. The general approach can be naturally adapted to
other situations where coordinate descent convex optimization algorithms are
used with a random choice of the coordinates.Comment: 10 page
Randomized iterative methods for linear systems: momentum, inexactness and gossip
In the era of big data, one of the key challenges is the development of novel optimization
algorithms that can accommodate vast amounts of data while at the same time satisfying
constraints and limitations of the problem under study. The need to solve optimization problems
is ubiquitous in essentially all quantitative areas of human endeavour, including industry and
science. In the last decade there has been a surge in the demand from practitioners, in fields
such as machine learning, computer vision, artificial intelligence, signal processing and data
science, for new methods able to cope with these new large scale problems.
In this thesis we are focusing on the design, complexity analysis and efficient implementations
of such algorithms. In particular, we are interested in the development of randomized first
order iterative methods for solving large scale linear systems, stochastic quadratic optimization
problems and the distributed average consensus problem.
In Chapter 2, we study several classes of stochastic optimization algorithms enriched with
heavy ball momentum. Among the methods studied are: stochastic gradient descent, stochastic
Newton, stochastic proximal point and stochastic dual subspace ascent. This is the first time
momentum variants of several of these methods are studied. We choose to perform our analysis
in a setting in which all of the above methods are equivalent: convex quadratic problems. We
prove global non-asymptotic linear convergence rates for all methods and various measures of
success, including primal function values, primal iterates, and dual function values. We also
show that the primal iterates converge at an accelerated linear rate in a somewhat weaker sense.
This is the first time a linear rate is shown for the stochastic heavy ball method (i.e., stochastic
gradient descent method with momentum). Under somewhat weaker conditions, we establish
a sublinear convergence rate for Cesaro averages of primal iterates. Moreover, we propose a
novel concept, which we call stochastic momentum, aimed at decreasing the cost of performing
the momentum step. We prove linear convergence of several stochastic methods with stochastic
momentum, and show that in some sparse data regimes and for sufficiently small momentum
parameters, these methods enjoy better overall complexity than methods with deterministic
momentum. Finally, we perform extensive numerical testing on artificial and real datasets.
In Chapter 3, we present a convergence rate analysis of inexact variants of stochastic gradient
descent, stochastic Newton, stochastic proximal point and stochastic subspace ascent.
A common feature of these methods is that in their update rule a certain sub-problem needs
to be solved exactly. We relax this requirement by allowing for the sub-problem to be solved
inexactly. In particular, we propose and analyze inexact randomized iterative methods for
solving three closely related problems: a convex stochastic quadratic optimization problem, a
best approximation problem and its dual { a concave quadratic maximization problem. We
provide iteration complexity results under several assumptions on the inexactness error. Inexact
variants of many popular and some more exotic methods, including randomized block
Kaczmarz, randomized Gaussian Kaczmarz and randomized block coordinate descent, can be
cast as special cases. Finally, we present numerical experiments which demonstrate the benefits
of allowing inexactness.
When the data describing a given optimization problem is big enough, it becomes impossible
to store it on a single machine. In such situations, it is usually preferable to distribute the data
among the nodes of a cluster or a supercomputer. In one such setting the nodes cooperate
to minimize the sum (or average) of private functions (convex or non-convex) stored at the
nodes. Among the most popular protocols for solving this problem in a decentralized fashion
(communication is allowed only between neighbours) are randomized gossip algorithms.
In Chapter 4 we propose a new approach for the design and analysis of randomized gossip
algorithms which can be used to solve the distributed average consensus problem, a fundamental
problem in distributed computing, where each node of a network initially holds a number or
vector, and the aim is to calculate the average of these objects by communicating only with
its neighbours (connected nodes). The new approach consists in establishing new connections to
recent literature on randomized iterative methods for solving large-scale linear systems. Our
general framework recovers a comprehensive array of well-known gossip protocols as special
cases and allow for the development of block and arbitrary sampling variants of all of these
methods. In addition, we present novel and provably accelerated randomized gossip protocols
where in each step all nodes of the network update their values using their own information but
only a subset of them exchange messages. The accelerated protocols are the first randomized
gossip algorithms that converge to consensus with a provably accelerated linear rate. The
theoretical results are validated via computational testing on typical wireless sensor network
topologies.
Finally, in Chapter 5, we move towards a different direction and present the first randomized
gossip algorithms for solving the average consensus problem while at the same time protecting
the private values stored at the nodes as these may be sensitive. In particular, we develop
and analyze three privacy preserving variants of the randomized pairwise gossip algorithm
("randomly pick an edge of the network and then replace the values stored at vertices of this
edge by their average") first proposed by Boyd et al. [16] for solving the average consensus
problem. The randomized methods we propose are all dual in nature. That is, they are designed
to solve the dual of the best approximation optimization formulation of the average consensus
problem. We call our three privacy preservation techniques "Binary Oracle", "ε -Gap Oracle"
and "Controlled Noise Insertion". We give iteration complexity bounds for the proposed privacy
preserving randomized gossip protocols and perform extensive numerical experiments
CORE: Common Random Reconstruction for Distributed Optimization with Provable Low Communication Complexity
With distributed machine learning being a prominent technique for large-scale
machine learning tasks, communication complexity has become a major bottleneck
for speeding up training and scaling up machine numbers. In this paper, we
propose a new technique named Common randOm REconstruction(CORE), which can be
used to compress the information transmitted between machines in order to
reduce communication complexity without other strict conditions. Especially,
our technique CORE projects the vector-valued information to a low-dimensional
one through common random vectors and reconstructs the information with the
same random noises after communication. We apply CORE to two distributed tasks,
respectively convex optimization on linear models and generic non-convex
optimization, and design new distributed algorithms, which achieve provably
lower communication complexities. For example, we show for linear models
CORE-based algorithm can encode the gradient vector to -bits
(against ), with the convergence rate not worse, preceding the
existing results
- …