15 research outputs found
Optimal Algorithms for Non-Smooth Distributed Optimization in Networks
In this work, we consider the distributed optimization of non-smooth convex
functions using a network of computing units. We investigate this problem under
two regularity assumptions: (1) the Lipschitz continuity of the global
objective function, and (2) the Lipschitz continuity of local individual
functions. Under the local regularity assumption, we provide the first optimal
first-order decentralized algorithm called multi-step primal-dual (MSPD) and
its corresponding optimal convergence rate. A notable aspect of this result is
that, for non-smooth functions, while the dominant term of the error is in
, the structure of the communication network only impacts a
second-order term in , where is time. In other words, the error due
to limits in communication resources decreases at a fast rate even in the case
of non-strongly-convex objective functions. Under the global regularity
assumption, we provide a simple yet efficient algorithm called distributed
randomized smoothing (DRS) based on a local smoothing of the objective
function, and show that DRS is within a multiplicative factor of the
optimal convergence rate, where is the underlying dimension.Comment: 17 page
On the Complexity of Finite-Sum Smooth Optimization under the Polyak-{\L}ojasiewicz Condition
This paper considers the optimization problem of the form ,
where satisfies the Polyak--{\L}ojasiewicz (PL) condition with
parameter and is -mean-squared smooth. We
show that any gradient method requires at least
incremental first-order oracle (IFO)
calls to find an -suboptimal solution, where
is the condition number of the problem. This result nearly matches upper bounds
of IFO complexity for best-known first-order methods. We also study the problem
of minimizing the PL function in the distributed setting such that the
individuals are located on a connected network of
agents. We provide lower bounds of
,
and
for communication rounds,
time cost and local first-order oracle calls respectively, where
is the spectral gap of the mixing matrix associated with the
network and~ is the time cost of per communication round. Furthermore,
we propose a decentralized first-order method that nearly matches above lower
bounds in expectation
Accelerated Gossip in Networks of Given Dimension using Jacobi Polynomial Iterations
Consider a network of agents connected by communication links, where each
agent holds a real value. The gossip problem consists in estimating the average
of the values diffused in the network in a distributed manner. We develop a
method solving the gossip problem that depends only on the spectral dimension
of the network, that is, in the communication network set-up, the dimension of
the space in which the agents live. This contrasts with previous work that
required the spectral gap of the network as a parameter, or suffered from slow
mixing. Our method shows an important improvement over existing algorithms in
the non-asymptotic regime, i.e., when the values are far from being fully mixed
in the network. Our approach stems from a polynomial-based point of view on
gossip algorithms, as well as an approximation of the spectral measure of the
graphs with a Jacobi measure. We show the power of the approach with
simulations on various graphs, and with performance guarantees on graphs of
known spectral dimension, such as grids and random percolation bonds. An
extension of this work to distributed Laplacian solvers is discussed. As a side
result, we also use the polynomial-based point of view to show the convergence
of the message passing algorithm for gossip of Moallemi \& Van Roy on regular
graphs. The explicit computation of the rate of the convergence shows that
message passing has a slow rate of convergence on graphs with small spectral
gap
Optimal Accelerated Variance Reduced EXTRA and DIGing for Strongly Convex and Smooth Decentralized Optimization
We study stochastic decentralized optimization for the problem of training
machine learning models with large-scale distributed data. We extend the famous
EXTRA and DIGing methods with accelerated variance reduction (VR), and propose
two methods, which require the time of
stochastic gradient evaluations
and communication rounds to
reach precision , where and are the stochastic
condition number and batch condition number for strongly convex and smooth
problems, is the condition number of the communication network, and
is the sample size on each distributed node. Our stochastic gradient
computation complexity is the same as the single-machine accelerated variance
reduction methods, such as Katyusha, and our communication complexity is the
same as the accelerated full batch decentralized methods, such as MSDA, and
they are both optimal. We also propose the non-accelerated VR based EXTRA and
DIGing, and provide explicit complexities, for example, the
stochastic gradient computation
complexity and the communication
complexity for the VR based EXTRA. The two complexities are also the same as
the ones of single-machine VR methods, such as SAG, SAGA, and SVRG, and the
non-accelerated full batch decentralized methods, such as EXTRA, respectively
Optimal algorithms for smooth and strongly convex distributed optimization in networks
In this paper, we determine the optimal convergence rates for strongly convex and smooth distributed optimization in two settings: centralized and decentralized communications over a network. For centralized (i.e. master/slave) algorithms, we show that distributing Nesterov's accelerated gradient descent is optimal and achieves a precision in time , where is the condition number of the (global) function to optimize, is the diameter of the network, and (resp. ) is the time needed to communicate values between two neighbors (resp. perform local computations). For decentralized algorithms based on gossip, we provide the first optimal algorithm, called the multi-step dual accelerated (MSDA) method, that achieves a precision in time , where is the condition number of the local functions and is the (normalized) eigengap of the gossip matrix used for communication between nodes. We then verify the efficiency of MSDA against state-of-the-art methods for two problems: least-squares regression and classification by logistic regression