15 research outputs found
Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and Variance Reduction
There is growing interest in large-scale machine learning and optimization
over decentralized networks, e.g. in the context of multi-agent learning and
federated learning. Due to the imminent need to alleviate the communication
burden, the investigation of communication-efficient distributed optimization
algorithms - particularly for empirical risk minimization - has flourished in
recent years. A large fraction of these algorithms have been developed for the
master/slave setting, relying on a central parameter server that can
communicate with all agents. This paper focuses on distributed optimization
over networks, or decentralized optimization, where each agent is only allowed
to aggregate information from its neighbors. By properly adjusting the global
gradient estimate via local averaging in conjunction with proper correction, we
develop a communication-efficient approximate Newton-type method Network-DANE,
which generalizes DANE to the decentralized scenarios. Our key ideas can be
applied in a systematic manner to obtain decentralized versions of other
master/slave distributed algorithms. A notable development is
Network-SVRG/SARAH, which employs variance reduction to further accelerate
local computation. We establish linear convergence of Network-DANE and
Network-SVRG for strongly convex losses, and Network-SARAH for quadratic
losses, which shed light on the impacts of data homogeneity, network
connectivity, and local averaging upon the rate of convergence. We further
extend Network-DANE to composite optimization by allowing a nonsmooth penalty
term. Numerical evidence is provided to demonstrate the appealing performance
of our algorithms over competitive baselines, in terms of both communication
and computation efficiency. Our work suggests that performing a certain amount
of local communications and computations per iteration can substantially
improve the overall efficiency
Variance-Reduced Decentralized Stochastic Optimization with Gradient Tracking--Part I: GT-SAGA
In this paper, we study decentralized empirical risk minimization problems,
where the goal is to minimize a finite-sum of smooth and strongly-convex
functions available over a network of nodes. In this Part I, we propose
\textbf{\texttt{GT-SAGA}}, a decentralized stochastic first-order algorithm
based on gradient tracking \cite{DSGT_Pu,DSGT_Xin} and a variance-reduction
technique called SAGA \cite{SAGA}. We develop the convergence analysis and the
iteration complexity of this algorithm. We further demonstrate various
trade-offs and discuss scenarios in which \textbf{\texttt{GT-SAGA}} achieves
superior performance (in terms of the number of local gradient computations
required) with respect to existing decentralized schemes. In Part II
\cite{GT_SVRG} of this two-part paper, we develop and analyze
\textbf{\texttt{GT-SVRG}}, a decentralized gradient tracking based
implementation of SVRG \cite{SVRG}, another well-known variance-reduction
technique
Newton Method over Networks is Fast up to the Statistical Precision
We propose a distributed cubic regularization of the Newton method for
solving (constrained) empirical risk minimization problems over a network of
agents, modeled as undirected graph. The algorithm employs an inexact,
preconditioned Newton step at each agent's side: the gradient of the
centralized loss is iteratively estimated via a gradient-tracking consensus
mechanism and the Hessian is subsampled over the local data sets. No Hessian
matrices are thus exchanged over the network. We derive global complexity
bounds for convex and strongly convex losses. Our analysis reveals an
interesting interplay between sample and iteration/communication complexity:
statistically accurate solutions are achievable in roughly the same number of
iterations of the centralized cubic Newton method, with a communication cost
per iteration of the order of
, where characterizes
the connectivity of the network. This demonstrates a significant communication
saving with respect to that of existing, statistically oblivious, distributed
Newton-based methods over networks.Comment: In proceedings of the 38th International Conference on Machine
Learning, PMLR 139, 202
Variance-Reduced Decentralized Stochastic Optimization with Gradient Tracking -- Part II: GT-SVRG
Decentralized stochastic optimization has recently benefited from gradient
tracking methods \cite{DSGT_Pu,DSGT_Xin} providing efficient solutions for
large-scale empirical risk minimization problems. In Part I \cite{GT_SAGA} of
this work, we develop \textbf{\texttt{GT-SAGA}} that is based on a
decentralized implementation of SAGA \cite{SAGA} using gradient tracking and
discuss regimes of practical interest where \textbf{\texttt{GT-SAGA}}
outperforms existing decentralized approaches in terms of the total number of
local gradient computations. In this paper, we describe
\textbf{\texttt{GT-SVRG}} that develops a decentralized gradient tracking based
implementation of SVRG \cite{SVRG}, another well-known variance-reduction
technique. We show that the convergence rate of \textbf{\texttt{GT-SVRG}}
matches that of \textbf{\texttt{GT-SAGA}} for smooth and strongly-convex
functions and highlight different trade-offs between the two algorithms in
various settings.Comment: arXiv admin note: text overlap with arXiv:1909.1177
Gradient tracking and variance reduction for decentralized optimization and machine learning
Decentralized methods to solve finite-sum minimization problems are important
in many signal processing and machine learning tasks where the data is
distributed over a network of nodes and raw data sharing is not permitted due
to privacy and/or resource constraints. In this article, we review
decentralized stochastic first-order methods and provide a unified algorithmic
framework that combines variance-reduction with gradient tracking to achieve
both robust performance and fast convergence. We provide explicit theoretical
guarantees of the corresponding methods when the objective functions are smooth
and strongly-convex, and show their applicability to non-convex problems via
numerical experiments. Throughout the article, we provide intuitive
illustrations of the main technical ideas by casting appropriate tradeoffs and
comparisons among the methods of interest and by highlighting applications to
decentralized training of machine learning models.Comment: accepted for publication, IEEE Signal Processing Magazin
Model Linkage Selection for Cooperative Learning
Rapid developments in data collecting devices and computation platforms
produce an emerging number of learners and data modalities in many scientific
domains. We consider the setting in which each learner holds a pair of
parametric statistical model and a specific data source, with the goal of
integrating information across a set of learners to enhance the prediction
accuracy of a specific learner. One natural way to integrate information is to
build a joint model across a set of learners that shares common parameters of
interest. However, the parameter sharing patterns across a set of learners are
not known a priori. Misspecifying the parameter sharing patterns and the
parametric statistical model for each learner yields a biased estimator and
degrades the prediction accuracy of the joint model. In this paper, we propose
a novel framework for integrating information across a set of learners that is
robust against model misspecification and misspecified parameter sharing
patterns. The main crux is to sequentially incorporates additional learners
that can enhance the prediction accuracy of an existing joint model based on a
user-specified parameter sharing patterns across a set of learners, starting
from a model with one learner. Theoretically, we show that the proposed method
can data-adaptively select the correct parameter sharing patterns based on a
user-specified parameter sharing patterns, and thus enhances the prediction
accuracy of a learner. Extensive numerical studies are performed to evaluate
the performance of the proposed method
A Primal-Dual Framework for Decentralized Stochastic Optimization
We consider the decentralized convex optimization problem, where multiple
agents must cooperatively minimize a cumulative objective function, with each
local function expressible as an empirical average of data-dependent losses.
State-of-the-art approaches for decentralized optimization rely on gradient
tracking, where consensus is enforced via a doubly stochastic mixing matrix.
Construction of such mixing matrices is not straightforward and requires
coordination even prior to the start of the optimization algorithm. This paper
puts forth a primal-dual framework for decentralized stochastic optimization
that obviates the need for such doubly stochastic matrices. Instead, dual
variables are maintained to track the disagreement between neighbors. The
proposed framework is flexible and is used to develop decentralized variants of
SAGA, L-SVRG, SVRG++, and SEGA algorithms. Using a unified proof, we establish
that the oracle complexity of these decentralized variants is ,
matching the complexity bounds obtained for the centralized variants.
Additionally, we also present a decentralized primal-dual accelerated SVRG
algorithm achieving oracle complexity, again matching
the bound for the centralized accelerated SVRG. Numerical tests on the
algorithms establish their superior performance as compared to the
variance-reduced gradient tracking algorithms.Comment: 31 pages, 6 Figure
Fast decentralized non-convex finite-sum optimization with recursive variance reduction
This paper considers decentralized minimization of smooth non-convex
cost functions equally divided over a directed network of nodes.
Specifically, we describe a stochastic first-order gradient method, called
GT-SARAH, that employs a SARAH-type variance reduction technique and gradient
tracking (GT) to address the stochastic and decentralized nature of the
problem. We show that GT-SARAH, with appropriate algorithmic parameters, finds
an -accurate first-order stationary point with
gradient complexity, where is the spectral gap of the
network weight matrix and is the smoothness parameter of the cost
functions. This gradient complexity outperforms that of the existing
decentralized stochastic gradient methods. In particular, in a big-data regime
such that , this gradient complexity
furthers reduces to , independent of the
network topology, and matches that of the centralized near-optimal
variance-reduced methods. Moreover, in this regime GT-SARAH achieves a
non-asymptotic linear speedup, in that, the total number of gradient
computations at each node is reduced by a factor of compared to the
centralized near-optimal algorithms that perform all gradient computations at a
single node. To the best of our knowledge, GT-SARAH is the first algorithm that
achieves this property. In addition, we show that appropriate choices of local
minibatch size balance the trade-offs between the gradient and communication
complexity of GT-SARAH. Over infinite time horizon, we establish that all nodes
in GT-SARAH asymptotically achieve consensus and converge to a first-order
stationary point in the almost sure and mean-squared sense
Scaling-up Distributed Processing of Data Streams for Machine Learning
Emerging applications of machine learning in numerous areas involve
continuous gathering of and learning from streams of data. Real-time
incorporation of streaming data into the learned models is essential for
improved inference in these applications. Further, these applications often
involve data that are either inherently gathered at geographically distributed
entities or that are intentionally distributed across multiple machines for
memory, computational, and/or privacy reasons. Training of models in this
distributed, streaming setting requires solving stochastic optimization
problems in a collaborative manner over communication links between the
physical entities. When the streaming data rate is high compared to the
processing capabilities of compute nodes and/or the rate of the communications
links, this poses a challenging question: how can one best leverage the
incoming data for distributed training under constraints on computing
capabilities and/or communications rate? A large body of research has emerged
in recent decades to tackle this and related problems. This paper reviews
recently developed methods that focus on large-scale distributed stochastic
optimization in the compute- and bandwidth-limited regime, with an emphasis on
convergence analysis that explicitly accounts for the mismatch between
computation, communication and streaming rates. In particular, it focuses on
methods that solve: (i) distributed stochastic convex problems, and (ii)
distributed principal component analysis, which is a nonconvex problem with
geometric structure that permits global convergence. For such methods, the
paper discusses recent advances in terms of distributed algorithmic designs
when faced with high-rate streaming data. Further, it reviews guarantees
underlying these methods, which show there exist regimes in which systems can
learn from distributed, streaming data at order-optimal rates.Comment: 45 pages, 9 figures; preprint of a journal paper published in
Proceedings of the IEEE (Special Issue on Optimization for Data-driven
Learning and Control
Communication-Efficient Variance-Reduced Decentralized Stochastic Optimization over Time-Varying Directed Graphs
We consider the problem of decentralized optimization over time-varying
directed networks. The network nodes can access only their local objectives,
and aim to collaboratively minimize a global function by exchanging messages
with their neighbors. Leveraging sparsification, gradient tracking and
variance-reduction, we propose a novel communication-efficient decentralized
optimization scheme that is suitable for resource-constrained time-varying
directed networks. We prove that in the case of smooth and strongly-convex
objective functions, the proposed scheme achieves an accelerated linear
convergence rate. To our knowledge, this is the first decentralized
optimization framework for time-varying directed networks that achieves such a
convergence rate and applies to settings requiring sparsified communication.
Experimental results on both synthetic and real datasets verify the theoretical
results and demonstrate efficacy of the proposed scheme