15 research outputs found

    Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and Variance Reduction

    Full text link
    There is growing interest in large-scale machine learning and optimization over decentralized networks, e.g. in the context of multi-agent learning and federated learning. Due to the imminent need to alleviate the communication burden, the investigation of communication-efficient distributed optimization algorithms - particularly for empirical risk minimization - has flourished in recent years. A large fraction of these algorithms have been developed for the master/slave setting, relying on a central parameter server that can communicate with all agents. This paper focuses on distributed optimization over networks, or decentralized optimization, where each agent is only allowed to aggregate information from its neighbors. By properly adjusting the global gradient estimate via local averaging in conjunction with proper correction, we develop a communication-efficient approximate Newton-type method Network-DANE, which generalizes DANE to the decentralized scenarios. Our key ideas can be applied in a systematic manner to obtain decentralized versions of other master/slave distributed algorithms. A notable development is Network-SVRG/SARAH, which employs variance reduction to further accelerate local computation. We establish linear convergence of Network-DANE and Network-SVRG for strongly convex losses, and Network-SARAH for quadratic losses, which shed light on the impacts of data homogeneity, network connectivity, and local averaging upon the rate of convergence. We further extend Network-DANE to composite optimization by allowing a nonsmooth penalty term. Numerical evidence is provided to demonstrate the appealing performance of our algorithms over competitive baselines, in terms of both communication and computation efficiency. Our work suggests that performing a certain amount of local communications and computations per iteration can substantially improve the overall efficiency

    Variance-Reduced Decentralized Stochastic Optimization with Gradient Tracking--Part I: GT-SAGA

    Full text link
    In this paper, we study decentralized empirical risk minimization problems, where the goal is to minimize a finite-sum of smooth and strongly-convex functions available over a network of nodes. In this Part I, we propose \textbf{\texttt{GT-SAGA}}, a decentralized stochastic first-order algorithm based on gradient tracking \cite{DSGT_Pu,DSGT_Xin} and a variance-reduction technique called SAGA \cite{SAGA}. We develop the convergence analysis and the iteration complexity of this algorithm. We further demonstrate various trade-offs and discuss scenarios in which \textbf{\texttt{GT-SAGA}} achieves superior performance (in terms of the number of local gradient computations required) with respect to existing decentralized schemes. In Part II \cite{GT_SVRG} of this two-part paper, we develop and analyze \textbf{\texttt{GT-SVRG}}, a decentralized gradient tracking based implementation of SVRG \cite{SVRG}, another well-known variance-reduction technique

    Newton Method over Networks is Fast up to the Statistical Precision

    Full text link
    We propose a distributed cubic regularization of the Newton method for solving (constrained) empirical risk minimization problems over a network of agents, modeled as undirected graph. The algorithm employs an inexact, preconditioned Newton step at each agent's side: the gradient of the centralized loss is iteratively estimated via a gradient-tracking consensus mechanism and the Hessian is subsampled over the local data sets. No Hessian matrices are thus exchanged over the network. We derive global complexity bounds for convex and strongly convex losses. Our analysis reveals an interesting interplay between sample and iteration/communication complexity: statistically accurate solutions are achievable in roughly the same number of iterations of the centralized cubic Newton method, with a communication cost per iteration of the order of O~(1/1βˆ’Ο)\widetilde{\mathcal{O}}\big(1/\sqrt{1-\rho}\big), where ρ\rho characterizes the connectivity of the network. This demonstrates a significant communication saving with respect to that of existing, statistically oblivious, distributed Newton-based methods over networks.Comment: In proceedings of the 38th International Conference on Machine Learning, PMLR 139, 202

    Variance-Reduced Decentralized Stochastic Optimization with Gradient Tracking -- Part II: GT-SVRG

    Full text link
    Decentralized stochastic optimization has recently benefited from gradient tracking methods \cite{DSGT_Pu,DSGT_Xin} providing efficient solutions for large-scale empirical risk minimization problems. In Part I \cite{GT_SAGA} of this work, we develop \textbf{\texttt{GT-SAGA}} that is based on a decentralized implementation of SAGA \cite{SAGA} using gradient tracking and discuss regimes of practical interest where \textbf{\texttt{GT-SAGA}} outperforms existing decentralized approaches in terms of the total number of local gradient computations. In this paper, we describe \textbf{\texttt{GT-SVRG}} that develops a decentralized gradient tracking based implementation of SVRG \cite{SVRG}, another well-known variance-reduction technique. We show that the convergence rate of \textbf{\texttt{GT-SVRG}} matches that of \textbf{\texttt{GT-SAGA}} for smooth and strongly-convex functions and highlight different trade-offs between the two algorithms in various settings.Comment: arXiv admin note: text overlap with arXiv:1909.1177

    Gradient tracking and variance reduction for decentralized optimization and machine learning

    Full text link
    Decentralized methods to solve finite-sum minimization problems are important in many signal processing and machine learning tasks where the data is distributed over a network of nodes and raw data sharing is not permitted due to privacy and/or resource constraints. In this article, we review decentralized stochastic first-order methods and provide a unified algorithmic framework that combines variance-reduction with gradient tracking to achieve both robust performance and fast convergence. We provide explicit theoretical guarantees of the corresponding methods when the objective functions are smooth and strongly-convex, and show their applicability to non-convex problems via numerical experiments. Throughout the article, we provide intuitive illustrations of the main technical ideas by casting appropriate tradeoffs and comparisons among the methods of interest and by highlighting applications to decentralized training of machine learning models.Comment: accepted for publication, IEEE Signal Processing Magazin

    Model Linkage Selection for Cooperative Learning

    Full text link
    Rapid developments in data collecting devices and computation platforms produce an emerging number of learners and data modalities in many scientific domains. We consider the setting in which each learner holds a pair of parametric statistical model and a specific data source, with the goal of integrating information across a set of learners to enhance the prediction accuracy of a specific learner. One natural way to integrate information is to build a joint model across a set of learners that shares common parameters of interest. However, the parameter sharing patterns across a set of learners are not known a priori. Misspecifying the parameter sharing patterns and the parametric statistical model for each learner yields a biased estimator and degrades the prediction accuracy of the joint model. In this paper, we propose a novel framework for integrating information across a set of learners that is robust against model misspecification and misspecified parameter sharing patterns. The main crux is to sequentially incorporates additional learners that can enhance the prediction accuracy of an existing joint model based on a user-specified parameter sharing patterns across a set of learners, starting from a model with one learner. Theoretically, we show that the proposed method can data-adaptively select the correct parameter sharing patterns based on a user-specified parameter sharing patterns, and thus enhances the prediction accuracy of a learner. Extensive numerical studies are performed to evaluate the performance of the proposed method

    A Primal-Dual Framework for Decentralized Stochastic Optimization

    Full text link
    We consider the decentralized convex optimization problem, where multiple agents must cooperatively minimize a cumulative objective function, with each local function expressible as an empirical average of data-dependent losses. State-of-the-art approaches for decentralized optimization rely on gradient tracking, where consensus is enforced via a doubly stochastic mixing matrix. Construction of such mixing matrices is not straightforward and requires coordination even prior to the start of the optimization algorithm. This paper puts forth a primal-dual framework for decentralized stochastic optimization that obviates the need for such doubly stochastic matrices. Instead, dual variables are maintained to track the disagreement between neighbors. The proposed framework is flexible and is used to develop decentralized variants of SAGA, L-SVRG, SVRG++, and SEGA algorithms. Using a unified proof, we establish that the oracle complexity of these decentralized variants is O(1/Ο΅)O(1/\epsilon), matching the complexity bounds obtained for the centralized variants. Additionally, we also present a decentralized primal-dual accelerated SVRG algorithm achieving O(1/Ο΅)O(1/\sqrt{\epsilon}) oracle complexity, again matching the bound for the centralized accelerated SVRG. Numerical tests on the algorithms establish their superior performance as compared to the variance-reduced gradient tracking algorithms.Comment: 31 pages, 6 Figure

    Fast decentralized non-convex finite-sum optimization with recursive variance reduction

    Full text link
    This paper considers decentralized minimization of N:=nmN:=nm smooth non-convex cost functions equally divided over a directed network of nn nodes. Specifically, we describe a stochastic first-order gradient method, called GT-SARAH, that employs a SARAH-type variance reduction technique and gradient tracking (GT) to address the stochastic and decentralized nature of the problem. We show that GT-SARAH, with appropriate algorithmic parameters, finds an Ο΅\epsilon-accurate first-order stationary point with O(max⁑{N12,n(1βˆ’Ξ»)βˆ’2,n23m13(1βˆ’Ξ»)βˆ’1}LΟ΅βˆ’2)O\big(\max\big\{N^{\frac{1}{2}},n(1-\lambda)^{-2},n^{\frac{2}{3}}m^{\frac{1}{3}}(1-\lambda)^{-1}\big\}L\epsilon^{-2}\big) gradient complexity, where (1βˆ’Ξ»)∈(0,1]{(1-\lambda)\in(0,1]} is the spectral gap of the network weight matrix and LL is the smoothness parameter of the cost functions. This gradient complexity outperforms that of the existing decentralized stochastic gradient methods. In particular, in a big-data regime such that n=O(N12(1βˆ’Ξ»)3){n = O(N^{\frac{1}{2}}(1-\lambda)^{3})}, this gradient complexity furthers reduces to O(N12LΟ΅βˆ’2){O(N^{\frac{1}{2}}L\epsilon^{-2})}, independent of the network topology, and matches that of the centralized near-optimal variance-reduced methods. Moreover, in this regime GT-SARAH achieves a non-asymptotic linear speedup, in that, the total number of gradient computations at each node is reduced by a factor of 1/n1/n compared to the centralized near-optimal algorithms that perform all gradient computations at a single node. To the best of our knowledge, GT-SARAH is the first algorithm that achieves this property. In addition, we show that appropriate choices of local minibatch size balance the trade-offs between the gradient and communication complexity of GT-SARAH. Over infinite time horizon, we establish that all nodes in GT-SARAH asymptotically achieve consensus and converge to a first-order stationary point in the almost sure and mean-squared sense

    Scaling-up Distributed Processing of Data Streams for Machine Learning

    Full text link
    Emerging applications of machine learning in numerous areas involve continuous gathering of and learning from streams of data. Real-time incorporation of streaming data into the learned models is essential for improved inference in these applications. Further, these applications often involve data that are either inherently gathered at geographically distributed entities or that are intentionally distributed across multiple machines for memory, computational, and/or privacy reasons. Training of models in this distributed, streaming setting requires solving stochastic optimization problems in a collaborative manner over communication links between the physical entities. When the streaming data rate is high compared to the processing capabilities of compute nodes and/or the rate of the communications links, this poses a challenging question: how can one best leverage the incoming data for distributed training under constraints on computing capabilities and/or communications rate? A large body of research has emerged in recent decades to tackle this and related problems. This paper reviews recently developed methods that focus on large-scale distributed stochastic optimization in the compute- and bandwidth-limited regime, with an emphasis on convergence analysis that explicitly accounts for the mismatch between computation, communication and streaming rates. In particular, it focuses on methods that solve: (i) distributed stochastic convex problems, and (ii) distributed principal component analysis, which is a nonconvex problem with geometric structure that permits global convergence. For such methods, the paper discusses recent advances in terms of distributed algorithmic designs when faced with high-rate streaming data. Further, it reviews guarantees underlying these methods, which show there exist regimes in which systems can learn from distributed, streaming data at order-optimal rates.Comment: 45 pages, 9 figures; preprint of a journal paper published in Proceedings of the IEEE (Special Issue on Optimization for Data-driven Learning and Control

    Communication-Efficient Variance-Reduced Decentralized Stochastic Optimization over Time-Varying Directed Graphs

    Full text link
    We consider the problem of decentralized optimization over time-varying directed networks. The network nodes can access only their local objectives, and aim to collaboratively minimize a global function by exchanging messages with their neighbors. Leveraging sparsification, gradient tracking and variance-reduction, we propose a novel communication-efficient decentralized optimization scheme that is suitable for resource-constrained time-varying directed networks. We prove that in the case of smooth and strongly-convex objective functions, the proposed scheme achieves an accelerated linear convergence rate. To our knowledge, this is the first decentralized optimization framework for time-varying directed networks that achieves such a convergence rate and applies to settings requiring sparsified communication. Experimental results on both synthetic and real datasets verify the theoretical results and demonstrate efficacy of the proposed scheme