5 research outputs found

    Distributed Gradient Methods for Nonconvex Optimization: Local and Global Convergence Guarantees

    Full text link
    The article discusses distributed gradient-descent algorithms for computing local and global minima in nonconvex optimization. For local optimization, we focus on distributed stochastic gradient descent (D-SGD)--a simple network-based variant of classical SGD. We discuss local minima convergence guarantees and explore the simple but critical role of the stable-manifold theorem in analyzing saddle-point avoidance. For global optimization, we discuss annealing-based methods in which slowly decaying noise is added to D-SGD. Conditions are discussed under which convergence to global minima is guaranteed. Numerical examples illustrate the key concepts in the paper

    Distributed Learning in the Non-Convex World: From Batch to Streaming Data, and Beyond

    Full text link
    Distributed learning has become a critical enabler of the massively connected world envisioned by many. This article discusses four key elements of scalable distributed processing and real-time intelligence --- problems, data, communication and computation. Our aim is to provide a fresh and unique perspective about how these elements should work together in an effective and coherent manner. In particular, we {provide a selective review} about the recent techniques developed for optimizing non-convex models (i.e., problem classes), processing batch and streaming data (i.e., data types), over the networks in a distributed manner (i.e., communication and computation paradigm). We describe the intuitions and connections behind a core set of popular distributed algorithms, emphasizing how to trade off between computation and communication costs. Practical issues and future research directions will also be discussed.Comment: Submitted to IEEE Signal Processing Magazine Special Issue on Distributed, Streaming Machine Learning; THC, MH, HTW contributed equall

    Communication-Efficient Local Decentralized SGD Methods

    Full text link
    Recently, the technique of local updates is a powerful tool in centralized settings to improve communication efficiency via periodical communication. For decentralized settings, it is still unclear how to efficiently combine local updates and decentralized communication. In this work, we propose an algorithm named as LD-SGD, which incorporates arbitrary update schemes that alternate between multiple Local updates and multiple Decentralized SGDs, and provide an analytical framework for LD-SGD. Under the framework, we present a sufficient condition to guarantee the convergence. We show that LD-SGD converges to a critical point for a wide range of update schemes when the objective is non-convex and the training data are non-identically independent distributed. Moreover, our framework brings many insights into the design of update schemes for decentralized optimization. As examples, we specify two update schemes and show how they help improve communication efficiency. Specifically, the first scheme alternates the number of local and global update steps. From our analysis, the ratio of the number of local updates to that of decentralized SGD trades off communication and computation. The second scheme is to periodically shrink the length of local updates. We show that the decaying strategy helps improve communication efficiency both theoretically and empirically

    Improving the Sample and Communication Complexity for Decentralized Non-Convex Optimization: A Joint Gradient Estimation and Tracking Approach

    Full text link
    Many modern large-scale machine learning problems benefit from decentralized and stochastic optimization. Recent works have shown that utilizing both decentralized computing and local stochastic gradient estimates can outperform state-of-the-art centralized algorithms, in applications involving highly non-convex problems, such as training deep neural networks. In this work, we propose a decentralized stochastic algorithm to deal with certain smooth non-convex problems where there are mm nodes in the system, and each node has a large number of samples (denoted as nn). Differently from the majority of the existing decentralized learning algorithms for either stochastic or finite-sum problems, our focus is given to both reducing the total communication rounds among the nodes, while accessing the minimum number of local data samples. In particular, we propose an algorithm named D-GET (decentralized gradient estimation and tracking), which jointly performs decentralized gradient estimation (which estimates the local gradient using a subset of local samples) and gradient tracking (which tracks the global full gradient using local estimates). We show that, to achieve certain ϵ\epsilon stationary solution of the deterministic finite sum problem, the proposed algorithm achieves an O(mn1/2ϵ−1)\mathcal{O}(mn^{1/2}\epsilon^{-1}) sample complexity and an O(ϵ−1)\mathcal{O}(\epsilon^{-1}) communication complexity. These bounds significantly improve upon the best existing bounds of O(mnϵ−1)\mathcal{O}(mn\epsilon^{-1}) and O(ϵ−1)\mathcal{O}(\epsilon^{-1}), respectively. Similarly, for online problems, the proposed method achieves an O(mϵ−3/2)\mathcal{O}(m \epsilon^{-3/2}) sample complexity and an O(ϵ−1)\mathcal{O}(\epsilon^{-1}) communication complexity, while the best existing bounds are O(mϵ−2)\mathcal{O}(m\epsilon^{-2}) and O(ϵ−2)\mathcal{O}(\epsilon^{-2}), respectively

    Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview

    Full text link
    Substantial progress has been made recently on developing provably accurate and efficient algorithms for low-rank matrix factorization via nonconvex optimization. While conventional wisdom often takes a dim view of nonconvex optimization algorithms due to their susceptibility to spurious local minima, simple iterative methods such as gradient descent have been remarkably successful in practice. The theoretical footings, however, had been largely lacking until recently. In this tutorial-style overview, we highlight the important role of statistical models in enabling efficient nonconvex optimization with performance guarantees. We review two contrasting approaches: (1) two-stage algorithms, which consist of a tailored initialization step followed by successive refinement; and (2) global landscape analysis and initialization-free algorithms. Several canonical matrix factorization problems are discussed, including but not limited to matrix sensing, phase retrieval, matrix completion, blind deconvolution, robust principal component analysis, phase synchronization, and joint alignment. Special care is taken to illustrate the key technical insights underlying their analyses. This article serves as a testament that the integrated consideration of optimization and statistics leads to fruitful research findings.Comment: Invited overview articl