5 research outputs found
Distributed Gradient Methods for Nonconvex Optimization: Local and Global Convergence Guarantees
The article discusses distributed gradient-descent algorithms for computing
local and global minima in nonconvex optimization. For local optimization, we
focus on distributed stochastic gradient descent (D-SGD)--a simple
network-based variant of classical SGD. We discuss local minima convergence
guarantees and explore the simple but critical role of the stable-manifold
theorem in analyzing saddle-point avoidance. For global optimization, we
discuss annealing-based methods in which slowly decaying noise is added to
D-SGD. Conditions are discussed under which convergence to global minima is
guaranteed. Numerical examples illustrate the key concepts in the paper
Distributed Learning in the Non-Convex World: From Batch to Streaming Data, and Beyond
Distributed learning has become a critical enabler of the massively connected
world envisioned by many. This article discusses four key elements of scalable
distributed processing and real-time intelligence --- problems, data,
communication and computation. Our aim is to provide a fresh and unique
perspective about how these elements should work together in an effective and
coherent manner. In particular, we {provide a selective review} about the
recent techniques developed for optimizing non-convex models (i.e., problem
classes), processing batch and streaming data (i.e., data types), over the
networks in a distributed manner (i.e., communication and computation
paradigm). We describe the intuitions and connections behind a core set of
popular distributed algorithms, emphasizing how to trade off between
computation and communication costs. Practical issues and future research
directions will also be discussed.Comment: Submitted to IEEE Signal Processing Magazine Special Issue on
Distributed, Streaming Machine Learning; THC, MH, HTW contributed equall
Communication-Efficient Local Decentralized SGD Methods
Recently, the technique of local updates is a powerful tool in centralized
settings to improve communication efficiency via periodical communication. For
decentralized settings, it is still unclear how to efficiently combine local
updates and decentralized communication. In this work, we propose an algorithm
named as LD-SGD, which incorporates arbitrary update schemes that alternate
between multiple Local updates and multiple Decentralized SGDs, and provide an
analytical framework for LD-SGD. Under the framework, we present a sufficient
condition to guarantee the convergence. We show that LD-SGD converges to a
critical point for a wide range of update schemes when the objective is
non-convex and the training data are non-identically independent distributed.
Moreover, our framework brings many insights into the design of update schemes
for decentralized optimization. As examples, we specify two update schemes and
show how they help improve communication efficiency. Specifically, the first
scheme alternates the number of local and global update steps. From our
analysis, the ratio of the number of local updates to that of decentralized SGD
trades off communication and computation. The second scheme is to periodically
shrink the length of local updates. We show that the decaying strategy helps
improve communication efficiency both theoretically and empirically
Improving the Sample and Communication Complexity for Decentralized Non-Convex Optimization: A Joint Gradient Estimation and Tracking Approach
Many modern large-scale machine learning problems benefit from decentralized
and stochastic optimization. Recent works have shown that utilizing both
decentralized computing and local stochastic gradient estimates can outperform
state-of-the-art centralized algorithms, in applications involving highly
non-convex problems, such as training deep neural networks.
In this work, we propose a decentralized stochastic algorithm to deal with
certain smooth non-convex problems where there are nodes in the system, and
each node has a large number of samples (denoted as ). Differently from the
majority of the existing decentralized learning algorithms for either
stochastic or finite-sum problems, our focus is given to both reducing the
total communication rounds among the nodes, while accessing the minimum number
of local data samples. In particular, we propose an algorithm named D-GET
(decentralized gradient estimation and tracking), which jointly performs
decentralized gradient estimation (which estimates the local gradient using a
subset of local samples) and gradient tracking (which tracks the global full
gradient using local estimates). We show that, to achieve certain
stationary solution of the deterministic finite sum problem, the proposed
algorithm achieves an sample complexity
and an communication complexity. These bounds
significantly improve upon the best existing bounds of
and , respectively.
Similarly, for online problems, the proposed method achieves an sample complexity and an
communication complexity, while the best existing bounds are
and , respectively
Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview
Substantial progress has been made recently on developing provably accurate
and efficient algorithms for low-rank matrix factorization via nonconvex
optimization. While conventional wisdom often takes a dim view of nonconvex
optimization algorithms due to their susceptibility to spurious local minima,
simple iterative methods such as gradient descent have been remarkably
successful in practice. The theoretical footings, however, had been largely
lacking until recently.
In this tutorial-style overview, we highlight the important role of
statistical models in enabling efficient nonconvex optimization with
performance guarantees. We review two contrasting approaches: (1) two-stage
algorithms, which consist of a tailored initialization step followed by
successive refinement; and (2) global landscape analysis and
initialization-free algorithms. Several canonical matrix factorization problems
are discussed, including but not limited to matrix sensing, phase retrieval,
matrix completion, blind deconvolution, robust principal component analysis,
phase synchronization, and joint alignment. Special care is taken to illustrate
the key technical insights underlying their analyses. This article serves as a
testament that the integrated consideration of optimization and statistics
leads to fruitful research findings.Comment: Invited overview articl