7,724 research outputs found
More Iterations per Second, Same Quality -- Why Asynchronous Algorithms may Drastically Outperform Traditional Ones
In this paper, we consider the convergence of a very general
asynchronous-parallel algorithm called ARock, that takes many well-known
asynchronous algorithms as special cases (gradient descent, proximal gradient,
Douglas Rachford, ADMM, etc.). In asynchronous-parallel algorithms, the
computing nodes simply use the most recent information that they have access
to, instead of waiting for a full update from all nodes in the system. This
means that nodes do not have to waste time waiting for information, which can
be a major bottleneck, especially in distributed systems. When the system has
nodes, asynchronous algorithms may complete more
iterations than synchronous algorithms in a given time period ("more iterations
per second").
Although asynchronous algorithms may compute more iterations per second,
there is error associated with using outdated information. How many more
iterations in total are needed to compensate for this error is still an open
question. The main results of this paper aim to answer this question. We prove,
loosely, that as the size of the problem becomes large, the number of
additional iterations that asynchronous algorithms need becomes negligible
compared to the total number ("same quality" of the iterations). Taking these
facts together, our results provide solid evidence of the potential of
asynchronous algorithms to vastly speed up certain distributed computations.Comment: 29 page
Make Workers Work Harder: Decoupled Asynchronous Proximal Stochastic Gradient Descent
Asynchronous parallel optimization algorithms for solving large-scale machine
learning problems have drawn significant attention from academia to industry
recently. This paper proposes a novel algorithm, decoupled asynchronous
proximal stochastic gradient descent (DAP-SGD), to minimize an objective
function that is the composite of the average of multiple empirical losses and
a regularization term. Unlike the traditional asynchronous proximal stochastic
gradient descent (TAP-SGD) in which the master carries much of the computation
load, the proposed algorithm off-loads the majority of computation tasks from
the master to workers, and leaves the master to conduct simple addition
operations. This strategy yields an easy-to-parallelize algorithm, whose
performance is justified by theoretical convergence analyses. To be specific,
DAP-SGD achieves an rate when the step-size is diminishing and an
ergodic rate when the step-size is constant, where is the
number of total iterations.Comment: 19 page
A Model Parallel Proximal Stochastic Gradient Algorithm for Partially Asynchronous Systems
Large models are prevalent in modern machine learning scenarios, including
deep learning, recommender systems, etc., which can have millions or even
billions of parameters. Parallel algorithms have become an essential solution
technique to many large-scale machine learning jobs. In this paper, we propose
a model parallel proximal stochastic gradient algorithm, AsyB-ProxSGD, to deal
with large models using model parallel blockwise updates while in the meantime
handling a large amount of training data using proximal stochastic gradient
descent (ProxSGD). In our algorithm, worker nodes communicate with the
parameter servers asynchronously, and each worker performs proximal stochastic
gradient for only one block of model parameters during each iteration. Our
proposed algorithm generalizes ProxSGD to the asynchronous and model parallel
setting. We prove that AsyB-ProxSGD achieves a convergence rate of
to stationary points for nonconvex problems under
\emph{constant} minibatch sizes, where is the total number of block
updates. This rate matches the best-known rates of convergence for a wide range
of gradient-like algorithms. Furthermore, we show that when the number of
workers is bounded by , we can expect AsyB-ProxSGD to achieve
linear speedup as the number of workers increases. We implement the proposed
algorithm on MXNet and demonstrate its convergence behavior and near-linear
speedup on a real-world dataset involving both a large model size and large
amounts of data.Comment: arXiv admin note: substantial text overlap with arXiv:1802.0888
Asynchronous ADMM for Distributed Non-Convex Optimization in Power Systems
Large scale, non-convex optimization problems arising in many complex
networks such as the power system call for efficient and scalable distributed
optimization algorithms. Existing distributed methods are usually iterative and
require synchronization of all workers at each iteration, which is hard to
scale and could result in the under-utilization of computation resources due to
the heterogeneity of the subproblems. To address those limitations of
synchronous schemes, this paper proposes an asynchronous distributed
optimization method based on the Alternating Direction Method of Multipliers
(ADMM) for non-convex optimization. The proposed method only requires local
communications and allows each worker to perform local updates with information
from a subset of but not all neighbors. We provide sufficient conditions on the
problem formulation, the choice of algorithm parameter and network delay, and
show that under those mild conditions, the proposed asynchronous ADMM method
asymptotically converges to the KKT point of the non-convex problem. We
validate the effectiveness of asynchronous ADMM by applying it to the Optimal
Power Flow problem in multiple power systems and show that the convergence of
the proposed asynchronous scheme could be faster than its synchronous
counterpart in large-scale applications
Asynchronous Stochastic Gradient Descent with Variance Reduction for Non-Convex Optimization
We provide the first theoretical analysis on the convergence rate of the
asynchronous stochastic variance reduced gradient (SVRG) descent algorithm on
non-convex optimization. Recent studies have shown that the asynchronous
stochastic gradient descent (SGD) based algorithms with variance reduction
converge with a linear convergent rate on convex problems. However, there is no
work to analyze asynchronous SGD with variance reduction technique on
non-convex problem. In this paper, we study two asynchronous parallel
implementations of SVRG: one is on a distributed memory system and the other is
on a shared memory system. We provide the theoretical analysis that both
algorithms can obtain a convergence rate of , and linear speed up is
achievable if the number of workers is upper bounded. V1,v2,v3 have been
withdrawn due to reference issue, please refer the newest version v4.Comment: V1,v2,v3 have been withdrawn due to reference issue, because arXiv
policy, we can't delete them. Please refer the newest version v
Impact of Communication Delay on Asynchronous Distributed Optimal Power Flow Using ADMM
Distributed optimization has attracted lots of attention in the operation of
power systems in recent years, where a large area is decomposed into smaller
control regions each solving a local optimization problem with periodic
information exchange with neighboring regions. However, most distributed
optimization methods are iterative and require synchronization of all regions
at each iteration, which is hard to achieve without a centralized coordinator
and might lead to under-utilization of computation resources due to the
heterogeneity of the regions. To address such limitations of synchronous
schemes, this paper investigates the applicability of asynchronous distributed
optimization methods to power system optimization. Particularly, we focus on
solving the AC Optimal Power Flow problem and propose an algorithmic framework
based on the Alternating Direction Method of Multipliers (ADMM) method that
allows the regions to perform local updates with information received from a
subset of but not all neighbors. Through experimental studies, we demonstrate
that the convergence performance of the proposed asynchronous scheme is
dependent on the communication delay of passing messages among the regions.
Under mild communication delays, the proposed scheme can achieve comparable or
even faster convergence compared with its synchronous counterpart, which can be
used as a good alternative to centralized or synchronous distributed
optimization approaches.Comment: SmartGridComm 201
Revisiting Asynchronous Linear Solvers: Provable Convergence Rate Through Randomization
Asynchronous methods for solving systems of linear equations have been
researched since Chazan and Miranker's pioneering 1969 paper on chaotic
relaxation. The underlying idea of asynchronous methods is to avoid processor
idle time by allowing the processors to continue to make progress even if not
all progress made by other processors has been communicated to them.
Historically, the applicability of asynchronous methods for solving linear
equations was limited to certain restricted classes of matrices, such as
diagonally dominant matrices. Furthermore, analysis of these methods focused on
proving convergence in the limit. Comparison of the asynchronous convergence
rate with its synchronous counterpart and its scaling with the number of
processors were seldom studied, and are still not well understood.
In this paper, we propose a randomized shared-memory asynchronous method for
general symmetric positive definite matrices. We rigorously analyze the
convergence rate and prove that it is linear, and is close to that of the
method's synchronous counterpart if the processor count is not excessive
relative to the size and sparsity of the matrix. We also present an algorithm
for unsymmetric systems and overdetermined least-squares. Our work presents a
significant improvement in the applicability of asynchronous linear solvers as
well as in their convergence analysis, and suggests randomization as a key
paradigm to serve as a foundation for asynchronous methods
Asynchronous Distributed Optimization with Stochastic Delays
We study asynchronous finite sum minimization in a distributed-data setting
with a central parameter server. While asynchrony is well understood in
parallel settings where the data is accessible by all machines -- e.g.,
modifications of variance-reduced gradient algorithms like SAGA work well --
little is known for the distributed-data setting. We develop an algorithm
ADSAGA based on SAGA for the distributed-data setting, in which the data is
partitioned between many machines. We show that with machines, under a
natural stochastic delay model with an mean delay of , ADSAGA converges in
iterations, where is the number of component functions, and is a
condition number. This complexity sits squarely between the complexity
of SAGA
\textit{without delays} and the complexity of parallel asynchronous algorithms
where the delays are \textit{arbitrary} (but bounded by ), and the data
is accessible by all. Existing asynchronous algorithms with distributed-data
setting and arbitrary delays have only been shown to converge in
iterations. We empirically compare on
least-squares problems the iteration complexity and wallclock performance of
ADSAGA to existing parallel and distributed algorithms, including synchronous
minibatch algorithms. Our results demonstrate the wallclock advantage of
variance-reduced asynchronous approaches over SGD or synchronous approaches.Comment: arXiv admin note: substantial text overlap with arXiv:2006.0963
Decentralized Dynamic Optimization for Power Network Voltage Control
Voltage control in power distribution networks has been greatly challenged by
the increasing penetration of volatile and intermittent devices. These devices
can also provide limited reactive power resources that can be used to regulate
the network-wide voltage. A decentralized voltage control strategy can be
designed by minimizing a quadratic voltage mismatch error objective using
gradient-projection (GP) updates. Coupled with the power network flow, the
local voltage can provide the instantaneous gradient information. This paper
aims to analyze the performance of this decentralized GP-based voltage control
design under two dynamic scenarios: i) the nodes perform the decentralized
update in an asynchronous fashion, and ii) the network operating condition is
time-varying. For the asynchronous voltage control, we improve the existing
convergence condition by recognizing that the voltage based gradient is always
up-to-date. By modeling the network dynamics using an autoregressive process
and considering time-varying resource constraints, we provide an error bound in
tracking the instantaneous optimal solution to the quadratic error objective.
This result can be extended to more general \textit{constrained dynamic
optimization} problems with smooth strongly convex objective functions under
stochastic processes that have bounded iterative changes. Extensive numerical
tests have been performed to demonstrate and validate our analytical results
for realistic power networks
Asynchronous Parallel Algorithms for Nonconvex Optimization
We propose a new asynchronous parallel block-descent algorithmic framework
for the minimization of the sum of a smooth nonconvex function and a nonsmooth
convex one, subject to both convex and nonconvex constraints. The proposed
framework hinges on successive convex approximation techniques and a novel
probabilistic model that captures key elements of modern computational
architectures and asynchronous implementations in a more faithful way than
current state-of-the-art models. Other key features of the framework are: i) it
covers in a unified way several specific solution methods; ii) it accommodates
a variety of possible parallel computing architectures; and iii) it can deal
with nonconvex constraints. Almost sure convergence to stationary solutions is
proved, and theoretical complexity results are provided, showing nearly ideal
linear speedup when the number of workers is not too large.Comment: This is the first part of a two-paper work. The second part can be
found at: arXiv:1701.0490
- …