7,724 research outputs found

    More Iterations per Second, Same Quality -- Why Asynchronous Algorithms may Drastically Outperform Traditional Ones

    Full text link
    In this paper, we consider the convergence of a very general asynchronous-parallel algorithm called ARock, that takes many well-known asynchronous algorithms as special cases (gradient descent, proximal gradient, Douglas Rachford, ADMM, etc.). In asynchronous-parallel algorithms, the computing nodes simply use the most recent information that they have access to, instead of waiting for a full update from all nodes in the system. This means that nodes do not have to waste time waiting for information, which can be a major bottleneck, especially in distributed systems. When the system has pp nodes, asynchronous algorithms may complete Θ(ln(p))\Theta(\ln(p)) more iterations than synchronous algorithms in a given time period ("more iterations per second"). Although asynchronous algorithms may compute more iterations per second, there is error associated with using outdated information. How many more iterations in total are needed to compensate for this error is still an open question. The main results of this paper aim to answer this question. We prove, loosely, that as the size of the problem becomes large, the number of additional iterations that asynchronous algorithms need becomes negligible compared to the total number ("same quality" of the iterations). Taking these facts together, our results provide solid evidence of the potential of asynchronous algorithms to vastly speed up certain distributed computations.Comment: 29 page

    Make Workers Work Harder: Decoupled Asynchronous Proximal Stochastic Gradient Descent

    Full text link
    Asynchronous parallel optimization algorithms for solving large-scale machine learning problems have drawn significant attention from academia to industry recently. This paper proposes a novel algorithm, decoupled asynchronous proximal stochastic gradient descent (DAP-SGD), to minimize an objective function that is the composite of the average of multiple empirical losses and a regularization term. Unlike the traditional asynchronous proximal stochastic gradient descent (TAP-SGD) in which the master carries much of the computation load, the proposed algorithm off-loads the majority of computation tasks from the master to workers, and leaves the master to conduct simple addition operations. This strategy yields an easy-to-parallelize algorithm, whose performance is justified by theoretical convergence analyses. To be specific, DAP-SGD achieves an O(logT/T)O(\log T/T) rate when the step-size is diminishing and an ergodic O(1/T)O(1/\sqrt{T}) rate when the step-size is constant, where TT is the number of total iterations.Comment: 19 page

    A Model Parallel Proximal Stochastic Gradient Algorithm for Partially Asynchronous Systems

    Full text link
    Large models are prevalent in modern machine learning scenarios, including deep learning, recommender systems, etc., which can have millions or even billions of parameters. Parallel algorithms have become an essential solution technique to many large-scale machine learning jobs. In this paper, we propose a model parallel proximal stochastic gradient algorithm, AsyB-ProxSGD, to deal with large models using model parallel blockwise updates while in the meantime handling a large amount of training data using proximal stochastic gradient descent (ProxSGD). In our algorithm, worker nodes communicate with the parameter servers asynchronously, and each worker performs proximal stochastic gradient for only one block of model parameters during each iteration. Our proposed algorithm generalizes ProxSGD to the asynchronous and model parallel setting. We prove that AsyB-ProxSGD achieves a convergence rate of O(1/K)O(1/\sqrt{K}) to stationary points for nonconvex problems under \emph{constant} minibatch sizes, where KK is the total number of block updates. This rate matches the best-known rates of convergence for a wide range of gradient-like algorithms. Furthermore, we show that when the number of workers is bounded by O(K1/4)O(K^{1/4}), we can expect AsyB-ProxSGD to achieve linear speedup as the number of workers increases. We implement the proposed algorithm on MXNet and demonstrate its convergence behavior and near-linear speedup on a real-world dataset involving both a large model size and large amounts of data.Comment: arXiv admin note: substantial text overlap with arXiv:1802.0888

    Asynchronous ADMM for Distributed Non-Convex Optimization in Power Systems

    Full text link
    Large scale, non-convex optimization problems arising in many complex networks such as the power system call for efficient and scalable distributed optimization algorithms. Existing distributed methods are usually iterative and require synchronization of all workers at each iteration, which is hard to scale and could result in the under-utilization of computation resources due to the heterogeneity of the subproblems. To address those limitations of synchronous schemes, this paper proposes an asynchronous distributed optimization method based on the Alternating Direction Method of Multipliers (ADMM) for non-convex optimization. The proposed method only requires local communications and allows each worker to perform local updates with information from a subset of but not all neighbors. We provide sufficient conditions on the problem formulation, the choice of algorithm parameter and network delay, and show that under those mild conditions, the proposed asynchronous ADMM method asymptotically converges to the KKT point of the non-convex problem. We validate the effectiveness of asynchronous ADMM by applying it to the Optimal Power Flow problem in multiple power systems and show that the convergence of the proposed asynchronous scheme could be faster than its synchronous counterpart in large-scale applications

    Asynchronous Stochastic Gradient Descent with Variance Reduction for Non-Convex Optimization

    Full text link
    We provide the first theoretical analysis on the convergence rate of the asynchronous stochastic variance reduced gradient (SVRG) descent algorithm on non-convex optimization. Recent studies have shown that the asynchronous stochastic gradient descent (SGD) based algorithms with variance reduction converge with a linear convergent rate on convex problems. However, there is no work to analyze asynchronous SGD with variance reduction technique on non-convex problem. In this paper, we study two asynchronous parallel implementations of SVRG: one is on a distributed memory system and the other is on a shared memory system. We provide the theoretical analysis that both algorithms can obtain a convergence rate of O(1/T)O(1/T), and linear speed up is achievable if the number of workers is upper bounded. V1,v2,v3 have been withdrawn due to reference issue, please refer the newest version v4.Comment: V1,v2,v3 have been withdrawn due to reference issue, because arXiv policy, we can't delete them. Please refer the newest version v

    Impact of Communication Delay on Asynchronous Distributed Optimal Power Flow Using ADMM

    Full text link
    Distributed optimization has attracted lots of attention in the operation of power systems in recent years, where a large area is decomposed into smaller control regions each solving a local optimization problem with periodic information exchange with neighboring regions. However, most distributed optimization methods are iterative and require synchronization of all regions at each iteration, which is hard to achieve without a centralized coordinator and might lead to under-utilization of computation resources due to the heterogeneity of the regions. To address such limitations of synchronous schemes, this paper investigates the applicability of asynchronous distributed optimization methods to power system optimization. Particularly, we focus on solving the AC Optimal Power Flow problem and propose an algorithmic framework based on the Alternating Direction Method of Multipliers (ADMM) method that allows the regions to perform local updates with information received from a subset of but not all neighbors. Through experimental studies, we demonstrate that the convergence performance of the proposed asynchronous scheme is dependent on the communication delay of passing messages among the regions. Under mild communication delays, the proposed scheme can achieve comparable or even faster convergence compared with its synchronous counterpart, which can be used as a good alternative to centralized or synchronous distributed optimization approaches.Comment: SmartGridComm 201

    Revisiting Asynchronous Linear Solvers: Provable Convergence Rate Through Randomization

    Full text link
    Asynchronous methods for solving systems of linear equations have been researched since Chazan and Miranker's pioneering 1969 paper on chaotic relaxation. The underlying idea of asynchronous methods is to avoid processor idle time by allowing the processors to continue to make progress even if not all progress made by other processors has been communicated to them. Historically, the applicability of asynchronous methods for solving linear equations was limited to certain restricted classes of matrices, such as diagonally dominant matrices. Furthermore, analysis of these methods focused on proving convergence in the limit. Comparison of the asynchronous convergence rate with its synchronous counterpart and its scaling with the number of processors were seldom studied, and are still not well understood. In this paper, we propose a randomized shared-memory asynchronous method for general symmetric positive definite matrices. We rigorously analyze the convergence rate and prove that it is linear, and is close to that of the method's synchronous counterpart if the processor count is not excessive relative to the size and sparsity of the matrix. We also present an algorithm for unsymmetric systems and overdetermined least-squares. Our work presents a significant improvement in the applicability of asynchronous linear solvers as well as in their convergence analysis, and suggests randomization as a key paradigm to serve as a foundation for asynchronous methods

    Asynchronous Distributed Optimization with Stochastic Delays

    Full text link
    We study asynchronous finite sum minimization in a distributed-data setting with a central parameter server. While asynchrony is well understood in parallel settings where the data is accessible by all machines -- e.g., modifications of variance-reduced gradient algorithms like SAGA work well -- little is known for the distributed-data setting. We develop an algorithm ADSAGA based on SAGA for the distributed-data setting, in which the data is partitioned between many machines. We show that with mm machines, under a natural stochastic delay model with an mean delay of mm, ADSAGA converges in O~((n+mκ)log(1/ϵ))\tilde{O}\left(\left(n + \sqrt{m}\kappa\right)\log(1/\epsilon)\right) iterations, where nn is the number of component functions, and κ\kappa is a condition number. This complexity sits squarely between the complexity O~((n+κ)log(1/ϵ))\tilde{O}\left(\left(n + \kappa\right)\log(1/\epsilon)\right) of SAGA \textit{without delays} and the complexity O~((n+mκ)log(1/ϵ))\tilde{O}\left(\left(n + m\kappa\right)\log(1/\epsilon)\right) of parallel asynchronous algorithms where the delays are \textit{arbitrary} (but bounded by O(m)O(m)), and the data is accessible by all. Existing asynchronous algorithms with distributed-data setting and arbitrary delays have only been shown to converge in O~(n2κlog(1/ϵ))\tilde{O}(n^2\kappa\log(1/\epsilon)) iterations. We empirically compare on least-squares problems the iteration complexity and wallclock performance of ADSAGA to existing parallel and distributed algorithms, including synchronous minibatch algorithms. Our results demonstrate the wallclock advantage of variance-reduced asynchronous approaches over SGD or synchronous approaches.Comment: arXiv admin note: substantial text overlap with arXiv:2006.0963

    Decentralized Dynamic Optimization for Power Network Voltage Control

    Full text link
    Voltage control in power distribution networks has been greatly challenged by the increasing penetration of volatile and intermittent devices. These devices can also provide limited reactive power resources that can be used to regulate the network-wide voltage. A decentralized voltage control strategy can be designed by minimizing a quadratic voltage mismatch error objective using gradient-projection (GP) updates. Coupled with the power network flow, the local voltage can provide the instantaneous gradient information. This paper aims to analyze the performance of this decentralized GP-based voltage control design under two dynamic scenarios: i) the nodes perform the decentralized update in an asynchronous fashion, and ii) the network operating condition is time-varying. For the asynchronous voltage control, we improve the existing convergence condition by recognizing that the voltage based gradient is always up-to-date. By modeling the network dynamics using an autoregressive process and considering time-varying resource constraints, we provide an error bound in tracking the instantaneous optimal solution to the quadratic error objective. This result can be extended to more general \textit{constrained dynamic optimization} problems with smooth strongly convex objective functions under stochastic processes that have bounded iterative changes. Extensive numerical tests have been performed to demonstrate and validate our analytical results for realistic power networks

    Asynchronous Parallel Algorithms for Nonconvex Optimization

    Full text link
    We propose a new asynchronous parallel block-descent algorithmic framework for the minimization of the sum of a smooth nonconvex function and a nonsmooth convex one, subject to both convex and nonconvex constraints. The proposed framework hinges on successive convex approximation techniques and a novel probabilistic model that captures key elements of modern computational architectures and asynchronous implementations in a more faithful way than current state-of-the-art models. Other key features of the framework are: i) it covers in a unified way several specific solution methods; ii) it accommodates a variety of possible parallel computing architectures; and iii) it can deal with nonconvex constraints. Almost sure convergence to stationary solutions is proved, and theoretical complexity results are provided, showing nearly ideal linear speedup when the number of workers is not too large.Comment: This is the first part of a two-paper work. The second part can be found at: arXiv:1701.0490
    corecore