2,904 research outputs found

    Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithms

    Full text link
    The implementation of a vast majority of machine learning (ML) algorithms boils down to solving a numerical optimization problem. In this context, Stochastic Gradient Descent (SGD) methods have long proven to provide good results, both in terms of convergence and accuracy. Recently, several parallelization approaches have been proposed in order to scale SGD to solve very large ML problems. At their core, most of these approaches are following a map-reduce scheme. This paper presents a novel parallel updating algorithm for SGD, which utilizes the asynchronous single-sided communication paradigm. Compared to existing methods, Asynchronous Parallel Stochastic Gradient Descent (ASGD) provides faster (or at least equal) convergence, close to linear scaling and stable accuracy

    Efficient Distributed Online Prediction and Stochastic Optimization with Approximate Distributed Averaging

    Full text link
    We study distributed methods for online prediction and stochastic optimization. Our approach is iterative: in each round nodes first perform local computations and then communicate in order to aggregate information and synchronize their decision variables. Synchronization is accomplished through the use of a distributed averaging protocol. When an exact distributed averaging protocol is used, it is known that the optimal regret bound of O(m)\mathcal{O}(\sqrt{m}) can be achieved using the distributed mini-batch algorithm of Dekel et al. (2012), where mm is the total number of samples processed across the network. We focus on methods using approximate distributed averaging protocols and show that the optimal regret bound can also be achieved in this setting. In particular, we propose a gossip-based optimization method which achieves the optimal regret bound. The amount of communication required depends on the network topology through the second largest eigenvalue of the transition matrix of a random walk on the network. In the setting of stochastic optimization, the proposed gossip-based approach achieves nearly-linear scaling: the optimization error is guaranteed to be no more than ϵ\epsilon after O(1nϵ2)\mathcal{O}(\frac{1}{n \epsilon^2}) rounds, each of which involves O(logn)\mathcal{O}(\log n) gossip iterations, when nodes communicate over a well-connected graph. This scaling law is also observed in numerical experiments on a cluster.Comment: 30 pages, 2 figure

    Balancing the Communication Load of Asynchronously Parallelized Machine Learning Algorithms

    Full text link
    Stochastic Gradient Descent (SGD) is the standard numerical method used to solve the core optimization problem for the vast majority of machine learning (ML) algorithms. In the context of large scale learning, as utilized by many Big Data applications, efficient parallelization of SGD is in the focus of active research. Recently, we were able to show that the asynchronous communication paradigm can be applied to achieve a fast and scalable parallelization of SGD. Asynchronous Stochastic Gradient Descent (ASGD) outperforms other, mostly MapReduce based, parallel algorithms solving large scale machine learning problems. In this paper, we investigate the impact of asynchronous communication frequency and message size on the performance of ASGD applied to large scale ML on HTC cluster and cloud environments. We introduce a novel algorithm for the automatic balancing of the asynchronous communication load, which allows to adapt ASGD to changing network bandwidths and latencies.Comment: arXiv admin note: substantial text overlap with arXiv:1505.0495

    An Accelerated Decentralized Stochastic Proximal Algorithm for Finite Sums

    Get PDF
    Modern large-scale finite-sum optimization relies on two key aspects: distribution and stochastic updates. For smooth and strongly convex problems, existing decentralized algorithms are slower than modern accelerated variance-reduced stochastic algorithms when run on a single machine, and are therefore not efficient. Centralized algorithms are fast, but their scaling is limited by global aggregation steps that result in communication bottlenecks. In this work, we propose an efficient \textbf{A}ccelerated \textbf{D}ecentralized stochastic algorithm for \textbf{F}inite \textbf{S}ums named ADFS, which uses local stochastic proximal updates and randomized pairwise communications between nodes. On nn machines, ADFS learns from nmnm samples in the same time it takes optimal algorithms to learn from mm samples on one machine. This scaling holds until a critical network size is reached, which depends on communication delays, on the number of samples mm, and on the network topology. We provide a theoretical analysis based on a novel augmented graph approach combined with a precise evaluation of synchronization times and an extension of the accelerated proximal coordinate gradient algorithm to arbitrary sampling. We illustrate the improvement of ADFS over state-of-the-art decentralized approaches with experiments.Comment: Code available in source files. arXiv admin note: substantial text overlap with arXiv:1901.0986
    corecore