2,904 research outputs found
Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithms
The implementation of a vast majority of machine learning (ML) algorithms
boils down to solving a numerical optimization problem. In this context,
Stochastic Gradient Descent (SGD) methods have long proven to provide good
results, both in terms of convergence and accuracy. Recently, several
parallelization approaches have been proposed in order to scale SGD to solve
very large ML problems. At their core, most of these approaches are following a
map-reduce scheme. This paper presents a novel parallel updating algorithm for
SGD, which utilizes the asynchronous single-sided communication paradigm.
Compared to existing methods, Asynchronous Parallel Stochastic Gradient Descent
(ASGD) provides faster (or at least equal) convergence, close to linear scaling
and stable accuracy
Efficient Distributed Online Prediction and Stochastic Optimization with Approximate Distributed Averaging
We study distributed methods for online prediction and stochastic
optimization. Our approach is iterative: in each round nodes first perform
local computations and then communicate in order to aggregate information and
synchronize their decision variables. Synchronization is accomplished through
the use of a distributed averaging protocol. When an exact distributed
averaging protocol is used, it is known that the optimal regret bound of
can be achieved using the distributed mini-batch
algorithm of Dekel et al. (2012), where is the total number of samples
processed across the network. We focus on methods using approximate distributed
averaging protocols and show that the optimal regret bound can also be achieved
in this setting. In particular, we propose a gossip-based optimization method
which achieves the optimal regret bound. The amount of communication required
depends on the network topology through the second largest eigenvalue of the
transition matrix of a random walk on the network. In the setting of stochastic
optimization, the proposed gossip-based approach achieves nearly-linear
scaling: the optimization error is guaranteed to be no more than
after rounds, each of which involves
gossip iterations, when nodes communicate over a
well-connected graph. This scaling law is also observed in numerical
experiments on a cluster.Comment: 30 pages, 2 figure
Balancing the Communication Load of Asynchronously Parallelized Machine Learning Algorithms
Stochastic Gradient Descent (SGD) is the standard numerical method used to
solve the core optimization problem for the vast majority of machine learning
(ML) algorithms. In the context of large scale learning, as utilized by many
Big Data applications, efficient parallelization of SGD is in the focus of
active research. Recently, we were able to show that the asynchronous
communication paradigm can be applied to achieve a fast and scalable
parallelization of SGD. Asynchronous Stochastic Gradient Descent (ASGD)
outperforms other, mostly MapReduce based, parallel algorithms solving large
scale machine learning problems. In this paper, we investigate the impact of
asynchronous communication frequency and message size on the performance of
ASGD applied to large scale ML on HTC cluster and cloud environments. We
introduce a novel algorithm for the automatic balancing of the asynchronous
communication load, which allows to adapt ASGD to changing network bandwidths
and latencies.Comment: arXiv admin note: substantial text overlap with arXiv:1505.0495
An Accelerated Decentralized Stochastic Proximal Algorithm for Finite Sums
Modern large-scale finite-sum optimization relies on two key aspects:
distribution and stochastic updates. For smooth and strongly convex problems,
existing decentralized algorithms are slower than modern accelerated
variance-reduced stochastic algorithms when run on a single machine, and are
therefore not efficient. Centralized algorithms are fast, but their scaling is
limited by global aggregation steps that result in communication bottlenecks.
In this work, we propose an efficient \textbf{A}ccelerated
\textbf{D}ecentralized stochastic algorithm for \textbf{F}inite \textbf{S}ums
named ADFS, which uses local stochastic proximal updates and randomized
pairwise communications between nodes. On machines, ADFS learns from
samples in the same time it takes optimal algorithms to learn from samples
on one machine. This scaling holds until a critical network size is reached,
which depends on communication delays, on the number of samples , and on the
network topology. We provide a theoretical analysis based on a novel augmented
graph approach combined with a precise evaluation of synchronization times and
an extension of the accelerated proximal coordinate gradient algorithm to
arbitrary sampling. We illustrate the improvement of ADFS over state-of-the-art
decentralized approaches with experiments.Comment: Code available in source files. arXiv admin note: substantial text
overlap with arXiv:1901.0986
- …