2,616 research outputs found
Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithms
The implementation of a vast majority of machine learning (ML) algorithms
boils down to solving a numerical optimization problem. In this context,
Stochastic Gradient Descent (SGD) methods have long proven to provide good
results, both in terms of convergence and accuracy. Recently, several
parallelization approaches have been proposed in order to scale SGD to solve
very large ML problems. At their core, most of these approaches are following a
map-reduce scheme. This paper presents a novel parallel updating algorithm for
SGD, which utilizes the asynchronous single-sided communication paradigm.
Compared to existing methods, Asynchronous Parallel Stochastic Gradient Descent
(ASGD) provides faster (or at least equal) convergence, close to linear scaling
and stable accuracy
Making Asynchronous Stochastic Gradient Descent Work for Transformers
Asynchronous stochastic gradient descent (SGD) is attractive from a speed
perspective because workers do not wait for synchronization. However, the
Transformer model converges poorly with asynchronous SGD, resulting in
substantially lower quality compared to synchronous SGD. To investigate why
this is the case, we isolate differences between asynchronous and synchronous
methods to investigate batch size and staleness effects. We find that summing
several asynchronous updates, rather than applying them immediately, restores
convergence behavior. With this hybrid method, Transformer training for neural
machine translation task reaches a near-convergence level 1.36x faster in
single-node multi-GPU training with no impact on model quality
Asynchronous Distributed Semi-Stochastic Gradient Optimization
With the recent proliferation of large-scale learning problems,there have
been a lot of interest on distributed machine learning algorithms, particularly
those that are based on stochastic gradient descent (SGD) and its variants.
However, existing algorithms either suffer from slow convergence due to the
inherent variance of stochastic gradients, or have a fast linear convergence
rate but at the expense of poorer solution quality. In this paper, we combine
their merits by proposing a fast distributed asynchronous SGD-based algorithm
with variance reduction. A constant learning rate can be used, and it is also
guaranteed to converge linearly to the optimal solution. Experiments on the
Google Cloud Computing Platform demonstrate that the proposed algorithm
outperforms state-of-the-art distributed asynchronous algorithms in terms of
both wall clock time and solution quality
- …