2,704 research outputs found
Balancing the Communication Load of Asynchronously Parallelized Machine Learning Algorithms
Stochastic Gradient Descent (SGD) is the standard numerical method used to
solve the core optimization problem for the vast majority of machine learning
(ML) algorithms. In the context of large scale learning, as utilized by many
Big Data applications, efficient parallelization of SGD is in the focus of
active research. Recently, we were able to show that the asynchronous
communication paradigm can be applied to achieve a fast and scalable
parallelization of SGD. Asynchronous Stochastic Gradient Descent (ASGD)
outperforms other, mostly MapReduce based, parallel algorithms solving large
scale machine learning problems. In this paper, we investigate the impact of
asynchronous communication frequency and message size on the performance of
ASGD applied to large scale ML on HTC cluster and cloud environments. We
introduce a novel algorithm for the automatic balancing of the asynchronous
communication load, which allows to adapt ASGD to changing network bandwidths
and latencies.Comment: arXiv admin note: substantial text overlap with arXiv:1505.0495
SCOPE: Scalable Composite Optimization for Learning on Spark
Many machine learning models, such as logistic regression~(LR) and support
vector machine~(SVM), can be formulated as composite optimization problems.
Recently, many distributed stochastic optimization~(DSO) methods have been
proposed to solve the large-scale composite optimization problems, which have
shown better performance than traditional batch methods. However, most of these
DSO methods are not scalable enough. In this paper, we propose a novel DSO
method, called \underline{s}calable \underline{c}omposite
\underline{op}timization for l\underline{e}arning~({SCOPE}), and implement it
on the fault-tolerant distributed platform \mbox{Spark}. SCOPE is both
computation-efficient and communication-efficient. Theoretical analysis shows
that SCOPE is convergent with linear convergence rate when the objective
function is convex. Furthermore, empirical results on real datasets show that
SCOPE can outperform other state-of-the-art distributed learning methods on
Spark, including both batch learning methods and DSO methods
Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithms
The implementation of a vast majority of machine learning (ML) algorithms
boils down to solving a numerical optimization problem. In this context,
Stochastic Gradient Descent (SGD) methods have long proven to provide good
results, both in terms of convergence and accuracy. Recently, several
parallelization approaches have been proposed in order to scale SGD to solve
very large ML problems. At their core, most of these approaches are following a
map-reduce scheme. This paper presents a novel parallel updating algorithm for
SGD, which utilizes the asynchronous single-sided communication paradigm.
Compared to existing methods, Asynchronous Parallel Stochastic Gradient Descent
(ASGD) provides faster (or at least equal) convergence, close to linear scaling
and stable accuracy
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization
Due to their simplicity and excellent performance, parallel asynchronous
variants of stochastic gradient descent have become popular methods to solve a
wide range of large-scale optimization problems on multi-core architectures.
Yet, despite their practical success, support for nonsmooth objectives is still
lacking, making them unsuitable for many problems of interest in machine
learning, such as the Lasso, group Lasso or empirical risk minimization with
convex constraints.
In this work, we propose and analyze ProxASAGA, a fully asynchronous sparse
method inspired by SAGA, a variance reduced incremental gradient algorithm. The
proposed method is easy to implement and significantly outperforms the state of
the art on several nonsmooth, large-scale problems. We prove that our method
achieves a theoretical linear speedup with respect to the sequential version
under assumptions on the sparsity of gradients and block-separability of the
proximal term. Empirical benchmarks on a multi-core architecture illustrate
practical speedups of up to 12x on a 20-core machine.Comment: Appears in Advances in Neural Information Processing Systems 30 (NIPS
2017), 28 page
- …