8,026 research outputs found
Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithms
The implementation of a vast majority of machine learning (ML) algorithms
boils down to solving a numerical optimization problem. In this context,
Stochastic Gradient Descent (SGD) methods have long proven to provide good
results, both in terms of convergence and accuracy. Recently, several
parallelization approaches have been proposed in order to scale SGD to solve
very large ML problems. At their core, most of these approaches are following a
map-reduce scheme. This paper presents a novel parallel updating algorithm for
SGD, which utilizes the asynchronous single-sided communication paradigm.
Compared to existing methods, Asynchronous Parallel Stochastic Gradient Descent
(ASGD) provides faster (or at least equal) convergence, close to linear scaling
and stable accuracy
Distributed Delayed Stochastic Optimization
We analyze the convergence of gradient-based optimization algorithms that
base their updates on delayed stochastic gradient information. The main
application of our results is to the development of gradient-based distributed
optimization algorithms where a master node performs parameter updates while
worker nodes compute stochastic gradients based on local information in
parallel, which may give rise to delays due to asynchrony. We take motivation
from statistical problems where the size of the data is so large that it cannot
fit on one computer; with the advent of huge datasets in biology, astronomy,
and the internet, such problems are now common. Our main contribution is to
show that for smooth stochastic problems, the delays are asymptotically
negligible and we can achieve order-optimal convergence results. In application
to distributed optimization, we develop procedures that overcome communication
bottlenecks and synchronization requirements. We show -node architectures
whose optimization error in stochastic problems---in spite of asynchronous
delays---scales asymptotically as \order(1 / \sqrt{nT}) after iterations.
This rate is known to be optimal for a distributed system with nodes even
in the absence of delays. We additionally complement our theoretical results
with numerical experiments on a statistical machine learning task.Comment: 27 pages, 4 figure
Memory-Efficient Topic Modeling
As one of the simplest probabilistic topic modeling techniques, latent
Dirichlet allocation (LDA) has found many important applications in text
mining, computer vision and computational biology. Recent training algorithms
for LDA can be interpreted within a unified message passing framework. However,
message passing requires storing previous messages with a large amount of
memory space, increasing linearly with the number of documents or the number of
topics. Therefore, the high memory usage is often a major problem for topic
modeling of massive corpora containing a large number of topics. To reduce the
space complexity, we propose a novel algorithm without storing previous
messages for training LDA: tiny belief propagation (TBP). The basic idea of TBP
relates the message passing algorithms with the non-negative matrix
factorization (NMF) algorithms, which absorb the message updating into the
message passing process, and thus avoid storing previous messages. Experimental
results on four large data sets confirm that TBP performs comparably well or
even better than current state-of-the-art training algorithms for LDA but with
a much less memory consumption. TBP can do topic modeling when massive corpora
cannot fit in the computer memory, for example, extracting thematic topics from
7 GB PUBMED corpora on a common desktop computer with 2GB memory.Comment: 20 pages, 7 figure
- …