479 research outputs found
Theano: new features and speed improvements
Theano is a linear algebra compiler that optimizes a user's
symbolically-specified mathematical computations to produce efficient low-level
implementations. In this paper, we present new features and efficiency
improvements to Theano, and benchmarks demonstrating Theano's performance
relative to Torch7, a recently introduced machine learning library, and to
RNNLM, a C++ library targeted at recurrent neural networks.Comment: Presented at the Deep Learning Workshop, NIPS 201
Sparse Stochastic Inference for Latent Dirichlet allocation
We present a hybrid algorithm for Bayesian topic models that combines the
efficiency of sparse Gibbs sampling with the scalability of online stochastic
inference. We used our algorithm to analyze a corpus of 1.2 million books (33
billion words) with thousands of topics. Our approach reduces the bias of
variational inference and generalizes to many Bayesian hidden-variable models.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
MindTheStep-AsyncPSGD: Adaptive Asynchronous Parallel Stochastic Gradient Descent
Stochastic Gradient Descent (SGD) is very useful in optimization problems
with high-dimensional non-convex target functions, and hence constitutes an
important component of several Machine Learning and Data Analytics methods.
Recently there have been significant works on understanding the parallelism
inherent to SGD, and its convergence properties. Asynchronous, parallel SGD
(AsyncPSGD) has received particular attention, due to observed performance
benefits. On the other hand, asynchrony implies inherent challenges in
understanding the execution of the algorithm and its convergence, stemming from
the fact that the contribution of a thread might be based on an old (stale)
view of the state. In this work we aim to deepen the understanding of AsyncPSGD
in order to increase the statistical efficiency in the presence of stale
gradients. We propose new models for capturing the nature of the staleness
distribution in a practical setting. Using the proposed models, we derive a
staleness-adaptive SGD framework, MindTheStep-AsyncPSGD, for adapting the step
size in an online-fashion, which provably reduces the negative impact of
asynchrony. Moreover, we provide general convergence time bounds for a wide
class of staleness-adaptive step size strategies for convex target functions.
We also provide a detailed empirical study, showing how our approach implies
faster convergence for deep learning applications.Comment: 12 pages, 3 figures, accepted in IEEE BigData 201
Communication-Efficient Distributed Deep Learning: A Comprehensive Survey
Distributed deep learning becomes very common to reduce the overall training
time by exploiting multiple computing devices (e.g., GPUs/TPUs) as the size of
deep models and data sets increases. However, data communication between
computing devices could be a potential bottleneck to limit the system
scalability. How to address the communication problem in distributed deep
learning is becoming a hot research topic recently. In this paper, we provide a
comprehensive survey of the communication-efficient distributed training
algorithms in both system-level and algorithmic-level optimizations. In the
system-level, we demystify the system design and implementation to reduce the
communication cost. In algorithmic-level, we compare different algorithms with
theoretical convergence bounds and communication complexity. Specifically, we
first propose the taxonomy of data-parallel distributed training algorithms,
which contains four main dimensions: communication synchronization, system
architectures, compression techniques, and parallelism of communication and
computing. Then we discuss the studies in addressing the problems of the four
dimensions to compare the communication cost. We further compare the
convergence rates of different algorithms, which enable us to know how fast the
algorithms can converge to the solution in terms of iterations. According to
the system-level communication cost analysis and theoretical convergence speed
comparison, we provide the readers to understand what algorithms are more
efficient under specific distributed environments and extrapolate potential
directions for further optimizations
Adaptiveness and Lock-free Synchronization in Parallel Stochastic Gradient Descent
The emergence of big data in recent years due to the vast societal digitalization and large-scale sensor deployment has entailed significant interest in machine learning methods to enable automatic data analytics. In a majority of the learning algorithms used in industrial as well as academic settings, the first-order iterative optimization procedure Stochastic gradient descent (SGD), is the backbone. However, SGD is often time-consuming, as it typically requires several passes through the entire dataset in order to converge to a solution of sufficient quality.In order to cope with increasing data volumes, and to facilitate accelerated processing utilizing contemporary hardware, various parallel SGD variants have been proposed. In addition to traditional synchronous parallelization schemes, asynchronous ones have received particular interest in recent literature due to their improved ability to scale due to less coordination, and subsequently waiting time. However, asynchrony implies inherent challenges in understanding the execution of the algorithm and its convergence properties, due the presence of both stale and inconsistent views of the shared state.In this work, we aim to increase the understanding of the convergence properties of SGD for practical applications under asynchronous parallelism and develop tools and frameworks that facilitate improved convergence properties as well as further research and development. First, we focus on understanding the impact of staleness, and introduce models for capturing the dynamics of parallel execution of SGD. This enables (i) quantifying the statistical penalty on the convergence due to staleness and (ii) deriving an adaptation scheme, introducing a staleness-adaptive SGD variant MindTheStep-AsyncSGD, which provably reduces this penalty. Second, we aim at exploring the impact of synchronization mechanisms, in particular consistency-preserving ones, and the overall effect on the convergence properties. To this end, we propose LeashedSGD, an extensible algorithmic framework supporting various synchronization mechanisms for different degrees of consistency, enabling in particular a lock-free and consistency-preserving implementation. In addition, the algorithmic construction of Leashed-SGD enables dynamic memory allocation, claiming memory only when necessary, which reduces the overall memory footprint. We perform an extensive empirical study, benchmarking the proposed methods, together with established baselines, focusing on the prominent application of Deep Learning for image classification on the benchmark datasets MNIST and CIFAR, showing significant improvements in converge time for Leashed-SGD and MindTheStep-AsyncSGD
- …