9,193 research outputs found
Distributed learning of CNNs on heterogeneous CPU/GPU architectures
Convolutional Neural Networks (CNNs) have shown to be powerful classification
tools in tasks that range from check reading to medical diagnosis, reaching
close to human perception, and in some cases surpassing it. However, the
problems to solve are becoming larger and more complex, which translates to
larger CNNs, leading to longer training times that not even the adoption of
Graphics Processing Units (GPUs) could keep up to. This problem is partially
solved by using more processing units and distributed training methods that are
offered by several frameworks dedicated to neural network training. However,
these techniques do not take full advantage of the possible parallelization
offered by CNNs and the cooperative use of heterogeneous devices with different
processing capabilities, clock speeds, memory size, among others. This paper
presents a new method for the parallel training of CNNs that can be considered
as a particular instantiation of model parallelism, where only the
convolutional layer is distributed. In fact, the convolutions processed during
training (forward and backward propagation included) represent from -\%
of global processing time. The paper analyzes the influence of network size,
bandwidth, batch size, number of devices, including their processing
capabilities, and other parameters. Results show that this technique is capable
of diminishing the training time without affecting the classification
performance for both CPUs and GPUs. For the CIFAR-10 dataset, using a CNN with
two convolutional layers, and and kernels, respectively, best
speedups achieve using four CPUs and with three GPUs.
Modern imaging datasets, larger and more complex than CIFAR-10 will certainly
require more than -\% of processing time calculating convolutions, and
speedups will tend to increase accordingly
PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation
Dynamic sparsity, where the sparsity patterns are unknown until runtime,
poses a significant challenge to deep learning. The state-of-the-art
sparsity-aware deep learning solutions are restricted to pre-defined, static
sparsity patterns due to significant overheads associated with preprocessing.
Efficient execution of dynamic sparse computation often faces the misalignment
between the GPU-friendly tile configuration for efficient execution and the
sparsity-aware tile shape that minimizes coverage wastes (non-zero values in
tensor).
In this paper, we propose PIT, a deep-learning compiler for dynamic sparsity.
PIT proposes a novel tiling mechanism that leverages Permutation Invariant
Transformation (PIT), a mathematically proven property, to transform multiple
sparsely located micro-tiles into a GPU-efficient dense tile without changing
the computation results, thus achieving both high GPU utilization and low
coverage waste. Given a model, PIT first finds feasible PIT rules for all its
operators and generates efficient GPU kernels accordingly. At runtime, with the
novel SRead and SWrite primitives, PIT rules can be executed extremely fast to
support dynamic sparsity in an online manner. Extensive evaluation on diverse
models shows that PIT can accelerate dynamic sparsity computation by up to 5.9x
(average 2.43x) over state-of-the-art compilers
Doubly Stochastic Variational Inference for Deep Gaussian Processes
Gaussian processes (GPs) are a good choice for function approximation as they
are flexible, robust to over-fitting, and provide well-calibrated predictive
uncertainty. Deep Gaussian processes (DGPs) are multi-layer generalisations of
GPs, but inference in these models has proved challenging. Existing approaches
to inference in DGP models assume approximate posteriors that force
independence between the layers, and do not work well in practice. We present a
doubly stochastic variational inference algorithm, which does not force
independence between layers. With our method of inference we demonstrate that a
DGP model can be used effectively on data ranging in size from hundreds to a
billion points. We provide strong empirical evidence that our inference scheme
for DGPs works well in practice in both classification and regression.Comment: NIPS 201
Theano: new features and speed improvements
Theano is a linear algebra compiler that optimizes a user's
symbolically-specified mathematical computations to produce efficient low-level
implementations. In this paper, we present new features and efficiency
improvements to Theano, and benchmarks demonstrating Theano's performance
relative to Torch7, a recently introduced machine learning library, and to
RNNLM, a C++ library targeted at recurrent neural networks.Comment: Presented at the Deep Learning Workshop, NIPS 201
- …