365 research outputs found
One weird trick for parallelizing convolutional neural networks
I present a new way to parallelize the training of convolutional neural
networks across multiple GPUs. The method scales significantly better than all
alternatives when applied to modern convolutional neural networks
Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
The past few years have witnessed growth in the computational requirements
for training deep convolutional neural networks. Current approaches parallelize
training onto multiple devices by applying a single parallelization strategy
(e.g., data or model parallelism) to all layers in a network. Although easy to
reason about, these approaches result in suboptimal runtime performance in
large-scale distributed training, since different layers in a network may
prefer different parallelization strategies. In this paper, we propose
layer-wise parallelism that allows each layer in a network to use an individual
parallelization strategy. We jointly optimize how each layer is parallelized by
solving a graph search problem. Our evaluation shows that layer-wise
parallelism outperforms state-of-the-art approaches by increasing training
throughput, reducing communication costs, achieving better scalability to
multiple GPUs, while maintaining original network accuracy
User-transparent Distributed TensorFlow
Deep Learning (DL) algorithms have become the {\em de facto} choice for data
analysis. Several DL implementations -- primarily limited to a single compute
node -- such as Caffe, TensorFlow, Theano and Torch have become readily
available. Distributed DL implementations capable of execution on large scale
systems are becoming important to address the computational needs of large data
produced by scientific simulations and experiments. Yet, the adoption of
distributed DL implementations faces significant impediments: 1) most
implementations require DL analysts to modify their code significantly -- which
is a show-stopper, 2) several distributed DL implementations are geared towards
cloud computing systems -- which is inadequate for execution on massively
parallel systems such as supercomputers.
This work addresses each of these problems. We provide a distributed memory
DL implementation by incorporating required changes in the TensorFlow runtime
itself. This dramatically reduces the entry barrier for using a distributed
TensorFlow implementation. We use Message Passing Interface (MPI) -- which
provides performance portability, especially since MPI specific changes are
abstracted from users. Lastly -- and arguably most importantly -- we make our
implementation available for broader use, under the umbrella of Machine
Learning Toolkit for Extreme Scale (MaTEx) at {\texttt
http://hpc.pnl.gov/matex}. We refer to our implementation as MaTEx-TensorFlow.Comment: 9 pages, 8 figure
HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array
With the rise of artificial intelligence in recent years, Deep Neural
Networks (DNNs) have been widely used in many domains. To achieve high
performance and energy efficiency, hardware acceleration (especially inference)
of DNNs is intensively studied both in academia and industry. However, we still
face two challenges: large DNN models and datasets, which incur frequent
off-chip memory accesses; and the training of DNNs, which is not well-explored
in recent accelerator designs. To truly provide high throughput and energy
efficient acceleration for the training of deep and large models, we inevitably
need to use multiple accelerators to explore the coarse-grain parallelism,
compared to the fine-grain parallelism inside a layer considered in most of the
existing architectures. It poses the key research question to seek the best
organization of computation and dataflow among accelerators. In this paper,
inspired by recent work in machine learning systems, we propose a solution
HyPar to determine layer-wise parallelism for deep neural network training with
an array of DNN accelerators. HyPar partitions the feature map tensors (input
and output), the kernel tensors, the gradient tensors, and the error tensors
for the DNN accelerators. A partition constitutes the choice of parallelism for
weighted layers. The optimization target is to search a partition that
minimizes the total communication during training a complete DNN. To solve this
problem, we propose a communication model to explain the source and amount of
communications. Then, we use a hierarchical layer-wise dynamic programming
method to search for the partition for each layer.Comment: To appear in the 2019 25th International Symposium on
High-Performance Computer Architecture (HPCA 2019
DeepSpark: A Spark-Based Distributed Deep Learning Framework for Commodity Clusters
The increasing complexity of deep neural networks (DNNs) has made it
challenging to exploit existing large-scale data processing pipelines for
handling massive data and parameters involved in DNN training. Distributed
computing platforms and GPGPU-based acceleration provide a mainstream solution
to this computational challenge. In this paper, we propose DeepSpark, a
distributed and parallel deep learning framework that exploits Apache Spark on
commodity clusters. To support parallel operations, DeepSpark automatically
distributes workloads and parameters to Caffe/Tensorflow-running nodes using
Spark, and iteratively aggregates training results by a novel lock-free
asynchronous variant of the popular elastic averaging stochastic gradient
descent based update scheme, effectively complementing the synchronized
processing capabilities of Spark. DeepSpark is an on-going project, and the
current release is available at http://deepspark.snu.ac.kr
ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks
In this paper, we propose a deep neural network architecture for object
recognition based on recurrent neural networks. The proposed network, called
ReNet, replaces the ubiquitous convolution+pooling layer of the deep
convolutional neural network with four recurrent neural networks that sweep
horizontally and vertically in both directions across the image. We evaluate
the proposed ReNet on three widely-used benchmark datasets; MNIST, CIFAR-10 and
SVHN. The result suggests that ReNet is a viable alternative to the deep
convolutional neural network, and that further investigation is needed
Parle: parallelizing stochastic gradient descent
We propose a new algorithm called Parle for parallel training of deep
networks that converges 2-4x faster than a data-parallel implementation of SGD,
while achieving significantly improved error rates that are nearly
state-of-the-art on several benchmarks including CIFAR-10 and CIFAR-100,
without introducing any additional hyper-parameters. We exploit the phenomenon
of flat minima that has been shown to lead to improved generalization error for
deep networks. Parle requires very infrequent communication with the parameter
server and instead performs more computation on each client, which makes it
well-suited to both single-machine, multi-GPU settings and distributed
implementations
Image Classification at Supercomputer Scale
Deep learning is extremely computationally intensive, and hardware vendors
have responded by building faster accelerators in large clusters. Training deep
learning models at petaFLOPS scale requires overcoming both algorithmic and
systems software challenges. In this paper, we discuss three systems-related
optimizations: (1) distributed batch normalization to control per-replica batch
sizes, (2) input pipeline optimizations to sustain model throughput, and (3)
2-D torus all-reduce to speed up gradient summation. We combine these
optimizations to train ResNet-50 on ImageNet to 76.3% accuracy in 2.2 minutes
on a 1024-chip TPU v3 Pod with a training throughput of over 1.05 million
images/second and no accuracy drop.Comment: Presented as part of Systems for ML Workshop @ NIPS 201
DyVEDeep: Dynamic Variable Effort Deep Neural Networks
Deep Neural Networks (DNNs) have advanced the state-of-the-art in a variety
of machine learning tasks and are deployed in increasing numbers of products
and services. However, the computational requirements of training and
evaluating large-scale DNNs are growing at a much faster pace than the
capabilities of the underlying hardware platforms that they are executed upon.
In this work, we propose Dynamic Variable Effort Deep Neural Networks
(DyVEDeep) to reduce the computational requirements of DNNs during inference.
Previous efforts propose specialized hardware implementations for DNNs,
statically prune the network, or compress the weights. Complementary to these
approaches, DyVEDeep is a dynamic approach that exploits the heterogeneity in
the inputs to DNNs to improve their compute efficiency with comparable
classification accuracy. DyVEDeep equips DNNs with dynamic effort mechanisms
that, in the course of processing an input, identify how critical a group of
computations are to classify the input. DyVEDeep dynamically focuses its
compute effort only on the critical computa- tions, while skipping or
approximating the rest. We propose 3 effort knobs that operate at different
levels of granularity viz. neuron, feature and layer levels. We build DyVEDeep
versions for 5 popular image recognition benchmarks - one for CIFAR-10 and four
for ImageNet (AlexNet, OverFeat and VGG-16, weight-compressed AlexNet). Across
all benchmarks, DyVEDeep achieves 2.1x-2.6x reduction in the number of scalar
operations, which translates to 1.8x-2.3x performance improvement over a
Caffe-based implementation, with < 0.5% loss in accuracy
Deep convolutional networks for pancreas segmentation in CT imaging
Automatic organ segmentation is an important prerequisite for many
computer-aided diagnosis systems. The high anatomical variability of organs in
the abdomen, such as the pancreas, prevents many segmentation methods from
achieving high accuracies when compared to other segmentation of organs like
the liver, heart or kidneys. Recently, the availability of large annotated
training sets and the accessibility of affordable parallel computing resources
via GPUs have made it feasible for "deep learning" methods such as
convolutional networks (ConvNets) to succeed in image classification tasks.
These methods have the advantage that used classification features are trained
directly from the imaging data. We present a fully-automated bottom-up method
for pancreas segmentation in computed tomography (CT) images of the abdomen.
The method is based on hierarchical coarse-to-fine classification of local
image regions (superpixels). Superpixels are extracted from the abdominal
region using Simple Linear Iterative Clustering (SLIC). An initial probability
response map is generated, using patch-level confidences and a two-level
cascade of random forest classifiers, from which superpixel regions with
probabilities larger 0.5 are retained. These retained superpixels serve as a
highly sensitive initial input of the pancreas and its surroundings to a
ConvNet that samples a bounding box around each superpixel at different scales
(and random non-rigid deformations at training time) in order to assign a more
distinct probability of each superpixel region being pancreas or not. We
evaluate our method on CT images of 82 patients (60 for training, 2 for
validation, and 20 for testing). Using ConvNets we achieve average Dice scores
of 68%+-10% (range, 43-80%) in testing. This shows promise for accurate
pancreas segmentation, using a deep learning approach and compares favorably to
state-of-the-art methods.Comment: SPIE Medical Imaging conference, Orlando, FL, USA: SPIE Proceedings |
Volume 9413 | Classificatio
- …