73,880 research outputs found
TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
High network communication cost for synchronizing gradients and parameters is
the well-known bottleneck of distributed training. In this work, we propose
TernGrad that uses ternary gradients to accelerate distributed deep learning in
data parallelism. Our approach requires only three numerical levels {-1,0,1},
which can aggressively reduce the communication time. We mathematically prove
the convergence of TernGrad under the assumption of a bound on gradients.
Guided by the bound, we propose layer-wise ternarizing and gradient clipping to
improve its convergence. Our experiments show that applying TernGrad on AlexNet
does not incur any accuracy loss and can even improve accuracy. The accuracy
loss of GoogLeNet induced by TernGrad is less than 2% on average. Finally, a
performance model is proposed to study the scalability of TernGrad. Experiments
show significant speed gains for various deep neural networks. Our source code
is available.Comment: NIPS 2017 Ora
Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools
Deep Learning (DL) has had an immense success in the recent past, leading to
state-of-the-art results in various domains such as image recognition and
natural language processing. One of the reasons for this success is the
increasing size of DL models and the proliferation of vast amounts of training
data being available. To keep on improving the performance of DL, increasing
the scalability of DL systems is necessary. In this survey, we perform a broad
and thorough investigation on challenges, techniques and tools for scalable DL
on distributed infrastructures. This incorporates infrastructures for DL,
methods for parallel DL training, multi-tenant resource scheduling and the
management of training and model data. Further, we analyze and compare 11
current open-source DL frameworks and tools and investigate which of the
techniques are commonly implemented in practice. Finally, we highlight future
research trends in DL systems that deserve further research.Comment: accepted at ACM Computing Surveys, to appea
IBM Deep Learning Service
Deep learning driven by large neural network models is overtaking traditional
machine learning methods for understanding unstructured and perceptual data
domains such as speech, text, and vision. At the same time, the
"as-a-Service"-based business model on the cloud is fundamentally transforming
the information technology industry. These two trends: deep learning, and
"as-a-service" are colliding to give rise to a new business model for cognitive
application delivery: deep learning as a service in the cloud. In this paper,
we will discuss the details of the software architecture behind IBM's deep
learning as a service (DLaaS). DLaaS provides developers the flexibility to use
popular deep learning libraries such as Caffe, Torch and TensorFlow, in the
cloud in a scalable and resilient manner with minimal effort. The platform uses
a distribution and orchestration layer that facilitates learning from a large
amount of data in a reasonable amount of time across compute nodes. A resource
provisioning layer enables flexible job management on heterogeneous resources,
such as graphics processing units (GPUs) and central processing units (CPUs),
in an infrastructure as a service (IaaS) cloud
Building DNN Acoustic Models for Large Vocabulary Speech Recognition
Deep neural networks (DNNs) are now a central component of nearly all
state-of-the-art speech recognition systems. Building neural network acoustic
models requires several design decisions including network architecture, size,
and training loss function. This paper offers an empirical investigation on
which aspects of DNN acoustic model design are most important for speech
recognition system performance. We report DNN classifier performance and final
speech recognizer word error rates, and compare DNNs using several metrics to
quantify factors influencing differences in task performance. Our first set of
experiments use the standard Switchboard benchmark corpus, which contains
approximately 300 hours of conversational telephone speech. We compare standard
DNNs to convolutional networks, and present the first experiments using
locally-connected, untied neural networks for acoustic modeling. We
additionally build systems on a corpus of 2,100 hours of training data by
combining the Switchboard and Fisher corpora. This larger corpus allows us to
more thoroughly examine performance of large DNN models -- with up to ten times
more parameters than those typically used in speech recognition systems. Our
results suggest that a relatively simple DNN architecture and optimization
technique produces strong results. These findings, along with previous work,
help establish a set of best practices for building DNN hybrid speech
recognition systems with maximum likelihood training. Our experiments in DNN
optimization additionally serve as a case study for training DNNs with
discriminative loss functions for speech tasks, as well as DNN classifiers more
generally
Deep Learning At Scale and At Ease
Recently, deep learning techniques have enjoyed success in various multimedia
applications, such as image classification and multi-modal data analysis. Large
deep learning models are developed for learning rich representations of complex
data. There are two challenges to overcome before deep learning can be widely
adopted in multimedia and other applications. One is usability, namely the
implementation of different models and training algorithms must be done by
non-experts without much effort especially when the model is large and complex.
The other is scalability, that is the deep learning system must be able to
provision for a huge demand of computing resources for training large models
with massive datasets. To address these two challenges, in this paper, we
design a distributed deep learning platform called SINGA which has an intuitive
programming model based on the common layer abstraction of deep learning
models. Good scalability is achieved through flexible distributed training
architecture and specific optimization techniques. SINGA runs on GPUs as well
as on CPUs, and we show that it outperforms many other state-of-the-art deep
learning systems. Our experience with developing and training deep learning
models for real-life multimedia applications in SINGA shows that the platform
is both usable and scalable.Comment: submitted to TOMM (under review
Deep Learning Towards Mobile Applications
Recent years have witnessed an explosive growth of mobile devices. Mobile
devices are permeating every aspect of our daily lives. With the increasing
usage of mobile devices and intelligent applications, there is a soaring demand
for mobile applications with machine learning services. Inspired by the
tremendous success achieved by deep learning in many machine learning tasks, it
becomes a natural trend to push deep learning towards mobile applications.
However, there exist many challenges to realize deep learning in mobile
applications, including the contradiction between the miniature nature of
mobile devices and the resource requirement of deep neural networks, the
privacy and security concerns about individuals' data, and so on. To resolve
these challenges, during the past few years, great leaps have been made in this
area. In this paper, we provide an overview of the current challenges and
representative achievements about pushing deep learning on mobile devices from
three aspects: training with mobile data, efficient inference on mobile
devices, and applications of mobile deep learning. The former two aspects cover
the primary tasks of deep learning. Then, we go through our two recent
applications that apply the data collected by mobile devices to inferring mood
disturbance and user identification. Finally, we conclude this paper with the
discussion of the future of this area.Comment: Conference version accepted by ICDCS'1
Human Motion Prediction using Semi-adaptable Neural Networks
Human motion prediction is an important component to facilitate human robot
interaction. Robots need to accurately predict human's future movement in order
to safely plan its own motion trajectories and efficiently collaborate with
humans. Many recent approaches predict human's movement using deep learning
methods, such as recurrent neural networks. However, existing methods lack the
ability to adapt to time-varying human behaviors, and many of them do not
quantify uncertainties in the prediction. This paper proposes an approach that
uses a semi-adaptable neural network for human motion prediction, and provides
uncertainty bounds of the predictions in real time. In particular, a neural
network is trained offline to represent the human motion transition model, and
then recursive least square parameter adaptation algorithm (RLS-PAA) is adopted
for online parameter adaptation of the neural network and for uncertainty
estimation. Experiments on several human motion datasets verify that the
proposed method significantly outperforms the state-of-the-art approach in
terms of prediction accuracy and computation efficiency
Deep Learning for Explicitly Modeling Optimization Landscapes
In all but the most trivial optimization problems, the structure of the
solutions exhibit complex interdependencies between the input parameters.
Decades of research with stochastic search techniques has shown the benefit of
explicitly modeling the interactions between sets of parameters and the overall
quality of the solutions discovered. We demonstrate a novel method, based on
learning deep networks, to model the global landscapes of optimization
problems. To represent the search space concisely and accurately, the deep
networks must encode information about the underlying parameter interactions
and their contributions to the quality of the solution. Once the networks are
trained, the networks are probed to reveal parameter combinations with high
expected performance with respect to the optimization task. These estimates are
used to initialize fast, randomized, local search algorithms, which in turn
expose more information about the search space that is subsequently used to
refine the models. We demonstrate the technique on multiple optimization
problems that have arisen in a variety of real-world domains, including:
packing, graphics, job scheduling, layout and compression. The problems include
combinatoric search spaces, discontinuous and highly non-linear spaces, and
span binary, higher-cardinality discrete, as well as continuous parameters.
Strengths, limitations, and extensions of the approach are extensively
discussed and demonstrated
Control of a Quadrotor with Reinforcement Learning
In this paper, we present a method to control a quadrotor with a neural
network trained using reinforcement learning techniques. With reinforcement
learning, a common network can be trained to directly map state to actuator
command making any predefined control structure obsolete for training.
Moreover, we present a new learning algorithm which differs from the existing
ones in certain aspects. Our algorithm is conservative but stable for
complicated tasks. We found that it is more applicable to controlling a
quadrotor than existing algorithms. We demonstrate the performance of the
trained policy both in simulation and with a real quadrotor. Experiments show
that our policy network can react to step response relatively accurately. With
the same policy, we also demonstrate that we can stabilize the quadrotor in the
air even under very harsh initialization (manually throwing it upside-down in
the air with an initial velocity of 5 m/s). Computation time of evaluating the
policy is only 7 {\mu}s per time step which is two orders of magnitude less
than common trajectory optimization algorithms with an approximated model
Nested Dithered Quantization for Communication Reduction in Distributed Training
In distributed training, the communication cost due to the transmission of
gradients or the parameters of the deep model is a major bottleneck in scaling
up the number of processing nodes. To address this issue, we propose
\emph{dithered quantization} for the transmission of the stochastic gradients
and show that training with \emph{Dithered Quantized Stochastic Gradients
(DQSG)} is similar to the training with unquantized SGs perturbed by an
independent bounded uniform noise, in contrast to the other quantization
methods where the perturbation depends on the gradients and hence, complicating
the convergence analysis. We study the convergence of training algorithms using
DQSG and the trade off between the number of quantization levels and the
training time.
Next, we observe that there is a correlation among the SGs computed by
workers that can be utilized to further reduce the communication overhead
without any performance loss. Hence, we develop a simple yet effective
quantization scheme, nested dithered quantized SG (NDQSG), that can reduce the
communication significantly \emph{without requiring the workers communicating
extra information to each other}. We prove that although NDQSG requires
significantly less bits, it can achieve the same quantization variance bound as
DQSG.
Our simulation results confirm the effectiveness of training using DQSG and
NDQSG in reducing the communication bits or the convergence time compared to
the existing methods without sacrificing the accuracy of the trained model
- …