38,680 research outputs found
A Comparative Measurement Study of Deep Learning as a Service Framework
Big data powered Deep Learning (DL) and its applications have blossomed in
recent years, fueled by three technological trends: a large amount of digitized
data openly accessible, a growing number of DL software frameworks in open
source and commercial markets, and a selection of affordable parallel computing
hardware devices. However, no single DL framework, to date, dominates in terms
of performance and accuracy even for baseline classification tasks on standard
datasets, making the selection of a DL framework an overwhelming task. This
paper takes a holistic approach to conduct empirical comparison and analysis of
four representative DL frameworks with three unique contributions. First, given
a selection of CPU-GPU configurations, we show that for a specific DL
framework, different configurations of its hyper-parameters may have a
significant impact on both performance and accuracy of DL applications. Second,
to the best of our knowledge, this study is the first to identify the
opportunities for improving the training time performance and the accuracy of
DL frameworks by configuring parallel computing libraries and tuning individual
and multiple hyper-parameters. Third, we also conduct a comparative measurement
study on the resource consumption patterns of four DL frameworks and their
performance and accuracy implications, including CPU and memory usage, and
their correlations to varying settings of hyper-parameters under different
configuration combinations of hardware, parallel computing libraries. We argue
that this measurement study provides in-depth empirical comparison and analysis
of four representative DL frameworks, and offers practical guidance for service
providers to deploying and delivering DL as a Service (DLaaS) and for
application developers and DLaaS consumers to select the right DL frameworks
for the right DL workloads.Comment: To appear on IEEE Transactions on Services Computing. The benchmark
tool used in this study is GTDLBench (https://git-disl.github.io/GTDLBench/
Transfer Learning for Performance Modeling of Deep Neural Network Systems
Modern deep neural network (DNN) systems are highly configurable with large a
number of options that significantly affect their non-functional behavior, for
example inference time and energy consumption. Performance models allow to
understand and predict the effects of such configuration options on system
behavior, but are costly to build because of large configuration spaces.
Performance models from one environment cannot be transferred directly to
another; usually models are rebuilt from scratch for different environments,
for example different hardware. Recently, transfer learning methods have been
applied to reuse knowledge from performance models trained in one environment
in another. In this paper, we perform an empirical study to understand the
effectiveness of different transfer learning strategies for building
performance models of DNN systems. Our results show that transferring
information on the most influential configuration options and their
interactions is an effective way of reducing the cost to build performance
models in new environments.Comment: 2 pages, 2 figures, USENIX Conference on Operational Machine
Learning, 201
Benchmarking the Performance and Energy Efficiency of AI Accelerators for AI Training
Deep learning has become widely used in complex AI applications. Yet,
training a deep neural network (DNNs) model requires a considerable amount of
calculations, long running time, and much energy. Nowadays, many-core AI
accelerators (e.g., GPUs and TPUs) are designed to improve the performance of
AI training. However, processors from different vendors perform dissimilarly in
terms of performance and energy consumption. To investigate the differences
among several popular off-the-shelf processors (i.e., Intel CPU, NVIDIA GPU,
AMD GPU, and Google TPU) in training DNNs, we carry out a comprehensive
empirical study on the performance and energy efficiency of these processors by
benchmarking a representative set of deep learning workloads, including
computation-intensive operations, classical convolutional neural networks
(CNNs), recurrent neural networks (LSTM), Deep Speech 2, and Transformer.
Different from the existing end-to-end benchmarks which only present the
training time, We try to investigate the impact of hardware, vendor's software
library, and deep learning framework on the performance and energy consumption
of AI training. Our evaluation methods and results not only provide an
informative guide for end-users to select proper AI accelerators, but also
expose some opportunities for the hardware vendors to improve their software
library.Comment: Revised some minor issue
AxTrain: Hardware-Oriented Neural Network Training for Approximate Inference
The intrinsic error tolerance of neural network (NN) makes approximate
computing a promising technique to improve the energy efficiency of NN
inference. Conventional approximate computing focuses on balancing the
efficiency-accuracy trade-off for existing pre-trained networks, which can lead
to suboptimal solutions. In this paper, we propose AxTrain, a hardware-oriented
training framework to facilitate approximate computing for NN inference.
Specifically, AxTrain leverages the synergy between two orthogonal
methods---one actively searches for a network parameters distribution with high
error tolerance, and the other passively learns resilient weights by
numerically incorporating the noise distributions of the approximate hardware
in the forward pass during the training phase. Experimental results from
various datasets with near-threshold computing and approximation multiplication
strategies demonstrate AxTrain's ability to obtain resilient neural network
parameters and system energy efficiency improvement.Comment: In International Symposium on Low Power Electronics and Design
(ISLPED) 201
FPGA-based Accelerators of Deep Learning Networks for Learning and Classification: A Review
Due to recent advances in digital technologies, and availability of credible
data, an area of artificial intelligence, deep learning, has emerged, and has
demonstrated its ability and effectiveness in solving complex learning problems
not possible before. In particular, convolution neural networks (CNNs) have
demonstrated their effectiveness in image detection and recognition
applications. However, they require intensive CPU operations and memory
bandwidth that make general CPUs fail to achieve desired performance levels.
Consequently, hardware accelerators that use application specific integrated
circuits (ASICs), field programmable gate arrays (FPGAs), and graphic
processing units (GPUs) have been employed to improve the throughput of CNNs.
More precisely, FPGAs have been recently adopted for accelerating the
implementation of deep learning networks due to their ability to maximize
parallelism as well as due to their energy efficiency. In this paper, we review
recent existing techniques for accelerating deep learning networks on FPGAs. We
highlight the key features employed by the various techniques for improving the
acceleration performance. In addition, we provide recommendations for enhancing
the utilization of FPGAs for CNNs acceleration. The techniques investigated in
this paper represent the recent trends in FPGA-based accelerators of deep
learning networks. Thus, this review is expected to direct the future advances
on efficient hardware accelerators and to be useful for deep learning
researchers.Comment: This article has been accepted for publication in IEEE Access
(December, 2018
Approximate LSTMs for Time-Constrained Inference: Enabling Fast Reaction in Self-Driving Cars
The need to recognise long-term dependencies in sequential data such as video
streams has made Long Short-Term Memory (LSTM) networks a prominent Artificial
Intelligence model for many emerging applications. However, the high
computational and memory demands of LSTMs introduce challenges in their
deployment on latency-critical systems such as self-driving cars which are
equipped with limited computational resources on-board. In this paper, we
introduce a progressive inference computing scheme that combines model pruning
and computation restructuring leading to the best possible approximation of the
result given the available latency budget of the target application. The
proposed methodology enables mission-critical systems to make informed
decisions even in early stages of the computation, based on approximate LSTM
inference, meeting their specifications on safety and robustness. Our
experiments on a state-of-the-art driving model for autonomous vehicle
navigation demonstrate that the proposed approach can yield outputs with
similar quality of result compared to a faithful LSTM baseline, up to 415x
faster (198x on average, 76x geo. mean).Comment: PREPRINT: Accepted for publication at the IEEE Consumer Electronics
Magazine (CEM). [Acceptance Date: 28-Oct-2019
A building block for hardware belief networks
Belief networks represent a powerful approach to problems involving
probabilistic inference, but much of the work in this area is software based
utilizing standard deterministic hardware based on the transistor which
provides the gain and directionality needed to interconnect billions of them
into useful networks. This paper proposes a transistor like device that could
provide an analogous building block for probabilistic networks. We present two
proof-of-concept examples of belief networks, one reciprocal and one
non-reciprocal, implemented using the proposed device which is simulated using
experimentally benchmarked models.Comment: Keywords: stochastic, sigmoid, phase transition, spin glass,
frustration, reduced frustration, Ising model, Bayesian network, Boltzmann
machine. 23 pages, 9 figure
Asynchrony begets Momentum, with an Application to Deep Learning
Asynchronous methods are widely used in deep learning, but have limited
theoretical justification when applied to non-convex problems. We show that
running stochastic gradient descent (SGD) in an asynchronous manner can be
viewed as adding a momentum-like term to the SGD iteration. Our result does not
assume convexity of the objective function, so it is applicable to deep
learning systems. We observe that a standard queuing model of asynchrony
results in a form of momentum that is commonly used by deep learning
practitioners. This forges a link between queuing theory and asynchrony in deep
learning systems, which could be useful for systems builders. For convolutional
neural networks, we experimentally validate that the degree of asynchrony
directly correlates with the momentum, confirming our main result. An important
implication is that tuning the momentum parameter is important when considering
different levels of asynchrony. We assert that properly tuned momentum reduces
the number of steps required for convergence. Finally, our theory suggests new
ways of counteracting the adverse effects of asynchrony: a simple mechanism
like using negative algorithmic momentum can improve performance under high
asynchrony. Since asynchronous methods have better hardware efficiency, this
result may shed light on when asynchronous execution is more efficient for deep
learning systems.Comment: Full version of a paper published in Annual Allerton Conference on
Communication, Control, and Computing (Allerton) 201
Application of Quantum Annealing to Training of Deep Neural Networks
In Deep Learning, a well-known approach for training a Deep Neural Network
starts by training a generative Deep Belief Network model, typically using
Contrastive Divergence (CD), then fine-tuning the weights using backpropagation
or other discriminative techniques. However, the generative training can be
time-consuming due to the slow mixing of Gibbs sampling. We investigated an
alternative approach that estimates model expectations of Restricted Boltzmann
Machines using samples from a D-Wave quantum annealing machine. We tested this
method on a coarse-grained version of the MNIST data set. In our tests we found
that the quantum sampling-based training approach achieves comparable or better
accuracy with significantly fewer iterations of generative training than
conventional CD-based training. Further investigation is needed to determine
whether similar improvements can be achieved for other data sets, and to what
extent these improvements can be attributed to quantum effects.Comment: 18 page
Structurally Sparsified Backward Propagation for Faster Long Short-Term Memory Training
Exploiting sparsity enables hardware systems to run neural networks faster
and more energy-efficiently. However, most prior sparsity-centric optimization
techniques only accelerate the forward pass of neural networks and usually
require an even longer training process with iterative pruning and retraining.
We observe that artificially inducing sparsity in the gradients of the gates in
an LSTM cell has little impact on the training quality. Further, we can enforce
structured sparsity in the gate gradients to make the LSTM backward pass up to
45% faster than the state-of-the-art dense approach and 168% faster than the
state-of-the-art sparsifying method on modern GPUs. Though the structured
sparsifying method can impact the accuracy of a model, this performance gap can
be eliminated by mixing our sparse training method and the standard dense
training method. Experimental results show that the mixed method can achieve
comparable results in a shorter time span than using purely dense training
- …