5,273 research outputs found
CNNLab: a Novel Parallel Framework for Neural Networks using GPU and FPGA-a Practical Study with Trade-off Analysis
Designing and implementing efficient, provably correct parallel neural
network processing is challenging. Existing high-level parallel abstractions
like MapReduce are insufficiently expressive while low-level tools like MPI and
Pthreads leave ML experts repeatedly solving the same design challenges.
However, the diversity and large-scale data size have posed a significant
challenge to construct a flexible and high-performance implementation of deep
learning neural networks. To improve the performance and maintain the
scalability, we present CNNLab, a novel deep learning framework using GPU and
FPGA-based accelerators. CNNLab provides a uniform programming model to users
so that the hardware implementation and the scheduling are invisible to the
programmers. At runtime, CNNLab leverages the trade-offs between GPU and FPGA
before offloading the tasks to the accelerators. Experimental results on the
state-of-the-art Nvidia K40 GPU and Altera DE5 FPGA board demonstrate that the
CNNLab can provide a universal framework with efficient support for diverse
applications without increasing the burden of the programmers. Moreover, we
analyze the detailed quantitative performance, throughput, power, energy, and
performance density for both approaches. Experimental results leverage the
trade-offs between GPU and FPGA and provide useful practical experiences for
the deep learning research community
AX-DBN: An Approximate Computing Framework for the Design of Low-Power Discriminative Deep Belief Networks
The power budget for embedded hardware implementations of Deep Learning
algorithms can be extremely tight. To address implementation challenges in such
domains, new design paradigms, like Approximate Computing, have drawn
significant attention. Approximate Computing exploits the innate
error-resilience of Deep Learning algorithms, a property that makes them
amenable for deployment on low-power computing platforms. This paper describes
an Approximate Computing design methodology, AX-DBN, for an architecture
belonging to the class of stochastic Deep Learning algorithms known as Deep
Belief Networks (DBNs). Specifically, we consider procedures for efficiently
implementing the Discriminative Deep Belief Network (DDBN), a stochastic neural
network which is used for classification tasks, extending Approximation
Computing from the analysis of deterministic to stochastic neural networks. For
the purpose of optimizing the DDBN for hardware implementations, we explore the
use of: (a)Limited precision of neurons and functional approximations of
activation functions; (b) Criticality analysis to identify nodes in the network
which can operate at reduced precision while allowing the network to maintain
target accuracy levels; and (c) A greedy search methodology with incremental
retraining to determine the optimal reduction in precision for all neurons to
maximize power savings. Using the AX-DBN methodology proposed in this paper, we
present experimental results across several network architectures that show
significant power savings under a user-specified accuracy loss constraint with
respect to ideal full precision implementations
Benchmarking Deep Spiking Neural Networks on Neuromorphic Hardware
With more and more event-based neuromorphic hardware systems being developed
at universities and in industry, there is a growing need for assessing their
performance with domain specific measures. In this work, we use the methodology
of converting pre-trained non-spiking to spiking neural networks to evaluate
the performance loss and measure the energy-per-inference for three
neuromorphic hardware systems (BrainScaleS, Spikey, SpiNNaker) and common
simulation frameworks for CPU (NEST) and CPU/GPU (GeNN). For analog hardware we
further apply a re-training technique known as hardware-in-the-loop training to
cope with device mismatch. This analysis is performed for five different
networks, including three networks that have been found by an automated
optimization with a neural architecture search framework. We demonstrate that
the conversion loss is usually below one percent for digital implementations,
and moderately higher for analog systems with the benefit of much lower
energy-per-inference costs
Distilling Spikes: Knowledge Distillation in Spiking Neural Networks
Spiking Neural Networks (SNN) are energy-efficient computing architectures
that exchange spikes for processing information, unlike classical Artificial
Neural Networks (ANN). Due to this, SNNs are better suited for real-life
deployments. However, similar to ANNs, SNNs also benefit from deeper
architectures to obtain improved performance. Furthermore, like the deep ANNs,
the memory, compute and power requirements of SNNs also increase with model
size, and model compression becomes a necessity. Knowledge distillation is a
model compression technique that enables transferring the learning of a large
machine learning model to a smaller model with minimal loss in performance. In
this paper, we propose techniques for knowledge distillation in spiking neural
networks for the task of image classification. We present ways to distill
spikes from a larger SNN, also called the teacher network, to a smaller one,
also called the student network, while minimally impacting the classification
accuracy. We demonstrate the effectiveness of the proposed method with detailed
experiments on three standard datasets while proposing novel distillation
methodologies and loss functions. We also present a multi-stage knowledge
distillation technique for SNNs using an intermediate network to obtain higher
performance from the student network. Our approach is expected to open up new
avenues for deploying high performing large SNN models on resource-constrained
hardware platforms.Comment: Preprint: Manuscript under revie
Dynamic Routing Networks
The deployment of deep neural networks in real-world applications is mostly
restricted by their high inference costs. Extensive efforts have been made to
improve the accuracy with expert-designed or algorithm-searched architectures.
However, the incremental improvement is typically achieved with increasingly
more expensive models that only a small portion of input instances really need.
Inference with a static architecture that processes all input instances via the
same transformation would thus incur unnecessary computational costs.
Therefore, customizing the model capacity in an instance-aware manner is much
needed for higher inference efficiency. In this paper, we propose Dynamic
Routing Networks (DRNets), which support efficient instance-aware inference by
routing the input instance to only necessary transformation branches selected
from a candidate set of branches for each connection between transformation
nodes. The branch selection is dynamically determined via the corresponding
branch importance weights, which are first generated from lightweight
hypernetworks (RouterNets) and then recalibrated with Gumbel-Softmax before the
selection. Extensive experiments show that DRNets can reduce a substantial
amount of parameter size and FLOPs during inference with prediction performance
comparable to state-of-the-art architectures.Comment: 10 pages, 3 figures, 3 table
Bag of Tricks for Image Classification with Convolutional Neural Networks
Much of the recent progress made in image classification research can be
credited to training procedure refinements, such as changes in data
augmentations and optimization methods. In the literature, however, most
refinements are either briefly mentioned as implementation details or only
visible in source code. In this paper, we will examine a collection of such
refinements and empirically evaluate their impact on the final model accuracy
through ablation study. We will show that, by combining these refinements
together, we are able to improve various CNN models significantly. For example,
we raise ResNet-50's top-1 validation accuracy from 75.3% to 79.29% on
ImageNet. We will also demonstrate that improvement on image classification
accuracy leads to better transfer learning performance in other application
domains such as object detection and semantic segmentation.Comment: 10 pages, 9 tables, 4 figure
Training Deep Neural Network in Limited Precision
Energy and resource efficient training of DNNs will greatly extend the
applications of deep learning. However, there are three major obstacles which
mandate accurate calculation in high precision. In this paper, we tackle two of
them related to the loss of gradients during parameter update and
backpropagation through a softmax nonlinearity layer in low precision training.
We implemented SGD with Kahan summation by employing an additional parameter to
virtually extend the bit-width of the parameters for a reliable parameter
update. We also proposed a simple guideline to help select the appropriate
bit-width for the last FC layer followed by a softmax nonlinearity layer. It
determines the lower bound of the required bit-width based on the class size of
the dataset. Extensive experiments on various network architectures and
benchmarks verifies the effectiveness of the proposed technique for low
precision training
Energy-based Tuning of Convolutional Neural Networks on Multi-GPUs
Deep Learning (DL) applications are gaining momentum in the realm of
Artificial Intelligence, particularly after GPUs have demonstrated remarkable
skills for accelerating their challenging computational requirements. Within
this context, Convolutional Neural Network (CNN) models constitute a
representative example of success on a wide set of complex applications,
particularly on datasets where the target can be represented through a
hierarchy of local features of increasing semantic complexity. In most of the
real scenarios, the roadmap to improve results relies on CNN settings involving
brute force computation, and researchers have lately proven Nvidia GPUs to be
one of the best hardware counterparts for acceleration. Our work complements
those findings with an energy study on critical parameters for the deployment
of CNNs on flagship image and video applications: object recognition and people
identification by gait, respectively. We evaluate energy consumption on four
different networks based on the two most popular ones (ResNet/AlexNet): ResNet
(167 layers), a 2D CNN (15 layers), a CaffeNet (25 layers) and a ResNetIm (94
layers) using batch sizes of 64, 128 and 256, and then correlate those with
speed-up and accuracy to determine optimal settings. Experimental results on a
multi-GPU server endowed with twin Maxwell and twin Pascal Titan X GPUs
demonstrate that energy correlates with performance and that Pascal may have up
to 40% gains versus Maxwell. Larger batch sizes extend performance gains and
energy savings, but we have to keep an eye on accuracy, which sometimes shows a
preference for small batches. We expect this work to provide a preliminary
guidance for a wide set of CNN and DL applications in modern HPC times, where
the GFLOPS/w ratio constitutes the primary goal.Comment: To appear in Concurrency and Computation: Practice and Experienc
Structured Deep Neural Network Pruning via Matrix Pivoting
Deep Neural Networks (DNNs) are the key to the state-of-the-art machine
vision, sensor fusion and audio/video signal processing. Unfortunately, their
computation complexity and tight resource constraints on the Edge make them
hard to leverage on mobile, embedded and IoT devices. Due to great diversity of
Edge devices, DNN designers have to take into account the hardware platform and
application requirements during network training. In this work we introduce
pruning via matrix pivoting as a way to improve network pruning by compromising
between the design flexibility of architecture-oblivious and performance
efficiency of architecture-aware pruning, the two dominant techniques for
obtaining resource-efficient DNNs. We also describe local and global network
optimization techniques for efficient implementation of the resulting pruned
networks. In combination, the proposed pruning and implementation result in
close to linear speed up with the reduction of network coefficients during
pruning.Comment: 16 pages, 3 figures, 2 tables, 1 listin
E-PUR: An Energy-Efficient Processing Unit for Recurrent Neural Networks
Recurrent Neural Networks (RNNs) are a key technology for emerging
applications such as automatic speech recognition, machine translation or image
description. Long Short Term Memory (LSTM) networks are the most successful RNN
implementation, as they can learn long term dependencies to achieve high
accuracy. Unfortunately, the recurrent nature of LSTM networks significantly
constrains the amount of parallelism and, hence, multicore CPUs and many-core
GPUs exhibit poor efficiency for RNN inference. In this paper, we present
E-PUR, an energy-efficient processing unit tailored to the requirements of LSTM
computation. The main goal of E-PUR is to support large recurrent neural
networks for low-power mobile devices. E-PUR provides an efficient hardware
implementation of LSTM networks that is flexible to support diverse
applications. One of its main novelties is a technique that we call Maximizing
Weight Locality (MWL), which improves the temporal locality of the memory
accesses for fetching the synaptic weights, reducing the memory requirements by
a large extent. Our experimental results show that E-PUR achieves real-time
performance for different LSTM networks, while reducing energy consumption by
orders of magnitude with respect to general-purpose processors and GPUs, and it
requires a very small chip area. Compared to a modern mobile SoC, an NVIDIA
Tegra X1, E-PUR provides an average energy reduction of 92x
- …