21,521 research outputs found
A Survey of the Recent Architectures of Deep Convolutional Neural Networks
Deep Convolutional Neural Network (CNN) is a special type of Neural Networks,
which has shown exemplary performance on several competitions related to
Computer Vision and Image Processing. Some of the exciting application areas of
CNN include Image Classification and Segmentation, Object Detection, Video
Processing, Natural Language Processing, and Speech Recognition. The powerful
learning ability of deep CNN is primarily due to the use of multiple feature
extraction stages that can automatically learn representations from the data.
The availability of a large amount of data and improvement in the hardware
technology has accelerated the research in CNNs, and recently interesting deep
CNN architectures have been reported. Several inspiring ideas to bring
advancements in CNNs have been explored, such as the use of different
activation and loss functions, parameter optimization, regularization, and
architectural innovations. However, the significant improvement in the
representational capacity of the deep CNN is achieved through architectural
innovations. Notably, the ideas of exploiting spatial and channel information,
depth and width of architecture, and multi-path information processing have
gained substantial attention. Similarly, the idea of using a block of layers as
a structural unit is also gaining popularity. This survey thus focuses on the
intrinsic taxonomy present in the recently reported deep CNN architectures and,
consequently, classifies the recent innovations in CNN architectures into seven
different categories. These seven categories are based on spatial exploitation,
depth, multi-path, width, feature-map exploitation, channel boosting, and
attention. Additionally, the elementary understanding of CNN components,
current challenges, and applications of CNN are also provided.Comment: Number of Pages: 70, Number of Figures: 11, Number of Tables: 11.
Artif Intell Rev (2020
FireCaffe: near-linear acceleration of deep neural network training on compute clusters
Long training times for high-accuracy deep neural networks (DNNs) impede
research into new DNN architectures and slow the development of high-accuracy
DNNs. In this paper we present FireCaffe, which successfully scales deep neural
network training across a cluster of GPUs. We also present a number of best
practices to aid in comparing advancements in methods for scaling and
accelerating the training of deep neural networks. The speed and scalability of
distributed algorithms is almost always limited by the overhead of
communicating between servers; DNN training is not an exception to this rule.
Therefore, the key consideration here is to reduce communication overhead
wherever possible, while not degrading the accuracy of the DNN models that we
train. Our approach has three key pillars. First, we select network hardware
that achieves high bandwidth between GPU servers -- Infiniband or Cray
interconnects are ideal for this. Second, we consider a number of communication
algorithms, and we find that reduction trees are more efficient and scalable
than the traditional parameter server approach. Third, we optionally increase
the batch size to reduce the total quantity of communication during DNN
training, and we identify hyperparameters that allow us to reproduce the
small-batch accuracy while training with large batch sizes. When training
GoogLeNet and Network-in-Network on ImageNet, we achieve a 47x and 39x speedup,
respectively, when training on a cluster of 128 GPUs.Comment: Version 2: Added results on 128 GPU
NeuralPower: Predict and Deploy Energy-Efficient Convolutional Neural Networks
"How much energy is consumed for an inference made by a convolutional neural
network (CNN)?" With the increased popularity of CNNs deployed on the
wide-spectrum of platforms (from mobile devices to workstations), the answer to
this question has drawn significant attention. From lengthening battery life of
mobile devices to reducing the energy bill of a datacenter, it is important to
understand the energy efficiency of CNNs during serving for making an
inference, before actually training the model. In this work, we propose
NeuralPower: a layer-wise predictive framework based on sparse polynomial
regression, for predicting the serving energy consumption of a CNN deployed on
any GPU platform. Given the architecture of a CNN, NeuralPower provides an
accurate prediction and breakdown for power and runtime across all layers in
the whole network, helping machine learners quickly identify the power,
runtime, or energy bottlenecks. We also propose the "energy-precision ratio"
(EPR) metric to guide machine learners in selecting an energy-efficient CNN
architecture that better trades off the energy consumption and prediction
accuracy. The experimental results show that the prediction accuracy of the
proposed NeuralPower outperforms the best published model to date, yielding an
improvement in accuracy of up to 68.5%. We also assess the accuracy of
predictions at the network level, by predicting the runtime, power, and energy
of state-of-the-art CNN architectures, achieving an average accuracy of 88.24%
in runtime, 88.34% in power, and 97.21% in energy. We comprehensively
corroborate the effectiveness of NeuralPower as a powerful framework for
machine learners by testing it on different GPU platforms and Deep Learning
software tools.Comment: Accepted as a conference paper at ACML 201
A Survey of Model Compression and Acceleration for Deep Neural Networks
Deep neural networks (DNNs) have recently achieved great success in many
visual recognition tasks. However, existing deep neural network models are
computationally expensive and memory intensive, hindering their deployment in
devices with low memory resources or in applications with strict latency
requirements. Therefore, a natural thought is to perform model compression and
acceleration in deep networks without significantly decreasing the model
performance. During the past five years, tremendous progress has been made in
this area. In this paper, we review the recent techniques for compacting and
accelerating DNN models. In general, these techniques are divided into four
categories: parameter pruning and quantization, low-rank factorization,
transferred/compact convolutional filters, and knowledge distillation. Methods
of parameter pruning and quantization are described first, after that the other
techniques are introduced. For each category, we also provide insightful
analysis about the performance, related applications, advantages, and
drawbacks. Then we go through some very recent successful methods, for example,
dynamic capacity networks and stochastic depths networks. After that, we survey
the evaluation matrices, the main datasets used for evaluating the model
performance, and recent benchmark efforts. Finally, we conclude this paper,
discuss remaining the challenges and possible directions for future work.Comment: Published in IEEE Signal Processing Magazine, updated version
including more recent work
FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks
Convolutional Neural Networks have rapidly become the most successful machine
learning algorithm, enabling ubiquitous machine vision and intelligent
decisions on even embedded computing-systems. While the underlying arithmetic
is structurally simple, compute and memory requirements are challenging. One of
the promising opportunities is leveraging reduced-precision representations for
inputs, activations and model parameters. The resulting scalability in
performance, power efficiency and storage footprint provides interesting design
compromises in exchange for a small reduction in accuracy. FPGAs are ideal for
exploiting low-precision inference engines leveraging custom precisions to
achieve the required numerical accuracy for a given application. In this
article, we describe the second generation of the FINN framework, an end-to-end
tool which enables design space exploration and automates the creation of fully
customized inference engines on FPGAs. Given a neural network description, the
tool optimizes for given platforms, design targets and a specific precision. We
introduce formalizations of resource cost functions and performance
predictions, and elaborate on the optimization algorithms. Finally, we evaluate
a selection of reduced precision neural networks ranging from CIFAR-10
classifiers to YOLO-based object detection on a range of platforms including
PYNQ and AWS\,F1, demonstrating new unprecedented measured throughput at
50TOp/s on AWS-F1 and 5TOp/s on embedded devices.Comment: to be published in ACM TRETS Special Edition on Deep Learnin
Reconfigurable Hardware Accelerators: Opportunities, Trends, and Challenges
With the emerging big data applications of Machine Learning, Speech
Recognition, Artificial Intelligence, and DNA Sequencing in recent years,
computer architecture research communities are facing the explosive scale of
various data explosion. To achieve high efficiency of data-intensive computing,
studies of heterogeneous accelerators which focus on latest applications, have
become a hot issue in computer architecture domain. At present, the
implementation of heterogeneous accelerators mainly relies on heterogeneous
computing units such as Application-specific Integrated Circuit (ASIC),
Graphics Processing Unit (GPU), and Field Programmable Gate Array (FPGA). Among
the typical heterogeneous architectures above, FPGA-based reconfigurable
accelerators have two merits as follows: First, FPGA architecture contains a
large number of reconfigurable circuits, which satisfy requirements of high
performance and low power consumption when specific applications are running.
Second, the reconfigurable architectures of employing FPGA performs prototype
systems rapidly and features excellent customizability and reconfigurability.
Nowadays, in top-tier conferences of computer architecture, emerging a batch of
accelerating works based on FPGA or other reconfigurable architectures. To
better review the related work of reconfigurable computing accelerators
recently, this survey reserves latest high-level research products of
reconfigurable accelerator architectures and algorithm applications as the
basis. In this survey, we compare hot research issues and concern domains,
furthermore, analyze and illuminate advantages, disadvantages, and challenges
of reconfigurable accelerators. In the end, we prospect the development
tendency of accelerator architectures in the future, hoping to provide a
reference for computer architecture researchers
Progressive Stochastic Binarization of Deep Networks
A plethora of recent research has focused on improving the memory footprint
and inference speed of deep networks by reducing the complexity of (i)
numerical representations (for example, by deterministic or stochastic
quantization) and (ii) arithmetic operations (for example, by binarization of
weights).
We propose a stochastic binarization scheme for deep networks that allows for
efficient inference on hardware by restricting itself to additions of small
integers and fixed shifts. Unlike previous approaches, the underlying
randomized approximation is progressive, thus permitting an adaptive control of
the accuracy of each operation at run-time. In a low-precision setting, we
match the accuracy of previous binarized approaches. Our representation is
unbiased - it approaches continuous computation with increasing sample size. In
a high-precision regime, the computational costs are competitive with previous
quantization schemes. Progressive stochastic binarization also permits
localized, dynamic accuracy control within a single network, thereby providing
a new tool for adaptively focusing computational attention.
We evaluate our method on networks of various architectures, already
pretrained on ImageNet. With representational costs comparable to previous
schemes, we obtain accuracies close to the original floating point
implementation. This includes pruned networks, except the known special case of
certain types of separated convolutions. By focusing computational attention
using progressive sampling, we reduce inference costs on ImageNet further by a
factor of up to 33% (before network pruning)
A Study of Complex Deep Learning Networks on High Performance, Neuromorphic, and Quantum Computers
Current Deep Learning approaches have been very successful using
convolutional neural networks (CNN) trained on large graphical processing units
(GPU)-based computers. Three limitations of this approach are: 1) they are
based on a simple layered network topology, i.e., highly connected layers,
without intra-layer connections; 2) the networks are manually configured to
achieve optimal results, and 3) the implementation of neuron model is expensive
in both cost and power. In this paper, we evaluate deep learning models using
three different computing architectures to address these problems: quantum
computing to train complex topologies, high performance computing (HPC) to
automatically determine network topology, and neuromorphic computing for a
low-power hardware implementation. We use the MNIST dataset for our experiment,
due to input size limitations of current quantum computers. Our results show
the feasibility of using the three architectures in tandem to address the above
deep learning limitations. We show a quantum computer can find high quality
values of intra-layer connections weights, in a tractable time as the
complexity of the network increases; a high performance computer can find
optimal layer-based topologies; and a neuromorphic computer can represent the
complex topology and weights derived from the other architectures in low power
memristive hardware
Espresso: Efficient Forward Propagation for BCNNs
There are many applications scenarios for which the computational performance
and memory footprint of the prediction phase of Deep Neural Networks (DNNs)
needs to be optimized. Binary Neural Networks (BDNNs) have been shown to be an
effective way of achieving this objective. In this paper, we show how
Convolutional Neural Networks (CNNs) can be implemented using binary
representations. Espresso is a compact, yet powerful library written in C/CUDA
that features all the functionalities required for the forward propagation of
CNNs, in a binary file less than 400KB, without any external dependencies.
Although it is mainly designed to take advantage of massive GPU parallelism,
Espresso also provides an equivalent CPU implementation for CNNs. Espresso
provides special convolutional and dense layers for BCNNs, leveraging
bit-packing and bit-wise computations for efficient execution. These techniques
provide a speed-up of matrix-multiplication routines, and at the same time,
reduce memory usage when storing parameters and activations. We experimentally
show that Espresso is significantly faster than existing implementations of
optimized binary neural networks ( 2 orders of magnitude). Espresso is
released under the Apache 2.0 license and is available at
http://github.com/fpeder/espresso.Comment: 10 pages, 4 figure
NetScore: Towards Universal Metrics for Large-scale Performance Analysis of Deep Neural Networks for Practical On-Device Edge Usage
Much of the focus in the design of deep neural networks has been on improving
accuracy, leading to more powerful yet highly complex network architectures
that are difficult to deploy in practical scenarios, particularly on edge
devices such as mobile and other consumer devices given their high
computational and memory requirements. As a result, there has been a recent
interest in the design of quantitative metrics for evaluating deep neural
networks that accounts for more than just model accuracy as the sole indicator
of network performance. In this study, we continue the conversation towards
universal metrics for evaluating the performance of deep neural networks for
practical on-device edge usage. In particular, we propose a new balanced metric
called NetScore, which is designed specifically to provide a quantitative
assessment of the balance between accuracy, computational complexity, and
network architecture complexity of a deep neural network, which is important
for on-device edge operation. In what is one of the largest comparative
analysis between deep neural networks in literature, the NetScore metric, the
top-1 accuracy metric, and the popular information density metric were compared
across a diverse set of 60 different deep convolutional neural networks for
image classification on the ImageNet Large Scale Visual Recognition Challenge
(ILSVRC 2012) dataset. The evaluation results across these three metrics for
this diverse set of networks are presented in this study to act as a reference
guide for practitioners in the field. The proposed NetScore metric, along with
the other tested metrics, are by no means perfect, but the hope is to push the
conversation towards better universal metrics for evaluating deep neural
networks for use in practical on-device edge scenarios to help guide
practitioners in model design for such scenarios.Comment: 9 page
- …