1,155 research outputs found
CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-CirculantWeight Matrices
Large-scale deep neural networks (DNNs) are both compute and memory
intensive. As the size of DNNs continues to grow, it is critical to improve the
energy efficiency and performance while maintaining accuracy. For DNNs, the
model size is an important factor affecting performance, scalability and energy
efficiency. Weight pruning achieves good compression ratios but suffers from
three drawbacks: 1) the irregular network structure after pruning; 2) the
increased training complexity; and 3) the lack of rigorous guarantee of
compression ratio and inference accuracy. To overcome these limitations, this
paper proposes CirCNN, a principled approach to represent weights and process
neural networks using block-circulant matrices. CirCNN utilizes the Fast
Fourier Transform (FFT)-based fast multiplication, simultaneously reducing the
computational complexity (both in inference and training) from O(n2) to
O(nlogn) and the storage complexity from O(n2) to O(n), with negligible
accuracy loss. Compared to other approaches, CirCNN is distinct due to its
mathematical rigor: it can converge to the same effectiveness as DNNs without
compression. The CirCNN architecture, a universal DNN inference engine that can
be implemented on various hardware/software platforms with configurable network
architecture. To demonstrate the performance and energy efficiency, we test
CirCNN in FPGA, ASIC and embedded processors. Our results show that CirCNN
architecture achieves very high energy efficiency and performance with a small
hardware footprint. Based on the FPGA implementation and ASIC synthesis
results, CirCNN achieves 6-102X energy efficiency improvements compared with
the best state-of-the-art results.Comment: 14 pages, 15 Figures, conferenc
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference
Research has shown that convolutional neural networks contain significant
redundancy, and high classification accuracy can be obtained even when weights
and activations are reduced from floating point to binary values. In this
paper, we present FINN, a framework for building fast and flexible FPGA
accelerators using a flexible heterogeneous streaming architecture. By
utilizing a novel set of optimizations that enable efficient mapping of
binarized neural networks to hardware, we implement fully connected,
convolutional and pooling layers, with per-layer compute resources being
tailored to user-provided throughput requirements. On a ZC706 embedded FPGA
platform drawing less than 25 W total system power, we demonstrate up to 12.3
million image classifications per second with 0.31 {\mu}s latency on the MNIST
dataset with 95.8% accuracy, and 21906 image classifications per second with
283 {\mu}s latency on the CIFAR-10 and SVHN datasets with respectively 80.1%
and 94.9% accuracy. To the best of our knowledge, ours are the fastest
classification rates reported to date on these benchmarks.Comment: To appear in the 25th International Symposium on Field-Programmable
Gate Arrays, February 201
WRPN: Wide Reduced-Precision Networks
For computer vision applications, prior works have shown the efficacy of
reducing numeric precision of model parameters (network weights) in deep neural
networks. Activation maps, however, occupy a large memory footprint during both
the training and inference step when using mini-batches of inputs. One way to
reduce this large memory footprint is to reduce the precision of activations.
However, past works have shown that reducing the precision of activations hurts
model accuracy. We study schemes to train networks from scratch using
reduced-precision activations without hurting accuracy. We reduce the precision
of activation maps (along with model parameters) and increase the number of
filter maps in a layer, and find that this scheme matches or surpasses the
accuracy of the baseline full-precision network. As a result, one can
significantly improve the execution efficiency (e.g. reduce dynamic memory
footprint, memory bandwidth and computational energy) and speed up the training
and inference process with appropriate hardware support. We call our scheme
WRPN - wide reduced-precision networks. We report results and show that WRPN
scheme is better than previously reported accuracies on ILSVRC-12 dataset while
being computationally less expensive compared to previously reported
reduced-precision networks
Recent Advances in Convolutional Neural Network Acceleration
In recent years, convolutional neural networks (CNNs) have shown great
performance in various fields such as image classification, pattern
recognition, and multi-media compression. Two of the feature properties, local
connectivity and weight sharing, can reduce the number of parameters and
increase processing speed during training and inference. However, as the
dimension of data becomes higher and the CNN architecture becomes more
complicated, the end-to-end approach or the combined manner of CNN is
computationally intensive, which becomes limitation to CNN's further
implementation. Therefore, it is necessary and urgent to implement CNN in a
faster way. In this paper, we first summarize the acceleration methods that
contribute to but not limited to CNN by reviewing a broad variety of research
papers. We propose a taxonomy in terms of three levels, i.e.~structure level,
algorithm level, and implementation level, for acceleration methods. We also
analyze the acceleration methods in terms of CNN architecture compression,
algorithm optimization, and hardware-based improvement. At last, we give a
discussion on different perspectives of these acceleration and optimization
methods within each level. The discussion shows that the methods in each level
still have large exploration space. By incorporating such a wide range of
disciplines, we expect to provide a comprehensive reference for researchers who
are interested in CNN acceleration.Comment: submitted to Neurocomputin
Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework
Hardware accelerations of deep learning systems have been extensively
investigated in industry and academia. The aim of this paper is to achieve
ultra-high energy efficiency and performance for hardware implementations of
deep neural networks (DNNs). An algorithm-hardware co-optimization framework is
developed, which is applicable to different DNN types, sizes, and application
scenarios. The algorithm part adopts the general block-circulant matrices to
achieve a fine-grained tradeoff between accuracy and compression ratio. It
applies to both fully-connected and convolutional layers and contains a
mathematically rigorous proof of the effectiveness of the method. The proposed
algorithm reduces computational complexity per layer from O() to O() and storage complexity from O() to O(), both for training and
inference. The hardware part consists of highly efficient Field Programmable
Gate Array (FPGA)-based implementations using effective reconfiguration, batch
processing, deep pipelining, resource re-using, and hierarchical control.
Experimental results demonstrate that the proposed framework achieves at least
152X speedup and 71X energy efficiency gain compared with IBM TrueNorth
processor under the same test accuracy. It achieves at least 31X energy
efficiency gain compared with the reference FPGA-based work.Comment: 6 figures, AAAI Conference on Artificial Intelligence, 201
Reconfigurable Hardware Accelerators: Opportunities, Trends, and Challenges
With the emerging big data applications of Machine Learning, Speech
Recognition, Artificial Intelligence, and DNA Sequencing in recent years,
computer architecture research communities are facing the explosive scale of
various data explosion. To achieve high efficiency of data-intensive computing,
studies of heterogeneous accelerators which focus on latest applications, have
become a hot issue in computer architecture domain. At present, the
implementation of heterogeneous accelerators mainly relies on heterogeneous
computing units such as Application-specific Integrated Circuit (ASIC),
Graphics Processing Unit (GPU), and Field Programmable Gate Array (FPGA). Among
the typical heterogeneous architectures above, FPGA-based reconfigurable
accelerators have two merits as follows: First, FPGA architecture contains a
large number of reconfigurable circuits, which satisfy requirements of high
performance and low power consumption when specific applications are running.
Second, the reconfigurable architectures of employing FPGA performs prototype
systems rapidly and features excellent customizability and reconfigurability.
Nowadays, in top-tier conferences of computer architecture, emerging a batch of
accelerating works based on FPGA or other reconfigurable architectures. To
better review the related work of reconfigurable computing accelerators
recently, this survey reserves latest high-level research products of
reconfigurable accelerator architectures and algorithm applications as the
basis. In this survey, we compare hot research issues and concern domains,
furthermore, analyze and illuminate advantages, disadvantages, and challenges
of reconfigurable accelerators. In the end, we prospect the development
tendency of accelerator architectures in the future, hoping to provide a
reference for computer architecture researchers
Keynote: Small Neural Nets Are Beautiful: Enabling Embedded Systems with Small Deep-Neural-Network Architectures
Over the last five years Deep Neural Nets have offered more accurate
solutions to many problems in speech recognition, and computer vision, and
these solutions have surpassed a threshold of acceptability for many
applications. As a result, Deep Neural Networks have supplanted other
approaches to solving problems in these areas, and enabled many new
applications. While the design of Deep Neural Nets is still something of an art
form, in our work we have found basic principles of design space exploration
used to develop embedded microprocessor architectures to be highly applicable
to the design of Deep Neural Net architectures. In particular, we have used
these design principles to create a novel Deep Neural Net called SqueezeNet
that requires as little as 480KB of storage for its model parameters. We have
further integrated all these experiences to develop something of a playbook for
creating small Deep Neural Nets for embedded systems.Comment: Keynote at Embedded Systems Week (ESWEEK) 201
NICE: Noise Injection and Clamping Estimation for Neural Network Quantization
Convolutional Neural Networks (CNN) are very popular in many fields including
computer vision, speech recognition, natural language processing, to name a
few. Though deep learning leads to groundbreaking performance in these domains,
the networks used are very demanding computationally and are far from real-time
even on a GPU, which is not power efficient and therefore does not suit low
power systems such as mobile devices. To overcome this challenge, some
solutions have been proposed for quantizing the weights and activations of
these networks, which accelerate the runtime significantly. Yet, this
acceleration comes at the cost of a larger error. The \uniqname method proposed
in this work trains quantized neural networks by noise injection and a learned
clamping, which improve the accuracy. This leads to state-of-the-art results on
various regression and classification tasks, e.g., ImageNet classification with
architectures such as ResNet-18/34/50 with low as 3-bit weights and
activations. We implement the proposed solution on an FPGA to demonstrate its
applicability for low power real-time applications. The implementation of the
paper is available at https://github.com/Lancer555/NIC
EIE: Efficient Inference Engine on Compressed Deep Neural Network
State-of-the-art deep neural networks (DNNs) have hundreds of millions of
connections and are both computationally and memory intensive, making them
difficult to deploy on embedded systems with limited hardware resources and
power budgets. While custom hardware helps the computation, fetching weights
from DRAM is two orders of magnitude more expensive than ALU operations, and
dominates the required power.
Previously proposed 'Deep Compression' makes it possible to fit large DNNs
(AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by
pruning the redundant connections and having multiple connections share the
same weight. We propose an energy efficient inference engine (EIE) that
performs inference on this compressed network model and accelerates the
resulting sparse matrix-vector multiplication with weight sharing. Going from
DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x;
Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x.
Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to
CPU and GPU implementations of the same DNN without compression. EIE has a
processing power of 102GOPS/s working directly on a compressed network,
corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of
AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is
24,000x and 3,400x more energy efficient than a CPU and GPU respectively.
Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy
efficiency and area efficiency.Comment: External Links: TheNextPlatform: http://goo.gl/f7qX0L ; O'Reilly:
https://goo.gl/Id1HNT ; Hacker News: https://goo.gl/KM72SV ; Embedded-vision:
http://goo.gl/joQNg8 ; Talk at NVIDIA GTC'16: http://goo.gl/6wJYvn ; Talk at
Embedded Vision Summit: https://goo.gl/7abFNe ; Talk at Stanford University:
https://goo.gl/6lwuer. Published as a conference paper in ISCA 201
- …