1,410 research outputs found
Recent Advances in Efficient Computation of Deep Convolutional Neural Networks
Deep neural networks have evolved remarkably over the past few years and they
are currently the fundamental tools of many intelligent systems. At the same
time, the computational complexity and resource consumption of these networks
also continue to increase. This will pose a significant challenge to the
deployment of such networks, especially in real-time applications or on
resource-limited devices. Thus, network acceleration has become a hot topic
within the deep learning community. As for hardware implementation of deep
neural networks, a batch of accelerators based on FPGA/ASIC have been proposed
in recent years. In this paper, we provide a comprehensive survey of recent
advances in network acceleration, compression and accelerator design from both
algorithm and hardware points of view. Specifically, we provide a thorough
analysis of each of the following topics: network pruning, low-rank
approximation, network quantization, teacher-student networks, compact network
design and hardware accelerators. Finally, we will introduce and discuss a few
possible future directions.Comment: 14 pages, 3 figure
SmartExchange: Trading Higher-cost Memory Storage/Access for Lower-cost Computation
We present SmartExchange, an algorithm-hardware co-design framework to trade
higher-cost memory storage/access for lower-cost computation, for
energy-efficient inference of deep neural networks (DNNs). We develop a novel
algorithm to enforce a specially favorable DNN weight structure, where each
layerwise weight matrix can be stored as the product of a small basis matrix
and a large sparse coefficient matrix whose non-zero elements are all
power-of-2. To our best knowledge, this algorithm is the first formulation that
integrates three mainstream model compression ideas: sparsification or pruning,
decomposition, and quantization, into one unified framework. The resulting
sparse and readily-quantized DNN thus enjoys greatly reduced energy consumption
in data movement as well as weight storage. On top of that, we further design a
dedicated accelerator to fully utilize the SmartExchange-enforced weights to
improve both energy efficiency and latency performance. Extensive experiments
show that 1) on the algorithm level, SmartExchange outperforms state-of-the-art
compression techniques, including merely sparsification or pruning,
decomposition, and quantization, in various ablation studies based on nine DNN
models and four datasets; and 2) on the hardware level, the proposed
SmartExchange based accelerator can improve the energy efficiency by up to
6.7 and the speedup by up to 19.2 over four state-of-the-art
DNN accelerators, when benchmarked on seven DNN models (including four standard
DNNs, two compact DNN models, and one segmentation model) and three datasets.Comment: Accepted by 47th International Symposium on Computer Architecture
(ISCA'2020
FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks
Convolutional Neural Networks have rapidly become the most successful machine
learning algorithm, enabling ubiquitous machine vision and intelligent
decisions on even embedded computing-systems. While the underlying arithmetic
is structurally simple, compute and memory requirements are challenging. One of
the promising opportunities is leveraging reduced-precision representations for
inputs, activations and model parameters. The resulting scalability in
performance, power efficiency and storage footprint provides interesting design
compromises in exchange for a small reduction in accuracy. FPGAs are ideal for
exploiting low-precision inference engines leveraging custom precisions to
achieve the required numerical accuracy for a given application. In this
article, we describe the second generation of the FINN framework, an end-to-end
tool which enables design space exploration and automates the creation of fully
customized inference engines on FPGAs. Given a neural network description, the
tool optimizes for given platforms, design targets and a specific precision. We
introduce formalizations of resource cost functions and performance
predictions, and elaborate on the optimization algorithms. Finally, we evaluate
a selection of reduced precision neural networks ranging from CIFAR-10
classifiers to YOLO-based object detection on a range of platforms including
PYNQ and AWS\,F1, demonstrating new unprecedented measured throughput at
50TOp/s on AWS-F1 and 5TOp/s on embedded devices.Comment: to be published in ACM TRETS Special Edition on Deep Learnin
CodeX: Bit-Flexible Encoding for Streaming-based FPGA Acceleration of DNNs
This paper proposes CodeX, an end-to-end framework that facilitates encoding,
bitwidth customization, fine-tuning, and implementation of neural networks on
FPGA platforms. CodeX incorporates nonlinear encoding to the computation flow
of neural networks to save memory. The encoded features demand significantly
lower storage compared to the raw full-precision activation values; therefore,
the execution flow of CodeX hardware engine is completely performed within the
FPGA using on-chip streaming buffers with no access to the off-chip DRAM. We
further propose a fully-automated algorithm inspired by reinforcement learning
which determines the customized encoding bitwidth across network layers. CodeX
full-stack framework comprises of a compiler which takes a high-level Python
description of an arbitrary neural network architecture. The compiler then
instantiates the corresponding elements from CodeX Hardware library for FPGA
implementation. Proof-of-concept evaluations on MNIST, SVHN, and CIFAR-10
datasets demonstrate an average of 4.65x throughput improvement compared to
stand-alone weight encoding. We further compare CodeX with six existing
full-precision DNN accelerators on ImageNet, showing an average of 3.6x and
2.54x improvement in throughput and performance-per-watt, respectively
Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators
Recently, RRAM-based Binary Neural Network (BNN) hardware has been gaining
interests as it requires 1-bit sense-amp only and eliminates the need for
high-resolution ADC and DAC. However, RRAM-based BNN hardware still requires
high-resolution ADC for partial sum calculation to implement large-scale neural
network using multiple memory arrays. We propose a neural network-hardware
co-design approach to split input to fit each split network on a RRAM array so
that the reconstructed BNNs calculate 1-bit output neuron in each array. As a
result, ADC can be completely eliminated from the design even for large-scale
neural network. Simulation results show that the proposed network
reconstruction and retraining recovers the inference accuracy of the original
BNN. The accuracy loss of the proposed scheme in the CIFAR-10 testcase was less
than 1.1% compared to the original network. The code for training and running
proposed BNN models is available at:
https://github.com/YulhwaKim/RRAMScalable_BNN
A Survey of FPGA-Based Neural Network Accelerator
Recent researches on neural network have shown significant advantage in
machine learning over traditional algorithms based on handcrafted features and
models. Neural network is now widely adopted in regions like image, speech and
video recognition. But the high computation and storage complexity of neural
network inference poses great difficulty on its application. CPU platforms are
hard to offer enough computation capacity. GPU platforms are the first choice
for neural network process because of its high computation capacity and easy to
use development frameworks.
On the other hand, FPGA-based neural network inference accelerator is
becoming a research topic. With specifically designed hardware, FPGA is the
next possible solution to surpass GPU in speed and energy efficiency. Various
FPGA-based accelerator designs have been proposed with software and hardware
optimization techniques to achieve high speed and energy efficiency. In this
paper, we give an overview of previous work on neural network inference
accelerators based on FPGA and summarize the main techniques used. An
investigation from software to hardware, from circuit level to system level is
carried out to complete analysis of FPGA-based neural network inference
accelerator design and serves as a guide to future work
AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training
Highly distributed training of Deep Neural Networks (DNNs) on future compute
platforms (offering 100 of TeraOps/s of computational capacity) is expected to
be severely communication constrained. To overcome this limitation, new
gradient compression techniques are needed that are computationally friendly,
applicable to a wide variety of layers seen in Deep Neural Networks and
adaptable to variations in network architectures as well as their
hyper-parameters. In this paper we introduce a novel technique - the Adaptive
Residual Gradient Compression (AdaComp) scheme. AdaComp is based on localized
selection of gradient residues and automatically tunes the compression rate
depending on local activity. We show excellent results on a wide spectrum of
state of the art Deep Learning models in multiple domains (vision, speech,
language), datasets (MNIST, CIFAR10, ImageNet, BN50, Shakespeare), optimizers
(SGD with momentum, Adam) and network parameters (number of learners,
minibatch-size etc.). Exploiting both sparsity and quantization, we demonstrate
end-to-end compression rates of ~200X for fully-connected and recurrent layers,
and ~40X for convolutional layers, without any noticeable degradation in model
accuracies.Comment: IBM Research AI, 9 pages, 7 figures, AAAI18 accepte
Precision Highway for Ultra Low-Precision Quantization
Neural network quantization has an inherent problem called accumulated
quantization error, which is the key obstacle towards ultra-low precision,
e.g., 2- or 3-bit precision. To resolve this problem, we propose precision
highway, which forms an end-to-end high-precision information flow while
performing the ultra low-precision computation. First, we describe how the
precision highway reduce the accumulated quantization error in both
convolutional and recurrent neural networks. We also provide the quantitative
analysis of the benefit of precision highway and evaluate the overhead on the
state-of-the-art hardware accelerator. In the experiments, our proposed method
outperforms the best existing quantization methods while offering 3-bit
weight/activation quantization with no accuracy loss and 2-bit quantization
with a 2.45 % top-1 accuracy loss in ResNet-50. We also report that the
proposed method significantly outperforms the existing method in the 2-bit
quantization of an LSTM for language modeling
Exploring the Regularity of Sparse Structure in Convolutional Neural Networks
Sparsity helps reduce the computational complexity of deep neural networks by
skipping zeros. Taking advantage of sparsity is listed as a high priority in
next generation DNN accelerators such as TPU. The structure of sparsity, i.e.,
the granularity of pruning, affects the efficiency of hardware accelerator
design as well as the prediction accuracy. Coarse-grained pruning creates
regular sparsity patterns, making it more amenable for hardware acceleration
but more challenging to maintain the same accuracy. In this paper we
quantitatively measure the trade-off between sparsity regularity and prediction
accuracy, providing insights in how to maintain accuracy while having more a
more structured sparsity pattern. Our experimental results show that
coarse-grained pruning can achieve a sparsity ratio similar to unstructured
pruning without loss of accuracy. Moreover, due to the index saving effect,
coarse-grained pruning is able to obtain a better compression ratio than
fine-grained sparsity at the same accuracy threshold. Based on the recent
sparse convolutional neural network accelerator (SCNN), our experiments further
demonstrate that coarse-grained sparsity saves about 2x the memory references
compared to fine-grained sparsity. Since memory reference is more than two
orders of magnitude more expensive than arithmetic operations, the regularity
of sparse structure leads to more efficient hardware design.Comment: submitted to NIPS 201
DNN Feature Map Compression using Learned Representation over GF(2)
In this paper, we introduce a method to compress intermediate feature maps of
deep neural networks (DNNs) to decrease memory storage and bandwidth
requirements during inference. Unlike previous works, the proposed method is
based on converting fixed-point activations into vectors over the smallest
GF(2) finite field followed by nonlinear dimensionality reduction (NDR) layers
embedded into a DNN. Such an end-to-end learned representation finds more
compact feature maps by exploiting quantization redundancies within the
fixed-point activations along the channel or spatial dimensions. We apply the
proposed network architectures derived from modified SqueezeNet and MobileNetV2
to the tasks of ImageNet classification and PASCAL VOC object detection.
Compared to prior approaches, the conducted experiments show a factor of 2
decrease in memory requirements with minor degradation in accuracy while adding
only bitwise computations.Comment: CEFRL201
- …