396 research outputs found
Accelerator-Aware Pruning for Convolutional Neural Networks
Convolutional neural networks have shown tremendous performance capabilities
in computer vision tasks, but their excessive amounts of weight storage and
arithmetic operations prevent them from being adopted in embedded environments.
One of the solutions involves pruning, where certain unimportant weights are
forced to have a value of zero. Many pruning schemes have been proposed, but
these have mainly focused on the number of pruned weights. Previous pruning
schemes scarcely considered ASIC or FPGA accelerator architectures. When these
pruned networks are run on accelerators, the lack of consideration of the
architecture causes some inefficiency problems, including internal buffer
misalignments and load imbalances. This paper proposes a new pruning scheme
that reflects accelerator architectures. In the proposed scheme, pruning is
performed so that the same number of weights remain for each weight group
corresponding to activations fetched simultaneously. In this way, the pruning
scheme resolves the inefficiency problems, doubling the accelerator
performance. Even with this constraint, the proposed pruning scheme reached a
pruning ratio similar to that of previous unconstrained pruning schemes, not
only on AlexNet and VGG16 but also on state-of-the-art very deep networks such
as ResNet. Furthermore, the proposed scheme demonstrated a comparable pruning
ratio on compact networks such as MobileNet and on slimmed networks that were
already pruned in a channel-wise manner. In addition to improving the
efficiency of previous sparse accelerators, it will be also shown that the
proposed pruning scheme can be used to reduce the logic complexity of sparse
accelerators.The pruned models are publicly available at
https://github.com/HyeongjuKang/accelerator-aware-pruning.Comment: 11 pages, 9 figure
Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization
This paper describes a novel approach of packing sparse convolutional neural
networks for their efficient systolic array implementations. By combining
subsets of columns in the original filter matrix associated with a
convolutional layer, we increase the utilization efficiency of the systolic
array substantially (e.g., ~4x) due to the increased density of nonzeros in the
resulting packed filter matrix. In combining columns, for each row, all filter
weights but one with the largest magnitude are pruned. We retrain the remaining
weights to preserve high accuracy. We demonstrate that in mitigating data
privacy concerns the retraining can be accomplished with only fractions of the
original dataset (e.g., 10\% for CIFAR-10). We study the effectiveness of this
joint optimization for both high utilization and classification accuracy with
ASIC and FPGA designs based on efficient bit-serial implementations of
multiplier-accumulators. We present analysis and empirical evidence on the
superior performance of our column combining approach against prior arts under
metrics such as energy efficiency (3x) and inference latency (12x).Comment: To appear in ASPLOS 201
Neural Network-Hardware Co-design for Scalable RRAM-based BNN Accelerators
Recently, RRAM-based Binary Neural Network (BNN) hardware has been gaining
interests as it requires 1-bit sense-amp only and eliminates the need for
high-resolution ADC and DAC. However, RRAM-based BNN hardware still requires
high-resolution ADC for partial sum calculation to implement large-scale neural
network using multiple memory arrays. We propose a neural network-hardware
co-design approach to split input to fit each split network on a RRAM array so
that the reconstructed BNNs calculate 1-bit output neuron in each array. As a
result, ADC can be completely eliminated from the design even for large-scale
neural network. Simulation results show that the proposed network
reconstruction and retraining recovers the inference accuracy of the original
BNN. The accuracy loss of the proposed scheme in the CIFAR-10 testcase was less
than 1.1% compared to the original network. The code for training and running
proposed BNN models is available at:
https://github.com/YulhwaKim/RRAMScalable_BNN
ChewBaccaNN: A Flexible 223 TOPS/W BNN Accelerator
Binary Neural Networks enable smart IoT devices, as they significantly reduce
the required memory footprint and computational complexity while retaining a
high network performance and flexibility. This paper presents ChewBaccaNN, a
0.7 mm sized binary convolutional neural network (CNN) accelerator designed
in GlobalFoundries 22 nm technology. By exploiting efficient data re-use, data
buffering, latch-based memories, and voltage scaling, a throughput of 241 GOPS
is achieved while consuming just 1.1 mW at 0.4V/154MHz during inference of
binary CNNs with up to 7x7 kernels, leading to a peak core energy efficiency of
223 TOPS/W. ChewBaccaNN's flexibility allows to run a much wider range of
binary CNNs than other accelerators, drastically improving the accuracy-energy
trade-off beyond what can be captured by the TOPS/W metric. In fact, it can
perform CIFAR-10 inference at 86.8% accuracy with merely 1.3 , thus
exceeding the accuracy while at the same time lowering the energy cost by 2.8x
compared to even the most efficient and much larger analog processing-in-memory
devices, while keeping the flexibility of running larger CNNs for higher
accuracy when needed. It also runs a binary ResNet-18 trained on the 1000-class
ILSVRC dataset and improves the energy efficiency by 4.4x over accelerators of
similar flexibility. Furthermore, it can perform inference on a binarized
ResNet-18 trained with 8-bases Group-Net to achieve a 67.5% Top-1 accuracy with
only 3.0 mJ/frame -- at an accuracy drop of merely 1.8% from the full-precision
ResNet-18.Comment: Accepted at IEEE ISCAS 2021, Daegu, South Korea, 23-26 May 202
Tartan: Accelerating Fully-Connected and Convolutional Layers in Deep Learning Networks by Exploiting Numerical Precision Variability
Tartan (TRT), a hardware accelerator for inference with Deep Neural Networks
(DNNs), is presented and evaluated on Convolutional Neural Networks. TRT
exploits the variable per layer precision requirements of DNNs to deliver
execution time that is proportional to the precision p in bits used per layer
for convolutional and fully-connected layers. Prior art has demonstrated an
accelerator with the same execution performance only for convolutional layers.
Experiments on image classification CNNs show that on average across all
networks studied, TRT outperforms a state-of-the-art bit-parallel accelerator
by 1:90x without any loss in accuracy while it is 1:17x more energy efficient.
TRT requires no network retraining while it enables trading off accuracy for
additional improvements in execution performance and energy efficiency. For
example, if a 1% relative loss in accuracy is acceptable, TRT is on average
2:04x faster and 1:25x more energy efficient than a conventional bit-parallel
accelerator. A Tartan configuration that processes 2-bits at time, requires
less area than the 1-bit configuration, improves efficiency to 1:24x over the
bit-parallel baseline while being 73% faster for convolutional layers and 60%
faster for fully-connected layers is also presented
Precision Highway for Ultra Low-Precision Quantization
Neural network quantization has an inherent problem called accumulated
quantization error, which is the key obstacle towards ultra-low precision,
e.g., 2- or 3-bit precision. To resolve this problem, we propose precision
highway, which forms an end-to-end high-precision information flow while
performing the ultra low-precision computation. First, we describe how the
precision highway reduce the accumulated quantization error in both
convolutional and recurrent neural networks. We also provide the quantitative
analysis of the benefit of precision highway and evaluate the overhead on the
state-of-the-art hardware accelerator. In the experiments, our proposed method
outperforms the best existing quantization methods while offering 3-bit
weight/activation quantization with no accuracy loss and 2-bit quantization
with a 2.45 % top-1 accuracy loss in ResNet-50. We also report that the
proposed method significantly outperforms the existing method in the 2-bit
quantization of an LSTM for language modeling
Recurrent Residual Module for Fast Inference in Videos
Deep convolutional neural networks (CNNs) have made impressive progress in
many video recognition tasks such as video pose estimation and video object
detection. However, CNN inference on video is computationally expensive due to
processing dense frames individually. In this work, we propose a framework
called Recurrent Residual Module (RRM) to accelerate the CNN inference for
video recognition tasks. This framework has a novel design of using the
similarity of the intermediate feature maps of two consecutive frames, to
largely reduce the redundant computation. One unique property of the proposed
method compared to previous work is that feature maps of each frame are
precisely computed. The experiments show that, while maintaining the similar
recognition performance, our RRM yields averagely 2x acceleration on the
commonly used CNNs such as AlexNet, ResNet, deep compression model (thus 8-12x
faster than the original dense models using the efficient inference engine),
and impressively 9x acceleration on some binary networks such as XNOR-Nets
(thus 500x faster than the original model). We further verify the effectiveness
of the RRM on speeding up CNNs for video pose estimation and video object
detection.Comment: To appear in CVPR 201
Efficient Hardware Realization of Convolutional Neural Networks using Intra-Kernel Regular Pruning
The recent trend toward increasingly deep convolutional neural networks
(CNNs) leads to a higher demand of computational power and memory storage.
Consequently, the deployment of CNNs in hardware has become more challenging.
In this paper, we propose an Intra-Kernel Regular (IKR) pruning scheme to
reduce the size and computational complexity of the CNNs by removing redundant
weights at a fine-grained level. Unlike other pruning methods such as
Fine-Grained pruning, IKR pruning maintains regular kernel structures that are
exploitable in a hardware accelerator. Experimental results demonstrate up to
10x parameter reduction and 7x computational reduction at a cost of less than
1% degradation in accuracy versus the un-pruned case.Comment: 6 pages, 8 figures, ISMVL 201
Hardware-oriented Approximation of Convolutional Neural Networks
High computational complexity hinders the widespread usage of Convolutional
Neural Networks (CNNs), especially in mobile devices. Hardware accelerators are
arguably the most promising approach for reducing both execution time and power
consumption. One of the most important steps in accelerator development is
hardware-oriented model approximation. In this paper we present Ristretto, a
model approximation framework that analyzes a given CNN with respect to
numerical resolution used in representing weights and outputs of convolutional
and fully connected layers. Ristretto can condense models by using fixed point
arithmetic and representation instead of floating point. Moreover, Ristretto
fine-tunes the resulting fixed point network. Given a maximum error tolerance
of 1%, Ristretto can successfully condense CaffeNet and SqueezeNet to 8-bit.
The code for Ristretto is available.Comment: 8 pages, 4 figures, Accepted as a workshop contribution at ICLR 2016.
Updated comparison to other work
Compressing Low Precision Deep Neural Networks Using Sparsity-Induced Regularization in Ternary Networks
A low precision deep neural network training technique for producing sparse,
ternary neural networks is presented. The technique incorporates hard- ware
implementation costs during training to achieve significant model compression
for inference. Training involves three stages: network training using L2
regularization and a quantization threshold regularizer, quantization pruning,
and finally retraining. Resulting networks achieve improved accuracy, reduced
memory footprint and reduced computational complexity compared with
conventional methods, on MNIST and CIFAR10 datasets. Our networks are up to 98%
sparse and 5 & 11 times smaller than equivalent binary and ternary models,
translating to significant resource and speed benefits for hardware
implementations.Comment: To appear as a conference paper at the 24th International Conference
On Neural Information Processing (ICONIP 2017
- …