7,379 research outputs found
Kernel Quantization for Efficient Network Compression
This paper presents a novel network compression framework Kernel Quantization
(KQ), targeting to efficiently convert any pre-trained full-precision
convolutional neural network (CNN) model into a low-precision version without
significant performance loss. Unlike existing methods struggling with weight
bit-length, KQ has the potential in improving the compression ratio by
considering the convolution kernel as the quantization unit. Inspired by the
evolution from weight pruning to filter pruning, we propose to quantize in both
kernel and weight level. Instead of representing each weight parameter with a
low-bit index, we learn a kernel codebook and replace all kernels in the
convolution layer with corresponding low-bit indexes. Thus, KQ can represent
the weight tensor in the convolution layer with low-bit indexes and a kernel
codebook with limited size, which enables KQ to achieve significant compression
ratio. Then, we conduct a 6-bit parameter quantization on the kernel codebook
to further reduce redundancy. Extensive experiments on the ImageNet
classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG
and ResNet18, respectively, to represent each parameter in the convolution
layer and achieves the state-of-the-art compression ratio with little accuracy
loss
Recent Advances in Efficient Computation of Deep Convolutional Neural Networks
Deep neural networks have evolved remarkably over the past few years and they
are currently the fundamental tools of many intelligent systems. At the same
time, the computational complexity and resource consumption of these networks
also continue to increase. This will pose a significant challenge to the
deployment of such networks, especially in real-time applications or on
resource-limited devices. Thus, network acceleration has become a hot topic
within the deep learning community. As for hardware implementation of deep
neural networks, a batch of accelerators based on FPGA/ASIC have been proposed
in recent years. In this paper, we provide a comprehensive survey of recent
advances in network acceleration, compression and accelerator design from both
algorithm and hardware points of view. Specifically, we provide a thorough
analysis of each of the following topics: network pruning, low-rank
approximation, network quantization, teacher-student networks, compact network
design and hardware accelerators. Finally, we will introduce and discuss a few
possible future directions.Comment: 14 pages, 3 figure
DNN Feature Map Compression using Learned Representation over GF(2)
In this paper, we introduce a method to compress intermediate feature maps of
deep neural networks (DNNs) to decrease memory storage and bandwidth
requirements during inference. Unlike previous works, the proposed method is
based on converting fixed-point activations into vectors over the smallest
GF(2) finite field followed by nonlinear dimensionality reduction (NDR) layers
embedded into a DNN. Such an end-to-end learned representation finds more
compact feature maps by exploiting quantization redundancies within the
fixed-point activations along the channel or spatial dimensions. We apply the
proposed network architectures derived from modified SqueezeNet and MobileNetV2
to the tasks of ImageNet classification and PASCAL VOC object detection.
Compared to prior approaches, the conducted experiments show a factor of 2
decrease in memory requirements with minor degradation in accuracy while adding
only bitwise computations.Comment: CEFRL201
Design Automation for Efficient Deep Learning Computing
Efficient deep learning computing requires algorithm and hardware co-design
to enable specialization: we usually need to change the algorithm to reduce
memory footprint and improve energy efficiency. However, the extra degree of
freedom from the algorithm makes the design space much larger: it's not only
about designing the hardware but also about how to tweak the algorithm to best
fit the hardware. Human engineers can hardly exhaust the design space by
heuristics. It's labor consuming and sub-optimal. We propose design automation
techniques for efficient neural networks. We investigate automatically
designing specialized fast models, auto channel pruning, and auto
mixed-precision quantization. We demonstrate such learning-based, automated
design achieves superior performance and efficiency than rule-based human
design. Moreover, we shorten the design cycle by 200x than previous work, so
that we can afford to design specialized neural network models for different
hardware platforms
CNN inference acceleration using dictionary of centroids
It is well known that multiplication operations in convolutional layers of
common CNNs consume a lot of time during inference stage. In this article we
present a flexible method to decrease both computational complexity of
convolutional layers in inference as well as amount of space to store them. The
method is based on centroid filter quantization and outperforms approaches
based on tensor decomposition by a large margin. We performed comparative
analysis of the proposed method and series of CP tensor decomposition on
ImageNet benchmark and found that our method provide almost 2.9 times better
computational gain. Despite the simplicity of our method it cannot be applied
directly in inference stage in modern frameworks, but could be useful for cases
calculation flow could be changed, e.g. for CNN-chip designers
Low-Dimensional Bottleneck Features for On-Device Continuous Speech Recognition
Low power digital signal processors (DSPs) typically have a very limited
amount of memory in which to cache data. In this paper we develop efficient
bottleneck feature (BNF) extractors that can be run on a DSP, and retrain a
baseline large-vocabulary continuous speech recognition (LVCSR) system to use
these BNFs with only a minimal loss of accuracy. The small BNFs allow the DSP
chip to cache more audio features while the main application processor is
suspended, thereby reducing the overall battery usage. Our presented system is
able to reduce the footprint of standard, fixed point DSP spectral features by
a factor of 10 without any loss in word error rate (WER) and by a factor of 64
with only a 5.8% relative increase in WER.Comment: Submitted to ICASSP 201
Optimize Deep Convolutional Neural Network with Ternarized Weights and High Accuracy
Deep convolution neural network has achieved great success in many artificial
intelligence applications. However, its enormous model size and massive
computation cost have become the main obstacle for deployment of such powerful
algorithm in the low power and resource-limited embedded systems. As the
countermeasure to this problem, in this work, we propose statistical weight
scaling and residual expansion methods to reduce the bit-width of the whole
network weight parameters to ternary values (i.e. -1, 0, +1), with the
objectives to greatly reduce model size, computation cost and accuracy
degradation caused by the model compression. With about 16x model compression
rate, our ternarized ResNet-32/44/56 could outperform full-precision
counterparts by 0.12%, 0.24% and 0.18% on CIFAR- 10 dataset. We also test our
ternarization method with AlexNet and ResNet-18 on ImageNet dataset, which both
achieve the best top-1 accuracy compared to recent similar works, with the same
16x compression rate. If further incorporating our residual expansion method,
compared to the full-precision counterpart, our ternarized ResNet-18 even
improves the top-5 accuracy by 0.61% and merely degrades the top-1 accuracy
only by 0.42% for the ImageNet dataset, with 8x model compression rate. It
outperforms the recent ABC-Net by 1.03% in top-1 accuracy and 1.78% in top-5
accuracy, with around 1.25x higher compression rate and more than 6x
computation reduction due to the weight sparsity
Exploring the Regularity of Sparse Structure in Convolutional Neural Networks
Sparsity helps reduce the computational complexity of deep neural networks by
skipping zeros. Taking advantage of sparsity is listed as a high priority in
next generation DNN accelerators such as TPU. The structure of sparsity, i.e.,
the granularity of pruning, affects the efficiency of hardware accelerator
design as well as the prediction accuracy. Coarse-grained pruning creates
regular sparsity patterns, making it more amenable for hardware acceleration
but more challenging to maintain the same accuracy. In this paper we
quantitatively measure the trade-off between sparsity regularity and prediction
accuracy, providing insights in how to maintain accuracy while having more a
more structured sparsity pattern. Our experimental results show that
coarse-grained pruning can achieve a sparsity ratio similar to unstructured
pruning without loss of accuracy. Moreover, due to the index saving effect,
coarse-grained pruning is able to obtain a better compression ratio than
fine-grained sparsity at the same accuracy threshold. Based on the recent
sparse convolutional neural network accelerator (SCNN), our experiments further
demonstrate that coarse-grained sparsity saves about 2x the memory references
compared to fine-grained sparsity. Since memory reference is more than two
orders of magnitude more expensive than arithmetic operations, the regularity
of sparse structure leads to more efficient hardware design.Comment: submitted to NIPS 201
HadaNets: Flexible Quantization Strategies for Neural Networks
On-board processing elements on UAVs are currently inadequate for training
and inference of Deep Neural Networks. This is largely due to the energy
consumption of memory accesses in such a network. HadaNets introduce a flexible
train-from-scratch tensor quantization scheme by pairing a full precision
tensor to a binary tensor in the form of a Hadamard product. Unlike wider
reduced precision neural network models, we preserve the train-time parameter
count, thus out-performing XNOR-Nets without a train-time memory penalty. Such
training routines could see great utility in semi-supervised online learning
tasks. Our method also offers advantages in model compression, as we reduce the
model size of ResNet-18 by 7.43 times with respect to a full precision model
without utilizing any other compression techniques. We also demonstrate a
'Hadamard Binary Matrix Multiply' kernel, which delivers a 10-fold increase in
performance over full precision matrix multiplication with a similarly
optimized kernel.Comment: Accepted in CVPR 2019, UAVision 201
Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM
Although deep learning models are highly effective for various learning
tasks, their high computational costs prohibit the deployment to scenarios
where either memory or computational resources are limited. In this paper, we
focus on compressing and accelerating deep models with network weights
represented by very small numbers of bits, referred to as extremely low bit
neural network. We model this problem as a discretely constrained optimization
problem. Borrowing the idea from Alternating Direction Method of Multipliers
(ADMM), we decouple the continuous parameters from the discrete constraints of
network, and cast the original hard problem into several subproblems. We
propose to solve these subproblems using extragradient and iterative
quantization algorithms that lead to considerably faster convergency compared
to conventional optimization methods. Extensive experiments on image
recognition and object detection verify that the proposed algorithm is more
effective than state-of-the-art approaches when coming to extremely low bit
neural network
- …