32 research outputs found
Alternating Multi-bit Quantization for Recurrent Neural Networks
Recurrent neural networks have achieved excellent performance in many
applications. However, on portable devices with limited resources, the models
are often too large to deploy. For applications on the server with large scale
concurrent requests, the latency during inference can also be very critical for
costly computing resources. In this work, we address these problems by
quantizing the network, both weights and activations, into multiple binary
codes {-1,+1}. We formulate the quantization as an optimization problem. Under
the key observation that once the quantization coefficients are fixed the
binary codes can be derived efficiently by binary search tree, alternating
minimization is then applied. We test the quantization for two well-known RNNs,
i.e., long short term memory (LSTM) and gated recurrent unit (GRU), on the
language models. Compared with the full-precision counter part, by 2-bit
quantization we can achieve ~16x memory saving and ~6x real inference
acceleration on CPUs, with only a reasonable loss in the accuracy. By 3-bit
quantization, we can achieve almost no loss in the accuracy or even surpass the
original model, with ~10.5x memory saving and ~3x real inference acceleration.
Both results beat the exiting quantization works with large margins. We extend
our alternating quantization to image classification tasks. In both RNNs and
feedforward neural networks, the method also achieves excellent performance.Comment: Published as a conference paper at ICLR 201
Compression of Acoustic Event Detection Models with Low-rank Matrix Factorization and Quantization Training
In this paper, we present a compression approach based on the combination of
low-rank matrix factorization and quantization training, to reduce complexity
for neural network based acoustic event detection (AED) models. Our
experimental results show this combined compression approach is very effective.
For a three-layer long short-term memory (LSTM) based AED model, the original
model size can be reduced to 1% with negligible loss of accuracy. Our approach
enables the feasibility of deploying AED for resource-constraint applications.Comment: NeuralPS 2018 CDNNRIA worksho
Network Pruning for Low-Rank Binary Indexing
Pruning is an efficient model compression technique to remove redundancy in
the connectivity of deep neural networks (DNNs). Computations using sparse
matrices obtained by pruning parameters, however, exhibit vastly different
parallelism depending on the index representation scheme. As a result,
fine-grained pruning has not gained much attention due to its irregular index
form leading to large memory footprint and low parallelism for convolutions and
matrix multiplications. In this paper, we propose a new network pruning
technique that generates a low-rank binary index matrix to compress index data
while decompressing index data is performed by simple binary matrix
multiplication. This proposed compression method finds a particular
fine-grained pruning mask that can be decomposed into two binary matrices. We
also propose a tile-based factorization technique that not only lowers memory
requirements but also enhances compression ratio. Various DNN models can be
pruned with much fewer indexes compared to previous sparse matrix formats while
maintaining the same pruning rate
Learning to Skip Ineffectual Recurrent Computations in LSTMs
Long Short-Term Memory (LSTM) is a special class of recurrent neural network,
which has shown remarkable successes in processing sequential data. The typical
architecture of an LSTM involves a set of states and gates: the states retain
information over arbitrary time intervals and the gates regulate the flow of
information. Due to the recursive nature of LSTMs, they are computationally
intensive to deploy on edge devices with limited hardware resources. To reduce
the computational complexity of LSTMs, we first introduce a method that learns
to retain only the important information in the states by pruning redundant
information. We then show that our method can prune over 90% of information in
the states without incurring any accuracy degradation over a set of temporal
tasks. This observation suggests that a large fraction of the recurrent
computations are ineffectual and can be avoided to speed up the process during
the inference as they involve noncontributory multiplications/accumulations
with zero-valued states. Finally, we introduce a custom hardware accelerator
that can perform the recurrent computations using both sparse and dense states.
Experimental measurements show that performing the computations using the
sparse states speeds up the process and improves energy efficiency by up to
5.2x when compared to implementation results of the accelerator performing the
computations using dense states.Comment: Accepted as a conference paper for presentation at DATE 201
Learning Low-Rank Approximation for CNNs
Low-rank approximation is an effective model compression technique to not
only reduce parameter storage requirements, but to also reduce computations.
For convolutional neural networks (CNNs), however, well-known low-rank
approximation methods, such as Tucker or CP decomposition, result in degraded
model accuracy because decomposed layers hinder training convergence. In this
paper, we propose a new training technique that finds a flat minimum in the
view of low-rank approximation without a decomposed structure during training.
By preserving the original model structure, 2-dimensional low-rank
approximation demanding lowering (such as im2col) is available in our proposed
scheme. We show that CNN models can be compressed by low-rank approximation
with much higher compression ratio than conventional training methods while
maintaining or even enhancing model accuracy. We also discuss various
2-dimensional low-rank approximation techniques for CNNs
DeepTwist: Learning Model Compression via Occasional Weight Distortion
Model compression has been introduced to reduce the required hardware
resources while maintaining the model accuracy. Lots of techniques for model
compression, such as pruning, quantization, and low-rank approximation, have
been suggested along with different inference implementation characteristics.
Adopting model compression is, however, still challenging because the design
complexity of model compression is rapidly increasing due to additional
hyper-parameters and computation overhead in order to achieve a high
compression ratio. In this paper, we propose a simple and efficient model
compression framework called DeepTwist which distorts weights in an occasional
manner without modifying the underlying training algorithms. The ideas of
designing weight distortion functions are intuitive and straightforward given
formats of compressed weights. We show that our proposed framework improves
compression rate significantly for pruning, quantization, and low-rank
approximation techniques while the efforts of additional retraining and/or
hyper-parameter search are highly reduced. Regularization effects of DeepTwist
are also reported
Dynamically Hierarchy Revolution: DirNet for Compressing Recurrent Neural Network on Mobile Devices
Recurrent neural networks (RNNs) achieve cutting-edge performance on a
variety of problems. However, due to their high computational and memory
demands, deploying RNNs on resource constrained mobile devices is a challenging
task. To guarantee minimum accuracy loss with higher compression rate and
driven by the mobile resource requirement, we introduce a novel model
compression approach DirNet based on an optimized fast dictionary learning
algorithm, which 1) dynamically mines the dictionary atoms of the projection
dictionary matrix within layer to adjust the compression rate 2) adaptively
changes the sparsity of sparse codes cross the hierarchical layers.
Experimental results on language model and an ASR model trained with a 1000h
speech dataset demonstrate that our method significantly outperforms prior
approaches. Evaluated on off-the-shelf mobile devices, we are able to reduce
the size of original model by eight times with real-time model inference and
negligible accuracy loss.Comment: Accepted by IJCAI-ECAI 201
FINN-L: Library Extensions and Design Trade-off Analysis for Variable Precision LSTM Networks on FPGAs
It is well known that many types of artificial neural networks, including
recurrent networks, can achieve a high classification accuracy even with
low-precision weights and activations. The reduction in precision generally
yields much more efficient hardware implementations in regards to hardware
cost, memory requirements, energy, and achievable throughput. In this paper, we
present the first systematic exploration of this design space as a function of
precision for Bidirectional Long Short-Term Memory (BiLSTM) neural network.
Specifically, we include an in-depth investigation of precision vs. accuracy
using a fully hardware-aware training flow, where during training quantization
of all aspects of the network including weights, input, output and in-memory
cell activations are taken into consideration. In addition, hardware resource
cost, power consumption and throughput scalability are explored as a function
of precision for FPGA-based implementations of BiLSTM, and multiple approaches
of parallelizing the hardware. We provide the first open source HLS library
extension of FINN for parameterizable hardware architectures of LSTM layers on
FPGAs which offers full precision flexibility and allows for parameterizable
performance scaling offering different levels of parallelism within the
architecture. Based on this library, we present an FPGA-based accelerator for
BiLSTM neural network designed for optical character recognition, along with
numerous other experimental proof points for a Zynq UltraScale+ XCZU7EV MPSoC
within the given design space.Comment: Accepted for publication, 28th International Conference on Field
Programmable Logic and Applications (FPL), August, 2018, Dublin, Irelan
Knowledge distillation for optimization of quantized deep neural networks
Knowledge distillation (KD) is a very popular method for model size
reduction. Recently, the technique is exploited for quantized deep neural
networks (QDNNs) training as a way to restore the performance sacrificed by
word-length reduction. KD, however, employs additional hyper-parameters, such
as temperature, coefficient, and the size of teacher network for QDNN training.
We analyze the effect of these hyper-parameters for QDNN optimization with KD.
We find that these hyper-parameters are inter-related, and also introduce a
simple and effective technique that reduces \textit{coefficient} during
training. With KD employing the proposed hyper-parameters, we achieve the test
accuracy of 92.7% and 67.0% on Resnet20 with 2-bit ternary weights for CIFAR-10
and CIFAR-100 data sets, respectively
Dataflow-based Joint Quantization of Weights and Activations for Deep Neural Networks
This paper addresses a challenging problem - how to reduce energy consumption
without incurring performance drop when deploying deep neural networks (DNNs)
at the inference stage. In order to alleviate the computation and storage
burdens, we propose a novel dataflow-based joint quantization approach with the
hypothesis that a fewer number of quantization operations would incur less
information loss and thus improve the final performance. It first introduces a
quantization scheme with efficient bit-shifting and rounding operations to
represent network parameters and activations in low precision. Then it
restructures the network architectures to form unified modules for optimization
on the quantized model. Extensive experiments on ImageNet and KITTI validate
the effectiveness of our model, demonstrating that state-of-the-art results for
various tasks can be achieved by this quantized model. Besides, we designed and
synthesized an RTL model to measure the hardware costs among various
quantization methods. For each quantization operation, it reduces area cost by
about 15 times and energy consumption by about 9 times, compared to a strong
baseline