7,315 research outputs found
Precision Highway for Ultra Low-Precision Quantization
Neural network quantization has an inherent problem called accumulated
quantization error, which is the key obstacle towards ultra-low precision,
e.g., 2- or 3-bit precision. To resolve this problem, we propose precision
highway, which forms an end-to-end high-precision information flow while
performing the ultra low-precision computation. First, we describe how the
precision highway reduce the accumulated quantization error in both
convolutional and recurrent neural networks. We also provide the quantitative
analysis of the benefit of precision highway and evaluate the overhead on the
state-of-the-art hardware accelerator. In the experiments, our proposed method
outperforms the best existing quantization methods while offering 3-bit
weight/activation quantization with no accuracy loss and 2-bit quantization
with a 2.45 % top-1 accuracy loss in ResNet-50. We also report that the
proposed method significantly outperforms the existing method in the 2-bit
quantization of an LSTM for language modeling
A Survey on Methods and Theories of Quantized Neural Networks
Deep neural networks are the state-of-the-art methods for many real-world
tasks, such as computer vision, natural language processing and speech
recognition. For all its popularity, deep neural networks are also criticized
for consuming a lot of memory and draining battery life of devices during
training and inference. This makes it hard to deploy these models on mobile or
embedded devices which have tight resource constraints. Quantization is
recognized as one of the most effective approaches to satisfy the extreme
memory requirements that deep neural network models demand. Instead of adopting
32-bit floating point format to represent weights, quantized representations
store weights using more compact formats such as integers or even binary
numbers. Despite a possible degradation in predictive performance, quantization
provides a potential solution to greatly reduce the model size and the energy
consumption. In this survey, we give a thorough review of different aspects of
quantized neural networks. Current challenges and trends of quantized neural
networks are also discussed.Comment: 17 pages, 8 figure
QGAN: Quantized Generative Adversarial Networks
The intensive computation and memory requirements of generative adversarial
neural networks (GANs) hinder its real-world deployment on edge devices such as
smartphones. Despite the success in model reduction of CNNs, neural network
quantization methods have not yet been studied on GANs, which are mainly faced
with the issues of both the effectiveness of quantization algorithms and the
instability of training GAN models. In this paper, we start with an extensive
study on applying existing successful methods to quantize GANs. Our observation
reveals that none of them generates samples with reasonable quality because of
the underrepresentation of quantized values in model weights, and the generator
and discriminator networks show different sensitivities upon quantization
methods. Motivated by these observations, we develop a novel quantization
method for GANs based on EM algorithms, named as QGAN. We also propose a
multi-precision algorithm to help find the optimal number of bits of quantized
GAN models in conjunction with corresponding result qualities. Experiments on
CIFAR-10 and CelebA show that QGAN can quantize GANs to even 1-bit or 2-bit
representations with results of quality comparable to original models
Feature Map Transform Coding for Energy-Efficient CNN Inference
Convolutional neural networks (CNNs) achieve state-of-the-art accuracy in a
variety of tasks in computer vision and beyond. One of the major obstacles
hindering the ubiquitous use of CNNs for inference on low-power edge devices is
their high computational complexity and memory bandwidth requirements. The
latter often dominates the energy footprint on modern hardware. In this paper,
we introduce a lossy transform coding approach, inspired by image and video
compression, designed to reduce the memory bandwidth due to the storage of
intermediate activation calculation results. Our method does not require
fine-tuning the network weights and halves the data transfer volumes to the
main memory by compressing feature maps, which are highly correlated, with
variable length coding. Our method outperform previous approach in term of the
number of bits per value with minor accuracy degradation on ResNet-34 and
MobileNetV2. We analyze the performance of our approach on a variety of CNN
architectures and demonstrate that FPGA implementation of ResNet-18 with our
approach results in a reduction of around 40% in the memory energy footprint,
compared to quantized network, with negligible impact on accuracy. When
allowing accuracy degradation of up to 2%, the reduction of 60% is achieved. A
reference implementation is available at
https://github.com/CompressTeam/TransformCodingInferenc
Proximal Mean-field for Neural Network Quantization
Compressing large Neural Networks (NN) by quantizing the parameters, while
maintaining the performance is highly desirable due to reduced memory and time
complexity. In this work, we cast NN quantization as a discrete labelling
problem, and by examining relaxations, we design an efficient iterative
optimization procedure that involves stochastic gradient descent followed by a
projection. We prove that our simple projected gradient descent approach is, in
fact, equivalent to a proximal version of the well-known mean-field method.
These findings would allow the decades-old and theoretically grounded research
on MRF optimization to be used to design better network quantization schemes.
Our experiments on standard classification datasets (MNIST, CIFAR10/100,
TinyImageNet) with convolutional and residual architectures show that our
algorithm obtains fully-quantized networks with accuracies very close to the
floating-point reference networks
Memory-Driven Mixed Low Precision Quantization For Enabling Deep Network Inference On Microcontrollers
This paper presents a novel end-to-end methodology for enabling the
deployment of low-error deep networks on microcontrollers. To fit the memory
and computational limitations of resource-constrained edge-devices, we exploit
mixed low-bitwidth compression, featuring 8, 4 or 2-bit uniform quantization,
and we model the inference graph with integer-only operations. Our approach
aims at determining the minimum bit precision of every activation and weight
tensor given the memory constraints of a device. This is achieved through a
rule-based iterative procedure, which cuts the number of bits of the most
memory-demanding layers, aiming at meeting the memory constraints. After a
quantization-aware retraining step, the fake-quantized graph is converted into
an inference integer-only model by inserting the Integer Channel-Normalization
(ICN) layers, which introduce a negligible loss as demonstrated on INT4
MobilenetV1 models. We report the latency-accuracy evaluation of
mixed-precision MobilenetV1 family networks on a STM32H7 microcontroller. Our
experimental results demonstrate an end-to-end deployment of an integer-only
Mobilenet network with Top1 accuracy of 68% on a device with only 2MB of FLASH
memory and 512kB of RAM, improving by 8% the Top1 accuracy with respect to
previously published 8 bit implementations for microcontrollers.Comment: Submitted to NeurIPS 201
Entropy-Constrained Training of Deep Neural Networks
We propose a general framework for neural network compression that is
motivated by the Minimum Description Length (MDL) principle. For that we first
derive an expression for the entropy of a neural network, which measures its
complexity explicitly in terms of its bit-size. Then, we formalize the problem
of neural network compression as an entropy-constrained optimization objective.
This objective generalizes many of the compression techniques proposed in the
literature, in that pruning or reducing the cardinality of the weight elements
of the network can be seen special cases of entropy-minimization techniques.
Furthermore, we derive a continuous relaxation of the objective, which allows
us to minimize it using gradient based optimization techniques. Finally, we
show that we can reach state-of-the-art compression results on different
network architectures and data sets, e.g. achieving x71 compression gains on a
VGG-like architecture.Comment: 8 pages, 6 figure
Value-aware Quantization for Training and Inference of Neural Networks
We propose a novel value-aware quantization which applies aggressively
reduced precision to the majority of data while separately handling a small
amount of large data in high precision, which reduces total quantization errors
under very low precision. We present new techniques to apply the proposed
quantization to training and inference. The experiments show that our method
with 3-bit activations (with 2% of large ones) can give the same training
accuracy as full-precision one while offering significant (41.6% and 53.7%)
reductions in the memory cost of activations in ResNet-152 and Inception-v3
compared with the state-of-the-art method. Our experiments also show that deep
networks such as Inception-v3, ResNet-101 and DenseNet-121 can be quantized for
inference with 4-bit weights and activations (with 1% 16-bit data) within 1%
top-1 accuracy drop
Granger Causality Analysis Based on Quantized Minimum Error Entropy Criterion
Linear regression model (LRM) based on mean square error (MSE) criterion is
widely used in Granger causality analysis (GCA), which is the most commonly
used method to detect the causality between a pair of time series. However,
when signals are seriously contaminated by non-Gaussian noises, the LRM
coefficients will be inaccurately identified. This may cause the GCA to detect
a wrong causal relationship. Minimum error entropy (MEE) criterion can be used
to replace the MSE criterion to deal with the non-Gaussian noises. But its
calculation requires a double summation operation, which brings computational
bottlenecks to GCA especially when sizes of the signals are large. To address
the aforementioned problems, in this study we propose a new method called GCA
based on the quantized MEE (QMEE) criterion (GCA-QMEE), in which the QMEE
criterion is applied to identify the LRM coefficients and the quantized error
entropy is used to calculate the causality indexes. Compared with the
traditional GCA, the proposed GCA-QMEE not only makes the results more
discriminative, but also more robust. Its computational complexity is also not
high because of the quantization operation. Illustrative examples on synthetic
and EEG datasets are provided to verify the desirable performance and the
availability of the GCA-QMEE.Comment: 5 pages, 2 figures, 3 table
ReLeQ: A Reinforcement Learning Approach for Deep Quantization of Neural Networks
Deep Neural Networks (DNNs) typically require massive amount of computation
resource in inference tasks for computer vision applications. Quantization can
significantly reduce DNN computation and storage by decreasing the bitwidth of
network encodings. Recent research affirms that carefully selecting the
quantization levels for each layer can preserve the accuracy while pushing the
bitwidth below eight bits. However, without arduous manual effort, this deep
quantization can lead to significant accuracy loss, leaving it in a position of
questionable utility. As such, deep quantization opens a large hyper-parameter
space (bitwidth of the layers), the exploration of which is a major challenge.
We propose a systematic approach to tackle this problem, by automating the
process of discovering the quantization levels through an end-to-end deep
reinforcement learning framework (ReLeQ). We adapt policy optimization methods
to the problem of quantization, and focus on finding the best design decisions
in choosing the state and action spaces, network architecture and training
framework, as well as the tuning of various hyperparamters. We show how ReLeQ
can balance speed and quality, and provide an asymmetric general solution for
quantization of a large variety of deep networks (AlexNet, CIFAR-10, LeNet,
MobileNet-V1, ResNet-20, SVHN, and VGG-11) that virtually preserves the
accuracy (=< 0.3% loss) while minimizing the computation and storage cost. With
these DNNs, ReLeQ enables conventional hardware to achieve 2.2x speedup over
8-bit execution. Similarly, a custom DNN accelerator achieves 2.0x speedup and
energy reduction compared to 8-bit runs. These encouraging results mark ReLeQ
as the initial step towards automating the deep quantization of neural
networks.Comment: Presented as a spotlight paper at NeurIPS Workshop on ML for Systems
201
- …