Search CORE

532 research outputs found

Data-Free Quantization Through Weight Equalization and Bias Correction

Author: Blankevoort Tijmen
Nagel Markus
van Baalen Mart
Welling Max
Publication venue
Publication date: 25/11/2019
Field of study

We introduce a data-free quantization method for deep neural networks that does not require fine-tuning or hyperparameter selection. It achieves near-original model performance on common computer vision architectures and tasks. 8-bit fixed-point quantization is essential for efficient inference on modern deep learning hardware. However, quantizing models to run in 8-bit is a non-trivial task, frequently leading to either significant performance reduction or engineering time spent on training a network to be amenable to quantization. Our approach relies on equalizing the weight ranges in the network by making use of a scale-equivariance property of activation functions. In addition the method corrects biases in the error that are introduced during quantization. This improves quantization accuracy performance, and can be applied to many common computer vision architectures with a straight forward API call. For common architectures, such as the MobileNet family, we achieve state-of-the-art quantized model performance. We further show that the method also extends to other computer vision architectures and tasks such as semantic segmentation and object detection.Comment: ICCV 201

arXiv.org e-Print Archive

Data-Free Network Quantization With Adversarial Knowledge Distillation

Author: Choi Jihwan
Choi Yoojin
El-Khamy Mostafa
Lee Jungwon
Publication venue
Publication date: 08/05/2020
Field of study

Network quantization is an essential procedure in deep learning for development of efficient fixed-point inference models on mobile or edge platforms. However, as datasets grow larger and privacy regulations become stricter, data sharing for model compression gets more difficult and restricted. In this paper, we consider data-free network quantization with synthetic data. The synthetic data are generated from a generator, while no data are used in training the generator and in quantization. To this end, we propose data-free adversarial knowledge distillation, which minimizes the maximum distance between the outputs of the teacher and the (quantized) student for any adversarial samples from a generator. To generate adversarial samples similar to the original data, we additionally propose matching statistics from the batch normalization layers for generated data and the original data in the teacher. Furthermore, we show the gain of producing diverse adversarial samples by using multiple generators and multiple students. Our experiments show the state-of-the-art data-free model compression and quantization results for (wide) residual networks and MobileNet on SVHN, CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. The accuracy losses compared to using the original datasets are shown to be very minimal.Comment: CVPR 2020 Joint Workshop on Efficient Deep Learning in Computer Vision (EDLCV

arXiv.org e-Print Archive

Subtensor Quantization for Mobilenets

Author: Chai Sek
Daskalopoulos Vasilios
Dinh Thu
Melnikov Andrey
Publication venue
Publication date: 04/11/2020
Field of study

Quantization for deep neural networks (DNN) have enabled developers to deploy models with less memory and more efficient low-power inference. However, not all DNN designs are friendly to quantization. For example, the popular Mobilenet architecture has been tuned to reduce parameter size and computational latency with separable depth-wise convolutions, but not all quantization algorithms work well and the accuracy can suffer against its float point versions. In this paper, we analyzed several root causes of quantization loss and proposed alternatives that do not rely on per-channel or training-aware approaches. We evaluate the image classification task on ImageNet dataset, and our post-training quantized 8-bit inference top-1 accuracy in within 0.7% of the floating point version.Comment: Embedded Vision Workshop, 16th European Conference on Computer Vision (ECCV), Aug 202

arXiv.org e-Print Archive

A Data and Compute Efficient Design for Limited-Resources Deep Learning

Author: Cesa Gabriele
Cohen Taco S.
Mohamed Mirgahney
Welling Max
Publication venue
Publication date: 08/07/2020
Field of study

Thanks to their improved data efficiency, equivariant neural networks have gained increased interest in the deep learning community. They have been successfully applied in the medical domain where symmetries in the data can be effectively exploited to build more accurate and robust models. To be able to reach a much larger body of patients, mobile, on-device implementations of deep learning solutions have been developed for medical applications. However, equivariant models are commonly implemented using large and computationally expensive architectures, not suitable to run on mobile devices. In this work, we design and test an equivariant version of MobileNetV2 and further optimize it with model quantization to enable more efficient inference. We achieve close-to state of the art performance on the Patch Camelyon (PCam) medical dataset while being more computationally efficient.Comment: Accepted for poster presentation at the Practical Machine Learning for Developing Countries (PML4DC) workshop, ICLR 202

arXiv.org e-Print Archive

MRQ:Support Multiple Quantization Schemes through Model Re-Quantization

Author: Afzal Tariq
Bakshi Rahul
Dayal Sankalp
Fu Kahkuen
Manohara Manasa
Publication venue
Publication date: 03/08/2023
Field of study

Despite the proliferation of diverse hardware accelerators (e.g., NPU, TPU, DPU), deploying deep learning models on edge devices with fixed-point hardware is still challenging due to complex model quantization and conversion. Existing model quantization frameworks like Tensorflow QAT [1], TFLite PTQ [2], and Qualcomm AIMET [3] supports only a limited set of quantization schemes (e.g., only asymmetric per-tensor quantization in TF1.x QAT [4]). Accordingly, deep learning models cannot be easily quantized for diverse fixed-point hardwares, mainly due to slightly different quantization requirements. In this paper, we envision a new type of model quantization approach called MRQ (model re-quantization), which takes existing quantized models and quickly transforms the models to meet different quantization requirements (e.g., asymmetric -> symmetric, non-power-of-2 scale -> power-of-2 scale). Re-quantization is much simpler than quantizing from scratch because it avoids costly re-training and provides support for multiple quantization schemes simultaneously. To minimize re-quantization error, we developed a new set of re-quantization algorithms including weight correction and rounding error folding. We have demonstrated that MobileNetV2 QAT model [7] can be quickly re-quantized into two different quantization schemes (i.e., symmetric and symmetric+power-of-2 scale) with less than 0.64 units of accuracy loss. We believe our work is the first to leverage this concept of re-quantization for model quantization and models obtained from the re-quantization process have been successfully deployed on NNA in the Echo Show devices.Comment: 8 pages, 6 figures, 3 tables, TinyML Conferenc

arXiv.org e-Print Archive

Weight Equalizing Shift Scaler-Coupled Post-training Quantization

Author: Kwon Kiseok
Lee SangJeong
Oh Jihun
Park Meejeong
Walagaurav Pooni
Publication venue
Publication date: 13/08/2020
Field of study

Post-training, layer-wise quantization is preferable because it is free from retraining and is hardware-friendly. Nevertheless, accuracy degradation has occurred when a neural network model has a big difference of per-out-channel weight ranges. In particular, the MobileNet family has a tragedy drop in top-1 accuracy from 70.60% ~ 71.87% to 0.1% on the ImageNet dataset after 8-bit weight quantization. To mitigate this significant accuracy reduction, we propose a new weight equalizing shift scaler, i.e. rescaling the weight range per channel by a 4-bit binary shift, prior to a layer-wise quantization. To recover the original output range, inverse binary shifting is efficiently fused to the existing per-layer scale compounding in the fixed-computing convolutional operator of the custom neural processing unit. The binary shift is a key feature of our algorithm, which significantly improved the accuracy performance without impeding the memory footprint. As a result, our proposed method achieved a top-1 accuracy of 69.78% ~ 70.96% in MobileNets and showed robust performance in varying network models and tasks, which is competitive to channel-wise quantization results.Comment: 9 pages, 4 figures, 4 table

arXiv.org e-Print Archive

Exploring Neural Networks Quantization via Layer-Wise Quantization Analysis

Author: Gluska Shachar
Grobman Mark
Publication venue
Publication date: 15/12/2020
Field of study

Quantization is an essential step in the efficient deployment of deep learning models and as such is an increasingly popular research topic. An important practical aspect that is not addressed in the current literature is how to analyze and fix fail cases where the use of quantization results in excessive degradation. In this paper, we present a simple analytic framework that breaks down overall degradation to its per layer contributions. We analyze many common networks and observe that a layer's contribution is determined by both intrinsic (local) factors - the distribution of the layer's weights and activations - and extrinsic (global) factors having to do with the the interaction with the rest of the layers. Layer-wise analysis of existing quantization schemes reveals local fail-cases of existing techniques which are not reflected when inspecting their overall performance. As an example, we consider ResNext26 on which SoTA post-training quantization methods perform poorly. We show that almost all of the degradation stems from a single layer. The same analysis also allows for local fixes - applying a common weight clipping heuristic only to this layer reduces degradation to a minimum while applying the same heuristic globally results in high degradation. More generally, layer-wise analysis allows for a more nuanced examination of how quantization affects the network, enabling the design of better performing schemes

arXiv.org e-Print Archive

Quantization of Neural Network Equalizers in Optical Fiber Transmission Experiments

Author: Costa Nelson
Darweesh Jamal
Jaouen Yves
Napoli Antonio
Spinnler Bernhard
Yousefi Mansoor
Publication venue
Publication date: 09/09/2023
Field of study

The quantization of neural networks for the mitigation of the nonlinear and components' distortions in dual-polarization optical fiber transmission is studied. Two low-complexity neural network equalizers are applied in three 16-QAM 34.4 GBaud transmission experiments with different representative fibers. A number of post-training quantization and quantization-aware training algorithms are compared for casting the weights and activations of the neural network in few bits, combined with the uniform, additive power-of-two, and companding quantization. For quantization in the large bit-width regime of

\geq 5

bits, the quantization-aware training with the straight-through estimation incurs a Q-factor penalty of less than 0.5 dB compared to the unquantized neural network. For quantization in the low bit-width regime, an algorithm dubbed companding successive alpha-blending quantization is suggested. This method compensates for the quantization error aggressively by successive grouping and retraining of the parameters, as well as an incremental transition from the floating-point representations to the quantized values within each group. The activations can be quantized at 8 bits and the weights on average at 1.75 bits, with a penalty of

\leq 0.5

~dB. If the activations are quantized at 6 bits, the weights can be quantized at 3.75 bits with minimal penalty. The computational complexity and required storage of the neural networks are drastically reduced, typically by over 90\%. The results indicate that low-complexity neural networks can mitigate nonlinearities in optical fiber transmission.Comment: 15 pages, 9 figures, 5 table

arXiv.org e-Print Archive

Bit Efficient Quantization for Deep Neural Networks

Author: Chai Sek
Nayak Prateeth
Zhang David
Publication venue
Publication date: 07/10/2019
Field of study

Quantization for deep neural networks have afforded models for edge devices that use less on-board memory and enable efficient low-power inference. In this paper, we present a comparison of model-parameter driven quantization approaches that can achieve as low as 3-bit precision without affecting accuracy. The post-training quantization approaches are data-free, and the resulting weight values are closely tied to the dataset distribution on which the model has converged to optimality. We show quantization results for a number of state-of-art deep neural networks (DNN) using large dataset like ImageNet. To better analyze quantization results, we describe the overall range and local sparsity of values afforded through various quantization schemes. We show the methods to lower bit-precision beyond quantization limits with object class clustering.Comment: EMC2 - NeurIPS workshop 2019, #latenta

arXiv.org e-Print Archive

Low-Power Computer Vision: Improve the Efficiency of Artificial Intelligence

Author: Chen Bo
Chen Yiran
Kim Jaeyoun
Lu Yung-Hisang
Thiruvathukal George K
Publication venue: Loyola eCommons
Publication date: 01/02/2022
Field of study

Energy efficiency is critical for running computer vision on battery-powered systems, such as mobile phones or UAVs (unmanned aerial vehicles, or drones). This book collects the methods that have won the annual IEEE Low-Power Computer Vision Challenges since 2015. The winners share their solutions and provide insight on how to improve the efficiency of machine learning systems