Accelerating Neural Network Inference by Overflow Aware Quantization by Xie, Hongwei et al.
Accelerating Neural Network Inference by Overflow Aware Quantization
Hongwei Xie , Shuo Zhang , Huanghao Ding , Yafei Song , Baitao Shao ,
Conggang Hu , Ling Cai and Mingyang Li
Alibaba Group
{hongwei.xhw, zs157140, huanghao.dhh, huaizhang.syf, baitao.sbt
conggang.hcg, cailing.cl, mingyangli}@alibaba-inc.com
Abstract
The inherent heavy computation of deep neural
networks prevents their widespread applications.
A widely used method for accelerating model in-
ference is quantization, by replacing the input
operands of a network using fixed-point values.
Then the majority of computation costs focus on
the integer matrix multiplication accumulation. In
fact, high-bit accumulator leads to partially wasted
computation and low-bit one typically suffers from
numerical overflow. To address this problem, we
propose an overflow aware quantization method by
designing trainable adaptive fixed-point represen-
tation, to optimize the number of bits for each in-
put tensor while prohibiting numeric overflow dur-
ing the computation. With the proposed method,
we are able to fully utilize the computing power
to minimize the quantization loss and obtain opti-
mized inference performance. To verify the effec-
tiveness of our method, we conduct image classi-
fication, object detection, and semantic segmenta-
tion tasks on ImageNet, Pascal VOC, and COCO
datasets, respectively. Experimental results demon-
strate that the proposed method can achieve com-
parable performance with state-of-the-art quantiza-
tion methods while accelerating the inference pro-
cess by about 2 times.
1 Introduction
To date, as a powerful machine learning system architecture,
deep neural network (DNN) has been applied on numerous
applications, e.g., image classification [He et al., 2016], ob-
ject detection [Ren et al., 2015; Liu et al., 2016], and seman-
tic segmentation [Chen et al., 2018]. However, to achieve
high-end performance on complicated problems, most DNN
systems require heavy computational resources for model in-
ference, which inevitably limits DNN’s deployment on low-
cost processors. Such processors are extensively used in bil-
lions of commercial products, such as mobile phones, drones,
and Internet-of-Things (IoT) devices, which makes DNN ac-
celeration a critical problem in both academia and industry.
To allow DNN acceleration, researchers have developed
various approaches, which can be roughly divided into two
C3 C2 C1 C0
+ + + +
A3 A2 A1 A0
B3 B2 B1 B0
×× × ×
4×8bit
4×8bit
4×32bit
(a) For a 32-bit accumulator, one instruction can compute 4
multi-adds with 128-bit register.
A7 A6 A5 A4 A3 A2 A1 A0
B7 B6 B5 B4 B3 B2 B1 B0
× × × ×× × × ×
+ ++ + + + + +
C7 C6 C5 C4 C3 C2 C1 C0
8×8bit
8×8bit
8×16bit
(b) For a 16-bit accumulator, one instruction can compute 8
multi-adds at the same time.
Figure 1: A representative example to show that replacing a 32-bit
accumulator with a 16-bit one leads to a double amount of multiply-
adds operations at the same time.
groups. The first group of methods focus on designing or
searching for more compact and efficient network structures,
which allow for reduced number of parameters and compu-
tations while achieving comparable performance. Represen-
tative methods in this category include MobileNet [Howard
et al., 2017], EfficientNet [Tan and Le, 2019], Proxyless-
NAS [Cai et al., 2019], and so on. The second group aims at
improving the efficiency of arithmetic computation, i.e., the
multiply-accumulate (MAC) operation, as it dominates most
computations during the DNN model inference. One widely
used method is to approximate the original floating-point cal-
culation using fixed-point operation to achieve computation
acceleration. This type of method is well-known as quanti-
zation [Jacob et al., 2018]. Representative visualization of
quantization is shown in Figure 1(a), where 8-bit fixed-point
integers are used to approximate floating-point values and 32-
ar
X
iv
:2
00
5.
13
29
7v
1 
 [c
s.C
V]
  2
7 M
ay
 20
20
bit fixed-point variables are used to hold MAC results. More-
over, in addition to speeding up the MAC operations, quan-
tization also achieves better parallel computing based on the
capability of modern CPUs.
By comparing Figure 1(b) against Figure 1(a), it can be
shown that if 16-bit fixed-point variables are used to hold
the MAC result, the degree of parallelism will be doubled
and the I/O times will be halved. However, when 16-bit
holder is used, numerical overflow on MAC results becomes
a frequently-happening problem that must be explicitly con-
sidered. A straightforward solution is to use low-bit quanti-
zation (<8-bit) for all operands, which however leads to loss
of quantization precision and significantly reduced the per-
formance. In addition, low-bit quantization still needs to take
up more physical bits (e.g., 4-bit quantization still requires
8-bit physical operands) in most modern CPUs, making the
computational resources partially wasted.
To summarize, existing quantization methods utilize fixed
number of bits to represent float values, while both high-
bit and low-bit representation have limitations. The former
suffers from numerical overflow problems and the latter one
leads to model precision degradation. To tackle this problem,
we propose a novel method to adaptively determine the quan-
tization precision in DNNs, by optimizing the number of bits
for operands while prohibiting overflow on the low-bit MAC
result holders. To achieve this, we introduce a trainable quan-
tization range mapping factor α into each layer of a DNN
network, which automatically scales the quantized result to
prevent the undesirable overflow. In addition, we propose
a quantization-overflow aware training framework for learn-
ing the quantization parameters, to minimize the performance
loss caused by post quantization [Krishnamoorthi, 2018]. To
verify the effectiveness of our method, we conducted tests on
a couple of state-of-the-art light-weighted DNNs for a vari-
ety of tasks on different benchmarking datasets. Specifically,
our experiments include image classification, object detec-
tion, and semantic segmentation, which are tested on Ima-
geNet, Pascal VOC, and COCO datasets, respectively. Exper-
imental results demonstrate that, compared with state-of-the-
art quantization methods, the proposed method can achieve
comparable performance while speeding up the inference ef-
ficiency by about 2 times.
The main contributions of this paper are listed as follows:
1. We propose an overflow aware quantization (OAQ) algo-
rithm for accelerating DNNs, that is able to adaptively
maximize the number of bits for operands while pro-
hibiting the numeric overflow.
2. To ensure optimized performance, we design a quanti-
zation overflow aware training framework (QOAT), to
automatically learn the parameters used by the proposed
OAQ algorithm.
3. We conduct extensive experiments on three public
datasets using state-of-the-art light-weight DNNs. The
results verify the effectiveness of our method on a vari-
ety of tasks including image classification, object detec-
tion, and semantic segmentation.
2 Related Work
In this work, we focus on quantization methods that accel-
erate inference on off-the-shelf hardware platforms. While
non-uniform quantization methods [Stock et al., 2019; Gao
et al., 2019] are also shown to be effective, they do not allow
efficient implementing on modern CPUs.
A representative early method is binary quantization
[Rastegari et al., 2016], which quantizes both weights and
activations to one bit, by using bit-shift and bit-count instead
of multiply-adds operators to speed up. This method achieves
acceptable performance on common over-parameterized net-
works, like AlexNet [Krizhevsky et al., 2012], but leads
to substantial performance degradation on light-weight net-
works, e.g. ResNet-18 [He et al., 2016] and MobileNet
[Howard et al., 2017]. As of today, one of the most widely
used quantization methods is 8-bit quantization [Jacob et al.,
2018; Krishnamoorthi, 2018; Jain et al., 2019], which is ex-
tensively applied in different applications on various hard-
ware platforms. 8-bit quantization converts the inference pro-
cess into integer-only operations, that could result in 2× ∼
3× faster inference process on mobile CPUs. However, when
tacking resource-demanding large networks or deployed on
resource-constrained platforms, additional DNN acceleration
is still required.
To allow further DNN acceleration, low-bit quantization
techniques are under active exploration, in which a key prob-
lem is to balance the inference speed and model performance.
[Choi et al., 2018] proposes PACT to train not only the
weights but also the clipping parameters for clipped ReLU
using gradient descent. [Louizos et al., 2018] presents RQ to
optimize the quantization part with gradient descent. How-
ever, both methods suffer from performance degradation on
lightweight networks. In fact, when quantizing both weights
and activations to 4 bits, PACT [Choi et al., 2018] leads to
accuracy reduction from 70.9% to 62.44%. Also, quantizing
MobileNet using RQ [Louizos et al., 2018] to 6-bit achieves
68% accuracy only. Additionally, existing low-bit techniques
only focus on the classification task. Evaluation results on
other popular tasks, e.g., object detection or semantic seg-
mentation, are limited on literatures. Inference accelerating
using properties of processors is another research direction.
[Zeng et al., 2019] proposes to decrease computational load
of multiplications, by exploiting the parallel computing ca-
pability of modern CPUs. [Gong et al., 2019] implements
2-bit fast integer arithmetic with ARM NEON technology
and achieves 1.7× speed up over NCNN [Tecent.Inc, 2017].
However, the reported results still suffer from significant per-
formance loss, e.g., 4-bit quantized MobileNet-v2 leads to
performance drop from 71.8% to 64.8%.
Moreover, there are a number of mixed-precision quanti-
zation methods [Wang et al., 2019; Wu et al., 2018; Dong et
al., 2019], which focus on searching for an optimal bit-width
setup that can achieve high-level acceleration on customized
hardware platforms while avoiding performance drop. Com-
pared to those methods, the proposed OAQ framework fo-
cuses on off-the-shelf devices, and addresses the numerical
overflow problem by designing trainable parameters in each
layer of a network.
3 Quantization Prerequisites
In this section, we first present the general algorithmic frame-
work of quantization methods. Subsequently, we analyze
the inherent overflow problem of the general framework and
provide mathematical conditions that allow overflow aware
quantization (OAQ). Detailed steps on performing OAQ ap-
proach are discussed in the next section.
3.1 Standard Quantization Method
To convert a floating-point real number r ∈ R to a fixed-
point quantized number q ∈ Z, the following affine mapping
function is typically used [Jacob et al., 2018]:
r = S(q − Z), (1)
where S is the scale factor and Z is the zero-point parameter.
By denoting q(i,k)a the element at ith row and kth column of
matrix A, the matrix multiplication of C = A × B in quan-
tized domain can be computed element-wise as:
q(i,k)c = Zc + P
N∑
j=1
(
(q(i,j)a − Za)(q(j,k)b − Zb)
)
. (2)
where the multiplier P is defined as
P =
SaSb
Sc
, (3)
which can be implemented by fixed-point multiplication and
efficient bit-shift [Jacob et al., 2018]. By expanding terms
in (2), we are able to obtain:
q(i,k)c = Zc + P (NZaZb − ZaM (k)b
−ZbM (i)a +
N∑
j=1
q(i,j)a q
(j,k)
b ),
(4)
where
M
(k)
b =
N∑
j=1
q
(j,k)
b ,M
(i)
a =
N∑
j=1
q(i,j)a . (5)
In (4), the majority of computation costs are the core integer
matrix multiplication accumulation:
N∑
j=1
q(i,j)a q
(j,k)
b . (6)
To compute for (6), the standard method is to accumulate
products of 8-bit values (signed or unsigned) with a 32-bit
integer accumulator (also see Figure 1(a)):
C∈Z32 + = A∈Z8 × B∈Z8 . (7)
where Zy represents the space of y-bit representable number.
As a result, a 32-bit register is required for caching the in-
termediate result. A NEON instruction can compute multiple
multiply-adds at the same time, but it is limited by the reg-
ister size and the number of multiplying and summing units
on board. The limited register resource is usually the main
bottleneck. Moreover, we note that under most ARM archi-
tectures, there is no instruction to implement (7). To this end,
existing popular mobile inference engines (e.g., TFLite and
NCNN) typically rely on
C∈Z32 + = A∈Z16 × B∈Z16 . (8)
as a replacement, i.e., VMAL.S16 on ARM architecture.
Additionally, we note that loading data between register
and memory is heavy operations. By applying (8), more reg-
ister space is used to implement a bigger micro-kernel1. This
largely reduces the frequency of transferring intermediate re-
sults between register and memory. To show more details,
we evaluated the performance of convolutions on MTK8167s
CPU with different implementations. Using VMLAL.S8 and
4x8 micro-kernel, the computation efficiency is improved by
36%. While applying a 4x16 micro kernel, we achieved 73%
speed up.
3.2 Optimize Quantization Operations
To optimize the efficiency of current quantization scheme, we
seek to use 16-bit integer as accumulator
C∈Z16 + = A∈Z8 × B∈Z8 . (9)
Compared to (7), one NEON instruction here (VMLAL.S8
on ARM architecture) is able to compute double amount of
multiply-adds operations, as shown in Figure 1(a) and Fig-
ure 1(b).
However, by directly using (9), numerical overflow be-
comes an unavoidable problem. This is one of the core prob-
lem we seek to resolve in this work. To make this possible,
we first re-expand (2) by assuming qa as the quantized inputs
and qb as the quantized weights:
q(i,k)c = Zc + P
N∑
j=1
(
(q(i,j)a )(q
(j,k)
b − Zb)
−Zaq(j,k)b + ZaZb
)
,
(10)
which can be rewritten as
q(i,k)c = Zc + P
 N∑
j=1
(
q(i,j)a qˆ
(j,k)
b
)
+B
 , (11)
where
B = −
N∑
j=1
Zaq
(j,k)
b +NZaZb, (12)
qˆ
(j,k)
b = (q
(j,k)
b − Zb). (13)
In above equations, Za, Zb, qb are all be constant values once
training is complete, and thus B and qˆ(j,k)b can be computed
in advance to improve inference efficiency.
Based on above equations, we point out that, to allow ef-
ficient computation under (7), the following three conditions
must be satisfied:
q
(i,j)
a ∈ Z8
qˆ
(j,k)
b ∈ Z8∑N
j=1 q
(i,j)
a qˆ
(j,k)
b ∈ Z16
, (14)
1https://engineering.fb.com/ml-applications/qnnpack/
It is important to note that the last condition in (14) should
hold for both final number and all intermediate numbers. We
also point out that, the second and last conditions in (14) are
not always naturally true. Without taking special considera-
tion, numerical overflow will frequently happen. To guaran-
tee (14) in a DNN system, additional algorithms need to be
designed and implemented.
4 Overflow Aware Quantization Framework
This section describes the details of our overflow aware quan-
tization algorithm, including both representation and training,
to ensure the important three conditions in (14).
4.1 Adaptive Integer Representation
One straightforward method to reduce accumulation overflow
is to narrow the range of each quantized value. For example,
by using 4-bit quantization instead of 8-bit quantization, real
values are mapped to [−8, 7] instead of [−128, 127]. As a
result, numerical overflow becomes significantly less likely
to happen, at a cost of wasting a large number of bits and
reducing accuracy. To this end, we propose an adaptive float-
bit-width method to fully utilize the representation capability
without arithmetic overflow.
Specifically, we use a float quantization range mapping fac-
tor α to adjust the affine relationship between the real range
and quantized range, as shown in Figure 2. Scaled by α, the
original 8-bit (the biggest bit-width can be used) quantization
range [−128, 127] is mapped to [b−128α c, b 127α c]. By enlarg-
ing α, we are able to narrow down the quantized value range
until the arithmetic overflows are eliminated.
To present this mathematically, the affine function (1) can
be written as
q =
r
S
+ Z. (15)
By applying the scale factor α on S
S′ = α · S
= α · rmax − rmin
2b − 1 ,
(16)
the quantized value is narrowed to
q′ =
r
S′
+ Z, (17)
where rmin and rmax are the minimum and maximum limits
of the real value, and b is the number of bits for the quantized
value.
We name this as float-bit-width method since it differs from
the traditional integer low-bit representation whose quantiza-
tion ranges have to be chosen from the limited bit-width set,
e.g, 3-bit for [−4, 3] or 4-bit for [−8, 7]. By using the pro-
posed method, it is feasible to utilize different quantization
range mapping factor α in each layer, to maximize the repre-
sentation capability while prohibiting overflow. Additionally,
since α is continuous, it can also be easily integrated into
training process of DNNs. Details on adaptively learning α
for weights and activations of each layer will be discussed in
the next subsection.
In addition to the range of quantized value, the value dis-
tribution is also critical. We expect the quantized values to be
𝑟"#$ 𝑟"%&
-128 127
8-bit quantization
Real range of tensor 𝑊
… …0
𝛼 ) 𝑟"#$ 𝛼 ) 𝑟"%&127𝛼−128𝛼
No real data will fall in this gray region when inference
𝑟"#$ 𝑟"%&
Figure 2: We introduce a float factor α to adjust the affine relation-
ship between the real range and quantized range, e.g., enlarging α to
narrow down the quantized value range.
FakeQuant
Conv
FakeQuant
+Bias
ReLU
ReLU
minmax 𝜶
EMA EMA
Conv-INT16
Quant
Weights
Quant
minmax 𝜶
𝑵𝒐
Int8Int8
Float
Float
MaxPool
× × × ×
Figure 3: We insert Quant nodes into each layer of computation
graph to calculates the amount of arithmetic overflow.
centered and gathered around zeros. As a result, The accumu-
lated number will be more likely to be away from overflow.
Initializing weights with a normal-distributed-like initializer
and applying L1-L2-Normalization are representative meth-
ods that can be used.
4.2 Learning Quantization Range Mapping Factor
To find proper α for each layer to simultaneously prohibit
arithmetic overflow and retain the model’s performance, we
propose a quantization-overflow aware training framework.
As shown in Figure 3, inspired by simulating quantization
effects in forwarding pass [Jacob et al., 2018], we add the
overflow aware module that simulates neural operations, e.g.,
Convolution, FullyConnect using 16-bit accumulator for cap-
turing arithmetic overflow in the forward pass.
In the forward pass, 8-bit quantization simulates quantized
inference by implementing in floating-point arithmetic which
is called FakeQuantization (Qfake) [Jacob et al., 2018]. To
adaptively learn α, capturing the amount of arithmetic over-
flow in the quantized inference process is required. There-
fore, we insert a Quantization node (Qreal) into each layer of
the computation graph in addition to Qfake. Qreal requires
rmin and rmax to produce 8-bit quantized values which are
the same with inference engine. rmin and rmax are scaled by
α before being passed into Qreal and Qfake. rmin and rmax
in activations are aggregated by exponential moving averages
(EMA) with the smoothed parameter close to 1, such that the
observed ranges are smoothed, allowing the network to en-
ter a more stable state. The created convolution operation
with 16-bit accumulator Conv-INT16 takes the 8-bit quan-
tized values as input to simulate the integer-only-inference
on the inference engine. This calculates the amount of arith-
metic overflow No that accumulates in the inference process,
including all types of overflow defined in the overflow-free
conditions (14). An easy way to capture overflow signals
from the popular training framework, e.g., TensorFlow or Py-
Torch is to compare the results between the regular 32-bit
convolution and the 16-bit convolution. Certainly, you can
make a more efficient implementation by some low-level lan-
guage like C++. As the process of getting No is integer-only
computation, gradient descent becomes not a proper method
in back-propagation. To address this problem, we use a sim-
ple rule to compensate for this.
Specifically, when No is bigger than zero, increasing α by:
α += min(lri ∗ log(No), lc), (18)
where lc is the fixed maximum learning rate. Additionally, lri
is the dynamic learning rate for increasing α, which decays
with the steps of training. After a large number of iterations,
lri is decayed to a quite small value to stabilize the training
state.
Alternatively, if No is zero, we decrease α by:
α −= lrd, (19)
where lrd is the learning rate for decreasing α, whose prop-
erties and physical representation are both similar to lri. For
improving training efficiency, calculating No and updating α
are executed every M steps, e.g., 10 or 50 during iterations.
The strategy of inserting Qreal is slightly different from
Qfake. As shown in Figure 3, fake quantization for input
or output is inserted after the activation function or a bypass
connection, e.g. adds or concatenates as they change the real
min-max ranges. For the operations such as max-pooling,
up-sampling or padding, Qfake is not required. But the 16-
bit neural operation, e.g., Conv-INT16 only accepts quantized
integer inputs. The outputs of Qreal has to be directly passed
into it. That means Qreal must be inserted ahead of Conv-
INT16 in the computation graph. Finally, Qreal and Qfake
share the same rmin, rmax and α, for strictly simulating the
same inference process.
5 Overflow Study
In this section, we first show in-depth analysis on the pro-
posed framework of adaptive scale parameter and explain
why we focus on comparing OAQ against state-of-the-art 6-
bit quantization methods. Subsequently, we study the over-
flow sensitive of different neural network models and show
how overflow ratio affecting the overall performance.
5.1 Per-Layer Adaptive Scale
Since our key algorithmic contribution is a method to adap-
tively learn the quantization range mapping factor α for
DNN’s each layer, it is also important to demonstrate the
0
0.5
1
1.5
2
2.5
3
3.5
1 3 5 7 9 11 13 15 17 19 21 23 25 27
Va
lu
e
Layer Index of MobileNet-v1
⍺
(a) The activation α of each layer in MobileNet-v1 trained
on ImageNet.
0
0.5
1
1.5
2
2.5
3
3.5
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Va
lu
e
Layer Index of MobileNet-v1-SSD
⍺
(b) The activation α of each layer in MobileNet-v1-SSD
trained on Pascal VOC.
Figure 4: Provides two representative results of per-layer factor val-
ues, implying that the quantized values are mostly between 6bits and
7bits.
distribution of α across networks in our experiments. Fig-
ure 4 provides two representative results of per-layer α val-
ues, adaptively trained by using the proposed quantization-
overflow aware training framework. From this figure, we can
observe that the computed scale factor varies across layer,
emphasizing our ‘scale-per-layer’ adaptive design. Addition-
ally, we note that the majority of activation factor α fall into
the range of [2, 4], indicating that the quantized values are
mostly between 6bits and 7bits.
Additionally, we estimate the overflow ratio of low-bit
quantization with a 16-bit accumulator by random multiply-
adds simulation. Specifically, we uniformly sampled N val-
ues from the quantization value range and tried to capture an
overflow signal. The signal is computed by applying consec-
utive multiply-adds operators of the N numbers on a 16-bit
holder. Since 3x3 depth-wise convolution and 1x1 point-wise
convolution are widely used in light-weight DNN architec-
tures, we chose N to be {9, 64, 256, 1024} representatively.
Subsequently, the rough Non-Overflow ratio was calculated
by 100000 independent simulation runs. As shown in Fig-
ure 5, 6-bit quantization could retain overflow free in most
cases, while 7-bit and 8-bit methods are risky.
Both Figure 4 and Figure 5 indicate that at most 6-bits
could be used if we apply low bit quantization to address the
overflow problem. Therefore, we will focus on comparing
OAQ against state-of-the-art 6-bit quantization methods.
0
17.3
99.6 100
0
77
100 100
21.7
99.6 100 10094.2 100 100 100
0
50
100
8bit 7bit 6bit 5bitN
on
-O
ve
rf
lo
w
%
1024 256 64 9Num. Multiply-Adds Operations
Figure 5: The Non-Overflow ratio of randomly simulated 9, 64, 256,
1024 operands’ multiply-adds 100,000 times with different quanti-
zation bits and a 16-bit accumulator.
5.2 Overflow Sensitive Study
Someone still worry about that the quantization range map-
ping factors learned from training data may be not applied
to all test data safely. For example, a little arithmetic over-
flow may cause the entire model to fail. Actually, we in-
deed found some overflow (usually on the order of tens af-
ter quantization-overflow-aware training) while inference on
test data, but the outputs seem still right. In order to ana-
lyze the impact of overflow quantitatively, we designed the
overflow simulation experiments. Specifically, we manually
inject overflow to the output of some layers to see their in-
fluence on the final result. And we tested the overflow sensi-
tive of MoblieNet-v1 on ImangeNet and MobileNet-v2-SSD
on COCO Detection Challenge respectively. The results are
shown in Figure 6. In classification model, inject 0.05% arith-
metic overflow into any one layer has no impact on the over-
all accuracy. When apply it to all layers, the accuracy slowly
drops along with the growth of overflow ratio. In detection
model, overflow on the last layers affect the final result seri-
ously. 0.01% overflow will cause the model unavailable. But
the detection model can keep high-performance when inject-
ing overflow into other layers.
6 Experiments
To demonstrate the performance of our proposed OAQ frame-
work, we evaluate both inference accuracy/recall and run-
time characteristics on representative public benchmarking
datasets.
6.1 ImageNet
The first experiment is to benchmark MobileNet-v2 [San-
dler et al., 2018], MoibleNet-v1 [Howard et al., 2017], and
MobileNet-v1 with depth-multiplier 0.25 on the ILSVRC-
2012 ImageNet. This dataset consists of 1000 classes, 1.28
million training images, and 50K validation images. We fine-
tuned MobileNet models from pre-trained model zoo (TF-
slim2). All of those models were re-trained for 20 epochs.
During training, the inputs were randomly cropped and re-
sized to 224x224 before being fed into the network. Since
the inputs of the first layer are not suitable for scaling, we
skipped learning the quantization range mapping factor of the
first layer’s weights. This rule was applied to all the follow-
ing experiments, including PACT [Choi et al., 2018], which
2https://github.com/tensorflow/models/tree/master/research/slim
67.5
68.0
68.5
69.0
69.5
70.0
70.5
71.0
L0 L1 L-26 L-28 ALL
To
p1
A
cc
ra
cy 0.00%
0.01%
0.02%
0.03%
0.04%
0.05%
Overflow
Ratio
(a) Inject overflow into some layers(L0: the first layer, L1: the
second layer, L26: the penultimate layer, L28: The last layer be-
fore Softmax, ALL: inject overflow into all layers) of MobileNet-
v1 and evaluate on ImageNet.
0
5
10
15
20
25
L0 L1 LFs LLs
m
A
P
0.00%
0.01%
0.02%
0.03%
0.04%
0.05%
Overflow
Ratio
(b) Inject overflow into some layers(L0: the first layer, L1: the
second layer, LFs: the 6 feature layers that provide inputs to the
last headers , LLs: The last 12 headers that regressing bounding
boxes and classification scores) of MobileNet-v2-SSD and eval-
uate on COCO.
Figure 6: Inject arithmetic overflow into different layers of neural
network, e.g., MobileNet-v1 and MobileNet-v2-SSD. In classifica-
tion model, even 0.05% arithmetic overflow won’t damage the over-
all accuracy. But in detection model, 0.01% overflow will result in
the model unavailable.
always uses 8-bit weights in the first layers. We report our
evaluation results using Top-1 and Top-5 accuracy.
In our tests, we focus on comparing OAQ against state-of-
the-art 6-bit quantization methods, including PACT [Choi et
al., 2018], RQ [Louizos et al., 2018], and SR+DR [Gysel et
al., 2018]. Comparison against other selective state-of-the-
art methods were also conducted. As shown in Table 1, OAQ
even outperforms 8-bit Quantization-Aware Training (QAT)
[Jacob et al., 2018] in certain cases. From Table 1, we ob-
serve that when quantizing MobileNet-v1, OAQ outperforms
RQ [Louizos et al., 2018] and SR+DR[Gysel et al., 2018] by
large margins. Although PACT [Choi et al., 2018] is better on
Top-5, the proposed method consistently performs better in
other metrics, especially on MobileNet-v1-0.25, i.e., 1.36%
higher than PACT. We also take DSQ [Gong et al., 2019] into
comparison, as it also achieved 1.7× speed up over NCNN on
an ARM Cortex-A53 CPU. The result shows, on MobileNet-
v2 our results are significantly better, e.g., 6.84% higher.
Model Method Bits MAC bits Top1 / Top5
MobileNet
-v2
FP 32 32 71.80 / 91.00
QAT 8 32 70.90 / 90.00
PACT 6 32 71.25 / 90.00
DSQ 4 32 64.90 / —
Our adaptive 16 71.64 / 90.10
MobileNet
-v1
FP 32 32 70.90 / 89.90
QAT 8 32 70.10 / 88.90
PACT 6 32 70.46 / 89.59
RQ 6 32 68.02 / 88.00
SR+DR 6 32 66.66 / 87.17
Our adaptive 16 70.87 / 89.56
MobileNet
-v1-0.25
FP 32 32 49.80 / 74.20
QAT 8 32 48.00 / 72.80
PACT 6 32 46.03 / 70.07
Our adaptive 16 47.38 / 72.14
Table 1: Evaluating on ImageNet and comparing against SOTA low-
bit quantization methods.
Model Method Bits MAC bits mAP
MobileNet-v1
SSD
FP 32 32 73.83
QAT 8 32 72.54
PACT 6 32 70.88
Our adaptive 16 72.53
MobileNet-v2
SSDLite
FP 32 32 72.79
QAT 8 32 72.02
PACT 6 32 70.50
Our adaptive 16 71.84
Table 2: Evaluating on Pascal VOC Detection Challenge and com-
paring with 6bit-PACT.
6.2 Detection on VOC and COCO
To illustrate the applicability of our method to object detec-
tion, we applied OAQ on MobileNet-v1-SSD [Howard et al.,
2017; Liu et al., 2016] and MobileNet-v2-SSDLite [Sandler
et al., 2018] and evaluated on the 2012 Pascal VOC object
detection challenge and 2017 MSCOCO detection challenge.
We implemented our experiments with TensorFlow and fine-
tuned models from TensorFlow Object Detection API3 (only
backbone since the SSD header differs with tasks). We first
trained them with 32-bit floating-point precision to achieve
state-of-the-art performance, and subsequently fine-tuned on
VOC for 40, 000 steps with batch size 32 and 60, 000 steps
on COCO with batch size 48 respectively.
The results on VOC and COCO are listed in Table 2 and
Table 3 respectively. On both of VOC detection challenge
and COCO detection challenge, the proposed method outper-
forms the 6-bit PACT [Choi et al., 2018] significantly. In
addition, our method achieves comparable performance with
8-bit QAT [Jacob et al., 2018] and is of small mAP drop from
3https://github.com/tensorflow/models/blob/master/research/
object detection/g3doc/detection model zoo.md
Model Method Bits MAC bits mAP
MobileNet-v1
SSD
FP 32 32 23.7
QAT 8 32 23.0
PACT 6 32 18.4
Our adaptive 16 22.0
MobileNet-v2
SSDLite
FP 32 32 22.7
QAT 8 32 21.4
PACT 6 32 18.1
Our adaptive 16 21.8
Table 3: Evaluating on COCO Detection Challenge and comparing
with 6bit-PACT.
Model Method Bits MAC bits mIOU
MobileNet-v2
dm0.5
deeplab
FP 32 32 71.8
QAT 8 32 70.4
PACT 6 32 60.9
Our adaptive 16 70.0
MobileNet-v2
deeplab
FP 32 32 75.3
QAT 8 32 74.8
PACT 6 32 70.4
Our adaptive 16 74.9
Table 4: Evaluating on Pascal VOC Segmentation Challenge and
comparing with 6-bit PACT.
the original model, i.e., about 1% mAP drop in VOC and less
than 2% mAP drop in COCO.
6.3 Segmentation on VOC
To demonstrate the generalization of our method to seman-
tic segmentation, we applied OAQ on DeepLab [Chen et al.,
2018] with MobileNet-v2 backbone (depth-multiplier 0.5 and
1.0). The performance was evaluated on the Pascal VOC
segmentation challenge, which contains 1464 training im-
ages and 1449 validation images. The results are shown
in Table 4. When quantizing the original model to 6 bits
with PACT [Choi et al., 2018], there is a significant drop
in performance, e.g., MobileNet-v2-dm0.5 backbone dropped
10.9% in mIOU. By comparison, the proposed OAQ method
achieved comparable performance with QAT, and only drop
1.8% and 0.4% on MobileNet-v2-dm0.5 and MobileNet-v2
backbone respectively compared to the original model.
6.4 Inference Efficiency Benchmark
Finally, we demonstrate the capability of DNN acceleration
of the proposed method on different low-cost hardware plat-
forms. Specifically, we benchmarked computational effi-
ciency on two selectively platforms, i.e., Allwinner V328
and MTK8167, whose processor architectures are ARM-
Cortex-A7 and ARM-Cortex-A35 respectively. In this test,
MobileNet-v1 and ResNet-18 were used as representative
DNN models to conduct inference. To run the DNN infer-
ence, three popular neural network inference engines for low-
cost platforms were selected, i.e., TFLite, MNN [Jiang et
CPU Inference Engine MobileNet-v1 ResNet18
Allwinner
V328
TFLite 550 1370
MNN 469 1605
MNN + OAQ 341 1021
Ours + OAQ 277 895
MTK8167s
TFLite 387 950
NCNN 351 706
MNN 311 916
MNN + OAQ 220 604
Ours + OAQ 189 585
Table 5: Comparison on inference time (msec) using MobileNet-v1
and ResNet18 networks.
al., 2020], and NCNN, under single-threaded implementation
within one core.
The experimental results are listed in Table 5, which clearly
demonstrate that the proposed method outperforms compet-
ing methods by wide margins. In fact, on both MTK8167s
and Allwinner V328, our method achieves 2× faster runtime
than TFLite and 1.85× faster than NCNN.
7 Conclusion
In this paper, we propose an overflow aware quantization
method to allow significant DNN inference time acceleration,
and minimize the loss of accuracy. To achieve this, we pro-
pose to adaptively adjust the number of bits used for repre-
senting quantized fixed-point integers. This scheme is also
incorporated into a novel training framework, to adaptively
learn the overflow-free quantization range while maintaining
high-end performance. By using the proposed method, an ex-
tremely light-weight neural network can achieve comparable
performance with the 8-bit quantization method on the Ima-
geNet classification challenge. Comprehensive experiments
were also conducted to verify that our method can also be
applied to various dense prediction tasks, e.g., object detec-
tion, and semantic segmentation by outperforming competing
state-of-the-art methods.
References
[Cai et al., 2019] Han Cai, Ligeng Zhu, and Song Han. Prox-
ylessNAS: Direct neural architecture search on target task
and hardware. In ICLR, 2019.
[Chen et al., 2018] Liang-Chieh Chen, Yukun Zhu, George
Papandreou, Florian Schroff, and Hartwig Adam.
Encoder-decoder with atrous separable convolution for se-
mantic image segmentation. In ECCV, 2018.
[Choi et al., 2018] Jungwook Choi, Zhuo Wang, Swagath
Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srini-
vasan, and Kailash Gopalakrishnan. Pact: Parameterized
clipping activation for quantized neural networks. arXiv
preprint arXiv:1805.06085, 2018.
[Dong et al., 2019] Zhen Dong, Zhewei Yao, Amir Gho-
lami, Michael Mahoney, and Kurt Keutzer. Hawq: Hes-
sian aware quantization of neural networks with mixed-
precision. arXiv preprint arXiv:1905.03696, 2019.
[Gao et al., 2019] Lianli Gao, Xiaosu Zhu, Jingkuan Song,
Zhou Zhao, and Heng Tao Shen. Beyond product quanti-
zation: Deep progressive quantization for image retrieval.
In IJCAI. AAAI Press, 2019.
[Gong et al., 2019] Ruihao Gong, Xianglong Liu, Shenghu
Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu,
and Junjie Yan. Differentiable soft quantization: Bridging
full-precision and low-bit neural networks. In ICCV, 2019.
[Gysel et al., 2018] Philipp Gysel, Jon Pimentel, Moham-
mad Motamedi, and Soheil Ghiasi. Ristretto: A framework
for empirical study of resource-efficient inference in con-
volutional neural networks. TNNLS, 29(11):5784–5789,
2018.
[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing
Ren, and Jian Sun. Deep residual learning for image recog-
nition. In CVPR, 2016.
[Howard et al., 2017] Andrew G Howard, Menglong Zhu,
Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias
Weyand, Marco Andreetto, and Hartwig Adam. Mo-
bilenets: Efficient convolutional neural networks for mo-
bile vision applications. arXiv preprint arXiv:1704.04861,
2017.
[Jacob et al., 2018] Benoit Jacob, Skirmantas Kligys,
Bo Chen, Menglong Zhu, Matthew Tang, Andrew
Howard, Hartwig Adam, and Dmitry Kalenichenko.
Quantization and training of neural networks for effi-
cient integer-arithmetic-only inference. In CVPR, pages
2704–2713, 2018.
[Jain et al., 2019] Sambhav R Jain, Albert Gural, Michael
Wu, and Chris H Dick. Trained quantization thresholds for
accurate and efficient fixed-point inference of deep neural
networks. arXiv preprint arXiv:1903.08066, 2019.
[Jiang et al., 2020] Xiaotang Jiang, Huan Wang, Yiliu Chen,
Ziqi Wu, Lichuan Wang, Bin Zou, Yafeng Yang,
Zongyang Cui, Yu Cai, Tianhang Yu, Chengfei Lv, and
Zhihua Wu. Mnn: A universal and efficient inference en-
gine. In MLSys, 2020.
[Krishnamoorthi, 2018] Raghuraman Krishnamoorthi.
Quantizing deep convolutional networks for efficient in-
ference: A whitepaper. arXiv preprint arXiv:1806.08342,
2018.
[Krizhevsky et al., 2012] Alex Krizhevsky, Ilya Sutskever,
and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In NIPS, 2012.
[Liu et al., 2016] Wei Liu, Dragomir Anguelov, Dumitru Er-
han, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and
Alexander C Berg. Ssd: Single shot multibox detector. In
ECCV, 2016.
[Louizos et al., 2018] Christos Louizos, Matthias Reisser,
Tijmen Blankevoort, Efstratios Gavves, and Max Welling.
Relaxed quantization for discretized neural networks.
arXiv preprint arXiv:1810.01875, 2018.
[Rastegari et al., 2016] Mohammad Rastegari, Vicente Or-
donez, Joseph Redmon, and Ali Farhadi. Xnor-net: Ima-
genet classification using binary convolutional neural net-
works. In ECCV. Springer, 2016.
[Ren et al., 2015] Shaoqing Ren, Kaiming He, Ross Gir-
shick, and Jian Sun. Faster r-cnn: Towards real-time object
detection with region proposal networks. In NIPS, pages
91–99, 2015.
[Sandler et al., 2018] Mark Sandler, Andrew Howard, Men-
glong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.
Mobilenetv2: Inverted residuals and linear bottlenecks. In
CVPR, 2018.
[Stock et al., 2019] Pierre Stock, Armand Joulin, Re´mi Gri-
bonval, Benjamin Graham, and Herve´ Je´gou. And the bit
goes down: Revisiting the quantization of neural networks.
arXiv preprint arXiv:1907.05686, 2019.
[Tan and Le, 2019] Mingxing Tan and Quoc V Le. Efficient-
net: Rethinking model scaling for convolutional neural
networks. arXiv preprint arXiv:1905.11946, 2019.
[Tecent.Inc, 2017] Tecent.Inc. Ncnn. https://github.com/
Tencent/ncnn, 2017.
[Wang et al., 2019] Kuan Wang, Zhijian Liu, Yujun Lin,
Ji Lin, and Song Han. Haq: Hardware-aware automated
quantization with mixed precision. In CVPR, 2019.
[Wu et al., 2018] Bichen Wu, Yanghan Wang, Peizhao
Zhang, Yuandong Tian, Peter Vajda, and Kurt Keutzer.
Mixed precision quantization of convnets via differ-
entiable neural architecture search. arXiv preprint
arXiv:1812.00090, 2018.
[Zeng et al., 2019] Linghua Zeng, Zhangcheng Wang, and
Xinmei Tian. Kcnn: kernel-wise quantization to remark-
ably decrease multiplications in convolutional neural net-
work. In IJCAI. AAAI Press, 2019.
