MSP: An FPGA-Specific Mixed-Scheme, Multi-Precision Deep Neural Network
  Quantization Framework by Chang, Sung-En et al.
MSP: An FPGA-Specific Mixed-Scheme, Multi-Precision Deep Neural Network
Quantization Framework
Sung-En Chang,1∗ Yanyu Li, 1∗ Mengshu Sun, 1∗ Weiwen Jiang, 2 Runbin Shi, 3 Xue Lin, 1 Yanzhi
Wang 1
1 Northeastern University
2 University of Notre Dame
3 University of Hong Kong
chang.sun@northeastern.edu, li.yanyu@northeastern.edu, sun.meng@northeastern.edu, wjiang2@nd.edu, rbshi@eee.hku.hk,
xue.lin@northeastern.edu, yanz.wang@northeastern.edu
Abstract
With the tremendous success of deep learning, there exists
imminent need to deploy deep learning models onto edge de-
vices. To tackle the limited computing and storage resources
in edge devices, model compression techniques have been
widely used to trim deep neural network (DNN) models for
on-device inference execution. This paper targets the com-
monly used FPGA (field programmable gate array) devices
as the hardware platforms for DNN edge computing. We fo-
cus on the DNN quantization as the main model compression
technique, since DNN quantization has been of great impor-
tance for the implementations of DNN models on the hard-
ware platforms. The novelty of this work comes in twofold:
(i) We propose a mixed-scheme DNN quantization method
that incorporates both the linear and non-linear number sys-
tems for quantization, with the aim to boost the utiliza-
tion of the heterogeneous computing resources, i.e., LUTs
(look up tables) and DSPs (digital signal processors) on an
FPGA. Note that all the existing (single-scheme) quantiza-
tion methods can only utilize one type of resources (either
LUTs or DSPs for the MAC (multiply-accumulate) opera-
tions in deep learning computations. (ii) We use a quantiza-
tion method that supports multiple precisions along the intra-
layer dimension, while the existing quantization methods ap-
ply multi-precision quantization along the inter-layer dimen-
sion. The intra-layer multi-precision method can uniform the
hardware configurations for different layers to reduce com-
putation overhead and at the same time preserve the model
accuracy as the inter-layer approach.
Our proposed mixed-scheme, multi-precision (MSP) DNN
quantization framework achieve 70.47% Top1 accuracy in
ResNet-18 on the ImageNet dataset. We also validate the
proposed MSP framework on two FPGA devices i.e., Xilinx
XC7Z020 and XC7Z045. We achieve 3.53× speedup in end-
to-end inference time on the ImageNet, comparing with the
fixed-point quantization method.
1 Introduction
Deep neural networks (DNNs) have been employed in var-
ious of tasks with outstanding performance, such as convo-
lutional neural networks (CNNs) for computer vision (Le-
Cun, Bengio, and Hinton 2015), recurrent neural networks
(RNNs) for natural language processing (NLP) (Bishop
∗Equal contribution.
2006; Goodfellow, Bengio, and Courville 2016), etc. How-
ever, due to the large model size and extremely intensive
computation, it is still challenging to deploy these DNN
models on edge devices.
To support broad applications of deep learning, such as
autonomous vehicles, wireless access points, robotic vision
and control, smart health devices, etc., there are two aspects
to enable DNNs under these resource constrained circum-
stances. First is to utilize specialized hardware platform for
the inference of DNNs. Extensive research efforts have been
dedicated to various kind of edge-computing platforms such
as ASICs (Application Specific Integrated Circuits) (Mao
et al. 2018; Hegde et al. 2018; Han et al. 2016a), FPGAs
(Sharma et al. 2016; Li et al. 2018; Zhang et al. 2015; Shi
et al. 2020a), and embedded CPUs/GPUs (Leng et al. 2019;
Niu et al. 2020; Han et al. 2016b).
The other is DNN model compression which explores
the potential of algorithm and hardware co-design and finds
better trade-offs between task performance (accuracy, etc.)
and hardware efficiency (latency, power consumption, etc.).
There are in general two techniques for model compression:
DNN pruning (Lym et al. 2019; Shi et al. 2020b; Han et al.
2015; Liu et al. 2019) and quantization (Courbariaux, Ben-
gio, and David 2015). These techniques can make models
work with significantly smaller model sizes and fewer op-
erations. Therefore, these compression techniques become a
must-do step for deployment on edge devices. Here we only
focus on the quantization approach, which becomes imper-
ative to hardware acceleration especially on the FPGA and
ASIC platforms. By representing weights with fewer bits,
weight quantization can directly reduce model size and ac-
celerate inference speed.
In this paper, we propose the novel Mixed-Scheme, Multi-
Precision (MSP) Quantization Framework specifically for
the FPGA devices, which achieves unprecedented hardware
efficiency without sacrificing accuracy at low bit width (e.g.,
4-bit). Our contributions are as follows:
• We develop a hardware-friendly sum-of-power-of-two
(SPoT) quantization scheme, which can mitigate the ac-
curacy degradation of the vanilla power-of-two (PoT)
scheme while can still take advantage of bit shifting oper-
ation to accelerate computation.
ar
X
iv
:2
00
9.
07
46
0v
1 
 [c
s.L
G]
  1
6 S
ep
 20
20
• We propose a novel Mixed Scheme Quantization (MS) to
fully utilize the heterogeneous resources of FPGAs. To be
more specific, based on available DSP and LUT modules
in different FPGAs, we can finalize the corresponding ra-
tio of SPoT and fixed-point quantization scheme, so that
both module can be fully exploited.
• Rather than using a different precision in the first and last
layers, or employing inter-layer multi-precision for ex-
treme compression rate, we propose an Intra-Layer Flexi-
bility, which can be applied to all layers in a DNN model,
achieving lossless accuracy performance without damag-
ing hardware efficiency.
2 Related Work and Motivation
In this section, we summarize quantization schemes and
compare them in terms of their hardware deployment impli-
cations and accuracy degradation. Further, we also briefly
discuss recent works on neural network quantization. Fi-
nally, we describe the motivation of our FPGA-specific MSP
quantization.
2.1 Quantization Schemes
Based on whether the distances between the quantization
levels are equal or not, there are linear and non-linear quan-
tization schemes. In terms of bit width, we can also clas-
sify neural network quantization into catagories of single-
precision or multi-precision.
Linear Number System Linear quantization schemes
contain binary, ternary, and fixed-point number systems. Bi-
nary or ternary uses extremely low-bit weight representation
for DNNs, which can achieve very high inference computing
efficiency by eliminating multiplications, but sacrifice ac-
curacy. Representative binary quantization methods include
Binaryconnect (Courbariaux, Bengio, and David 2015), Bi-
narized Neural Network (BNN) (Courbariaux et al. 2016),
XNOR-net (Rastegari et al. 2016), and ABC-Net (Lin,
Zhao, and Pan 2017). Ternary quantization schemes are im-
plemented in TWN (Li, Zhang, and Liu 2016), TTQ (Zhu
et al. 2017), and (He and Fan 2019).
On the other hand, fixed-point quantization uses more bits
of weight representation to preserve accuracy. For exam-
ple, compared with the floating-point (e.g. 32 bits), fixed-
point quantization can use 4-bit to represent the weights with
negligible accuracy loss. Fixed-point quantization scheme
has been implemented with different methods/algorithms.
DoReFa-Net (Zhou et al. 2016) first explored it by intro-
ducing hyperbolic tangent transformation to weights and ac-
tivations, with scaling factors to minimize quantization error.
PACT (Choi et al. 2018) improved this method by adding a
parameterized clipping threshold to activations. DSQ (Gong
et al. 2019b) developed an evolving training method to grad-
ually approaximate STE. QIL (Jung et al. 2019) parameter-
ized the quantization interval and trained it with task loss,
avoiding access to the original training data. µL2Q (Cheng
et al. 2019) introduced data distribution loss during training
to minimize quantization error. LSQ (Esser et al. 2019) pro-
posed a differentiable method to learn the quantizer for each
layer jointly with parameters.
Fixed-point quantization still needs multiplication opera-
tions, which execute on DSPs of FPGA. However,the DSP
resources are limited, e.g., ranging from 240 to 1,540 DSP
slices in Xilinx Kintex-7 series, which becomes the bottle-
neck of merely employing fixed point quantization.
Non-Linear Number System Miyashita et al.
(Miyashita, Lee, and Murmann 2016) first replaced
fixed-point quantizer with logarithmic representation to
exploit bit shift operations and accelerate inference. We
refer this kind of non-linear number system as power-of-two
(PoT) quantization. In this way, the multiplication of input
(a fixed-point number) and weight (a PoT number) can be
replaced by bit shift operation and can be executed as shown
below,
2b × a =

a << b, b > 0
a, b = 0
a >> b, b < 0
. (1)
where 2b is the quantized weight, a is the input value.
Followed the PoT scheme, INQ (Zhou et al. 2017) split
weights into groups and iteratively quantize the model to low
bit-width. Leng et al. (Leng et al. 2018) employed ADMM
training technique to increase the accuracy of extremely low
bit-width DNNs. Li et al. (Li, Dong, and Wang 2020) in-
troduced a reparameterizaiton of clipping function to get
better-defined gradients, and employed weight normaliza-
tion to stablize training.
Even though PoT can reduce the computation by replac-
ing the multiplication to bit shift operation, the distance be-
tween the quantization levels grows up exponentially, caus-
ing PoT suffered from significant accuracy degradation. (Li,
Dong, and Wang 2020) proposed additive power-of-two
(APoT) to reduce the accuracy degradation of power-of-
Two. However, APoT uses more bit shift and addition while
increasing the bit width for the weight quantization. For ex-
ample, APoT uses 3 bit-shift and 2 addition for 6-bit weight
representation, which only pursues the less accuracy degra-
dation but does not consider the hardware availability ratio.
Besides, APoT alternatively assigns the value of Power-of-
two to each part (e.g. in 5-bit quantization, APoT assigns
{0, 2−1, 2−3, 2−5} to the first part and {0, 1, 2−2, 2−4} to
the second part), which is hard to deploy the quantized DNN
models for hardware implementations.
Bit Width Selection The majority of previous works
(Zhou et al. 2016; Choi et al. 2018; Zhang et al. 2018)
etc. assign same bit width to all layers, which we refer as
single-precision quantization. The other track (Dong et al.
2019; Shen et al. 2020) optimizes bit width for each in-
dividual layer so that maximum compression rate can be
achieved with minimum degradation of accuracy. We refer
this methodology as multi-precision quantization. It is to
solve Hessian matrix to determine bit width for each layer.
The general idea is to assign more bits to layers that are
sensitive to quantization error. In fact, very few works are
strictly single-precision quantization, because the first and
last layers affect much more on accuracy. Therefore, current
researches follow (Han et al. 2015) to quantize first and last
layer to a higher bit width (e.g., 8 bit) or even leave them un-
quantized. Though first and last layers only compose a small
percentage of computation cost, if unquantized, specific con-
figurations are needed on FPGA to execute this different bit-
width GEMM (Generalized Matrix Multiplication), which
downgrades hardware utilization.
2.2 Motivation
FPGA performance is constrained by the amount available
DSP resources which is required for fixed-point computa-
tions. In general, inference of a fixed-point quantized DNN
model is much slower than that from PoT quantization. PoT
scheme can replace the multiplications with bit-shift and ad-
dition operations that are performed by LUT resources on
FPGA, but it does not fit with the true weight distribution
because of its high resolution around zero and low resolu-
tion on the tails, as shown in Figure 1. Specifically, 4-bit
PoT quantization would suffer from at least 1 − 2% accu-
racy degradation. This motivates us to adopt an improved
version of PoT i.e., sum-of-power-of-two (SPoT) to mit-
igate the accuracy degradation. However, only relying on
SPoT quantization does not exploit DSP resource on FPGA.
To achieve the optimal utilization of on-chip computing re-
sources (both DSPs and LUTs), we propose to utilize a com-
bination of SPoT quantization and fixed-point quantization,
called mixed-scheme (MS) quantization, applying the two
quantization methods respectively to different filters within
a DNN layer. MS allows another dimension to choose ap-
propriate scheme according to weight distribution, which is
beneficial to accuracy.
We also observe that existing works do not quantize or
use no less than 8 bits for fixed point weight representa-
tion for the first and last layers (Courbariaux, Bengio, and
David 2015; He and Fan 2019; Ren et al. 2019), and most
multi-precision works (Dong et al. 2019; Shen et al. 2020)
also employ such inter-layer flexibility. The deployment of
these models on FPGA for inference needs different config-
urations for different layers. So we explore a novel multi-
precision (MP) quantization to quantize the first and the last
layer while preserving the accuracy. Based on the optimal
ratio given from hardware (FPGA) resources, we propose
the our mixed-scheme, multi-precision (MSP) quantization
enjoying two benefits, i.e., (1) better utilization of the FPGA
resources of FPGA, which is coming from mixed-scheme,
and (2) zero accuracy degradation, which is due to multi-
precision.
3 Quantization Formulation
In this section, we first formulate typical fixed point and
power-of-two quantization schemes. Then we improve the
vanilla PoT with a summation method, namely sum-of-
power-of-two (SPoT) to better fit weight distribution so as
to preserve accuracy with minor computation overhead in-
troduced. Lastly we give the algorithm to perform quantiza-
tion aware training.
3.1 Fixed Point Quantization
As mentioned in Section 2, the most intuitive way to per-
form quantization is to map the full precision parameters to
low bit width uniform representations. The weight represen-
tation can be defined as follows:
QFP (m,α) = ±α×{0, 1
2m−1 − 1 ,
2
2m−1 − 1 , ..., 1}. (2)
where QFP refers to quantized numbers, m is the bit width
and α is a scaling factor. And the mapping function from a
32-bit floating-point weight w into the quantized weight wˆ
by m-bit fixed-point representation is as follows:
wˆ =
∏
QFP (m,α)
w
= α · h−1( 1
2m − 1round((2
m − 1) · h(dw,αc))), (3)
where
∏
QFP (m,α)(·) denotes the quantizer function to
project onto QFP (m,α); the function h(·) transforms a
value within [−1,+1] into the range of [0, 1], for example
we can use h(·) = tanh(·)/2 + 0.5; and dw,αc clips w ac-
cording to
dw,αc =

−1, w < −α
w/α, −α ≤ w ≤ α
1, w > α
. (4)
3.2 Power of Two (PoT) Quantization
In order to replace multiplications with bit shifting opera-
tions, we need to make weights in the form of powers-of-
two. The quantized weight values by PoT scheme with an
m-bit weight representation are as follows:
QP2(m,α) = ±α× {0, 1
22m−1−2
,
1
22m−1−3
, ..., 1}. (5)
The mapping from continuous parameters to PoT number
system is defined by
wˆ =
∏
QP2(m,α)
w
=
{
α · h−1(2round(log2 h′)), h′ > 2−2m+1
0, h′ ≤ 2−2m+1,
h′ = h(dw,αc),
(6)
from which we can infer another disadvantage of pure PoT
quantization. Increasing bit width merely increases resolu-
tion around mean area, but has no effect at the tail, as dis-
played in Figure 1.
3.3 Sum of Power of Two (SPoT) Quantization
Sum-of-power-of-two (SPoT) can be considered as an im-
proved version of power-of-two (PoT) quantization, which
can also replace the multiplications with bit-shift operations.
The quantization levels are defined as follows:
QSPoT (m,α) = QP21 (m1, α) +QP22 (m2, α) (7)
Where m1 has a larger range and m2 has a smaller range
(m1 ≥ m2). Then all quantization levels set are a combina-
tion of lower bit-width PoTs. In Figure 1 we take the 6-bit
SPoT quantization as an example. First, SPoT needs 1 bit for
Figure 1: Illustration of the PoT and the proposed SPoT
quantization schemes. It also shows the corresponding quan-
tization levels along with the weight distribution of an actual
layer in MobileNet-v2.
the sign, then we split the rest 5 bit to the smaller range m1
(3bit) and the larger range m2 (2bit). In Figure 1, number
0.625 can be represented by 2−3+2−1, then 2−3 can be de-
coded in ”011” in smaller rangem1 and 2−1 can be decoded
in ”10” in larger range m2. Therefore, when multiplying an
input value by weight ”101101”, we shift the input right 3
bit and 1 bit, respectively, and sum the two results up.
Figure 1 also shows a notable shortcoming of PoT. PoT
has very high precision around the mean, but the tail ends
present very low precision. It causes a mismatch with the
weight value distribution. Thus, PoT suffers from non-
negligible accuracy degradation. On the other hand, the pro-
posed SPoT has relatively even quantization intervals, which
is close to that of fixed-point quantization levels. Therefore,
the SPoT can achieve similar accuracy performance as the
fixed-point quantization scheme.
3.4 Quantization-Aware Training Algorithm
Quantization is a projection from the continuous value to
discrete number system, which makes the gradients flow-
ing from loss function zero everywhere during backpropa-
gation. There are two approaches to address this issue. One
is employing a Straight Through Estimator (STE) (Bengio,
Le´onard, and Courville 2013; Yin et al. 2018) to set the gra-
dient to constant value 1 as
Forward : y = round(x)
Backward :
∂y
∂x
= 1x∈R
(8)
The other is the Alternating Direction Method of Multipli-
ers (ADMM) (Leng et al. 2018) to iteratively solve the pa-
Algorithm 1: FPGA-Specific Mixed-Scheme Multi-
Precision Quantization (MSP)
input :
32-bit floating-point DNN modelM, with weightsW
to be quantized.
target: Quantized model Mˆ
// Initialize auxiliary variable:
U0 = 0; Z0 =W;
Partition rate from FPGA resource: PRSPoT ;
Sf =Fixed-point; Sp =SPoT; S8 =8 bit;
foreach Epoch do
Sort variance v(l)1:R of weight matrixW
(l) and
obtain the threshold θ(l) by PRSPoT ;
if v(l)r < θ(l) then
S← Sp/S8 based on quantization error;
else
S← Sf ;
// Update Z, U:
Zt ← projS(W + U t−1);
U t ←W − Zt + U t−1;
foreach Batch do
input← projS(input);
loss←M(input);
loss← loss+∑ 12‖W − Zt + U t‖2;
Backpropagate loss and updateW;
Return Mˆ ←M{projS(W)}.
rameters with a target quantization scheme as the optimiza-
tion constraint. These two methods are equivalent in terms
of convergence, but ADMM algorithm shows more flexibil-
ity as gradients can not be defined appropriately by STE in
some specific tasks. In this paper, we perform ADMM for
weight and STE for activation quantization. We will not in-
clude detailed derivations of ADMM due to space, please
refer to the mentioned previous work.
4 MSP Framework
4.1 Mixed-Scheme (MS) Quantization
Relying on SPoT quantization only or fixed point quantiza-
tion only will not achieve the optimal performance on FPGA
devices. Thus, we propose the mixed-scheme- quantization
(MS). In each layer of a DNN, we split parameters into two
parts, one uses SPoT while the other is quantized under fixed
point quantization scheme. Besides, on the algorithm level,
in each layer, the weight matrix can be obtained by trans-
forming the weight tensor into a 2D GEMM matrix. The dis-
tribution of weights in different rows is rather random. For
rows that have smaller variances (have more Gaussian-like
weight distributions), SPoT scheme is a better fit; while for
rows with larger variances (more uniform-like distribution),
using fixed-point scheme can avoid high quantization error.
In our work, the optimal ratio of SPoT to fixed-point is de-
termined by available resources on FPGA devices, instead of
serving for accuracy. Usually, the utilization of DSPs needs
to be maintained at 100% to take full advantage of the DSP
resource. Incorporating with DSP, we would like to assign
appropriate workload on LUTs and make them finish simul-
taneously, and therefore enhance the throughput.
4.2 Multi-Precision (MP) Quantization
We further propose a novel mixed-precision quantization
scheme (MP) to achieve lossless performance compared
with the original full precision DNN models, as illustrated
in Figure 2. In image classification task, first and last lay-
ers are extremely sensitive. Thus, most of the existing works
do not quantize or use no less than 8-bit fixed-point weight
representation for the first and last layers to maintain accu-
racy (Courbariaux, Bengio, and David 2015; He and Fan
2019; Ren et al. 2019). Recent work (Zhu et al. 2019; Gong
et al. 2019a) has investigated FPGA-based inference engine
supporting different quantization bits adaptively. However,
such online reconfiguration ability inside each PE incurs
non-negligible hardware overhead. Besides, such inter-layer
flexibility in quantization bits brings about only minor accu-
racy improvement, even if sophisticated search method for
per-layer quantization bits is employed (Wang et al. 2019;
Lou et al. 2019).
To overcome this challenge, we propose Multi-precision
quantization (MP) for our quantization scheme. In each
layer, we preserve the 5% weights to use the 8-bit weight
representation. Based on the average distance between the
weight and the nearest 4-bit weight quantization level in
each row (i.e. average quantization error) of the weight ma-
trix. We determine the rows with highest 5% average quan-
tization error to be 8-bit, while still using the 4-bit weight
representation for the rest. The overall DNN accuracy can
be maintained as long as 5% of weights in each layer are
quantized using 8 bits. Even if the rest of the weights (in
all layers) are quantized using very few bits. This is because
in intra-layer flexibility, the weights quantized using 8 bits
can be trained to mitigate the imprecision caused by those
weights quantized using fewer bits. This mitigation happens
in every layer. On the other hand, in the prior inter-layer flex-
ibility, the majority of layers will be quantized with fewer
bits. The resultant imprecision cannot be mitigated within a
layer and will be accumulated across layers. It is difficult to
recover the accumulated imprecision by limited layers quan-
tized with more bits.
Besides algorithm-level advantages, the proposed intra-
layer flexibility also exhibits an advantage at the FPGA hard-
ware level. Recall that the same quantization scheme (e.g.,
4-bit for 95% of weights and 8-bit for the rest of 5%) is ap-
plied to all layers of a DNN. At FPGA configuration time for
a specific DNN inference task, one could allocate a portion
of PEs for the low-bit portion of computation and the rest
of PEs for the 8-bit portion, and this works for every layer.
As for traditional inter-layer multi-precision scheme, it’s al-
most impossible to perform online reconfiguration, that is,
the PEs assigned to execute 8-bit first/last layers is vacant
while processing the middle layers.
Figure 2: Illustration of a DNN integrating the proposed
intra-layer flexibility with the FPGA-specific quantization
scheme and comparison with the other state-of-the-art quan-
tization works. k refers to the bit width of each layer. Gen-
erally, the middle layers share same bit width (e.g. k2 =
k3 = ... = kn−1), but for multi-precision works, different
bit widths are employed in the middle layers (e.g. ki 6= kj).
4.3 Mixed-Scheme, Multi-Precision (MSP)
Quantization Framework
To fully utilize all on-chip resources, we propose Mixed
Scheme, Multi-Precision Quantization (MSP) framework as
the combination of MS and MP. Based on the optimal ra-
tio of SPoT/Fixed-point given by the hardware (FPGA) re-
source (e.g. 2:1 for XC7Z045). We futher give the optimal
ratio of SPoT/Fixed-point/8bit is 65:30:5, which can (1) bet-
ter utilize the resource of the FPGA, (2) have the highest
throughput (GOPS)/lowest latency, and (3) achieve lossless
accuracy performance compared with the original full preci-
sion model. The detailed implementation is shown in Algo-
rithm 1.
5 Evaluation
5.1 Experiment Setup
We evaluate our novel MSP on image classification tasks
with convolutional neural networks (CNNs). We use no extra
data augmentations other than those already employed for
training the 32-bit floating-point baseline models. Our quan-
tization training algorithm uses step or cosine learning rate
decay and `2 regularization, following training algorithms of
the baseline models. Our experiments are implemented with
server-grade machines running Ubuntu 18.04, CUDA 10.2
and PyTorch 1.5. Our models are trained on NVIDIA TI-
TAN RTX GPUs and GeForce RTX 2080Ti GPUs. We eval-
uate with the deep residual net (ResNet-18) (He et al. 2016),
which generalizes well for varieties of tasks, as well as the
lightweight MobileNet-v2 model (Sandler et al. 2018). We
test on CIFAR-10, CIFAR-100 (Krizhevsky 2009), and Im-
ageNet ILSVRC-2012 (Krizhevsky, Sutskever, and Hinton
2012) datasets. DNN models for CIFAR-10 and CIFAR-
100 datasets are trained from scratch, and quantized for 150
epochs. For ImageNet dataset, pre-trained models in 32-bit
floating-point are used, and quantized for 90 epochs. The
initial learning rates are 8e − 3 for CIFAR-10, 4e − 3 for
CIFAR-100, 1e− 2 for ImageNet.
Quantization Bit width ResNet-18 Accuracy (%) MobileNet-v2 Accuracy (%)
Scheme (Wght./Actv.) Top1 (N/Y) Top5 (N/Y) Top1 (N/Y) Top5 (N/Y)
CIFAR-10
Baseline 32/32 93.62 - 92.51 -
PoT 4/4 92.97 / 92.14 - 91.34 / 90.92 -
Fixed 4/4 93.43 / 92.97 - 92.34 / 91.76 -
SPoT 4/4 93.47 / 92.94 - 92.72 / 91.83 -
MS 4/4 93.53 / 92.98 - 92.57 / 91.99 -
MP 4/4 - / 93.63 - - / 92.54 -
MSP 4/4 - / 93.72 - - / 92.58 -
CIFAR-100
Baseline 32/32 74.49 92.70 71.48 91.98
PoT 4/4 73.88 / 72.97 92.14 / 91.65 68.68 / 67.11 90.06 / 89.21
Fixed 4/4 74.37 / 73.88 92.31 / 91.72 71.16 / 70.22 91.63 / 90.88
SPoT 4/4 74.33 / 73.97 92.49 / 92.03 71.13 / 70.21 91.69 / 90.85
MS 4/4 74.58 / 74.03 92.51 / 92.05 71.21 / 70.25 91.74 / 90.92
MP 4/4 - / 74.54 - / 92.61 - / 71.49 - / 91.82
MSP 4/4 - / 74.61 - / 92.69 - / 71.51 - / 91.97
ImageNet Wght./Actv. 4/32
Baseline 32/32 69.76 89.08 71.88 90.29
PoT 4/4 68.20 / 67.11 87.14 / 85.93 69.93 / 67.88 88.63 / 86.83
Fixed 4/4 69.72 / 68.66 88.67 / 87.54 71.26 / 69.23 90.18 / 88.03
SPoT 4/4 69.74 / 68.48 88.71 / 87.92 71.32 / 69.76 90.17 /88.42
MS 4/4 70.11 / 69.22 89.41 / 88.33 71.26 / 69.31 90.04 / 88.11
MP 4/4 - / 69.99 - / 89.12 - / 71.68 - / 90.22
MSP 4/4 - / 70.47 - / 89.52 - / 71.73 - / 90.27
Table 1: Result from different quantization schemes for the ResNet-18 and MobileNet-v2 DNN models on CIFAR-10, CIFAR-
100, and ImageNet datasets. Y/N: With/Without quantization of the first and the last layer.
Methods Top-1 Top-5 quant. quant.(%) (%) 1st l. Last l.
Baseline 69.76 89.08 - -
Dorefa (Zhou et al. 2016) 68.10 88.10 × ×
PACT (Choi et al. 2018) 69.20 89.00 × ×
DSQ (Gong et al. 2019b) 69.56 N/A × ×
QIL (Jung et al. 2019) 70.10 N/A × ×
µL2Q (Cheng et al. 2019) 65.92 86.72 - 16bit
LQ-NETS (Zhang et al. 2018) 69.30 88.80 × ×
MSP (ours) 70.47 89.52 X X
Table 2: Comparisons with existing works with ResNet-18
model on ImageNet dataset for 4 bit quantization.
To demonstrate hardware efficiency of the proposed MSP
framework, we implemented the architecture with hetero-
geneous GEMM cores on the embedded FPGA device,
in which high efficiency is usually top priority under re-
source constraints. We validate our framework on Zynq
XC7Z020 and XC7Z045 devices with different design pa-
rameters that result in different throughput and resource
utilization results. We setup different ratios between the
PE array sizes of the GEMMSPoT ,GEMMfixed, and
GEMM8−bit cores. Specifically, we progressively increase
the size of GEMMSPoT core (Blkout,SPoT ), till the LUT
utilization reaches the highest ratio. For all implementations,
the working frequency is set to 100MHz.
5.2 Accuracy Performance
Tables 1 and 2 summarize quantization results for image
classification task. We use PRSPoT :Fixed:8−bit = 65 : 30 :
5, which is the optimal ratio validated from resource utiliza-
tion results on FPGA. MSP obtains the minimum accuracy
degradation.
The accuracy increase of MSP compared to PoT, Fixed,
SPoT, MS or MP results from several aspects. First, com-
bining SPoT and Fixed makes the quantized DNN weights
fit the original weight distribution better. Besides, leave
5% weight with 8-bit, which is ”intra-layer flexibility” can
achieve lossless performance even the first and the last layer
are quantized. In addition, model compression could slightly
increase accuracy when weight bit-width ≥ 4, as quantiza-
tion noise can potentially act as regularization that benefits
generalization and addresses overfitting.
Tables 2 compares our MSP with existing DNN quanti-
zation works including Dorefa (Zhou et al. 2016), PACT
(Choi et al. 2018), DSQ (Gong et al. 2019b), QIL (Jung
et al. 2019), µL2Q (Cheng et al. 2019), and LQ-NETS
(Zhang et al. 2018). Those works and our MSP start with
the same pre-trained models with the same baseline accu-
racy. Table 2 shows that Dorefa, PACT, DSQ, µL2Q, and
LQ-NETS have up to 3.84% accuracy degradation, only QIL
reports lossless accuracy performance. Our MSP increases
accuracy by 0.71% compared with the floating-point model.
Note that those works above they do not quantize the first
Device Ratio(SPoT/4-bit/8-bit)
Utilization Throughput (GOPS)
LUT DSP BRAM36 FF ResNet-18 MobileNet-v2on ImageNet on ImageNet
XC7Z020
0:1:0 12.2K (23%) 220 (100%) 39 (56%) 9.4K (9%) 31.8 26.6
1:1:0 22.9K (43%) 220 (100%) 49 (70%) 14.5K (14%) 51.6 47.4
1.5:1:0 28.3K (53%) 220 (100%) 56 (80%) 17.1K (16%) 58.2 55.3
60:35:5 (MSP) 31.1K (58%) 220 (100%) 59 (84%) 20.5K (19%) 73.8 58.7
XC7Z045
0:1:0 41.8K (19%) 900 (100%) 160 (29%) 31.3K (7%) 120.5 102.8
1:1:0 93.4K (43%) 900 (100%) 194 (36%) 65.7K (15%) 195.3 190.6
2:1:0 145.0K (66%) 900 (100%) 225.5 (41%) 111.6K (26%) 244.5 236.5
65:30:5 (MSP) 151.4K (69%) 900 (100%) 245 (45%) 114.2K (26%) 325.0 262.2
Table 3: Performance of various DNN applications on hardware under different settings.
MSP MSP MSP MP MS MS MS Fixed Fixed PoT PoT SPoT SPoT
DSP 8-bit FP 5% 5% 10% 5% 0 0 0 0 0 0 0 0 0
4-bit FP 30% 0 0 95% 50% 50% 33% 100% 100% 0 0 0 0
LUT 4-bit SPoT 65% 95% 90% 0 50% 50% 67% 0 0 0 0 100% 100%
4-bit PoT 0 0 0 0 0 0 0 0 0 100% 100% 0 0
First and Last Layer (Y/N) Y Y Y Y N Y N N Y N Y N Y
Top-1 Accuracy (%) 70.47 70.09 70.24 69.99 70.08 69.52 70.11 69.72 68.66 68.20 67.11 69.74 68.48
FPGA Latency (ms) 11.2 13.7 13.3 29.1 20.1 12.2 15.4 39.5 25.4 19.5 14.7 22.9 16.5
Table 4: Ablation study of ResNet-18 on ImageNet. Y/N: With/Without quantization of the first and the last layer. The results
are measured on XC7Z045.
and the last layer, only µL2Q uses 16bit weight represen-
tation for the last layer. On the other hand, MSP quantizes
the first and the last layer while still can achieve the lossless
accuracy performance.
5.3 Hardware Efficiency
To present the performance with real-world applications,
we employed different CNN models with the proper
SPoT/fixed-point/8-bit ratios on the two devices. The net-
works ResNet-18 and MobileNet-v2 are implemented based
on the ImageNet dataset. The performance results of each
network under various hardware configurations are dis-
played in Table 3. Generally, the utilization of DSP is al-
ways 100%, by increasing the ratio of SPoT, the utilization
of the other resource has improved (e.g. from 19% to 69%
on LUT in XC7Z045). On the other hand, the heterogeneous
GEMMSPoT , GEMMfixed, and GEMM8−bit cores im-
prove the throughput by 2.2 × −2.7× with the optimal de-
sign compared to utilizing the GEMMfixed core only (e.g.
the throughput from 120.5 to 325.0 GOPS on ResNet-18 in
XC7Z045).
5.4 Ablation Study
We experiment over different quantization configurations to
validate the proposed MSP scheme. Note that all the percent-
ages refer to the ratio of employed quantization scheme in
the main body of a DNN, since first and last layers are spec-
ified separately. As shown in table 4, single scheme quanti-
zation (fixed point (Zhou et al. 2016), PoT, SPoT) shows no
advantage in both accuracy and speed (FPGA latency in our
work). Plus, simply quantizing first and last layer into low bit
width introduces noticeable accuracy degradation. With our
Mixed Scheme (MS) method, execution speed is increased
significantly because of efficient utilization of both DSP and
LUTs. We can also observe slight improvement in accuracy,
because the dual schemes are organized to better fit weight
distribution, as described in 3. Finally we employ our intra-
layer mixed precision (MP) to mitigate the need of high bit
widths in first and last layers, further improving speed and
achieving comparable accuracy.
6 Conclusion
This work proposes a mixed-scheme (MS) quantization
method that combines the sum-of-power-of-2 (SPoT) quan-
tization scheme and the traditional fixed-point quantizer.
Multiplications under the SPoT scheme can be replaced with
bit shifting and addition operations using the FPGA LUT re-
sources, and the fixed-point operations can be executed ef-
ficiently on DSP modules, therefore the heterogeneous re-
sources on FPGA are fully exploited. Furthermore, we also
develop a novel intra-layer multi-precision (MP) method to
mitigate the need of high bit width in first and last layers
while still preserving accuracy. As a new dimension for mix-
ing bit width, our MP is much more hardware friendly com-
pared to inter-layer multi-precision quantization works. Fi-
nally, we combine the MS and MP to MSP, as an ensem-
ble of SPoT ,low-bit fixed point and 5% 8bit quantization.
MSP achieves best accuracy results as well as fastest infer-
ence speeds on FPGA compared to prior arts. With optimal
SPoT/fixed-point/8bit ratios, our FPGA-specific quantiza-
tion scheme can not only achieve 70.47% Top1 accuracy in
ResNet-18 on the ImageNet dataset but also speedup 3.53×
inference time compared with pure fixed point quantization
on FPGA devices.
References
Bengio, Y.; Le´onard, N.; and Courville, A. 2013. Estimat-
ing or propagating gradients through stochastic neurons for
conditional computation. arXiv preprint arXiv:1308.3432 .
Bishop, C. M. 2006. Pattern recognition and machine learn-
ing. springer.
Cheng, G.; Ye, L.; Tao, L.; Xiaofan, Z.; Cong, H.; Deming,
C.; and Yao, C. 2019. µL2Q: An Ultra-Low Loss Quantiza-
tion Method for DNN. The 2019 International Joint Confer-
ence on Neural Networks (IJCNN) .
Choi, J.; Wang, Z.; Venkataramani, S.; Chuang, P. I.-J.;
Srinivasan, V.; and Gopalakrishnan, K. 2018. Pact: Param-
eterized clipping activation for quantized neural networks.
arXiv preprint arXiv:1805.06085 .
Courbariaux, M.; Bengio, Y.; and David, J.-P. 2015. Bi-
naryconnect: Training deep neural networks with binary
weights during propagations. In Advances in neural infor-
mation processing systems (NeurIPS), 3123–3131.
Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; and
Bengio, Y. 2016. Binarized neural networks: Training deep
neural networks with weights and activations constrained
to+ 1 or-1. arXiv preprint arXiv:1602.02830 .
Dong, Z.; Yao, Z.; Gholami, A.; Mahoney, M. W.; and
Keutzer, K. 2019. Hawq: Hessian aware quantization of
neural networks with mixed-precision. In Proceedings of the
IEEE International Conference on Computer Vision (ICCV),
293–302.
Esser, S. K.; McKinstry, J. L.; Bablani, D.; Appuswamy, R.;
and Modha, D. S. 2019. Learned step size quantization. In-
ternational Conference on Learning Representations (ICLR)
.
Gong, C.; Jiang, Z.; Wang, D.; Lin, Y.; Liu, Q.; and Pan,
D. Z. 2019a. Mixed Precision Neural Architecture Search
for Energy Efficient Deep Learning. In 2019 IEEE/ACM
International Conference on Computer-Aided Design (IC-
CAD), 1–7. IEEE.
Gong, R.; Liu, X.; Jiang, S.; Li, T.; Hu, P.; Lin, J.; Yu, F.;
and Yan, J. 2019b. Differentiable soft quantization: Bridging
full-precision and low-bit neural networks. In Proceedings
of the IEEE International Conference on Computer Vision
(ICCV), 4852–4861.
Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep
learning. MIT press.
Han, S.; Liu, X.; Mao, H.; Pu, J.; Pedram, A.; Horowitz,
M. A.; and Dally, W. J. 2016a. EIE: efficient inference en-
gine on compressed deep neural network. In Proceedings
of the 43rd Annual International Symposium on Computer
Architecture (ISCA), 243–254. IEEE.
Han, S.; Pool, J.; Tran, J.; and Dally, W. 2015. Learning both
weights and connections for efficient neural network. In Ad-
vances in neural information processing systems (NeurIPS),
1135–1143.
Han, S.; Shen, H.; Philipose, M.; Agarwal, S.; Wolman, A.;
and Krishnamurthy, A. 2016b. Mcdnn: An approximation-
based execution framework for deep stream processing un-
der resource constraints. In Proceedings of the 14th Annual
International Conference on Mobile Systems, Applications,
and Services (MobiSys), 123–136. ACM.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-
ual learning for image recognition. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion (CVPR), 770–778.
He, Z.; and Fan, D. 2019. Simultaneously optimizing weight
and quantizer of ternary neural network using truncated
gaussian approximation. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR),
11438–11446.
Hegde, K.; Agrawal, R.; Yao, Y.; and Fletcher, C. W. 2018.
Morph: Flexible Acceleration for 3D CNN-Based Video Un-
derstanding. In Proceedings of the 51st Annual IEEE/ACM
International Symposium on Microarchitecture (MICRO),
933–946. IEEE.
Jung, S.; Son, C.; Lee, S.; Son, J.; Han, J.-J.; Kwak, Y.;
Hwang, S. J.; and Choi, C. 2019. Learning to quantize deep
networks by optimizing quantization intervals with task loss.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 4350–4359.
Krizhevsky, A. 2009. Learning multiple layers of features
from tiny images .
Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Im-
agenet classification with deep convolutional neural net-
works. In Advances in neural information processing sys-
tems, 1097–1105.
LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning.
nature 521(7553): 436–444.
Leng, C.; Dou, Z.; Li, H.; Zhu, S.; and Jin, R. 2018. Ex-
tremely low bit neural network: Squeeze the last bit out with
admm. In Thirty-Second AAAI Conference on Artificial In-
telligence (AAAI).
Leng, Y.; Chen, C.-C.; Sun, Q.; Huang, J.; and Zhu, Y. 2019.
Energy-efficient video processing for virtual reality. In Pro-
ceedings of the 46th International Symposium on Computer
Architecture (ISCA), 91–103.
Li, F.; Zhang, B.; and Liu, B. 2016. Ternary weight net-
works. arXiv preprint arXiv:1605.04711 .
Li, Y.; Dong, X.; and Wang, W. 2020. Additive Powers-of-
Two Quantization: An Efficient Non-uniform Discretization
for Neural Networks. In International Conference on Learn-
ing Representations (ICLR).
Li, Y.; Park, J.; Alian, M.; Yuan, Y.; Qu, Z.; Pan, P.; Wang,
R.; Schwing, A.; Esmaeilzadeh, H.; and Kim, N. S. 2018. A
Network-Centric Hardware/Algorithm Co-Design to Accel-
erate Distributed Training of Deep Neural Networks. In Pro-
ceedings of the 51st Annual IEEE/ACM International Sym-
posium on Microarchitecture (MICRO), 175–188. IEEE.
Lin, X.; Zhao, C.; and Pan, W. 2017. Towards accurate bi-
nary convolutional neural network. In Advances in Neural
Information Processing Systems (NeurIPS), 345–353.
Liu, Z.; Sun, M.; Zhou, T.; Huang, G.; and Darrell, T. 2019.
Rethinking the value of network pruning. International Con-
ference on Learning Representations (ICLR) .
Lou, Q.; Guo, F.; Liu, L.; Kim, M.; and Jiang, L. 2019.
Autoq: Automated kernel-wise neural network quantization.
arXiv preprint arXiv:1902.05690 .
Lym, S.; Choukse, E.; Zangeneh, S.; Wen, W.; Sanghavi, S.;
and Erez, M. 2019. PruneTrain: fast neural network training
by dynamic sparse model reconfiguration. In Proceedings
of the International Conference for High Performance Com-
puting, Networking, Storage and Analysis, 1–13.
Mao, H.; Song, M.; Li, T.; Dai, Y.; and Shu, J. 2018. Ler-
GAN: A Zero-Free, Low Data Movement and PIM-Based
GAN Architecture. In Proceedings of the 51st Annual
IEEE/ACM International Symposium on Microarchitecture
(MICRO), 669–681. IEEE.
Miyashita, D.; Lee, E. H.; and Murmann, B. 2016. Convolu-
tional neural networks using logarithmic data representation.
arXiv preprint arXiv:1603.01025 .
Niu, W.; Ma, X.; Lin, S.; Wang, S.; Qian, X.; Lin, X.; Wang,
Y.; and Ren, B. 2020. Patdnn: Achieving real-time DNN ex-
ecution on mobile devices with pattern-based weight prun-
ing. In Proceedings of the 25th International Conference
on Architectural Support for Programming Languages and
Operating Systems (ASPLOS), 907–922.
Rastegari, M.; Ordonez, V.; Redmon, J.; and Farhadi, A.
2016. Xnor-net: Imagenet classification using binary convo-
lutional neural networks. In European conference on com-
puter vision (ECCV), 525–542. Springer.
Ren, A.; Zhang, T.; Ye, S.; Li, J.; Xu, W.; Qian, X.; Lin, X.;
and Wang, Y. 2019. Admm-nn: An algorithm-hardware co-
design framework of dnns using alternating direction meth-
ods of multipliers. In Proceedings of the 24th International
Conference on Architectural Support for Programming Lan-
guages and Operating Systems (ASPLOS), 925–938.
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; and
Chen, L.-C. 2018. Mobilenetv2: Inverted residuals and lin-
ear bottlenecks. In Proceedings of the IEEE conference
on computer vision and pattern recognition (CVPR), 4510–
4520.
Sharma, H.; Park, J.; Mahajan, D.; Amaro, E.; Kim, J. K.;
Shao, C.; Mishra, A.; and Esmaeilzadeh, H. 2016. From
high-level deep neural models to FPGAs. In Proceedings
of the 49th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), 17. IEEE Press.
Shen, S.; Dong, Z.; Ye, J.; Ma, L.; Yao, Z.; Gholami, A.;
Mahoney, M. W.; and Keutzer, K. 2020. Q-BERT: Hessian
Based Ultra Low Precision Quantization of BERT. In Thirty-
Second AAAI Conference on Artificial Intelligence (AAAI),
8815–8821.
Shi, R.; Ding, Y.; Wei, X.; Liu, H.; So, H.; and Ding, C.
2020a. FTDL: An FPGA-tailored Architecture for Deep
Learning Systems. In The 2020 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays (FPGA),
320–320.
Shi, R.; Dong, P.; Geng, T.; Ding, Y.; Ma, X.; So, H. K.-
H.; Herbordt, M.; Li, A.; and Wang, Y. 2020b. CSB-
RNN: A Faster-than-Realtime RNN Acceleration Frame-
work with Compressed Structured Blocks. arXiv preprint
arXiv:2005.05758 .
Wang, K.; Liu, Z.; Lin, Y.; Lin, J.; and Han, S. 2019. HAQ:
Hardware-Aware Automated Quantization with Mixed Pre-
cision. International Conference on Computer Vision and
Pattern Recognition (CVPR) .
Yin, P.; Lyu, J.; Zhang, S.; Osher, S.; Qi, Y.; and Xin, J. 2018.
Understanding Straight-Through Estimator in Training Ac-
tivation Quantized Neural Nets. In International Conference
on Learning Representations (ICLR).
Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; and Cong,
J. 2015. Optimizing fpga-based accelerator design for
deep convolutional neural networks. In Proceedings of
the 2015 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays (FPGA), 161–170. ACM.
Zhang, D.; Yang, J.; Ye, D.; and Hua, G. 2018. Lq-nets:
Learned quantization for highly accurate and compact deep
neural networks. In Proceedings of the European conference
on computer vision (ECCV), 365–382.
Zhou, A.; Yao, A.; Guo, Y.; Xu, L.; and Chen, Y. 2017. In-
cremental Network Quantization: Towards Lossless CNNs
with Low-precision Weights. In International Conference
on Learning Representations (ICLR).
Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; and Zou, Y.
2016. Dorefa-net: Training low bitwidth convolutional neu-
ral networks with low bitwidth gradients. arXiv preprint
arXiv:1606.06160 .
Zhu, C.; Han, S.; Mao, H.; and Dally, W. J. 2017. Trained
ternary quantization. In International Conference on Learn-
ing Representations (ICLR).
Zhu, Z.; Sun, H.; Lin, Y.; Dai, G.; Xia, L.; Han, S.; Wang,
Y.; and Yang, H. 2019. A configurable multi-precision cnn
computing framework based on single bit rram. In 2019
56th ACM/IEEE Design Automation Conference (DAC), 1–
6. IEEE.
