Efficient non-uniform quantizer for quantized neural network targeting
  reconfigurable hardware by Liss, Natan et al.
Efficient Non-uniform Quantizer for Quantized Neural
Network Targeting Re-configurable Hardware
Natan Liss ∗
Technion
Haifa, Israel
lissnatan@campus.technion.ac.il
Chaim Baskin ∗
Technion
Haifa, Israel
chaimbaskin@cs.technion.ac.il
Avi Mendelson
Technion
Haifa, Israel
avi.mendelson@cs.technion.ac.il
Alexander M.Bronstein
Technion
Haifa, Israel
bron@cs.technion.ac.il
Raja Giryes
Tel-Aviv University
Tel-Aviv, Israel
raja@tauex.tau.ac.il
ABSTRACT
Convolutional Neural Networks (CNN) has become more pop-
ular choice for various tasks such as computer vision, speech
recognition and natural language processing. Thanks to their
large computational capability and throughput, GPUs ,which
are not power efficient and therefore does not suit low power
systems such as mobile devices, are the most common plat-
form for both training and inferencing tasks. Recent studies
has shown that FPGAs can provide a good alternative to GPUs
as a CNN accelerator, due to their re-configurable nature ,low
power and small latency. In order for FPGA-based accelera-
tors outperform GPUs in inference task, both the parameters
of the network and the activations must be quantized. While
most works use uniform quantizers for both parameters and
activations, it is not always the optimal one, and a non-uniform
quantizer need to be considered. In this work we introduce a
custom hardware-friendly approach to implement non-uniform
quantizers. In addition, we use a single scale integer represen-
tation of both parameters and activations, for both training and
inference. The combined method yields a hardware efficient
non-uniform quantizer, fit for real-time applications. We have
tested our method on CIFAR-10 and CIFAR-100 image classi-
fication datasets with ResNet-18 and VGG-like architectures,
and saw little degradation in accuracy.
INTRODUCTION
Neural Networks on custom hardware
When implementing systems involving arbitrary precision, FP-
GAs and ASICs are a natural selection as target device due
to their customizable nature. It was already shown that there
*equal contributors
ACM ISBN .
DOI:
is a lot of redundancy when using floating point representa-
tion in Neural Network(NN). Therefore, custom low-precision
representation can be used with little impact to the accuracy.
Due to the steadily increasing on-chip memory size (tens of
megabytes) and the integration of high bandwidth memory
(hundreds of megabytes), it is feasible to fit all the parameters
inside an ASIC or FPGA, when using low bitwidth. Besides
the obvious advantage of reducing the latency, this approach
has several advantages: power consumption reduction and
smaller resource utilization, which in addition to DSP blocks
and LUTs, also includes routing resource. The motivation of
quantizing the activations is similar to that of the parameters.
Although activations are not stored during inference, their
quantization can lead to major saving in routing resources
which in turn can increase the maximal operational frequency
of the fabric, resulting in increased throughput. In recent
years, FPGAs has become more popular as an inference accel-
erator. While ASICs [6, 15] usually offers more throughput
with lower energy consumption, they don’t enjoy the advan-
tage of reconfigurability as FPGAs. This is important since
neural network algorithm evolve with time, so should their
hardware implementation. Since the implementation of neu-
ral network involves complex scheduling and data movement,
FPGA-based inference accelerators has been described as het-
erogeneous system using OpenCL [21, 22, 2] or as standalone
accelerator using HLS compilers [8, 20, 24].
The effect of quantization choice on hardware
It is known that FPGAs and ASICs are good with custom data
types arithmetic and logical operations, especially integers and
fixed point. During inference, only activations are actually
being quantized since the parameters stored in the memory
have already been quantized during training. The simplest
and probably the most hardware efficient implementation of
activation quantization is the linear quantization, which can
be easily implemented by applying bit shifting on the result
of Multiply And Accumulate (MAC) operation, after all the
negative values have been zeroed. The size of the shift de-
pends on the maximum value of the MAC operation and the
number of bits representing the activation. Suppose the ac-
tivation is represented as BWa a bit unsigned integer and the
ar
X
iv
:1
81
1.
10
86
9v
1 
 [c
s.L
G]
  2
7 N
ov
 20
18
network parameters are represented as BWp bit signed inte-
ger. Also, suppose we have a filter with a size of N f . The
amount of right bit shift required for linear quantization is:
log2[(2
BWa−1)(2BWp−1−1)N f ]−BWa. Another hardware ef-
ficient activation is the logarithmic. When using the log2 ,
the implementation simply boils down priority encoder on the
MAC result which indicates the location of the most signifi-
cant ’1’. The major drawback of log2 based activation is that
the gradient can diminish rapidly, which can lead to slow con-
vergence, especially for the first layers. A more general form
of quantized activation can be formulated using thresholds
example when using non-uniform quantization, which in hard-
ware is realized with chains of comparators and MUXs.
RELATED WORK
Previous works had been investigating the challenging task of
efficiently deploying neural network(NN) to a custom hard-
ware. This task usually boils down to two major sub-tasks:
• Network quantization.
• Pipelining and scheduling.
Extreme quantization setup, as low as 1-2 bit weights and acti-
vation quantization, has been explored [3, 18, 13, 12, 7, 25]
In some of these works, the multiplying operation has been
replaced by a sequence of XNOR and Pop-count operations
which increase the custom logic utilization of an FPGA or
ASIC. However, this works suffer from accuracy drop due to
the low precision. In [26] work, the authors address this issue
by adding a scaling layer after a quantized layer as well as
learning two independent weight thresholds. while [17] ap-
proach this by widening the layer depth, at the cost of higher
computational effort. Recently [16] and [17] proposed, a
teacher-student setup for knowledge distillation [11]. In this
setup, a full-precision model is used to distill its knowledge to
the target network. While this approach achieves impressive
results, the training computation complexity increases by more
than 2X . In the aforementioned papers, the general approach
is to preform the forward pass with the quantized weight val-
ues, while the backward pass is preformed on a full-precision
copy of the weights. In [14], the authors represent each pa-
rameter as an 8-bit integer coupled with layer-specific scaling
and zero-point factors. Using this quantization approach al-
lows integer-only inferencing, with little drop in accuracy. A
common approach for aforementioned paper, and most pre-
vious works on quantization, assume a uniform distribution
of weights and activation. While this assumption leads to a
more compelling hardware implementation. Unfortunately,
the distribution of weights looks like they were drawn from
gaussian distribution [9, 1]. Following this assumption, [4]
suggested to first calculate µ and σ per each layer and apply
CDF of the normal distribution, which subsequently trans-
form the parameters from normal to uniform distribution. A
noise, drawn from uniform distribution, is added to simulate
the quantization error. Finally, a quantile function is applied
in order to transform back the parameters to normal distribu-
tion. In recent years, FPGAs has become more popular as an
inference accelerator. And while ASICs [6, 15] usually offers
more throughput with lower energy consumption, they don’t
enjoy the advantage of reconfigurability as FPGAs. This is
important since neural network algorithm evolve with time,
so should their hardware implementation. Since the imple-
mentation of neural network involves complex scheduling and
data movement, FPGA-based inference accelerators has been
described as heterogeneous system using OpenCL [21, 2, 22]
or as standalone accelerator using HLS compilers [20, 24, 8]
NON UNIFORM QUANTIZATION HARDWARE IMPLEMEN-
TATION
Activation quantization using thresholds
In recent years, ReLU has been a popular choice for activation
function because it allows faster convergence of deep networks.
This activation function also appeals for custom hardware such
as FPGAs [20, 21, 19] and ASICs since the implementation
involves 1-bit compactor for the sign bit and a MUX which
either passes zero-vector or the input. However, the fact that
ReLU is unbounded coupled with the the large MAC results,
imposed by integer only networks, makes low bit quantization
very challenging. We take a reasonable assumption that the
activations follows a Gaussian distribution. In order to achieve
equiprobable distribution among all bins, we use a hybrid
of CDF of a normal distribution and ReLU. Let T = {0 <
t0 < t1 < · · · < tk−3 < tk−2 = ∞} be a set of thresholds and
Ba the number of allocated bits for quantized activation. We
will define a quantizer Q(x) which maps the result of MAC
operation x ∈ Z to y ∈ Z according to the following :
y= Q(x) =

0, x ∈ (−∞,0]
1, x ∈ (0, t1]
...
2Ba −1, x ∈ (tk−2, tk−1]
2Ba −1, x> tk−1
During train time, we calculate the running statistics, µ and
σ , for each layer, and use them during evaluation to calculate
the thresholds T . Let FX (x,µ,σ) be the CDF of a normal
distribution and F−1X (x,µ,σ) be the quantile function of a
normal distribution. Activation thresholds T are calculated as
follows:
Z = FX (0,µ,σ)
bi =
[
1−Z
2Ba
]
i, i ∈ [1 . . 2Ba −1]
ti = F−1X (bi,µ,σ)
An illustration of this function, for activation with µ = 900 and
σ = 900, is present in Figure 1. From this we can clearly see
that the distance on y-axis are equal which implies equiproba-
ble bins. The intersection of the vertical green lines with the
x-axis represent the integer thresholds which are loaded to the
custom device, during inference, and define the boundaries
between each bin.
Non uniform integer weight quantization
Following the assumption that weights are normally dis-
tributed, we used a similar non uniform quantizer as mentioned
2000 1000 0 1000 2000 3000 4000
MAC result
0.0
0.2
0.4
0.6
0.8
1.0
Pr
ec
en
til
e 
of
 th
e 
fu
ll 
sc
al
e 
va
lu
e
000
001
010
011
100
101
110
111
111
CDF Function
MAC Thresholds
Quantization boundries
Figure 1: k-quantile quantization to discrete integer levels
above. Let FX (x,µ,σ) be the CDF of a normal distribution
and Bw the number of allocated bits for quantized weights. We
quantize weights as follows:
Wq = [FX (W,µw,σw)−0.5]×2bw
Back propagating through Gaussian threshold activa-
tions quantizer
Since Q(x) is a non differentiable function, we used the
Straight-Through Estimator (STE) [5] approach in order to
approximate the gradient of a quantized variable. Our cho-
sen activation function is the normal CDF FX (x,µ(x),σ(x)).
Combining this fact with the STE approach, the combined
derivative of the CDF activation and quantization functions
w.r.t input x, is the input derivative multiplied by normal prob-
ability density function: F ′X (x,µ(x),σ(x))
Residual block adaptation
In order to be able to incorporate integer-only arithmetic with
residual based architectures [10], we had to do slight modi-
fication in the residual block. The modification to the basic
block is visible in Figure 2. In the original basic residual block,
the last activation function is applied after adding the result
of the convolution layer with the residual result, as shown in
Figure 2b. If we simply replace the second ReLU function
with integer quantization, this would bound the residual of the
next layer to the maximum value of Ma = 2BWa −1 where Ba
is the number of bits allocated for quantized activation. For the
second convolution layer, of the dimensions O× I×K2 where
O, I,K2 are the number of output feature maps, input feature
maps, and size of filter respectively, the maximal MAC value
is Mm = (2BWa −1)(2BWa−1−1)(I×K2). Since Mm >>Ma,
the effect of the residual path will become negligible and even-
tually lead to accuracy drop. To tackle this issue, we have
moved the post-residual activation to the input of the next
basic block, as shown in Figure 2a. In this case, the scale of
the residual and the MAC result of the second convolution
layer is roughly the same.
(a) Custom residual basic block for
integer quantization
(b) Original residual basic block
Figure 2: Custom vs original residual basic block architecture
EXPERIMENTAL RESULTS
Training setup
In order to avoid model divergence, which may occur when all
of the layers’ parameters are quantized, we follow a gradual
quantization scheme, similar to the one proposed by [23].
The gradual training works as follows: Lets denote the total
number of layers as N. At the i-th stage, the parameters and
activations of layers {L1, ...,Li} are quantized, while layers
{Li+1, ...,LN} remain in full precision. This training scheme
allows smoother and more stable convergence to a point in
which all the network is quantized. As a representative of
non-uniform quantization function, for both parameters and
activations, we chose the normal CDF function. We use an
SGD optimizer with momentum. The learning rate is set to
10−3, momentum 0.9 and weight decay 10−4.
Results
We have tested threshold quantization on CIFAR-10 and
CIFAR-100 image classification datasets using ResNet-18 and
VGG-like architectures. The residual blocks in the ResNet-18
architecture, were modified as described in previous sections.
Our VGG-like architecture is shown in Table 1. The results for
CIFAR-10 and CIFAR-100 datasets, are presented in Table 2
and Table 3 accordingly. The rows with the (w,a)=32,32 setup,
represent the baseline accuracy. As one can observe, there
is a little degradation in accuracy, for both datasets. We did
observe slightly bigger degradation of VGG-like architecture
for CIFAR-100 dataset, which may be attributed to the overall
weakness of these types of architecture in term of diminishing
gradients and ability to adapt to quantization via gradient de-
scent. The ResNet-18, thanks to their skip connection, tend to
fine tune batter to quantization.
conv3-128
conv3-128
maxpool
conv3-256
conv3-256
maxpool
conv3-512
conv3-512
maxpool
FC-1024
FC-10
Table 1: VGG-like architecture for CIFAR-10. Convolution
layers are denoted as "conv
〈
receptive field size
〉
-
〈
number of
channels
〉
"
Network Precision (w,a) Accuracy (% top-1)
ResNet-18 32,32 93.76
ResNet-18 4,4 93.45
ResNet-18 3,3 93.05
VGG-like 32,32 90.5
VGG-like 4,4 89.7
VGG-like 3,3 88.1
Table 2: Threshold based integer-only quantization results for
CIFAR-10 dataset
Network Precision (w,a) Accuracy (% top-1)
ResNet-18 32,32 73.95
ResNet-18 4,4 71.71
ResNet-18 3,3 70.50
VGG-like 32,32 67.1
VGG-like 4,4 64.5
VGG-like 3,3 63.1
Table 3: Threshold based integer-only quantization results for
CIFAR-100 dataset
CONCLUSION AND FUTURE WORK
In this work we have presented a hardware efficient technique
for implementing non-uniform quantizer. From Tables 2 and 3
it is clear that using integer parameters and activations leads to
almost no degradation in accuracy, for CIFAR-10 and CIFAR-
100 datasets, and can be efficiently implemented costume
hardware. In the future we would like to implement threshold
based non-uniform quantizer on an FPGA and compare it to
straight forward implementation, from logic utilization, power
and runtime perspective.
REFERENCES
1. Alexander G. Anderson and Cory P. Berg. 2018. The
High-Dimensional Geometry of Binary Neural Networks.
International Conference on Learning Representations
(ICLR) (2018).
2. Utku Aydonat, Shane O’Connell, Davor Capalija,
Andrew C. Ling, and Gordon R. Chiu. 2017. An
OpenCL™Deep Learning Accelerator on Arria 10. In
Proceedings of the 2017 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays (FPGA
’17). ACM, New York, NY, USA, 55–64. DOI:
http://dx.doi.org/10.1145/3020078.3021738
3. Chaim Baskin, Natan Liss, Avi Mendelson, and Evgenii
Zheltonozhskii. 2017. Streaming Architecture for
Large-Scale Quantized Neural Networks on an
FPGA-Based Dataflow Platform. CoRR abs/1708.00052
(2017).
4. Chaim Baskin, Eli Schwartz, Evgenii Zheltonozhskii,
Natan Liss, Raja Giryes, Alexander M. Bronstein, and
Avi Mendelson. 2018. UNIQ: Uniform Noise Injection
for the Quantization of Neural Networks. CoRR
abs/1804.10969 (2018). http://arxiv.org/abs/1804.10969
5. Yoshua Bengio, Nicholas Léonard, and Aaron C.
Courville. 2013. Estimating or Propagating Gradients
Through Stochastic Neurons for Conditional
Computation. CoRR abs/1308.3432 (2013).
http://arxiv.org/abs/1308.3432
6. Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016.
Eyeriss: A Spatial Architecture for Energy-Efficient
Dataflow for Convolutional Neural Networks. In 2016
ACM/IEEE 43rd Annual International Symposium on
Computer Architecture (ISCA), Vol. 44. 367–379. DOI:
http://dx.doi.org/10.1109/ISCA.2016.40
7. Matthieu Courbariaux and Yoshua Bengio. 2016.
BinaryNet: Training Deep Neural Networks with Weights
and Activations Constrained to +1 or -1. CoRR
abs/1602.02830 (2016). http://arxiv.org/abs/1602.02830
8. Sina Ghaffari and Saeed Sharifian. 2016. FPGA-based
convolutional neural network accelerator design using
high level synthesize. In 2016 2nd International
Conference of Signal Processing and Intelligent Systems
(ICSPIS). 1–6. DOI:
http://dx.doi.org/10.1109/ICSPIS.2016.7869873
9. Song Han, Huizi Mao, and William J Dally. 2016. Deep
Compression: Compressing Deep Neural Networks with
Pruning, Trained Quantization and Huffman Coding.
International Conference on Learning Representations
(ICLR) (2016).
10. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. 2016. Deep Residual Learning for Image
Recognition. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR).
11. Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015.
Distilling the Knowledge in a Neural Network.
http://arxiv.org/abs/1503.02531
12. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran
El-Yaniv, and Yoshua Bengio. 2016. Binarized neural
networks. In Advances in neural information processing
systems. 4107–4115.
13. Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran
El-Yaniv, and Yoshua Bengio. 2018. Quantized Neural
Networks: Training Neural Networks with Low Precision
Weights and Activations. Journal of Machine Learning
Research 18, 187 (2018), 1–30.
http://jmlr.org/papers/v18/16-456.html
14. Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong
Zhu, Matthew Tang, Andrew G. Howard, Hartwig Adam,
and Dmitry Kalenichenko. 2017. Quantization and
Training of Neural Networks for Efficient
Integer-Arithmetic-Only Inference. CoRR
abs/1712.05877 (2017). http://arxiv.org/abs/1712.05877
15. Norman P. Jouppi, Cliff Young, Nishant Patil, David A.
Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah
Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick
Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark,
Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben
Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati,
William Gulland, Robert Hagmann, Richard C. Ho, Doug
Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz,
Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit
Khaitan, Andy Koch, Naveen Kumar, Steve Lacy, James
Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan
Liu, Kyle Lucke, Alan Lundin, Gordon MacKean,
Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul
Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix,
Thomas Norrie, Mark Omernick, Narayana Penukonda,
Andy Phelps, Jonathan Ross, Amir Salek, Emad
Samadiani, Chris Severn, Gregory Sizikov, Matthew
Snelham, Jed Souter, Dan Steinberg, Andy Swing,
Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma,
Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter
Wang, Eric Wilcox, and Doe Hyun Yoon. 2017.
In-datacenter performance analysis of a tensor processing
unit. In 2017 ACM/IEEE 44th Annual International
Symposium on Computer Architecture (ISCA). 1–12. DOI:
http://dx.doi.org/10.1145/3079856.3080246
16. Asit Mishra and Debbie Marr. 2018. Apprentice: Using
Knowledge Distillation Techniques To Improve
Low-Precision Network Accuracy. International
Conference on Learning Representations (2018).
https://openreview.net/forum?id=B1ae1lZRb
17. Antonio Polino, Razvan Pascanu, and Dan Alistarh. 2018.
Model compression via distillation and quantization.
International Conference on Learning Representations
(2018). https://openreview.net/forum?id=S1XolQbRW
18. Mohammad Rastegari, Vicente Ordonez, Joseph Redmon,
and Ali Farhadi. 2016. Xnor-net: Imagenet classification
using binary convolutional neural networks. In European
Conference on Computer Vision. Springer, 525–542.
19. M. Sit, R. Kazami, and H. Amano. 2017. FPGA-based
accelerator for losslessly quantized convolutional neural
networks. In 2017 International Conference on Field
Programmable Technology (ICFPT). 295–298. DOI:
http://dx.doi.org/10.1109/FPT.2017.8280164
20. Yaman Umuroglu, Nicholas J. Fraser, Giulio
Gambardella, Michaela Blott, Philip Leong, Magnus
Jahre, and Kees Vissers. 2017. FINN: A Framework for
Fast, Scalable Binarized Neural Network Inference. In
Proceedings of the 2017 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays (FPGA
’17). ACM, New York, NY, USA, 65–74. DOI:
http://dx.doi.org/10.1145/3020078.3021744
21. Dong Wang, Jianjing An, and Ke Xu. 2016a. PipeCNN:
An OpenCL-Based FPGA Accelerator for Large-Scale
Convolution Neuron Networks. CoRR abs/1611.02450
(2016). http://arxiv.org/abs/1611.02450
22. Zhengrong Wang, Fei Qiao, Zhen Liu, Yuxiang Shan,
Xunyi Zhou, Li Luo, and Huazhong Yang. 2016b.
Optimizing convolutional neural network on FPGA under
heterogeneous computing framework with OpenCL. In
2016 IEEE Region 10 Conference (TENCON).
3433–3438. DOI:
http://dx.doi.org/10.1109/TENCON.2016.7848692
23. Yuhui Xu, Yongzhuang Wang, Aojun Zhou, Weiyao Lin,
and Hongkai Xiong. 2018. Deep Neural Network
Compression With Single and Multiple Level
Quantization. (2018). https://www.aaai.org/ocs/index.
php/AAAI/AAAI18/paper/view/16479
24. Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei
Xing, Jeng-Hau Lin, Mani Srivastava, Rajesh Gupta, and
Zhiru Zhang. 2017. Accelerating Binarized Convolutional
Neural Networks with Software-Programmable FPGAs.
In Proceedings of the 2017 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays (FPGA
’17). ACM, New York, NY, USA, 15–24. DOI:
http://dx.doi.org/10.1145/3020078.3021741
25. Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He
Wen, and Yuheng Zou. 2016. DoReFa-Net: Training low
bitwidth convolutional neural networks with low bitwidth
gradients. arXiv preprint arXiv:1606.06160 (2016).
26. Chenzhuo Zhu, Song Han, Huizi Mao, and William J
Dally. 2016. Trained ternary quantization. International
Conference on Learning Representations (ICLR) (2016).
