Memory Requirement Reduction of Deep Neural Networks Using Low-bit
  Quantization of Parameters by Nicodemo, Niccoló et al.
Memory Requirement Reduction of Deep Neural Networks Using Low-bit
Quantization of Parameters
Niccolo` Nicodemo†, Gaurav Naithani‡, Konstantinos Drossos‡,
Tuomas Virtanen‡, and Roberto Saletti†
†{n.nicodemo1@studenti., r.saletti@}unipi.it,
University of Pisa, Italy
‡{firstname.lastname}@tuni.fi,
Audio Research Group, Tampere University, Finland
Abstract
Effective employment of deep neural networks (DNNs) in mobile devices and embedded systems is
hampered by requirements for memory and computational power. This paper presents a non-uniform
quantization approach which allows for dynamic quantization of DNN parameters for different layers
and within the same layer. A virtual bit shift (VBS) scheme is also proposed to improve the accuracy
of the proposed scheme. Our method reduces the memory requirements, preserving the performance
of the network. The performance of our method is validated in a speech enhancement application,
where a fully connected DNN is used to predict the clean speech spectrum from the input noisy
speech spectrum. A DNN is optimized and its memory footprint and performance are evaluated using
the short-time objective intelligibility, STOI, metric. The application of the low-bit quantization
allows a 50% reduction of the DNN memory footprint while the STOI performance drops only by
2.7%.
Keywords: neural network quantization, memory footprint reduction, FPGA, hardware accel-
erators
1 Introduction
Field programmable gate arrays (FPGAs) are widely
used in mobile devices as they allow for the design
of highly efficient systems, with low-latency and low-
power requirements. FPGAs are particularly useful
for speeding up signal processing by using specific de-
signed hardware (called hardware accelerators) to be
run in parallel with main CPUs, usually embedded in
the FPGA itself. Deep neural networks (DNNs) of-
ten set the state-of-the-art in many signal processing
tasks, e.g., speech separation [1, 2, 3], speech recogni-
tion [4], etc. However, memory footprint, memory
bandwidth requirements, and the associated power
consumption of DNNs are a issue to be solved for the
deployment of a DNN on an FPGA.
Two main approaches have been used to decrease
the memory requirements for neural networks: i)
changing the architecture of the network in order
to reduce the parameter number, and ii) quantizing
the parameters of the network to directly reduce the
amount of memory needed for storing them (i.e. re-
ducing the memory footprint) and the memory band-
width needed to read them. The first approach in-
volves methods like parameter pruning and sharing
[5], i.e., removing redundant weights or layers, knowl-
edge distillation [6], i.e., retrieving a smaller network
from a pretrained bigger one, and the use of low-rank
factorization [7] or specific convolutional filters [8].
All these methods produce networks with less com-
putational needs but require a modification in the ar-
chitecture of the network itself. This is less desirable
from the perspective of hardware deployment, since
each change in the architecture affects the hardware
1
ar
X
iv
:1
91
1.
00
52
7v
1 
 [e
es
s.A
S]
  1
 N
ov
 20
19
design and may mean to design a specific hardware
for a specific architecture. Additionally, all the above
mentioned methods require the optimization of the
new DNN architecture.
The second approach may lead to quantizing the
network parameters from floating-point (e.g. 32-bit)
to a n-bit fixed-point representation. In FPGAs the
parameters of a DNN are usually stored in external
memories (i.e. flash memories). The access time for
a flash memory can be a bottleneck and severely slow
down the corresponding calculations for the DNN.
Reducing the number of bits needed to store each
DNN parameter reduces memory requirements and
improves the execution speed. Furthermore, smaller,
slower, and cheaper memories can be used by em-
ploying low-bit fixed point arithmetic, resulting in a
reduction of the power consumption also [9]. How-
ever, the parameter quantization can lead to a degra-
dation of the DNN performance and very poor results
if too few bits are used (i.e. less than 8) [10]. Several
quantization strategies have been tried like normaliza-
tion [11], uniform and non-uniform quantization for
different ranges of values [12], using Minimum Mean
Squared Error [10], weights clipping and bias correc-
tion [13], and per-channel or per-layer different scal-
ing [13, 14, 15]. Mixed approaches came up too, like
binarized neural network [16], in which weights and
activations are forced to -1 and +1 values, requiring
a specific architecture and a specific training for the
network.
In this paper we consider the quantization of the
weights of a DNN, we focus on the use-case of FP-
GAs, and we propose a low-bit quantization method
based on the non-uniform and dynamic quantization
methods [12, 14, 17, 18, 19]. Our approach distin-
guishes itself from earlier similar works by introduc-
tion of a virtual bit shift (VBS) scheme that allows
for dynamically adjusting parameter representation
for parameter ranges within the same layer as well
as for different layers. VBS mitigates the drawbacks
of fixed-point quantization scheme and increases the
accuracy thereby reducing performance loss. Our
method encodes the parameters of the DNN employ-
ing a probabilistic-based and hardware-oriented ap-
proach, using codes that can be stored in slow, ex-
ternal memories, while the actual values can be kept
in FPGA-mapped lookup tables (LUT). Specifically,
we apply a quantization which stores 4-bit codes of
the parameters in external memory, thus reducing the
memory footprint up to 50%, if compared to an 8-
bit fixed point representation of the parameters. The
quantization technique is applied to a speech sepa-
ration task, achieving the aforementioned footprint
reduction with a performance reduction of only 2.7%
in terms of STOI. Furthermore, using 4-bit codes re-
duces the bandwidth requirement too. In fact, halving
the bitwidth of the stored weights halves the band-
width of the memory accesses, which often represents
a bottleneck of the whole system.
2 Proposed Quantization Method
Our method consists in taking as an input the set of
parameters Θ of a deep neural network (DNN), quan-
tizing them with fixed point values of m-bit width by
applying a non-uniform quantization, and then encod-
ing the m-bit values using codes of a n-bit lookup ta-
ble (LUT) that associates the n-bit wide codes to the
m-bit wide values. The n-bit wide codes are stored
in an external, slow, memory and the fixed-point m-
bit wide values of the parameters of the network are
kept in the FPGA memory and are retrieved using
the LUT.
2.1 Quantization of parameters Θ
Any quantization scheme that converts the DNN pa-
rameters Θ from floating to fixed point values leads
to quantization errors and subsequent performance
losses. The aim of any such scheme is to reduce
this error to a minimum. As Θ is generally non-
uniformly distributed, it seems appropriate to use a
non-uniform scheme. Given Θ, its range can be ex-
pressed as A = [al, ah], where al ≤ θ ≤ ah, θ ∈ Θ.
We can use an n-bit encoding scheme for quantiz-
ing the range A into discrete intervals, resulting into
2n intervals B = {Bi}2ni=1, with Bi = [bli, bui ], and
bui−1 < b
l
i < b
u
i . For the purpose of illustration, we
will from now on consider the cumulative distribution
of parameters φ, shown in Figure 1, as example of a
feedforward network where the parameter values are
clamped to the range [−1, 1]. The quantization er-
ror corresponding to each interval is directly related
to the interval ∆i = b
u
i − bli. We divide A into two
partitions: Bint ⊂ B termed as the internal partition,
where the parameters are densely concentrated, and
Bext ⊂ B, termed as the external partition, where
the parameters are sparsely concentrated. The inter-
2
Figure 1: Cumulative distribution and partitions
val span in the Bext is larger than the interval span
in Bint, i.e. Bi ∈ Bint, Be ∈ Bext ⇒ ∆i < ∆e, and
Bint ∩ Bext = ∅.
We use uniform quantization in Bint and non-
uniform quantization in Bext. We can define the ratio
of number of intervals in the internal and external
partitions RB = |Bint|/|Bext|, where | · | is the number
of elements in a set, and the probability values pstart
and pstop denoting the lower and upper boundaries of
Bint, respectively. We define the number of intervals
|Bint| and |Bext| as
|Bext| =
⌊
2n
1 +RB
⌋
+ c, and (1)
|Bint| = 2n − |Bext|, where (2)
c =
1, if
⌊
2n
1+RB
⌋
is odd
0, if
⌊
2n
1+RB
⌋
is even.
(3)
For the external partition, we uniformly split the
range of φ and invert it back to get the set of intervals
Bexti as
Bexti =
[
φ−1(i ·∆φi ), φ−1([i+ 1] ·∆φi )
)
, (4)
where ∆φi =
2·pstart
|Bext| is the interval span in the
range of φ and hence the corresponding ∆i is non
uniform. Finally, we uniformly divide Bint with step
∆i =
φ−1(pstop)−φ−1(pstart)
|Bint| , thereby obtaining a set of
intervals Binti as
Binti =
[
φ−1(pstart) + i ·∆i,
φ−1(pstart) + (i+ 1) ·∆i
)
.
(5)
For any such interval Bi, the quantized level θ˜i can
be computed as the m-bit quantized mean of the pa-
rameters lying in the Bi interval as,
Quantization 
intervals
bit resolution
Figure 2: Example with the resolution of quantization
∆i being smaller than the resolution of m-bit encod-
ing δi.
Table 1: LUT of the n-bit code, the m-bit wide θ˜i,
and the corresponding partition.
Code |10 θ˜i Partition
0000 -0.3359375 ext
0001 -0.15625 ext
0100 -0.0703125 int
...
...
...
θ˜i =
∑
θ∈Θ 1θ∈Biθ∑
θ∈Θ 1θ∈Bi

m-bit
, where (6)
1Ξ =
{
1, if Ξ,
0, otherwise,
(7)
and ·|m-bit means the m-bit representation. Using
θ˜i from Eq. (6), we can quantize the parameters Θ
of a DNN with an m-bit representation. The finite
amount of values assumed by θ˜i, enables the reduction
of the memory word length from m to n, with n < m.
This is achieved through a lookup table (LUT) which
stores the relationship between the n-bit code and its
corresponding m-bit value and partition (i.e. external
or internal). An example of such a LUT is shown in
Table 1.
2.2 Virtual bit shift
The encoding scheme using B is heavily dependent on
the choice of the parameters pstart and pstop (bound-
aries of Bint). The smallest number that can be rep-
resented using signed m-bit encoding, i.e., the resolu-
tion of the encoding scheme δi, is 2
−(m−1) and hence
is dependent on the bit-width. If the span from pstart
to pstop is very narrow, we may end up to an interval
3
Table 2: An example of LUT with virtual bit shift for
n = 4, m = 8, and k = 4.
Code value binary value12bit LUT value
0100 0.02099609 .000001010110 01010110
span ∆i smaller than δi. In that case, the adjacent
intervals will map to the same m-bit value as the res-
olution of encoding scheme δi is less accurate than the
interval partition ∆i . The Figure 2 depicts such a sit-
uation, where θ˜i and θ˜i+1 are the values for quantized
parameters corresponding to the ith and i+ 1th inter-
val, respectively, and that share the same m-bit rep-
resentation. To avoid this there should be, δi < ∆i.
We propose a different quantization scheme for Bint.
Since δi depends on the bit-width of the encoding, a
higher resolution can in principle be achieved by quan-
tizing θ˜i using m + k bits. Since for all θ˜i ∈ Bint, we
have θ˜i ≤ max(abs(φ−1(pstart), φ−1(pstop))), and vari-
able k ∈ N can be found so that θ˜i < 2−k. This implies
that the k most significant bits will contain either ze-
ros or the sign bit and can be considered redundant
for storage purposes. The sign bit can be stored in
the n bit indexing code itself. Storing only m least
significant bits from a m + k-bit representation of θ˜i
can be thought of as shift of k bits to the left which
implies multiplication by 2k in binary arithmetic. Let
us denote m least significant bits for θ˜i as θ˜
m
i , so that
we have,
θ˜mi = θ˜i · 2k. (8)
We can store θ¯mi , from which the actual parame-
ter values θ˜i can be retrieved using Eq. (8). Basi-
cally we perform a range adjustment by virtual bit
shift of actual parameter values. An example of the
same is shown in Table 2. The resolution error can
now be avoided by observing a lesser stringent con-
dition than before, namely, δm+ki < ∆i, where δ
m+k
i
is the resolution of m + k bit encoding. Thus abso-
lute values, signs and representation range informa-
tion can be stored in the same code and conversion
table mapped in a FPGA-embedded LUT. A n-bit
quantization is obtained in which actual parameter
values are not bounded to uniformly quantized val-
ues, but can be chosen in a proper way in order to
reduce errors.
3 DNN-based speech enhancement
We apply the proposed quantization scheme on a
speech enhancement task using a feedforward DNN.
Input noisy mixtures are represented using the mag-
nitude short-time Fourier transform (STFT) and then
scaled in order to properly calculate their magni-
tude and phase by using a coordinate rotation digital
computer (CORDIC) algorithm [20] based on integer
arithmetic. N frames, {x˜t−N+1, x˜t−N , x˜t−N−1 . . . , x˜t}
of these features are first stacked together and then
fed to the DNN to estimate denoised/clean speech
magnitude spectrum xt. The stacking of features is
done to allow the DNN to implicitly model tempo-
ral dependencies. The CORDIC algorithm is then
applied again on xt to restore phase information
extracted from the mixture features. The values
thus obtained are scaled back and converted back to
time domain speech via inverse fast Fourier transform
(IFFT) and overlap-add.
4 Evaluation
For evaluation, synthetic mixtures are created using
Wall Street Journal (WSJ0) dataset for speech and
TUT Acoustic scenes 2016 development dataset [21]
for noise. The latter consists of sound recordings from
15 real-world environments, e.g., cafe, train, metro
station, etc. A random speech signal is selected and
an equal-length noise segment is sampled from the
noise signal. The training and validation data con-
sist of about 12,000 (around 20 hours) and 5000 mix-
tures ( around 8 hours), respectively. Similarly, the
test data consists of about 2800 mixtures (around 5
hours). The speech and noise signals are mixed with a
randomly chosen signal to noise ratio (SNR) from the
set {0, 5} dB. The native sampling rate for noise sig-
nals is 44.1 kHz which is down-sampled to 8 kHz, the
native sampling rate of WSJ0 audio.The short term
objective intelligibility (STOI) [22] metric is used as
a measure of intelligibility of enhanced speech.
The STFT features are extracted with Hann win-
dow of 128 sample (16 ms) with 50% overlap. Eight
input frames are stacked and fed to a two-layer feed-
forward network with 256 and 129 neurons in input
and hidden layer, respectively. The rectified linear
unit is used as non-linearity for each layer. The
Adam optimization [23] with default parameters is
used. For training networks, PyTorch [24] library is
4
Table 3: Evaluation results in terms of STOI and
memory reduction. m-U is for m-bit uniform and
m-NU for m-bit non-uniform quantization. Memory
reduction is with respect to 8 bit fixed-point uniform
quantization as reference.
Quantization scheme
1st layer 2nd layer STOI/STOI Loss
Memory
usage
(bytes)
/reduction
8-U 8-U 0.87/-0.002 (00.2%) 297216/ –
4-U 4-U 0.63/-0.246 (28.2%) 148608/ -50.0%
4-NU 4-NU 0.85/-0.024 (02.7%) 148608/ -50.0%
4-NU 8-U 0.86/-0.016 (01.8%) 165120/ -44.4%
No quantization 0.87/ –
Mix 0.84/ –
Figure 3: Comparison in terms of STOI performance
of different quantization approaches being swept on
first/second DNN layer keeping the other with 8-bit
uniform quantization.
used, and for audio processing, Librosa [25] library
is used. Since our focus is fixed point arithmetic de-
vices, the network weights and biases are clamped to
the range (-1, +1) in order to avoid overflow and re-
duce the number of bits needed for correct numeric
representation in network’s operations. The DNN
weights have been quantized using n = 4, m = 8,
and k1stlayer = 3, k2ndlayer = 2, obtained by choosing
ratio = 1, pstart = 0.04 and pstop = 0.96. The ratio,
pstart and pstop have been chosen empirically after op-
timizing for the test data. The biases used are the
8-bit-uniform quantized values.
Figure 4: Comparison in terms of STOI perfor-
mance of different quantization approaches with 4/8-
bit quantization for first and second layer.
5 Results
Table 3 compares the STOI values obtained with dif-
ferent approaches over 2800 noisy samples. 8-bit-
uniform quantization gives good results with a very
small degradation in STOI (0.21%) compared to non-
quantized network, while 4-bit-uniform quantization
led to a drastic fall in the performance, obtaining
a result that is less intelligible than even the input
noisy signal. On the other hand, the 4-bit non-
uniform quantization proposed in this paper yields
better STOI than noisy mixtures and only 2.7% worse
as compared to the non-quantized network and halv-
ing the memory footprint in comparison to the 8-bit
uniform quantization while simultaneously decreasing
the memory bandwidth requirement.
Figure 3 compares the different approaches by us-
ing 8-bit quantization (uniform and non uniform) for
different layers, and how the proposed quantization
scheme consisting of range split (RS) and virtual bit
shift (VBS) affects the performance. For each simu-
lation, one of the two layer is kept at 8-bit uniform
quantization while the other is swept between the fol-
lowing four approaches: uniform quantization (U),
uniform quantization with virtual bit shift (UVBS),
range split (RS), and range split with virtual bit shift
(RSVB). It can easily be noticed that for the first
layer sweep, when no VBS is used, how the perfor-
mance suffers as δi > ∆i. Figure 4 shows the effect of
these approaches for two cases: 4-bit quantization for
both layers, and, 4-bit for the first and 8 bit for the
second layer.
5
6 Conclusions
This work proposes a low-bit quantization method
inspired by the companding approach that allows
the achievement of a good trade-off between perfor-
mance and resource requirements in a hardware im-
plementation of a DNN and is thus very appealing
for FPGA applications. The method does not require
any change or pruning of the network, so no retrain
is needed. The case studied shows a two-layer feed-
forward neural network, from which it emerges that
a dramatic reduction of the memory requirements is
obtained (50%) with only a slight reduction of the
performance. Further research should concern the ap-
plication of the method to deeper networks and the
usage of non symmetrical range split or of a custom
multiplying architecture for the weighting of the input
values.
Acknowledgement
The authors wish to acknowledge CSC-IT Center for
Science, Finland, for computational resources.
References
[1] Y. Luo and N. Mesgarani, “Conv-tasnet: Sur-
passing ideal time–frequency magnitude mask-
ing for speech separation,” IEEE/ACM Trans-
actions on Audio, Speech, and Language Process-
ing, vol. 27, no. 8, pp. 1256–1266, 2019.
[2] D. Wang and J. Chen, “Supervised speech sep-
aration based on deep learning: An overview,”
IEEE/ACM Transactions on Audio, Speech, and
Language Processing, vol. 26, no. 10, pp. 1702–
1726, 2018.
[3] Z.-Q. Wang, J. Le Roux, and J. R. Hershey,
“Alternative objective functions for deep clus-
tering,” in Proc. IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP), 2018, pp. 686–690.
[4] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prab-
havalkar, P. Nguyen, Z. Chen, A. Kannan, R. J.
Weiss, K. Rao, E. Gonina et al., “State-of-the-
art speech recognition with sequence-to-sequence
models,” in 2018 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2018, pp. 4774–4778.
[5] S. Han, H. Mao, and W. J. Dally, “Deep compres-
sion: Compressing deep neural networks with
pruning, trained quantization and huffman cod-
ing,” arXiv preprint arXiv:1510.00149, 2015.
[6] C. Bucilu, R. Caruana, and A. Niculescu-Mizil,
“Model compression,” in Proceedings of the
12th ACM SIGKDD international conference on
Knowledge discovery and data mining. ACM,
2006, pp. 535–541.
[7] T. N. Sainath, B. Kingsbury, V. Sindhwani,
E. Arisoy, and B. Ramabhadran, “Low-rank ma-
trix factorization for deep neural network train-
ing with high-dimensional output targets,” in
2013 IEEE international conference on acous-
tics, speech and signal processing. IEEE, 2013,
pp. 6655–6659.
[8] Y. Cheng, D. Wang, P. Zhou, and T. Zhang,
“A survey of model compression and accelera-
tion for deep neural networks,” arXiv preprint
arXiv:1710.09282, 2017.
[9] W. Dally, “High-performance hardware for ma-
chine learning,” NIPS Tutorial, 2015.
[10] Y. Choukroun, E. Kravchik, and P. Kisilev,
“Low-bit quantization of neural networks
for efficient inference,” arXiv preprint
arXiv:1902.06822, 2019.
[11] R. A. Solovyev, A. A. Kalinin, A. G. Kustov,
D. V. Telpukhov, and V. S. Ruhlov, “FPGA im-
plementation of convolutional neural networks
with fixed-point calculations,” arXiv preprint
arXiv:1808.09945, 2018.
[12] E. Park, S. Yoo, and P. Vajda, “Value-aware
quantization for training and inference of neural
networks,” in Proceedings of the European Con-
ference on Computer Vision (ECCV), 2018, pp.
580–595.
[13] R. Banner, Y. Nahshan, E. Hoffer, and
D. Soudry, “Post training 4-bit quantization
of convolution networks for rapid-deployment,”
CoRR, abs/1810.05723, vol. 1, p. 2, 2018.
6
[14] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou,
J. Yu, T. Tang, N. Xu, S. Song et al., “Going
deeper with embedded fpga platform for convo-
lutional neural network,” in Proceedings of the
2016 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays. ACM, 2016,
pp. 26–35.
[15] P. Judd, J. Albericio, T. Hetherington,
T. Aamodt, N. E. Jerger, R. Urtasun, and
A. Moshovos, “Reduced-precision strategies for
bounded memory in deep neural nets,” arXiv
preprint arXiv:1511.05236, 2015.
[16] M. Courbariaux, I. Hubara, D. Soudry, R. El-
Yaniv, and Y. Bengio, “Binarized neural net-
works: Training deep neural networks with
weights and activations constrained to+ 1 or-1,”
arXiv preprint arXiv:1602.02830, 2016.
[17] S. Seo and J. Kim, “Efficient weights quantiza-
tion of convolutional neural networks using ker-
nel density estimation based non-uniform quan-
tizer,” Applied Sciences, vol. 9, p. 2559, 06 2019.
[18] N. Liss, C. Baskin, A. Mendelson, A. M. Bron-
stein, and R. Giryes, “Efficient non-uniform
quantizer for quantized neural network target-
ing reconfigurable hardware,” arXiv preprint
arXiv:1811.10869, 2018.
[19] H. Tann, S. Hashemi, R. I. Bahar, and
S. Reda, “Hardware-software codesign of accu-
rate, multiplier-free deep neural networks,” in
2017 54th ACM/EDAC/IEEE Design Automa-
tion Conference (DAC), June 2017, pp. 1–6.
[20] J. E. Volder, “The cordic trigonometric comput-
ing technique,” IRE Transactions on Electronic
Computers, no. 3, pp. 330–334, 1959.
[21] A. Mesaros, T. Heittola, and T. Virtanen, “TUT
acoustic scenes 2016, development dataset,” Feb
2016.
[22] C. H. Taal, R. C. Hendriks, R. Heusdens, and
J. Jensen, “A short-time objective intelligibil-
ity measure for time-frequency weighted noisy
speech,” in 2010 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing.
IEEE, 2010, pp. 4214–4217.
[23] D. Kingma and J. Ba, “Adam: A method for
stochastic optimization,” in Proc. International
Conference on Learning Representations, 2014.
[24] A. Paszke, S. Gross, S. Chintala, G. Chanan,
E. Yang, Z. DeVito, Z. Lin, A. Desmaison,
L. Antiga, and A. Lerer, “Automatic differenti-
ation in PyTorch,” in NIPS Autodiff Workshop,
2017.
[25] B. McFee et al., “librosa 0.5.0,” Feb. 2017.
[Online]. Available: https://doi.org/10.5281/
zenodo.293021
7
