Effective employment of deep neural networks (DNNs) in mobile devices and embedded systems is hampered by requirements for memory and computational power. This paper presents a non-uniform quantization approach which allows for dynamic quantization of DNN parameters for different layers and within the same layer. A virtual bit shift (VBS) scheme is also proposed to improve the accuracy of the proposed scheme. Our method reduces the memory requirements, preserving the performance of the network. The performance of our method is validated in a speech enhancement application, where a fully connected DNN is used to predict the clean speech spectrum from the input noisy speech spectrum. A DNN is optimized and its memory footprint and performance are evaluated using the short-time objective intelligibility, STOI, metric. The application of the low-bit quantization allows a 50% reduction of the DNN memory footprint while the STOI performance drops only by 2.7%.
Introduction
Field programmable gate arrays (FPGAs) are widely used in mobile devices as they allow for the design of highly efficient systems, with low-latency and lowpower requirements. FPGAs are particularly useful for speeding up signal processing by using specific designed hardware (called hardware accelerators) to be run in parallel with main CPUs, usually embedded in the FPGA itself. Deep neural networks (DNNs) often set the state-of-the-art in many signal processing tasks, e.g., speech separation [1, 2, 3] , speech recognition [4] , etc. However, memory footprint, memory bandwidth requirements, and the associated power consumption of DNNs are a issue to be solved for the deployment of a DNN on an FPGA. Two main approaches have been used to decrease the memory requirements for neural networks: i) changing the architecture of the network in order to reduce the parameter number, and ii) quantizing the parameters of the network to directly reduce the amount of memory needed for storing them (i.e. reducing the memory footprint) and the memory bandwidth needed to read them. The first approach involves methods like parameter pruning and sharing [5] , i.e., removing redundant weights or layers, knowledge distillation [6] , i.e., retrieving a smaller network from a pretrained bigger one, and the use of low-rank factorization [7] or specific convolutional filters [8] . All these methods produce networks with less computational needs but require a modification in the architecture of the network itself. This is less desirable from the perspective of hardware deployment, since each change in the architecture affects the hardware design and may mean to design a specific hardware for a specific architecture. Additionally, all the above mentioned methods require the optimization of the new DNN architecture.
The second approach may lead to quantizing the network parameters from floating-point (e.g. 32-bit) to a n-bit fixed-point representation. In FPGAs the parameters of a DNN are usually stored in external memories (i.e. flash memories). The access time for a flash memory can be a bottleneck and severely slow down the corresponding calculations for the DNN. Reducing the number of bits needed to store each DNN parameter reduces memory requirements and improves the execution speed. Furthermore, smaller, slower, and cheaper memories can be used by employing low-bit fixed point arithmetic, resulting in a reduction of the power consumption also [9] . However, the parameter quantization can lead to a degradation of the DNN performance and very poor results if too few bits are used (i.e. less than 8) [10] . Several quantization strategies have been tried like normalization [11] , uniform and non-uniform quantization for different ranges of values [12] , using Minimum Mean Squared Error [10] , weights clipping and bias correction [13] , and per-channel or per-layer different scaling [13, 14, 15] . Mixed approaches came up too, like binarized neural network [16] , in which weights and activations are forced to -1 and +1 values, requiring a specific architecture and a specific training for the network.
In this paper we consider the quantization of the weights of a DNN, we focus on the use-case of FP-GAs, and we propose a low-bit quantization method based on the non-uniform and dynamic quantization methods [12, 14, 17, 18, 19] . Our approach distinguishes itself from earlier similar works by introduction of a virtual bit shift (VBS) scheme that allows for dynamically adjusting parameter representation for parameter ranges within the same layer as well as for different layers. VBS mitigates the drawbacks of fixed-point quantization scheme and increases the accuracy thereby reducing performance loss. Our method encodes the parameters of the DNN employing a probabilistic-based and hardware-oriented approach, using codes that can be stored in slow, external memories, while the actual values can be kept in FPGA-mapped lookup tables (LUT). Specifically, we apply a quantization which stores 4-bit codes of the parameters in external memory, thus reducing the memory footprint up to 50%, if compared to an 8bit fixed point representation of the parameters. The quantization technique is applied to a speech separation task, achieving the aforementioned footprint reduction with a performance reduction of only 2.7% in terms of STOI. Furthermore, using 4-bit codes reduces the bandwidth requirement too. In fact, halving the bitwidth of the stored weights halves the bandwidth of the memory accesses, which often represents a bottleneck of the whole system.
Proposed Quantization Method
Our method consists in taking as an input the set of parameters Θ of a deep neural network (DNN), quantizing them with fixed point values of m-bit width by applying a non-uniform quantization, and then encoding the m-bit values using codes of a n-bit lookup table (LUT) that associates the n-bit wide codes to the m-bit wide values. The n-bit wide codes are stored in an external, slow, memory and the fixed-point mbit wide values of the parameters of the network are kept in the FPGA memory and are retrieved using the LUT.
Quantization of parameters Θ
Any quantization scheme that converts the DNN parameters Θ from floating to fixed point values leads to quantization errors and subsequent performance losses. The aim of any such scheme is to reduce this error to a minimum. As Θ is generally nonuniformly distributed, it seems appropriate to use a non-uniform scheme. Given Θ, its range can be expressed as A = [a l , a h ], where a l ≤ θ ≤ a h , θ ∈ Θ. We can use an n-bit encoding scheme for quantizing the range A into discrete intervals, resulting into
For the purpose of illustration, we will from now on consider the cumulative distribution of parameters φ, shown in Figure 1 
We use uniform quantization in B int and nonuniform quantization in B ext . We can define the ratio of number of intervals in the internal and external partitions R B = |B int |/|B ext |, where | · | is the number of elements in a set, and the probability values p start and p stop denoting the lower and upper boundaries of B int , respectively. We define the number of intervals |B int | and |B ext | as
For the external partition, we uniformly split the range of φ and invert it back to get the set of intervals
where ∆ φ i = 2·pstart |B ext | is the interval span in the range of φ and hence the corresponding ∆ i is non uniform. Finally, we uniformly divide B int with step
For any such interval B i , the quantized levelθ i can be computed as the m-bit quantized mean of the parameters lying in the B i interval as, Quantization intervals bit resolution Figure 2 : Example with the resolution of quantization ∆ i being smaller than the resolution of m-bit encoding δ i . 
and ·| m-bit means the m-bit representation. Using θ i from Eq. (6), we can quantize the parameters Θ of a DNN with an m-bit representation. The finite amount of values assumed byθ i , enables the reduction of the memory word length from m to n, with n < m. This is achieved through a lookup table (LUT) which stores the relationship between the n-bit code and its corresponding m-bit value and partition (i.e. external or internal). An example of such a LUT is shown in Table 1 .
Virtual bit shift
The encoding scheme using B is heavily dependent on the choice of the parameters p start and p stop (boundaries of B int ). The smallest number that can be represented using signed m-bit encoding, i.e., the resolution of the encoding scheme δ i , is 2 −(m−1) and hence is dependent on the bit-width. If the span from p start to p stop is very narrow, we may end up to an interval span ∆ i smaller than δ i . In that case, the adjacent intervals will map to the same m-bit value as the resolution of encoding scheme δ i is less accurate than the interval partition ∆ i . The Figure 2 depicts such a situation, whereθ i andθ i+1 are the values for quantized parameters corresponding to the i th and i + 1 th interval, respectively, and that share the same m-bit representation. To avoid this there should be, δ i < ∆ i . We propose a different quantization scheme for B int . Since δ i depends on the bit-width of the encoding, a higher resolution can in principle be achieved by quantizingθ i using m + k bits. Since for allθ i ∈ B int , we haveθ i ≤ max(abs(φ −1 (p start ), φ −1 (p stop ))), and variable k ∈ N can be found so thatθ i < 2 −k . This implies that the k most significant bits will contain either zeros or the sign bit and can be considered redundant for storage purposes. The sign bit can be stored in the n bit indexing code itself. Storing only m least significant bits from a m + k-bit representation ofθ i can be thought of as shift of k bits to the left which implies multiplication by 2 k in binary arithmetic. Let us denote m least significant bits forθ i asθ m i , so that we have,θ
We can storeθ m i , from which the actual parameter valuesθ i can be retrieved using Eq. (8) . Basically we perform a range adjustment by virtual bit shift of actual parameter values. An example of the same is shown in Table 2 . The resolution error can now be avoided by observing a lesser stringent condition than before, namely, δ m+k i < ∆ i , where δ m+k i is the resolution of m + k bit encoding. Thus absolute values, signs and representation range information can be stored in the same code and conversion table mapped in a FPGA-embedded LUT. A n-bit quantization is obtained in which actual parameter values are not bounded to uniformly quantized values, but can be chosen in a proper way in order to reduce errors.
DNN-based speech enhancement
We apply the proposed quantization scheme on a speech enhancement task using a feedforward DNN. Input noisy mixtures are represented using the magnitude short-time Fourier transform (STFT) and then scaled in order to properly calculate their magnitude and phase by using a coordinate rotation digital computer (CORDIC) algorithm [20] based on integer arithmetic. N frames, {x t−N +1 ,x t−N ,x t−N −1 . . . ,x t } of these features are first stacked together and then fed to the DNN to estimate denoised/clean speech magnitude spectrum x t . The stacking of features is done to allow the DNN to implicitly model temporal dependencies. The CORDIC algorithm is then applied again on x t to restore phase information extracted from the mixture features. The values thus obtained are scaled back and converted back to time domain speech via inverse fast Fourier transform (IFFT) and overlap-add.
Evaluation
For evaluation, synthetic mixtures are created using Wall Street Journal (WSJ0) dataset for speech and TUT Acoustic scenes 2016 development dataset [21] for noise. The latter consists of sound recordings from 15 real-world environments, e.g., cafe, train, metro station, etc. A random speech signal is selected and an equal-length noise segment is sampled from the noise signal. The training and validation data consist of about 12,000 (around 20 hours) and 5000 mixtures ( around 8 hours), respectively. Similarly, the test data consists of about 2800 mixtures (around 5 hours). The speech and noise signals are mixed with a randomly chosen signal to noise ratio (SNR) from the set {0, 5} dB. The native sampling rate for noise signals is 44.1 kHz which is down-sampled to 8 kHz, the native sampling rate of WSJ0 audio.The short term objective intelligibility (STOI) [22] metric is used as a measure of intelligibility of enhanced speech.
The STFT features are extracted with Hann window of 128 sample (16 ms) with 50% overlap. Eight input frames are stacked and fed to a two-layer feedforward network with 256 and 129 neurons in input and hidden layer, respectively. The rectified linear unit is used as non-linearity for each layer. The Adam optimization [23] with default parameters is used. For training networks, PyTorch [24] library is used, and for audio processing, Librosa [25] library is used. Since our focus is fixed point arithmetic devices, the network weights and biases are clamped to the range (-1, +1) in order to avoid overflow and reduce the number of bits needed for correct numeric representation in network's operations. The DNN weights have been quantized using n = 4, m = 8, and k 1 st layer = 3, k 2 nd layer = 2, obtained by choosing ratio = 1, p start = 0.04 and p stop = 0.96. The ratio, p start and p stop have been chosen empirically after optimizing for the test data. The biases used are the 8-bit-uniform quantized values. Table 3 compares the STOI values obtained with different approaches over 2800 noisy samples. 8-bituniform quantization gives good results with a very small degradation in STOI (0.21%) compared to nonquantized network, while 4-bit-uniform quantization led to a drastic fall in the performance, obtaining a result that is less intelligible than even the input noisy signal. On the other hand, the 4-bit nonuniform quantization proposed in this paper yields better STOI than noisy mixtures and only 2.7% worse as compared to the non-quantized network and halving the memory footprint in comparison to the 8-bit uniform quantization while simultaneously decreasing the memory bandwidth requirement. Figure 3 compares the different approaches by using 8-bit quantization (uniform and non uniform) for different layers, and how the proposed quantization scheme consisting of range split (RS) and virtual bit shift (VBS) affects the performance. For each simulation, one of the two layer is kept at 8-bit uniform quantization while the other is swept between the following four approaches: uniform quantization (U), uniform quantization with virtual bit shift (UVBS), range split (RS), and range split with virtual bit shift (RSVB). It can easily be noticed that for the first layer sweep, when no VBS is used, how the performance suffers as δ i > ∆ i . Figure 4 shows the effect of these approaches for two cases: 4-bit quantization for both layers, and, 4-bit for the first and 8 bit for the second layer.
Results

Conclusions
This work proposes a low-bit quantization method inspired by the companding approach that allows the achievement of a good trade-off between performance and resource requirements in a hardware implementation of a DNN and is thus very appealing for FPGA applications. The method does not require any change or pruning of the network, so no retrain is needed. The case studied shows a two-layer feedforward neural network, from which it emerges that a dramatic reduction of the memory requirements is obtained (50%) with only a slight reduction of the performance. Further research should concern the application of the method to deeper networks and the usage of non symmetrical range split or of a custom multiplying architecture for the weighting of the input values.
