This paper addresses a challenging problem -how to reduce energy consumption without incurring performance drop when deploying deep neural networks (DNNs) at the inference stage. In order to alleviate the computation and storage burdens, we propose a novel dataflow-based joint quantization approach with the hypothesis that a fewer number of quantization operations would incur less information loss and thus improve the final performance. It first introduces a quantization scheme with efficient bit-shifting and rounding operations to represent network parameters and activations in low precision. Then it restructures the network architectures to form unified modules for optimization on the quantized model. Extensive experiments on ImageNet and KITTI validate the effectiveness of our model, demonstrating that state-of-the-art results for various tasks can be achieved by this quantized model. Besides, we designed and synthesized an RTL model to measure the hardware costs among various quantization methods. For each quantization operation 1 , it reduces area cost by ∼15× and energy consumption by ∼9×, compared to a strong baseline.
Introduction
DNNs have been increasingly deployed at the edge due to its remarkable performance. In particular, energy becomes an import factor when processing DNNs at the edge in embedded devices with limited battery capacity (e.g., smartphones, smart sensors, UAVs, and wearables) [1] . Hence, there is an increasing demand for reducing energy consumption and increasing throughput without incurring a significant drop in accuracy. Extensive research has focused on reducing the precision of operations and operands to address the challenge (e.g., DoReFa-Net [2] , BinaryNet [3] , XNOR-Net [4] and ternary quantization [5] ).
Although these quantization methods remarkably reduce the model complexity, they are limited in two key ways. Firstly, a noticeable performance drop exists. For example, DoReFa-Net has a 2.9% performance loss for AlexNet on ImageNet with 8-bit weights and activations. Secondly, additional hardware costs are introduced due to extra inefficient operations such as codebooks in [6] and scaling factors in DoReFa-Net and IOA [7] .
To address the above issues, we introduce an energy-efficiency quantization scheme which represents both network parameters and activations in low precision. It only contains efficient bit-shifting and rounding-to-nearest behaviors. Then, based on this scheme, we re-structure the network to form unified modules, which reduce the number of quantization operations. Here we hypothesize that fewer number of quantization operations would incur less information loss and thus boost the final performance accuracy. Finally, a joint reconstruction error loss function is set up on these unified modules to do optimization on the quantized model.
We evaluate our proposed method on ImageNet [8] with ResNet [9] and KITTI [10] with Faster R-CNN [11] . Furthermore, we designed and synthesized a hardware unit to measure the hardware costs among various quantization operations. We show that our proposed approach performs comparably or even better in terms of accuracy and computational costs over existing state-of-the-art approaches on various tasks. In summary, our key contributions are as follows:
1. A purely quantized network is introduced to represent network parameters and activations into low precision with energy-efficient operations. As compared to floating-point representation, the 8-bit quantized model leads to less computation and memory accesses by ∼4× without significant performance drop.
2. A novel joint quantization approach is introduced by taking the network dataflow into consideration to define unified modules and then set up joint reconstruction error objectives to learn the optimal parameters for the quantization model. Besides, our approach does not have the time-consuming fine-tuning stage.
3. Extensive experiments on various benchmark datasets with state-of-the-art deep models demonstrate that our proposed method reduces computational costs while still achieving decent accuracy.
Proposed Method
In this section, we first describe our quantization scheme in DNNs and then provide a detailed induction on the integer-only inference. In the end, a joint quantization approach of network parameters and activations is presented.
Quantization Scheme
A common practice in reducing the precision of DNNs is to introduce a quantization function [2, 7] . A basic requirement of our quantization scheme is that it is hardwarefriendly, i.e., without incurring intensive hardware operations and has equal conversion between integer and floating-point representation. The quantization scheme is defined as: Given a floating-point value r, we use a quantization function, Q(·), to approximate it:
where r q is the quantized floating value, r I is the integer value and N r is the fractional bit which is the only parameter to set. n bits is the bit-width including 1 sign bit. For example, when n bits = 8, r I is a 8-bit integer within the range of [−128, 127] . When N r is negative, only the data before the decimal point is selected. For instance, if N r is -3 with 8-bit bit-width, we select the 3 to 10 digits before the decimal point as the low-precision data. Our quantization scheme only contains bit-shifting, which is different from previous works [2, 7] with multiplication-based scaling factors or [12] weight loading from a codebook.
In DNNs, a separate fractional bit is selected for weights, bias, and activations of each layer. For instance, a convolution layer would have a set of quantization parameters -N w , N b , N x , N o for weights, bias, input activation and output activation, respectively. N x comes from the output activation of the previous layer.
Integer-arithmetic-only Operations
The above quantization scheme is in floating-point arithmetic and thus supposed to be deployed to general hardware platforms such as GPUs and CPUs. In this section, we describe an integer-only arithmetic inference in DNNs by adopting the proposed scheme. The inference provides a step-by-step solution on how custom hardware units compute on DNNs using fixed-point representations.
Following the notations in [1] , we define the 2-D convolution operation to be:
where O, X, W, and B are the matrices of the output feature maps (ofmaps), input feature maps (ifmaps), filters, and biases, respectively. S is a provided stride while C, H and W are ifmaps channels, height and width, respectively. After quantizing weights, biases and activations, convolution operation becomes:
where X I , W I and B I are the integer matrices of the ofmaps, ifmaps, filters, and bias, respectively. N x , N w and N b are the fractional bit of ifmaps, filters, and biases, respectively. We carefully align biases with the convolution output by sacrificing smaller values achieving less information loss. The intermediate result of convolution is 32-bit integer to handle the overflow of accumulation.
Then, we need to quantize the output of the convolution layer as: where O I is the integer matrices of the ofmaps. In custom hardware units, two sets of data are stored: one is the integer matrices of ifmaps X I , ofmaps O I , filters W I , and bias B I ; and the other is the bit-shifting values for data alignment in biases and activations such as
but not the fractional bits.
Dataflow-based Joint Quantization
The impact of activations on storage capacity depends on the network architecture and dataflow [1] . Hence, the dataflow of network architecture becomes an important factor to quantize the activations. Here we hypothesize that if more quantization operations are applied, it is likely to incur more measurement noise which might lead to information loss. To reduce the number of quantization operations, we propose to combine several basic layers into a unified module based on network architectures. We only consider ReLU as activation functions because it is the de facto choice for DNNs due to its effectiveness. For instance, the quantization is conducted after a ReLU layer in Figure 1 (b) because that 1) the negative part can be skipped from computing the quantization error as only non-negative values exist after ReLU layer; 2) the cost of memory accesses is reduced dramatically without writing the convolution output back to memory. This is different from existing methods such as [2] , which quantizes activations instantly after convolution. Furthermore, the batch normalization layer is merged into the weights and biases of the next convolution layer at inference stage. 
Optimization
Following [13] , we assume that the quantization error in each layer is positive relevant to the final performance accuracy. Hence it is reasonable to minimize each layer's reconstruction error to boost the performance. Our target is to minimize the reconstruction error between the outputs and the quantized outputs as follows:
WhereN w ,N b ,N o are optimal fractional bits of weights, biases and output activations separately. N x is from the optimal bit of activations of the previous layer. Our proposed method is based on the pre-trained model without fine-tuning and the optimization is to learn the optimal fractional bits. To speed up the search, we first narrow down the search space and then apply a simple grid search approach. This is motivated by the observation [14] that larger weights play a more important role than smaller ones. Hence, we hypothesize that the optimal fractional bit should be located in the upper bits. Algorithm 1 provides a detailed solution to the optimization. First, the largest fractional bit to represent the maximum value of parameter W is:
The search space would be [N − τ ] where τ is a hyper-parameter to be set empirically. Biases and activations have a similar search strategy. Now, we are ready to traverse all the solutions in a narrowed-down search space to get the optimal fractional bits. Obviously, the time complexity of Algorithm 1 is O(τ 3 Γ) where Γ comes from the convolution operation. In our experiments, we empirically set τ as 4.
Experiments
We conduct experiments on ImageNet [8] , KITTI [10] datasets with two widely used deep models -ResNet [9] and Faster R-CNN [11] to investigate the performance of our proposed method. Besides, we conduct hardware measurements on various approaches to verify the computational efficiency of our proposed method.
Implementation Settings
Our optimization is conducted on a single image as the number of weights, bias and activations for each layer are enough to bring little biases on the optimization. Meanwhile, the baselines are implemented according to their default settings. The CPU platform is AMD 1950X 16-Core CPU with 64GB RAM while the GPU platform is NVIDIA Tesla V100.
To evaluate the hardware cost among various methods, we have created an RTL model for each method and conducted synthesis using UMC 40nm library, the area and power are then estimated at 500MHz clock Synthesizers.
Evaluation on ImageNet
For training, all images are resized preserving aspect ratio so that the smallest side is 256. Then the center 224 × 224 patch is cropped and each of the RGB channels subtracts the means. While for inference, we apply a single-crop testing for standard evaluation. We apply our method to ResNet with different layer depths on ImageNet. The pre-trained floating-point model is from the TensorFlow's official repository 2 . Method IOA [7] is fine-tuned on the pre-trained floating-point model. Table 1 demonstrates the performance of ResNet-50, ResNet-101 and ResNet-152 on ImageNet. It can be seen that 1) our method is robust with various network depths with only about 1.8% drop; 2) as compared to other baselines, our proposed approach achieves competitive performance. Although baseline IOA has similar results, it contains scaling factors and 32-bit biases. Besides, it has extra addition operations on the "zero-point" values. While our method saves more hardware cost with bit-shifting and 8-bit biases; and 3) the quantized model with optimal fractional bits is probably a great starting point to continue fine-tuning. It may further improve the performance and speed up the training as the search space of parameters becomes smaller. Table 2 lists the training time of our proposed method on ResNet with various network depths. We observe that in comparison with several-days fine-tuning on a pre-trained model, our approach has a dramatically fast speed within several minutes. Table 3 reports the ImageNet accuracy for ResNet-50 on various approaches with various bit-widths. As expected, the integer-only work performs better than ABCnet, which uses 5-bit for weights and activations. INQ and CLIP-Q perform better than ours as they represent activations in 32-bit.
Evaluation on KITTI
For object detection, we evaluate our method on the autonomous driving benchmark dataset -KITTI, using Faster R-CNN (F-RCNN) with ResNet-152 as the backbone network. Since the ground-truth for test set is not publicly available, we randomly sample 80% of training images for training F-RCNN and the remaining for validation on the quantized model. All images are resized into HD resolution (1280 × 720). Table 4 lists the object detection performance of our proposed method in terms of various bit-widths on KITTI dataset. We observe that our method in 8-bit achieves almost and even better performance as compared to full precision models. Our method in 7-bit precision also has a competitive performance. However, 6-bit representation has a dramatic performance drop. 
Hardware Experiments
We measure the energy and area costs in Table 5 on various operations. To do comparison fairly, all implementations are constrained to 32-bit input and 8-bit output.
In particular, scaling factor operation is implemented by a 32-bit multiplier and the output is then clipped to the rightmost 8-bit integer value; k-means has a 4-bit codebook with each entry a 8-bit value. The selected entry is multiplied with input data and then clipped to the rightmost 8-bit; while for bit-shifting, the input data is shifted right by the range in [1, 10] and then clipped the rightmost 8-bit. It shows that bit-shifting saves the power and area the most. The scaling factor operation consumes ∼2× more energy and area of bit-shifting operation. The codebook consumes the power and area most as the codebook contains intensive encodingdecoding operations. Following [20] , if in floating-point precision, the computational cost of the quantization for activations is about 1 filter size of the standard convolution layer, only occupying about 1 − 2% of the whole network computation. However, in fixed-point precision, the computational cost of quantization cannot be ignored. For example, for the scaling factor experiment, the convolution layer is implemented by 8-bit multipliers while the quantization is 32-bit multipliers, the computational cost of quantization would be increased by ∼16× and should not be ignored. Figure 2 shows the statistics on mean squared error (MSE) between quantized activations and floating-point activations with residual block depth and the shifting bits with layer depth for ResNet-50. From Figure 2a , we observe that 1) the MSE of residual addition is larger than the first two convolution layers as it integrates last convolution layer and shortcut connection; and 2) when the layer goes deeper, the MSE error of the first two convolutions in a residual block is stable while the addition units increase. From Figure 2b , it can be seen that 1) the bit-shifting operation operates in a range [1, 10] ; and 2) the bit-shifting values often revolves around 3 and 8, respectively.
Discussion

Related Work
Recent work on compressing DNNs while accelerating inference can be classified as: a) adopting low-precision of both operands and operations; and b) reducing the number of operations by designing compact architectures or pruning.
Lower precision Some works have been focused on reducing the precision of weights for efficient on-chip memory storage [17, 5] . For instance, [17] proposes an incremental network quantization (INQ) to convert pre-trained full-precision models into low-precision ones by quantizing weights into power of two or zero values. To further save the computing cost and memory storage, recent works also consider the quantization of activation [2, 3, 4, 21] . For instance, binarized neural networks (BNNs) [3] uses binary weights and activations and thus reduces the MAC to an XNOR operation. Different from these methods, a novel dataflow-based joint quantization approach is proposed to quantize the weights and activations.
Fewer operations There is a significant amount of research on designing efficient network architectures [22, 20, 23] . SqueezeNet [22] uses 1 × 1 instead of 3 × 3 filters, which dramatically increases the depth of DNNs thus reducing the model complexity. Xception [23] and MobileNet [20] rely on depthwise separable convolutions. Yet another way to reduce the number of operations is through network pruning [6, 24, 25] . As an example, [25] prunes the network based on the magnitude of weights. As the model size of a DNN does not directly reflect the hardware energy consumption, [24] proposes an energy-aware method to prune the weights.
Conclusions
In this work, we explored a dataflow-based joint quantization mechanism to lower the precision of network parameters and activations in DNNs and performed extensive experiments to demonstrate the effectiveness of our proposed methods in terms of performance accuracy and hardware costs. The quantized model is a purely lowprecision model with weights, biases and activations in low precision. Besides, the batch normalization is merged into the convolution layer.
