Outliers in weights and activations pose a key challenge for fixed-point quantization of neural networks. While outliers can be addressed by finetuning, this is not practical for machine learning (ML) service providers (e.g., Google, Microsoft) who often receive customers' models without the training data. Specialized hardware for handling outliers can enable low-precision DNNs, but incurs nontrivial area overhead. In this paper, we propose overwrite quantization (OverQ), a novel hardware technique which opportunistically increases bitwidth for outliers by letting them "overwrite" adjacent values. An FPGA prototype shows OverQ can significantly improve ResNet-18 accuracy at 4 bits while incurring relatively little increase in resource utilization.
Introduction
Deep neural networks (DNNs) have achieved state-of-theart results in many machine learning domains including computer vision, natural language processing, robotic control, and game AI. However, a key drawback of DNNs is high computational and storage requirements. For example, the cutting-edge BERT architecture (Devlin et al., 2018) contains as many as 345 million floating-point parameters. The execution costs of DNNs impede deployment on edge devices (Xu et al., 2018) and has measurable impact on the the carbon footprint of datacenters (Strubell et al., 2019) .
Quantizing neural network weights and activations from floating-point to low-precision fixed-point is a promising approach for reducing DNN size and complexity. DNN quantization is a highly active research topic, with many works that propose fine-tuning a network to make the weights quantization-friendly (Wu et al., 2018; Jacob et al., 2018; Yang et al., 2018b; Choi et al., 2018; Zhou et al., 2017; Banner et al., 2018) . Recently, however, a number of works Preprint, work in progress have argued for the importance of post-training or data-free quantization techniques (Migacz, 2017; Zhao et al., 2019; Nagel et al., 2019; Meller et al., 2019) . This is because ML service providers in industry must optimize their customers' neural networks without having access to the training data.
DNN weights and activations are distributed in a bell curve, with the majority of values concentrated near zero and a long tail of rare outliers with large magnitude. The outliers present a challenge for uniform quantization, and a number of techniques has been proposed to deal with them including clipping (Sung et al., 2015; Shin et al., 2016; McKinstry et al., 2018; Migacz, 2017) , channel splitting (Zhao et al., 2019; Park & Choi, 2019) , and channel equalization (Nagel et al., 2019; Meller et al., 2019) . These techniques have made data-free quantization of many popular DNNs and CNNs possible at 8 or 7 bits without accuracy degradation.
For even lower precision, specialized hardware can be used. Park et al. (Park et al., 2018b; a) proposed an outlier-aware DNN accelerator (OLAccel) which uses a conventional processing engine (PE) for central activations and a second sparse PE for outliers. OLAccel achieves < 1% accuracy loss on AlexNet using 4 bits for most values and 16 bits for a small percentage of outliers. Though effective, OLAccel's major drawback is using an entirely separate outlier engine -this requires additional multiply-accumulate (MAC) units and incurs hardware overheads due to sparsity.
In this paper, we present overwrite quantization (OverQ), a data-free quantization technique which uses lightweight architectural extensions to address outliers. OverQ's primary goal is to reduce the area overhead of OLAccel while still achieving significant accuracy gains over a naïve baseline. To achieve this, OverQ opportunistically re-allocates bits from unimportant values to represent adjacent outliers. OverQ does not handle every outlier like OLAccel. However, an exploratory study on ImageNet ResNet-18 shows that OverQ address 80 − 95% of outliers and achieves 1.0 − 2.5% Top-1 accuracy improvement. An FPGA prototype built using high-level synthesis shows that OverQ requires no additional MAC units, and that its logic overhead is negligible compared to the MAC array size. Although the scope of the evaluation is currently limited, it demonstrates the potential of overwrite quantization in larger networks.
Sparse Outlier PE Dense PE Figure 1 . Execution flow of outlier-aware accelerator (OLAccel) (Park et al., 2018a) -The input vector is separated into a dense vector containing most values, and sparse outliers. Twice the regular bitwidth is used to represent the outliers. The vector is processed by a dense PE while the outliers are processed by a different sparse PE.
Related Work
We limit the discussion of related work to existing data-free quantization techniques in software and hardware.
Handling Outliers in Software
Clipping weights and activations to a pre-determined threshold is a straightforward way to control outlier effects. Proposed methods to compute the clip threshold include minimal mean-squared error (Sung et al., 2015; Shin et al., 2016) , percentile of values (McKinstry et al., 2018) , and KL divergence (Migacz, 2017) . On popular ImageNet CNNs, clipping with 8-bit fixed-point can achieve no accuracy degradation (Migacz, 2017) .
Channel equalization seeks to balance the magnitude of different channels by scaling one layer's channels by a constant vector, then scaling the following layer by the inverse vector (Nagel et al., 2019; Meller et al., 2019) . Nagel et al. (Nagel et al., 2019) also proposes bias correction, which adjusts the biases in each layer to compensate for the fact that quantization skews the activation mean. A combination of these techniques enables 8-bit integer quantization for MobileNetV1 and ResNet18 without accuracy loss.
Weight splitting takes nodes or channels containing outlier weights and duplicates them while dividing the weight in half (Zhao et al., 2019; Park & Choi, 2019) . This preserves network equivalence but reduces the magnitude of the outliers. Splitting requires static information on the locations of outliers and is not effective for activations (Zhao et al., 2019) . Even for weights, splitting requires substantial model size overhead to restore a <8-bit model to floating-point accuracy.
Handling Outliers with Specialized Hardware
Software techniques have difficulties reaching precisions below 8-bit without accuracy degradation. Specialized hardware can help overcome this barrier. Park et al. (Park et al., 2018b; a) proposed OLAccel, an outlier-aware DNN accelerator which uses a dense 4/8-bit PE for most activations and a sparse 8/16-bit PE for outliers. Increasing the bitwidth for outliers affords greater dynamic range. Figure 1 shows how OLAccel separates an input vector into a dense vector and sparse outliers for separate processing. Each outlier is accompanied by indices (width, height, channels) to track its location in the original vector. On AlexNet, OLAccel with 4-bit quantization achieves accuracy within 1% of the float baseline when the outlier PE handles 1% of all values.
The outlier PE in OLAccel incurs non-trivial hardware overhead. Although Park et al. did not specify the exact area of the outlier PE in their area comparison, we can observe that: (1) the outlier PE requires additional MAC units since it operates at a different bitwidth; (2) the sparse representation incurs storage overhead. For each 8/16-bit outlier, 32 additional bits are used for indices (see Figure 1 ). OverQ seeks to handle outliers in a higher precision but in a more area-efficient manner.
x 1
x 3 x 2 Outlier x 4
Base
x 4 x 2 OverQ Figure 2 . Basic idea of Overwrite Quantizationan outlier xi can overwrite the adjacent value xi+1 when xi+1 is small. If so, twice the bitwidth is used to represent xi and xi+1 is dropped.
Overwrite Quantization
Consider the problem of quantizing an N -element activation vector {x} N i=1 . The basic idea of overwrite quantization is illustrated in Figure 2 . An outlier x i can overwrite its adjacent value x i+1 when x i+1 is smaller than a fixed threshold 1 . When this overwrite occurs, x i is represented using twice the normal bitwidth while x i+1 is dropped. OverQ dynamically re-allocates bitwidth from an insignificant value to an important outlier. OverQ exploits two properties of DNN activations: (1) outliers are rare, but contribute disproportionately to a layer's output; (2) activation distributions peak near zero, and contain many ReLU-induced zeros. The first property guarantees that overwrite occurs infrequently, so not too many values are dropped out. The second property means that an outlier will lie beside a small value with significant probability. In hardware, OverQ reuses the MAC unit from the adjacent value and performs two low-precision MACs to compute the outlier result.
In DNNs, OverQ would be applied along a single dimension of the activation tensor (width, height, or channels for convolutional networks). The chosen dimension affects the hardware architecture as it determines which dimensions must be spatially unrolled. Experiments showed that OverQ along the channels is much more effective than along either spatial dimension, since spatially adjacent values are highly correlated in CNNs while adjacent channels exhibit less correlation. we assume OverQ along the channels of a CNN from this point onward.
Computing with OverQ
Figure 3 illustrates two variants of overwrite quantization and how computation is done with each. Figure 3 (a) shows a dot product between activations x i and weights w i . Such dot products are the basis of convolutional and fully-connected layers in DNNs. In OverQ, each activation is associated with a flag bit indicating whether that activation is being overwritten. Figure 3(b) shows OverQ-Split, where half the outlier value (x i /2) is stored in both hardware vector slots. 1 The boundary value xN has no adjacent value and cannot benefit from OverQ.
During computation, the PE detects the flag bit and copies w i to the position of w i+1 ; the two products at indices i and i + 1 thus sum to x i × w i . Copying the weight is simple in hardware as long as we spatially unroll the vector such that the compute units for x i and x i+1 are physically close by. OverQ-Split can represent a value with twice the normal dynamic range, essentially adding one extra bit of representation to affected outliers. The primary advantage of OverQ-Split is its simplicity. We will show later how it can be implemented in a spatial accelerator with only basic muxing logic.
Figure 3(b) shows OverQ-Shift, a more complex but also more flexible variant of overwrite quantization. OverQ-Shift uses the adjacent slot to hold out-of-range bits of the outlier x i . These could be either additional more-significant bits (MSBs) or less-significant bits (LSBs). One bit in the adjacent slot is designated as the shift direction bit, and selects between MSB or LSB representation. The remaining bits store data. Because the adjacent slot represent different binary positions than a regular slot, its product requires either a shift right (for MSBs) or a shift left (for LSBs) before accumulation. For the example in Figure 3 (b), a regular slot holds binary positions 2 3 2 2 2 1 2 0 , while the OverQ slot holds positions 2 6 2 5 2 4 . A shifter after the multiplier uses the shift direction bit to determine whether to shift left or right. OverQ-Shift is more hardware intensive as it requires two different (constant) shifters as well as additional decode and mux logic.
One important advantage of OverQ-Shift over Split is that in addition to handling outliers, it can also improve the precision of non-outlier values by exploiting zero activations created by ReLU. When x i is not an outlier and x i+1 is a zero, we can overwrite x i+1 with additional LSB bits of x i with no loss of information. We call this zero-reuse, since it reuses bits which would all be zero to store useful information instead. Zero-reuse increases the effective bitwidth and precision of x i , and is a secondary source of accuracy improvement in addition to outlier quantization. Figure 4 gives numerical examples of OverQ-Shift being used for outliers and for zero-reuse. On the left side, outof-range MSBs of an outlier are stored in the adjacent slot, overwriting regular data. On the right side, out-of-range LSBs of a regular value overwrites a zero in the adjacent slot. In each case the OverQ bit is set to 1, but the shift direction bit is different. Figure 5 . Channel reordering for OverQreordering puts channels with high outlier count next to channels with low outlier count to increase the probability of OverQ. Network used is ImageNet ResNet-18.
Channel Reordering
Although OverQ is intended to be a run-time hardware technique, its effectiveness can potentially be improved by compile-time transformations. OverQ is predicated on the fact that outliers will be adjacent to a small value with high probability. We can increase this probability by reordering the channels in a layer such that channels likely to contain outliers are adjacent to channels likely to contain zeros. Our proposed channel reordering procedure is as follows. First, activation distributions are sampled using a small profiling dataset (this is already fairly standard practice for data-free activation quantization (Migacz, 2017) ). Then, the number of outliers in each channel are counted (outliers can be defined as the largest 1% of activations). Finally, the channels are reordered by the outlier count following a sequence of high, low, high, low, high, etc. Figure 5 shows the outlier count in each channel of a CNN layer before and after reordering. Reordering can be implemented as a compile-time graph transformation which swaps weight filters in a layer such that the output channels follow the desired order. Importantly, channel reordering incurs no overhead at run time as the size of all tensors remain unchanged.
Mapping OverQ to DNN Accelerators
From the outset, OverQ was designed for efficient hardware implementation in realistic DNN accelerators. In this section, we show how OverQ can be added using only lightweight architectural changes to a weight-stationary spatial array -a common template for DNN accelerators. Here, weight-stationary (WS) refers to a type of DNN dataflow in which weights are held stationary in PEs while inputs and partial sums move through the accelerator. A recent study on dataflow choice in DNN hardware literature (Yang et al., 2018a) showed that WS dataflow was the most popular. Experiments in the study also demonstrated that WS to be the most hardware efficient dataflow, albeit only by a small margin. accelerator for matrix-matrix multiplication. The left side shows how the input activations move from left to right while the output partial sums (psums) move from top to bottom. The right side depicts the architecture of a PE, which contains a single multiplier and adder. This accelerator can target fully-connected or 1 × 1 convolutional layers. A weight-stationary spatial architecture is well-suited for OverQ for two reasons. First, the array spatially unrolls the input channels. In Figure 6 (a) the input channels are mapped to rows along the vertical axis. Adjacent channels are therefore mapped to physically adjacent PEs during processing. Second, the weights are held stationary in each PE. These two factors make it relatively easy for a PE to access its adjacent weight. Figure 6(b) illustrates the implementation of OverQ-Split in a PE. A 1-bit wire and register is added to each PE to propagate the OverQ flag bit, which is used to multiplex between the PE's own weight and its adjacent weight. Figure 6(c) shows the modifications needed for OverQ-Shift, which is more complex. In addition to the hardware for OverQ-Split, a second mux is needed after the multiplier to implement a possible shift on its output. For simplicity the figure only depicts selecting between no shift and a right shift, which is sufficient for outlier overwrite. To support zero-reuse, a larger mux is needed to choose between no shift, right shift, or left shift. A second select bit would be routed from input activation register.
To decide which activations to overwrite, we take advantage of the fact that output accumulation typically occurs at a larger bitwidth than the input weights or activations (e.g., this is done in Google's TPU (Jouppi et al., 2017) ). When outputs exit the WS array at the bottom, they must be rescaled and re-quantized down to activation bitwdith for the next layer. The rescaling unit can be modified to make decisions on where to perform OverQ. The size of the rescaling unit scales with the width of the array only, whereas the PEs scale with both width and height of the array. As a result we believe that the dominant resource overhead of OverQ will be from the modifications to each PE.
Experimental Proof-of-Concept
To demonstrate the potential of overwrite quantization, two sets of experimental results are provided below. The first is an evaluation of the impact of OverQ on the accuracy of ResNet-18 (He et al., 2015) for ImageNet classification (Deng et al., 2009 ). ResNet-18 was the most modern network used in the experimental section of OLAccel (the others being AlexNet and VGG-16). The second is measurements of resource usage on a small-scale prototype FPGA accelerator for CNN inference. Our experiments define outliers as all values above a pre-computed threshold; the threshold was determined using MMSE clipping (Sung et al., 2015; Shin et al., 2016) unless otherwise noted. MMSE clipping was shown to be the best performing clipping method from literature in Zhao et al. (Zhao et al., 2019) .
For the accuracy experiments, baseline and OverQ quantization was implemented in TensorFlow (et al., 2015) and applied to ResNet-18 for ImageNet classification. The basic flow of quantization is to scale and clip values into a range [−2 B − 1, 2 B − 1], rounding to integer, and scaling back to the original range. B here is the unsigned bitwidth in a sign-magnitude representation. OverQ was implemented on top of this leveraging tf.select, TensorFlow's ternary if-else operator.
Effectiveness of Channel Reordering
Reordering was implemented as a TensorFlow graph edit pass which visits each layer and performs static weight shuffling to generate the desired output channel order. This pass can be applied to a TensorFlow model checkpoint to generate an alternative checkpoint with reordered channels. To evaluate channel reordering, let us first define outlier coverage as the fraction of outliers which can benefit from overwrite. Ideally, coverage would be 100% like in OLAccel, but the opportunistic nature of OverQ prevents this. Nevertheless it is preferred that coverage be as high as possible. Print statements (tf.print) were used to log outlier coverage in ResNet-18 during inference for a single batch of 250 images. The outlier threshold for this experiment was determined using MMSE clipping. Table 1 compares the coverage between the original and reordered models in the first 8 layers. The data shows that reordering improves outlier coverage by up to 8.8% in some layers and increases overall accuracy. Reordering is most effective in early layers where the number of channels is relatively small. The remaining experiments in this section use the reordered ResNet-18 model. 
Accuracy Impact on ImageNet Classification
The accuracy of a quantized CNN depends greatly on the clipping threshold (i.e. the scaling value mentioned above). A data-free quantization flow following NVIDIA TensorRT (Migacz, 2017) and OCS (Zhao et al., 2019) was used. A profiling dataset of 500 training images were used to sample the activation distribution, and from this the clip threshold S for each layer was determined. S is the maximum representable value of the fixed-point format. The threshold below which values may be overwritten in OverQ is set at S/4; comparing a fixed-point value against S/4 can be done efficiently using bitwise logic. Two experiments were conducted: (1) sweeping the clip threshold from 0.2 − 0.9× of the maximum sampled value;
(2) using the MMSE clipping threshold in each layer. Weights were quantized to eight bits in all experiments. The floating-point baseline accuracy for ResNet-18 is 69.7%. Figure 7 shows the sweep of the clip threshold S with 5and 4-bit activation quantization. OL and ZR refer to outlier and zero-reuse, respectively. The plots show Baseline, OL, ZR, and simultaneous OL+ZR. In both plots we see the same pattern: with a small S OL performs above baseline and ZR near baseline, while with a large S OL performs near baseline and ZR above. Conceptually, a small clip threshold produces more outliers (cases where the activation exceeds S) and thus OL becomes more effective, while ZR performs poorly since it cannot address the outliers which are responsible for the accuracy degradation. In the large S regime there are very few outliers for OL, meaning ZR is more effective as it can utilize zeros to increase the precision of some fraction of activations. Fortunately, it is not necessary to choose -simultaneous OL+ZR strictly outperforms either technique alone and the curve behaves as the sum of the accuracy gains from OL and ZR. At the accuracy peak, OL+ZR improves Top-1 by 0.7% at 5 bits and 2.5% at 4 bits. 
Hardware Resource Impact on FPGA Prototype
OverQ was implemented on a small matrix-vector multiply accelerator, which is used to execute 3 × 3 conv layers using real data extracted from TensorFlow. The accelerator was built using C++ source and synthesized to Verilog using Xilinx Vivado HLS version 2016.3, then implemented to bitstream for a Xilinx xc7z020 FPGA device. The baseline design is an M × N array pipelined in HLS, where M and N are the unroll factors for the input and output channels. Evaluation was done on inputs, weights and outputs taken from a single layer in ResNet-18. OverQ had no impact on latency. Table 3 shows resource usage of the baseline, OverQ-Split (OL only) and OverQ-Shift (OL+ZR) designs.
The key observation is that although the percentage increase in LUTs and FFs are considerable (LUT usage grow by over 3× from baseline to OverQ-Shift), the LUT and FF utilization on the device remains very low in comparison to DSPs (which are used to implement MAC units). Based on our data, scaling up the design will result in DSPs running out well before the logic overhead of OverQ becomes an issue. Commercial FPGAs have an abundance of LUTs and FFs available, but DSPs are at a premium because the latter is much larger in area. DNN accelerators are almost always bottlenecked by DSP usage Qiu et al., 2016; Zhang & Li, 2017) . Fortunately, OverQ was specifically designed to avoid MAC overhead and this is reflected in the results. Another important finding is that the additional of OverQ does not hurt timing.
It is important to note that the accelerator is only for matrixmatrix products, and lacks other components typically found in a DNN accelerator such as memory system, rescaling unit, bias unit, etc. The hardware which decides which activations to overwrite was not prototyped. Section 4 describes how the rescaling unit can be modified to make the OverQ decisions and how its resource overhead is expected to be small compared to the overhead in the PE array.
Conclusions and Future Work
Our preliminary results demonstrate that OverQ has potential as a lightweight hardware-centric technique for improving activation quantization. However, we have only performed tests on a single, relatively simple network (ResNet-18) and our hardware prototype consisted of just the core MAC array, leaving out other important components required for a full-fledged DNN accelerator (e.g., memory system, rescaling and bias add). Nevertheless, the data gives us confidence that OverQ is indeed feasible in real hardware at fairly low hardware overhead. The most pertinent future work is expanding the scope of the evaluation to an accelerator which can execute an entire DNN on its own, and to collect data on larger networks.
