Overwrite Quantization: Opportunistic Outlier Handling for Neural
  Network Accelerators by Zhao, Ritchie et al.
Overwrite Quantization: Opportunistic Outlier Handling
for Neural Network Accelerators
Ritchie Zhao 1 Christopher De Sa 1 Zhiru Zhang 1
Abstract
Outliers in weights and activations pose a key
challenge for fixed-point quantization of neural
networks. While outliers can be addressed by fine-
tuning, this is not practical for machine learning
(ML) service providers (e.g., Google, Microsoft)
who often receive customers’ models without
the training data. Specialized hardware for han-
dling outliers can enable low-precision DNNs,
but incurs nontrivial area overhead. In this pa-
per, we propose overwrite quantization (OverQ),
a novel hardware technique which opportunis-
tically increases bitwidth for outliers by letting
them “overwrite” adjacent values. An FPGA pro-
totype shows OverQ can significantly improve
ResNet-18 accuracy at 4 bits while incurring rela-
tively little increase in resource utilization.
1. Introduction
Deep neural networks (DNNs) have achieved state-of-the-
art results in many machine learning domains including
computer vision, natural language processing, robotic con-
trol, and game AI. However, a key drawback of DNNs is
high computational and storage requirements. For example,
the cutting-edge BERT architecture (Devlin et al., 2018)
contains as many as 345 million floating-point parameters.
The execution costs of DNNs impede deployment on edge
devices (Xu et al., 2018) and has measurable impact on the
the carbon footprint of datacenters (Strubell et al., 2019).
Quantizing neural network weights and activations from
floating-point to low-precision fixed-point is a promising ap-
proach for reducing DNN size and complexity. DNN quan-
tization is a highly active research topic, with many works
that propose fine-tuning a network to make the weights
quantization-friendly (Wu et al., 2018; Jacob et al., 2018;
Yang et al., 2018b; Choi et al., 2018; Zhou et al., 2017;
Banner et al., 2018). Recently, however, a number of works
1Cornell University, Ithaca, New York 14850, USA. Correspon-
dence to: Ritchie Zhao <rz252@cornell.edu>.
Preprint, work in progress
have argued for the importance of post-training or data-free
quantization techniques (Migacz, 2017; Zhao et al., 2019;
Nagel et al., 2019; Meller et al., 2019). This is because ML
service providers in industry must optimize their customers’
neural networks without having access to the training data.
DNN weights and activations are distributed in a bell curve,
with the majority of values concentrated near zero and a
long tail of rare outliers with large magnitude. The outliers
present a challenge for uniform quantization, and a number
of techniques has been proposed to deal with them including
clipping (Sung et al., 2015; Shin et al., 2016; McKinstry
et al., 2018; Migacz, 2017), channel splitting (Zhao et al.,
2019; Park & Choi, 2019), and channel equalization (Nagel
et al., 2019; Meller et al., 2019). These techniques have
made data-free quantization of many popular DNNs and
CNNs possible at 8 or 7 bits without accuracy degradation.
For even lower precision, specialized hardware can be used.
Park et al. (Park et al., 2018b;a) proposed an outlier-aware
DNN accelerator (OLAccel) which uses a conventional pro-
cessing engine (PE) for central activations and a second
sparse PE for outliers. OLAccel achieves < 1% accuracy
loss on AlexNet using 4 bits for most values and 16 bits for
a small percentage of outliers. Though effective, OLAccel’s
major drawback is using an entirely separate outlier engine
— this requires additional multiply-accumulate (MAC) units
and incurs hardware overheads due to sparsity.
In this paper, we present overwrite quantization (OverQ),
a data-free quantization technique which uses lightweight
architectural extensions to address outliers. OverQ’s pri-
mary goal is to reduce the area overhead of OLAccel while
still achieving significant accuracy gains over a naı¨ve base-
line. To achieve this, OverQ opportunistically re-allocates
bits from unimportant values to represent adjacent outliers.
OverQ does not handle every outlier like OLAccel. How-
ever, an exploratory study on ImageNet ResNet-18 shows
that OverQ address 80 − 95% of outliers and achieves
1.0 − 2.5% Top-1 accuracy improvement. An FPGA pro-
totype built using high-level synthesis shows that OverQ
requires no additional MAC units, and that its logic overhead
is negligible compared to the MAC array size. Although the
scope of the evaluation is currently limited, it demonstrates
the potential of overwrite quantization in larger networks.
ar
X
iv
:1
91
0.
06
90
9v
1 
 [c
s.L
G]
  1
3 O
ct 
20
19
Overwrite Quantization
x1 x3x2
Outlier
x4
Outlier
Value Indices
8’b 10’b 10’b
hw c
12’b
x2
4’b
x1 x30 x4
4’b
Sparse 
Outlier PE
Dense PE
Figure 1. Execution flow of outlier-aware accelerator (OLAccel) (Park et al., 2018a) — The input vector is separated into a dense
vector containing most values, and sparse outliers. Twice the regular bitwidth is used to represent the outliers. The vector is processed by
a dense PE while the outliers are processed by a different sparse PE.
2. Related Work
We limit the discussion of related work to existing data-free
quantization techniques in software and hardware.
2.1. Handling Outliers in Software
Clipping weights and activations to a pre-determined thresh-
old is a straightforward way to control outlier effects. Pro-
posed methods to compute the clip threshold include mini-
mal mean-squared error (Sung et al., 2015; Shin et al., 2016),
percentile of values (McKinstry et al., 2018), and KL di-
vergence (Migacz, 2017). On popular ImageNet CNNs,
clipping with 8-bit fixed-point can achieve no accuracy
degradation (Migacz, 2017).
Channel equalization seeks to balance the magnitude of
different channels by scaling one layer’s channels by a con-
stant vector, then scaling the following layer by the inverse
vector (Nagel et al., 2019; Meller et al., 2019). Nagel et
al. (Nagel et al., 2019) also proposes bias correction, which
adjusts the biases in each layer to compensate for the fact
that quantization skews the activation mean. A combination
of these techniques enables 8-bit integer quantization for
MobileNetV1 and ResNet18 without accuracy loss.
Weight splitting takes nodes or channels containing outlier
weights and duplicates them while dividing the weight in
half (Zhao et al., 2019; Park & Choi, 2019). This preserves
network equivalence but reduces the magnitude of the out-
liers. Splitting requires static information on the locations
of outliers and is not effective for activations (Zhao et al.,
2019). Even for weights, splitting requires substantial model
size overhead to restore a <8-bit model to floating-point
accuracy.
2.2. Handling Outliers with Specialized Hardware
Software techniques have difficulties reaching precisions
below 8-bit without accuracy degradation. Specialized hard-
ware can help overcome this barrier. Park et al. (Park et al.,
2018b;a) proposed OLAccel, an outlier-aware DNN acceler-
ator which uses a dense 4/8-bit PE for most activations and
a sparse 8/16-bit PE for outliers. Increasing the bitwidth
for outliers affords greater dynamic range. Figure 1 shows
how OLAccel separates an input vector into a dense vector
and sparse outliers for separate processing. Each outlier is
accompanied by indices (width, height, channels) to track
its location in the original vector. On AlexNet, OLAccel
with 4-bit quantization achieves accuracy within 1% of the
float baseline when the outlier PE handles 1% of all values.
The outlier PE in OLAccel incurs non-trivial hardware over-
head. Although Park et al. did not specify the exact area
of the outlier PE in their area comparison, we can observe
that: (1) the outlier PE requires additional MAC units since
it operates at a different bitwidth; (2) the sparse representa-
tion incurs storage overhead. For each 8/16-bit outlier, 32
additional bits are used for indices (see Figure 1). OverQ
seeks to handle outliers in a higher precision but in a more
area-efficient manner.
x1 x3x2
Outlier
x4Base
x1 x4x2OverQ
Figure 2. Basic idea of Overwrite Quantization — an outlier xi
can overwrite the adjacent value xi+1 when xi+1 is small. If so,
twice the bitwidth is used to represent xi and xi+1 is dropped.
3. Overwrite Quantization
Consider the problem of quantizing anN -element activation
vector {x}Ni=1. The basic idea of overwrite quantization is il-
lustrated in Figure 2. An outlier xi can overwrite its adjacent
Overwrite Quantization
x1
x2
[3:0]
x4
x2
[6:4]
w1
w2
w4
w2
0
0
0
1
OverQ-Shift
Shift 
Dir
x1
𝑥2
2
x4
𝑥2
2
w1
w2
w4
w2
0
0
0
1
OverQ-Split
≫
L/R Shift
OverQ
Flag
Bit x1
x2
x4
x3
×
×
×
×
w1
w2
w4
w3
0
0
0
0
Baseline
+
×
×
×
×
×
×
×
×
Figure 3. Dot product computation with OverQ — the two variants, OverQ-Split and OverQ-Shift, differ in how they use the additional
bits to represent outliers. For both variants the weight wi is copied to the adjacent MAC.
value xi+1 when xi+1 is smaller than a fixed threshold 1.
When this overwrite occurs, xi is represented using twice
the normal bitwidth while xi+1 is dropped. OverQ dynami-
cally re-allocates bitwidth from an insignificant value to an
important outlier. OverQ exploits two properties of DNN
activations: (1) outliers are rare, but contribute dispropor-
tionately to a layer’s output; (2) activation distributions peak
near zero, and contain many ReLU-induced zeros. The first
property guarantees that overwrite occurs infrequently, so
not too many values are dropped out. The second property
means that an outlier will lie beside a small value with sig-
nificant probability. In hardware, OverQ reuses the MAC
unit from the adjacent value and performs two low-precision
MACs to compute the outlier result.
In DNNs, OverQ would be applied along a single dimen-
sion of the activation tensor (width, height, or channels for
convolutional networks). The chosen dimension affects the
hardware architecture as it determines which dimensions
must be spatially unrolled. Experiments showed that OverQ
along the channels is much more effective than along either
spatial dimension, since spatially adjacent values are highly
correlated in CNNs while adjacent channels exhibit less
correlation. we assume OverQ along the channels of a CNN
from this point onward.
3.1. Computing with OverQ
Figure 3 illustrates two variants of overwrite quantization
and how computation is done with each. Figure 3(a) shows a
dot product between activations xi and weightswi. Such dot
products are the basis of convolutional and fully-connected
layers in DNNs. In OverQ, each activation is associated
with a flag bit indicating whether that activation is being
overwritten. Figure 3(b) shows OverQ-Split, where half the
outlier value (xi/2) is stored in both hardware vector slots.
1The boundary value xN has no adjacent value and cannot
benefit from OverQ.
During computation, the PE detects the flag bit and copies
wi to the position of wi+1; the two products at indices i
and i + 1 thus sum to xi × wi. Copying the weight is
simple in hardware as long as we spatially unroll the vector
such that the compute units for xi and xi+1 are physically
close by. OverQ-Split can represent a value with twice the
normal dynamic range, essentially adding one extra bit of
representation to affected outliers. The primary advantage
of OverQ-Split is its simplicity. We will show later how it
can be implemented in a spatial accelerator with only basic
muxing logic.
Figure 3(b) shows OverQ-Shift, a more complex but also
more flexible variant of overwrite quantization. OverQ-Shift
uses the adjacent slot to hold out-of-range bits of the outlier
xi. These could be either additional more-significant bits
(MSBs) or less-significant bits (LSBs). One bit in the adja-
cent slot is designated as the shift direction bit, and selects
between MSB or LSB representation. The remaining bits
store data. Because the adjacent slot represent different bi-
nary positions than a regular slot, its product requires either
a shift right (for MSBs) or a shift left (for LSBs) before
accumulation. For the example in Figure 3(b), a regular
slot holds binary positions 23222120, while the OverQ slot
holds positions 262524. A shifter after the multiplier uses
the shift direction bit to determine whether to shift left or
right. OverQ-Shift is more hardware intensive as it requires
two different (constant) shifters as well as additional decode
and mux logic.
One important advantage of OverQ-Shift over Split is that
in addition to handling outliers, it can also improve the pre-
cision of non-outlier values by exploiting zero activations
created by ReLU. When xi is not an outlier and xi+1 is a
zero, we can overwrite xi+1 with additional LSB bits of xi
with no loss of information. We call this zero-reuse, since
it reuses bits which would all be zero to store useful infor-
mation instead. Zero-reuse increases the effective bitwidth
and precision of xi, and is a secondary source of accuracy
Overwrite Quantization
≫
×
×
×
×
x1
x2
[3:0]
x4
x2
[6:4]
w1
w2
w4
w2
×
×
×
×
0
0
0
1
10.10
1101.01
11.00
01.01
0
0
0
1
10.100
01.010
11.001
0.000
10.1
01.0
11.0
111
10.1
01.0
11.0
100≪
Outlier (OL) stores 
additional MSB bits
Zero-Reuse (ZR) stores 
additional LSB bits
x1
x2
[3:0]
x4
x2
[6:4]
w1
w2
w4
w2
.
.
.
Figure 4. Handling outliers and zero-reuse with OverQ-Shift — If xi is an outlier, the adjacent slot stores out-of-range MSB bits. If
xi is not an outlier but xi+1 is a zero, the adjacent slot can be “reused” to store out-of-range LSB bits of xi. The shift direction green bit
(green) indicates the required shift direction after multiplication.
improvement in addition to outlier quantization.
Figure 4 gives numerical examples of OverQ-Shift being
used for outliers and for zero-reuse. On the left side, out-
of-range MSBs of an outlier are stored in the adjacent slot,
overwriting regular data. On the right side, out-of-range
LSBs of a regular value overwrites a zero in the adjacent
slot. In each case the OverQ bit is set to 1, but the shift
direction bit is different.
Figure 5. Channel reordering for OverQ — reordering puts
channels with high outlier count next to channels with low out-
lier count to increase the probability of OverQ. Network used is
ImageNet ResNet-18.
3.2. Channel Reordering
Although OverQ is intended to be a run-time hardware
technique, its effectiveness can potentially be improved by
compile-time transformations. OverQ is predicated on the
fact that outliers will be adjacent to a small value with high
probability. We can increase this probability by reordering
the channels in a layer such that channels likely to contain
outliers are adjacent to channels likely to contain zeros. Our
proposed channel reordering procedure is as follows. First,
activation distributions are sampled using a small profiling
dataset (this is already fairly standard practice for data-free
activation quantization (Migacz, 2017)). Then, the number
of outliers in each channel are counted (outliers can be de-
fined as the largest 1% of activations). Finally, the channels
are reordered by the outlier count following a sequence of
high, low, high, low, high, etc. Figure 5 shows the outlier
count in each channel of a CNN layer before and after re-
ordering. Reordering can be implemented as a compile-time
graph transformation which swaps weight filters in a layer
such that the output channels follow the desired order. Im-
portantly, channel reordering incurs no overhead at run time
as the size of all tensors remain unchanged.
4. Mapping OverQ to DNN Accelerators
From the outset, OverQ was designed for efficient hard-
ware implementation in realistic DNN accelerators. In this
section, we show how OverQ can be added using only
lightweight architectural changes to a weight-stationary
spatial array — a common template for DNN accelera-
tors. Here, weight-stationary (WS) refers to a type of DNN
dataflow (Chen et al., 2017) in which weights are held sta-
tionary in PEs while inputs and partial sums move through
the accelerator. A recent study on dataflow choice in DNN
hardware literature (Yang et al., 2018a) showed that WS
dataflow was the most popular. Experiments in the study
also demonstrated that WS to be the most hardware efficient
dataflow, albeit only by a small margin.
Figure 6(a) shows the organization of a baseline WS spatial
Overwrite Quantization
Output Channels
In
p
u
t 
C
h
a
n
n
e
ls
Rescaling, Bias Add
×
+
x
sum
w
<<×
+
x
sum
w
OverQ Flag
×
+
x
sum
w
Channel 
Activation
P
S
u
m
Stationary 
Weight
(a) Baseline
(b) OverQ-Split (c) OverQ-Shift
Adjacent 
Weight
Shifter
Figure 6. OverQ hardware architecture sketch — OverQ can be supported with lightweight changes in a weight-stationary 2D spatial
array, a common DNN accelerator design. (a) Baseline systolic array and PE showing weight, activation, and partial sum (PSum) along
with MAC unit; (b) OverQ-Split PE requires only a flag bit and a mux between own and adjacent weights; (c) OverQ-Shift PE requires
additional muxing and shifters.
accelerator for matrix-matrix multiplication. The left side
shows how the input activations move from left to right
while the output partial sums (psums) move from top to
bottom. The right side depicts the architecture of a PE,
which contains a single multiplier and adder. This accelera-
tor can target fully-connected or 1× 1 convolutional layers.
A weight-stationary spatial architecture is well-suited for
OverQ for two reasons. First, the array spatially unrolls
the input channels. In Figure 6(a) the input channels are
mapped to rows along the vertical axis. Adjacent channels
are therefore mapped to physically adjacent PEs during pro-
cessing. Second, the weights are held stationary in each PE.
These two factors make it relatively easy for a PE to access
its adjacent weight.
Figure 6(b) illustrates the implementation of OverQ-Split in
a PE. A 1-bit wire and register is added to each PE to propa-
gate the OverQ flag bit, which is used to multiplex between
the PE’s own weight and its adjacent weight. Figure 6(c)
shows the modifications needed for OverQ-Shift, which is
more complex. In addition to the hardware for OverQ-Split,
a second mux is needed after the multiplier to implement a
possible shift on its output. For simplicity the figure only
depicts selecting between no shift and a right shift, which
is sufficient for outlier overwrite. To support zero-reuse, a
larger mux is needed to choose between no shift, right shift,
or left shift. A second select bit would be routed from input
activation register.
To decide which activations to overwrite, we take advan-
tage of the fact that output accumulation typically occurs
at a larger bitwidth than the input weights or activations
(e.g., this is done in Google’s TPU (Jouppi et al., 2017)).
When outputs exit the WS array at the bottom, they must
be rescaled and re-quantized down to activation bitwdith
for the next layer. The rescaling unit can be modified to
make decisions on where to perform OverQ. The size of
the rescaling unit scales with the width of the array only,
Overwrite Quantization
whereas the PEs scale with both width and height of the
array. As a result we believe that the dominant resource
overhead of OverQ will be from the modifications to each
PE.
5. Experimental Proof-of-Concept
To demonstrate the potential of overwrite quantization, two
sets of experimental results are provided below. The first
is an evaluation of the impact of OverQ on the accuracy
of ResNet-18 (He et al., 2015) for ImageNet classifica-
tion (Deng et al., 2009). ResNet-18 was the most modern
network used in the experimental section of OLAccel (the
others being AlexNet and VGG-16). The second is measure-
ments of resource usage on a small-scale prototype FPGA
accelerator for CNN inference. Our experiments define
outliers as all values above a pre-computed threshold; the
threshold was determined using MMSE clipping (Sung et al.,
2015; Shin et al., 2016) unless otherwise noted. MMSE clip-
ping was shown to be the best performing clipping method
from literature in Zhao et al. (Zhao et al., 2019).
For the accuracy experiments, baseline and OverQ quanti-
zation was implemented in TensorFlow (et al., 2015) and
applied to ResNet-18 for ImageNet classification. The basic
flow of quantization is to scale and clip values into a range
[−2B − 1, 2B − 1], rounding to integer, and scaling back
to the original range. B here is the unsigned bitwidth in a
sign-magnitude representation. OverQ was implemented on
top of this leveraging tf.select, TensorFlow’s ternary
if-else operator.
5.1. Effectiveness of Channel Reordering
Reordering was implemented as a TensorFlow graph edit
pass which visits each layer and performs static weight shuf-
fling to generate the desired output channel order. This
pass can be applied to a TensorFlow model checkpoint to
generate an alternative checkpoint with reordered channels.
To evaluate channel reordering, let us first define outlier
coverage as the fraction of outliers which can benefit from
overwrite. Ideally, coverage would be 100% like in OLAc-
cel, but the opportunistic nature of OverQ prevents this.
Nevertheless it is preferred that coverage be as high as possi-
ble. Print statements (tf.print) were used to log outlier
coverage in ResNet-18 during inference for a single batch
of 250 images. The outlier threshold for this experiment
was determined using MMSE clipping. Table 1 compares
the coverage between the original and reordered models
in the first 8 layers. The data shows that reordering im-
proves outlier coverage by up to 8.8% in some layers and
increases overall accuracy. Reordering is most effective
in early layers where the number of channels is relatively
small. The remaining experiments in this section use the
reordered ResNet-18 model.
Table 1. Outlier fraction and coverage in ResNet-18 – data for
the first 8 layers are shown. Channel reordering improves coverage
in most layers and increases network accuracy.
Layer Name Outlier Coverage
Fraction Base Reorder Delta
S1/blk1/conv1 0.22% 76.5% 85.3% +8.8%
S1/blk1/conv2 0.32% 99.5% 95.6% −3.8%
S1/blk2/conv1 0.15% 82.4% 85.6% +3.2%
S1/blk2/conv2 0.09% 83.0% 85.4% +2.4%
S2/blk1/conv1 0.15% 73.0% 80.1% +7.1%
S2/blk1/conv2 0.09% 88.7% 90.3% +1.6%
S2/blk2/conv1 0.08% 88.8% 77.8% −10.9%
S2/blk2/conv2 0.02% 93.1% 96.6% +3.5%
Accuracy 67.8% 68.7% +0.9%
5.2. Accuracy Impact on ImageNet Classification
The accuracy of a quantized CNN depends greatly on
the clipping threshold (i.e. the scaling value mentioned
above). A data-free quantization flow following NVIDIA
TensorRT (Migacz, 2017) and OCS (Zhao et al., 2019) was
used. A profiling dataset of 500 training images were used
to sample the activation distribution, and from this the clip
threshold S for each layer was determined. S is the max-
imum representable value of the fixed-point format. The
threshold below which values may be overwritten in OverQ
is set at S/4; comparing a fixed-point value against S/4
can be done efficiently using bitwise logic. Two experi-
ments were conducted: (1) sweeping the clip threshold from
0.2 − 0.9× of the maximum sampled value; (2) using the
MMSE clipping threshold in each layer. Weights were quan-
tized to eight bits in all experiments. The floating-point
baseline accuracy for ResNet-18 is 69.7%.
Figure 7 shows the sweep of the clip threshold S with 5-
and 4-bit activation quantization. OL and ZR refer to outlier
and zero-reuse, respectively. The plots show Baseline, OL,
ZR, and simultaneous OL+ZR. In both plots we see the
same pattern: with a small S OL performs above baseline
and ZR near baseline, while with a large S OL performs
near baseline and ZR above. Conceptually, a small clip
threshold produces more outliers (cases where the activation
exceeds S) and thus OL becomes more effective, while
ZR performs poorly since it cannot address the outliers
which are responsible for the accuracy degradation. In the
large S regime there are very few outliers for OL, meaning
ZR is more effective as it can utilize zeros to increase the
precision of some fraction of activations. Fortunately, it is
not necessary to choose — simultaneous OL+ZR strictly
outperforms either technique alone and the curve behaves
as the sum of the accuracy gains from OL and ZR. At the
accuracy peak, OL+ZR improves Top-1 by 0.7% at 5 bits
and 2.5% at 4 bits.
Overwrite Quantization
66%
67%
68%
69%
70%
0.0 0.2 0.4 0.6 0.8 1.0
To
p
-1
 A
cc
u
ra
cy
Clip Threshold (Fraction of Max)
Accuracy vs. Clip Threshold (5-bit)
Baseline
OL
ZR
OL+ZR
56%
58%
60%
62%
64%
66%
68%
70%
0.0 0.2 0.4 0.6 0.8 1.0
To
p
-1
 A
cc
u
ra
cy
Clip Threshold (Fraction of Max)
Accuracy vs. Clip Threshold (4-bit)
Baseline
OL
ZR
OL+ZR
Figure 7. OverQ accuracy vs. clip threshold on ResNet-18 — plot shows uniform quantization (Baseline), outlier overwrite (OL),
zero-reuse (ZR), and simultaneous OL+ZR. Clip threshold is expressed as a fraction of the maximum profiled activation value.
Table 2. OverQ ResNet-18 top-1 accuracy – the table comparse
no clip, MMSE clip, and OverQ with MMSE clip thresholds. The
floating-point baseline achieves 69.7% accuracy.
Weight Act. No Clip MMSE Clip MMSE Clip
Bits Bits + OverQ
8 5 69.1 69.4 69.7
8 4 64.3 67.1 68.8
8 3 36.8 56.3 64.7
Table 2 shows the accuracy under MMSE clipping. OverQ
obtains 0.3% accuracy improvement at 5 bits and 1.7% at
4 bits. With 5 bit activations, MMSE clipping with OverQ
achieves accuracy equal to the floating-point baseline. With
4-bit activations, OverQ achieves a 0.9% accuracy loss from
baseline. With 3-bit activations, OverQ achieves a signifi-
cant 8.4% accuracy improvement compared to just clipping.
OLAccel (Park et al., 2018a) reports that for deep models
(ResNet-101 and DenseNet-121), OLAccel with 4-bit quan-
tization and 3% outlier threshold results in < 1% accuracy
loss. While OLAccel’s benchmarks are larger and deeper,
the accuracy results of OverQ are comparable on ResNet-18.
5.3. Hardware Resource Impact on FPGA Prototype
OverQ was implemented on a small matrix-vector multiply
accelerator, which is used to execute 3×3 conv layers using
real data extracted from TensorFlow. The accelerator was
built using C++ source and synthesized to Verilog using
Xilinx Vivado HLS version 2016.3, then implemented to bit-
stream for a Xilinx xc7z020 FPGA device. The baseline
design is an M ×N array pipelined in HLS, where M and
N are the unroll factors for the input and output channels.
Evaluation was done on inputs, weights and outputs taken
from a single layer in ResNet-18. OverQ had no impact
on latency. Table 3 shows resource usage of the baseline,
OverQ-Split (OL only) and OverQ-Shift (OL+ZR) designs.
The key observation is that although the percentage increase
in LUTs and FFs are considerable (LUT usage grow by
over 3× from baseline to OverQ-Shift), the LUT and FF
utilization on the device remains very low in comparison to
DSPs (which are used to implement MAC units). Based on
our data, scaling up the design will result in DSPs running
out well before the logic overhead of OverQ becomes an
issue. Commercial FPGAs have an abundance of LUTs
and FFs available, but DSPs are at a premium because the
latter is much larger in area. DNN accelerators are almost
always bottlenecked by DSP usage (Zhang et al., 2015; Qiu
et al., 2016; Zhang & Li, 2017). Fortunately, OverQ was
specifically designed to avoid MAC overhead and this is
reflected in the results. Another important finding is that the
additional of OverQ does not hurt timing.
It is important to note that the accelerator is only for matrix-
matrix products, and lacks other components typically found
in a DNN accelerator such as memory system, rescaling unit,
bias unit, etc. The hardware which decides which activa-
tions to overwrite was not prototyped. Section 4 describes
how the rescaling unit can be modified to make the OverQ
decisions and how its resource overhead is expected to be
small compared to the overhead in the PE array.
6. Conclusions and Future Work
Our preliminary results demonstrate that OverQ has poten-
tial as a lightweight hardware-centric technique for improv-
ing activation quantization. However, we have only per-
formed tests on a single, relatively simple network (ResNet-
18) and our hardware prototype consisted of just the core
MAC array, leaving out other important components re-
quired for a full-fledged DNN accelerator (e.g., memory
system, rescaling and bias add). Nevertheless, the data gives
Overwrite Quantization
Table 3. OverQ resource usage in an FPGA prototype – OverQ incurs non-trivial LUT and FF overhead, but total utilization of these
resources remains very low. OverQ has no impact on DSP, BRAM, and timing.
4x4 8x8 Device
Design Base OverQ OverQ Base OverQ OverQ Total
Split Shift Split Shift
CP 4.79 4.56 4.81 5.01 4.91 4.98
LUT 116 220 414 671 923 1660 53,200
Utilization 0.22% 0.41% 0.78% 1.26% 1.73% 3.12%
FF 752 752 860 2290 2619 2781 106,400
Utilization 0.71% 0.71% 0.81% 2.15% 2.46% 2.61%
DSP 16 16 16 64 64 64 220
BRAM 0 0 0 0 0 0 140
us confidence that OverQ is indeed feasible in real hardware
at fairly low hardware overhead. The most pertinent future
work is expanding the scope of the evaluation to an accel-
erator which can execute an entire DNN on its own, and to
collect data on larger networks.
Acknowledgments
This work was supported in part by the Semiconductor Re-
search Corporation (SRC) and DARPA. One of the Titan Xp
GPUs used for this research was donated by the NVIDIA
Corporation.
References
Banner, R., Hubara, I., Hoffer, E., and Soudry, D. Scalable
Methods for 8-bit Training of Neural Networks. arXiv
preprint, arXiv:1805.11046, 2018.
Chen, Y.-H., Krishna, T., Emer, J. S., and Sze, V. Eyeriss:
An Energy-Efficient Reconfigurable Accelerator for Deep
Convolutional Neural Networks. Journal of Solid-State
Circuits (JSSC), 52(1):127–138, 2017.
Choi, J., Wang, Z., Venkataramani, S., Chuang, P. I.-J.,
Srinivasan, V., and Gopalakrishnan, K. PACT: Parameter-
ized Clipping Activation for Quantized Neural Networks.
arXiv e-print, arXiv:1805.0608, May 2018.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. ImageNet: A Large-Scale Hierarchical Image
Database. Conf. on Computer Vision and Pattern Recog-
nition (CVPR), pp. 248–255, 2009.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT:
Pre-Training of Deep Bidirectional Transformers for Lan-
guage Understanding. arXiv preprint, arXiv:1810.04805,
Oct 2018.
et al., M. A. TensorFlow: Large-Scale Machine Learn-
ing on Heterogeneous Systems, 2015. URL http:
//tensorflow.org/. Software available from ten-
sorflow.org.
He, K., Zhang, X., Ren, S., and Sun, J. Deep Resid-
ual Learning for Image Recognition. arXiv e-print,
arXiv:1512.0338, Dec 2015.
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard,
A., Adam, H., and Kalenichenko, D. Quantization
and Training of Neural Networks for Efficient Integer-
Arithmetic-Only Inference. Conf. on Computer Vision
and Pattern Recognition (CVPR), pp. 2704–2713, Jun
2018.
Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal,
G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers,
A., et al. In-Datacenter Performance Analysis of a Tensor
Processing Unit. Int’l Symp. on Computer Architecture
(ISCA), pp. 1–12, 2017.
McKinstry, J. L., Esser, S. K., Appuswamy, R., Bablani, D.,
Arthur, J. V., Yildiz, I. B., and Modha, D. S. Discover-
ing Low-Precision Networks Close to Full-Precision Net-
works for Efficient Embedded Inference. arXiv preprint,
arXiv:1809.04191, Sep 2018.
Meller, E., Finkelstein, A., Almog, U., and Grobman, M.
Same, Same but Different - Recovering Neural Network
Quantization Error through Weight Factorization. Int’l
Conf. on Machine Learning (ICML), Jun 2019.
Migacz, S. 8-bit Inference with TensorRT. NVIDIA GPU
Technology Conference, May 2017.
Nagel, M., van Baalen, M., Blankevoort, T., and Welling,
M. Data-Free Quantization Through Weight Equalization
and Bias Correction. Int’l Conf. on Computer Vision
(ICCV), Oct 2019.
Park, E., Kim, D., and Yoo, S. Energy-Efficient Neural Net-
work Accelerator Based on Outlier-Aware Low-Precision
Computation. Int’l Symp. on Computer Architecture
(ISCA), Jun 2018a.
Overwrite Quantization
Park, E., Yoo, S., and Vajda, P. Value-aware Quantization
for Training and Inference of Neural Networks. arXiv
e-print, arXiv:1804.07802, Apr 2018b.
Park, H. and Choi, K. Cell Division: Weight Bit-Width
Reduction Technique for Convolutional Neural Network
Hardware Accelerators. Asia and South Pacific Design
Automation Conf. (ASP-DAC), pp. 286–291, Jan 2019.
Qiu, J., Wang, J., Yao, S., Guo, K., Li, B., Zhou, E., Yu,
J., Tang, T., Xu, N., Song, S., et al. Going Deeper with
Embedded FPGA Platform for Convolutional Neural Net-
work. Int’l Symp. on Field-Programmable Gate Arrays
(FPGA), pp. 26–35, Feb 2016.
Shin, S., Hwang, K., and Sung, W. Fixed-Point Performance
Analysis of Recurrent Neural Networks. Int’l Conf. on
Acoustics, Speech and Signal Processing (ICASSP), pp.
976–980, 2016.
Strubell, E., Ganesh, A., and McCallum, A. Energy and
Policy Considerations for Deep Learning in NLP. arXiv
preprint, arXiv:1906.02243, Jun 2019.
Sung, W., Shin, S., and Hwang, K. Resiliency of Deep
Neural Networks Under Quantization. arXiv preprint
arXiv:1511.06488, 2015.
Wu, S., Li, G., Chen, F., and Shi, L. Training and Inference
with Integers in Deep Neural Networks. Int’l Conf. on
Learning Representations (ICLR), May 2018.
Xu, X., Ding, Y., Hu, S. X., Niemier, M., Cong, J., Hu, Y.,
and Shi, Y. Scaling for Edge Inference of Deep Neural
Networks. Nature Electronics, 1(4):216, 2018.
Yang, X., Gao, M., Pu, J., Nayak, A., Liu, Q., Bell, S. E., Set-
ter, J. O., Cao, K., Ha, H., Kozyrakis, C., and Horowitz,
M. DNN Dataflow Choice Is Overrated. arXiv preprint,
arXiv:1809.04070, Sep 2018a.
Yang, Y., Huang, Q., Wu, B., Zhang, T., Ma, L., Gam-
bardella, G., Blott, M., Lavagno, L., Vissers, K.,
Wawrzynek, J., and Keutzer, K. Synetgy: Algorithm-
Hardware Co-Design for Convnet Accelerators on Em-
bedded FPGAs. arXiv preprint, arXiv:1811.08634, Nov
2018b.
Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B., and Cong,
J. Optimizing FPGA-based Accelerator Design for Deep
Convolutional Neural Networks. Int’l Symp. on Field-
Programmable Gate Arrays (FPGA), pp. 161–170, Feb
2015.
Zhang, J. and Li, J. Improving the Performance of OpenCL-
based FPGA Accelerator for Convolutional Neural Net-
work. Int’l Symp. on Field-Programmable Gate Arrays
(FPGA), pp. 25–34, Feb 2017.
Zhao, R., Hu, Y., Dotzel, J., De Sa, C., and Zhang, Z. Im-
proving Neural Network Quantization without Retraining
using Outlier Channel Splitting. Int’l Conf. on Machine
Learning (ICML), pp. 7543–7552, Jun 2019.
Zhou, A., Yao, A., Guo, Y., Xu, L., and Chen, Y. In-
cremental Network Quantization: Towards Lossless
CNNs with Low-Precision Weights. arXiv preprint,
arXiv:1702.03044, 2017.
