Term Revealing: Furthering Quantization at Run Time on Quantized DNNs by Kung, H. T. et al.
This preliminary version has been accepted by
Intl. Conf. for High Performance Computing,
Networking, Storage, and Analysis (SC20).
Term Revealing: Furthering Quantization at Run
Time on Quantized DNNs
H. T. Kung*
Harvard University
kung@harvard.edu
Bradley McDanel*
Harvard University
mcdanel@fas.harvard.edu
Sai Qian Zhang*
Harvard University
zhangs@g.harvard.edu
Abstract—We present a novel technique, called Term Revealing
(TR), for furthering quantization at run time for improved
performance of Deep Neural Networks (DNNs) already quantized
with conventional quantization methods. TR operates on power-
of-two terms in binary expressions of values. In computing a
dot-product computation, TR dynamically selects a fixed number
of largest terms to use from the values of the two vectors
in the dot product. By exploiting normal-like weight and data
distributions typically present in DNNs, TR has a minimal impact
on DNN model performance (i.e., accuracy or perplexity). We
use TR to facilitate tightly synchronized processor arrays, such
as systolic arrays, for efficient parallel processing. We show an
FPGA implementation that can use a small number of control
bits to switch between conventional quantization and TR-enabled
quantization with a negligible delay. To enhance TRs efficiency
further, we propose “HESE encoding” (Hybrid Encoding for
Signed Expressions) of values, as opposed to classic binary
encoding with nonnegative power-of-two terms. We evaluate TR
with HESE encoded values on an MLP for MNIST, multiple
CNNs for ImageNet, and an LSTM for Wikitext-2, and show
significant reductions in inference computations (between 3-10x)
compared to conventional quantization for the same level of
model performance.
I. INTRODUCTION
Deep Neural Networks (DNNs) have achieved state-of-the-art
performance across a variety of domains, including Recurrent
Neural Networks (RNNs) and Transformers for natural lan-
guage processing and Convolutional Neural Networks (CNNs)
for computer vision. However, the high computation complexity
of DNNs makes them expensive to deploy at scale in datacenter
contexts, as a popular model (e.g., Googles Smart Compose
email autocomplete [1]) may be queried millions of times per
day, with each query requiring 10s to 100s of GFLOPs.
To address these high computational costs, significant
research effort has been spent on developing techniques that
reduce the computational complexity of pre-trained DNNs.
One of the most commonly used techniques is post-training
quantization (see, e.g., [2]), where 32-bit floating-point DNN
weights and data (activations) are converted to a fixed-point
representation (e.g., 8-bit fixed-point) to reduce the amount of
computation performed per inference sample. One benefit of
post-training quantization is that it does not require access to
the original training dataset, and can therefore be applied by a
third-party (such as a cloud service) as a step to reduce costs.
In this work, we define a term as a nonzero bit in a quantized
*Equal contribution ordered alphabetically.
8-bit uniform 
quantization
3.21 0.17
1.84
0.72
32-bit floating-point 
weight matrix
0.18
64 3
37
14
8-bit fixed-point
weight matrix with
scale factor = 0.05 
4
group-based run-time
quantization with TR
64 2
37
12
4
TR keeps top
k=4 terms across
a group of 3 values
26, 21+20, 23+22+21
Fig. 1: In conventional quantization, a 32-bit floating point
weight matrix (left) is converted to an 8-bit fixed-point format
via uniform quantization (middle). We propose to further
quantization via Term Revealing (TR) which is a group-based
run-time quantization method (right). By limiting the number
of power-of-two terms across a group of values, TR enables a
tighter processing bound for DNN dot product computations.
fixed-point value. For instance, we say that the 8-bit value 3
(00000011) is composed of two terms: 21 + 20.
In this paper, we further quantize the computation of an
already quantized DNN at run time to realize additional
substantial computation savings. That is, we propose to perform
further run-time quantization on, for example, a quantized
8-bit DNN while still achieving the same level of model
performance. Note that for furthering quantization, we must
use new techniques beyond conventional quantization methods,
for otherwise, the original DNN could have been quantized to a
lower precision in the first place while maintaining acceptable
performance (e.g., a 4-bit DNN instead of an 8-bit DNN).
Specifically, we introduce a novel group-based quantization
method, which we call Term Revealing (TR). TR, shown in
Figure 1, ranks the terms in a group of values associated
with a dot-product computation to reveal a fixed number of
top terms (called a group budget) to use for the dot-product
computation. By limiting the number of terms to a group
budget k and pruning the remaining smaller terms, TR enables
a more efficient implementation of dot-product computations in
DNN. With TR, the selected terms for a value are based on
their relative rankings against terms of other values in the
group. TR’s run-time group-based quantization, a departure
from traditional value-based quantization, allows TR to carry
ar
X
iv
:2
00
7.
06
38
9v
1 
 [c
s.C
V]
  1
3 J
ul 
20
20
out additional quantization on an already quantized DNN.
While allowing further quantization at run time, TR is able
to achieve the same level of model performance as the original
quantized DNNs for two reasons. First, TR uses group-based
term selection, which prunes only smaller terms in a group (e.g.,
21 and 20 terms), leading to minimal added quantization error.
For many groups, with fewer terms than the allocated group
budget, no additional quantization is performed. Second, by
leveraging normal-like weight and data distributions typically
present in DNNs (see Section III-A), TR can use a small group
budget without introducing quantization error for many groups.
With a simple FPGA design, we can use a small number
of control bits to reconfigure a hardware supporting quantized
computations under conventional quantization to one supporting
run-time TR quantization, and vice versa.
To simplify our introduction of TR, we use conventional bi-
nary representations where all terms are nonnegative. However,
shorter signed expressions which use both positive and negative
terms, such as Booth encodings [3], can typically allow fewer
terms in expressing a value and lead to increased computation
savings in TR-enabled quantization. To this end, we have
developed a new signed encoding called Hybrid Encoding
for Shortened Expressions (HESE), which we use in the later
sections of the paper to express DNN weights and data.
The novel contributions of the paper are:
• The concept of run-time quantization on already quantized
DNNs to realize further computation savings.
• A group-based term ranking mechanism, called term
revealing (TR) and its term MAC (tMAC) hardware
design for the implementation of our proposed further
quantization at run time.
• An FPGA system which requires minimal reconfiguration
to efficiently supports both conventional quantization and
TR-enabled quantization.
• A signed power-of-two encoding called Hybrid Encoding
for Shortened Expressions (HESE). Using fewer terms
compared to previous signed representations, HESE en-
hances TR’s computation efficiency.
II. BACKGROUND AND RELATED WORK
In Section II-A, we discuss related work on pruning and
quantization techniques for performing efficient DNN inference.
Then, in Section II-B, we discuss prior work on hardware
architectures which aim to exploit bit-level sparsity. Finally,
in Section II-C we illustrate how matrix multiplication is
performed with systolic arrays.
A. Pruning and Quantization Methods
There has been significant research efforts in pruning-based
methods which exploit value-level sparsity in CNN weights, as
performing multiplication with zero operands can be viewed as
wasted computation [4]–[12]. However, these pruning methods
typically require model retraining, making them not feasible
for a third-party that is hosting the model (as it requires access
to the full training dataset). Additionally, unstructured pruning
methods which achieve the best performance (e.g., [4]) are
hard to implementing efficiently in special-purpose hardware,
as the the remaining nonzero weights are randomly distributed.
In this paper, we propose to further reduce the amount of
computation even for nonzero values by exploiting bit-level
sparsity as opposed to conventional value-level sparsity.
Quantization [13]–[26] lowers precision of the values in
weights and data in order to reduce the associated storage, I/O
and/or computation costs. However, aggressive post-training
quantization (e.g., to 4-bit representations) introduces addi-
tional error into the computation, leading to decreased model
performance. Due to this, many low-precision quantization
approaches, such as binary neural networks [27], must be
performed during training. Our proposed TR approach is
applied on top of 8-bit quantization and does not require
additional training. Note that our approach does not reduce the
precision of the weights (i.e., weights are still 8-bit fixed-point
values after term revealing) but instead reduces the number of
nonzero terms to be used at runtime across a group of weights.
B. Hardware Architectures for Exploiting Bit-level Sparsity
There has been growing interest in exploiting bit-level
sparsity (i.e., the zero bits present in weight and data values)
as opposed to value-level sparsity discussed in the previous
section. A bit-level multiplication with a zero bit can be viewed
as wasted computation in the same manner as a value-level
multiplication with a zero value, in that both operations do
not effect the result. Based on this observation, Bit-Pragmatic
introduces an architecture that utilizes a nonzero term-based rep-
resentation to remove multiplication with zero bits in weights
while keeping data in a conventional representation [28]. Bit-
Tactical follows up this work by grouping nonzero weight and
data values to achieve more efficient scheduling of nonzero
computation [29]. Both of these approaches assume 8-bit or
16-bit fixed-point quantization (i.e., the first step in Figure 1).
However, due to the more fine-grained nature of these bit-
level architectures, efficiently scheduling bit-level operations
across multiple groups of computations becomes challenging,
as each group may have a different amount computation
to perform. Generally, this leads to stragglers that require
significantly more bit-level operations than other groups. Both
Bit-Pragmatic and Bit-Tactical handle this straggler problem by
adding a synchronization barrier which makes all groups wait
until the straggler is finished. Due to this, in processing many
groups concurrently, they can only exploit bit-level sparsity up
to the degree of the group with most bit-level operations (i.e.,
the straggler). We find that this worse case can be a factor
of 2-3× more bit-level operations compared to the average
case. By comparison, TR provides a tighter processing bound
which enables synchronous computation across all groups. This
is done by removing smaller terms from groups with a large
number of terms (the second quantization step in Figure 1).
C. Systolic Arrays for Matrix Multiplication
The majority of computation in the forward propagation of
a DNN consists of matrix multiplications between a learned
weight matrix in each layer and input or data being propagated
2
 W2
W2,3,1 W2,3,2 W2,3,3
W2,2,1 W2,2,2 W2,2,3
W2,1,1 W2,1,2 W2,1,3
Systolic 
cellW1 W3 W4
Weight-stationary systolic array 
computing dot products for
highlighted w2 and x2 tiles on left 
X2,1,1
 X2X1 X3 X4
weight matrix
data matrix
X2,2,1X2,3,1
X2,1,2X2,2,2X2,3,2
X2,1,3X2,2,3X2,3,3
Fig. 2: The weight matrix W and data matrix X for a layer
in a DNN (left) are partitioned into four tiles to be processed
in a systolic array (right). The highlighted weight tile (W2) is
shown loaded into the systolic array with the data tile (X2)
entering the systolic array from below.
through the layer, as shown on the left side of Figure 2. Systolic
arrays are known to be able to efficiently implement matrix
multiplication due to their regular design, dataflow architectures
and reduced memory access [30]. The right side of Figure 2
shows a 3×3 systolic array, for computing dot products between
W2 and X2 (highlighted in the weight and data matrices). The
data in the partition (e.g., X2,1,1) are passed into the systolic
array from below in a skewed fashion in order to maintain
synchronization between cells. We use this systolic array design
as the starting point for our FPGA system in Section V.
III. TERM REVEALING
In this section, we introduce a group-based quantization
method called term revealing (TR), which is applied to
quantized DNNs at run time.
A. DNN Weight and Data Distributions
As mentioned earlier, TR leverages weight and data dis-
tributions of DNNs. DNNs are often trained with weight
decay regularization to improve model generalization [31]
and batch normalization [32] on data which both improves
the stability of convergence and improves the performance
of the learned model. A consequence is that the weights are
approximately normally distributed and the data follow a half-
normal distribution (as ReLU sets negative values to 0). Figure 3
(top row) illustrates these distributions for the weights in 7th
convolution layer of ResNet-18 [33] trained on ImageNet [34]
and the data input to the layer. Both the weights and data are
quantized to 8-bit fixed-point using uniform quantization (QT).
The higher frequency of small values means that most
elements are represented with only 2 or 3 power-of-two terms
as shown in Figure 3 (bottom). For instance, the value 6 is
represented with two power-of-two terms (22+21). In the figure,
79% of weight values and 84% of data are represented with 3
or fewer power-of-two terms. Note that the most significant
bit (MSB) in the 8-bit representation is used to represent the
sign of each value, thus each value has at most 7 terms.
−128 0 128
Weight Values
0
2
4
6
Va
lue
Fr
eq
ue
nc
y
0 128
Data Values
0
1
40
0 1 2 3 4 5 6 7
# of Weight Terms
0
20
40
# 
of
 T
er
m
s
Fr
eq
ue
nc
y
0 1 2 3 4 5 6 7
# of Data Terms
0
20
40
Fig. 3: The distributions of weight and data values (top) shape
the distribution of the number of terms in a binary encoding
for both weights and data (bottom).
2weight matrix data matrix
16 16 12 3
0
partial dot 
product 
(group of 16)
partial dot product 
term pair multiplications
6 23+22
21
23×21,
22×21
21+20 22+21
22+20
22×22, 22×20,
21×22, 21×20
12 3
2 0
6
5
matrix-matrix 
multiplication
weight
terms
data
terms
term pair 
multiplications
5
Fig. 4: A matrix-matrix multiplication between a weight and
data matrix (left) divided into partial dot products of length 16
(one partial dot product is shown in the middle). Each partial
dot product is computed by multiplying all pairs of terms
(right) across the 16 value in the weight and data vectors.
B. Computing Dot Products via Term Pair Multiplications
Assume that we compute dot products in matrix-matrix
multiplication between quantized weights and data by dividing
both vectors into groups of a given length (e.g., 16). This
group-based formulation is motivated by efficient hardware im-
plementations described later in Section V. Figure 4 illustrates
how partial dot products, partitioned into groups of length 16,
are computed using term pair multiplications. In the example,
the first value in the weight vector 12 = 23+22 multiplied with
the first data value 2 = 21 is computed using two term pair mul-
tiplications (23×21+22×21 = 2(3+1)+2(2+1) = 24+23 = 24)
as shown on the right of Figure 4. Using this paradigm, we
can analyze the number of term pair multiplications that are
required per partial dot product (e.g., with a group size of 16)
across all groups in a matrix-matrix multiplication.
Figure 5 shows a histogram of the number of term pair
multiplications for partial dot products with groups of 16 values
in the 7th convolutional layer of ResNet-18. Interestingly, 99%
of these groups require under 110 term pair multiplications
even though the theoretical maximum, where all weight and
data values use 7 terms (i.e., every value is 127 = 26+25+24+
23+22+21+20), is 16×7×7 = 784. In this work, we propose
to restrict the number of term pair multiplications performed in
each partial dot product (e.g., to 110 instead of 784) in order to
3
0 20 40 60 80 100 120 140 160
Term Pair Multiplications
0.0
0.5
1.0
1.5
2.0
2.5
Fr
eq
ue
nc
y (
%
)
99% of partial dot
products (groups of 16)
require fewer than 110
term pair multiplications.
The theoretical max
is 16*7*7 = 784.
Term Pair Multiplications in Dot Products (Groups of 16)
Fig. 5: The number of term pair multiplications required for
partial dot products with groups of size 16 in an 8-bit DNN.
achieve tightly synchronized parallel processing across systolic
cells. As we know that DNNs are robust in performance to small
amounts of error (e.g., through the initial uniform quantization
step), it is reasonable to expect that they would also tolerate an
additional quantization step such as our proposed TR-enabled
quantization that makes small modifications to enforce a tighter
processing bound. We use term pairs multiplications as a proxy
for the amount of computation performed during inferences,
as the hardware system described in Section V performs dot
products using this term pair multiplication approach.
C. Overview of Term Revealing
Term revealing is a group-based term ranking method that
sets a limit on the number of terms allotted to a scheduling
group. TR consists of three steps:
1) Grouping elements as shown in Figure 6 (left). For a
given weight matrix, we partition it into equal size groups
which are used in dot product computations. The group
size g, denotes the number of values per group and may
assume various values such as 2, 3, 4, 8, 16, etc.
2) Configure a group budget k which is used for every
group. The budget bounds the number of terms used in
dot-product computations across the values in a group.
3) Identify top k terms in the group using a receding water
algorithm that ranks and selects the terms as shown in
Figure 6 (right). It keeps the largest k terms in a group
and prunes the remaining smaller terms below a waterline.
Note that some groups may have fewer than k terms,
meaning that no pruning occurs.
Figure 6 illustrate how TR is applied to a group of g = 3
values in a weight matrix with a term budget k = 4. The three
values are decomposed into their term representations, and
scanned row by row (viewed as a waterline), starting from the
26 term and finishing at the 20 term, until the group budget is
reached. In the example, the group budget of k = 4 is reached
at the 23 term for w2. The remaining low-order terms (e.g., 22
and below for this group) are pruned, adding a small amount
of additional quantization error. For instance, after TR, w3 is
quantized from 81 to 80.
Since the position of the waterline is determined by the
distribution of terms in a group, the amount of pruning induced
by TR varies across each group of values. Consider two groups
26
25
24
23
22
21
w1 w2 w3
waterline
Term Revealing (TR)
keeps k=4 top terms
kept 
terms
pruned 
terms
weight matrix
3
67
TR
20
10 81
64 8 80
group size g=3
Fig. 6: A weight matrix (left) is partitioned into groups of size
3. The elements of each group (middle) are passed into TR.
The receding water algorithm (right) based on term ranking
keeps the top k = 4 terms (red) for the values in the group;
the rest of the terms are pruned.
of weights (group a: w1, w2, w3), and (group b: w4, w5, w6).
Figure 7 illustrates the quantization error incurred when 4-bit
QT (truncating the 20 and 21 terms) and TR (keeping the top
k = 6 terms) are applied to both groups. For group a, we see
that TR introduces no error as the group has only 6 terms.
By comparison, 4-bit QT introduces error by pruning all of
the 20 and 21 terms, as conventional quantization keeps only
the largest 4 terms across all values. For group b, which has
significantly more terms, TR and 4-bit QT perform a similar
amount of truncation. Group b represents a worse case for TR,
as most groups will have significantly fewer terms.
Therefore, in practice, we can use a small group budget such
as k = 6 without introducing significant quantization error. By
constraining the number of terms to k = 6 across the g = 3
values, TR is are able to ensure a tighter processing bound
compared to 4-bit QT. Specifically, assuming each data value
has up to 7 term, the maximum number of term pairs with TR
is reduced to 7× k = 42, which is smaller than 4-bit QT of
7× 4× 3 = 84 by a factor of 2×.
D. Term Pair Reduction for Term Revealing Groups
To more formally quantify the term pair reduction due to
TR, suppose for a group size of g = 3 that the group budget
is k and the receding water algorithm reveals k1, k2 and k3
number of terms for weight values w1, w2 and w3, respectively,
with k = k1 + k2 + k3. Suppose further that x1, x2 and x3
have r1, r2 and r3 terms, respectively. (For example, k1 = 2,
k2 = 3, k3 = 1, and r1 = 2, r2 = 4, r3 = 3.) Then, with
TR, the total number of term pairs to be processed for the dot
product computation between x and w is
r1k1 + r2k2 + r3k3 ≤ 7× (k1 + k2 + k3) = 7× k
In reality, since most weights and data require significantly
fewer than the maximum allotted number of power-of-two
terms, for most groups, dot products will complete the
computation below this bound, as discussed earlier in relation
to Figure 5. In this sense, TR can be viewed as shifting this
upper bound from 7×7×g terms per group in the baseline case
(7 terms for both weights and data) to 7× k terms per group,
where k  7 × g. In Section V, we utilize this significantly
4
0 1 0
0 0 0
1 0 0
1 1 1
1 0 1
1 1 1
0 0 0
0 1 0
0 0 025
24
23
22
4-bit quantization (QT) truncates the 20 and 21 terms
term revealing (TR) keeps k=6 largest terms per group 
QT
0 1 0
0 0 0
1 0 0
QT
x4 x5 x6
21
20
0 0 1 1 1 1
1 1 0 1 0 1
1 1 0 1 1 1
x1 x2 x3
0 0 0
0 1 0
0 0 0
0 0 1
1 1 0
1 1x1 x2 x3
TR
0 x4 x5 x6
TR
group a group b
Fig. 7: 4-bit uniform quantization (QT) always truncates smaller
terms (e.g., the 20 and 21 terms). This leads to large quantiza-
tion error for groups with many small terms as in group a. In
contrast, by keeping the top 6 terms, TR introduces significantly
less quantization error on average. Additionally, TR reduces
the number of term pair multiplications to 7× k = 42, which
is smaller than 4-bit QT of 7× 4× 3 = 84 by a factor of 2×.
reduced upper bound enabled via TR to implement tightly
synchronized processor arrays for DNN inference.
E. Relationship Between Group Size and Group Budget
TR budgets k terms for a group of size g. Let k = α × g
for some α, where α is the average number of terms budgeted
for each value in the group. Recall from Figure 3 that 79%
of weight values are represented in 3 or fewer terms. This
means that as the group size g increases, the average number
of budgeted terms per value approaches the mean of the weight
term distribution. For the weight term distribution in Figure 3,
the mean is only 2.46 terms per values, even though some
values have as many as 7 terms. Practically, this means that a
larger group size allows for a smaller relative term budget k
which is close to the mean, as it becomes increasingly unlikely
that many groups have more than k terms.
F. Bounding Truncation-induced Error in Dot Products
TR strives to minimize truncation-induced relative error σ.
Suppose that 2i is the water line determined by TR under a
given group budget. That is, terms smaller than 2i are truncated.
Then, for a group size g and α = 1.5, kept terms have value at
least g× 2i+ g2 × 2i+1, or g× 2i+1, and truncated terms have
value at most g× (2i−1+2i−2+ · · ·+20), or g× (2i−1). We
have σ = truncated termskept terms+truncated terms ≤ truncated termskept terms ≤ 2
i−1
2i+1 ≤ 12 .
Larger α results in a reduced upper bound on σ.
We provide a bound on the relative error introduced by TR
in truncated dot products between weights and data. For a given
group of data (x1, x2, x3), the dot product over the group is
w1x1+w2x2+w3x3 where w1, w2 and w3 are corresponding
weights of the filter. Each wi values may be positive, negative
or zero, while xi data values are non-negative. For simplicity,
we assume here that all wi are positive while noting the result
also holds when they are all negative. After TR, each xi is
replaced with a truncated x′i in the dot product computation.
Let σi denote the relative error of x′i induced by TR, i.e.,
x′i = xi(1− σi) = xi − xiσi. Then, the dot product result y
with x′i can be decomposed as follows:
y = w1x
′
1 + w2x
′
2 + w3x
′
3
= w1(x1 − x1σ1) + w2(x2 − x2σ2) + w3(x3 − x3σ3)
= w1x1 + w2x2 + w3x3 − (w1x1σ1 + w2x2σ2 + w3x3σ3)
Therefore, the relative error of the dot product with truncated
values as an approximation to the original dot product is:
w1x1σ1 + w2x2σ2 + w3x3σ3
w1x1 + w2x2 + w3x3
Suppose that, as described above, by TR we can assure that
σi ≤ σ for i = 1, 2, 3. Then,
w1x1σ1 + w2x2σ2 + w3x3σ3
w1x1 + w2x2 + w3x3
≤ w1x1σ + w2x2σ + w3x3σ
w1x1 + w2x2 + w3x3
Thus, the relative error in the computed dot products w1x′1 +
w2x
′
2 + w3x
′
3 is bounded by σ.
IV. HYBRID ENCODING FOR SHORTENED EXPRESSIONS
In this section, we present Hybrid Encoding for Shortened
Expressions (HESE), a signed power-of-two encoding which
reduces the number of terms required to represent 8-bit fixed-
point values for DNN weights and data. HESE complements
TR by reducing the number of terms used before TR is applied.
A. Signed power-of-two Representations
Booth radix-4 encoding [3] converts a conventional bi-
nary representation with only positive power-of-two terms
(e.g., 30 = 24 + 23 + 22 + 21) into a representation with both
positive and negative power-of-two terms (e.g., 30 = 25 − 21).
Note that while the underlying value is the same, the signed
representation of Booth uses only two terms as opposed to four
in the positive-only binary case. Booth reduces the number of
terms by encoding strings of consecutive 1s, corresponding to
positive power-of-two terms, such as (11110) of 30, into a pair
a positive and negative terms (+1 0 0 0 −1 0).
Booth radix-4 bounds the number of power-of-two terms in
an n-bit value to n2 + 1 [3]. This is utilized in the design of
efficient Booth multipliers to provide a smaller bound on the
amount of computation required for any pair of n-bit values.
This bound is necessary for synchronization purposes across
multiple processing elements (e.g., systolic arrays). In this
work, we are interested in representations with fewer terms
as TR enforces a tighter computational bound by truncating
smaller terms in a group.
B. Overview of HESE
HESE is a hybrid encoding method that combines Booth,
which efficiently handles strings of 1s, with additional rules
for reducing the number of terms required for isolated 1s
and 0s. Figure 8a shows how HESE is used to encode 8-
bit binary values into a signed power-of-two expression with
fewer terms. For example, Booth translate 95 (01011111) to
+27−26+25−20. In comparison, as shown in Figure 8b, HESE
will translate the value to a shorter expression +26 + 25 − 20.
5
-20+25
  0 1 0 1 1 1 1 1 0
95 (01011111) + 0 pad 
(a) HESE Encoding Table
     010  +2i, skip 1
     011  +2i+1
   11011  -2i-1, skip 2
     110  -2i
           else   0
(b) HESE Encoding Example
Bit Pattern Output
Apply 
HESE
Input 
Value
000
0+26
0 2 4 6
Number of Terms
0
20
40
60
80
100
Cu
m
ula
tiv
e 
%
 o
f V
alu
es
(c) Encoding Comparison
HESE (data)
HESE (unif)
Radix-4 (data)
Radix-4 (unif)
Binary (data)
Binary (unif)
Fig. 8: (a) HESE converts binary encodings into shorter signed
encodings such as the example in (b). (c) HESE requires fewer
terms than both binary and radix-4 encodings for 8-bit values
over DNN data (data) and a uniform distribution (unif).
The rules for the generated output for this example can be
explained by combining the strategy of Booth radix-2 encoding
for strings of 1s with that of the standard binary representation
for isolated 1s which are surrounded by zeros. For instance, in
the example shown in Figure 8a, 95 (01011111) has a single
isolated 1 in the 7th bit position, and a string of 1s from the
5th to 1st bit positions. This translates to +26 (for the isolated
1) +25 − 20 (for the string of 1s). By using the third rule in
Figure 9a for isolated 0, we can save an additional term for
values such as 55 (0110111) by translating into +26− 23− 20.
Because this rule is rarely needed for 7-bit binary values, for
implementation simplicity, we omit it in design and analysis
reported in this paper. That is, we only use the other four rules.
C. Reducing Number of Terms per Encoding
HESE encodings have strictly equal or fewer terms than
binary and Booth radix-4. Figure 8c shows the number of
terms required for these encodings across two distributions of
values: data values from ResNet-18 and values drawn from
a uniform distribution over the same range as the data. The
x-axis is the number of terms required to represent a value
and the y-axis is the cumulative percentage of values that are
represented within a given number of terms. HESE outperforms
both Booth and binary across both distributions of values.
As expected, Booth leads to more compact representations
than binary for values drawn from the uniform distribution.
However, most of the reduction in terms comes from larger
values in the 8-bit range (with many 1s), which occur much
less frequently for the data, as depicted in Figure 3 (bottom).
Therefore, radix-4 performs equal or worse than binary for the
distribution of data values we are interested in. By comparison,
when applying HESE on data, 99% of values are represented
in 3 or fewer terms. Practically, this means we can use 3
power-of-two terms for both weights and data.
V. HARDWARE DESIGN FOR EFFICIENT TERM REVEALING
In this section, we present our hardware design for TR-
based quantization. Figure 9 provides an overview of the TR
Systolic Array with Term Revealing
...
ReLU
BlocktMAC
...
HESE
Encoder
HESE
Encoder
...
 ...
Term
Comparator
Binary 
Stream 
Converter ...... ...
ReLU 
Block
ReLU
Block
...
...
... ...
tMAC tMAC
Data
Buffer
tMAC tMAC tMAC
tMAC tMAC tMAC
HESE
Encoder
Weight
Buffer
Fig. 9: The term revealing system design.
system design, consisting of the following components: (1)
weight and data buffers which store DNN layer weights and
input/intermediate data, (2) a systolic array which performs
dot products between weights and data using term MACs
(tMAC) described in Section V-B, (3) a binary stream converter
to convert systolic array output into binary representation
(Section V-C), (4) a ReLU block (Section V-C), (5) a HESE
encoder (Section V-D) to convert the binary representations
to shorted signed expressions, and (6) a term comparator
(Section V-E) which applies TR by selecting the top k terms
in a group. In Section V-A, we first give some high-level
reasoning on how tMAC can save computation.
A. High-level Comparison Between Bit-parallel MAC (pMAC)
and Term MAC (tMAC)
To help understand the inherent advantage of TR, we
provide a high-level argument on how our proposed term MAC
(tMAC) saves a significant amount of work over a conventional
parallel MAC (pMAC). Here, we define work as the amount
of computation, including both arithmetic and bookkeeping
operations, which are performed per group. The work incurred
by a method largely determines the energy, area, and latency
of its implementation.
To be concrete, we study a 1 × 3 1D systolic array of 3
cells, as depicted in Figure 10a, for the processing of groups
of 3 data values (x1, x2, x3) in computing their dot products
with weights (w1, w2, w3) pre-stored in the systolic array. To
provide a baseline for comparison, we consider a conventional
implementation, where each cell is a pMAC performing an 8-
bit bit-parallel multiplication, w×x, and a 32-bit accumulation
adding an intermediate y to the computed w×x. In comparison,
Figure 10b depicts a tMAC-based implementation which
significantly reduces the work by only processing available
term pairs. The number of terms is relatively small due to high
bit-level sparsity generally presented in CNN weights and data.
This comparison result applies to a general 2D systolic array,
which is a stack of 1D systolic arrays. In Section VII-A, we
show how this analysis translates to realized performance
on an FPGA with a group size of g = 8.
For this illustrative analysis, we assume that tMAC uses
a TR group of size g = 3 and budget k = 6 for weight
values, and s = 2 leading terms for data values under HESE
(Section IV-B). Thus, for weights, each value in a group uses
6
5
(22+20)
(a) Three pMACs Each Implementing a 
Cell in a 1x3 Systolic Array
(b) One tMAC Implementing All 3 Cells of 
the 1x3 Systolic Array in (a) 
Accumulating Coefficient Vector+ + +
6
(22+21)
2
(21) 22+20 22+21 21
1
(20)
10
(23+21)
3
(21+20) 20 23+21 21+20
x1 x2 x3
w1 w2 w3
x2x1 x3
w1 w2 w3
pMAC pMAC pMAC
Fig. 10: (a) A 1 × 3 systolic array where each of the three
systolic cells is a conventional bit-parallel MAC (pMAC) which
performs an 8-bit multiplication between weights (w) and data
(x) values and a 32-bit accumulation each systolic array cycle.
(b) The proposed term MAC (tMAC) processes all term-pair
multiplications (e.g., 22 · 21), for the same systolic array cycle,
across a group of weight and data values (group size g = 3
here) in a bit-serial fashion. For a group budget of k and s-term
data, the number of term-pair multiplications is bounded by
k · s. Here, with k = 6 and s = 2, it is 8 (< 6 · 2 = 12).
on average α = 2 terms. As we show in Table III, under similar
settings, TR will incur a minimum decrease in classification
accuracy (e.g., less than 0.15%) when dropping lower-order
terms exceeding the group budget k across multiple CNNs.
Our analysis on work proceeds as follows. A conventional
pMAC implementation of a single systolic cell incurs 7 8-bit
additions for the multiplication w×x and 1 32-bit accumulation
operation for y+w×x. Therefore, the pMAC implementation
of a 1 × 3 1D systolic array with three cells requires 21
8-bit additions and 3 32-bit accumulation operations. In
contrast, a tMAC implementation incurs significantly less work.
Specifically, it uses at most 12 3-bit additions on exponents
of power-of-two terms (weight and data exponents are both
less than 8) for term-pair multiplications. (Recall that we
assume k = 6 and s = 2 terms for data values, as depicted in
Figure 10b). The updating of accumulating coefficient vector
(discussed in detail in the next section) requires bookkeeping
operations for bit alignment, etc., with work we assume is
no larger than the equivalent 12 3-bit additions. Thus, tMAC
substantially reduces work compared to pMAC, that is, 24 3-bit
additions vs. 21 8-bit additions plus 3 32-bit accumulations.
B. Term MAC (tMAC) Design
The term MAC (tMAC) performs dot products between a
data and weight vector of group size g by multiplying all term
pairs. Figure 11 illustrates how these term pair multiplications
in tMAC are performed for a group of size g = 4 and a group
budget k = 8. In this example, TR ensures that there are
8 or fewer terms across all weight values in the group. For
illustration simplicity, assume all data values can be represented
with a single term (in our implementation there are as many
as 3 terms per data value). Under these assumptions, 8 term
pair multiplications are performed and the results are added to
a coefficient vector depicted in the upper right of Figure 11.
Data Exponents
140-131
coefficient vector
Adder
w1 = 2
3-20
w2 = 2
2-20
w3 = 2
2
w4 = 2
3+22-20
x1 = 2
2
x2 = 2
1
x3 = 2
2
x4 = 2
2
202122232425
-20
Weight Exponents
 23 -20 22 22-202223
w4x4 w3x3 w2x2 w1x1
-1 ⇐ -20+2
+1 ⇐ 23+2
weight values data values
+1 ⇐  23+2
 22 21 22 22
t0t1 t2 t3t4t5t6t7
t0
t1
t7
Fig. 11: Term pair multiplication for a dot product across a
group of 4 weight and data values over 8 cycles (t0 to t7).
The coefficient vector stores the current partial result of
the dot product as coefficients for each power of two. In this
example, the coefficients are set to (1, 3, −1, 0, 4, 1), which
represents a value of 1 × 25 + 3 × 24 − 1 × 23 + 0 × 22 +
4 × 21 + 1 × 20 = 81. For the first term pair in the figure,
(−20,+22) in w1x1, the coefficient for 22 is decremented by 1
as the signs of the terms differ. Once all exponent additions are
completed for a dot product, the coefficient vector is reduced
to a single value. As the largest term is 27, assuming 8-bit
uniform quantization, the largest term pair is 27 × 27 = 214.
Therefore, the coefficient vector has a length of 15 in order to
store all possible term pair results from 20 to 214. To ensure
overflow is not possible for dot products of length as large as
4, 096, each element in the coefficient vector is 12 bits.
The hardware design of tMAC is shown in Figure 12a. The
exponents for term pairs are stored in data and weight exponent
arrays, with the sign of each term stored in the parallel arrays
with one bit per term. For instance, the term −22 would store
a 2 in the exponent array and a minus (−) in the sign array.
The yellow, red, blue, and green colors denote the term pair
boundaries for each data × weight multiplication. The exponent
duplicator takes in data exponents and duplicates them based
on the number of weight exponents in each value. Each cycle,
a pair of exponents from these two arrays are passed into
the adder, which computes the sum of the exponents, sets the
sign, and sends to the result to a coefficient accumulator (CA)
(Figure 12b) within one cycle. Therefore, to process a group
with 8 term pairs takes 8 cycles in total.
The CAs perform bit-serial addition between the coefficient
vector and the output of the exponent adder in the tMAC. Due
to the bit-serial design, the number of CAs must match the
size of the data and weight register arrays (8 in this example)
in order to maintain synchronization across the cells of the
systolic array. At each cycle, one of the eight CAs takes the
sum of two exponents from the adder, and adds/subtracts 1
to/from the corresponding coefficient. In our implementation,
each tMAC can choose to reuse the current coefficient vector
or take the new coefficient vector from its neighboring cell via
the selection signal sec acc, as depicted in Figure 12a.
7
Adder
 0
CA0 ...
...
tMAC
Weight Exponent Array
(a)
Bit-serial 
adder
1 -1
...
CA
... ...
...
(b)
Data Exponents Array
Xout
Xin
Yin Yout
Exponent duplicator
 3022 3  2  0
2 2 2 2 211 2
CA1 CA7
+ + + +++
+ +  - +  - -+ +
15
15
5
4
4
5
4
15
++
sel_acc
Fig. 12: (a) The term MAC (tMAC) performs term pair
multiplications between data and weight terms for a group
of values. (b) A coefficient accumulator (CA) takes the adder
result and add or subtract 1 from the corresponding coefficient.
C. Binary Stream Converter and ReLU Block
The binary stream converter takes coefficient vectors, output
from the systolic array, and transforms them into a binary
format by multiplying each element of the coefficient vector
with the corresponding power-of-two term then summing over
the partial results. The outputs of the binary stream converter
are sent to the ReLU block in a bit-serial fashion. Using a two’s
complement representation for the outputs, the sign can be
determined by detecting the most significant bit of the output
streams. The ReLU block buffers all the lower bits until the
MSB arrives. Then, it outputs zero if the sign of the MSB
indicates that the value is negative; otherwise it outputs the
original bit stream.
D. HESE Encoder
The HESE encoder produces two bit streams, which represent
the magnitude and sign of each power-of-two term, respectively.
For example, for a bit-serial input of 31 = 00011111, the
HESE encoder will produce two output streams: 00100001
(magnitude) and 00000001 (sign), to indicate 31 = 25 − 20.
The HESE encoder is implemented with a finite state machine.
E. Term Comparator
The term comparator in Figure 13a selects the top k terms
from the outputs of every g consecutive HESE encoders, where
k and g are the group budget and group size, respectively.
Figure 13b shows the operation of term comparator on the
outputs of four HESE encoders, the HESE encoder outputs
are divided into two groups, where each group has a group
size g = 2 and group budget of k = 3. The inputs enter
the term comparator in a reverse order such that their most
significant bits (MSB) enter the term comparator first. Each
cycle, the term comparator counts the total number of nonzero
bits encountered so far, and truncates the remaining low-order
terms once the group budget k is reached for a group.
The term comparator contains multiple accumulate and
compare (A&C) blocks which are arranged into a tree structure.
Each A&C block takes a single input bit stream and counts
the total number of nonzero bits in this stream. Figure 14
-1
1
k
0
g
HESE 
encoder 
outputs ...T
er
m
 
C
om
pa
ra
to
r
outputs...
...
...
... 1
1
1
MSBLSB
Term Comparator
(a) (b)
Te
rm
 
co
m
pa
ra
to
r
MSBLSB
-1
1
-1
0
0
-1
0
1
1
1
0
-1
T=0 T=1
group1
MSBLSB
Te
rm
 
co
m
pa
ra
to
r -1
1
-1
0
0
-1
0
1
1
1
0
-1
T=2 T=3
budget is reached
for group 1
MSBLSB
1
1
0
-1
MSBLSB
Te
rm
 
co
m
pa
ra
to
r
Te
rm
 
co
m
pa
ra
to
r0
-1
0
1
0
0
0
-1
group2
group1
group2
budget is reached
for both groups
Fig. 13: (a) The design of the term comparator which imple-
ments term revealing. The term comparator takes the group
size g and group budget k as inputs, counts the total number
of terms within each group, and set the corresponding terms
to zeros once the group budget is reached. (b) An example of
term comparator operating on two groups. At T=2, the group
budget is reached for group 1 and all the remaining terms are
pruned. At T=3, group budget is also reached for group 2.
A&C
A&C
A&C
A&C
group size = 1
A&C
A&C
A&C
4
A&C
A&C
A&C
A&C
A&C
4
A&C
A&C
A&C
A&C
A&C
A&C
4
A&C
A&C
A&C
2
2
HE
SE
 en
co
de
r o
utp
uts
HE
SE
 en
co
de
r o
utp
uts
HE
SE
 en
co
de
r o
utp
uts
Term 
Comparator
Term 
Comparator
Term 
Comparator
group size = 2 group size = 4
Fig. 14: Configurations of term comparator under different
group sizes.
show how the A&C blocks can be reconfigured for different
group sizes g. For g = 1, the A&C blocks on the first level
of the tree will compare the number of nonzero bits in their
input stream against the group budget k, and truncate each
stream accordingly. If the group size is larger than 1 (e.g.,
2), each A&C block in the first level of the tree will forward
its input stream together with the nonzero bit count to its
parent A&C block. The parent A&C block then operates on
these two streams in a similar fashion to its children. The tree
architecture allows for minimal changes to the term comparator
under different group sizes, which leads to a low reconfiguration
overhead and maximum level of hardware reuse.
F. Memory Subsystem
Our memory subsystem consists of a data and weight buffer.
The data buffer holds the term exponents and signs for both
the input and result data of the current layer, and the weight
buffer holds the term exponents and signs of the weights for
each group. For the weight buffer, we use double buffering
to prefetch the next weight tile from the off-chip DRAM so
that the computation of the systolic array can overlap with the
traffic transfer from the off-chip DRAM to weight buffer. Note
that TR does not reduce the storage complexity of the model,
as each weight is stored in an 8-bit fixed-point format.
8
TABLE I: Control registers for supporting QT and TR.
Uniform quantization (QT) Term revealing (TR)
HESE ENCODER ON
(1 bit)
HESE encoder is turned
off by setting this bit to 0
HESE encoder is turned
on by setting this bit to 1
COMPARATOR ON
(1 bit)
term comparator is turned
off by setting this bit to 0
term comparator is turned
on by setting this bit to 1
QUANT BITWIDTH
(4 bit)
quantization bitwidth
used for QT
quantization bitwidth
used for TR
DATA TERMS
(4 bit)
same as the quantization
bitwidth for QT
Maximum number of power-
of-two terms in data for TR
GROUP SIZE
(3 bit)
group size is set
to 1 for QT
group size is between
2 to 8 for TR
GROUP BUDGET
(5 bit)
group budget is the same as
quantization bitwidth for QT
group budget can be up
to 8× 3 = 24 for TR
G. FPGA Reconfiguration for QT and TR
Our TR system system can be easily reconfigured for
different group sizes g and group budgets k, in order to adapt to
dynamic requirements on group size and group budget during
inference with a negligible delay. In addition, our system can
also supports conventional quantization (QT) by performing
power-of-two operations with binary representations. Since QT
does not require TR or HESE encoding, the term comparator
and HESE encoder can be turned off by using clock gating
to reduce power consumption. Table I summarizes all of the
control registers which need to be modified when switching
between TR and QT. The switching process only takes several
clock cycles (i.e., within 100ns for our FPGA implementation).
VI. TERM REVEALING EVALUATION
In this section, we evaluate the performance of TR when ap-
plied to an MLP on MNIST [35], a broad range of CNNs (VGG-
19 [36], ResNet-18 [33], MobileNet-V2 [37], and EfficientNet-
b0 [38]) on ImageNet [34], and an LSTM [39] on Wikitext-
2 [40]. In Section VI-A, we compare TR against conventional
uniform quantization (QT) on the performance (i.e., accuracy or
perplexity) of these DNNs. Then, in Section VI-B, we provide
analysis on how the α (average number of terms) and g (group
size) parameters impact the classification accuracy. Next, in
Section VI-C, we analyze the individual contribution of HESE
and TR on model performance. Finally, in Section VI-D, we
show that the quantization error introduce by TR is substantially
less than when using a more aggressive QT setting (e.g., 6-bit
uniform quantization).
To perform this analysis, we have implemented a CUDA
kernel for TR which only increases the inference runtime
of a pre-trained model running on a NVIDIA 1080 Ti by
under 5%. This means that the validation accuracy for a pre-
trained CNN for ImageNet can still be obtained within several
minutes. Using pre-trained models has the advantage of making
parameter search (e.g., for group size g and term budget k)
simple compared to methods such as weight pruning [4] that
require model retraining which takes hours or days for each
setting. Before applying TR, each model is quantized from
32-bit floating-point to 8-bit fixed-point using a layerwise
procedure described in [41].
A. Comparing Term Revealing to Uniform Quantization
Motivated by the design in Section V, we are interested
in minimizing the number of term pair multiplications per
sample, as this directly translates to the processing latency of a
sample. For the uniform quantization (QT) approach with 8-bit
fixed-point weights and data, each multiplication translates
to 7 × 7 = 49 term pair multiplications. By comparison, for
term revealing (TR), the number of term pair multiplications
is instead bounded by the average number of term pairs which
is shared across a group of values. We show that TR gives a
significant reduction (e.g., 3-10×) over QT while maintaining
the nearly identical performance (e.g., within 0.1% accuracy).
1) MLP on MNIST: We train an MLP with one hidden layer
with 512 neurons for MNIST using the parameter settings
given in the PyTorch examples for MNIST1. Figure 15 (left)
shows the performance of QT and TR applied to the pre-
trained MLP. TR achieves a 5× reduction in number of term
pair multiplications over QT while achieving a clasification
accuracy of 98.4% (compared to the 98.5% baseline).
2) CNNs on ImageNet: We use pre-trained models provided
by the PyTorch torchvision package2 for VGG-16, ResNet-
18, and MobileNet-v2 and a PyTorch implementation3 of
EfficientNet with pre-trained models. Figure 15 (center) shows
the performance of TR and QT for the 4 CNNs. TR achieves a
14× reduction in term pair multiplications over QT for VGG-
16, which is known to be significantly overprovisioned (e.g.,
amenable to quantization and pruning). Even for more recent
models, which have significantly fewer parameters, such as
MobileNet-v2 and EfficientNet-b0, TR is still able to achieve a
4× and 6× reduction in term pair multiplications, respectively,
losing less than 0.1% classification accuracy compared to the
8-bit QT settings. Generally, we see that more aggressive TR
settings (e.g., with a reduced group budget) appears to degrade
the accuracy more gracefully than more aggressive QT settings
(e.g., with reduced bit-width for weight).
3) LSTM on WikiText-2: We train a 1-layer LSTM with
650 hidden units (i.e., neurons), a word embedding of length
650, and a dropout rate of 0.5, following the PyTorch word
language model example4. This baseline model achieves a
perplexity of 86.85. Figure 15 (right) shows how the perplexity
of the pre-trained model is impacted by QT and TR. Again,
we find that TR is able to reduce the number of term pair
multiplications by a significant factor of 3×, while achieving
the same perplexity.
B. Improved Term Allocation with Larger Group Size
Figure 16 shows the classification accuracy for ResNet-18
as α is varied for different group sizes. As the group size
increases, the variance in number of terms across values in a
group shrinks, meaning that a larger group budget k at a fixed
α ratio is strictly better than a smaller k for the same α. As
observed, the classification accuracy for a larger group size
strictly outperforms all settings with smaller group sizes. For
instance, a group size of 8 with α of 1 achieves a classification
accuracy of 67.72% which is 5.21% better than a group size of
1https://github.com/pytorch/examples/tree/master/mnist
2https://github.com/pytorch/vision/tree/master/torchvision/models
3https://github.com/lukemelas/EfficientNet-PyTorch
4https://github.com/pytorch/examples/blob/master/word language model
9
106 107
94
95
96
97
98
To
p-
1 
Ac
cu
ra
cy
 (%
)
MLP (MNIST)
109 1010 1011 1012
Term Pair Multiplications per Sample (log scale)
68
70
72
74
76
CNNs (ImageNet)
1011 1012
87.00
87.25
87.50
87.75
88.00
88.25
Pe
rp
lex
ity
LSTM (Wiki-2)
MLP (QT)
MLP (TR)
ResNet-18 (QT)
ResNet-18 (TR)
VGG-16 (QT)
VGG-16 (TR)
MoblNet-v2 (QT)
MoblNet-v2 (TR)
EffNet-b0 (QT)
EffNet-b0 (TR)
LSTM (QT)
LSTM (TR)
Fig. 15: Comparing uniform quantization (QT) and term revealing (TR) for an MLP on
MNIST (left), CNNs on ImageNet (center), and an LSTM on Wikitext-2 (right). The
QT settings vary the weight bit-width (from 4 to 8 bits), while the TR settings vary g
(group size) and α (number of terms per group). TR reduces the number of term pair
multiplications per sample over QT by 3-10× across the three types of DNNs.
1.0 1.5 2.0 2.5 3.0
Average Number of Terms α
64
66
68
70
To
p-
1 
Ac
cu
ra
cy
 (%
)
Impact of Group Size g
g=1
g=2
g=8
g=32
Fig. 16: A larger group size g im-
proves ResNet-18 ImageNet classifi-
cation accuracy for a given α.
1.0 1.5 2.0 2.5 3.0 3.5 4.0
Average Number of Terms α
65
66
67
68
69
70
To
p-
1 
Ac
cu
ra
cy
 (%
)
Impact of TR and HESE (ResNet-18)
HESE + TR(g=8)
QT + TR(g=8)
HESE
QT
Fig. 17: Measuring the individual contributions of TR and
HESE in reducing number of terms while maintaining high
classification accuracy.
1 at the same α value. Note that a group size of 1 is equivalent
to truncating each value to exactly α terms.
C. Isolating the Impact of TR and HESE
Figure 17 shows the relative impact of TR and HESE in
terms of classification accuracy by measuring them in isolation.
The HESE and QT settings (without TR) apply term truncation
by keeping the top k terms in each individual weight. In this
case, α is equal to k as the group size is 1 (i.e., there is no
grouping). We see that HESE substantially outperforms QT
until the top 4 terms are kept per weight (α = 4) due to it
requiring fewer terms. For the settings with TR, QT + TR and
HESE + TR, we use a group size of g = 8, with term budget k
values of 8, 12, 16, 20, and 24 to generate comparable values
of α as in the settings without TR. We find that TR improves
the performance of both the QT and HESE encoding methods,
with HESE + TR achieving the best performance.
D. Quantization Error Analysis
The reason for TR’s superior performance (e.g., accuracy
or perplexity) over QT discussed in Section VI-A is due to
TR introducing less quantization error. Figure 18 shows the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
ResNet-18 Layer
0.000
0.002
0.004
0.006
Er
ro
r p
er
 W
eig
ht
 (a
vg
) Layerwise Quantization Error in ResNet-18
8-bit QT 7-bit QT 6-bit QT TR (g=8, k=14)
Fig. 18: The average quantization error (relative to the original
32-bit floating-point weights) across the convolutional layers
in ResNet-18 for 3 QT settings and one TR setting.
quantization error across the layers in ResNet-18 for 3 QT
settings (from 6-bit to 8-bit) and TR with a group size g = 8
and a group budget k = 14. We see that TR introduces a
small amount of quantization error over 8-bit QT, which makes
sense as TR is applied on top of 8-bit QT. The 7-bit and 6-bit
QT settings truncate the low-order terms, leading to larger
quantization error and reduced classification accuracy.
VII. FPGA EVALUATION
In this section, we evaluate the hardware performance of the
TR system described in Section V. We have synthesized our TR
system using Xilinx VC707 FPGA evaluation board. We first
compare the performance of tMAC against a bit-parallel MAC
in Section VII-A. Then, we demonstrate that our TR system
can be used to implement both QT and TR in Section VII-B.
Finally, in Section VII-C, we compare our TR system against
the other FPGA-based CNN accelerators on ResNet-18.
A. Comparing Performance of Bit-parallel MAC and tMAC
In this section, we evaluate the hardware performance of
a single tMAC by comparing it against a bit-parallel MAC
(pMAC) shown in Figure 10a. For both designs, we perform
a group of MAC computation: yout =
∑g
i=1 xiwi + yin,
10
TABLE II: FPGA resource consumption of pMAC and tMAC.
LUT FF
pMAC 154 148
tMAC 25 26
TABLE III: Classification accuracy and energy efficiency
comparison for the two MAC designs across four CNNs.
Model MAC s k g Accuracy Energy Eff.
Resnet-18 pMAC - - - 69.62% 1.0×tMAC 3 12 8 69.60% 2.1×
VGG-16 pMAC - - - 73.11% 1.0×tMAC 2 12 8 73.11% 3.1×
MoblNet-v2 pMAC - - - 71.76% 1.0×tMAC 3 18 8 71.65% 1.5×
EffNet-b0 pMAC - - - 75.99% 1.0×tMAC 3 16 8 75.84% 1.7×
where yin, yout, xi and wi are 32-bit, 32-bit, 8-bit and 8-
bit, respectively, and g = 8 is the number of elements in the
weight and data vectors (i.e., the group size in TR). In one
cycle, the pMAC performs an 8-bit multiplication between xi
and wi and a 32-bit accumulation between the result and yin.
Therefore, yout is generated in g = 8 cycles. By comparison,
the tMAC takes a variable number of cycles to process each
multiplication in the group, depending on the number of term
pairs in the multiplication. In total, it requires no more than
s × k cycles, where s is the maximum number of terms for
each data value and k is the group term budget.
Table II shows the FPGA resource consumption of the two
MAC designs in terms of LookUp Tables (LUTs) and Flip-
flops (FFs). The tMAC consume 6.5× less LUTs and 6.0×
FFs than the pMAC. The tMAC requires less FPGA resources
as it performs 3-bit exponent additions as opposed to 8-bit
additions and 32-bit accumulation as in the pMAC.
We evaluate the two designs in terms of the energy efficiency,
which is the ratio between the throughput and power consump-
tion. Table III shows the energy efficiency and classification
accuracy for the two designs across four CNNs. For the tMAC
settings, different values of s and k are selected for each CNN
such that the classification accuracy stays competitive with
the baseline model (less than 0.15% difference in accuracy
across all settings) while keeping the group size fixed (g = 8).
For each CNN, the energy efficiency of both MAC designs
is normalized to that of the pMAC. We observe that tMAC
achieves superior energy efficiency (2.1× on average) compared
to pMAC across the four CNNs. This reflects that pMAC
needs to perform more work than tMAC, as our analysis in
Section V-A shows.
B. System Comparison of QT and TR
In this section, we compare the hardware performance of TR
against QT with the DNNs shown in Figure 15. The systolic
array in the TR system has 128 rows by 64 columns, with each
systolic cell implementing a tMAC with a group size g = 8.
The group budget k is chosen independently for each network
MLP VGG-16 ResNet-18 MoblNet-v2 EffNet-b0 LSTM0
2
4
6
8
10
No
rm
ali
ze
d 
Im
pr
ov
em
en
t FPGA improvements of TR over QT
Energy Efficency
Latency
Fig. 19: Normalized energy efficiency and latency improve-
ments of TR over QT. All models use a group size of g = 8.
The group budget k is selected for each model such that it is
within 0.15% accuracy of the corresponding QT setting (k is 8,
12, 12, 18, 16, 20 for MLP, VGG-16, ResNet-18, MobileNet-
V2, EfficientNet-b0, and LSTM, respectively). All models keep
the top s = 3 terms except for VGG-16 which uses s = 2.
such that the TR is within 0.15% accuracy of the QT setting (or
0.05 perplexity for the LSTM). In this section, we use the same
TR system (Figure 9) for the implementation of both QT and
TR in order to show the reconfigurability of our design. The
implementation of QT does not require group-based ranking
and HESE encoding, so we turn off these components of the
hardware system to reduce dynamic power consumption. All
control registers are configured based on Table I.
We evaluate our FPGA system with the following perfor-
mance metrics: (1) Average Processing latency of the hardware
system to generate the prediction result, and (2) Energy
efficiency or the average amount of energy required to process
a single input sample. As shown in Figure 19, our TR system
outperforms the QT by 7.8× and 4.3× on average in terms
of processing latency and energy efficiency, respectively. For
more difficult tasks, such as Wikitext-2 for the LSTM, a more
conservative group budget k = 20 is selected, leading to less
relative improvement over QT. For overprovisioned models
(e.g., VGG-16), a more aggressive group budget of k = 8 is
used, leading to more substantial improvements in latency and
energy efficiency.
C. FPGA System Evaluation
In this section, we evaluate our TR system over ResNet-18
on ImageNet, using a group size g = 8 and group budget
k = 16. While an even larger group size could theoretically
lead to additional savings, there are diminishing returns as
shown in Figure 16 when comparing g = 8 and g = 32.
Additionally, larger group sizes increase the complexity of the
term comparator due to additional tree levels of A&C blocks.
We compare our TR system with the other FPGA-based
accelerators which implements different CNN architectures
(e.g., AlexNet) on ImageNet. We evaluate our design in
terms of the average processing latency for the input samples,
energy efficiency of the hardware system and classification
accuracy. As shown in Table IV, our design achieves the
highest classification accuracy (69.48%), energy efficiency
(25.22 frames/J), and the second lowest latency (7.21ms).
11
TABLE IV: Comparison of our FPGA implementation of
ResNet-18 to other FPGA-based accelerators on ImageNet.
[42] [43] [44] [45] Ours
FPGA Chip VC706 Virtex-7 ZC706 ZC706 VC707
Acc. (%) 53.30% 55.70% 64.64% N/A 69.48%
Frequency (MHz) 200 100 150 100 170
FF 51k(12%) 348k(40%) 127k(29%) 96k(22%) 316k(51%)
LUT 86k(39%) 236k(55%) 182k(83%) 148k(68%) 201k(65%)
DSP 808(90%) 3177(88%) 780(89%) 725(80%) 756(27%)
BRAM 303(56%) 1436(49%) 486(86%) 901(82%) 606(59%)
Latency (ms) 5.88 11.7 224 17.3 7.21
Energy eff. (frames/J) 23.6 8.39 0.46 6.13 25.22
Our hardware system achieves the best performance for
multiple reasons. First, TR coupled with the proposed HESE
encoding greatly reduce the amount of term pair multiplications,
which reduces the number of cycles in tMACs. TR allows
tMACs to achieve a much tighter processing bound of 3× k
pairs per group as opposed to 7× 7× g in the case of standard
binary encoding without TR. Second, the bit-serial design of
the coefficient accumulator in tMAC together with the systolic
architecture of the computing engine leads to a highly regular
layout with low routing complexity.
VIII. CONCLUSION
We proposed term revealing (TR) as a general run-time
approach for furthering quantized computation on already
quantized DNNs. Departing from conventional quantization
that operates on individual values, TR is a group-based method
that keeps a fixed number of terms within a group of values.
TR leverages the weight and data distributions of DNNs, so
that it can achieve good model performance even with a small
group budget. We measure the computation cost of TR-enabled
quantization using the number of term pair multiplications per
inference sample. Under this clearly defined cost proxy, we have
shown that TR significantly lowers computation costs for MLPs,
CNNs, and LSTMs. As shown in Section VII-B, this reduction
in operations translates to improved energy efficiency and
reduced latency over conventional quantization for our FPGA
system. Furthermore, our FPGA system demonstrates that by
changing a small number of control bits we can reconfigure
a quantized computation under conventional quantization to
one under TR-enabled quantization, and vice versa (Table I).
Quantization is one of most widely used approaches in
streamlining DNNs; TR proposed in this paper brings the
success of the quantization approach to another level.
REFERENCES
[1] “Smart compose: Using neural networks to help write emails.” Avail-
able at: ”https://ai.googleblog.com/2018/05/smart-compose-using-neural-
networks-to.html.
[2] D. Lin, S. Talathi, and S. Annapureddy, “Fixed point quantization of
deep convolutional networks,” in International Conference on Machine
Learning, pp. 2849–2858, 2016.
[3] A. D. Booth, “A signed binary multiplication technique,” The Quarterly
Journal of Mechanics and Applied Mathematics, vol. 4, no. 2, pp. 236–
240, 1951.
[4] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep
neural networks with pruning, trained quantization and huffman coding,”
arXiv preprint arXiv:1510.00149, 2015.
[5] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured
sparsity in deep neural networks,” in Advances in Neural Information
Processing Systems, pp. 2074–2082, 2016.
[6] G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger, “Con-
densenet: An efficient densenet using learned group convolutions,” arXiv
preprint arXiv:1711.09224, 2017.
[7] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very
deep neural networks,” in International Conference on Computer Vision
(ICCV), vol. 2, 2017.
[8] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method
for deep neural network compression,” arXiv preprint arXiv:1707.06342,
2017.
[9] S. Narang, E. Undersander, and G. F. Diamos, “Block-sparse recurrent
neural networks,” CoRR, vol. abs/1711.02782, 2017.
[10] S. Gray, A. Radford, and D. Kingma, “Gpu kernels for block-sparse
weights.” https://s3-us-west-2.amazonaws.com/openai-assets/blocksparse/
blocksparsepaper.pdf, 2017. [Online; accessed 12-January-2018].
[11] H. T. Kung, B. McDanel, and S. Q. Zhang, “Packing sparse convolutional
neural networks for efficient systolic array implementations: Column
combining under joint optimization,” 24th ACM International Conference
on Architectural Support for Programming Languages and Operating
Systems, 2019.
[12] A. Ren, T. Zhang, S. Ye, J. Li, W. Xu, X. Qian, X. Lin, and Y. Wang,
“Admm-nn: An algorithm-hardware co-design framework of dnns using
alternating direction methods of multipliers,” in Proceedings of the
Twenty-Fourth International Conference on Architectural Support for
Programming Languages and Operating Systems, pp. 925–938, ACM,
2019.
[13] M. Courbariaux, Y. Bengio, and J.-P. David, “Training deep neu-
ral networks with low precision multiplications,” arXiv preprint
arXiv:1412.7024, 2014.
[14] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
learning with limited numerical precision,” in International Conference
on Machine Learning, pp. 1737–1746, 2015.
[15] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,”
arXiv preprint arXiv:1612.01064, 2016.
[16] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
“Quantized neural networks: Training neural networks with low precision
weights and activations,” The Journal of Machine Learning Research,
vol. 18, no. 1, pp. 6869–6898, 2017.
[17] E. Park, J. Ahn, and S. Yoo, “Weighted-entropy-based quantization
for deep neural networks,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 5456–5464, 2017.
[18] S. Kapur, A. Mishra, and D. Marr, “Low precision rnns: Quantizing rnns
without losing accuracy,” arXiv preprint arXiv:1710.07706, 2017.
[19] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network
quantization: Towards lossless cnns with low-precision weights,” arXiv
preprint arXiv:1702.03044, 2017.
[20] P. Wang and J. Cheng, “Fixed-point factorized networks,” in Computer
Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on,
pp. 3966–3974, IEEE, 2017.
[21] C.-Y. Chen, J. Choi, K. Gopalakrishnan, V. Srinivasan, and S. Venkatara-
mani, “Exploiting approximate computing for deep learning acceleration,”
in 2018 Design, Automation & Test in Europe Conference & Exhibition
(DATE), pp. 821–826, IEEE, 2018.
[22] E. Park, D. Kim, and S. Yoo, “Energy-efficient neural network accelerator
based on outlier-aware low-precision computation,” in 2018 ACM/IEEE
45th Annual International Symposium on Computer Architecture (ISCA),
pp. 688–698, IEEE, 2018.
[23] E. Park, S. Yoo, and P. Vajda, “Value-aware quantization for training and
inference of neural networks,” in Proceedings of the European Conference
on Computer Vision (ECCV), pp. 580–595, 2018.
[24] Q. Hu, P. Wang, and J. Cheng, “From hashing to cnns: Training
binaryweight networks via hashing,” arXiv preprint arXiv:1802.02733,
2018.
[25] B. McDanel, S. Q. Zhang, H. T. Kung, and X. Dong, “Full-stack
optimization for accelerating cnns using powers-of-two weights with
fpga validation,” International Conference on Supercomputing, 2019.
[26] A. Li, T. Geng, T. Wang, M. Herbordt, S. L. Song, and K. Barker,
“Bstc: a novel binarized-soft-tensor-core design for accelerating bit-based
approximated neural nets,” in Proceedings of the International Conference
for High Performance Computing, Networking, Storage and Analysis,
pp. 1–30, 2019.
12
[27] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training
deep neural networks with binary weights during propagations,” in
Advances in neural information processing systems, pp. 3123–3131,
2015.
[28] J. Albericio, A. Delma´s, P. Judd, S. Sharify, G. O’Leary, R. Genov,
and A. Moshovos, “Bit-pragmatic deep neural network computing,” in
Proceedings of the 50th Annual IEEE/ACM International Symposium on
Microarchitecture, pp. 382–394, ACM, 2017.
[29] A. Delmas, P. Judd, D. M. Stuart, Z. Poulos, M. Mahmoud, S. Sharify,
M. Nikolic, and A. Moshovos, “Bit-tactical: Exploiting ineffectual
computations in convolutional neural networks: Which, why, and
how,” 24th ACM International Conference on Architectural Support
for Programming Languages and Operating Systems, 2019.
[30] H. T. Kung, “Why systolic architectures?,” IEEE Computer, vol. 15,
pp. 37–46, 1982.
[31] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understand-
ing deep learning requires rethinking generalization,” arXiv preprint
arXiv:1611.03530, 2016.
[32] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.
[33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 770–778, 2016.
[34] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A
large-scale hierarchical image database,” in Computer Vision and Pattern
Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255, IEEE,
2009.
[35] Y. LeCun, “The mnist database of handwritten digits,” http://yann. lecun.
com/exdb/mnist/, 1998.
[36] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional
networks: Visualising image classification models and saliency maps,”
arXiv preprint arXiv:1312.6034, 2013.
[37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
“Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 4510–4520, 2018.
[38] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for
convolutional neural networks,” arXiv preprint arXiv:1905.11946, 2019.
[39] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[40] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture
models,” arXiv preprint arXiv:1609.07843, 2016.
[41] J. H. Lee, S. Ha, S. Choi, W.-J. Lee, and S. Lee, “Quantization for rapid
deployment of deep neural networks,” arXiv preprint arXiv:1810.05488,
2018.
[42] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.-m. Hwu, and D. Chen,
“Dnnbuilder: an automated tool for building high-performance dnn
hardware accelerators for fpgas,” in Proceedings of the International
Conference on Computer-Aided Design, p. 56, ACM, 2018.
[43] Y. Shen, M. Ferdman, and P. Milder, “Maximizing cnn accelerator
efficiency through resource partitioning,” in Computer Architecture
(ISCA), 2017 ACM/IEEE 44th Annual International Symposium on,
pp. 535–547, IEEE, 2017.
[44] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,
N. Xu, S. Song, et al., “Going deeper with embedded fpga platform for
convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, pp. 26–35,
ACM, 2016.
[45] Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.-W. Tai, “Exploring heterogeneous
algorithms for accelerating deep convolutional neural networks on fpgas,”
in Proceedings of the 54th Annual Design Automation Conference 2017,
p. 62, ACM, 2017.
13
