Improving Efficiency in Neural Network Accelerator Using Operands
  Hamming Distance optimization by Li, Meng et al.
IMPROVING EFFICIENCY IN NEURAL NETWORK ACCELERATOR USING
OPERANDS HAMMING DISTANCE OPTIMIZATION
Meng Li * 1 Yilei Li * 1 Pierce Chuang 1 Liangzhen Lai 1 Vikas Chandra 1
ABSTRACT
Neural network accelerator is a key enabler for the on-device AI inference, for which energy efficiency is an
important metric. The datapath energy, including the computation energy and the data movement energy among
the arithmetic units, claims a significant part of the total accelerator energy. By revisiting the basic physics of the
arithmetic logic circuits, we show that the datapath energy is highly correlated with the bit flips when streaming the
input operands into the arithmetic units, defined as the hamming distance of the input operand matrices. Based on
the insight, we propose a post-training optimization algorithm and a hamming-distance-aware training algorithm
to co-design and co-optimize the accelerator and the network synergistically. The experimental results based on
post-layout simulation with MobileNetV2 demonstrate on average 2.85× datapath energy reduction and up to
8.51× datapath energy reduction for certain layers.
1 INTRODUCTION
Deep neural networks (DNNs) have revolutionized different
applications ranging from computer vision to speech and
natural language processing (LeCun et al., 2015), and are
now widely deployed in data centers (Jouppi et al., 2017;
Hazelwood et al., 2018; Park et al., 2018b) and edge de-
vices (Du et al., 2017; Zhang et al., 2017; Wu et al., 2019).
As modern DNNs usually require significant computation,
neural network accelerators are extensively studied in re-
cent years to enable energy-efficient processing (Chen et al.,
2014; 2016; Sze et al., 2017; Jouppi et al., 2017; Sharma
et al., 2018).
Datapath of the neural network accelerator, including the
arithmetic compute units and the data bus among the units,
lies at the heart of neural network accelerators. It plays an
important role in terms of energy consumption of the neural
network accelerator. With the trend of aggressive operand
quantization (< 8 bit) and near/in-memory computation, the
energy consumption of memory accesses in neural network
accelerators is greatly reduced. In many state-of-the-art
accelerator designs (Andri et al., 2016; Gao et al., 2017;
Park et al., 2018a), datapath can consume 40-70% of the
total energy.
Conventionally, the datapath energy consumption in a neural
network accelerator can be estimated as Edatapath = λ ·
OPs ·Energy/OP , where OPs denotes the total number
of operations of the neural network, Energy/OP is the
*Equal contribution 1Facebook, 1 Hacker Way, Menlo Park, CA
94025. Correspondence to: Meng Li <meng.li@fb.com>, Yilie
Li <yileil@fb.com>.
datapath energy consumption of one operation and λ is a
correction term that depends on the network parameters
and the underlying hardware design,. Previous researches
mainly focus on reducing OPs, e.g., by optimizing the
network topology (Iandola et al., 2016; Howard et al., 2017;
Tan et al., 2019) or network pruning (Han et al., 2015; He
et al., 2017), and reducing Energy/OP , e.g., by network
quantization (Moons & Verhelst, 2016; Park et al., 2018a;
Sharma et al., 2018) or binarization (Courbariaux et al.,
2016) etc. In contrast, reducing λ receives less attention.
Existing works mainly focus on exploiting the sparsity of
the network parameters and activations to gate the compute
units and skip the unnecessary computations (Chen et al.,
2016).
In this work, we explore a new dimension to reduce λ and
the datapath energy. We show that as most accelerators
leverage spatial data reuse (Chen et al., 2016) and stream
input operands into the compute array, the sequence of the
input operands significantly impacts the datapath energy.
Specifically, we find that the datapath energy is strongly
correlated to the bit flips when streaming the input operands.
In this paper, we leverage the concept of hamming distance
to formalize the bit flip analysis. A series of post-training
and training-aware techniques are proposed to co-design
and co-optimize the accelerator and the network to reduce
the hamming distance of the input operand sequence. Exper-
imental results based on the post-layout simulation demon-
strates on average 3.6× datapath energy reduction and up
to 8.51× energy reduction for certain layers. The proposed
techniques are compatible with other optimization knobs,
e.g., pruning, quantization, etc. The contributions of the
ar
X
iv
:2
00
2.
05
29
3v
1 
 [c
s.C
V]
  1
3 F
eb
 20
20
Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization
paper can be summarized as follows:
• We discover the correlation between the datapath en-
ergy and the hamming distance when streaming the
input operands and further propose the concept of ham-
ming distance optimization as a new direction of data-
path energy optimization;
• We propose a post-training optimization algorithm to
reduce the hamming distance of the neural network
model, which introduces negligible hardware overhead
and no impact on the model output;
• We propose a hamming-distance-aware training algo-
rithm, which reduces the hamming distance of the neu-
ral network model with negligible effect on accuracy;
• Experiments based on the post-layout simulation
demonstrate promising results (up to 8.51× datapath
energy reduction) by combining the hamming-distance-
aware training and the post-processing algorithm.
2 BACKGROUND: SPATIAL
ACCELERATORS
Modern NN accelerators usually comprise of the following
major components - a two-dimensional arithmetic compute
array, a network-on-chip (NoC), control blocks, and an on-
chip memory (Sze et al., 2017). Specifically, the on-chip
memory usually consists of several levels of hierarchies,
including a global buffer, an inter-unit network to facilitate
data pass among the arithmetic units, and register files (RFs)
within each arithmetic unit (Chen et al., 2016). The mem-
ory access energy to different memory hierarchies can vary
significantly.
…
…
H x W
C
PE Array
K
Ps
umK
weight
weight
weight
weight
Ps
um
Ps
um
Ps
um
(a)
…
…
H x W
K
PE Array
C
Ac
tC
weight
weight
weight
weight
Ac
t
Ac
t
Ac
t
(b)
Figure 1. Different dataflow variants: (a) exploits input stationary
dataflows, and (b) leverages the output stationary dataflow. For
these dataflow variants, weights are organized in a sequence along
either the output channel or the input channel dimension and are
sent into the compute array consecutively.
To reduce access to more expensive memory hierarchies,
specialized processing dataflows are designed to enable data
reuse across different computation units. Representative
dataflows include input stationary, output stationary, row
stationary, etc (Chen et al., 2016; Sze et al., 2017). The
dataflow architecture dictates what data gets read into the
memory hierarchy and how data are propagated in the com-
pute array. Figure 1 shows two widely used designs (Chen
et al., 2018). The design in Figure 1(a) leverages the input
stationary and relies on unrolling both the input channel di-
mension (C) and input spatial locations (H×W ) to map the
operations spatially onto the array to exploit the computa-
tion parallelism. The weights are streamed into the array and
can be reused horizontally with input pixels from different
spatial locations, while the partial sums are accumulated spa-
tially across the column. Instead of saving the partial sums
directly to the activation SRAM, they are usually stored into
an accumulation buffer first to reduce the memory access
energy. Until the partial sums are fully reduced, they may
go through the nonlinear units and be stored back to the
global SRAM. Similarly, the design in Figure 1(b) leverages
the output stationary dataflow and relies on unrolling the
output channel dimension (K) and output spatial dimen-
sions (H ×W ) to enable data reuse. In this scheme, the
weights are still streamed along the row direction and the
input activations are streamed in the orthogonal direction to
reuse across different output channels.
Popular neural network layers, such as the convolution layer
and the fully-connected layer, can be easily mapped to the
accelerator. Consider the example of a 1-by-1 convolution
in Figure 2. To map the computation into the input sta-
tionary compute array in Figure 1(a), the input activations
are pre-filled with different input spatial locations unrolled
horizontally and different input channels unrolled vertically.
The weights are streamed in a sequence into the arithmetic
array. For the input stationary dataflow, weights from dif-
ferent input channels are fed spatially into different rows
and weights from different output channels are streamed
temporally into the same row.
W[3, 0]
W[2, 0]
W[1, 0]
W[0, 0]
K
C
x
A[3, 0]A[2, 0]A[1, 0]A[0, 0]
C
H x W
W[3, 1]
W[2, 1]
W[1, 1]
W[0, 1]
W[3, 2]
W[2, 2]
W[1, 2]
W[0, 2]
W[3, 3]
W[2, 3]
W[1, 3]
W[0, 3]
A[3, 1]A[2, 1]A[1, 1]A[0, 1]
A[3, 2]A[2, 2]A[1, 2]A[0, 2]
A[3, 3]A[2, 3]A[1, 3]A[0, 3]
(a)
H x W
C
K
A[3, 0]A[2, 0]A[1, 0]A[0, 0]
A[3, 1]A[2, 1]A[1, 1]A[0, 1]
A[3, 2]
A[2, 2]A[1, 2]A[0, 2]
A[3, 3]
A[2, 3]
A[1, 3]A[0, 3]
W[3, 0] W[2, 0] W[1, 0] W[0, 0]
W[3, 1] W[2, 1] W[1, 1] W[0, 1]
W[3, 2] W[2, 2] W[1, 2] W[0, 2]
W[3, 3] W[2, 3] W[1, 3] W[0, 3]
(b)
Figure 2. Mapping a 1-by-1 convolution to the input stationary
accelerator in Figure 1(a).
Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization
The energy consumption of the accelerator is composed of
the datapath energy (including the arithmetic computation
energy and the data propagation energy among compute
units), the memory access energy and the control energy.
When all the operands can fit into the local SRAM (Park
et al., 2018a), the datapath and memory access energy can
be computed as
Edatapath = OPs× Energy/OP
Emem = OPs× ( 1
τweight
+
1
τinput
+
1
τpsum
)
× Energy/SRAMAccess
where τweight, τinput, τpsum denote the reuse factor of the
weights, input activation, and the partial sums, respec-
tively. Energy/OP denotes the datapath energy and
Energy/SRAMAccess denotes the SRAM access energy
that includes the SRAM read/write and the data movement
energy from SRAM to the compute array.
Assume the ratio between the compute energy, inter-unit
propagation energy and the SRAM access energy is 1:2:6
(Chen et al., 2016). For a reasonable design with τweight =
16, τinput = 16 and τpsum = 16, the ratio between the data-
path energy and the SRAM energy becomes 3: 4 (assuming
weights and inputs are 8-bit and partial sums are 32-bit).
The datapath energy consumes a significant portion of the
total energy and hence it is crucial to reduce the datapath
energy.
The datapath energy can be further divided into three parts,
including switching energy, glitch energy, and leakage
(Rabaey et al., 2008). Both the switching energy and glitch
energy are caused by the circuit nodes switching from 0 to
1 or from 1 to 0, denoted as bit flips. Leakage energy is
caused by the small leakage current when the transistors are
turned off and its contribution to the datapath energy is usu-
ally orders of magnitude smaller than glitch and switching.
Hence, we ignore the leakage energy in the paper.
3 MOTIVATION: BIT FLIPS IMPACT
DATAPATH ENERGY
As described in Section 2, while the datapath energy ac-
counts for a significant portion of the total energy, the bit
flips inside the datapath are the main culprit. The datapath
bit flips are determined by the value and the streaming pat-
tern of the input operands, i.e., weights, input activations,
and partial sums. Because the activations and partial sums
are input dependent, we focus on analyzing the impact of
weight matrices.
Consider the example of the 2-bit weight matrix W ∈
RK×C :
W =

00 00 00 00
11 11 11 11
00 00 00 00
11 11 11 11︸ ︷︷ ︸
C

 K.
Without loss of generality, we assume an input-stationary
compute array as shown in Figure 1(a) and W is streamed
into the array following Figure 2(b). Then, the weight
sequence fed into the first row of the compute array is
{00, 11, 00, 11} and the bit flips of the weight sequence
at the compute array input are 6. To confirm the relation
between the bit flips of the weight sequence and the data-
path energy, we use the weight matrices of MobileNetV2
(Sandler et al., 2018) trained on Cifar100 dataset as an ex-
ample and generate random input activations. We evaluate
the bit flips of the weight sequence and the datapath energy
consumption with post-layout simulation (see Section 6 for
detailed experimental setup). As shown in Figure 3, the total
bit flips of the weight sequence and the energy consumption
demonstrate a strong linear relation. Moreover, given a fixed
total bit flips, the energy is independent of the length of the
weight sequence and the bit flipping probability.
0 1 2 3 4
·105
0
100
200
300
400
500
Total Bit Flips
N
or
m
al
iz
ed
E
n
er
gy
Simulation Data
Linear Regression
0.1
0.2
0.3
0.4
0.5
Figure 3. Total bit flips of the weight sequence and the energy con-
sumption demonstrate strong correlation: the colormap represents
the average bit flip probability of the input sequence.
Hence, to minimize the datapath energy, an effective ap-
proach is to reduce the bit flips of the weight sequence. We
observe that the bit flips can be reduced if the sequence of
weight streaming are carefully reordered. Consider W in
the example above. If we swap the second row and the third
row of the matrix, we have W ′′ as below:
W ′ =

00 00 00 00
00 00 00 00
11 11 11 11
11 11 11 11

Now, by streaming W ′ into the compute array, the bit flips
can be reduced from 24 to 8. Note that swapping the rows
Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization
of the weight matrix is essentially adjusting the order of gen-
erating output channels and there is no influence in terms
of neural network functionality. As the swapping can be
finished via post-processing in model level, no specific hard-
ware support is needed.
Besides the post-training processing of the weight matrices,
another orthogonal approach is to incorporate the bit flips of
the weight sequence into the training procedure and reduce
the bit flips without sacrificing the model accuracy. Consider
W ′′ and v below:
W ′ =

10 10 10 01
11 11 11 11
10 10 10 01
11 11 11 11
 , v =

01
01
01
10
 .
While Wv = W ′′v, the bit flips of the weight sequence
for W ′′ is 12. Hence, without impacting the computation
results, the bit flips can be reduced by 2×. In fact, by further
reordering the output channels of W ′′, the bit flips can be
reduced to 4.
In the rest of the section, we will formalize our analysis of
the bit flips of the weight sequence and formally describe
our post-training and training-aware techniques to reduce
the bit flips of the weight sequence.
4 METHODOLOGY: HAMMING DISTANCE
OPTIMIZATION
In this section, we will formalize the concept of bit flips and
propose both post-training and training-aware techniques
to minimize the bit flips of the streaming weights. For con-
venience, we use the input stationary dataflow (e.g., Figure
1(a)) as an example throughout the analysis but the defi-
nition, analysis, and conclusion can be easily applied to
other dataflow schemes once the weights are streamed into
the compute array. The notations used in this paper are
summarized in Table 1.
Table 1. Notations used in the paper.
N,H,W,K Output batch, height, width, channel
C,Fx, Fy Input channel, filter height, width
W Model weight matrix
B Bit width of model weights
S Sequence of output channels
T Cluster of input channels
4.1 Problem Formulation
In coding theory, the bit difference between two binary
strings are formally defined as the hamming distance (Ham-
ming, 1950). Accordingly, we define the hamming distance
0.4 0.44 0.48 0.52 0.56
0
5
10
15
20
25
NHD(W)
P
ro
b
ab
il
it
y
D
en
si
ty
Figure 4. NHD(W ) distribution for different layers in Mo-
bileNetV2 and ResNet26.
between two B-bit numbers a and b as
HD(a, b) =
B∑
i=1
Biti(a)⊕ Biti(b),
where ⊕ denotes the XOR operation and Biti(·) is the func-
tion that extracts the i-th bit of the number.
Consider a weight matrix W ∈ RK×C 1. As the input
stationary dataflow unrolls the input channel dimension (C)
along the compute array column direction and stream the
weights along different output channels (K) in temporal
sequence to the array, we define the hamming distance of
streaming W as
HD(W ) =
K−1∑
j=1
HD(W [j, :],W [j + 1, :])
=
K−1∑
j=1
C∑
i=1
HD(W [j, i],W [j + 1, i])
We also define the normalized hamming distance (NHD) of
streaming W as
NHD(W ) =
HD(W )
C × (K − 1)×B .
Hence, HD(W ) captures the total bit flips of streaming
W and NHD(W ) represents the bit flip probability. We
show NHD(W ) for different layers of the MobileNetV2
and ResNet26 trained on Cifar100 in Figure 4 and as we
can see, NHD(W ) is close to 0.5 for all the layers. In the
following sections, we will propose techniques to minimize
HD(W ) and NHD(W ) to reduce the bit flips and the data-
path energy.
1We assume Fx = Fy = 1 for the weight matrix in this case,
but the definition and analysis can be easily extended to cases
where Fx and Fy are larger than 1.
Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization
4.2 Output Channel Reordering
Inspired by the example in Section 3, a straightforward
technique to minimize HD(W ) is to reorder the sequence
of W streaming into the compute array. Let S denote the
sequence of output channels to stream W into the array and
HDS(W ) denote the hamming distance of streaming W
following S, then, we have
HDS(W ) =
K−1∑
j=1
C∑
i=1
HD(W [S[j], i],W [S[j + 1], i]).
The output channel reordering problem is defined as follows.
Problem 1 (Output Channel Reordering) Given a weight
matrix W ∈ RK×C , find S∗ such that HDS∗(W ) is mini-
mized, i.e.,
S∗ = argminSHDS(W ).
As S is a reordering of the output channels which consists
of each output channel exactly once, we map the reordering
problem to a Traveling Salesman Problem (TSP) (Miller
et al., 1960). Specifically, each output channel i corresponds
to one location to visit, and the hamming distance between
two output channels i and j, i.e., HD(W [i, :],W [j, :]), cor-
responds to the distance between two locations. Hence, min-
imizing HDS(W ) is equivalent to searching for the shortest
path to visit all the locations. Hence the complexity of
solving the output channel reordering problem scales expo-
nentially, which quickly becomes intractable for moderate
size problems.
To efficiently solve the reordering problem, we propose a
greedy search algorithm as described in Algorithm 1. The
algorithm first initializes the sequence S by assigning the
first output channel to the starting position of S. After that,
the output channel that has the smallest hamming distance
compared with the previous channel in S is added to S.
The complexity of the algorithm scales quadratically with
the number of output channels, which is very efficient in
practice.
Algorithm 1 Greedy Output Channel Reordering.
Input: weight matrix W
Output: optimal sequence S that minimizes HD(W )
S ← INITIALIZE()
for i = 2 : K do
j ← argminjHD(W [S[i− 1], :],W [j, :])
S[i]← j
end
4.3 Input Channel Segmentation and Clustering
While the output channel reordering can help reduce the
hamming distance of streaming W , the effectiveness is im-
Table 2. Hamming Distance Reduction with Various C and K.
LAYER C K HD REDUCTION
LAYER 7 192 32 1.53×
LAYER 15 384 64 1.33×
LAYER 21 576 96 1.27×
LAYER 27 960 160 1.18×
LAYER 33 1280 320 1.21×
pacted by the number of input channels C. We use Mo-
bileNetV2 on Cifar100 dataset (Krizhevsky et al., 2009) as
an example and evaluate the hamming distance reduction for
different layers. As shown in Table 2, with the increase ofC,
the hamming distance reduction slows down significantly.
One straightforward method to improve the effectiveness
of the output channel reordering is to segment the weight
matrix W along the input channel direction into several
small sub-matrices. For different sub-matrices, we can use
Algorithm 1 to search for the optimal output channel order
to reduce the hamming distance. We denote this method
as the segment-then-reorder approach. It should be noted
that as the output channel sequence changes, specific hard-
ware support in the accumulator is required to make sure the
partial sums corresponding to the same output channel are
correctly accumulated. We will detail the hardware support
in Section 5, which introduces negligible overhead to the
accumulator. With the segment-then-reorder approach, the
hamming distance can be further reduced by 1.5-2.5× com-
pared with the direct output channel reordering (see Section
6).
As expected, the smaller each input channel group is, the bet-
ter the hamming distance reduction can be achieved. Hence,
the segment-then-reorder algorithm would favor the com-
pute array with a skewed aspect ratio, i.e., more columns
and fewer rows. However, the aspect ratio of the compute ar-
ray also impacts the reuse of different operands (Chen et al.,
2016) and utilization. For example, with more number of
columns in the input stationary array, it takes more pixels
in the spatial plane to fill the whole array and thus, leads to
under utilization for small input activation sizes. While this
is not a problem for small-scale arrays with a small number
of compute units, it may induce utilization issue for large
arrays.
To further improve the effectiveness when the input channel
per segment is large, we propose to cluster the input chan-
nels first before segmenting the weight matrix. Then, the
output channels are reordered for each cluster separately.
We denote this approach as cluster-then-reorder.
Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization
0 5 10 15 20
1
1.2
1.4
1.6
1.8
2
# Iter
H
D
R
ed
u
ct
io
n
Layer 4
Layer 10
(a)
La
yer
4
La
yer
6
La
yer
8
La
yer
10
La
yer
12
1.5
1.6
1.7
1.8
1.9
2
H
D
R
ed
u
ct
io
n
Segment-then-Reorder
Cluster-then-Reorder
(b)
8 16 32 64
1.2
1.4
1.6
1.8
2
Channel/Cluster
H
D
R
ed
u
ct
io
n
Segment-then-Reorder
Cluster-then-Reorder
Layer 4
Layer 10
(c)
Figure 5. Performance of the cluster-then-reorder algorithm: (a) convergence plot; and normalized hamming distance comparison with the
segment-then-reorder algorithm for (b) different layers (channels per cluster is 8) and (c) different channels per cluster.
Consider the example of the following weight matrix:
W =

00 11 00 11 01 10 01 10
11 11 00 00 10 10 01 01
11 00 00 11 10 01 10 10
11 11 11 11 10 10 10 10

Assume the compute array has 4 rows and only allows for
streaming 4 input channels simultaneously. Instead of di-
rectly segmenting W , we can first cluster the input channels
into 2 groups, i.e., {0, 2, 4, 6} and {1, 3, 5, 7}, and then
segment W into W ′ and W ′′ as below. Compared to the
segment-then-reorder approach, the hamming distance can
be reduced from 22 to 16. Note that the clustering of the
input channels does not impact the output.
W ′ =

00 00 01 01
11 00 10 01
11 00 10 10
11 11 10 10
 ,W ′′ =

11 11 10 10
11 00 10 01
00 11 01 10
11 11 10 10
 .
Let {T1, . . . , Tt} denote the t clusters of the input channel.
The input channel clustering problem is then defined as
follows.
Problem 2 (Input Channel Clustering) Given a weight ma-
trix W ∈ RK×C , find t clusters T1, . . . , Tt such that the
total hamming distance of streaming each sub-matrix WTi
is minimized, i.e.,
min
T1,...,Tt
t∑
i=1
HDS∗i (WTi)
s.t. S∗i = argminSHDS(WTi)
Ti ∩ Tj = ∅ ∀i 6= j
∪ti=1 Ti = {1, . . . , C}
This is a nested optimization problem which is computa-
tionally expensive to solve optimally even if the inner opti-
mization loop can be solved with the proposed Algorithm 1.
Hence, we propose a greedy iterative method to solve the
nested optimization problem. As shown in Algorithm 2, in
the initialization process, t input channels are randomly se-
lected and {S(0)1 , . . . , S(0)t } are initialized to minimize the
total hamming distance for each input channel. The algo-
rithm alternates between the assignment step and the update
step for N total iterations. In the assignment step, for each
input channel i, we evaluate its hamming distance following
the optimal sequence of each cluster, i.e., S(n)1 , . . . , S
(n)
t .
The input channel is then added to the cluster with the small-
est hamming distance. In the update step, we re-compute
the optimal sequence for each cluster of the input channels.
The convergence of the proposed clustering algorithm can
be guaranteed if the inner loop optimization, i.e., the output
channel reordering problem, can be optimally solved. This
is because the objective function of the clustering problem
is always bounded and it is guaranteed to be reduced in the
assignment and update step in each iteration. In practice,
we use the greedy algorithm to solve the update step as
described in Section 4.2. We find the cluster-then-reorder
algorithm converges very well and continuously out-perform
the segment-then-reorder algorithm.
We use the layers of the MobileNetV2 (Sandler et al., 2018)
on Cifar100 dataset as an example and run the clustering
algorithm 20 times with random initialization. The con-
vergence plot is shown in Figure 5(a). The normalized
hamming distance is computed as the hamming distance
of different algorithms normalized by the hamming dis-
tance without output channel reordering. As we can see,
the clustering algorithm converges within 15 iterations and
the run-to-run variation of the clustering algorithm is very
small. We also compare the cluster-then-reorder algorithm
Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization
Algorithm 2 Cluster-then-reorder Algorithm
Input: weight matrix W ∈ RK×C , number of iterations N ,
number of clusters t
Output: cluster of input channels {T ∗1 , . . . , T ∗t } and the
optimal sequence of output channels {S∗1 , . . . , S∗t }
{S(0)1 , . . . , S(0)t }, n← RANDOM INITIALIZE(), 0
while n ≤ N do
// Assignment step
for i = 1 : C do
l← argminkHDS(n)k (W [:, i])
T
(n)
l ← T (n)l
⋃{i}
end
// Update step
for i = 1 : t do
S
(n+1)
i ← argminSHDS(WTi)
end
n+ = 1
end
{S∗1 , . . . , S∗t } ← {S(N)1 , . . . , S(N)t }
{T ∗1 , . . . , T ∗t } ← {T (N)1 , . . . , T (N)t }
with the segment-then-reorder algorithm for different layers
and different numbers of channels per cluster. As shown in
Figure 5(b) and 5(c), the cluster-then-reorder algorithm can
out-perform the baseline algorithm by up to 1.21×.
The proposed algorithm is very efficient since the complex-
ity of the update step scales O(K2C) and the complexity
of the assignment step scales O(CKt) with the number of
input channels C, output channels K, and the number of
clusters t.
4.4 Hamming Distance-Aware Training
While the techniques proposed above focus on post-training
optimization, we also propose a hamming distance-aware
training procedure to further reduce the hamming distance of
streaming W . The basic idea is to incorporate the hamming
distance loss into the loss function and explicitly encourage
the reduction of hamming distance as shown below:
L = LCE + λLHD(W ),
where LCE represents the original cross-entropy loss. λ is
used to explicitly control the trade-off between the accuracy
and the hamming distance reduction.
However, there are two main problems with LHD(W ).
Firstly, to compute LHD, Bit(·) is needed. Consider a
integer x, to get the b-th bit, we have
Bitb(x) =
1
2b
floor(x− 2b+1floor( x
2b+1
)).
Because floor(·) is not differentiable, LHD is not differen-
tiable as well.
Target NN Model
HD-Aware NN Model
Fi
x 
M
SB
Output Chl Reordering
Comp. LayerWise HD Loss
Param. UpdateB
at
ch
Ep
oc
h
Multiple
 
Epochs
Figure 6. Hamming distance-aware training procedure.
Previously, straight-through-estimator (STE) has been pro-
posed to approximate the gradients for floor(·) (Bengio
et al., 2013). However, directly applying STE leads to
∂Bitb
∂x
= 0, ∀b 6= B − 1.
This indicates that only the most significant bit of the weight
parameters can be regularized. Hence, we propose an itera-
tive freeze-and-regularize procedure. In the network training
process, we first add regularization to the most significant
bit and after several epochs, we freeze the most significant
bit and after that regularize the second most significant bit.
The iterative process continues until we fix all the bits of
the weights.
The second problem with LHD is that to compute LHD,
the input channel clusters and output channel orders are
needed. As the weight matrices get updated during training,
both the optimal input channel clusters and the optimal
output channel order can change. Hence, after each epoch
of training, we leverage the cluster-then-reorder algorithm
to cluster the input channels and reorder the output channels.
The final training procedure is shown in Figure 6.
5 HARDWARE SUPPORT
In this section, we discuss the necessary hardware support
for the proposed algorithms, including direct greedy reorder,
segment-then-reorder, and cluster-then-reorder algorithms.
The direct reorder algorithm only switches the sequence for
the output channel generation. No extra hardware support
is needed for the direct reorder algorithm. Instead, a post-
training processing of the model to re-arrange the weight
matrices is sufficient. Consider the example in Figure 7. To
switch the output channels of the first layer, both the rows of
the weight matrix in the first layer, i.e.,W1, and the columns
of the weight matrix in the second layer, i.e., W2, need to
be switched accordingly.
The segment-then-reorder algorithm divides the input chan-
nels into segments and reorders the output channels for each
Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization
1
2
3
1
2
1
2
3
W1
<latexit sha 1_base64="fTyao9aYOz4k5n vem/GAw+92UEw=">AAAB+Xi cbVDLSgNBEOyNrxhfqx69DEb BU9iNgh4DXjxGMA9IlmV20km GzD6YmQ2EJX/ixYMiXv0Tb/ 6Nk80eNLGgoajqnp6uIBFcac f5tkobm1vbO+Xdyt7+weGRf XzSVnEqGbZYLGLZDahCwSNsa a4FdhOJNAwEdoLJ/cLvTFEqH kdPepagF9JRxIecUW0k37b7 +RtZIFKck47v+nbVqTk5yDpx C1KFAk3f/uoPYpaGGGkmqFI 910m0l1GpORM4r/RThQllEzr CnqERDVF5Wb51Ti6NMiDDWJq KNMnV3xMZDZWahYHpDKkeq1 VvIf7n9VI9vPMyHiWpxogtFw 1TQXRMFjGQAZfItJgZQpnk5 q+EjamkTJuwKiYEd/XkddKu1 9zrWv3xptq4KOIowxmcwxW4c AsNeIAmtIDBFJ7hFd6szHqx 3q2PZWvJKmZO4Q+szx9ICJNT </latexit>
W2
<latexit sha 1_base64="F+Wp2l6Y8lsifj uv9cvr1XoJYA4=">AAAB+Hi cbVDLSsNAFL3xWeujUZduBqv gqiRV0GXBjcsK9gFtCJPJpB0 6mYSZiVBDv8SNC0Xc+inu/B unaRbaeuDC4Zx75849QcqZ0o 7zba2tb2xubVd2qrt7+wc1+ /Coq5JMEtohCU9kP8CKciZoR zPNaT+VFMcBp71gcjv3e49UK paIBz1NqRfjkWARI1gbybdr w+KNXNJwhnp+07frTsMpgFaJ W5I6lGj79tcwTEgWU6EJx0o NXCfVXo6lZoTTWXWYKZpiMsE jOjBU4JgqLy+WztC5UUIUJdK U0KhQf0/kOFZqGgemM8Z6rJ a9ufifN8h0dOPlTKSZpoIsFk UZRzpB8xRQyCQlmk8NwUQy8 1dExlhiok1WVROCu3zyKuk2G +5lo3l/VW+dlXFU4ARO4QJcu IYW3EEbOkAgg2d4hTfryXqx 3q2PReuaVc4cwx9Ynz946ZLd </latexit>
4 4
(a)
W 01
<latexit sha1_base64="O4 NuZmeBLk/r9aMI+Nv/7T1FKsQ=">AAACAHicbVDLSsNAFJ3 UV62vqAsXbgar4KokVdBlwY3LCvYBTQyT6U07dPJgZiKUkI 2/4saFIm79DHf+jdM0C209cOFwzr1z5x4/4Uwqy/o2Kiura+ sb1c3a1vbO7p65f9CVcSoodGjMY9H3iQTOIugopjj0EwEk9 Dn0/MnNzO89gpAsju7VNAE3JKOIBYwSpSXPPHKKNzKfp5Dj nmc/OIlgIXhm3WpYBfAysUtSRyXanvnlDGOahhApyomUA9t KlJsRoRjlkNecVEJC6ISMYKBpREKQblYsz/GZVoY4iIWuSO FC/T2RkVDKaejrzpCosVz0ZuJ/3iBVwbWbsShJFUR0vihIOV YxnqWBh0wAVXyqCaGC6b9iOiaCUKUzq+kQ7MWTl0m32bAvG s27y3rrtIyjio7RCTpHNrpCLXSL2qiDKMrRM3pFb8aT8WK8 Gx/z1opRzhyiPzA+fwDNg5Zw</latexit>
W 02
<latexit sha1_base64="l8 jZXR9MiT82yktoInFBdXhVaWg=">AAAB/3icbVDLSsNAFJ3 4rPUVFdy4GayCq5JUQZcFNy4r2Ac0MUwmN+3QyYOZiVBiF/ 6KGxeKuPU33Pk3TtMstPXAhcM5986de/yUM6ks69tYWl5ZXV uvbFQ3t7Z3ds29/Y5MMkGhTROeiJ5PJHAWQ1sxxaGXCiCRz 6Hrj66nfvcBhGRJfKfGKbgRGcQsZJQoLXnmoVO8kQsIJrjr Ne6dVLAIPLNm1a0CeJHYJamhEi3P/HKChGYRxIpyImXftlL l5kQoRjlMqk4mISV0RAbQ1zQmEUg3L3ZP8KlWAhwmQlescK H+nshJJOU48nVnRNRQzntT8T+vn6nwys1ZnGYKYjpbFGYcqw RPw8ABE0AVH2tCqGD6r5gOiSBU6ciqOgR7/uRF0mnU7fN64 /ai1jwp46igI3SMzpCNLlET3aAWaiOKHtEzekVvxpPxYrwb H7PWJaOcOUB/YHz+APtilfo=</latexit>
1
2
3
1
2
1
2
3
4 4
(b)
Figure 7. Post-training processing to reorder the output channel.
Accumulator
Counter
Output index
Output Buffer
…
Psum
1
1
Output 
Addr LUT 2
Output buffer addr2
(a)
W[0,0]  W[0,1]   W[0,2]  W[0,3]
W[0,0]  W[0,1]   W[0,2]  W[0,3] * =
X[0]
X[1]
X[2]
X[3]
Y[0]
Y[1]
mem addr0
mem addr1
W[0,0]  W[0,1]
W[1,0]  W[1,1]
W[1,2]  W[1,3]
W[0,2]  W[0,3]
W’ =   W’’ =
Segment
Optimize
Feeding W’ into array   
Optimize
Feeding W’’ into array   
Partial sum of Y[0]
@addr1
Cycle 0
W[0,0]  W[0,1]
W[1,0]  W[1,1]
x
W
X X
x
W
+
1: Out addr 1
2: Out addr 2
LUT
Partial sum of Y[1]
@addr2
Cycle 1
W[0,0]  W[0,1]
W[1,0]  W[1,1] x
W
X X
x
W
+
1: Out addr 1
2: Out addr 2
LUT
Partial sum of Y[1]
@addr2
Cycle 0
W[0,2]  W[0,3]
W[1,2]  W[1,3] x
W
X X
x
W
+
1: Out addr 2
2: Out addr 1
LUT
Partial sum of Y[0]
@addr1
Cycle 1
W[0,2]  W[0,3]
W[1,2]  W[1,3]
x
W
X X
x
W
+
1: Out addr 2
2: Out addr 1
LUT
(b)
Figure 8. (a) Hardware support for the segment-then-reorder algo-
rithm and (b) an example with 2 segments.
segment separately. Hence, the same row in different seg-
mented weight sub-matrices may correspond to the partial
sum of different output channels. To guarantee correct re-
duction of the partial sums, we add an output address lookup
table (LUT) to translate the index of the counter in the ac-
cumulator to the actual address for accumulation as shown
in Figure 8(a). We also show in Figure 8(b) an example on
how to use the address LUT to guide the accumulation. As
can be seen, by modifying the LUT entry corresponding
to different counter indices, the partial sums are correctly
accumulated.
If we assume the output buffer depth to be D, the LUT
needs to have at leastD entries and each entry needs to have
log2D bits. For a reasonable output buffer depth, e.g., 1024,
the LUT SRAM size is less than 2 KB, which is very small
and thus has negligible energy and area overhead.
Compared to the segment-then-reorder algorithm, the
cluster-then-reorder algorithm also changes the order of
the input channels, i.e., the columns of the weight matrices.
While the clustering does not impact the correctness of the
outputs, it may impact the memory fetching of the input
activations. We leverage the output address LUT to swap
the activations to avoid any complication or modification
to the input fetching logic. For example, let’s assume the
required input channel sequence for the current layer to be
{1, 3, 4, 2}. When executing the previous layer, the output
address LUT can simply be set to {1 : 1, 2 : 3, 3 : 4, 4 : 2}
to reorder the sequence of channel generation.
6 EXPERIMENTAL RESULTS
6.1 Experimental Setup
In this section, we report on our experiments to demonstrate
the effectiveness of the proposed hamming distance reduc-
tion techniques. We use MobileNetV2 (Sandler et al., 2018)
and ResNet26 (He et al., 2016) trained on the Cifar10 and
Cifar100 dataset for the evaluation. The 1-by-1 convolution
layers in MobileNetV2 and the 3-by-3 convolution layers
in ResNet26 are picked 2. The layer shapes are shown in
Appendix A. To evaluate the energy consumption, we use
simulation on a post-layout extracted netlist. We designed
an input-stationary systolic array with 8 rows and 8 columns.
Each PE in the array can support the multiplication and ac-
cumulation of 8-bit activations and 4-bit weights. The array
is synthesized and placed and routed using a commercial
technology library and the energy consumption is evaluated
in a typical process corner. The leakage energy is ignored
in the evaluation as it is more than two orders of magnitude
less than dynamic energy.
6.2 Post-Training Hamming Distance Optimization
We first compare the effectiveness of different post-training
hamming distance optimization algorithms, including the di-
rect reorder, segment-then-reorder, and cluster-then-reorder
algorithms. We select the 1-by-1 convolution layers from
the MobileNetV2 and the 3-by-3 layers from the ResNet26
for the evaluation. We compare the hamming distance of
different algorithms with the baseline setting without any
optimization. As shown in Figure 10, when the number of
input channels per cluster is 8, the average hamming dis-
tance can be reduced by 1.96× and 1.54× for MobileNetV2
23-by-3 depthwise separable convolutions are not considered
as they are usually hard to map on the systolic arrays and they only
consume a very small part of the total energy.
Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
1.4
1.6
1.8
2
2.2
2.4
# Layer
A
ve
ra
ge
H
D
R
ed
u
ct
io
n
Segment-then-Reorder
Cluster-then-Reorder
Figure 9. Hamming distance reduction comparison for the segment-then-reorder and the cluster-then-reorder algorithm on MobileNetV2.
Table 3. Training-aware hamming distance optimization for MobileNetV2 on Cifar10 and Cifar100 dataset.
DATASET λ TOP-1 ACC TOP-5 ACC
BEST-LAYER HD
REDUCTION
AVERAGE HD
REDUCTION
BEST-LAYER ENERGY
REDUCTION
AVERAGE ENERGY
REDUCTION
CIFAR10
0.0 94.38 99.82 1.0× 1.0× 1.0× 1.0×
1× 10−4 94.22 99.00 41.6× 7.55× 18.6× 6.63×
CIFAR100
0.0 78.21 94.53 1.0× 1.0× 1.0× 1.0×
1× 10−5 77.98 94.20 2.31× 1.24× 2.17× 1.26×
3× 10−5 77.47 94.07 3.19× 1.50× 2.86× 1.47×
5× 10−5 77.29 94.24 4.54× 1.76× 3.88× 1.67×
7× 10−5 77.62 94.26 5.95× 2.00× 4.92× 1.86×
and ResNet26, respectively, which translate to 1.62× and
1.49× reduction of the average energy consumption.
We also have a more detailed comparison between the
segment-then-reorder and the cluster-then-reorder algo-
rithms for MobileNetV2 as shown in Figure 9. As shown in
the figure, the cluster-then-reorder algorithm usually results
in a higher reduction for the even layers, e.g., layer 2, layer
4, etc. These layers are the second 1-by-1 convolution layers
in the inverted residual blocks, which have a larger number
of input channels and a smaller number of output channels.
For these layers, more clusters can be formed to achieve
better results. For the even layers with a smaller number of
input channels and a larger number of output channels, the
two methods perform similarly.
6.3 Training-Aware Hamming Distance Optimization
We now evaluate the effectiveness of the training-aware
hamming distance algorithms. We select MobileNetV2 and
train the network on Cifar10 and Cifar100 datasets. By
controlling the regularization coefficients λ2, we explore
the trade-off between the accuracy and the reduction of
hamming distance. For practical purpose, we constrain the
accuracy degradation within 1%. As shown in Table 3,
on Cifar10 dataset, the average hamming distance can be
reduced by 7.55×, which leads to 6.63× reduction of the
average energy across layers. On Cifar100 dataset, the
average hamming distance reduction and the average energy
reduction are 2.00× and 1.86×, respectively.
8 16 32 64
0.8
1
1.2
1.4
1.6
1.8
2
Channels/Cluster
A
ve
ra
ge
H
D
R
ed
u
ct
io
n
Baseline
Direct Reorder
Segment-then-Reorder
Cluster-then-Reorder
(a)
8 16 32 64
0.8
1
1.2
1.4
1.6
Channels/Cluster
A
ve
ra
ge
H
D
R
ed
u
ct
io
n
Baseline
Direct Reorder
Segment-then-Reorder
Cluster-then-Reorder
(b)
Figure 10. Comparison of the post-training optimization tech-
niques for HD reduction on (a) MobileNetV2 and (b) ResNet26.
6.4 Combined Hamming Distance Optimization.
We now combine the post-training optimization techniques
with the training-aware optimization algorithm. As shown
Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization
Table 4. Average hamming distance and energy reduction of the combined methods (CTR is short for the cluster-then-reorder algorithm).
BEST-LAYER HD
REDUCTION
AVERAGE HD
REDUCTION
BEST-LAYER ENERGY
REDUCTION
AVERAGE ENERGY
REDUCTION
BASELINE 1.0× 1.0× 1.0× 1.0×
λ = 0, CTR 2.44× 1.96× 2.27× 1.84×
λ = 7× 10−5 5.95× 2.00× 4.92× 1.86×
λ = 7× 10−5 , CTR 10.2× 3.79× 8.51× 2.85×
in Table 4, the proposed training-aware and post-training op-
timization techniques can work orthogonal to each other. By
combining these optimization techniques, for MobileNetV2
trained on Cifar100, the average hamming distance of
streaming the weight matrices can be reduced by 3.79×
and the average datapath energy can be reduced by 2.85×.
7 CONCLUSION
Energy consumption of arithmetic datapath in a neural net-
work accelerator is heavily dependent on the hamming dis-
tance of the input sequence. With the proposed Hamming-
Distance-Aware training and post-processing algorithm, the
energy consumption of datapath can be significantly reduced.
Evaluation with MobileNetV2 and ResNet neural networks
shows that our proposed methods can achieve 2.85× datap-
ath energy reduction on average and up to 8.51× datapath
energy reduction for certain network layers, which demon-
strates significant potential in energy-critical neural network
accelerator designs.
REFERENCES
Andri, R., Cavigelli, L., Rossi, D., and Benini, L. Yodann:
An ultra-low power convolutional neural network accel-
erator based on binary weights. 2016 IEEE Computer
Society Annual Symposium on VLSI (ISVLSI), pp. 236–
241, 2016.
Bengio, Y., Le´onard, N., and Courville, A. Estimating or
propagating gradients through stochastic neurons for con-
ditional computation. arXiv preprint arXiv:1308.3432,
2013.
Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., and
Temam, O. Diannao: A small-footprint high-throughput
accelerator for ubiquitous machine-learning. In Proceed-
ings of the 19th International Conference on Architectural
Support for Programming Languages and Operating Sys-
tems, ASPLOS ’14, pp. 269–284, New York, NY, USA,
2014. ACM.
Chen, Y.-H., Krishna, T., Emer, J. S., and Sze, V. Eyeriss:
An energy-efficient reconfigurable accelerator for deep
convolutional neural networks. IEEE Journal of Solid-
State Circuits, 52(1):127–138, 2016.
Chen, Y.-H., Yang, T.-J., Emer, J., and Sze, V. Eyeriss v2: A
flexible accelerator for emerging deep neural networks on
mobile devices. arXiv preprint arXiv:1807.07928, 2018.
Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., and
Bengio, Y. Binarized neural networks: Training deep
neural networks with weights and activations constrained
to+ 1 or-1. arXiv preprint arXiv:1602.02830, 2016.
Du, L., Du, Y., Li, Y., Su, J., Kuan, Y.-C., Liu, C.-C., and
Chang, M.-C. F. A reconfigurable streaming deep convo-
lutional neural network accelerator for internet of things.
IEEE Transactions on Circuits and Systems I: Regular
Papers, 65(1):198–208, 2017.
Gao, M., Pu, J., Yang, X., Horowitz, M., and Kozyrakis, C.
Tetris: Scalable and efficient neural network acceleration
with 3d memory. In Proceedings of the Twenty-Second
International Conference on Architectural Support for
Programming Languages and Operating Systems, ASP-
LOS ’17, 2017. ISBN 978-1-4503-4465-4.
Hamming, R. W. Error detecting and error correcting codes.
The Bell system technical journal, 29(2):147–160, 1950.
Han, S., Mao, H., and Dally, W. J. Deep compres-
sion: Compressing deep neural networks with pruning,
trained quantization and huffman coding. arXiv preprint
arXiv:1510.00149, 2015.
Hazelwood, K., Bird, S., Brooks, D., Chintala, S., Diril, U.,
Dzhulgakov, D., Fawzy, M., Jia, B., Jia, Y., Kalro, A.,
et al. Applied machine learning at facebook: A datacenter
infrastructure perspective. In 2018 IEEE International
Symposium on High Performance Computer Architecture
(HPCA), pp. 620–629. IEEE, 2018.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition,
pp. 770–778, 2016.
He, Y., Zhang, X., and Sun, J. Channel pruning for acceler-
ating very deep neural networks. In Proceedings of the
Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization
IEEE International Conference on Computer Vision, pp.
1389–1397, 2017.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang,
W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets:
Efficient convolutional neural networks for mobile vision
applications. arXiv preprint arXiv:1704.04861, 2017.
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K.,
Dally, W. J., and Keutzer, K. Squeezenet: Alexnet-level
accuracy with 50x fewer parameters and¡ 0.5 mb model
size. arXiv preprint arXiv:1602.07360, 2016.
Jouppi, N. P., Young, C., Patil, N., Patterson, D., and et al.
In-datacenter performance analysis of a tensor processing
unit. In Proceedings of the 44th Annual International
Symposium on Computer Architecture, ISCA ’17, 2017.
Krizhevsky, A., Hinton, G., et al. Learning multiple layers
of features from tiny images. Technical report, Citeseer,
2009.
LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature,
521(7553):436, 2015.
Miller, C. E., Tucker, A. W., and Zemlin, R. A. Integer pro-
gramming formulation of traveling salesman problems. J.
ACM, 7(4), October 1960.
Moons, B. and Verhelst, M. A 0.3–2.6 tops/w precision-
scalable processor for real-time large-scale convnets. In
2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits),
2016.
Park, E., Kim, D., and Yoo, S. Energy-efficient neural net-
work accelerator based on outlier-aware low-precision
computation. In Proceedings of the 45th Annual Interna-
tional Symposium on Computer Architecture, ISCA ’18,
2018a.
Park, J., Naumov, M., Basu, P., Deng, S., Kalaiah, A., Khu-
dia, D., Law, J., Malani, P., Malevich, A., Nadathur, S.,
et al. Deep learning inference in facebook data cen-
ters: Characterization, performance optimizations and
hardware implications. arXiv preprint arXiv:1811.09886,
2018b.
Rabaey, J. M., Chandrakasan, A., and Nikolic, B. Digital
Integrated Circuits. Prentice Hall Press, Upper Saddle
River, NJ, USA, 3rd edition, 2008.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and
Chen, L.-C. Mobilenetv2: Inverted residuals and linear
bottlenecks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 4510–
4520, 2018.
Sharma, H., Park, J., Suda, N., Lai, L., Chau, B., Chan-
dra, V., and Esmaeilzadeh, H. Bit fusion: Bit-level dy-
namically composable architecture for accelerating deep
neural networks. In Proceedings of the 45th Annual In-
ternational Symposium on Computer Architecture, pp.
764–775. IEEE Press, 2018.
Sze, V., Chen, Y., Yang, T., and Emer, J. S. Efficient pro-
cessing of deep neural networks: A tutorial and survey.
Proceedings of the IEEE, 105(12), 2017.
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M.,
Howard, A., and Le, Q. V. Mnasnet: Platform-aware
neural architecture search for mobile. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2820–2828, 2019.
Wu, C.-J., Brooks, D., Chen, K., Chen, D., Choudhury, S.,
Dukhan, M., Hazelwood, K., Isaac, E., Jia, Y., Jia, B.,
et al. Machine learning at facebook: Understanding infer-
ence at the edge. In 2019 IEEE International Symposium
on High Performance Computer Architecture (HPCA),
pp. 331–344. IEEE, 2019.
Zhang, Y., Suda, N., Lai, L., and Chandra, V. Hello edge:
Keyword spotting on microcontrollers. arXiv preprint
arXiv:1711.07128, 2017.
Improving Efficiency in Neural Network Accelerator using Operands Hamming Distance Optimization
APPENDIX A LAYER SHAPES OF MOBILENETV2 AND RESNET26
Table 5. Layer shapes of 1-by-1 convolutions in MobileNetV2.
# Layer 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
C 16 96 24 144 24 144 32 192 32 192 32 192 64 384 64 384 64 384 64 384 96 576 96 576 96 576 160 960 160 960 160 960 320
K 96 24 144 24 144 32 192 32 192 32 192 64 384 64 384 64 384 64 384 96 576 96 576 96 576 160 960 160 960 160 960 320 1280
Table 6. Layer shapes of 3-by-3 convolutions in ResNet26.
# Layer 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
C 16 16 16 16 16 16 16 16 16 32 16 32 32 32 32 32 32 32 64 32 32 32 32 32 32 32
K 16 16 16 16 16 16 16 16 32 32 16 32 32 32 32 32 32 64 64 64 64 64 64 64 64 64
