SparseNN: An Energy-Efficient Neural Network Accelerator Exploiting
  Input and Output Sparsity by Zhu, Jingyang et al.
ar
X
iv
:1
71
1.
01
26
3v
1 
 [c
s.L
G]
  3
 N
ov
 20
17
SparseNN: An Energy-Efficient Neural Network
Accelerator Exploiting Input and Output Sparsity
Jingyang Zhu1, Jingbo Jiang1, Xizi Chen1 and Chi-Ying Tsui2
Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong
Email: 1{jzhuak, jingbo.jiang, xchenbn}@connect.ust.hk, 2eetsui@ust.hk
Abstract—Contemporary Deep Neural Network (DNN) contains
millions of synaptic connections with tens to hundreds of layers.
The large computation and memory requirements pose a challenge
to the hardware design. In this work, we leverage the intrinsic
activation sparsity of DNN to substantially reduce the execution
cycles and the energy consumption. An end-to-end training algo-
rithm is proposed to develop a lightweight run-time predictor for
the output activation sparsity on the fly. From our experimental
results, the computation overhead of the prediction phase can
be reduced to less than 5% of the original feedforward phase
with negligible accuracy loss. Furthermore, an energy-efficient
hardware architecture, SparseNN, is proposed to exploit both the
input and output sparsity. SparseNN is a scalable architecture with
distributed memories and processing elements connected through
a dedicated on-chip network. Compared with the state-of-the-art
accelerators which only exploit the input sparsity, SparseNN can
achieve a 10%-70% improvement in throughput and a power
reduction of around 50%.
I. INTRODUCTION
Deep Neural Networks (DNNs) are one of the fundamental
machine learning models. In the past decade, DNNs have
attracted great research interest due to their promising results
in various domains, including visual recognition [1], natural
language processing [2], and artificial intelligence [3]. Although
DNNs can outperform many traditional machine learning mod-
els, the large computation and storage requirements pose an
obstacle to the extensive deployment in embedded applications.
For instance, Inception-v4 [4], the latest Google DNN model
for visual recognition, requires 12GMACs and more than 42M
parameters for classifying a single frame. Therefore, considering
the limited resources in the embedded platform, both algorith-
mic and architectural optimizations are required to deliver an
energy-efficient solution for DNNs.
In order to avoid overfitting and to be biologically plausible,
Rectified Linear Unit (ReLU) is extensively used in DNNs,
which leads to a large amount of zeros in the activations of
hidden layers [5]. It is reported that there is around 50% sparsity
in the contemporary DNN models [6]. The zero activations can
be exploited for the design of an energy-efficient implementation
as the multiplications and memory access associated with these
zero activations can be safely bypassed without affecting the
performance. The activation sparsity can be classified into
two categories: the input activation sparsity and the output
activation sparsity. The input activation sparsity refers to the
zero activations within the input feature map, and it is always
known when the computation starts. On the other hand, the
output activation sparsity, indicating the zero activations in
the output feature map, is unknown until the computation
of the current layer is finished. In this work, we propose
an efficient end-to-end training algorithm to form a run-time
predictor that can predict the output activation sparsity before
the actual computation of the current layer. The computation
overhead of making the prediction is less than 5% of the original
feedforward calculation. To efficiently exploit these sparsity to
achieve high energy-efficiency, a specialized hardware architec-
ture, SparseNN is proposed. SparseNN is a scalable Network-
on-Chip (NoC) based architecture with distributed processing
elements and memories. It can effectively take the advantage of
both input and output activation sparsity. From the experimental
results, it is shown that the throughput can be improved by
10%∼70% with a power reduction of 50% when these two types
of sparsity are jointly utilized.
II. RELATED WORK
Different optimization techniques have been proposed to
improve the energy-efficiency of the deep learning accelerator.
The DianNao series illustrate a series of specialized architec-
tures for the deep learning [7], [8]. The customized datapath
including multiplier arrays, adder trees, and nonlinear units,
shows a superior performance over the conventional computing
platforms like CPU and GPU. Cnvlutin [6] enhances the compu-
tation scheduling in DianNao by deliberately skipping the zero
input activations. Han et al. [9] proposed the deep compression
algorithm to significantly reduce the memory footprint of DNNs.
A specialized hardware architecture, EIE, was designed to
accelerate the inference phase of the compressed models in
[10]. Davis et al. [11] adopted singular value decomposition
(SVD) as the output sparsity predictor to reduce the com-
putation complexity in the feedforward pass. Based on that,
an architecture, LRADNN, was proposed to utilize the output
sparsity to improve the energy efficiency by bypassing the zero
output activations in [12]. However, none of the previous works
leverage both the input and output activation sparsity at the same
time. In summary, this work brings the following contributions:
• A novel end-to-end training algorithm is proposed to gen-
erate the output sparsity predictor of the neural network.
The scalability and the predicted sparsity are improved
compared with the previous SVD approach.
• A scalable NoC based architecture is developed to take
advantage of both input and output activation sparsity.
• The proposed architecture is implemented in ASIC and
simulations are carried out to verify the improvement
in throughput and energy consumption using three real
benchmarks.
y = f(∑)
x1
x2
x3
xn
w1
w2
w3
wn
hidden layers
output layer
input layer
x1
x2
x3
xn
y1
y2
y3
y4
ym
(a) (b)
Fig. 1. (a) The arithmetic computation associated with one neuron. (b) The
layer-wise structure of the DNNs.
III. PRELIMINARIES
A. Neural Networks
Neural networks are usually represented as a directed acyclic
graph (DAG), where each node refers to a neuron and the
synaptic connection between two neurons is represented by an
edge. The neural networks are usually organized into a stack of
hierarchical layers as shown in Fig. 1. The layer-wise structure
is biology-inspired and is demonstrated to have algorithmic
superiority in real applications. Generally speaking, the first
several layers usually act as the low-level feature extractors
(e.g. edges), and the last several layers are able to represent
the high-level features (e.g. complex contour). The arithmetic
computation associated with a neuron is the weighted sum of
the connected input activations as shown in Fig. 1, and the layer-
wise computation can be compressed into the following concise
matrix-vector function:
a(l+1) = f(W (l)a(l)) (1)
where a(l) is the input activation vector, a(l+1) is the output
activation vector to be computed, W (l) is the weight matrix,
and f is a nonlinear function (ReLU is typically used). By
iteratively applying Eq. (1) following the topological order of
the DAG, we can obtain the final prediction results of the neural
network. Such feedforward pass of DNNs is usually known as
the inference phase. On the other hand, the training phase of the
DNNs is often termed as the backpropagation as it is conducted
in the reversed order of the feedforward pass.
B. Sparsity Predictor
Due to the nature and the structure of DNNs, there are a large
amount of zeros exist in the input and output activations [6]. The
input activation sparsity can be easily exploited using the leading
nonzero detector [10] because a(l) is already known in the feed-
forward pass when Eq. (1) needs to be computed. However, the
output sparsity is not known until the calculation of the current
layer is finished. A common technique to exploit the output
sparsity is to add a new prediction phase, with lightweight
computation complexity, to predict whether the output is zero
as shown in Fig. 2 [11]. Before the feedfoward computation
using the original weight matrix is begun, the activeness of each
neuron in the output layer is predicted using a lightweight output
predictor. Based on the predicted output result, we only execute
the feedforward computations associated with the neurons that
are predicted as nonzero while the operations of the others are
bypassed. Since the prediction phase is a lightweight procedure
input acts
output acts
output 
predictor
0 01
Fig. 2. The output activation sparsity predictor of DNNs. Only the nonzero
output activations are scheduled for the feedforward computation (solid line).
The predicted zero output activations are bypassed (dotted line).
and the amount of the bypassed computations is significant, the
overall computation complexity is reduced.
In [11], the low rank approximation of the weight matrix
is used as the output sparsity predictor of the DNNs. More
specifically, the weight matrix W (l) ∈ Rm×n is decomposed
into the product of two low rank matrices U (l) ∈ Rm×r
and V (l) ∈ Rr×n. U (l) and V (l) are the first r left-singular
vectors and right-singular vectors of W (l), respectively. The
feedforward pass of DNNs using SVD output predictor can then
be summarized in the following equations:
p(l+1) = sign(U (l)V (l)a(l)) (2)
a(l+1) = p(l+1) ◦ f(W (l)a(l)) (3)
where p(l+1) is the sparsity predictor of the output activations
and ◦ represents the Hadamard product between the predictor
and the original feedforward pass. The computational com-
plexity of the truncated SVD scheme is O(r(m + n)), which
is smaller than the complexity of the original feedforward
computation (i.e. O(mn)) when r is much smaller than m
and n. Depending on the polarity of p(l+1), only the positive
output neurons are calculated in the subsequent feedforward
pass. However, the truncated SVD scheme is not a good sparsity
predictor. More specifically, it always looks for a solution with
the minimum difference of Frobenius norm [11] but it may not
be an optimal sparsity predictor. For instance, 0.1 and −0.1 are
closed to each other in terms of Frobenius norm, but they give
opposite polarity predictions. Moreover, U (l) and V (l) are only
updated once-per-epoch in the training [11]. The static updating
rule limits the flexibility of the backpropagation.
IV. SPARSITY PREDICTOR: END-TO-END TRAINING
In order to address the issue of the truncated SVD approach,
we propose a more powerful end-to-end training algorithm to
search for a better solution for the output sparsity predictor. It is
noted that the internal structure of the predictor keeps the same
as [11], i.e. it is based on a pair of U (l) and V (l). However,
the way to come up with U (l) and V (l) is different. Instead of
using SVD, they are derived from an end-to-end training phase.
During training we need to backpropagate the gradient of
loss ℓ into not only the original feedforward pass but also the
sparsity predictor pass. The gradients are derived iteratively
from the output layer to the input layer using the chain rule.
Most of the derivative calculation is straightforward except the
passing of the derivative from the predictor to U and V . In
Eq. (2), the derivative of the sign function will block the output
gradient propagate back to U and V during the backpropagation
since the value of the derivative is zero for all input except
when it is 0. Inspired by [13], we adopt a similar approach
using the “straight-through estimator”. The basic idea is to
approximate the sign(x) with the piece-wise linear function
max(−1,min(1, x)), whose derivative is 1 when the input is in
[−1, 1]. The overall end-to-end training algorithm is summarized
in Alg. 1. It has three steps: (1) A feedforward step to calculate
the activations at each layer; (2) A backpropagate step to
calculate the error term at each layer and the gradients with
respect to the parameters; (3) A gradient descent step to update
the trainable parameters. To regularize the sparsity of the output
Input : Network with trainable parameters
{U (l), V (l),W (l)}Ll=1
Output: Network with the output sparsity predictor for inference
// Step 1: Feedforward pass
for l ← 1 to L− 1 do
a
(l+1)
ori = ReLU(W
(l)a(l));
p(l+1) = sign(U (l)V (l)a(l));
a(l+1) = p(l+1) ◦ a
(l+1)
ori ;
end
// Step 2: Backpropagate pass
Compute δ(L) = ∂ℓ
∂a(L)
knowing a(L) and a∗;
for l ← L− 1 to 1 do
∂ℓ
∂p(l+1)
= δ(l+1) ◦ a
(l+1)
ori ;
∂ℓ
∂a
(l+1)
ori
= δ(l+1) ◦ p(l+1);
θ(l) = ∂ℓ
∂U(l)V (l)a(l)
= ∂ℓ
∂p(l+1)
1|U(l)V (l)a(l)|<1;
γ(l) = ∂ℓ
∂W (l)a(l)
= ∂ℓ
∂a
(l+1)
ori
1W (l)a(l)>0;
δ(l) = ∂ℓ
∂a(l)
= (W (l))Tγ(l);
end
// Step 3: Stochastic gradient descent
for l ← 1 to L− 1 do
∂ℓ
∂U(l)
= θ(l)(V (l)a(l))T ; ∂ℓ
∂V (l)
= (U (l))T θ(l)(a(l))T ;
∂ℓ
∂W (l)
= γ(l)(a(l))T ;
(U (l), V (l),W (l)) −= η( ∂ℓ
∂U(l)
, ∂ℓ
∂V (l)
, ∂ℓ
∂W (l)
);
end
Algorithm 1: The proposed End-to-End training algorithm
for DNNs with the output sparsity predictor.
activations, we add the ℓ1 norm of the sparsity predictor p
(l) to
the original loss function to optimize both error rate and sparsity
level during training. Therefore, the gradients with respect to
p(l+1) in Alg. 1 will be modified to:
∂ℓ
∂p(l+1)
= δ(l+1) ◦ a
(l+1)
ori + λ sign(p
(l+1)) (4)
where λ is the regularization factor controlling the sparsity of
the predictor.
V. SPARSENN: A SCALABLE HARDWARE ARCHITECTURE
After the output sparsity predictor U (l) and V (l) are obtained
using the proposed end-to-end training algorithm, a specialized
hardware architecture is required to accelerate the inference
phase of the DNNs with both input and output sparsity. Tra-
ditional Single Instruction Multiple Data (SIMD) microarchi-
tecture like [12][14] is not a scalable solution because the
memory bandwidth increases linearly with the SIMD width.
In addition, as for each memory access, several consecutive
=
PE1
PE2
PE3
PE64
W11 W12 W13 W1n
W21 W22 W23 W2n
W31 W32 W33 W3n
Wm1 Wm2 Wm3 Wmn
o1
o2
o3
om
i1 i2 i3 in
PE Leaf Internal Root
(a) (b)
Fig. 3. (a) The row-based interleaving of weights and activations in SparseNN.
(b) The hierarchical structure of SparseNN with 64 processing elements and
3-level routing fabrics.
weights are read out. However, due to the sparsity, not all
weights are used in the computation of outputs. The redundant
memory access reduces the energy efficiency and at the same
time, some of the processing elements become idle and this
will affect the overall throughput also. To exploit the input
sparsity, a distributed hardware accelerator, called EIE, targeted
for accelerating DNNs with compressed weights was proposed
in [10]. In this work, we adopt the basic microarchitecture of
EIE and enhance it to exploit both the input and output sparsity.
The proposed architecture, SparseNN, has the following distinct
features apart from EIE: (a) A buffered NoC flow control leads
to a more efficient use of the on-chip routing fabrics; (b) A spar-
sity predictor is added and a different computation schedule is
developed for computing the predictor; (c) Additional skipping
blocks are designed for both input and output sparsity.
A. Hierarchical Architecture of SparseNN
SparseNN is a scalable distributed hardware architecture
consisting of 64 processing elements (PEs). As shown in Fig. 3,
64 PEs are connected through a dedicated 3-level on-chip H-tree
network, which has routers at the leaf-level, the internal-level,
and the root-level. The computation of the matrix-vector multi-
plication of Eq. (1) is distributed to each PE. More specifically,
all rows of the weight matrix W(j,:), and the input activations ij
are stored in the kth PE, and output activations oj are computed
by the kth PE, where j mod 64 = k. Since each PE only
stores a subset of the input activations, the output activation
can not be computed locally until all the input activations
are received. As a result, an additional broadcasting stage is
required to distribute the local input activations stored in each
PE along with the input indices to all PEs through the on-chip
network. In order to exploit the input sparsity, only the nonzero
activations in the PE will be broadcasted. Each PE starts the
local multiplication and accumulation of the input activations
as soon as it receives the nonzero input activations from the on-
chip network. During the inference computation, SparseNN is
first used to calculate the sparsity predictor (i.e. Eq. (2)) and then
the original matrix-vector multiplication in Eq. (3) is computed.
Since the dimensions of the matrices U , V and W are very
different, different schedulings for computing these matrices are
needed and will be discussed in Section V.C.
B. On-chip Network Design
In EIE, the timing overhead of broadcasting the input acti-
vations to the PEs does not cause degradation in performance.
Since the dimension of matrix W is usually very large and
multiple rows are mapped to each PE, so whenever the PE
receives an input activation, it will take multiple cycles to
compute the multiplications with the weight of each mapped
row and the next input activation will only be needed many
cycles later. Therefore, it has enough time margin for the next
broadcasting input activation to arrive to avoid idling cycles.
However, in a general accelerator for DNNs, the weight matrix
may not necessarily be a square one. For example, the V matrix
of the sparsity predictor is usually a matrix with fat shape. Very
few output activations are mapped to each PE if the weight
matrix has a smaller number of rows, and hence for each PE, it
only takes a few cycles to consume the received input activation.
So if the next input activation does not arrive on time, there will
be idling cycles and affect the overall performance. As a result,
the on-chip network of SparseNN is deliberately designed to
make sure the activation can arrive every clock cycle to keep
the datapath in the PE always busy. Here we adopt a general
buffered flow control of the on-chip network. Four nonzero
input activations are arbitrated at each level of the routing node.
The activation with the smallest index will be granted to the
next level while the others will be stored in the buffer at the
current node, waiting for the arbitration in the next cycle. The
transmission of activations is fully pipelined, and hence each PE
can receive the data every cycle. However, the arriving input
activation is out-of-order, meaning the index of the received
nonzero input activations at each PE may not follow a strictly
increasing order. This is because arbitration is performed locally
at each routing level. The earlier nonzero activations might be
blocked in a leaf node, while some of the activations with a
higher index may enter into a higher level node from another
leaf node. However, the out-of-order input activations do not
affect the computation results as the matrix-vector multiplication
is commutative and the receiving order is not important. The
buffered flow control needs additional temporary storage in the
routing node but as shown in Section VI, the routing logic takes
less than 1% of the total area of SparseNN and this additional
overhead is negligible.
C. Computation Schedule for Sparsity Predictor
In the original computation scheduling of EIE, each row
of the weight matrix W is distributed to one of the 64 PEs,
and the corresponding output activations are calculated. We
call this the row-based scheduling. The computation of the
sparsity predictor U and V can also be conducted in a similar
way but then the hardware utilization will not be optimized.
If the row number of the weight matrix is smaller than 64,
not all PEs are mapped with the output activations under
the row-based scheduling. This situation happens for the V
matrix in the sparsity predictor because the rank size r is
typically smaller than 64. In order to address this limitation
of row-based scheduling of the matrix-vector multiplication, we
propose another column-based scheduling as shown in Fig. 4. In
Fig. 4(a), the columns (instead of rows) of V are interleavedly
mapped to the 64 PEs. Each PE calculates the partial sum of the
output activations o on the right hand side. The accumulation
of the partial sums is conducted through the 3-level H-Tree in
Fig. 4(b), and the final results of the output activations are stored
=
PE1 PE2 PE3 PE64
V11 V12 V13 V1n
V21 V22 V23 V2n
V31 V32 V33 V3n
Vm1 Vm2 Vm3 Vmn
o1
o2
o3
om
i1 i2 i3 in
Root
(a)
(b)
RC SA ST
ACC
LT
(c)
PE Leaf Internal Root
Fig. 4. (a) The column-based interleaving of V with the partial sum of each
row is calculated in 64 PEs. (b) The accumulation of partial sum is conducted
through 3-level routing nodes. (c) The modified pipeline stage in NoC router:
Routing Computation (RC), Switch Allocation (SA), Switch Traversal (ST),
ACCumulation (ACC), and Link Traversal (LT).
ActQueue
Act value
Act index
Predictor p
(l)
LNZD
MemAddrComp
Input act index
Output act index
W MEM
U 
MEM
V 
MEM
MemAccess
MemValue
Act value
Dst
Reg
Src
Reg
ActRegFile
LNZD
Network
Interface
Act value
Act index
MAC
MaskValue
Controller
Fig. 5. The microachitecture of the processing element. The major blocks of the
PE include the input activation queue (ActQueue), the leading nonzero detector
(LNZD), the memory address computing unit (MemAddrComp), the memory
(MemAccess), the multiplier-accumulator (MAC), and the activation register file
(ActRegFile).
in the root node. The accumulation operation is embedded in
the 4-stage pipelined router shown in Fig. 4(c). The utilization
rate of the V computation is closed to 100% even when the rank
size r is as low as 16. The following U computation stage in
the sparsity predictor uses the original row-based scheduling as
the row number is the same as the number of output activations
of W , which is usually much greater than 64.
D. Architecture of PE with the Output Sparsity Bypass
The architecture of the PE in SparseNN is shown in Fig. 5.
The datapath of the PE consists of 5 pipeline stages: memory
address computation, memory access, multiplication, addition,
and write back. The two physical register files are organized
as a pair of ping-pong buffers, which alternatively act as the
source and destination register files from layer to layer. A
complete computation flow of the PE undergoes three matrix-
vector computation phases for V , U , and W , respectively.
1) V computation phase: The local nonzero input activation
aj and its associated index j are scanned from the source
register file which stores all local input activations, and pushed
into the datapath. The column-based scheduling in Fig. 4 is then
proceeded to calculate the partial sum in each row. When the
partial sum of one row is finished, the result will be sent to the
on-chip network for the accumulation. The root node receives
the final accumulated result of V computation and broadcasts it
back to all 64 PEs. The results will be temporarily stored in the
activation queue if the PE has not finished the V computation.
2) U computation phase: With the received V results, the
row-based scheduling (i.e. Fig. 3) of U computation is con-
ducted in each PE. In each clock cycle, the PE only processes
the head of activation queue, and pushes the locally-stored rows
of the U matrix and the results of V computation phase to the
datapath. At the end of the U computation phase, the output
sparsity predictor p(l) is stored in a dedicated 1-bit register bank.
3) W computation phase: The local nonzero input activation
aj and its associated index j are scanned from the source
register file, and broadcasted to all the PEs through the H-Tree.
After receiving the nonzero input activation and the index, each
PE then multiplies the received input activation with the weights
of all output activations mapped to the PE that are predicted by
the sparsity predictor to be nonzero. In each cycle, the leading
nonzero detector of the predictor register bank searches the next
nonzero output activation for computation and the intermediate
results are stored in the destination registers.
VI. EXPERIMENTAL RESULTS
A. Experimental Setup
We first compare the performance of the proposed end-to-
end training algorithm with that of the conventional truncated
SVD scheme on MNIST-BASIC dataset (BASIC) along with
two challenging variants [15]. The variation extends the original
hand-written digits with the rotation (ROT) and background
superimposition (BG-RAND). Two different neural network
architectures are explored in this work: the 3-layer (with 1
hidden layer) and the 5-layer (with 3 hidden layers). Each
hidden layer has 1000 neurons.
To evaluate the hardware performance (i.e. area, power, and
latency), we implement SparseNN using Verilog HDL. The
functional simulation of the hardware implementation is verified
against with the fixed point simulation in Matlab. SparseNN is
synthesized using the Synopsys Design Compiler with TSMC
65nm LP library under the worst case PVT. To model the
area, power, and access time of the memory, CACTI 6.5 [16]
is used. We collect the toggling rate from the post-synthesize
simulation on the real benchmarks and use it to estimate the
power consumption of SparseNN using Synopsys PrimeTime.
B. Performance of the End-to-End Training Algorithm
The test error rate (TER) and the predicted output sparsity
ρ(l) of the 3-layer neural network are shown in Fig. 6. Due
to the limitation of space, we only show the results of 5-layer
neural network with a rank size 15 in Table. I, and the results
of the other rank sizes have the similar trend.
TABLE I
TEST ERROR RATE AND PREDICTED OUTPUT SPARSITY ρ OF 5-LAYER
NEURAL NETWORK WITH RANK SIZE 15
Dataset Algorithm TER(%) ρ(1) ρ(2) ρ(3)
ROT
NO UV 8.54 N.A. N.A. N.A.
SVD 10.69 90.74 28.12 34.27
End-to-End 8.8 69.41 64.13 71.07
BASIC
NO UV 2.738 N.A. N.A. N.A.
SVD 2.728 62.5 38.15 39.38
End-to-End 2.718 56.34 65.89 66.7
BG-RAND
NO UV 10.08 N.A. N.A. N.A.
SVD 10.036 51.61 51.49 24.01
End-to-End 10.03 52.79 48.23 41.44
8.5
9
9.5
10
10.5
11
11.5
NO UV 100 75 50 25 10 5
ROT: TER(%) with varied rank size r
Truncated SVD End-to-End
0
20
40
60
80
100
100 75 50 25 10 5
ROT: Output sparsity(%) with varied 
rank size r
Truncated SVD End-to-End
2.7
2.8
2.9
3
3.1
3.2
NO UV 100 75 50 25 10 5
BASIC: TER(%) with varied rank size r
Truncated SVD End-to-End
0
20
40
60
80
100 75 50 25 10 5
BASIC: Output sparsity(%) with 
varied rank size r
Truncated SVD End-to-End
8.5
9
9.5
10
10.5
NO UV 100 75 50 25 10 5
BG-RAND: TER(%) with varied rank 
size r
Truncated SVD End-to-End
0
20
40
60
80
100 75 50 25 10 5
BG-RAND: Output sparsity(%) with 
varied rank size r 
Truncated SVD End-to-End
Fig. 6. Comparison of the proposed end-to-end training algorithm with the
truncated SVD scheme on the neural network with one hidden layer.
From Fig. 6, we can observe that the proposed end-to-end
training algorithm of the sparsity predictor scales well with
the rank size of the UV predictor. For instance, the TER of
the truncated SVD scheme is around 1% larger than the end-
to-end training algorithm in ROT dataset when a small rank
size is used. The performance difference is mainly because
the UV update is static in the conventional truncated SVD
scheme and cannot be tuned. The low rank approximation of
the weight matrix W is inaccurate when the rank size is small.
In Table. I, we compare the TER and the output sparsity at each
hidden layer of the 5-layer neural network trained using different
algorithms. The network trained by the proposed end-to-end
training algorithm preserves a similar (or even better) accuracy
to the SVD approach, but with a higher average sparsity ratio
of the hidden layers. It is mainly because the output sparsity
is considered in the end-to-end training algorithm as we use
the ℓ1 regularization in the cost function (Eq. (4)). A higher
sparsity ratio is preferred for the better energy-efficiency of
SparseNN, because more computation can be skipped. It is noted
that a larger regularization factor λ can result in a larger sparsity
prediction in each layer, but TER might be affected due to the
underfitting.
C. Performance of the SparseNN
The design parameters of the mircroarchitecture of the pro-
posed architecture, SparseNN, are listed in Table. II. The
TABLE II
THE MIRCROARCHITECTURE PARAMETERS OF 64-PE SPARSENN
Micro-architectural parameters Value
Quantization scheme 16-bit fixed point
On-chip W /U /V memory per PE 128KB/8KB/8KB
Activation register no. per PE 64
Flow control of NoC router Packet-buffer with credit
microarchitecture of SparseNN is inspired by EIE. For instance,
the total on-chip weight memory is 128KB×64 = 8MB, and the
05000
10000
15000
20000
basic bg_rand rot basic bg_rand rot basic bg_rand rot
1st hidden layer 2nd hidden layer 3rd hidden layer
Execution Cycles on Three Datasets
uv_off uv_on
0
500
1000
1500
basic bg_rand rot basic bg_rand rot basic bg_rand rot
1st hidden layer 2nd hidden layer 3rd hidden layer
Power Consumption [mW] on Three Datasets
uv_off uv_on
Fig. 7. Comparison of execution cycles and power consumption on three
datasets using the 5-layer DNN. The results are organized in layer-wise, where
uv on and uv off represent the output sparsity predictor of SparseNN is enabled
and disabled, respectively.
maximum number of activations in each layer is 64×64 = 4K.
The target critical path of SparseNN is set to 2ns because the
access time of the 128KB SRAM is more than 1.7ns.
The area breakdown of SparsNN is listed in Table. III. The
TABLE III
THE AREA BREAKDOWN OF SPARSENN BY COMPONENT AND BY MODULE
Area (µm2) (%)
Total 78,443,365 (100%)
Combinational 1,716,373 (2.4%)
Buf/Inv 199,038 (0.2%)
Non-combinational 2,068,996 (2.6%)
Macro (Memory) 74,426,310 (94.8%)
Processing element 1,216,457×64 (99.2%)
Routing logics 590,062 (0.8%)
routing nodes occupy only a small fraction (less than 1%) of
the total area, and the major area contributors are the PEs. The
main reason is the large on-chip SRAMs for W , U , and V in
each PE, which take around 95% of the overall area.
The results on the execution cycle and the power consumption
of SparseNN on the three benchmarks are shown in Fig. 7.
When UV predictor is not used, SparseNN is the same as the
conventional EIE architecture which only exploits the input
activation sparsity. From Fig. 7, it can be seen that the im-
provement in the number of execution cycles with the output
sparsity varies from layer to layer. For the 1st hidden layer, the
reduction of cycles ranges from 10%∼31%. The inputs to the
1st hidden layer are the same for the UV enabled and the UV
disabled networks, and hence the improvement of throughput
only comes from the output sparsity. The difference of the
throughput improvement at different layers and benchmarks is
due to the difference in predicted output sparsity. In addition, the
number of nonzero output activations predicted by the sparsity
predictor also varies from PE to PE. For the remaining hidden
layers, the reduction of cycles can be as high as 70%. The
predicted output sparsity of the previous layer will increase the
input sparsity of the current layer. Therefore, the throughput
is jointly improved by the input sparsity as well as the output
sparsity. The improvement in power consumption with output
sparsity is almost uniform among all datasets and all hidden
layers: around 50%. The reasons for the power reduction are
twofold: the number of access to the largeW memory decreases
with the output sparsity, and the access energy to the U , V
memory during sparsity prediction phase is small. We also
compare SparseNN with the existing SIMD hardware platforms
for DNNs in Table. IV. In SIMD architecture, there is a
TABLE IV
COMPARISON WITH EXISTING SIMD HARDWARE PLATFORMS FOR DNNS
Platform LRADNN [12] DNN-Engine [14] This work
Technology 65nm 28nm 65nm
Peak Perf. 7.08GOPs 19GOPs 64GOPs
W memory 3.5MB 1MB 8MB
Power (mW) 439∼487 63.5 452∼705
Area (mm2) 51 5.76 78
tradeoff between the parallelism (i.e. SIMD window) and the
on-chip bandwidth. More specifically, the working frequency of
LRADNN is slower as the unified memory needs to provide
32 data in each cycle. On the other hand, the parallelism level
in DNN-Engine is limited to 8 in order to achieve a frequency
as high as 1.2GHz. Ideally, DNN-Engine takes 785×10008 cycles
to finish the 1st hidden layer computation of the dataset BG-
RAND. Therefore, the corresponding energy consumption by
DNN-Engine is approximately 5.1µJ . On the other hand, the
energy consumption of SparseNN for the 1st hidden layer in
BG-RAND is around 14µJ . However, as they are implemented
in different technology node, to have a fair comparison, we
need to scale the energy consumption accordingly. From the
CACTI memory model, the energy consumption per read access
is roughly 11x when the technology node is scaled from 28nm
to 65nm and the memory size changes from 1MB to 8MB.
Therefore, if we take this scaling into account, SparseNN
has a 4x better energy-efficiency over the conventional SIMD
architecture.
VII. CONCLUSION
In this work, we first propose an end-to-end training algorithm
to obtain the U and V matrices for the sparsity predictor from
the backpropagation. The scalability with rank and the predicted
sparsity are better than the traditional truncated SVD scheme.
Then, a specialized architecture, SparseNN, is developed to ex-
ploit both the input and output sparsity. Our evaluations demon-
strate that with the output sparsity, the throughput of SparseNN
can be improved by 10% to 70% while the power consumption
is approximately reduced by half. Moreover, SparseNN shows a
better scalability and a higher energy-efficiency compared with
the state-of-the-art SIMD architecture.
REFERENCES
[1] Alex Krizhevsky et al. Imagenet classification with deep convolutional
neural networks. In NIPS, pages 1097–1105, 2012.
[2] Dario Amodei et al. Deep speech 2: End-to-end speech recognition in
english and mandarin. In ICML, pages 173–182, 2016.
[3] David Silver et al. Mastering the game of go with deep neural networks
and tree search. Nature, 529(7587):484–489, 2016.
[4] Christian Szegedy et al. Inception-v4, inception-resnet and the impact of
residual connections on learning. In AAAI, pages 4278–4284, 2017.
[5] George E Dahl, Tara N Sainath, and Geoffrey E Hinton. Improving deep
neural networks for lvcsr using rectified linear units and dropout. In
Acoustics, Speech and others, pages 8609–8613. IEEE, 2013.
[6] Jorge Albericio et al. Cnvlutin: ineffectual-neuron-free deep neural
network computing. In ISCA, pages 1–13. IEEE, 2016.
[7] Tianshi Chen et al. Diannao: A small-footprint high-throughput accelerator
for ubiquitous machine-learning. In ACM Sigplan Notices, volume 49,
pages 269–284. ACM, 2014.
[8] Yunji Chen et al. Dadiannao: A machine-learning supercomputer. In
MICRO, pages 609–622. IEEE Computer Society, 2014.
[9] Song Han et al. Deep compression: Compressing deep neural networks
with pruning, trained quantization and huffman coding. arXiv preprint
arXiv:1510.00149, 2015.
[10] Song Han et al. Eie: efficient inference engine on compressed deep neural
network. In ISCA, pages 243–254. IEEE Press, 2016.
[11] Andrew Davis et al. Low-rank approximations for conditional feedforward
computation in deep neural networks. arXiv preprint arXiv:1312.4461,
2013.
[12] Jingyang Zhu et al. Lradnn: High-throughput and energy-efficient deep
neural network accelerator using low rank approximation. In ASP-DAC,
pages 581–586. IEEE, 2016.
[13] Matthieu Courbariaux et al. Binarized neural networks: Training deep
neural networks with weights and activations constrained to+ 1 or-1. arXiv
preprint arXiv:1602.02830, 2016.
[14] Paul N Whatmough et al. 14.3 a 28nm soc with a 1.2 ghz 568nj/prediction
sparse deep-neural-network engine with > 0.1 timing error rate tolerance
for iot applications. In ISSCC, pages 242–243. IEEE, 2017.
[15] Hugo Larochelle et al. An empirical evaluation of deep architectures on
problems with many factors of variation. In ICML, pages 473–480. ACM,
2007.
[16] Naveen Muralimanohar et al. Cacti 6.0: A tool to model large caches. HP
Laboratories, pages 22–31, 2009.
