An Efficient Hardware Accelerator for Structured Sparse Convolutional
  Neural Networks on FPGAs by Zhu, Chaoyang et al.
THIS MANUSCIRPT IS FOR IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1
An Efficient Hardware Accelerator for Structured
Sparse Convolutional Neural Networks on FPGAs
Chaoyang Zhu, Student Member, IEEE, Kejie Huang, Senior Member, IEEE,
Shuyuan Yang, Student Member, IEEE, Ziqi Zhu, Student Member, IEEE,
Hejia Zhang and Haibin Shen
Abstract—Deep Convolutional Neural Networks (CNNs) have
achieved state-of-the-art performance in a wide range of ap-
plications. However, deeper CNN models, which are usually
computation consuming, are widely required for complex Arti-
ficial Intelligence (AI) tasks. Though recent research progress
on network compression such as pruning has emerged as a
promising direction to mitigate computational burden, existing
accelerators are still prevented from completely utilizing the
benefits of leveraging sparsity owing to the irregularity caused
by pruning. On the other hand, Field-Programmable Gate
Arrays (FPGAs) have been regarded as a promising hardware
platform for CNN inference acceleration. However, most existing
FPGA accelerators focus on dense CNN and cannot address
the irregularity problem. In this paper, we propose a sparse
wise dataflow to skip the cycles of processing Multiply-and-
Accumulates (MACs) with zero weights and exploit data statistics
to minimize energy through zeros gating to avoid unnecessary
computations. The proposed sparse wise dataflow leads to a
low bandwidth requirement and a high data sharing. Then
we design an FPGA accelerator containing a Vector Generator
Module (VGM) which can match the index between sparse
weights and input activations according to the proposed dataflow.
Experimental results demonstrate that our implementation can
achieve 987 imag/s and 48 imag/s performance for AlexNet and
VGG-16 on Xilinx ZCU102, respectively, which provides 1.5× to
6.7× speedup and 2.0× to 6.2× energy-efficiency over previous
CNN FPGA accelerators.
Index Terms—Deep convolutional neural networks, dataflow,
structured pruning, FPGAs, hardware accelerator.
I. INTRODUCTION
THE remarkable performance improvement in variousdomains [1]–[3] achieved by Convolutional Neural Net-
works (CNNs) like AlexNet [4] and VGG-16 [5] comes at
the computational and data cost which challenges both of on-
chip storage and off-chip bandwidth in accelerator architecture
design. Even though most of the operations in the CNN train-
ing and inference can be converted to matrix multiplication
operations and be accelerated with modern Graphics Process-
ing Units (GPUs), deploying CNNs on GPUs suffers from
high power and area consumption. Customized accelerators
have been regarded as a promising alternative which are more
flexible for considering the performance requirements and
energy constraints [6]–[17].
C. Zhu, S. Yang, Z. Zhu, K. Huang and H. Shen are with
the College of Information Science and Electronic Engineering, Zhe-
jiang University, Hangzhou, 310058, China (email: {21760249, huangke-
jie, 21931061, 21960370, shen hb}@zju.edu.cn).
H. Zhang is with the Department of Computer Science, University of
Southern California, Los Angeles, CA 90089, USA (email: hejiazha@usc.edu)
Recently, the CNN pruning technique has been proved as
an effective solution to reduce the computation and memory
requirements of these models [18], [19]. For example, Han et
al. [18] pointed out that pruning can lead to more than 10×
amount reduction of data with negligible accuracy loss. On the
other hand, weight encoding including quantization and en-
tropy coding has been proposed to further reduce the bitwidth
of each weight, e.g., 4-bit per weight for AlexNet [19].
Unstructured pruning techniques like Deep Compression [19]
have the weaknesses of imbalanced load and high irregularity.
Therefore, structured pruning techniques [20]–[23] were pro-
posed which are more hardware friendly with a slightly lower
compression ratio.
However, irregularity caused by sparsity prevents accelera-
tors from fully leveraging the computation and data reduction.
Exciting architectures on FPGAs for dense models are not
efficient for sparse CNN models because a lot of weights
are pruned so that most multiplication operations involve
zero operands leading to low hardware efficiency [8], [24]–
[28]. Sparse architecture on Field-Programmable Gate Arrays
(FPGAs) has been investigated in recent years [29], [30].
[29] is designed for the Fully-Connected (FC) layers, which
uses matrix-vector multiplication operations. In fact, the major
operators in CNNs are convolution operations. Although the
spatial convolution can be converted to matrix-vector multi-
plications, this will lead to a large memory footprint since
the input feature map has to be copied multiple times when
being flattened to a vector. [30] proposes a dataflow that
exploits element-matrix multiplication as the key operation.
However, this design holds a low computation efficiency due
to the imbalanced load of each Processing Engine (PE). Since
this accelerator requires a large number of Look-Up-Table
(LUT) to buffer the input activations for nonzero weights, the
performance is bounded by the number of LUT on FPGA,
which leads to an inefficient resource utilization.
To design an efficient FPGA accelerator for sparse CNN
models, following challenges have to be tackled:
First of all, each output activation connects to several input
activations through the sliding window (Kernel) in dense Con-
volutional (CONV) layers. The connection becomes irregular
after pruning. Meanwhile, the sparse weights are encoded in
compressed format, which results in extra coordinate compu-
tation to reconstruct the connection or to locate the output. So
it is challenging to design a dataflow to address the irregu-
larity whereas efficiently leverage the data and computation
reduction and maintain the high parallelism of FPGA.
ar
X
iv
:2
00
1.
01
95
5v
1 
 [e
es
s.S
Y]
  7
 Ja
n 2
02
0
2 THIS MANUSCIRPT IS FOR IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
Kernel 1 Kernel 2
...
Kernel B Kernel F
...
Group1
Fig. 1. Illustration of shape-wise structured pruning. The F kernels are split
into multiple groups, kernels in the same group are pruned on the same
locations.
Second, FPGAs can only provide limited on-chip memory
and off-chip bandwidth. Although sparse CNNs have signif-
icant data reduction, it is difficult to save all the weights
into on-chip memory for complex CNNs like VGG-16. In
addition, different CNN models have different sizes, which
results in high variability in the number of operations. A rigid
accelerator architecture for CNNs may not fully utilize the
FPGA’s limited resources for every CNN model.
To address both challenges, we only focus on structured
pruning to reduce the irregularity of sparse weights. On
this basis, a sparse wise dataflow is proposed to address
the remaining irregularity of sparse weights. With proposed
dataflow, we do not need extra coordinate computation to
reconstruct the connection or to locate the output. Furthermore,
we minimize energy through zeros gating to avoid unnecessary
computations if the input activations equal zero.
In conclusion, we make the following contributions: 1) we
propose a sparse wise dataflow to skip the cycles of processing
Multiply-and-Accumulates (MACs) that have zero weights and
minimize energy through zero gating to further avoid unneces-
sary computations. 2) we propose a Vector Generator Module
to reuse and generate the necessary input activations for sparse
CNNs. 3) we co-design both the accelerator architecture and
the loop tiling to minimize off-chip memory accesses and
maximize performance by slicing the input feature maps to
best match the capacity of Block Random Access Memory
(BRAMs) on FPGA.
Experiments demonstrate that the proposed accelerator can
achieve 987 imag/s and 48 imag/s performance for AlexNet
and VGG-16 on Xilinx ZCU102, respectively, which provides
1.5× to 6.7× speedup and 2.0× to 6.2× energy-efficiency
over previous CNN FPGA accelerators
II. BACKGROUND
CNNs consist of multiple types of layers, including con-
volutional layers, pooling layers and fully-connected layers.
Through these layers, inputs are processed and propagated,
thus to be classified or recognized. The convolution operation
uses an R×R window to slide through the input feature map to
extract features. At every location, the input activations inside
the window are multiplied by corresponding weights and the
products are accumulated to compute the partial sum of an
output activation. Note that the partial sums in different input
channels are accumulated to compute the output activation.
TABLE I
SPARSITY AND ACCURACY COMPARISON
model Deep Cmp Coarse-grained pruningSparsity (%) Top1-E (%) Sparsity (%) Top1-E (%)
AlexNet 11.15 42.78 11.03 42.72
VGG-16 7.61 31.17 8.07 31.33
C
R
Input feature maps
Output feature maps
F
V
U
Kernels
.
.
.
W
H
R
Fig. 2. The logical graph of convolutional operation. Convolutional layers
conduct 2D convolutions on a set of input feature maps and add the results
to get output feature maps.
A. Network Pruning
CNNs have achieved remarkable success in various ap-
plications [1]–[3], [31], [32] at the cost of huge amount of
computations. The weights pruning method has been proved
as an effective solution to reduce the computation and memory
burden of these models without significant accuracy loss [18]–
[23]. Compared to unstructured pruning techniques, structured
pruning techniques are less irregular and more hardware
friendly at the cost of a slightly lower compression rate.
With shape-wise structured pruning (cf. Fig. 1), kernels in the
same group will be encoded in the same compressed format.
By sharing a common compressed format, the irregularity of
sparse weights will be reduced. Besides, the load of each
parallel processing engines is balanced. Table I shows the
sparsity and accuracy comparison of Deep Compression [19]
and Coarse-grained pruning [22]. The difference of sparsities
between Deep Compression and Coarse-grained pruning is
almost negligible. Although structured pruning techniques
reduce the irregularity of sparse weights, previous FPGA CNN
accelerators still can not exploit weight sparsity efficiently.
To address the remaining irregularity, we propose a sparse
wise dataflow and propose a set of architecture optimization
techniques.
B. Loop Operation
In a typical CNN, convolutional layers take up about 90%
of the computation in inference procedure. A convolutional
layer can be characterized by six parameters: F , the number of
output feature maps (output channels); C, the number of input
features maps (input channels); U , the height of output feature
map; V , the width of output feature map; R, the kernel size;
S, the stride size, as shown in Fig. 2. The computation can
ZHU et al.: AN EFFICIENT HARDWARE ACCELERATOR FOR STRUCTURED SPARSE CONVOLUTIONAL NEURAL NETWORKS ON FPGAS 3
Algorithm 1: Pseudo code of convolutional layer
1 for f = 0; f<F ; f++ do
2 for c = 0; c<C; c++ do
3 for u = 0; u<U ; u++ do
4 for v = 0; v<V ; v++ do
5 for kh = 0; kh<R; kh++ do
6 for kw = 0; kw<R; kw++ do
7 Y fu,v+ = W
f,c
kh,kw ×Xcu+kh,v+kw ;
8 end
9 end
10 end
11 end
12 end
13 end
be described in a deep nested loop as illustrated in Algorithm
1. Feature map related loops f and c index the output and
input channels, respectively. Activation related loops u and v
index each activation of feature maps. Finally, weight related
loops kh and kw index weights of each kernel. Obviously,
convolution operation exhibits high parallelism in channel,
activation and weight levels.
To achieve high performance, the above deep nested loop is
unrolled and mapped to a parallel hardware. This involves loop
tiling and loop interchange strategy [33]. Loop tiling keeps all
the input data of a loop tile stored on chip to reduce external
memory access. External memory access happens when the
operation of a new tile begins. Besides, a part of data may be
temporally reused across the adjacent tiles. Loop interchange
strategy decides the order of the loop tiles. Both loop tiling
and loop interchange strategy decide the dataflow of hardware.
III. SPARSE WISE DATAFLOW
There have been dense CNNs dataflows on FPGA [24], [34],
[35]. However, these dataflows cannot leverage the benefits of
sparse CNNs since most multiplication operations involve zero
weights that will not contribute to the corresponding output
feature map. Therefore, it is extremely essential to skip the
cycles of processing MACs that have zero weights.
Recently, designing dataflows for sparse CNNs on ASIC
platforms has attracted attentions from the research commu-
nity. However, these dataflows will not be efficient for FPGA
platforms due to the architecture difference between ASIC
and FPGA. For example, SCNN architecture [36] applies
input-stationary dataflow where the inner computation is a
Cartesian production. In SCNN architecture, there are N PEs
and each PE contains an I×F multiplier array. Consequently,
it simultaneously requires N × I input activations and N ×F
weights for computation. Although, this dataflow temporally
reuses input activations, it still requires to update N × F
weights in each cycle. Furthermore, this method firstly requires
significant coordinates computation to locate output activa-
tions. Then using Cartesian production, this dataflow returns
multiple partial sums, which are needed to be arbitrated be-
fore being saved. Cambricon-S [22] applies output-stationary
dataflow where the inner computation is a Vector dot product.
Cambricon-S implements a centralized indexing module to
select input activations and only transfers selected activations
and indexes to PEs without extra coordinates computation.
However, this dataflow only performs parallel computation
Cycle 0 0 1 3 2 3 6 4 5 9 6 11
PE0
PE1
PE2
PE3
Weight
X0,0 X0,1 X0,2
X0,1 X0,2 X0,3
X0,2 X0,3 X0,4
X0,3 X0,4 X0,5
X0,4 X0,5 X0,6
X0,5 X0,6 X0,7
X0,6 X0,7 X0,8
X0,7 X0,8 X0,9
0 W0,1 W0,2 0 W0,1 W0,2
X1,0 X1,1 X1,2
X1,1 X1,2 X1,3
X1,2 X1,3 X1,4
X1,3 X1,4 X1,5
0 W0,4 0
X1,0 X1,1 X1,2
X1,1 X1,2 X1,3
X1,2 X1,3 X1,4
X1,3 X1,4 X1,5
0 W0,1 W0,2
Skip Skip Skip SkipSkip
Fig. 3. Sparse wise dataflow for CONV layers. When weights equal to zero,
the cycles of processing MACs will be skipped by controlling the upper bound
of corresponding loop illustrated in Algorithm 2.
in channel dimensions by spatially sharing input activations,
thus requires M input activations and N × M weights for
computation, which will lead to poor parallelism on FPGAs.
We propose a sparse wise dataflow to accelerate CNNs
on FPGA (cf. Alg. 2). For a convolutional layer with input
feature maps Ifamp(C,H,W ) and kernels Kernel(F,C,R,R),
we select and fetch a vector of Toc nonzero weights from
Toc adjacent kernels at each access. Similarly, we also fetch a
vector of Tom activations which are in the same row of input
feature map. Meanwhile, we compute Toc element-vector mu-
tiplications on the PE array. After shape-wise structured prun-
ing, the coordinates of nonzero weights in different kernels
are uniformed. So each element-vector mutiplication involves
a weight and a shared vector of Tom activations. For exam-
ple, in the step 1, the weight vector is [W 0,00,0 , · · · ,WToc,00,0 ],
and the input vector is [X00,0, · · · , X00,Tom ]. In the step 2,
we fetch weight vector [W 0,00,1 , · · · ,WToc,00,1 ] and input vector
[X00,jump, · · · , X00,Tom+jump] for next computation and com-
pute Toc element-vector mutiplications in pipeline mode. The
variable jump equals to the number of zeros between adjacent
nonzero weights in the same row of a kernel.
Convolution windows that produce adjacent outputs share
part of input activations. Therefore, the two successive fetched
input vectors from the same row of input feature map involve
partially shared data. We add a dedicated Vector Generator
Module (will be introduced in the next section) to update the
input register for reusing data. We describe the convolution
operation execution pattern in Fig. 3. Using the pattern in Fig.
3, the four PEs compute four adjacent output activations and
skip the cycles of processing MACs that have zero weights.
For a fully-connected layer with input vector
I vector(C, 1) and weight matrix W (F,C), the weight
buffer delivers a vector of Tom nonzero weights from the
adjacent rows of matrix W at each access (see Fig. 4). The
input register only delivers a single activation.
This dataflow requires a group of buffers to save partial
sums produced by convolution operations. Considering a con-
volutional layer with output feature map Ofamp(F,U, V ), we
applied a N ×M PE array. Make Toc = N and Tom = M ,
to prevent partial sums from being saved to and restored from
DRAM, the depth of the partial sum (Psum) buffer should
4 THIS MANUSCIRPT IS FOR IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
Algorithm 2: Pseudo code of our dataflow
Input :
The nonzero weight array W oc,ickh,kw ;
The index array Index[ ];
The row pointer of index R pointer[ ];
The number of nonzero weights in each input channel Offset[ ];
The input activation Xicih,iw ;
The Kernel size R×R×C×F;
The Output feature map size U×V×F;
Output:
The output activation Y ocoh,ow ;
1 Initialize the parameter kw = 0 , kh = 0, i = 0 and jump = 0;
2 for oh = 0; oh<U ; oh = oh+ Ut do
3 for oc = 0; oc<F ; oc = oc + N do
4 for ic = 0; ic <C; ic++ do
5 if ic>0 then
6 i = i+Offset[ic− 1];
7 end
8 for Toh = oh;Toh<oh+ Ut;Toh ++ do
9 for ow = 0; ow <V; ow = ow + M do
10 if kh = R&&ow>0 then
11 i = i−Offset[ic];
12 end
13 for kh = 0; kh <R; kh++ do
14 kh count = ic× R + kh;
15 kw max = R pointer[kh count];
16 if kh>0 then
17 i = i+ R pointer[kh count− 1];
18 end
19 jump = 0;
20 for kw = 0; kw <kw max; kw++ do
21 jump = jump+ Index[i+ kw];
22 //Loop unroll
23 for Toc = oc;Toc < oc+N ;Toc ++ do
24 for Tow = ow;Tow<ow+M ;Tow ++
do
25 Y TocToh,Tow
= WToc,ickh,kw ×
XicToh+kh,Tow+kw+jump
;
26 end
27 end
28 end
29 end
30 end
31 end
32 end
33 end
34 end
Cycle 0 0 1 3 4 2 6 3 8 9
PE0
PE1
PE2
PE3
Input
0 0
0 0
0 0
0 0
X0,1 X1,1 X2,1 X3,1 X4,1 X5,1
0
0
0
0
X9,1
0 0
0 0
0 0
0 0
X6,1 X7,1 X8,1
Skip Skip Skip Skip
0 W0,1 W0,2
0 W1,1 W1,2
0 W2,1 W2,2
0 W3,1 W3,2
...
W0,5
W1,5
W2,5
W3,5
W0,7
W1,7
W2,7
W3,7
Fig. 4. Execution pattern of FC layers. Similar to CONV layers, the cycles
of processing MACs will be skipped when weights equal to zero.
.. ..
.. ..
.. ..
.. ..
.. ..
.. ..
.. .. .. .. .. .. .. .. .. ..
.. .. .. .. .. .. .. ..
..
..
..
..
..
..
.. .. .. ..
..
..
..
..
.. ..
.. ..
.. ..
.. ..
.. ..
.. .. .. .. .. .. .. ..
.. ..
Output activation PE
(a) (b)
Fig. 5. The two situations of network mapping. (a)When V≥M, one
Processing Unit (PU) computes M partial sums of M output activations which
are in the same row; when V<M, one PU computes M partial sums of M
output activations which are in the different rows.
be at least dU × V/Me. However, some CNNs like VGG-16
involve a quite large product of U ×V in the first two layers.
Therefore, we partition the input feature map into dH/Hte
tiles, where
Ht = (Ut − 1)× Stride+R
= (bM × SliceBRAM
V
c − 1)× Stride+R (1)
The parameter SliceBRAM represents the capacity of one
block of BRAM on FPGA. The adjacent tiles share (R −
Stride) rows because the sliding-window nature of the con-
volution operation introduces data dependency at tile edges.
When the width of output feature map V <M , we map a
bM/V c × V tile to one PU for efficiency because each PE
computes one output activation. Otherwise, we map a 1×M
tile to one PU, as shown in Fig. 5. The PE arracy consists
of N PUs and each PU consists of M PEs in the same row.
The input activations are shared across PUs. Different PUs
compute output activations which are from different output
channels.
By spatially sharing both input activations and weights,
and, temporally reusing partial input activations, we reduce
the bandwidth requirement to less than N × B + M × A.
Therefore, the bandwidth requirement of the accelerator is
much smaller than Cambricon-S (M × A + N × M × B)
and SCNN (N×F ×B), where A and B represent the widths
of activation and weight, respectively.
IV. ARCHITECTURE OF SPARSE WISE ACCELERATOR
In this section, we introduce the detailed architecture of our
accelerator, to address the remaining irregularity of shape-wise
pruned CNNs.
Overview. Fig. 6 depicts the overall block diagram of our
proposed accelerator. Following the proposed dataflow, we de-
sign a Vector Generator Module (VGM) to address the sparsity
with shared indexes. We design a PU that has multiple PEs
to compute adjacent output activations in parallel. Multiple
PUs constitute an array to compute multiple output activations
ZHU et al.: AN EFFICIENT HARDWARE ACCELERATOR FOR STRUCTURED SPARSE CONVOLUTIONAL NEURAL NETWORKS ON FPGAS 5
ABin
WIB
ABout
VGM
DMA
Main 
Controller
DRAM
PU0
PU1
PU2
PUN
...
Weight
Fig. 6. Accelerator Architecture. The DRAM is implemented with Double
Data Rate Random Access Memory in Processing System on Xilinx FPGA.
Tile 0 0.2 0 1.1 0 2.5 0 0 1.7
0 5.1 0.8 0 0 0 0 0 2.7
0.3 0 1.3 4.1 2 0 0 1 0
Non-zero 
weight
Index
1 2 1 2 0 1 2 2 1R_pointer
Offset 4 3 5
1 0 1 2 1 0 2 0 1 0 0 1
0.2 1.1 2.5 1.7 5.1 0.8 2.7 0.3 1.3 4.1 2 1
Compress
Fig. 7. Index representation of weights in CONV layers. The Index represents
the number of pruned weights between two nonzero weights. The r pointer
represents the number of remaining weights in each row. The Offset shows
the remaining weights in each channel of one kernel.
across output channel in parallel. The storage module consists
of two activation buffers (ABin and ABout), a Weight Index
Buffer (WIB), N Weight Buffers (WBs), and N ×M Partial
Sum Buffers (PSBs). The Main Controller decodes instructions
and weight indexes into detailed control signals for all other
modules.
A. Addressing with sparsity
The accelerator is designed to exploit structured sparsity
for performance gain and energy reduction. In our accelerator,
sparsity is processed by VGM and PU together. The VGM
receives input activations from ABin and decodes weight
indexes from WIB, then produces the selected activations
that are broadcast to all the PUs. Meanwhile, these input
activations will be cached for next selection in VGM since
there is overlapping when the kernel slides across the input
feature map. Each PU receives the read address of weight from
Main Controller and reads out the needed weight following
the principles described in proposed dataflow, thus avoiding
unnecessary computations.
Index. Before elaborating the VGM and PU, we clearly
explain how we store and index the sparse weights. We store
the sparse weights that result from shape-wise pruning using
Compressed Sparse Row (CSR) format, which only requires
Non-zero 
weight
Index
Offset 7
compressed
1.5 0.7 0 3.1 2.5 4.6 0.4
0 10 15 4 13 2 4
0.5 1.7 0 4.2 7.2 2.1 1.3
OCt
IC
2.5 2.2 0 1.1 4.3 5.1 2
...
OCt
Fig. 8. Index representation of weights in FC layers. Padding filler zero to
prevent overflow.
2a+R×C +C numbers, where a is the number of nonzero
weights, R is the number of rows and C is the number of input
channels. Owing to shape-wise pruning, each kernel shares the
common index. To compress further, we store the step index
instead of the absolute position. We encode both step index
and R pointer in 4 bits, and encoded Offset in 16 bits as
illustrated in Fig. 7.
The 4 bits index is large enough in convolutional layers,
however, the situation is different in fully-connected layers.
When we need an index larger than the bound, we will pad
a filler zero to prevent overflow. Regarding the example in
Fig. 8, when the step index exceeds the largest 4-bit unsigned
number, we pad a filler zero.
VGM. The VGM module processes the sparsity by selecting
the needed input activations and transfers the selected input
activations to all the PUs, see Fig. 9. We design a central
VGM shared by multiple PUs to more efficiently process
shared indexes from structured sparsity. For example, firstly,
when index is ”1”, the activations in register REG0 will be
shifted by A bits to the left (the most left A bits, where A
represents the width of activation), and the data with index
(REG0[0], REG0[Stride], ...REG0[(M − 1)×Stride]) will
be broadcasted to all the PUs. In cycle1, when index is ”0”, the
activations in register REG0 will be further shifted by A bits.
PUs do MACs with previous selected input activations and
cache the activations selected by VGM in this cycle. In cycle2,
the row number of kernel—kh (see Algorithm 2) increases.
Therefore, REG0 reloads activations from REG1 and does
data-shift according to the new weight index. The depths of
both REG0 and REG1 are set to (M − 1)× stride+R. To
leverage the overlap between activation selection and memory
access, the activations in REG1 will be updated after REG0
lastly reloading activations from REG1.
As PUs share the same indexes of weights due to shape-wise
pruning, the module for selecting activations (VGM) is shared
by all the PUs, thus reducing the indexing module overhead
and bandwidth requirement between VGM and PUs.
PU. The PU processes all operations in CNNs. Each PU
consists of M zero-value discriminators, M homogeneous PEs,
M homogeneous PSBs, a WB, pooling, normalization, and ac-
tivation module, see Fig. 10. Weights can be stored separately
in PUs as output activations from different channels involve
independent kernels. A selected activation firstly streams into
a zero-value discriminator, and then will be used in PE if it
does not equal to zero. Meanwhile, the discriminator produces
an enable signal to control the clock gating of PE. When the
6 THIS MANUSCIRPT IS FOR IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
0 1 2 3 4 5 6 7 8 9 A ...Buffer
1 0Index
VGM VGM
0 1 2 3 4 5 REG1
3 4 5 REG0
shift
1 2
...
2 3 4 5
PE0 PE1 PE2 PE3
PU0
2 3 4 5
PE0 PE1 PE2 PE3
PUN
0 1 2 3 4 5 REG1
4 5 REG0
shift
32
...
1 2 3 4W1
PE0 PE1 PE2 PE3
Y01,0 Y
0
1,1 Y
0
1,2 Y
0
1,3
PU0
1 2 3 4W1
PE0 PE1 PE2 PE3
YN1,0 YN1,1 YN1,2 YN1,3
PUN
W2 W2
Cycle0 Cycle1 Cycle2
0
Next row of kernel
VGM
0 1 2 3 4 5 REG1
REG00 1 2 3 4 5
Y01,0 Y
0
1,1 Y
0
1,2 Y
0
1,3 YN1,0 YN1,1 YN1,2 YN1,3
Fig. 9. Vector Generator Module. The VGM buffers and selects input activations to reuse them and to address the weight sparsity. It is shared by all the PUs.
ABin
WIB
ABout
VGM
DMA
Main 
Controller
DRAM
PU0
PU1
PU2
PUN
...
Weight
(a)
PE0
Psum 
Buffer
PE1
Psum 
Buffer
PEM
Psum 
Buffer
...
Weight 
Buffer
N
o
rm
al
iz
at
io
n
A
ct
iv
at
io
n
P
oo
lin
g
(b)
Shifter
Activation 
Reg
Index 
Decoder
Index
jump
Data_in
{pop,push}
New_data
Data_out
(c)
==0?
==0?
==0?
EN
EN
EN
PU
Selected 
activations
Output 
activation
Compressed 
weight
Fig. 10. The architecture of the PU. The PU processes all operations in CNNs.
It contains M homogeneous PEs, and the number of PEs can be configured
according to the CNN model and FPGA platform.
selected activation equals to zero, the corresponding PE is
clock gated. To minimize data communication, the partial sums
produced by PE will be saved in a local partial sum buffer
which is placed next to the PE. Until the entire computation
of an output activation is done, the final partial sum will be
transfered to activation and pooling module.
As there is neither spatially sharing nor temporally reuse
for weights in fully-connected layer, it requires quite large
input bandwidth. Although all the PUs can be active in fully-
connected layer, the required off-chip memory bandwidth
cannot be satisfied on the FPGA platform. Thus, we only keep
one PU active when the M is large enough in fully-connected
layer.
B. Optimize PE for quantization
One of the most commonly methods for model compres-
sion is to quantize both weights and activations. The low-
bit activations and weights require small bandwidth, which
benefits to improve the throughput if the accelerator is a
computation bounded design [14]. However, the proposed
design is BRAM limited. Directly increasing the size of PE
array will lead to a failure of implementation on FPGA
because the number of required BRAM will easily exceed
the available number. Therefore, we maintain the datapath of
activations and optimize the structure of PE. For example, the
activations and weights are quantized to 8-bit. We dispatch two
selected 8-bit activations into one PE. Due to the proposed
dataflow, the two activations multiply with the same weight
in two DSP slice, respectively. Then the partial sums can be
concatenated and stored in the buffer.
C. Storage
As the data processed in our accelerator have different
behaviors, we split storage into five parts: an ABin, an ABout,
a WIB, N WBs, and N ×M PSBs.
For the ABin and ABout, we set the width as 16×(M−1+
R)-bit and 16 ×M -bit, respectively, so as to provide (M −
1 + R) input activations for VGM and to fetch M output
activations from PU at each access. Benefit from the proposed
dataflow, N PUs share the same input activations, we read
input activations and produce output activations row by row.
Thus, we set a small depth for both ABin and ABout.
For the WB in each PU, we use a dual port ram that we
select the read width as 16-bit for one port and 16×M -bit for
the other port. In particular, the write width of both two ports
is 16×M -bit. In the convolutional layer, only one weight will
be read out and be broadcasted to all the PEs. Whereas in
the fully-connected layer, M weights will be fetched and be
transfered to the corresponding PE.
For the WIB, we select the width as 16-bit as we use
CSR format where we deploy 16-bit, 4-bit, 4-bit for Offset,
Index, and R pointer, respectively. Thus, the Index and
R pointer are stored aligning to 4 bits. We divide the WIB
into three parts for the three components of the compressed
weights index.
For the PSB, we set the width to 32-bit. We store the partial
sums and cache output activations in PSB, as the ABout fetchs
ZHU et al.: AN EFFICIENT HARDWARE ACCELERATOR FOR STRUCTURED SPARSE CONVOLUTIONAL NEURAL NETWORKS ON FPGAS 7
output activations from PUs in turns. We map each PSB to one
BRAM. Because if we map each PSB to two or more BRAMs,
the on-chip memory resources will easily be the bottleneck of
the available peak performance, leading to the damping of
throughput.
The size of the five buffers are decisive to overall per-
formance and energy consumption. For example, the size of
PSB decides the number of tiles. Small size of PSB requires
large number of tiles that leads to costly off-chip memory
accessed. Whereas large size of PSB leads to unscalability for
small layers in CNNs. Thus, we generally deploy 2KB, 2KB,
4KB, 512B and 2KB for ABin, ABout, WIB, WB and PSB,
respectively. The configuration of these buffers can be adjusted
accroding to the CNNs.
V. EXPERIMENT
A. Experiments Setup
We evaluate our design on the Xilinx ZCU102 evaluation
kit consisting of an Ultrascale FPGA, quad ARM Cortex-A53
processors, 4GB PS DDR4 and 512MB PL DDR4. In this
work, we use verilog for RTL implementation and employ Xil-
inx Vivado (v2017.2) to compile the source code to bitstream.
The design method is inspired by the DNNWEAVER [6]. Our
FPGA implementation is synthesized at 200MHz frequency.
We use a GPIO to USB adapter to read the power directly
from the PMbus in the FPGA board. We comprehensively
apply [22], [23] methods to train the CNN models. In Our
experiment, we test typical CNNs including Lenet, Alexnet
and VGG-16 and achieve 11.85%, 32.92%, 36.75% sparsity
of Lenet, Alexnet and VGG-16 without significant accuracy
loss.
TABLE II
RESOURCE UTILIZATION BREAKDOWN
BRAM LUT DSP FF
PU 28 6407 28 5463
DMA 111 26875 0 6940
Controller 0 28392 6 1282
VGM 0 27963 0 7285
ABin 2 0 0 0
ABout 1 0 0 0
WIB 2 0 0 0
Total 1460(80%) 390K(65%) 1350(53%) 278K(51%)
Available 1824 600K 2520 550K
B. Resource utilization
Table II shows the resource utilization breakdown with the
configuration (N = 48,M = 28). BRAMs are mainly used
to construct the buffers and FIFOs (First In First Out). The
parameter N determines the number of weight buffers in PUs
and that of output FIFOs in DMA (Direct Memory Access).
The product of parameters M and N determines the number
of Psum buffers. Each DSP (Digital Signal Processing) can
address a 16bit × 16bit multiplcation or a 8bit × 8bit multi-
plcation. The number of DSP can be calculated as N ×M +6
when the width of operand is 16. The 6 DSPs are used to
calculate the index address for WIB and the number of shift
N 8 M 8 N 8 M 1 6 N 1 6 M 1 6 N 1 6 M 3 2 N 3 2 M 2 8 N 4 8 M 2 80
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
Res
our
ce u
tiliz
atio
n (%
)
C o n f i g u r a t i o n
 L U T   F F   B R A M   D S P
Fig. 11. Resource utilization of the accelerator under different configuration.
c o n v 1 c o n v 2 c o n v 3 c o n v 4 c o n v 5
0 . 7
0 . 8
0 . 9
1 . 0
Com
put
atio
n E
ffici
enc
y (%
)
A l e x n e t
c o n v 1 xc o n v 2 xc o n v 3 xc o n v 4 xc o n v 5 x
0 . 8
0 . 9
1 . 0
Com
put
atio
n E
ffici
enc
y (%
)
V G G 1 6
 M = 8   M = 1 4   M = 1 6   M = 2 8   M = 3 2
Fig. 12. Computation efficiency of Alexnet and VGG-16 under different
configuration.
bit for VGM. Besides the buffers, the rest modules consume
LUT and Flip-Flop (FF).
Fig. 11 shows the resource utilization of different paral-
lelism factors obtained from Xilinx Vivado tool (v.2017.2).
The LUT utilization increases as the total number of PE
N ×M increases. The utilization of BRAM are mainly used
to construct buffers and FIFOs. When (N,M) increases to a
certain extent, some large FIFOs are implemented by LUTs
and FFs rather than BRAM to meet the timing constraints.
C. Computation Efficiency
In this work, we do output channel parallelly across PUs.
Benefiting from the shape-wise pruning, the load of each PU is
balanced. When we map the network onto PEs, the inefficiency
of our design mainly comes from two aspects: Dynamic Acti-
vation Inefficiency (DAI) and Dataflow Mapping Inefficiency
(DMI). First, there are zero activations in input feature maps,
which leads to some PEs gated. This pattern is designed to save
energy deliberately. Second, the size of feature map cannot be
8 THIS MANUSCIRPT IS FOR IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
TABLE III
PERFORMANCE COMPARISON WITH PREVIOUS IMPLEMENTATION
[8] [37] [30] Ours [25] [30] Ours Ours
CNN type Alexnet Alexnet Alexnet Alexnet VGG-16 VGG-16 VGG-16 VGG-16
Device Zynq ZC706 XC7VX690T Zynq ZCU102 Zynq ZCU102 Zynq ZCU102 Arria-10 GX1150 Zynq ZCU102 Zynq ZCU102
Accelerator type sparse dense sparse sparse sparse dense sparse sparse
Frequency(MHz) - 200 200 200 200 150 200 200
Precision - 16bit fixed 16bit fixed 16bit fixed 16bit fixed 16bit fixed 16bit fixed 8bit int
DSP Utilization - 1436(40%) 1144(45%) 1350(53%) 1144(45%) 1518(100%) 1350(53%) 2520(100%)
Logic Utilization - 468K(67%) 552K(92%) 390K(65%) 552K(92%) 161K(38%) 390K(65%) 405K(67%)
BRAM - 423(39%) 912(48%) 1460(80%) 912(48%) 1900(70%) 1460(80%) 1460(80%)
Performance(imag/s) 147 548 640 987 21 22 48 96
Power(W) 9.6 17.3 23.6 15.4 23.6 45.0 15.4 17.1
Efficiency(imag/s/W) 15.37 31.71 27.13 64.13 0.92 0.50 3.11 5.61
TABLE IV
PERFORMANCE DENSITY COMPARISON WITH PREVIOUS IMPLEMENTATION
[37] Ours [30] [38] [25] Ours Ours
CNN type Alexnet Alexnet VGG-16 VGG-16 VGG-16 VGG-16 VGG-16
Device XC7VX690T Zynq ZCU102 Zynq ZCU102 XC7VX690T Arria-10 GX1150 Zynq ZCU102 Zynq ZCU102
Frequency(MHz) 200 200 200 200 150 200 200
Precision 16bit fixed 16bit fixed 16bit fixed 8bit BFP 16bit fixed 16bit fixed 8bit int
Performance(GOP/s) 270 476.7 223.4 281.5 232.2 495.4 990.8
DSP Efficiency(GOP/s/DSP) 0.188 0.353 0.195 0.275 0.153 0.367 0.393
Logic cell Efficiency(GOP/s/K cells) 0.577 1.222 0.405 1.213 1.442 1.270 2.446
c o n v 2 c o n v 3 c o n v 4 c o n v 50
2 0
4 0
6 0
8 0
1 0 0
Act
ivat
ion 
Spa
rsity
 (%
)
L a y e r
 A c t i v a t i o n  S p a r s i t y O v e r a l l  S p a r s i t y
Fig. 13. The activation sparsity of CONV layers in Alexnet. The overall
sparsity is about 39.6%.
c o n v 1 _ 2c o n v 2 _ 1c o n v 2 _ 2c o n v 3 _ 1c o n v 3 _ 2c o n v 3 _ 3c o n v 4 _ 1c o n v 4 _ 2c o n v 4 _ 3c o n v 5 _ 1c o n v 5 _ 2c o n v 5 _ 3
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
 O v e r a l l  S p a r s i t y
Act
ivat
ion 
Spa
rsity
 (%
)
L a y e r
 A c t i v a t i o n  S p a r s i t y
Fig. 14. The activation sparsity of CONV layers in VGG16. The overall
sparsity is about 39.5%.
0 5 0 0 0 0 1 0 0 0 0 0 1 5 0 0 0 0 2 0 0 0 0 0 2 5 0 0 0 0 3 0 0 0 0 0 3 5 0 0 0 00
2 0
4 0
6 0
8 0
1 0 0  A v e r a g e  E f f i c i e n c y
Ave
rag
e E
ffici
enc
y (%
)
t i m e  ( c y c l e )
 O v e r a l l  E f f i c i e n c y
Fig. 15. The proportion of active PEs on conv2 1 of VGG-16. We sample the
number of active PEs by simulation under the configuration of< N,M >=<
32, 28 >. Each point represents the average efficiency of 100 samples.
divided by M evenly, where M is the number of PE in each
PU. The DAI is dynamic and variable dependent on data sheet
but irrespective of the proposed datafow whereas the DMI is
dependent on the proposed dataflow and can be computed as
the following equations.
We assume the size of a 3-D output feature map is U ×
V × F . According to our dataflow, when V >M , the average
computation efficiency is shown as Eq(2).
Computeeff =
V
dV/MeM (2)
When V <M , the average compute efficiency is shown as
Eq(3).
Computeeff =
U × V
d UbMV ceM
(3)
ZHU et al.: AN EFFICIENT HARDWARE ACCELERATOR FOR STRUCTURED SPARSE CONVOLUTIONAL NEURAL NETWORKS ON FPGAS 9
Fig. 16. Roofline model for 16-bit fixed in FC layer under different
configuration. When the number of PEs increases to 8, the performance
touches the roof. Because the off-chip bandwidth is fully utilized.
We measure the DMI on Alexnet and VGG-16. In Fig.
12, the computation efficiency which involves in DMI across
different layers with different parallelism factors. Because both
U and V in VGG-16 can be divided by 14, the computation
efficiency keeps 100% when M equals to 14 or 28. In conclu-
sion, our sparse wise dataflow can maintain a high computation
efficiency for different neuron networks.
To analyse DAI, we firstly count sparse activations on
convolutional layers of Alexnet and VGG-16 by using Pytorch
vision. The dataset is ImageNet 2012. As shown in Fig. 13,
layer conv5 shows the lowest activation sparsity below 10%,
and layer conv2 shows the highest activation sparsity over
90%. The overall sparsity is about 39.6%. As for VGG-16,
Fig. 14 depicts that the last seven convolutional layers show a
low sparsity below 30% and layer conv2 1 shows the highest
sparsity about 75%. In the mass, the overall activation sparsity
is about 39.5%. According to the activation sparsity, we can
estimate the DAI.
Then we measure the DAI on conv2 1 of VGG-16 by
simulation. On each compute cycle, we sample the total
number of active PEs and calculate the efficiency. As the
number of total samples is too large, so we calculate an
average efficiency every 100 samples. Fig. 15 shows the
proportion of ungated PE. Indeed, part of PEs are gated
to save energy. The proportion of active PE on conv2 1 is
positively related to the activation sparsity. The sample circuit
is designed for simulation when the configuration parameter
< N,M >=< 16, 28 >, and does not be implemented on
FPGA.
D. Performance analysis
In this section, we adopt the well-known roofline model
[14] for exploring the impact of insufficient off-chip bandwidth
on performance. We set the bitwidth of AXI data bus which
connects to DDR as 128-bit. First, we do not quantize the
weight and change the configuration to find the optimal
parameters of mapping the FC6 layer of VGG-16 onto our
accelerator. We normalize all the performance number to that
Fig. 17. Roofline model for 8-bit int in FC layer under different configuration.
The width of weight is 8-bit, so the performance touches the roof until the
number of PEs increases to 16.
Fig. 18. Roofline model for 16-bit fixed in CONV layer under different
configuration. When the number of PEs in a PU increases from 32 to 64, the
performance does not improve.
of ”N1M2”. As illustrated in Fig. 16, when the number of PE
reaches to 8, the performance touches the roof. Second, we
quantize the weights to 8-bit. We find that the performance
touches the roof until the number of PE reaches to 16, and
the roof becomes higher in Fig. 17. Because there is neither
weight share nor weight reuse in FC layer. When the bitwidth
of weight decreases by half, the number of transfered weights
doubles in each DDR access. So the speedup also doubles.
After that we map the conv2 1 layer of VGG-16 onto our
accelerator to explore the optimal parameters. To get a high
compute efficiency, we set N as the divisor of the number of
output channel F. We normalize all the performance number to
that of ”N1M8.” In Fig. 18, configuration ”N48M28” achieves
the best peak performance.
Final, we analyze the performance of our implementation.
We set the PE array size as < N,M >=< 48, 28 >, which
consists of 1344 PEs. In this configuration, the peak through-
put can be calculated as 2×0.2GHz×28×48 = 537.6GOP/s
when the width of operand is 16. Specially, the proposed
design supports to perform two 8bit × 8bit multiplications
10 THIS MANUSCIRPT IS FOR IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
in a PE with 2 DSPs, which leads to 1075.2GOP/s peak
performance.
We compare our design with convolutional FPGA acceler-
ators in Table III. The performance in Table III represents the
effective throughput. [25], [37] are dense CNN accelerators,
and [8], [30] are sparse CNN accelerators. Both [8] and [30]
only address the weight sparsity, but do not address the activa-
tion sparsity, so we compute our throughput with DMI which
is definded in previous subsection. For the dense accelerator,
the performance is computed by dividing the effective through-
put with computation of dense network. According to Table III,
our accelerator achieves 987 imag/s effective performance on
structured sparse Alexnet which shows 1.5× to 6.7× speedup
and 2.0× to 4.2× energy-efficiency compared with [8], [30],
[37]. As for VGG-16, our implementation achieves 48 imag/s
performance which is 2.2× to 2.3× speedup and 3.4× to 6.2×
energy-efficiency compared with [25], [30]. For the case of 8-
bit int, we achieve 96 imag/s performance which is 4.4× to
4.6× speedup and 6.1× to 11.2× energy-efficiency compared
with [25], [30]. The reason of the speedup is because our
dataflow can effectively skip the sparse weight multiplications.
In addition, this dataflow flexibly maps network onto PEs
which leads to a high utilization of on-chip resources. Previous
works cannot efficiently exploit the zeros or involve a low
compute efficiency. Besides, we apply clock gate on unused
PEs when the input activations equal to zero. So we get a
higher energy-efficiency than previous works.
The proposed design achieves a higher performance of
resource efficiency because we leverage the sparsity and
achieve a high mapping efficiency, as tabulated in Table IV.
The accelerator [30] also leverages the sparsity but the DSP
efficiency is encumbered by the low mapping efficiency due to
the unbalanced load of PE. In addition, the logic cell efficiency
is only 0.405 GOP/s/K cells because the TLUT and CMUX
[30] consume a large number of logic resources. Reducing the
bitwidth can help to improve the resource utilization efficiency.
The design [38] achieves a performance of 0.275 GOP/s/DSP
and 1.213 GOP/s/K cells because it uses 8-bit block float point
to represent activations and weights so that two multiplication
operations can be carried out in a DSP slice. In the proposed
design, if the width of data is 8-bit, the logic cell efficiency
nearly improves 100%.
VI. CONCLUSION
In this work, we have proposed a sparse CNN FPGA
accelerator with a sparse-wise dataflow to skip zero weights
computations. Moreover we have exploited data statistics to
minimize energy through zeros gating to avoid unnecessary
computations. In addition, we have proposed a set of ar-
chitecture optimization techniques for sparse CNNs. Exper-
iments demonstrated that our implementation could achieve
987 imag/s and 48 imag/s performance for AlexNet and VGG-
16 on Xilinx ZCU102, respectively, which provides 1.5× to
6.7× speedup and 2.0× to 6.2× energy-efficiency over pre-
vious CNN FPGA accelerators. Furthermore, the performance
improvement and energy efficiency will be much larger if we
achieve the sparsity of CNNs described in Cambricon-S [22].
VII. ACKNOWLEDGEMENT
This work is funded by the National Key R&D Program of
China (2018YFB0904902).
REFERENCES
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification,” in
Proceedings of the IEEE international conference on computer vision,
2015, pp. 1026–1034.
[2] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2014, pp. 580–587.
[3] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp. 779–
788.
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in Neural
Information Processing Systems 25, F. Pereira, C. J. C. Burges,
L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc.,
2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/
4824-imagenet-classification-with-deep-convolutional-neural-networks.
pdf
[5] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[6] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao,
A. Mishra, and H. Esmaeilzadeh, “From high-level deep neural models
to fpgas,” in 2016 49th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), Oct 2016, pp. 1–12.
[7] Y. Ma, N. Suda, Y. Cao, S. Vrudhula, and J.-s. Seo, “Alamo: Fpga ac-
celeration of deep learning algorithms with a modularized rtl compiler,”
Integration, vol. 62, pp. 14–23, 2018.
[8] H. Zeng, R. Chen, C. Zhang, and V. Prasanna, “A framework for gener-
ating high throughput cnn implementations on fpgas,” in Proceedings of
the 2018 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays. ACM, 2018, pp. 117–126.
[9] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “An automatic rtl compiler
for high-throughput fpga implementation of diverse deep convolutional
neural networks,” in 2017 27th International Conference on Field
Programmable Logic and Applications (FPL). IEEE, 2017, pp. 1–8.
[10] Y. Ma, Y. Cao, S. Vrudhula et al., “Optimizing the convolution operation
to accelerate deep neural networks on fpga,” IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, vol. 26, no. 7, pp. 1354–1367,
2018.
[11] A. Podili, C. Zhang, and V. Prasanna, “Fast and efficient implemen-
tation of convolutional neural networks on fpga,” in 2017 IEEE 28th
International Conference on Application-specific Systems, Architectures
and Processors (ASAP). IEEE, 2017, pp. 11–18.
[12] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang,
and H. Yang, “Angel-eye: A complete design flow for mapping cnn
onto embedded fpga,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 37, no. 1, pp. 35–47, 2017.
[13] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang, “A high perfor-
mance fpga-based accelerator for large-scale convolutional neural net-
works,” in 2016 26th International Conference on Field Programmable
Logic and Applications (FPL). IEEE, 2016, pp. 1–9.
[14] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
fpga-based accelerator design for deep convolutional neural networks,”
in Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays. ACM, 2015, pp. 161–170.
[15] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, “Design space
exploration of fpga-based deep convolutional neural networks,” in 2016
21st Asia and South Pacific Design Automation Conference (ASP-DAC).
IEEE, 2016, pp. 575–580.
[16] L. Lu, Y. Liang, Q. Xiao, and S. Yan, “Evaluating fast algorithms for
convolutional neural networks on fpgas,” in 2017 IEEE 25th Annual
International Symposium on Field-Programmable Custom Computing
Machines (FCCM). IEEE, 2017, pp. 101–108.
[17] Y. Ma, N. Suda, Y. Cao, J.-s. Seo, and S. Vrudhula, “Scalable and
modularized rtl compilation of convolutional neural networks onto fpga,”
in 2016 26th International Conference on Field Programmable Logic
and Applications (FPL). IEEE, 2016, pp. 1–8.
ZHU et al.: AN EFFICIENT HARDWARE ACCELERATOR FOR STRUCTURED SPARSE CONVOLUTIONAL NEURAL NETWORKS ON FPGAS 11
[18] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con-
nections for efficient neural network,” in Advances in neural information
processing systems, 2015, pp. 1135–1143.
[19] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
deep neural networks with pruning, trained quantization and huffman
coding,” arXiv preprint arXiv:1510.00149, 2015.
[20] C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian,
Y. Bai, G. Yuan et al., “C ir cnn: accelerating and compressing deep
neural networks using block-circulant weight matrices,” in Proceedings
of the 50th Annual IEEE/ACM International Symposium on Microarchi-
tecture. ACM, 2017, pp. 395–408.
[21] C. Deng, S. Liao, Y. Xie, K. K. Parhi, X. Qian, and B. Yuan, “Per-
mdnn: Efficient compressed dnn architecture with permuted diagonal
matrices,” in 2018 51st Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO). IEEE, 2018, pp. 189–202.
[22] X. Zhou, Z. Du, Q. Guo, S. Liu, C. Liu, C. Wang, X. Zhou, L. Li,
T. Chen, and Y. Chen, “Cambricon-s: Addressing irregularity in sparse
neural networks through a cooperative software/hardware approach,” in
2018 51st Annual IEEE/ACM International Symposium on Microarchi-
tecture (MICRO). IEEE, 2018, pp. 15–28.
[23] T. Zhang, K. Zhang, S. Ye, J. Li, J. Tang, W. Wen, X. Lin, M. Fardad, and
Y. Wang, “Adam-admm: A unified, systematic framework of structured
weight pruning for dnns,” arXiv preprint arXiv:1807.11091, 2018.
[24] J. Zhang and J. Li, “Improving the performance of opencl-based fpga
accelerator for convolutional neural network,” in Proceedings of the
2017 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays. ACM, 2017, pp. 25–34.
[25] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing loop oper-
ation and dataflow in fpga acceleration of deep convolutional neural
networks,” in Proceedings of the 2017 ACM/SIGDA International Sym-
posium on Field-Programmable Gate Arrays. ACM, 2017, pp. 45–54.
[26] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang,
and J. Cong, “Fp-dnn: An automated framework for mapping deep
neural networks onto fpgas with rtl-hls hybrid templates,” in 2017 IEEE
25th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM). IEEE, 2017, pp. 152–159.
[27] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.-m. Hwu, and
D. Chen, “Accdnn: an ip-based dnn generator for fpgas,” in 2018 IEEE
26th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM). IEEE, 2018, pp. 210–210.
[28] X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang,
and J. Cong, “Automated systolic array architecture synthesis for high
throughput cnn inference on fpgas,” in Proceedings of the 54th Annual
Design Automation Conference 2017. ACM, 2017, p. 29.
[29] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao,
Y. Wang et al., “Ese: Efficient speech recognition engine with sparse
lstm on fpga,” in Proceedings of the 2017 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays. ACM, 2017, pp.
75–84.
[30] L. Lu, J. Xie, R. Huang, J. Zhang, W. Lin, and Y. Liang, “An
efficient hardware accelerator for sparse convolutional neural networks
on fpgas,” in 2019 IEEE 27th Annual International Symposium on Field-
Programmable Custom Computing Machines (FCCM). IEEE, 2019, pp.
17–25.
[31] H. Zhang, P.-J. Lai, S. Paul, S. Kothawade, and S. Nikolaidis, “Learning
collaborative action plans from youtube videos,” in Proceedings of the
International Symposium on Robotics Research (ISRR 2019), Hanoi,
Vietnam, 2019.
[32] H. Zhang and S. Nikolaidis, “Robot learning and execution of col-
laborative manipulation plans from youtube videos,” arXiv preprint
arXiv:1911.10686, 2019.
[33] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, “A survey of fpga-based
neural network accelerator,” arXiv preprint arXiv:1712.08934, 2017.
[34] S. Li, W. Wen, Y. Wang, S. Han, Y. Chen, and H. Li, “An fpga
design framework for cnn sparsification and acceleration,” in 2017 IEEE
25th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM). IEEE, 2017, pp. 28–28.
[35] J. H. Ko, B. Mudassar, T. Na, and S. Mukhopadhyay, “Design of an
energy-efficient accelerator for training of convolutional neural networks
using frequency-domain computation,” in 2017 54th ACM/EDAC/IEEE
Design Automation Conference (DAC). IEEE, 2017, pp. 1–6.
[36] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,
B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An
accelerator for compressed-sparse convolutional neural networks,” in
2017 ACM/IEEE 44th Annual International Symposium on Computer
Architecture (ISCA). IEEE, 2017, pp. 27–40.
[37] S. Kala, B. R. Jose, J. Mathew, and S. Nalesh, “High-performance cnn
accelerator on fpga using unified winograd-gemm architecture,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27,
no. 12, pp. 2816–2828, 2019.
[38] X. Lian, Z. Liu, Z. Song, J. Dai, W. Zhou, and X. Ji, “High-performance
fpga-based cnn accelerator with block-floating-point arithmetic,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, 2019.
Chaoyang Zhu received the B.S. degree from
College of Physics and Technology, Central China
Normal University, China, in 2017. He is currently
pursuing the master’s degree in College of Infor-
mation Science & Electronic Engineering, Zhejiang
University. His research interest includes hardware
acceleration of neural network.
Kejie Huang (M’13-SM’18) received his Ph.D de-
gree from the Department of Electrical Engineering,
National University of Singapore (NUS), Singapore,
in 2014. He has been a principal investigator at
College of Information Science Electronic Engineer-
ing, Zhejiang University (ZJU) since 2016. Prior to
joining ZJU, he spent five years in the IC design
industry including Samsung and Xilinx, two years
in the Data Storage Institute, Agency for Science
Technology and Research (A*STAR), and another
three years in Singapore University of Technology
and Design (SUTD), Singapore. He has authored or coauthored more than
30 scientific papers in international peer-reviewed journals and conference
proceedings. He holds four granted international patents, and another eight
pending ones.
His research interests include low power circuits and systems design
using emerging non-volatile memories, architecture and circuit optimization
for reconfigurable computing systems and neuromorphic systems, machine
learning, and deep learning chip design. He is the Associate Editor of the IEEE
TRANSACTIONS ON CIRCUITS AND SYSTEMS-PART II: EXPRESS
BRIEFS.
Shuyuan Yang received the B.S. degree in elec-
tronic science and technology from Huazhong Uni-
versity of Science and Technology, Wuhan, China,
in 2019. He is currently working toward the M.S.
degree of electronic science and technology in Zhe-
jiang University, Hangzhou, China. His current re-
search interests include deep learning accelerator
and network on chip.
Ziqi Zhu received the B.S. degree in Electronic
and Information Engineering from the Zhejiang Uni-
versity, Hangzhou, China, in 2019. He is currently
working towards the M.S. degree of Integrated Cir-
cuits. His current research interests include computer
vision and 3D object detection.
12 THIS MANUSCIRPT IS FOR IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
Hejia Zhang received the B.E. degree in Bio-
engineering from Zhejiang University, Hangzhou,
China, in 2017. He is currently working towards the
Ph.D. degree in Computer Science at the University
of Southern California, USA. His current research
interests include robot learning from videos and
submodular optimization for active learning.
Haibin Shen is currently a Professor with Zhejiang
University, a member of the second level of 151 tal-
ents project of Zhejiang Province, and a member of
the Key Team of Zhejiang Science and Technology
Innovation. His research interests include learning
algorithm, processor architecture, and modeling. His
research achievement has been used by many major
enterprises. He has published more than 100 papers
on academic journals, and he has been granted more
than 30 patents of invention. He was a recipient of
the First Prize of Electronic Information Science and
Technology Award from the Chinese Institute of Electronics, and has won a
second prize at the provincial level.
