Stacked Filters Stationary Flow For Hardware-Oriented Acceleration Of
  Deep Convolutional Neural Networks by Gao, Yuechao et al.
STACKED FILTERS STATIONARY FLOW FOR
HARDWARE-ORIENTED ACCELERATION OF DEEP
CONVOLUTIONAL NEURAL NETWORKS
Gao Yuechao, Liu Nianhong & Zhang Sheng ∗
Department of Microelectrics and Nanoelectrics
Tsinghua University
Beijing, 100084, China
{gyc15,lnh15}@mails.tsinghua.edu.cn
ABSTRACT
To address memory and computation resource limitations for hardware-oriented
acceleration of deep convolutional neural networks (CNNs), we present a compu-
tation flow, stacked filters stationary flow (SFS), and a corresponding data encod-
ing format, relative indexed compressed sparse filter format (CSF), to make the
best of data sparsity, and simplify data handling at execution time. And we also
propose a three dimensional Single Instruction Multiple Data (3D-SIMD) proces-
sor architecture to illustrate how to accelerate deep CNNs by taking advantage
of SFS flow and CSF format. Comparing with the state-of-the-art result (Han
et al., 2016b), our methods achieve 1.11× improvement in reducing the storage
required by AlexNet, and 1.09× improvement in reducing the storage required by
SqueezeNet, without loss of accuracy on the ImageNet dataset. Moreover, using
these approaches, chip area for logics handling irregular sparse data access can
be saved (about 19.1% chip area in (Han et al., 2016a)). Comparing with the 2D-
SIMD processor structures in DVAS, ENVISION, etc., our methods achieve about
3.65× processing element (PE) array utilization rate improvement (from 26.4% to
96.5%) on the data from Deep Compression on AlexNet1.
1 INTRODUCTION
CNNs have achieved substantial progress during the past years. But hardware resource limitations
have hindered their wide usage in embedded devices. Various efforts have been made to address this
issue, such as ShiftCNN (Gudovskiy & Rigazio, 2017), Ristretto (Gysel, 2016), Eyeriss (Chen et al.,
2017), Deep Compression (Han et al., 2016b) and EIE (Han et al., 2016a). Through compressing
deep neural networks with pruning, trained quantization and Huffman coding, Deep Compression
(Han et al., 2016b), the best paper of ICLR 2016, achieved state-of-the-art result in reducing storage
requirement of neural networks without affecting their accuracy.
In spite of the great progress achieved till now, there are still many problems to be solved. The
first problem is manipulating compressed sparse data need considerable extra logics and consumes
extra clock cycles. Eyeriss(Chen et al., 2017) uses network on chip (NoC) to handle sparsity by
only performing data reads and MACs on nonzero values; DVAS(Moons & Verhelst, 2015) and
ENVISION(Moons et al., 2017) use input guard memories and guard control units to handle data
sparsity. Several sparse matrix encoding formats have been proposed, such as CSC, CSR and CISR
(Fowers et al., 2014). But existing encoding formats complex the computation at runtime due to their
irregular memory access characteristics. This results in inefficiency in parallelizing computation
and bigger chip area. For example, EIE(Han et al., 2016a) use Pointer Read Units (accounting for
about 19.1% chip area) and a Sparse Matrix Read Unit (accounting for about 73.57% chip area)
to handle compressed sparse data. Therefore, it would be desirable if the sparse data can be easily
handled during execution without complex transformations, lookups and computation. The second
∗Corresponding author. zhang sh@tsinghua.edu.cn
1https://github.com/songhan/Deep-Compression-AlexNet
ar
X
iv
:1
80
1.
07
45
9v
3 
 [c
s.C
V]
  6
 Fe
b 2
01
8
one is, for deeply compressed sparse networks, the PE array utilization rate of recently proposed
hardware acceleration designs, such as Eyeriss (Chen et al., 2017), DVAS (Moons & Verhelst, 2015),
ENVISION (Moons et al., 2017), DNPU (Shin et al., 2017), etc., is fairly low. In this paper we
present a novel computation flow SFS, and a corresponding data memory layout and encoding format
CSF which achieves the desirable goal that data can be straightforwardly handled at run time. We
also propose a three dimensional Single Instruction Multiple Data (3D-SIMD) processor architecture
which takes full advantage of these two features.
2 STACKED FILTERS STATIONARY FLOW (SFS)
Computations of convolutional (CONV) and fully connected (FC) layers in CNNs can be uni-
fied into one formula Eq.1 (ignoring biases). Eq.2-6 illustrate the approach SFS. Vo, Vi
and Wf are the matrices of output feature maps, input feature maps and filters, respectively.
S,C,K,M,M ′,m,W,H,W ′, H ′ is a given stride size, channel number, filter kernel size, total filter
number, number of batches, batch size, input feature width, height and output feature width, height
(Eq.6). Consider a bank of M filters each with size K and an H ×W feature with C input channels
in a layer. We denote the filter bank as a four dimensional arrayWf with size M ×C×K×K, the
input feature as a three dimensional array Vi with size C×H×W , and the output feature as a three
dimensional array Vo with size M×H ′×W ′. Filters are firstly grouped into M ′ batches with batch
size m (Eq.2), and eachWf (n) is then reshaped toWf ′ (n) (Eq.3). One channel of feature data will
convolute with m filters from the same channel in parallel (Eq.4, j = 0, ...,m − 1, pseudo code is
illustrated in algorithm 1). At the end of computation, Vo′ (0) - Vo′ (M
′−1) are concatenated back to
Vo (Eq.5).
Vo[cho][y][x] =
C−1∑
chi=0
K−1∑
r=0
K−1∑
c=0
Wf [cho][chi][r][c]× Vi[chi][Sy + r][Sx+ c] (1)
Wf = [Wf
(0),Wf
(1), ...,Wf
(M ′−1)],Wf ′ = [Wf ′ (0),Wf ′ (1), ...,Wf ′ (M
′−1)] (2)
Wf ′
(n)[chi][r][c][j] =Wf
(n)[j][chi][r][c] (3)
Vo′
(n)[j][y][x] =
C−1∑
chi=0
K−1∑
r=0
K−1∑
c=0
Wf ′
(n)[chi][r][c][j]× Vi[chi][Sy + r][Sx+ c] (4)
Vo = [Vo′
(0),Vo′
(1), ...,Vo′
(M ′−1)] (5)
0 ≤ cho < M, 0 ≤ chi < C, 0 ≤ r < K, 0 ≤ c < K, 0 ≤ j < m,
0 ≤ x < W ′, 0 ≤ y < H ′, 0 ≤ n < M ′,
M ′ =M/m,W ′ = (W −K)/S + 1, H ′ = (H −K)/S + 1.
(6)
Algorithm 1 SFS parallel computing pseudo code
for each chi ∈ [0, C − 1] do
for each r ∈ [0, K − 1] do
for each c ∈ [0, K − 1] do
output channel 0 : Vo′
(n)[0][y][x]+ =Wf′
(n)[chi][r][c][0]× Vi[chi][Sy + r][Sx+ c]
output channel 1 : Vo′
(n)[1][y][x]+ =Wf′
(n)[chi][r][c][1]× Vi[chi][Sy + r][Sx+ c]
...
output channel m− 1 : Vo′ (n)[m− 1][y][x]+ =Wf′ (n)[chi][r][c][m− 1]× Vi[chi][Sy + r][Sx+ c]
end for
end for
end for
3 RELATIVE INDEXED COMPRESSED SPARSE FILTER (CSF) FORMAT
As to the encoding format CSF, this approach is to further rearrange the memory layout of the
grouped m filters illustrated in figure 1, storing the elements column by column. So in computation
flow SFS, when each element in the feature map multiplies with a column of data from m filters
(algorithm 1), the filter weights could be loaded sequentially. The first line in figure 2 illustrates the
changing. In figure 2, if there is any weight value equals to 0, just remove that value and its index,
add one to the relative index of the next value, and subtract one to the pointer of the next column.
The nonzero value number (includes padding zeros) of a column is given by the pointer of the next
column. Column pointer is 0 means all the values in the column before the column of this pointer
equal to 0. Relative column pointer is not needed when parameters are stored in files.
Filter 1 W1,11 W1,12 W1,13 W1,21 W1,22 W1,23 W1,31 W1,32 W1,33
Filter 2 W2,11 W2,12 W2,13 W2,21 W2,22 W2,23 W2,31 W2,32 W2,33
Filter 3 W3,11 W3,12 W3,13 W3,21 W3,22 W3,23 W3,31 W3,32 W3,33
... ... ... ... ... ... ... ... ... ...
Filter m Wm,11 Wm,12 Wm,13 Wm,21 Wm,22 Wm,23 Wm,31 Wm,32 Wm,33
Figure 1: Memory layout for the m filters with kernel size 3 from a single channel.
virtual weight value W1,11 W2,11 … Wm,11 W1,12 W2,12 … Wm,12 … W1,kk W2,kk … Wm,kk
relative filter index 0 0 … 0 0 0 … 0 … 0 0 … 0
relative column pointer 0 m … m m
Figure 2: Memory layout in the relative indexed CSF format.
4 3D-SIMD PROCESSOR ARCHITECTURE
The SFS flow and the CSF encoding format are two key features of the proposed 3D-SIMD processor
architecture, see figure 3. In this architecture, after feature data are loaded into the line buffer and
window registers from a single channel of input feature map, andm filter data from the same channel
are loaded into the local filter buffer, each element in the window will multiply with a column of data
from m filters at the same position (algorithm 1). So data in CSF format can be straightforwardly
handled without complex transformations, lookups and computation and loaded sequentially at run
time, and zeros are skipped as designed. This demonstration shows that these two approaches can
greatly simplify sparse data handling, saving zero bypassing and data lookup time. There are no
complex sparse data handling logics needed comparing with former works(Moons & Verhelst, 2015;
Moons et al., 2017; Han et al., 2016a).
Local output registers
Global 
feature
buffer
Global
filter
buffer
Global output feature buffer NL
Pool
Output
data
format
Center controller
V11 V12 V13
V21 V22 V23
V31 V32 V33
Line buffer
Window registers
Main process unit
virtual weight value W1,11 W2,11 … Wm,11
relative filter index 0 0 … 0
virtual weight value W1,12 W2,12 … Wm,12
relative filter index 0 0 … 0
virtual weight value … … … …
relative filter index … … … …
virtual weight value W1,kk W2,kk … Wm,kk
relative filter index 0 0 … 0
V11 V12 V13 V21 V22 V23 V31 V32 V33
relative column pointer 0 m … m m
PE ARRAY 
…
Computation 
FIFO
PE 
Array
PE 
Array
PE 
Array
Fifo 1
W1,11xV11
W2,11xV11
W3,11xV11
...
Wm,11xV11
Fifo 2
W1,12xV12
W2,12xV12
W3,12xV12
...
Wm,12xV12
Fifo KxK
W1,33xV33
W2,33xV33
W3,33xV33
...
Wm,33xV33
virtual weight value W1,11 W2,11 … Wm,11
relative filter index 0 0 … 0
Registers and adders 1 2 3 4 … m
Output feature offset + + + + … +
Output feature offset + + + + … +
Output feature offset + + + + … +
Output feature offset + + + + … +
Local filter buffer
V11 V12 V13 … V1w
V21 V22 V23 … V2w
V31 V32 V33 … V3w
RAM
RAM
Figure 3: 3D-SIMD processor architecture.
5 RESULT
The distribution of continuous zero numbers after applying the changing is first evaluated. As fig-
ure 4 shows, the distribution narrows to the left. It means that fewer bits are needed to store the
relative index values, and there will be fewer padding zeros when compressing data in encoding
formats. This will further reduce storage space. The distribution of continuous nonzero numbers
after applying the changes is also evaluated. The distribution also narrows to the left, which means
the computation load during execution will be better balanced comparing to the reference work(Han
et al., 2016b). The effect of batch size m on storage space is also analyzed. It shows that there
do exist an optimum batch size for each layer. Smaller batch size requires smaller local buffer, but
data are less reused and the input feature map need to be loaded more times. For simplicity, all the
experiments in this section use filter number as the batch size. That means, all the filters and input
feature maps are loaded only one time and the output feature maps are saved one time for one refer-
ence. Eq.7 is used to find the best bit number to store the relative index values of each layer. Table 1
and table 2 illustrate the improvement of extra space needed to store index and padding zeros in
each layer of Alexnet comparing to former works and the improvement of total storage requirement
after applying our method. The PE array utilization rate2 improvement of convolutional and fully
connected layers on several networks are also evaluated. On Alexnet, as illustrated in table 3, the
number of MACs of dense network is 1.06GOPS, the number of MACs of nonzero weight value of
sparse network is 0.28GOPS, and the number of MACs after applying our method is 0.29GOPS. So
comparing with dense network processor like the 2D-SIMD processor structures in DVAS, ENVI-
SION, etc., the PE array utilization rate improves from 26.4% to 96.5% (about 3.65× improvement),
using the data from Deep Compression on AlexNet. The amount of data lookup calculation3 is also
evaluated. Using the same data above, the amount of calculation of SFS and CSF approach is about
1/20 that of the algorithm in EIE (see table 3).
argmin
bit
{ftotal bits(bit) = Nz num× bit+
max∑
i=2bit
zero stat[i]× ( i
2bit
)× (wbit+ bit)} (7)
Nz num: the number of nonzero weight values; wbit: the number of bits to store weight value;
bit: the number of bits to store relative index value; zero stat: the distribution of continuous zero
numbers.
1 2 3 4 5 6 7 8 9 101112131415
0
500
1000
1500
2000
2500
3000
3500
4000
C
o
u
n
t
Continuous Zero Number
 Ref
 Ours
Conv1
1 2 3 4 5 6 7 8 9 101112131415
0
5000
10000
15000
20000
25000
30000
Conv2
C
o
u
n
t
Continuous Zero Number
 Ref
 Ours
1 2 3 4 5 6 7 8 9 101112131415
0
50000
100000
150000
200000
250000
300000
350000
FC6
C
o
u
n
t
Continuous Zero Number
 Ref
 Ours
1 2 3 4 5 6 7 8 9 101112131415
0
10000
20000
30000
40000
50000
60000
Conv4
C
o
u
n
t
Continuous Zero Number
 Ref
 Ours
Figure 4: Distributions of continuous zero numbers in different layers of Alexnet, comparing with
reference work (Han et al., 2016b).
6 CONCLUSION
In this paper, we propose a stacked filters stationary flow SFS, and its corresponding data encoding
format CSF. And we also propose a 3D-SIMD processor architecture for this computation flow and
data encoding format. Experimental results show that our approaches narrow the distribution of the
numbers of continuous zeros and nonzeros to the smaller number direction. This helps to further
compress the network parameters by about 8% to 10% and balance computation load at run time.
By adopting the proposed 3D-SIMD processor architecture, chip area for logics handling irregular
2PE array utilization rate is estimated by: no. of nonzero value MACs / total no. of MACs.
3Single calculation of locating a batch of data is defined as a basic unit.
Table 1: Space(in bits) needed in each layer of Alexnet for storing extra padding zeros and index
Layer Nonzeros Index(bit) Extra space Improvement
conv1 in (Han et al., 2016b) 235088 4 117568
conv1 by SFS+CSF 235088 1 37189 3.16×
conv2 in (Han et al., 2016b) 930448 4 491456
conv2 by SFS+CSF 930448 3 393897 1.25×
conv3 in (Han et al., 2016b) 2447520 4 1262136
conv3 by SFS+CSF 2447520 3 1082226 1.17×
conv4 in (Han et al., 2016b) 1975696 4 999260
conv4 by SFS+CSF 1975696 3 822902 1.21×
conv5 in (Han et al., 2016b) 1304496 4 662352
conv5 by SFS+CSF 1304496 3 545220 1.21×
fc6 in (Han et al., 2016b) 13345544 4 23978248
fc6 by SFS+CSF 13345544 5 18849823 1.27×
fc7 in (Han et al., 2016b) 6149136 4 9525904
fc7 by SFS+CSF 6149136 5 8373849 1.14×
fc8 in (Han et al., 2016b) 4204128 4 4289032
fc8 by SFS+CSF 4204128 3 4033864 1.06×
total in (Han et al., 2016b) 30592056 41325956
total by SFS+CSF 30592056 34138970 1.21×
Table 2: Extra space(in bits) improvement and total storage requirement improvement
Network Nonzeros Extra space Improvement Total
AlexNet by (Han et al., 2016b) 30592056 41325956
AlexNet by SFS+CSF 30592056 34138970 1.21× 1.11×
SqueezeNet by (Han et al., 2016b) 3327368 1737628
SqueezeNet by SFS+CSF 3327368 1307160 1.33× 1.09×
Table 3: PE array utilization rate and data lookup calculation improvement (MACs in GOPS)
Total no.
of MACs
No. of nonzero
value MACs
Total no. of
MACs (CSF)
Speed-up
(CSF)
Lookup
(CSF)
Alexnet CONV layers 1.00269368 0.2744399 0.2839878 3.53× 1/13
Alexnet FC layers 0.05459595 0.0055178 0.0059305 9.21× 1/42
Alexnet CONV+FC layers 1.05728963 0.2799577 0.2899182 3.65× 1/20
PE untilization ratio 0.2647881 0.9656438
memory access of sparse data can be saved, for example, about 19.1% chip area in EIE (Han et al.,
2016a) for pointer read can be saved. Moreover, directly using the encoded data without complex
transformations, lookups and computation at runtime can also save zero bypassing clock cycles, and
data lookup time (Moons & Verhelst, 2015; Moons et al., 2017; Han et al., 2016a).
REFERENCES
Yu Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. Eyeriss: An energy-efficient re-
configurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State
Circuits, 52(1):127–138, 2017.
Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S. Chung, and Greg Stitt. A high memory
bandwidth fpga accelerator for sparse matrix-vector multiplication. In IEEE International Sym-
posium on Field-Programmable Custom Computing Machines, pp. 36–43, 2014.
Denis A Gudovskiy and Luca Rigazio. Shiftcnn: Generalized low-precision architecture for infer-
ence of convolutional neural networks. 2017.
Philipp Gysel. Ristretto: Hardware-oriented approximation of convolutional neural networks. 2016.
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A Horowitz, and William J
Dally. Eie: Efficient inference engine on compressed deep neural network. International Confer-
ence on Computer Architecture (ISCA), 2016a.
Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks
with pruning, trained quantization and huffman coding. International Conference on Learning
Representations (ICLR), 2016b.
Bert Moons and Marian Verhelst. Dvas: Dynamic voltage accuracy scaling for increased energy-
efficiency in approximate computing. In Ieee/acm International Symposium on Low Power Elec-
tronics and Design, pp. 237–242, 2015.
Bert Moons, Roel Uytterhoeven, Wim Dehaene, and Marian Verhelst. 14.5 envision: A 0.26-
to-10tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable convolutional neural
network processor in 28nm fdsoi. In Solid-State Circuits Conference, pp. 246–247, 2017.
Dongjoo Shin, Jinmook Lee, Jinsu Lee, and Hoi Jun Yoo. 14.2 dnpu: An 8.1tops/w reconfigurable
cnn-rnn processor for general-purpose deep neural networks. In Solid-State Circuits Conference,
pp. 240–241, 2017.
