Escoin: Efficient Sparse Convolutional Neural Network Inference on GPUs by Chen, Xuhao
Escoin: Efficient Sparse Convolutional Neural Network
Inference on GPUs
Xuhao Chen
University of Texas at Austin
cxh@utexas.edu
ABSTRACT
Deep neural networks have achieved remarkable accuracy in many
artificial intelligence applications, e.g. computer vision, at the cost
of a large number of parameters and high computational complexity.
Weight pruning can compress DNN models by removing redundant
parameters in the networks, but it brings sparsity in the weight
matrix, and therefore makes the computation inefficient on GPUs.
Although pruning can remove more than 80% of the weights, it
actually hurts inference performance (speed) when running models
on GPUs.
Two major problems cause this unsatisfactory performance on
GPUs. First, lowering convolution onto matrix multiplication re-
duces data reuse opportunities and wastes memory bandwidth. Sec-
ond, the sparsity brought by pruning makes the computation irregu-
lar, which leads to inefficiency when running on massively parallel
GPUs. To overcome these two limitations, we propose Escort, an
efficient sparse convolutional neural networks on GPUs. Instead
of using the lowering method, we choose to compute the sparse
convolutions directly. We then orchestrate the parallelism and local-
ity for the direct sparse convolution kernel, and apply customized
optimization techniques to further improve performance. Evaluation
on NVIDIA GPUs show that Escort can improve sparse convolution
speed by 2.63× and 3.07×, and inference speed by 1.38× and 1.60×,
compared to CUBLAS and CUSPARSE respectively.
ACM Reference format:
Xuhao Chen. 2019. Escoin: Efficient Sparse Convolutional Neural Network
Inference on GPUs. In Proceedings of ACM Conference, Washington, DC,
USA, July 2017 (Conference’17), 9 pages.
DOI: 10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
Deep neural networks (DNNs) [27] have been widely used in many
artificial intelligence (AI) applications including computer vision [23,
25, 26], speech recognition [3, 15], natural language processing [12],
and robotics [6, 40]. Modern DNNs are composed of five to more
than a thousand network layers, with a trend of going deeper and
more complex. A common form of DNNs is convolutional neural
networks (CNNs), which are mainly composed of multiple convolu-
tional (CONV) layers. In recent CNNs, the CONV layers dominate
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
Conference’17, Washington, DC, USA
© 2019 ACM. 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
DOI: 10.1145/nnnnnnn.nnnnnnn
the entire networks and consumes most of the execution time. This
paper focuses on improving the speed of CONV layers in CNNs.
In many application domains, DNNs are now able to exceed hu-
man accuracy [42]. The superior accuracy of DNNs, however, comes
at the cost of high computational complexity. With continuous in-
crease of their model sizes, DNNs consume considerable storage,
memory bandwidth, and computational resources. To address this
limitation, weight pruning [19] has been proposed to compress DNN
models by removing redundant connections in the networks. How-
ever, although this technique can significantly reduce the model size
by removing an average of 80% of the weights, pruning actually
hurts inference performance (i.e. speed) when running CNN models
on GPUs [47].
To understand the performance effect of weight pruning, we mea-
sured the inference speed of 3 popular CNNs on NVIDIA GPUs
using CUBLAS [? ] and CUSPARSE [? ] library respectively. In
spite that weight pruning can remove a large portion of multiply-
accumulate (MAC) operations, we discover that the inference speed
of the networks using CUSPARSE is actually barely faster than that
using CUBLAS. Two issues result in this performance degradation.
First, the overhead of lowering convolution onto matrix multipli-
cation becomes a severe problem when the computation turns into
sparse after pruning. The lowering approach has demonstrated over-
head for dense convolution [? ], since it duplicates the input features
multiple times, which wastes memory bandwidth and reduces the
data reuse opportunities. For the dense case, this overhead is trivial.
However, it becomes unacceptable for sparse convolution whose
computational intensity is already much lower than dense convolu-
tion. For a highly memory bound operation like sparse convolution,
lowering is no longer a suitable choice for implementing convolution
on GPUs.
Second, sparse matrix computation is much less efficient than its
dense counter part on GPUs. Although sparse matrix multiplication
avoids unnecessary MAC operations, its memory access pattern is
fairly irregular and can not fully take advantage of the compute
capability of the GPU architecture. Besides, although sparse matrix
computation using compressed data structure could save memory
space, there is overhead to decode the sparse format at runtime.
To overcome the limitations, we propose Escort, an efficient
sparse CNN method customized for GPU’s data-parallel architec-
ture. Instead of lowering the convolution onto matrix multiplication,
we choose to directly compute the sparse convolution. To take advan-
tage of GPU’s tremendous computational horsepower, we customize
the dataflow and apply a series of optimization techniques based
on the understanding of the memory access pattern. We implement
Escort using CUDA and evaluate it on NVIDIA GPUs. Experimen-
tal results show that Escort substantially outperforms the lowering
method using either CUBLAS or CUSPARSE. To the best of our
ar
X
iv
:1
80
2.
10
28
0v
2 
 [c
s.D
C]
  3
 A
pr
 20
19
Conference’17, July 2017, Washington, DC, USA Xuhao Chen
H
 1
…
 1
…
… …
… … …
…
_S_
R
C
_W_
C
 1
…
_F_
E …
…
M
Filters
Input fmaps
Output fmaps
H M …
…
 N
…
… …
…
…
…
_S_
R
C
_W_
C
 N
…
_F_
E …
…
M
… ……
…
Figure 1: 3-D convolutions in CNNs [42].
Algorithm 1 Sequential Convolution [36]
1: procedure CONV(in, weight, out)
2: for n in [0, N) do
3: form in [0, M) do
4: for c in [0, C) do
5: for h in [0, E) do
6: for w in [0, F) do
7: for r in [0, R) do
8: for s in [0, S) do
9: out[n][m][h][w] +=
10: in[n][c][h+r ][w+s] *
11: weight[m][c][r ][s]
knowledge, this is the first direct sparse convolution tailored for the
GPU architecture.
This paper makes the following contributions:
• We propose Escort, a direct sparse convolution approach
that can efficiently run on modern GPUs.
• We orchestrate the parallelism and locality for Escort and
optimize it for the GPU architecture.
• We measure the inference speed of Escort on NVIDIA
GPUs, and demonstrate its superior performance over CUBLAS
and CUSPARSE.
The rest of the paper is organized as follows: Section 2 introduces
the background of sparse convolutional neural networks and explains
the motivation of this work. Our proposed design is described
in Section 3. We present the evaluation in Section 4. Section 5
summarizes related works and Section 6 concludes.
2 BACKGROUND AND MOTIVATION
In AI applications, employing DNNs can be decomposed into two
tasks: training and inference. Today, training is often done on GPUs,
while inference depends on the application and can employ CPUs,
GPUs, FPGAs or ASICs [42]. This paper focuses on CNN inference
on GPUs. Since over 90% of the computation of recent CNN designs
is in convolutions [38], we tend to speed up this core operation.
1	 2	 3	
4	 5	 6	
7	 8	 9	
1	 2	
3	 4	
1	 2	
3	 4	
1	 2	 3	 4	 1	 2	 4	 5	
2	 3	 5	 6	
4	 5	 7	 8	
5	 6	 8	 9	
* = Convolution 
Matrix 
Multiplication 
×
Filter 
Input features 
Output features 
1	 2	 3	 4	= 
Figure 2: Lowering 2-D convolution to matrix multiplica-
tion [42].
2.1 Convolutional Neural Networks
Convolutional neural networks (CNNs) [48] have become the most
popular algorithmic approach for deep learning in many application
domains. Each of the CONV layers in a CNN is primarily composed
of high-dimensional convolutions as visualized in Fig. 1. In this
computation, the core operation is a 2-D sliding window convolution
of an R×S filter kernel over a H ×W input channel to produce a E×F
output channel. A input feature map (ifmap) can include multiple
(C) input channels. A distinct filter kernel is applied to each input
channel, and the outputs for each of the C channels are accumulated
together element-wise into a single channel of output feature map
(ofmap). Multiple 3-D filters (M) can be applied to the same volume
of input activations to create M output channels. Finally, N ifmaps
may be processed together as a batch to potentially improve reuse of
the filter weights [42].
Given the shape parameters in Table 1, the computation of a
CONV layer is defined as Eq. (1), whereO , I andW are the matrices
of the ofmaps, ifmaps and filters, respectively. Filters are composed
of weights, while input and output feature maps (ifmaps, ofmaps)
are composed of activations. Algorithm 1 shows the pseudo code of
computing a complete CONV layer. It is performed as a loop nest
over 7 variables. Each point in the 7-dimensional space formed from
these variables represents a single multiply-accumulate operation
(line 9∼11).
O [n][m][h][w ] =
C−1∑
c=0
R−1∑
r=0
S−1∑
s=0
I [n][c][h + r ][w + s] ×W [m][c][r ][s],
0 ≤ n < N , 0 ≤ m < M, 0 ≤ h < E, 0 ≤ w < F ,
E = H − R + 1, F =W − S + 1.
(1)
Shape Parameter Description
N batch size
M # of filters / # of ofmap channels
C # of ifmap/filter channels
H/W ifmap height/width
R/S filter height/width
E/F ofmap height/width
Table 1: Shape Parameters of a CONV Layer [42]
Escoin: Efficient Sparse Convolutional Neural Network Inference on GPUs Conference’17, July 2017, Washington, DC, USA
CRS 
O W 
Output fmap Filters Input fmap 
M M 
EF EF CRS 
I 
Figure 3: Matrix multiplication is used when computing one
output feature map from one input feature map.
2.2 The Lowering Method
To leverage highly optimized GEMM (General Matrix Multiply)
libraries, CONV layers in DNNs are usually mapped to matrix mul-
tiplication. Fig. 2 gives an example of transforming 2-D convolution
into matrix multiplication. The 2-D filter is flattened into a 1-D
array, and the input features are filled into a matrix such that the dot
product of the 1-D array and each column of the matrix generates
an output element. This process is called the lowering method [10].
Extending this process to the 3-D convolution in Fig. 1, the filters
are reshaped into a matrixW with dimensions M ×CRS , and a in-
put matrix is gathered by duplicating the original input data into a
matrix I with dimensions CRS × EF . After this transformation, the
convolution is replaced by a single matrix multiplication in Fig. 3 to
form an output matrix O with dimension M × EF .
There are software libraries designed for GPUs (e.g., cuBLAS)
that highly optimize matrix multiplications. The implementation
is tiled to the memory hierarchy of the target GPU to capture lo-
cality. Due to the simplicity of implementation and consistency of
performance across the parameter space, the lowering method is
adopted by most DNN frameworks (e.g. TensorFlow [1], Caffe [24],
Theano [4], and Torch7 [11]).
The downside for using GEMM for CONV layers is that there is
redundant data in the input matrix I as highlighted in Fig. 2. This
can lead to inefficiency in storage and waste of bandwidth at run-
time. Constructing I requires duplicating the input features up to
R ×S times, which might require a prohibitively large memory space
allocation. In this case, implementations (e.g., Caffe) need to ma-
terialize I piece by piece, e.g., by calling GEMM iteratively for
each element of the batch. However, this approach limits the paral-
lelism, and can lead to cases where the matrix multiplications are
too small to efficiently utilize the GPU [10]. Besides, the operation
of forming I in memory itself is costly, requiring significant mem-
ory traffic. More importantly, due to duplication, lowering reduces
data reuse opportunities and wastes memory bandwidth at runtime,
which increases the burden of the memory subsystem. The lowering
approach has demonstrated overhead for dense convolution [16],
and unfortunately, this issue gets worse and unacceptable when the
computation becomes sparse after weight pruning.
2.3 Weight Pruning
Weight pruning techniques [19] measure the importance of each
weight and remove those deemed unimportant, resulting in both
memory storage and computation reductions with no accuracy loss.
value = [ 10 20 30 40 50 60 70 80 ]  
rowptr = [ 0 2 4 7 8 ] 
colidx = [ 0 1 1 3 2 3 4 5 ] 
10 20 0 0 0 0
0 30 0 40 0 0
0 0 50 60 70 0
0 0 0 0 0 80
⎛
⎝
⎜
⎜
⎜
⎜
⎞
⎠
⎟
⎟
⎟
⎟
Figure 4: An example of the compressed sparse row (CSR) for-
mat.
After weight pruning, redundant weights and related MAC opera-
tions are removed. One such method, Deep Compression [18] can
reduce the number of weights in AlexNet [26] and VGG-16 [41] by
9× and 13×, respectively.
After pruning, the remaining weights are stored in a sparse matrix
format. Compressed sparse row (CSR) format, as shown in Fig. 4,
is often used to store the sparse weight matrix in a compressed form.
The CSR data structure consists of three arrays. The data array
value stores only the non-zero elements row by row. To find out
the original location of each non-zero elements, two auxiliary data
structures are added. The column-indices array colidx contains nnz
integers (nnz is the total number of non-zero elements), and entry
colidx[i] indicates the column id of the ith element in value. The
row-pointers array rowptr contains M + 1 (M is the number of rows
of the matrix) integers, and entry rowptr [i] is the starting index in
colidx of the ith row. This implies that rowptr [i + 1] − rowptr [i] is
the number of non-zero elements in the ith row.
Using CSR format, the memory space used to store the weight
matrix is (2 × nnz +M + 1) × 4 bytes (assuming floating-point data
type for value). We define the sparsity of a sparse matrix as the ratio
of zero values the matrix stores relative to the total number of cells
in the matrix. Since more than 80% weights are set to zero by the
pruning technique, the sparsity of the weight matrix is often over
0.8, i.e. nnz < 0.2 × total (total is the total number of cells in the
matrix), and the memory space consumed by the compressed weight
matrix is then less than 40% of the original dense matrix (assuming
M  nnz). This can enable deeper CNN models in the future,
and is also important for deployment of CNN models in memory
constrained platforms, such as desktops and mobile devices.
2.4 Limitations of Lowering for Sparse CNN
Despite the advantage of dramatic reduction of MAC operations,
the sparsity of pruned networks often leads to performance loss in
CNN computation on GPUs [47]. This is because sparse weight
matrices lose the regular structure of dense matrices. On GPUs, the
sparse matrix computation [5, 8, 13] cannot make full usage of the
supported hardware, e.g., memory coalescing. Also, dense matrix
optimizations, like matrix tiling, are less effective [47]. Therefore
sparsity brings limited benefit if running on GPUs. Worse still, extra
overhead is needed to decode the sparse format at runtime. With
limited benefit, it is not surprising that this overhead would lead to
performance degradation.
Fig. 8 illustrates execution time spent on CONV layers when
performing inference on NVIDIA GPUs using CUBLAS and CUS-
PARSE respectively. For both methods, the weight matrices are
pruned, but they are stored as dense matrices (filled with lots of
zeros) for CUBLAS and as sparse matrices (i.e. CSR) for CUS-
PARSE. We can observe a consistent performance loss on the Tesla
Conference’17, July 2017, Washington, DC, USA Xuhao Chen
Algorithm 2 Sequential Sparse Convolution [37]
1: procedure SCONV(in, W, out)
2: for n in [0, N) do
3: form in [0, M) do
4: for j in [W.rowptr[m], W.rowptr[m+1]) do
5: off←W.colidx[j]
6: val←W.value[j]
7: for h in [0, E) do
8: for w in [0, F) do
9: out[n][m][h][w] += val *
10: in[n][off + f (0,h,w)]
P100 GPU. As for GTX 1080Ti GPU, CUSPARSE achieves very
limited performance improvement compared to CUBLAS. This un-
satisfactory performance motivates us to rethink the mapping of
convolution operations to GPUs and optimize the implementation
for the data-parallel architecture.
2.5 GPU Programming and Memory Hierarchy
From the programmers’ point of view, each CUDA kernel includes
groups of threads called thread blocks. All threads in a thread
block are guaranteed to execute concurrently on the same streaming
multiprocessor (SM). Within each thread block, subgroups of threads
called warps (usually containing 32 threads) are executed in lockstep
fashion. This programming paradigm is defined to fit GPU’s SIMT
architecture [? ]. When a multiprocessor is given one or more thread
blocks to execute, it partitions them into warps and each warp gets
scheduled for execution on the SIMD execution units.
The GPU memory hierarchy consists of several levels of storage
with variable sizes, properties, and access constraints. Register files
are the closest to the streaming multiprocessor, and they are local
memories for each thread. Shared memory, a.k.a. scratchpad, is
programmer manageable and can be shared by the threads in the
same thread block. At the same level there is a hardware-managed
read-only cache, which is used to hold the read-only data specified
by the programmer. The L2 cache is shared across all threads of
the entire CUDA kernel and usually works as the central point of
coherency. Besides, memory requests would reach off-chip GDDR
or HBM2 DRAM when the required data is not in any of the above
levels [35].
3 ESCORT DESIGN
As mentioned, the lowering approach replicates the input features
multiple times, significantly reducing arithmetic intensity, and this
issue is particularly worse for sparse convolution since its intensity
is already much lower than that of dense one. To avoid this limita-
tion, we use the direct sparse convolution method [37] to perform
convolution. We then map the operations onto GPUs with SIMT
parallelism in mind. We also analyze the memory access pattern
of sparse convolution and employ optimization techniques to im-
prove data locality. For various layers with different parameters (e.g.
the sizes of filters and ofmaps), we adaptively apply customized
compute kernels to improve efficiency.
3.1 Overview
For the lowering method, materializing the lowered matrix in mem-
ory can be costly for GPUs whose memory size is relatively limited.
To avoid this overhead, cuDNN materializes the lowered matrix by
lazily loading the input matrix into on-chip cache at runtime, rather
than by constructing it in off-chip memory before calling a GEMM
routine [10]. Escort follows this approach, but adapts it for direct
sparse convolution. A 1-D array is used to hold the ifmaps, padded
if necessary. As the computation proceeds, we dynamically compute
the offset of the input array, and then use the index to load the correct
elements into on-chip memories. After the computation is complete,
we perform the required index calculation to store the result in the
correct output position. We refer this technique as dynamic indexing.
In SkimCaffe [37], a layout function f is defined such that
f (c,y,x) maps to the offset corresponding to (c,y,x)th element of
the input array (assuming that f (c,y+r ,x+s) = f (c,y,x)+ f (0, r , s)).
For example, in CHW layout, f (c, r , s) = (c · Hin + r )Win + s. We
use this function f to compute the correct index of the input array.
The dynamic indexing approach improves arithmetic intensity, at
the cost of dynamically calculating the index of input array. This
trade-off is made based on the fact that sparse computation is often
highly memory bound and it is important to reduce off-chip memory
accesses to improve GPU efficiency, even at the cost of more index
calculation.
To match the dimension of the input array, the weight matrix is
stretched beforehand. This is preprocessed when constructing the
sparse weight matrix (i.e. the CSR data structures) and only run once.
We refer this preprocessing operation as weight stretching [37]. This
operation only modifies the column indices of the weight matrix
which are stored in the colidx array. No extra memory space is
consumed.
The sequential algorithm of direct sparse convolution is calculated
as shown in Algorithm 2. For each ofmap (line 2) and each output
channel in the ofmap (line 3), it traverses all the elements in the
corresponding filter (line 4), and gets the offset (line 5) and weight
value (line 6) from the CSR data structures. It then iterates over the
2-D channel in row-major order (line 7&8). At last it loads the input
feature using the dynamically calculated index, multiplies it with
the weight value, and accumulates the product to the correct output
location (line 9&10).
3.2 Parallelism Strategy
A CNN’s dataflow defines how the loops are ordered, partitioned,
and parallelized [9]. A straightforward implementation of Algo-
rithm 2 is not necessarily efficient on GPUs if the dataflow is not
0	 0	 0	
0	 0	 2	
3	 0	 0	
1	 2	 3	 4	 5	 6	
7	 8	 9	 10	 11	 12	
13	 14	 15	 16	 17	 18	
19	 20	 21	 22	 23	 24	
25	 26	 27	 28	 29	 30	
31	 32	 33	 34	 35	 36	
	
		
	
9	 10	 11	 12	
15	 16	 17	 18	
21	 22	 23	 24	
27	 28	 29	 30	
13	 14	 15	 16	
19	 20	 21	 22	
25	 26	 27	 28	
31	 32	 33	 34	
3	2	 	
	
	
	
Figure 5: An example sparse convolution of one 3×3 filter
against an 6×6 input feature with 1 channel [38].
Escoin: Efficient Sparse Convolutional Neural Network Inference on GPUs Conference’17, July 2017, Washington, DC, USA
t0 t1 t2 t3 
Input feature map 
Weight value array 
Output feature map 
Figure 6: Data-to-thread mapping for a pseudo 4-thread warp.
carefully designed for the underlying architecture. For example, non-
contiguous indirect memory access is a major overhead of typical
sparse-matrix computations on GPUs [5]. If consecutive threads
in a warp accesses consecutive memory locations, the memory re-
quests are coalesced into one or several memory transactions to save
memory bandwidth. Otherwise, memory divergence occurs and the
efficiency of GPU memory subsystem declines sharply [7].
We choose a dataflow in [38] to minimize memory divergence.
The basic idea is illustrated in Fig. 5. The sparse convolution of a
3×3 filter against a 6×6 input feature can be divided into two parts:
the nonzero weight “2” in the filter times a 4×4 sub-matrix (blue),
and the other nonzero weight “3” in the filter times another 4×4
sub-matrix (red). And then the final results is obtained by simply
accumulating the two products. Unstructured computation is avoided
when separately conducting the multiplications.
The data-to-thread mapping on GPU is shown in Fig. 6. Assum-
ing a 4-thread warp, the accesses to the input array by a warp are
coalesced as long as the array elements with contiguous row or col-
umn indices are stored contiguously. Each thread is responsible for
calculating one output element in the output matrix. When writing
the product sum into the output array, the accesses are also contigu-
ous since consecutive threads are assigned to calculate consecutive
output positions. Thus for each non-zero weight, it is multiplied
with consecutive input data in the same row, and each product is
added to the partial sum of the corresponding output element which
is assigned to the thread. In this way, we can avoid most of uncoa-
lesced memory accesses to the global memory in the GPU, and thus
improve memory access efficiency.
3.3 Locality
The key to highly efficient sparse convolution on GPUs is to max-
imize data locality. Previous research has shown that the overall
performance of memory intensive applications on GPU is highly
affected by its on-chip cache performance [7]. To have a deep under-
standing of the reuse pattern of sparse convolution, we analyze the
nested loop in Algorithm 2. It can be transformed in numerous ways
to capture different reuse patterns of the weights and activations, and
to map the computation to the underlying hardware [36]. For exam-
ple, an input channel is reused against multiple filters to generate
multiple output channels, and there is also ample reuse out of an
0	 1	 0	
0	 0	 2	
3	 0	 0	
0	 1	 0	
0	 0	 2	
3	 0	 0	
0	 1	 0	
0	 0	 2	
3	 0	 0	
1	 2	 3	 4	 5	
6	 7	 8	 9	 10	
11	 12	 13	 14	 15	
16	 17	 18	 19	 20	
21	 22	 23	 24	 25	
	
	
2	 3	 4	
7	 8	 9	
12	 13	 14	
	
	
1	 2	 3	 4	 5	
6	 7	 8	 9	 10	
11	 12	 13	 14	 15	
16	 17	 18	 19	 20	
21	 22	 23	 24	 25	
	
	
1	 2	 3	 4	 5	
6	 7	 8	 9	 10	
11	 12	 13	 14	 15	
16	 17	 18	 19	 20	
21	 22	 23	 24	 25	
	
	
16	 18	 20	
26	 28	 30	
36	 38	 40	
	
	
33	 36	 39	
48	 51	 54	
63	 66	 69	
	
	
18	 21	 24	
33	 36	 39	
48	 51	 54	
	
	
51	 57	 63	
81	 87	 93	
111	 117	 123	
	
	
Weights 
Input features 
Partial sums 
Output features 
Figure 7: An example of data reuse in sparse convolution.
The colored boxes represents the data that is accessed multiple
times.
input channel due to overlapping between sliding windows. A filter
is reused not only when it is sliding across an ifmap, but also against
multiple ifmaps in a batch. Thus the arithmetic intensity of sparse
convolution is significantly higher than typical sparse-matrix com-
putations. Also, potential data reuses in direct sparse convolution
are more than that in lowered GEMM, since some reuses are lost
when duplicating the input features. This implies that it is possible
to achieve high compute efficiency on GPUs.
In Algorithm 2, three major data structures are used in the sparse
convolution: the input features, the sparse weight matrix (CSR
format), and the output features. Therefore, generally we have three
types of dataflow to capture reuse [42]: 1) Weight Stationary is
to minimize the overhead of loading weights by maximizing the
accesses of weights in the on-chip cache. 2) Output Stationary aims
to minimize the overhead of reading and writing the partial sums.
It keeps the accumulation of partial sums in the on-chip cache, and
streams the input features across the processor and broadcasts the
weights, and 3) Input Stationary is to minimize the overhead of
reading inputs by keeping the input features in the cache and streams
the weights.
We try to maximize the reuse and accumulation in the cache for
all types of data, i.e., weights, inputs and partial sums. We assign the
work of processing one output channel to a thread block. It keeps
the corresponding filter weights stationary inside the cache, and then
streams the input features into the SM. Since there are overlaps of
input features between different sliding windows, the input features
are also be kept in the cache and get reused. Fig. 7 shows an example
of the data reuse captured when calculating sparse convolution. In
this case, each element read from the sparse weight matrix is reused
E × F times.
To fully exploit data locality, we should carefully arrange the
data placement, i.e. find the suitable kinds of memory to place
Conference’17, July 2017, Washington, DC, USA Xuhao Chen
different types of data. The input features and the weight matrix are
read-only, while the output features are written. Since the weight
matrix is stored as CSR format, we use threads in a thread block
to cooperatively load the colidx and value arrays into the shared
memory. These are all coalesced memory accesses. Since the input
data is not modified throughout the entire process, we put them in
the read-only cache so that they can be shared across thread blocks
running on the same SM and reused multiple times. As for the partial
sums, they are put in the register files to keep the accumulation local
and fast.
3.4 Kernel Customization
Implementations following the direct sparse convolution approach
should be specifically optimized for convolutions in certain parts of
the parameter space. The major factors we should consider includes
the filter size, the ofmap size, the batch size and the stride. We use
C++ template to generalize the kernel source, and let the compiler
dose the work of generating customized kernel for specific parame-
ters. The optimization space we explore includes the grid shape and
thread block size. Besides, to improve performance when the filter
size is smaller than 3×3, cuDNN uses Winograd [28] algorithm in-
stead of lowering onto matrix multiplication to perform convolution.
This approach is compatible with Escort. We take this as a future
work.
4 EVALUATION
We evaluate performance of Escort on two platforms shown in Ta-
ble 2. NVIDIA Tesla P100 [35] represents data-center server plat-
form. NVIDIA Geforce GTX 1080Ti represents desktop environ-
ment. Escort is implemented as an extension of Caffe deep learning
framework [24]. We use gcc 4.8 and NVCC 8.0 for compilation.
nvprof [33] is used to collect execution time and performance
metrics of CUDA kernels. All the experiments use 32-bit floating
point data type and batch size of 128. We use trained and pruned
models of AlexNet [26], GooLeNet [43], and ResNet [20] which are
available in the SkimCaffe repository [22] (along with the sparsity
information). All these models are trained on the ImageNet [14]
ILSVRC-2012 dataset. Details about the models are listed in Table 3.
Since optimizations used in Escort has no effect on accuracy, our
evaluation focuses on performance (i.e. inference speed).
GTX 1080Ti Tesla P100
# of cores 3584 3584
Boost Clock 1582 MHz 1480 MHz
Mem. Size 11 GB GDDR5X 16 GB HBM2
Bandwidth 484 GB/s 732 GB/s
Table 2: Evaluated GPU Platforms
Model
# of CONV
Layers
# of Sparse
CONV Layers Weights MACs
AlexNet 5 4 61M 724M
GoogLeNet 57 19 7M 1.43G
ResNet 53 16 25.5M 3.9G
Table 3: Summary of Networks
 0
 0.5
 1
 1.5
 2
 2.5
 3
AlexNet-P100
GoogLeNet-P100
ResNet-P100
AlexNet-1080Ti
GoogLeNet-1080Ti
ResNet-1080Ti
Geomean
Ru
nt
im
e 
Sp
ee
du
p
5.57
CUBLAS
CUSPARSE
Escort
Figure 8: Execution time speedup of sparse CONV layers in
three models all normalized to the CUBLAS approach. Dense
CONV layers and other layers are not included in this experi-
ment.
 0
 200
 400
 600
 800
 1000
 1200
 1400
 1600
CUBLAS
CUSPARSE
Escort
CUBLAS
CUSPARSE
Escort
CUBLAS
CUSPARSE
Escort
Ex
ec
ut
ion
 
tim
e 
(m
s)
sgemm
csrmm
im2col
sconv
pad_in
ResNetGoogLeNetAlexNet
Figure 9: Execution Time Breakdown for Sparse CONV Lay-
ers.
4.1 Sparse CONV Performance
Firstly, we compare the performance of sparse CONV layers in
CUBLAS, CUSPARSE and Escort. Fig. 8 shows the normalized
execution time of the sparse CONV layers in these three implemen-
tations. We collect the timing information using nvprof. We only
accumulate execution time related to sparse CONV layers, i.e., the
execution time spent on dense CONV layers and other non-CONV
layers (such as FC, ReLU, LRN and Pooling layers) is not collected.
We can observe that CUSPARSE on Tesla P100 suffers a consistent
performance degradation compared to CUBLAS due to the irregu-
larity of the sparse kernels in CUSPARSE, while on GTX 1080Ti,
CUSPARSE can accelerate GoogLeNet and ResNet by 1.25× and
2.33×, but still causes slowdown for AlexNet. On the contrary,
Escort consistently achieves significant performance improvement
over CUBLAS, with speedups from 1.50× and 5.57×. On average,
sparse CONV layers in Escort is 2.63× and 3.07× faster than those
in CUBLAS and CUSPARSE respectively.
4.2 Execution Time Breakdown
To investigate the performance effect in detail, we breakdown the ex-
ecution time of sparse CONV layers on Tesla P100 into several parts,
each of which is a CUDA kernel. The kernel timing is collected
using nvprof. The kernels include: sgemm, csrmm, im2col,
sconv and pad in. sgemm is the dense matrix multiplication
Escoin: Efficient Sparse Convolutional Neural Network Inference on GPUs Conference’17, July 2017, Washington, DC, USA
 0
 0.2
 0.4
 0.6
 0.8
 1
AlexNet-tex
GoogLeNet-tex
ResNet-tex
AlexNet-l2
GoogLeNet-l2
ResNet-l2
Ca
ch
e 
Hi
t R
at
e
csrmm
sconv
Figure 10: Cache hit rate of two CUDA kernels on Tesla P100.
routine in CUBLAS. csrmm is the sparse matrix dense matrix mul-
tiplication routine in CUSPARSE. im2col is the CUDA kernel to
lower the input data onto matrices. sconv is CUDA kernel of our
proposed sparse convolution in Escort. pad in is the kernel that
Escort uses to pad the input data. Fig. 9 shows the execution time dis-
tribution of different CNN models using different approaches. Since
CUBLAS and CUSPARSE are both base on the lowering method,
they have the same execution time spent on im2col. Escort does
not require this data transformation, and the input padding process
pad in is less costly than im2col. As for the core computation
part, sgemm is faster than csrmm, due to the irregularity of sparse
matrix computation. However, sconv is faster than sgemm, which
demonstrates the effectiveness of our optimization techniques.
4.3 On-chip Memory Efficiency
Fig. 10 compares the texture (i.e. read-only) cache and L2 cache
hit rate of two CUDA kernels, csrmm and sconv. The results
are collected by nvprof. For all three models, sconv in Escort
consistently achieves better read-only cache performance (hit rates
from 71% to 81%) compared to csrmm in CUSPARSE (hit rates
from 52% to 57%). As for L2 cache, we observe similar trend. This
is reasonable because cache tiling is not as effective for sparse matrix
computation as its dense counter part [47], and some data reuses have
already been lost when duplicating the input features. In contrast, we
separately store the weight and input features in different kinds of
on-chip memories, avoiding possible cache conflicts, and adaptively
tile the output channel to make good use of the read-only cache.
4.4 Overall Performance
Fig. 11 illustrates the overall inference performance of the three
approaches. In this experiment, we collect the execution time spent
on an entire iteration (i.e the time spent on processing one batch) in
Caffe. To avoid noise, we run 10 iterations and calculate the average
time. We observe similar trend as Fig. 8, but the performance
variation among different approaches is less significant since we
add up the execution time of all the other layers. Even so, Escort
still achieves consistent speedup over CUBLAS, i.e., 1.47×, 1.18×
and 1.19× on Tesla P100, and 1.74×, 1.34× and 1.43× on GTX
1080Ti, for AlexNet, GoogLeNet and ResNet respectively. Escort
gets the smallest speedup for GoogLeNet because a large portion
of CONV layers are dense and can not benefit from our sparse
convolution method. As for ResNet, the performance of CUSPARSE
and Escort is not as significantly affected as that of AlexNet because
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
 1.6
 1.8
AlexNet-P100
GoogLeNet-P100
ResNet-P100
AlexNet-1080Ti
GoogLeNet-1080Ti
ResNet-1080Ti
Geomean
Ru
nt
im
e 
Sp
ee
du
p
CUBLAS
CUSPARSE
Escort
Figure 11: Overall performance speedup of three models, all
normalized to the CUBLAS approach.
of its relatively lower proportion of CONV layers in all layers. On
average, Escort achieves a geomean speedup of 1.38× over the
CUBLAS approach which is the default GPU configuration of Caffe.
Compared to CUSPARSE, the speedup is 1.60×. Note that this
performance improvement requires neither adaption of higher level
programming nor modification of underlying hardware.
5 RELATED WORK
Sparse CNN on CPUs. Liu et al. [29] proposed a sparse dense
MM algorithm for inference on CPUs, which exploits sparsity in the
weights. Park et al. [38] implemented direct sparse convolution for
inference on CPUs and optimizes it for different kinds of Intel CPUs.
Meanwhile, Rajbhandari et al. [39] proposed to leverage sparsity for
training DNNs on CPUs and develops an optimization framework to
automatically choose best performing implementations for various
CNN computations. Vooturi et al. [44] proposed parallel algorithms
to perform efficient inferencing on multicore CPUs using MKL.
These experiences provide us insights for implementing sparse CNN
on GPUs.
Sparse CNN Accelerators. Recent works have examined how to
efficiently support processing of sparse CNN in hardware. EIE [17]
compresses the model in the fully connected layers to speedup in-
ference. Eyeriss [9] gates the multiplier when the input activation is
zero, while Cnvlutin [2] compresses activation values to skip over
the ineffectual computations. But neither of them leverage pruning
to exploit weight sparsity. Cambricon-X [49] employs weight spar-
sity to keep only non-zero weights in its internal buffers. SCNN [36]
keeps both weights and activations in a compressed form and uses
Cartesian product to compute convolution. Comparing with these
hardware solutions, Escort is a pure software approach and requires
no effort from either high level programmers or hardware designers.
Structured Pruning. Recent works have explored the use of
structured pruning to regularize sparse matrix computation on GPUs.
Structured Sparsity Learning (SSL) [45] adaptively regularizes DNN
structures, and employs locality optimization to accelerate computa-
tion. Scalpel [47] leverages SIMD-aware weight pruning and node
pruning for CPUs and GPUs respectively. DeftNN [21] presents
synapse vector elimination and applies a transformation to the DNN
data layout, producing efficient computations on GPUs. Molchanov et al. [31]
proposed to prune filters to enable efficient inference. Mao et al. [30]
compared different kinds of pruning techniques at different pruning
granularities. Compared with these structured pruning approaches,
Conference’17, July 2017, Washington, DC, USA Xuhao Chen
Escort directly improves performance on arbitrary sparse networks,
requiring no adjustment of the training and pruning process, and it
has no effect on the inference accuracy.
6 CONCLUSION
CNNs have been applied in a wide range of AI applications and
achieved remarkable performance. To enable deeper and more com-
plex neural networks on various platforms, e.g. mobile devices,
weight pruning is proposed to remove redundant parameters. Unfor-
tunately, pruning generates unstructured sparse matrices and leads
to unsatisfactory inference speed on GPUs which are suited for ac-
celerating structured compute kernels. To handle the irregularity of
sparse computation, we propose Escort, an efficient sparse convo-
lution method customized for GPUs. Escort improves arithmetic
intensity by directly computing sparse convolution instead of lower-
ing it onto matrix multiplication, and is specifically optimized for
the GPU architecture by orchestrating the parallelism and exploiting
data locality. Our evaluation demonstrates that Escort outperforms
the lowering method implemented on top of either CUBLAS or
CUSPARSE, successfully turning sparsity into inference speedup
on GPUs, not only memory space saving.
REFERENCES
[1] Martı´n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Man-
junath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray,
Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan
Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-scale Machine
Learning. In Proceedings of the 12th USENIX Conference on Operating Systems
Design and Implementation, 265–283.
[2] Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright
Jerger, and Andreas Moshovos. 2016. Cnvlutin: Ineffectual-neuron-free Deep
Neural Network Computing. In Proceedings of the 43rd International Symposium
on Computer Architecture, 1–13.
[3] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan
Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos,
Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Y.
Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Y.
Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David
Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao,
Dani Yogatama, Jun Zhan, and Zhenyao Zhu. 2015. Deep Speech 2: End-to-End
Speech Recognition in English and Mandarin. CoRR abs/1512.02595 (2015).
[4] Fre´de´ric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfel-
low, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Ben-
gio. 2012. Theano: new features and speed improvements. CoRR abs/1211.5590
(2012).
[5] Nathan Bell and Michael Garland. 2009. Implementing Sparse Matrix-vector
Multiplication on Throughput-oriented Processors. In Proceedings of the Confer-
ence on High Performance Computing Networking, Storage and Analysis, Article
18, 11 pages.
[6] C. Chen, A. Seff, A. Kornhauser, and J. Xiao. 2015. DeepDriving: Learning Affor-
dance for Direct Perception in Autonomous Driving. In 2015 IEEE International
Conference on Computer Vision (ICCV), 2722–2730.
[7] Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang,
and Wen-Mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient
GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), 343–355.
[8] Xuhao Chen, Cheng Chen, Jie Shen, Jianbin Fang, Tao Tang, Canqun Yang,
and Zhiying Wang. 2017. Orchestrating parallel detection of strongly connected
components on GPUs. Parallel Comput. (2017).
[9] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze. 2017. Eyeriss: An Energy-
Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks.
IEEE Journal of Solid-State Circuits 52, 1 (Jan 2017), 127–138.
[10] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John
Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives
for Deep Learning. CoRR abs/1410.0759 (2014).
[11] Ronan Collobert, Koray Kavukcuoglu, and Clment Farabet. 2011. Torch7: A
Matlab-like Environment for Machine Learning. In BigLearn, Neural Information
Processing Systems Workshop.
[12] Ronan Collobert and Jason Weston. 2008. A Unified Architecture for Natu-
ral Language Processing: Deep Neural Networks with Multitask Learning. In
Proceedings of the 25th International Conference on Machine Learning, 160–167.
[13] Steven Dalton, Luke Olson, and Nathan Bell. 2015. Optimizing Sparse Matrix-
Matrix Multiplication for the GPU. ACM Trans. Math. Softw. 41, 4, Article 25
(Oct. 2015), 20 pages.
[14] J. Deng, W. Dong, R. Socher, L. J. Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A
large-scale hierarchical image database. In 2009 IEEE Conference on Computer
Vision and Pattern Recognition, 248–255.
[15] L. Deng, J. Li, J. T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, G. Zweig, X.
He, J. Williams, Y. Gong, and A. Acero. 2013. Recent advances in deep learning
for speech research at Microsoft. In 2013 IEEE International Conference on
Acoustics, Speech and Signal Processing, 8604–8608.
[16] Stefan Hadjis, Firas Abuzaid, Ce Zhang, and Christopher Re´. 2015. Caffe Con
Troll: Shallow Ideas to Speed Up Deep Learning. In Proceedings of the Fourth
Workshop on Data Analytics in the Cloud, Article 2, 4 pages.
[17] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz,
and William J. Dally. 2016. EIE: Efficient Inference Engine on Compressed
Deep Neural Network. In Proceedings of the 43rd International Symposium on
Computer Architecture, 243–254.
[18] Song Han, Huizi Mao, and William J Dally. 2016. Deep Compression: Com-
pressing Deep Neural Networks with Pruning, Trained Quantization and Huffman
Coding. International Conference on Learning Representations (ICLR) (2016).
[19] Song Han, Jeff Pool, John Tran, and William J. Dally. 2015. Learning Both
Weights and Connections for Efficient Neural Networks. In Proceedings of the
28th International Conference on Neural Information Processing Systems - Vol-
ume 1, 1135–1143.
[20] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Im-
age Recognition. In 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 770–778.
[21] Parker Hill, Animesh Jain, Mason Hill, Babak Zamirai, Chang-Hong Hsu,
Michael A. Laurenzano, Scott Mahlke, Lingjia Tang, and Jason Mars. 2017.
DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse
Vector Elimination and Near-compute Data Fission. In Proceedings of the 50th
Annual IEEE/ACM International Symposium on Microarchitecture, 786–799.
[22] Intel. 2017. SkimCaffe. https://github.com/IntelLabs/SkimCaffe. (2017).
[23] S. Ji, W. Xu, M. Yang, and K. Yu. 2013. 3D Convolutional Neural Networks for
Human Action Recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence 35, 1 (Jan 2013), 221–231.
[24] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long,
Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolu-
tional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093
(2014).
[25] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. 2014.
Large-Scale Video Classification with Convolutional Neural Networks. In 2014
IEEE Conference on Computer Vision and Pattern Recognition, 1725–1732.
[26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Clas-
sification with Deep Convolutional Neural Networks. In Advances in Neural
Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and
K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097–1105.
[27] Y. LeCun, Y. Bengio, and G. G. Hinton. 2015. Deep learning. Nature 512, 7553
(May 2015), 436–444.
[28] Sheng R. Li, Jongsoo Park, and Ping Tak Peter Tang. 2017. Enabling Sparse
Winograd Convolution by Native Pruning. CoRR abs/1702.08597 (2017).
[29] Baoyuan Liu, Min Wang, H. Foroosh, M. Tappen, and M. Penksy. 2015. Sparse
Convolutional Neural Networks. In 2015 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 806–814.
[30] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William
Dally. 2017. Exploring the Regularity of Sparse Structure in Convolutional
Neural Networks. In Proceedings of the 31st Conference on Neural Information
Processing Systems (NIPS 2017), 1–10.
[31] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. 2017.
Pruning Convolutional Neural Networks for Resource Efficient Inference. Proc.
International Conference on Learning Representations (ICLR) (2017).
[32] NVIDIA. 2016. cuBLAS Library. http://docs.nvidia.com/cuda/cublas/. (2016).
[33] NVIDIA 2016. CUDA Profiler Users Guide v8.0. NVIDIA.
[34] NVIDIA. 2016. cuSPARSE Library. http://docs.nvidia.com/cuda/cusparse/.
(2016).
[35] NVIDIA 2016. NVIDIA Tesla P100: The Most Advanced Datacenter Accelerator
Ever Built. NVIDIA.
[36] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rang-
harajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and
William J. Dally. 2017. SCNN: An Accelerator for Compressed-sparse Con-
volutional Neural Networks. In Proceedings of the 44th Annual International
Symposium on Computer Architecture, 27–40.
[37] Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and
Pradeep Dubey. 2017. Faster CNNs with Direct Sparse Convolutions and Guided
Pruning. In the 5th International Conference on Learning Representations, 1–11.
Escoin: Efficient Sparse Convolutional Neural Network Inference on GPUs Conference’17, July 2017, Washington, DC, USA
[38] Jongsoo Park, Sheng R. Li, Wei Wen, Hai Li, Yiran Chen, and Pradeep Dubey.
2016. Holistic SparseCNN: Forging the Trident of Accuracy, Speed, and Size.
CoRR abs/1608.01409 (2016).
[39] Samyam Rajbhandari, Yuxiong He, Olatunji Ruwase, Michael Carbin, and Trishul
Chilimbi. 2017. Optimizing CNNs on Multicores for Scalability, Performance
and Goodput. In Proceedings of the Twenty-Second International Conference
on Architectural Support for Programming Languages and Operating Systems,
267–280.
[40] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George
van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel-
vam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalch-
brenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu,
Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go with deep
neural networks and tree search. Nature 512, 7553 (May 2016), 436–444.
[41] K. Simonyan and A. Zisserman. 2015. Very Deep Convolutional Networks for
Large-Scale Image Recognition. In 3rd International Conference on Learning
Representation (ICLR 2015).
[42] V. Sze, Y. H. Chen, T. J. Yang, and J. S. Emer. 2017. Efficient Processing of
Deep Neural Networks: A Tutorial and Survey. Proc. IEEE 105, 12 (Dec 2017),
2295–2329.
[43] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In 2015
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1–9.
[44] Dharma Teja Vooturi, Saurabh Goyal, Anamitra R. Choudhury, Yogish Sabharwal,
and Ashish Verma. 2017. Efficient Inferencing of Compressed Deep Neural
Networks. CoRR abs/1711.00244 (2017).
[45] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning
Structured Sparsity in Deep Neural Networks. In Advances in Neural Information
Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and
R. Garnett (Eds.). Curran Associates, Inc., 2074–2082.
[46] Zhen Xu, Xuhao Chen, Jie Shen, Yang Zhang, Cheng Chen, and Canqun Yang.
2017. GARDENIA: A Domain-specific Benchmark Suite for Next-generation
Accelerators. CoRR abs/1708.04567 (2017).
[47] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das,
and Scott Mahlke. 2017. Scalpel: Customizing DNN Pruning to the Underlying
Hardware Parallelism. In Proceedings of the 44th Annual International Symposium
on Computer Architecture, 548–560.
[48] Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and Understanding Convo-
lutional Networks. In Computer Vision – ECCV 2014, 818–833.
[49] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen.
2016. Cambricon-X: An accelerator for sparse neural networks. In 2016 49th
Annual IEEE/ACM International Symposium on Microarchitecture (MICRO),
1–12.
