SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks by Parashar, Angshuman et al.
SCNN: An Accelerator for Compressed-sparse
Convolutional Neural Networks
Angshuman Parashar† Minsoo Rhu†
Anurag Mukkara‡ Antonio Puglielli∗ Rangharajan Venkatesan† Brucek Khailany†
Joel Emer†‡ Stephen W. Keckler† William J. Dally†
NVIDIA† Massachusetts Institute of Technology‡ UC-Berkeley∗ Stanford University
Abstract—Convolutional Neural Networks (CNNs) have
emerged as a fundamental technology for machine learning.
High performance and extreme energy efficiency are critical for
deployments of CNNs in a wide range of situations, especially
mobile platforms such as autonomous vehicles, cameras, and
electronic personal assistants. This paper introduces the Sparse
CNN (SCNN) accelerator architecture, which improves perfor-
mance and energy efficiency by exploiting the zero-valued weights
that stem from network pruning during training and zero-valued
activations that arise from the common ReLU operator applied
during inference. Specifically, SCNN employs a novel dataflow
that enables maintaining the sparse weights and activations
in a compressed encoding, which eliminates unnecessary data
transfers and reduces storage requirements. Furthermore, the
SCNN dataflow facilitates efficient delivery of those weights and
activations to the multiplier array, where they are extensively
reused. In addition, the accumulation of multiplication products
are performed in a novel accumulator array. Our results show
that on contemporary neural networks, SCNN can improve both
performance and energy by a factor of 2.7× and 2.3×, respec-
tively, over a comparably provisioned dense CNN accelerator.
I. INTRODUCTION
Driven by the availability of massive data and the com-
putational capability to process it, deep learning has recently
emerged as a critical tool for solving complex problems across
a wide range of domains, including image recognition [20],
speech processing [12], [16], [2], natural language process-
ing [8], language translation [10], and autonomous vehi-
cles [21]. Convolutional neural networks (CNNs) have become
the most popular algorithmic approach for deep learning for
many of these domains. Employing CNNs can be decomposed
into two tasks: (1) training — in which the parameters of a
neural network are learned by observing massive numbers of
training examples, and (2) inference — in which a trained
neural network is deployed in the field and classifies the
A version appears in the 44th IEEE/ACM International Symposium on
Computer Architecture (ISCA-44), 2017.
This research was, in part, funded by the U.S. Government, under the
DARPA CRAFT program. The views and conclusions contained in this
document are those of the authors and should not be interpreted as
representing the official policies, either expressed or implied, of the U.S.
Government.
observed data. Today, training is often done on GPUs [24]
or farms of GPUs, while inference depends on the application
and can employ CPUs, GPUs, FPGA, or specially-built ASICs.
During the training process, a deep learning expert will
typically architect the network, establishing the number of
layers, the operation performed by each layer, and the connec-
tivity between layers. Many layers have parameters, typically
filter weights, which determine their exact computation. The
objective of the training process is to learn these weights,
usually via a stochastic gradient descent-based excursion
through the space of weights. This process typically employs
a forward-propagation calculation for each training example,
a measurement of the error between the computed and de-
sired output, and then back-propagation through the network
to update the weights. Inference has similarities, but only
includes the forward-propagation calculation. Nonetheless, the
computation requirements for inference can be prohibitively
large, particularly with the emergence of deeper networks
(hundreds of layers [17], [18], [26], [29]) and larger inputs
sets, such as high-definition video. Furthermore, the energy
efficiency of this computation is important, especially for
mobile platforms, such as autonomous vehicles, cameras, and
electronic personal assistants.
Recent published works have shown that common networks
have significant redundancy and can be pruned dramatically
during training without substantively affecting accuracy [15].
Our experience shows that the number of weights that can
be eliminated varies widely across the layers but typically
ranges from 20% to 80% [15], [14]. Eliminating weights
results in a network with a substantial number of zero values,
which can potentially reduce the computational requirements
of inference.
The inference computation also offers a further optimization
opportunity. In specific, many networks employ as their non-
linear operator the ReLU (rectified linear unit) function which
clamps all negative activation values to zero. The activations
are the output values of an individual layer that are passed
as inputs to the next layer. Our experience shows that for
typical data sets, 50–70% of the activations are clamped to
zero. Since the multiplication of weights and activations is the
key computation for inference, the combination of these two
ar
X
iv
:1
70
8.
04
48
5v
1 
 [c
s.N
E]
  2
3 M
ay
 20
17
factors can reduce the amount of computation required by over
an order of magnitude. Additional benefits can be achieved by
a compressed encoding for zero weights and activations, thus
allowing more to fit in on-chip RAM and eliminating energy-
costly DRAM accesses.
In this paper, we introduce the Sparse CNN (SCNN) ac-
celerator architecture, a new CNN inference architecture that
exploits both weight and activation sparsity to improve both
performance and power. Previous works have employed tech-
niques for exploiting sparsity, including saving computation
energy for zero-valued activations and compressing weights
and activations stored in DRAM [7], [6]. Other works have
used a compressed encoding of activations [1] or weights [30]
in parts of their dataflow to reduce data transfer bandwidth
and save time for computations of some multiplications with
a zero operand. While these prior architectures have largely
focused on eliminating computations and exploiting some
data compression, SCNN couples an algorithmic dataflow
that eliminates all multiplications with a zero operand while
employing a compressed representation of both weights and
activations through almost the entire computation.
At the heart of the SCNN design is a processing element
(PE) with a multiplier array that accepts a vector of weights
and a vector of activations. Unlike previous convolutional
dataflows [5], [11], [7], [25], the SCNN dataflow only delivers
weights and activations to the multiplier array that can all
be multiplied by one another in the manner of a Cartesian
product. Furthermore, the activation vectors are reused in
an input stationary [6] fashion against a number of weight
vectors to reduce data accesses. Finally, only non-zero weights
and activations are fetched from the input storage arrays and
delivered to the multiplier array. As with any CNN accelerator,
SCNN must accumulate the partial products generated by the
multipliers. However, since the products generated by the mul-
tiplier array cannot be directly summed together, SCNN tracks
the output coordinates associated with each multiplication and
sends the coordinate and product to a scatter accumulator array
for summing.
To increase parallelism beyond a single PE, multiple PEs
can be run in parallel with each working on a disjoint 3D tile
of input activations. Because of the end-to-end compression
of activations, SCNN can keep the both the input and output
activations of each tile local to its PE, further reducing energy-
hungry data transmission. Overall, this results in a design with
efficient compressed storage and delivery of input operands,
high reuse of the input operands in the multiplier array, and
that spends no time on multiplications with zero operands.
To evaluate SCNN, we developed a cycle-level performance
model and a validated analytical model that allows us to
quickly explore the design space of different types of accel-
erators. We also implemented an SCNN PE in synthesizable
System C and compiled the design into gates using a combina-
tion of commercial high-level synthesis (HLS) tools and a tra-
ditional verilog compiler. Our results show that 64 PE SCNN
implementation with 16 multipliers per PE (1024 multipliers
in total) can be implemented in approximately 7.9mm2, which
TABLE I
NETWORK CHARACTERISTICS. WEIGHTS AND ACTIVATIONS ASSUME A
DATA-TYPE SIZE OF TWO BYTES.
# Conv. Max. Layer Max. Layer Total #
Network Layers Weights Activations Multiplies
AlexNet [20] 5 1.73 MB 0.31 MB 0.69 B
GoogLeNet [28] 54 1.32 MB 1.52 MB 1.1 B
VGGNet [27] 13 4.49 MB 6.12 MB 15.3 B
is somewhat larger than an equivalently provisioned dense
accelerator architecture due to the overheads of managing the
sparse dataflow. On a range of networks, SCNN provides a
factor of 2.7× speedup and a 2.3× energy reduction relative
to the dense architecture.
II. MOTIVATION
Convolutional Neural Network algorithms (CNNs) are es-
sentially a cascaded set of pattern recognition filters trained
with supervision [21]. A CNN consists of a series of layers,
which include convolutional layers, non-linear scalar operator
layers, and layers that downsample the intermediate data, for
example by pooling. The convolutional layers represent the
core of the CNN computation and are characterized by a
set of filters that are usually 1×1 or 3×3, and occasionally
5×5 or larger. The values of these filters are the weights that
are trained using a training set for the network. Some deep
neural networks (DNNs) also include fully-connected layers,
typically toward the end of the DNN. During inference, a
new image (in the case of image recognition) is presented
to the network, which classifies into the training categories by
computing in succession each of the layers in the network. The
intermediate data between the layers are called activations and
the output activation of one layer becomes the input activation
of the next layer. In this paper, we focus on accelerating the
convolutional layers as they constitute the majority of the
computation [9].
Table II lists the attributes of three commonly used networks
in image processing: AlexNet [20], GoogLeNet [28], and
VGGNet [27], whose specifications come from the Caffe
BVLC Model Zoo [4]. The increasing layer depth across the
networks represents the successively more accurate networks
in the ImageNet [19] competition. The Maximum Weights and
Activations columns indicate the size of the largest weight and
activation matrices across the layer of the network. The last
column lists the total number of multiplies required to compute
a single inference pass through all of the convolutional layers
of the network. These data and computational requirements
are derived from the standard ImageNet inputs images of
224×224 pixels. Processing larger, higher resolution images
will result in greater computational and data requirements.
Sparsity in CNNs. Sparsity in a layer of a CNN is
defined as the fraction of zeros in the layer’s weight and
input activation matrices. The primary technique for creating
weight sparsity is to prune the network during training. Han,
et al. developed a pruning algorithm that operates in two
2
phases [15]. First, any weight with an absolute value that
is close to zero (e.g. below a defined threshold) is set to
zero. This process has the effect of removing weights from
the filters, and sometimes even forcing an output activation to
always to be zero. Second, the remaining network is retrained,
to regain the accuracy lost through naı¨ve pruning. The result
is a smaller network with accuracy extremely close to the
original network. The process can be iteratively repeated to
reduce network size while maintaining accuracy.
Activation sparsity occurs dynamically during inference and
is highly dependent on the data being processed. Specifically,
the rectified linear unit (ReLU) function that is commonly
used as the non-linear operator in CNNs forces all negatively
valued activations to be clamped to zero. After completing
computation of a convolutional layer, a ReLU function is
applied point-wise to each element in the output activation
matrices before the data is passed to the next layer.
To measure the weight and activation sparsity, we used the
Caffe framework [3] to prune and train the three networks
listed in Table II, using the pruning algorithm of [15]. We then
instrumented the Caffe framework to inspect the activations
between the convolutional layers. Figure 1 shows the weight
and activation density (fraction of non-zeros or complement of
sparsity) of the layers of the networks, referenced to the left-
hand y-axes. As GoogLeNet has 54 convolutional layers, we
only show a subset of representative layers. The data shows
that weight density varies across both layers and networks,
reaching a minimum of 30% for some of the GoogLeNet
layers. Activation density also varies, with density typically
being higher in early layers. Activation density can be as
low as 30% as well. The triangles shows the ideal amount of
work (measured in multiplies of non-zero values) that could be
achieved through maximum exploitation of sparsity by taking
the product of the weight and activation densities on a per-
layer basis. Typical layers can reduce work by a factor of 4,
and can reach as high as a factor of ten.
Exploiting sparsity. Since multiplication by zero just re-
sults in a zero, it should require no work. In addition, that
zero will contribute nothing to the partial sum it is part of,
so the addition is unnecessary as well. Furthermore, data with
many zeros can be represented in a compressed form. Together
these characteristics provide a number of opportunities for
optimization:
• Compressing data: Encoding the sparse weights and/or
activations provides an architecture an opportunity to
reduce the amount of data that must be moved throughout
the memory hierarchy. It also reduces the data footprint,
which allows larger matrices to be held a given size
storage structure.
• Eliminating computation: For multiplications that have
a zero weight and/or activation operand, the operation
can be data gated or the operands might never be sent to
the multiplier. This can save energy consumption or both
time and energy consumption, respectively.
Our SCNN architecture exploits both these opportunities.
First, it employs a dense encoding of sparse weights and
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
conv1 conv2 conv3 conv4 conv5
W
o
rk
 (
# 
o
f 
m
u
lt
ip
lie
s)
 
D
e
n
si
ty
 (
IA
, W
) 
Density (IA)
Density (W)
Work (# of multiplies)
(a) AlexNet
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
p
o
o
l_
p
ro
j
1
x1
3
x3
_r
ed
u
ce
3
x3
5
x5
_r
ed
u
ce
5
x5
p
o
o
l_
p
ro
j
1
x1
3
x3
_r
ed
u
ce
3
x3
5
x5
_r
ed
u
ce
5
x5
inception_3a inception_5b
W
o
rk
 (
# 
o
f 
m
u
lt
ip
lie
s)
 
D
e
n
si
ty
 (
IA
, W
) 
Density (IA)
Density (W)
Work (# of multiplies)
(b) GoogLeNet
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
W
o
rk
 (
# 
o
f 
m
u
lt
ip
lie
s)
 
D
e
n
si
ty
 (
IA
, W
) 
Density (IA)
Density (W)
Work (# of multiplies)
(c) VGGNet
Fig. 1. Input activation and weight density and the reduction in the amount
of work achievable by maximally exploiting sparsity.
activations. Second, it uses a novel dataflow that delivers
only those densely encoded weights and activations to the
multipliers.
III. SCNN DATAFLOW
The core operation in a CNN convolutional layer is a 2-
dimensional sliding-window convolution of an R×S element
filter over a W×H element input activation plane to produce a
W ×H element output activation plane. There can be multiple
(C) input activation planes, which are referred to as input
channels. A distinct filter is applied to each input activation
channel, and the filter output for each of the C channels
are accumulated together element-wise into a single output
activation plane. Multiple filters (K) can be applied to the
same body of input activations to produce K output channels
of activations. Finally, a batch of length N of groups of C
channels of input activation planes can be applied to the same
volume of filter weights. Figure 2 shows these parameters
applied to the computation of a single CNN layer.
The set of computations for the complete layer can be
formulated as a loop nest over these 7 variables. Because
3
17-dimensional network layer
.
.
.
N
.
.
.
N
C
C
K
K
C
W-R+1
Weights Inputs
Outputs
S
R
H
W
H-S+1
Fig. 2. CNN computations and parameters.
multiply-add operations are associative (modulo rounding er-
rors, which we will ignore in this study), all permutations of
these 7 loop variables are legal. Figure 3 shows an example
loop nest based on one such permutation. We can concisely
describe this nest as N → K→C→W → H → R→ S. Each
point in the 7-dimensional space formed from these variables
represents a single multiply-accumulate operation. Note that
for the remainder of this paper, we assume a batch size of 1,
which is common for inferencing tasks.
This simple loop nest can be transformed in numerous
ways to capture different reuse patterns of the activations
and weights and to map the computation to a hardware
accelerator implementation. A CNN’s dataflow defines how
the loops are ordered, partitioned, and parallelized [6]. Prior
work[6] has shown that the choice of dataflow has a significant
impact on the area and energy-efficiency of an architecture.
In fact, the choice of dataflow is perhaps the single most
significant differentiator between many prior works on CNN
architectures.
While the concept of dataflow has been studied for dense
architectures, sparse architectures can also employ various
alternative dataflows, each with its own set of trade-offs. While
an exhaustive enumeration of sparse dataflows is beyond the
scope of this paper, we present a specific dataflow called
PlanarTiled-InputStationary-CartesianProduct-sparse (or PT-
IS-CP-sparse). After examining a range of different dataflows,
we selected PT-IS-CP-sparse because it enables reuse pat-
terns that exploit the characteristics of sparse weights and
activations. This section first presents an equivalent dense
dataflow (PT-IS-CP-dense) to explain the decomposition of the
computations and then adds the specific features for PT-IS-CP-
sparse .
A. The PT-IS-CP-dense Dataflow
Single-multiplier temporal dataflow. To understand the
temporal component of the PT-IS-CP-dense dataflow, consider
the operation of a processing element (or PE) with a single
multiply-accumulate unit. The dataflow employs an input-
stationary computation order in which an input activation is
held stationary at the computation units as it is multiplied by
all the filter weights needed to make all its contributions to
for n = 1 to N
for k = 1 to K
for c = 1 to C
for w = 1 to W
for h = 1 to H
for r = 1 to R
for s = 1 to S
out[n][k][w][h] +=
in[n][c][w+r-1][h+s-1] *
filter[k][c][r][s];
Fig. 3. 7-dimensional CNN loop nest.
each of the the K output channels (a K×R× S subvolume).
Thus each input activation will contribute to a volume of
K×R×S output activations. This order maximizes the reuse
of the input activations, while paying a cost to stream the
weights to the computation units. Accommodating multiple
input channels (C) adds an additional outer loop and results
in the loop nest C→W → H→ K→ R→ S.
The PT-IS-CP-dense dataflow requires input buffers for
weights and input activations, and a accumulator buffer to store
the partial sums of the output activations. The accumulator
buffer must perform a read-add-write operation for every
access to a previously-written index. We call this accumulator
buffer along with the attached adder an accumulation unit.
Parameters of contemporary networks cause these buffers to
be large and energy-expensive to access. The input-stationary
temporal loop nest amortizes the energy cost of accessing
the input buffer over multiple weight and accumulator buffer
accesses. More precisely, the register in which the stationary
input is held over K×R×S iterations serves as an inner buffer
filtering accesses to the larger input buffer.
Unfortunately, the stationarity of input activations comes at
the cost of more streaming accesses to the weights and partial
sums (in the accumulator buffer). Blocking the weights and
partial sums in the output channel (K) dimension can increase
reuse of these data structures and improve energy efficiency.
We factor the output channel variable (K) into Kc (which we
call a output-channel group) and K/Kc, and only store weights
and outputs for a single output-channel group at a time inside
the weight and accumulator buffers. Thus the subvolumes that
are housed in buffers at the computation unit are:
• Weights: C×Kc×R×S
• Inputs: C×W ×H
• Partial Sums: Kc×W ×H
An outer loop over all the K/Kc output-channel chunks
results in the complete loop nest K/Kc→C→W→H→Kc→
R→ S. Note that each iteration of this outer loop will require
the weight buffer to be refilled and the accumulator buffer
must be drained and cleared, while the contents of the input
buffer will be fully reused because the same input activations
are used across all output channels.
Intra-PE parallelism. To exploit the parallelism of many
multipliers within a PE, we fetch a vector of F filter-weights
4
from the weight buffer and a vector of I inputs from the input
activation buffer. These values are delivered to an array of
F×I multipliers to compute a full Cartesian product of output
partial-sums. Each product yields a useful partial sum such that
no extraneous fetches or computations are performed. PT-IS-
CP-sparse will exploit this same property to make computation
efficient on compressed-sparse weights and input activations.
The multiplier outputs are sent to the accumulation unit,
which updates the partial sums of the output activation.
Each multiplier output is accumulated with a partial sum
at the matching output coordinates in the output activation
space. These coordinates are computed in parallel with the
multiplications. The accumulation unit must employ at least
F×I adders to match the throughput of the multipliers. Fig-
ure 4 shows pseudo-code for the PT-IS-CP-dense dataflow,
including blocking in the K dimension (A,C), fetching vectors
of input activations and weights (B,D), and computing the
Cartesian product in parallel (E,F). The Kcoord(), Xcoord(),
and Y coord() functions compute the k, x, and y coordinates
of the uncompressed output volume using a de-linearization
of the temporal loop indices a and w, the spatial loop indices
i and f , and the known filter width and height. Note that this
dataflow is simply a reordered, partitioned and parallelized
version of Figure 3.
Inter-PE parallelism. To scale beyond the practical limits
of multiplier count and buffer sizes within a PE, we employ
a tiling strategy to spread the work across an array of PEs
so that each PE can operate independently. PT-IS-CP-dense
partitions the W×H element activation plane into smaller
Wt×Ht element tiles that are distributed across the PEs. Each
tile extends fully into the input-channel dimension C, resulting
in an input-activation volume of C×Wt×Ht assigned to each
PE. Weights are broadcast to the PEs and each PE operates
on its own subset of the input and output activation space.
Unfortunately, strictly partitioning both input and output
activations into Wt×Ht tiles does not work because the sliding-
window nature of the convolution operation introduces cross-
tile dependencies at tile edges. These dependencies are called
halos. Halos can be resolved in two ways:
• Input halos: The input buffers at each PE are sized to be
slightly larger than C×Wt×Ht to accommodate the halos.
These halo input values are replicated across adjacent
PEs, but outputs are strictly private to each PE. Replicated
input values can be multicast when they are being fetched
into the buffers.
• Output halos: The output buffers at each PE are sized to
be slightly larger than Kc×Wt ×Ht to accommodate the
halos. The halos now contain incomplete partial sums that
must be communicated to neighbor PEs for accumulation.
This communication occurs at the end of computing each
output-channel group.
Our PT-IS-CP-dense dataflow uses output halos, though the
efficiency difference between the two approaches is minimal.
BUFFER wt_buf[C][Kc*R*S/F][F];
BUFFER in_buf[C][Wt*Ht/I][I];
BUFFER acc_buf[Kc][Wt+R-1][Ht+S-1];
BUFFER out_buf[K/Kc][Kc*Wt*Ht];
(A) for k’ = 0 to K/Kc-1
{
for c = 0 to C-1
for a = 0 to (Wt*Ht/I)-1
{
(B) in[0:I-1] = in_buf[c][a][0:I-1];
(C) for w = 0 to (Kc*R*S/F)-1
{
(D) wt[0:F-1] = wt_buf[c][w][0:F-1];
(E) parallel_for (i=0 to I-1) x (f=0 to F-1)
{
k = Kcoord(w,f);
x = Xcoord(a,i,w,f);
y = Ycoord(a,i,w,f);
(F) acc_buf[k][x][y] += in[i]*wt[f];
}
}
}
out_buf[k’][0:Kc*Wt*Ht-1] =
acc_buf[0:Kc-1][0:Wt-1][0:Ht-1];
}
Fig. 4. PT-IS-CP-dense dataflow.
B. PT-IS-CP-sparse Dataflow
PT-IS-CP-sparse is a natural extension of PT-IS-CP-dense
that exploits sparsity in the weights and input activations. The
dataflow is specifically designed to operate on compressed-
sparse encodings of the weights and input activations and to
produce a compressed-sparse encoding of the output activa-
tions. At a CNN layer boundary, the output activations of
the previous layer become the input activations of the next
layer. While prior work has proposed a number of compressed-
sparse representations [13], [1], [30], the specific format used
is orthogonal to the sparse architecture itself. What is key is
that decoding a sparse format ultimately yields a non-zero data
value and an index indicating the coordinates of the value in
the weight or input activation matrices.
To facilitate easier decoding of the compressed-sparse
blocks, weights are grouped into compressed-sparse blocks at
the granularity of an output-channel group, with of Kc×R×S
weights encoded into one compressed block. Likewise, input
activations are encoded at the granularity of input channels,
with a block of Wt×Ht encoded into one compressed block.
At each access, the weight buffer delivers a vector of F non-
zero filter weights along with each of their coordinates within
the Kc×R×S region. Similarly, the input buffer delivers a
vector of F non-zero input activations along with each of
their coordinates within the Wt×Ht region. Similar to the dense
dataflow, the multiplier array computes the full cross-product
of F×I partial sum outputs, with no extraneous computations.
Unlike a dense architecture, output coordinates are not derived
from loop indices in a state machine but from the coordinates
of non-zero values embedded in the compressed format.
Even though calculating output coordinates is trivial, the
multiplier outputs are not typically contiguous as they are in
PT-IS-CP-dense. Thus the F×I multiplier outputs must be
5
scattered to discontiguous addresses within the Kc×Wt×Ht
output range. Because any value in the output range can be
non-zero, the accumulation buffer must be kept in a dense
format. In fact, output activations will probabilistically have
high density even with a very low density of weights and input
activations, until they pass through a ReLU operation.
To accommodate the needs of accumulation of sparse partial
sums, we modify the monolithic Kc×Wt×Ht accumulation
buffer from the PT-IS-CP-dense dataflow into a distributed
array of smaller accumulation buffers accessed via a scatter
network which can be implemented as a crossbar switch. The
scatter network routes an array of F×I partial sums to an array
of A accumulator banks based on the output index associated
with each partial sum. Taken together, the complete accumula-
tor array still maps the same Kc×Wt×Ht address range, though
the address space is now split across a distributed set of banks.
PT-IS-CP-sparse can be implemented via small adjustments
of Figure 4. Instead of a dense vector fetches, (B) and (D)
fetch the compressed sparse input activations and weights,
respectively. In addition, the coordinates of the non-zero values
in the compressed-sparse form of these data structures must be
fetched from their respective buffers (not shown). After that
the accumulator buffer (F) must be indexed with the computed
output coordinates from the sparse weight and activations.
Finally, when the computation for the output-channel group
has been completed the accumulator buffer is drained and
compressed into the next level of buffer.
IV. SCNN ARCHITECTURE
A complete SCNN accelerator employing the PT-IS-CP-
sparse dataflow of Section III consists of multiple SCNN pro-
cessing elements (PEs) connected via simple interconnections.
Figure 5 shows an array of PEs, with each PE including
channels for receiving weights and input activations, and
channels delivering output activations. The PEs are connected
to their nearest neighbors to exchange halo values during the
processing of each CNN layer. The PE array is driven by a
layer sequencer to orchestrate the movement of weights and
activations and is connected to a DRAM controller that can
broadcast weights to the PEs and stream activations to/from
the PEs.
Processing Element (PE) Architecture. Figure 6 shows
the microarchitecture of an SCNN PE, including a weight
buffer, input/output activation RAMs (IARAM and OARAM),
a multiplier array, a scatter crossbar, a bank of accumulator
buffers, and a post-processing unit (PPU). To process the
first CNN layer, the layer sequencer streams a portion of the
input image into the IARAM of each PE and broadcasts the
compressed-sparse weights into the weight buffer of each PE.
Upon completion of the layer, the sparse-compressed output
activation is distributed across the OARAMs of the PEs. When
possible, the activations are held in the IARAMs/OARAMs
and are never swapped out to DRAM. If the output activation
volume of a layer can serve as the input activation volume
for the next layer, the IARAMs and OARAMs are logically
swapped between the two layers’ computation sequences.
PE	 PE	 PE	
PE	 PE	
PE	PE	 PE	
PE	
Weights	
IA	 OA	D
RA
M
	
La
ye
r	
Se
qu
en
ce
r	
Fig. 5. Complete SCNN architecture.
Each PE’s state machine operates on the weight and in-
put activations in the order defined by the PT-IS-CP-sparse
dataflow to produce an output-channel group of Kc×Wt ×Ht
partial sums inside the accumulation buffers. First, a vector F
of compressed weights and a vector I of compressed input
activations are fetched from their respective buffers. These
vectors are distributed into the F×I multiplier array which
computes a form of the cartesian product of the vectors. At
the same time, the indices from the sparse-compressed weights
and activations are processed to compute the coordinates in
dense output activation. The F×I products are delivered to
an array of A accumulator banks, indexed by the output
coordinates. To reduce contention among products that hash to
the same accumulator bank, A is set to be larger than F×I. Our
results show that A = 2×F×I sufficiently reduces accumulator
bank contention. Each accumulator bank includes adders and
small set of entries for the output channels associated with
the output-channel group being processed. The accumulation
buffers are double-buffered so that one set of banks can be
updated by incoming partial sums while the second set of
banks are drained out by the PPU. When the output-channel
group is complete, the PPU performs the following tasks: (1)
exchange partial sums with neighbor PEs for the halo regions
at the boundary of the PE’s output activations, (2) apply
the non-linear activation (e.g. ReLU), pooling, and dropout
functions, and (3) compress the output activations into the
compressed-sparse form and write them into the OARAM.
SCNN uses a simple compressed-sparse encoding approach
based on run-length encoding scheme. The index vector
encodes the number of zeros between each element in the
compressed-sparse data vector. Determining the coordinates
in the accumulator buffer for each multiplier output requires
reading the index vectors for F and I and combining them
with the coordinates of portion of the output activation space
currently being processed. Four bits per index allows for up
to 15 zeros to appear between any two non-zero elements.
Non-zero elements that are further apart can have a zero-
value placeholder without incurring any noticable degradation
in compression efficiency.
While SCNN will operate most efficiently when the activa-
6
1  
Weight FIFO 
(sparse) 
 
IARAM 
(sparse) 
 
F 
I 
F*I …
 
FxI multiplier array 
indices 
indices 
Coordinate 
Computation 
F 
I 
F*I 
A accumulator buffers 
Buffer bank 
Buffer bank 
F
*I
—
A
 a
rb
it
ra
te
d
 x
b
a
r 
O
A
R
A
M
 
(s
p
a
rs
e
) 
 
in
d
ic
e
s 
Neighbors 
PPU: 
Halos 
ReLU 
Compress 
… 
…
 
Fig. 6. SCNN PE employing the PT-IS-CP-sparse dataflow.
TABLE II
SCNN DESIGN PARAMETERS.
PE Parameter Value
Multiplier width 16 bits
Accumulator width 24 bits
IARAM/OARAM (each) 10KB
Weight FIFO 50 entries (500 B)
Multiply array (F×I) 4×4
Accumulator banks 32
Accumulator bank entries 32
SCNN Parameter Value
# PEs 64
# Multipliers 1024
IARAM + OARAM data 1MB
IARAM + OARAM indices 0.2MB
tions fit in the on-chip activation RAMs, Large networks, such
as VGGNet, require activations to be saved to and restored
from DRAM. CNN accelerator architectures such as SCNN
can employ a tiling approach that operates on a 2D subset
of the activation space at a time. Our analysis shows that for
both dense and sparse architectures, the DRAM accesses for
one tile can be hidden by pipelining them in tandem with
the computation of another tile. While we do not present
the details of the tiled approach here, we do account for the
DRAM access energy overhead for tiling in Section VI.
SCNN Architecture Configuration. While the SCNN ar-
chitecture can be scaled across a number of dimensions,
Table II lists the key parameters of the SCNN design we
explore in this paper. The design employs an 8×8 array of PEs,
each with a 4×4 multiplier array, and an accumulator buffer
with 32 banks. We chose a design point of 1,024 multipliers
to match the expected computation throughput required to
process HD video in real-time at acceptable frame rates. The
IARAM and OARAM are sized so that the sparse activations
of AlexNet and GoogleNet can fit entirely within these RAMs
so that activations need not spill to DRAM. The weight FIFO
and the activation RAMs each carry a 10-bit overhead for
each 16-bit value to encode the coordinates in the compressed-
TABLE III
SCNN PE AREA BREAKDOWN.
PE Component Size Area (mm2)
IARAM + OARAM 20 KB 0.031
Weight FIFO 0.5 KB 0.004
Multiplier array 16 ALUs 0.008
Scatter network 16×32 crossbar 0.026
Accumulator buffers 6 KB 0.036
Other — 0.019
Total — 0.123
Accelerator total 64 PEs 7.9
sparse format. In total, the SCNN design includes a total
of 1,024 multipliers and 1MB of activation RAM. At the
synthesized clock speed of the PE of slightly more than 1
GHz, this design achieves a peak throughput of 2 Tera-ops
(16-bit multiplies plus 24-bit adds).
Area Analysis. To prototype the SCNN architecture, we
designed an SCNN PE in synthesizable SystemC and then
used the Catapult high-level synthesis (HLS) tool [23], [22]
to generate Verilog RTL. During this step, we used HLS
design constraints to optimize the design by mapping different
memory structures to synchronous RAMs and latch arrays
and pipelining the design to achieve full throughput. We then
used Synopsys Design Compiler to perform placement-aware
logic synthesis and obtain post-synthesis area estimates in a
TSMC 16nm FinFET technology. Table III summarizes the
area of the major structures of the SCNN PE. A significant
fraction of the PE area is contributed by memories (IARAM,
OARAM, accumulator buffers), which consume 57% of the PE
area, while the multiplier array only consumes 6%. IARAM
and OARAM are large in size and consume 25% of the PE
area. Accumulator buffers, though smaller in size compared to
IARAM/OARAM, are heavily banked (32 banks), contributing
to their large area.
V. EXPERIMENTAL METHODOLOGY
CNN performance and power measurements. To model
7
the performance of the SCNN architecture, we rely primar-
ily on a custom-built cycle-level simulator. This simulator
is parameterizable across dimensions including number of
processing element (PE) tiles, RAM capacity, multiplier array
dimensions (F and I), and accumulator buffers (A). The SCNN
simulator is driven by the pruned weights and sparse input
activation maps extracted from the Caffe Python interface
(pycaffe) [3] and executes each layers of the network one
at a time. As a result, the simulator captures the effects of the
sparsity of the data and its effect on load balancing within the
SCNN architecture.
We also developed TimeLoop, a detailed analytical model
for CNN accelerators to enable an exploration of the design
space of dense and sparse architectures. TimeLoop can model
a wide range of data flows, including PT-IS-CP-dense, PT-IS-
CP-sparse and others. Architecture parameters to TimeLoop
include the memory hierarchy configuration (buffer size and
location), ALU count and partitioning, and dense/sparse hard-
ware support. TimeLoop analyzes the input data parameters,
the architecture, and the dataflows, and computes the number
of cycles to process the layer based on a bottleneck analysis
and the counts of ALU operations and accesses to different
buffers in the memory hierarchy. We apply an energy model
to the time loop events derived from the synthesis modeling
to compute the overall energy required to execute the layer.
TimeLoop also computes the overall area of the accelerator
based on the inputs from the synthesis modeling. For SCNN,
the area model includes all of the elements from the synthe-
sizable SystemC implementation. For dense architectures, area
is computed using area of the major structures (RAMs, ALUs,
and interconnect) derived from the SystemC modeling.
Architecture configurations. Table IV summarizes the ma-
jor accelerator configurations that we explore, including both
dense and sparse accelerators. All of the accelerators employ
the same number of multiply ALUs so that we can compare the
performance of the accelerators with the same computational
resources. The dense DCNN accelerator operates solely on
dense weights and activations and employs a dataflow called
PT-IS-DP-dense, which is a variant of the PT-IS-CP-dense
data flow described in Section III that uses a dot-product as
its inner core operation. The optimized DCNN-opt architec-
ture has the same configuration as DCNN but employs two
optimizations: (1) compression/decompression of activations
as they are transferred out of/into DRAM, and (2) multiply
ALU gating to save energy when a multiplier input is zero.
The DCNN architecture is configured with 2MB of SRAM for
holding inter-layer activations, and can hold all of them for
AlexNet and GoogLeNet. The SCNN configuration matches
the architecture described in Section IV, and includes a total
of 1MB of IARAM + OARAM. Because the activations are
compressed, this capacity enables all of the activation data
for the two networks to be held on chip, without requiring
DRAM transfers for activations. The larger VGGNet requires
the activation data to be transferred in and out of DRAM.
The area required for each accelerator is listed in the last
column. While SCNN has smaller activation RAM capacity,
TABLE IV
CNN ACCELERATOR CONFIGURATIONS.
# PEs # MULs SRAM Area (mm2)
DCNN 64 1024 2MB 5.9
DCNN-opt 64 1024 2MB 5.9
SCNN 64 1024 1MB 7.9
its larger size is due to the accumulator buffers, as described
in Section IV.
Benchmarks. To explore the sensitivity of the architectures
to sparsity parameters, we employ a synthetic network where
we can adjust the degree of sparsity of both weights and
activations. As described in Section II, we use AlexNet and
GoogLeNet for the bulk of our experiments. With respect to
GoogLeNet, we primarily focus on the convolutional layers
that are within the inception modules [28]. VGGNet is known
to be over-parameterized, which results in an extremely large
amount of inter-layer activation data (6 MB or about 4×) the
largest GoogLeNet layer. However, we use VGGNet as a proxy
for large input data (such has high-resolution images) to ex-
plore the implications of tiling data on the SCNN architecture
in Section VI-D.
VI. EVALUATION
This section first evaluates the sensitivity of SCNN to the
sparseness of weights and activations using a synthetic CNN
benchmark. We then discuss the performance and energy-
efficiency of SCNN versus a dense CNN accelerator, using
real world CNN applications. For brevity, all the inception
modules in GoogLeNet are denoted as IC_id in all of the
figures discussed in this section.
A. Sensitivity to CNN Sparsity
To explore the efficacy of sparse versus dense architecture,
we first measure performance and energy as a function of
density of weights and activations. Using the TimeLoop anal-
ysis tool, we examine GoogLeNet as we artificially sweep
the weight and activation densities together from 100% (fully
dense) down to 10%. Figure 7 shows results on SCNN,
DCNN, and DCNN-opt. On the x-axis, the total density is
the product of the weight and activation density. For example,
0.5/0.5 represents the point where both weights and activations
are 50% dense so the overall density is 25%. Figure 7a
shows that at 100% density, SCNN achieves about 79% of
the performance of DCNN because it suffers from multiplier
underutilization (discussed shortly). SCNN starts to perform
better than DCNN as density decreases to 85%, reaching a
24× improvement at 10% weight/activation density. This chart
does not include DCNN-opt as the energy optimizations over
DCNN do not affect performance.
Figure 7b first shows that DCNN-opt’s energy optimizations
of zero gating and DRAM traffic compression cause it to be
better than DCNN at every level of density. These energy
optimizations are surprisingly effective given that they have
such a small effect on the design of the accelerator. At
high density, SCNN is notably less energy efficient than
8
00.2
0.4
0.6
0.8
1
1.2
1.4
0.1/0.1 0.2/0.2 0.3/0.3 0.4/0.4 0.5/0.5 0.6/0.6 0.7/0.7 0.8/0.8 0.9/0.9 1.0/1.0
La
te
n
cy
 (
cy
cl
e
s)
Weight / Activation Density
DCNN/DCNN-opt SCNN
(a) Performance
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0.1/0.1 0.2/0.2 0.3/0.3 0.4/0.4 0.5/0.5 0.6/0.6 0.7/0.7 0.8/0.8 0.9/0.9 1.0/1.0
En
e
rg
y
Weight / Activation Density
DCNN DCNN-opt SCNN
(b) Energy
Fig. 7. GoogLeNet performance and energy as a function of density.
either DCNN architecture due to the overheads of storing and
maintaining the sparse data structures. SCNN becomes more
efficient than DCNN at about 83% weight/activation density
and more efficient than DCNN-opt at 60% density. Given the
density measurements of the networks in Figure 1, we expect
SCNN to outperform the dense architectures on nearly all
of the layers of the networks we examined and be largely
comparable in energy efficiency vs. DCNN-opt across these
layers.
B. SCNN Performance and Energy
Performance. This subsection discusses the performance
improvements offered by SCNN over the baseline dense
DCNN accelerator. To demonstrate the upper-bound oppor-
tunity on performance, we also present the speedups offered
by an oracular SCNN design (SCNN(oracle)). The perfor-
mance of SCNN(oracle) is derived by dividing the number
of multiplication operations required for Cartesian product-
based convolution (Section III) with the number of multipliers
available on-chip. Figure 8 summarizes the speedups offered
by SCNN versus a dense CNN accelerator. Overall, SCNN
consistently outperforms the DCNN design across all the
layers of AlexNet, GoogLeNet, and VGGNet, achieving an
average 2.37×, 2.19×, and 3.52× network-wide performance
improvement, respectively.
The performance gap between SCNN versus
SCNN(oracle) widens in later layers of the network,
from left to right in the x-axis of Figure 8. The key reason
behind this gap is closely related to each layer’s average
PE multiplier array utilization. In general, the working set
allocated to each of the PEs in the later layers (e.g., IC_5b)
are smaller than those in the earlier layers (e.g., IC_3a).
As a result, it is more difficult to assign to a PE a sufficient
0
2
4
6
8
10
12
conv1 conv2 conv3 conv4 conv5 all
Per-layer Network
Sp
e
e
d
u
p
DCNN/DCNN-opt SCNN SCNN (oracle)
(a) AlexNet
0
2
4
6
8
10
12
14
IC_3a IC_3b IC_4a IC_4b IC_4c IC_4d IC_4e IC_5a IC_5b all
Per-layer Network
Sp
e
e
d
u
p
DCNN/DCNN-opt SCNN SCNN (oracle)
(b) GoogLeNet
0
3
6
9
12
15
co
n
v1
_1
co
n
v1
_2
co
n
v2
_1
co
n
v2
_2
co
n
v3
_1
co
n
v3
_2
co
n
v3
_3
co
n
v4
_1
co
n
v4
_2
co
n
v4
_3
co
n
v5
_1
co
n
v5
_2
co
n
v5
_3 a
ll
Per-layer Network
Sp
e
e
d
u
p
DCNN/DCNN-opt SCNN SCNN (oracle)
(c) VGGNet
Fig. 8. Performance.
number of non-zero activations and weights in the later layers
to fully utilize the multiplier arrays.
Figure 9 quantitatively demonstrates this intra-PE frag-
mentation of the multiplier array. For the last two inception
modules of GoogLeNet, the fragmentation issue becomes
noticeably severe, with less than an average 20% multiplier
utilization. In this layer 3 out of the 6 convolutional layers
within the inception module have a filter size of 1× 1, only
having up to 8 non-zero weights within an output-channel
group with a Kc value of 8. Nonetheless, the last layers
generally account for a small portion of the overall execution
time and SCNN generally provides significant performance
improvement across the network. The right y-axis of Figure 9
demonstrates the effect of load imbalance across the PEs by
showing the fraction of cycles spent waiting at inter-PE barrier.
Overall, although the inter-PE global barriers and intra-PE
fragmentation prevents SCNN from reaching similar speedups
offered by oracular SCNN(oracle), it still provides an
average 2.7× performance over DCNN across the three state-
of-the-art CNNs.
Energy-efficiency. Figure 10 compares the energy of the
three accelerator architectures across the layers of the three
networks. DCNN-opt improves energy efficiency by 2.0× over
DCNN, while SCNN increases average efficiency by 2.3×.
The behavior across the layers varies widely depending on the
density of the layer, ranging from 0.89× to 4.7× improvement
over DCNN and 0.76× to 1.9× improvement over DCNN-opt.
Input layers such as VGGNet conv1 1 and AlexNet conv1
9
00.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
conv1 conv2 conv3 conv4 conv5
A
vg
. P
E 
id
le
 c
yc
le
s 
A
vg
. m
u
lt
ip
lie
r 
u
ti
l.
 
Multiplier util.
PE idle cycles
(a) AlexNet
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
IC_3a IC_3b IC_4a IC_4b IC_4c IC_4d IC_4e IC_5a IC_5b
A
vg
. P
E 
id
le
 c
yc
le
s 
A
vg
. m
u
lt
ip
lie
r 
u
ti
l.
 
mul_util
idle_cycles
(b) GoogLeNet
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
A
vg
. P
E 
id
le
 c
yc
le
s 
A
vg
. m
u
lt
ip
lie
r 
u
ti
l.
 
mul_util
idle_cycles
(c) VGGNet
Fig. 9. Average multiplier array utilization (left-axis) and the average fraction
of time PEs are stalled on a global barrier (right-axis), set at the boundaries
of output channel groups.
usually present a challenge for sparse architectures because
of their 100% input activation density. In such cases, the
overheads of SCNN’s structures such as the crossbar and
distributed accumulation RAMs overshadow any benefits from
fewer arithmetic operations and data movement.
C. PE Granularity
As outlined in Section VI-B, both cross-PE global barriers
and intra-PE multiplier array fragmentation can contribute to
degradation in the performance of SCNN. We quantify the
effects of both of these factors on system performance by
conducting the following sensitivity study. Assuming a fixed,
chip-wide math throughput of 1,024 FLOPS, we sweep the
total number of PEs on-chip from 64 (8×8 PEs, 16 multipliers
per PE) down to 4 (2×2 PEs, 256 multipliers per PE). Clearly,
an SCNN with 4 PEs can better sustain the effects of the
global barriers than an SCNN with 64 PEs. However, the 4
PE configuration is also more likely to suffer from intra-PE
fragmentation because each PE must now process a larger
working set to fully utilize the math units. When evaluated
on GoogLeNet, SCNN with 64 PEs achieves an 11% speedup
than the one with 4 PEs as it does a better job utilizing the
math arrays (average 59% math utilization versus 35%). We
observed similar trends for AlexNet and VGGNet, concluding
that addressing intra-PE fragmentation is more critical than
0
0.2
0.4
0.6
0.8
1
1.2
conv1 conv2 conv3 conv4 conv5 all
En
e
rg
y 
(r
e
la
ti
ve
 t
o
 D
C
N
N
)
Per-layer
DCNN DCNN-opt SCNN
(a) AlexNet
0
0.2
0.4
0.6
0.8
1
1.2
IC_3a IC_3b IC_4a IC_4b IC_4c IC_4d IC_4e IC_5a IC_5b all
En
e
rg
y 
(r
e
la
ti
ve
 t
o
 D
C
N
N
)
Per-layer
DCNN DCNN-opt SCNN
(b) GoogLeNet
0
0.2
0.4
0.6
0.8
1
1.2
En
e
rg
y 
(r
e
la
ti
ve
 t
o
 D
C
N
N
)
Per-layer
DCNN DCNN-opt SCNN
(c) VGGNet
Fig. 10. Energy-efficiency.
inter-PE barriers for system-wide performance under our PT-
IS-CP-sparse dataflow.
D. Larger Networks
SCNN is at its most efficient when it can capture all of the
activations in its IARAM and OARAM. For a network with
large layers like VGGNet, there are two choices. First, the
designer could choose to provision SCNN with IARAM and
OARAM large enough to hold the largest layer. While this ap-
proach would make processing the large layers more efficient
by avoiding off-chip DRAM accesses, smaller layers would
be penalized by incurring larger on-chip access overheads to
RAMs that are larger than necessary. The second choice is
to employ a tiling approach as described in Section IV. Our
results show that this is only required for 9 of the 72 total
evaluated layers, and for these 9 layers the overall energy
penalty for shuttling the activation data to and from DRAM
ranges from 5–62%, with a mean of 18%. While the tiling
approach is attractive, we expect that there will be low power
deployment scenarios that will motivate the designers of the
neural networks to size them so that they fit in the on-chip
SRAM capacity provided by the accelerator implementation.
10
VII. RELATED WORK
Previous efforts to exploit sparsity in CNN accelerators
have focused on reducing energy or saving time, which will
invariably also save energy. Eliminating the multiplication
when an input operand is zero is a natural way to save energy.
Eyeriss [7] gates the multiplier when it sees an input activation
of zero. For non-pruned networks the sparsity of weights is
a less significant factor, and Eyeriss opted not to gate the
multiplier on zero weights. This approach will save energy,
but not save cycles. SCNN also saves energy by eliminating
all the unnecessary multiplications, but doesn’t even prepare
to do a multiplication when any input operand is zero, thus
saving time as well.
Another approach to reducing energy is to reduce data
transfer costs when the data is sparse. Eyeriss uses a run length
encoding scheme when transferring activations to and from
DRAM. This saves energy (and time) by reducing the number
of DRAM accesses. However, after the data is loaded into
the on-chip buffer the data is stored in expanded form. Thus,
there is no savings on data transfers from one internal buffer to
another internal buffer or to the multipliers. SCNN also uses
a compressed representation for all data coming from DRAM,
but also maintains that compressed representation in the on-die
buffers.
Another CNN accelerator designed to exploit sparsity, Cn-
vlutin [1] compresses activation values. However, it does not
compress weights. Cambricon-X [30], does not compress the
data coming from DRAM. Nor does it keep activations in a
compressed form in the internal buffers, except in the queues
directly delivering activations to the multipliers. Cambricon-
X does keep only non-zero weights in the internal buffers.
In contrast, SCNN keeps both weights and activations in a
compressed form in both DRAM and internal buffers. This
saves data transfer time and energy on all data transfers and
allows the chip hold larger models for a given amount of
internal storage.
Not delivering zero activation values or zero weights to
the multipliers can save time by avoiding cycles when the
multiplier has nothing to do. Cnvlutin selects only non-zero
activation values for delivery as multiplier operands, but does
occupy a multiplier with zero weights. Cambricon-X does not
deliver either zero activations or weights to the multipliers.
SCNN also does not deliver either zero activations or weights
to the multipliers.
Finally, the EIE CNN accelerator [13] uses a compressed
representation of both activations and weights, and only de-
livers non-zero operands to the multipliers. However, EIE is
designed for the fully connected layers of a CNN model, while
SCNN is targeting the convolutional layers where the majority
of the computations are [9].
VIII. CONCLUSION
This paper presents the Sparse CNN (SCNN) accelerator
architecture for inference in convolutional neural networks.
SCNN exploits sparsity in both weights and activations us-
ing the PlanarTiled-InputStationary-CartesianProduct-sparse
(PT-IS-CP-sparse) dataflow. This approach enables SCNN to
use a novel Cartesian product-based computation architecture
that maximizes reuse of weights and activations within a set
of distributed processing elements. In addition, it allows the
use of a dense compressed representation for both weights
and activations to be used through almost the entire pro-
cessing flow. This approach reduces data movement and on-
die storage capacity relative to alternative architectures. Our
results show that the SCNN architecture starts to beat a dense
architecture in performance and energy efficiency when the
weights and activations are each less than 85% dense. On three
contemporary networks (AlexNet, GoogLeNet, and VGGNet)
SCCN achieves performance improvements over the dense
architecture by a factor of 2.7× while still being energy-
efficient by a factor of 2.3×.
REFERENCES
[1] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and
A. Moshovos. Cnvlutin: Ineffectual-Neuron-Free Deep Convolutional
Neural Network Computing. In Proceedings of the International
Symposium on Computer Architecture (ISCA), pages 1–13, June 2016.
[2] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro,
J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. Engel,
L. Fan, C. Fougner, T. Han, A. Hannun, B. Jun, P. LeGresley, L. Lin,
S. Narang, A. Ng, S. Ozair, R. Prenger, J. Raiman, S. Satheesh,
D. Seetapun, S. Sengupta, Y. Wang, Z. Wang, C. Wang, B. Xiao,
D. Yogatama, J. Zhan, and Z. Zhu. Deep Speech 2: End-To-End Speech
Recognition in English and Mandarin. https://arxiv.org/abs/1512.02595,
2015.
[3] Caffe. http://caffe.berkeleyvision.org, 2016.
[4] Caffe Model Zoo. https://github.com/BVLC/caffe/wiki/Model-Zoo,
2017.
[5] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam.
DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous
Machine-learning. In Proceedings of the International Conference
on Architectural Support for Programming Languages and Operation
Systems (ASPLOS), pages 269–284, March 2014.
[6] Y.-H. Chen, J. Emer, and V. Sze. Eyeriss: A Spatial Architecture
for Energy-Efficient Dataflow for Convolutional Neural Networks. In
Proceedings of the International Symposium on Computer Architecture
(ISCA), pages 367–379, June 2016.
[7] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze. Eyeriss: An Energy-efficient
Reconfigurable Accelerator for Deep Convolutional Neural Networks.
In Proceedings of the International Solid State Circuits Conference
(ISSCC), February 2016.
[8] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and
P. Kuksa. Natural Language Processing (Almost) From Scratch. https:
//arxiv.org/abs/1103.0398, 2011.
[9] J. Cong and B. Xiao. Minimizing Computation in Convolutional Neural
Networks. In Proceedings of the International Conference on Artificial
Neural Networks (ICANN), pages 281–290, September 2014.
[10] G. Diamos, S. Sengupta, B. Catanzaro, M. Chrzanowski, A. Coates,
E. Elsen, J. Engel, A. Hannun, and S. Satheesh. Persistent RNNs:
Stashing Recurrent Weights On-Chip. In Proceedings of the Interna-
tional Conference on Machine Learning (ICML), June 2016.
[11] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
and O. Temam. ShiDianNao: Shifting Vision Processing Closer to the
Sensor. In Proceedings of the International Symposium on Computer
Architecture (ISCA), pages 92–104, June 2015.
[12] A. Graves and J. Schmidhuber. Framewise Phoneme Classification With
Bidirectional LSTM and Other Neural Network Architectures. In Neural
Networks, 2005.
[13] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and B. Dally.
EIE: Efficient Inference Engine on Compressed Deep Neural Network.
In Proceedings of the International Symposium on Computer Architec-
ture (ISCA), pages 243–254, June 2016.
[14] S. Han, H. Mao, and W. J. Dally. Deep Compression: Compressing
Deep Neural Networks with Pruning, Trained Quantization and Huffman
Coding. https://arxiv.org/abs/1510.00149, 2015.
11
[15] S. Han, J. Pool, J. Tran, and W. J. Dally. Learning Both Weights
and Connections for Efficient Neural Networks. In Proceedings of
the International Conference on Neural Information Processing Systems
(NIPS), pages 1135–1143, December 2015.
[16] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen,
R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Y. Ng. Deep
Speech: Scaling Up End-To-End Speech Recognition. https://arxiv.org/
abs/1412.5567, 2014.
[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image
Recognition. https://arxiv.org/abs/1512.03385, 2015.
[18] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Weinberger. Deep Networks
with Stochastic Depth. https://arxiv.org/abs/1603.09382, 2016.
[19] ImageNet. http://image-net.org, 2016.
[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification
with Deep Convolutional Neural Networks. In Proceedings of the
International Conference on Neural Information Processing Systems
(NIPS), December 2012.
[21] Y. LeCun, Y. Bengio, and G. Hinton. Deep Learning. Nature, 521:436–
444, May 2015.
[22] G. Martin and G. Smith. High-Level Synthesis: Past, Present, and Future.
IEEE Design & Test of Computers, 26(4):18–25, July/August 2009.
[23] Catapult High-Level Synthesis. https://www.mentor.com/hls-lp/catapult-
high-level-synthesis, 2017.
[24] NVIDIA cuDNN. https://developer.nvidia.com/cudnn, 2016.
[25] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. Lee, J. M. H.
Lobato, G.-Y. Wei, and D. Brooks. Minerva: Enabling Low-Power, High-
Accuracy Deep Neural Network Accelerators. In Proceedings of the
International Symposium on Computer Architecture (ISCA), pages 267–
278, June 2016.
[26] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keck-
ler. vDNN: Virtualized Deep Neural Networks for Scalable, Memory-
Efficient Neural Network Design. In Proceedings of the International
Symposium on Microarchitecture (MICRO), October 2016.
[27] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for
Large-Scale Image Recognition. https://arxiv.org/abs/1409.1556, May
2015.
[28] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions.
In Proceedings of the Conference on Computer Vision and Pattern
Recognition (CVPR), June 2015.
[29] Wired. Microsoft Neural Net Shows Deep Learning Can Get Way
Deeper, 2016.
[30] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and
Y. Chen. Cambricon-X: An Accelerator for Sparse Neural Networks.
In Proceedings of the International Symposium on Microarchitecture
(MICRO), October 2016.
12
