Accelerator-Aware Pruning for Convolutional Neural Networks by Kang, Hyeong-Ju
1Accelerator-Aware Pruning for Convolutional
Neural Networks
Hyeong-Ju Kang, Member, IEEE
Abstract—Convolutional neural networks have shown tremen-
dous performance capabilities in computer vision tasks, but their
excessive amounts of weight storage and arithmetic operations
prevent them from being adopted in embedded environments.
One of the solutions involves pruning, where certain unimportant
weights are forced to have a value of zero. Many pruning schemes
have been proposed, but these have mainly focused on the
number of pruned weights, scarcely considering ASIC or FPGA
accelerator architectures. When a pruned network is run on an
accelerator, the lack of the architecture consideration causes some
inefficiency problems, including internal buffer misalignments
and load imbalances. This paper proposes a new pruning scheme
that reflects accelerator architectures. In the proposed scheme,
pruning is performed so that the same number of weights
remain for each weight group corresponding to activations
fetched simultaneously. In this way, the pruning scheme resolves
the inefficiency problems, doubling the accelerator performance.
Even with this constraint, the proposed pruning scheme reached
a pruning ratio similar to that of previous unconstrained pruning
schemes, not only on AlexNet and VGG16 but also on state-of-
the-art very deep networks such as ResNet. Furthermore, the
proposed scheme demonstrated a comparable pruning ratio on
compact networks such as MobileNet and on slimmed networks
that were already pruned in a channel-wise manner. In addition
to improving the efficiency of previous sparse accelerators, it will
be also shown that the proposed pruning scheme can be used to
reduce the logic complexity of sparse accelerators.
Index Terms—Deep learning, convolutional neural networks,
neural network pruning, neural network accelerator
I. INTRODUCTION
CONVOLUTIONAL neural networks (CNNs) are attract-ing interest in the fields of image recognition [1]–
[5], object detection [6]–[10], and image segmentation [11].
Although they provide great performance for computer vision
tasks, there are certain obstacles that must be overcome
before they can be adopted in embedded environments. A
CNN usually requires excessive amounts of weight storage
and arithmetic operations. For fast processing and low power
consumption under these requirements, ASIC or FPGA ac-
celerators have been proposed [12]–[18], but the amounts of
weights and operations remain a major concern.
The weight amounts can be reduced by network pruning
[19]–[33], where some unimportant weights are forced to
have a value of zero. Multiplication with zero is meaningless,
Manuscript received XXXXX, XX, XXXX; revised XXXXX XX, XXXX.
This work was supported by Basic Science Research Program through the
National Research Foundation of Korea (NRF) funded by the Ministry of
Education(2015R1D1A1A01058768). This work was also supported by IDEC
(EDA Tool).
H.-J. Kang is with the School of Computer Science and Engineering,
Korea University of Technology and Education, Cheonan, Chungnam, 31253
Republic of Korea e-mail: hjkang@koreatech.ac.kr.
implying that reductions in operation amounts can be expected
as well. Various pruning schemes have been proposed, some
of which prune the weights without constraints [19], [20]
while others prune the weights considering the neural network
structures. It is known that relatively more weights can be
pruned in fully-connected layers than in convolutional layers.
However, pruning convolutional layers can reduce more energy
and realize higher throughput [21]. Given that the operational
structure of convolutional layers is more complex than that
of fully-connected layers, convolutional layers have a greater
variety of pruning schemes [22]–[32].
Pruned networks can be executed on certain ASIC or FPGA
accelerators that can exploit the weight sparsity. The EIE
proposed in [34] used the sparsity of weights generated by the
pruning algorithm of [20], but the architecture only processes
fully-connected layers. The EIE architecture was modified
for a long short-term memory (LSTM) network to become
the ESE architecture [35]. Cambricon-X is an architecture
that can exploit weight sparsity in convolutional layers [36].
SCNN proposed in [37] exploits both weight sparsity and
activation sparsity for convolutional layers. In ZENA proposed
in [38], good operating efficiency was reached by skipping
zero weights and activations.
Despite the various pruning schemes and accelerators, it re-
mains challenging to utilize the performance of ASIC or FPGA
accelerators efficiently with pruned networks. The aforemen-
tioned pruning schemes were developed without considering
the concept of accelerator acceptability. They mainly focused
on the amount of weight that can be pruned away. Pruning
with no constraints creates irregular patterns in the remaining
non-zero weights. These irregular patterns cause some degree
of inefficiency when the pruned networks are performed
on accelerators. The misalignment of the fetched activations
and weights requires padding-zero insertions. The processing
elements (PEs) process different numbers of weights, meaning
that some PEs must wait for other PEs to complete, known as
the load-imbalance problem. To alleviate these problems, some
accelerators use complex structures. For example, Cambricon-
X uses very wide (256×16-bits width) memory and very wide
(256-to-1) multiplexers (MUXs).
This paper proposes a new approach, a pruning algorithm
which considers the accelerator architecture. There can be
various architecture consideration points, and this paper will
focus on the activation groups fetched and processed simulta-
neously in a PE and the number of remaining weights for each
group. These points are related to certain critical parameters
in accelerator designs, including the number of multipliers
and the width of the internal weight buffer. In the proposed
ar
X
iv
:1
80
4.
09
86
2v
2 
 [c
s.N
E]
  1
4 M
ay
 20
19
2Fig. 1. Weight structure of a convolutional layer.
algorithm, pruning is applied to weight groups, each of which
corresponds to an activation group fetched together, and after
pruning, a fixed number of weights remain in a group. Because
pruning is applied to a weight group aligned with the operating
boundary of a PE, the scheme can resolve the misalignment
problem, removing waste of the internal buffer or multipliers.
The pruning group can be adjusted to reduce the complexity of
the data selection and indexing logic. Furthermore, since the
remaining non-zero weights are distributed evenly, the load-
imbalance problem can be solved naturally.
This paper is organized as follows. Section II explains
CNNs, the previous pruning schemes, and the accelerator
architectures. The accelerator-aware pruning scheme is pro-
posed in Section III, and the experimental results are shown
in Section IV. Section V makes the concluding remarks.
II. CNN AND PRUNING
A. CNN
A CNN usually consists of many convolutional layers
and a few fully-connected layers. Between the layers, there
are activation layers such as rectified linear units (ReLUs),
pooling layers, and batch-normalization layers. It is known
that the convolutional layers account for more than 90% of
the arithmetic operations. Because the fully-connected layers
require much more weight storage, recent CNNs have tended
to have only one fully-connected layer [3], [4] or none [5].
The operation in a convolutional layer is as follows:
fo(m, y, x) =
C−1∑
c=0
K−1∑
i=0
K−1∑
j=0
weight(m, c, i, j)× (1)
fi(c, S × y + i, S × x+ j) + bias(m),
where fo(m, y, x) is the activation of row y and column x
in output feature map m , and fi(c, h, w) is the activation of
row h and column w in input feature map c. In the equation,
C is the number of the input feature maps or channels, K
is the spatial size of the kernel, and S is the stride size. The
weight(m, c, i, j) variables are the convolution weights, and
the bias(m) variables are the bias terms. Fig. 1 shows the
structure of the weights, consisting of M filters with a spatial
size of K×K and a depth of C.
TABLE I
PRUNING RATIO OF THE PREVIOUS PRUNING SCHEME IN [19]
Conv1 Conv FC1&FC2 FC3
AlexNet 16% 62–65% 91% 75%
VGG16 42% 47–78% 96% 77%
TABLE II
PREVIOUS STRUCTURED PRUNING SCHEMES
Granularity Pruned weights References
channel-wise or weight(:, n, :, :) or [22]–[27]
filter-wise weight(m, :, :, :)
shape-wise or GPU-aware weight(:, n, i, j) [28], [29]
weight(m,n, :, :)
2D/1D granularity weight(m,n, i, :) [30]
weight(m,n, :, j)
SIMD aware [25]
For ease of discussion, some axes are defined. The channel
axis of the input activations denotes the fi(c, h, w) variables
with identical values of h and w. Similarly, the channel axis of
the weights is defined by the weight(m, c, i, j) entries with the
same m, i, and j values. The spatial axis represents fi(c, h, w)
variables with the same c values or weight(m, c, i, j) variables
with identical values of m and c as well. The filter axis denotes
weight(m, c, i, j) variables with the same c, i, and j values.
Fig. 1 shows the weights along each axis.
B. Pruning
The pruning of neural networks removes some of the
unimportant weights or nodes to reduce the amount of storage
and the number of operations. The early works were presented
in [39]–[41], and pruning in CNNs was proposed in [19],
whose pruning results are summarized in Table I. The table
shows the ratio of the pruned weights at the first convolutional
layer, the other convolutional layers, the first and second fully-
connected layers, and the last fully-connected layer of AlexNet
and VGG16. This work was expanded with quantization and
Huffman coding in [20] and with an accelerator architecture
for pruned fully-connected layers in [34]. The pruning scheme
can prune many weights but shows irregularities in the pruned
pattern. Moreover, the corresponding accelerator architecture
can only deal with fully-connected layers. The energy-aware
pruning scheme proposed in [21] focuses on convolutional
layers because convolutional layers consume more energy.
However, this work did not consider the regularity of the
pruning pattern, either.
The regularity of the pruning was considered in [22]–[31].
The pruning schemes in these studies can be categorized as
channel-wise, filter-wise, and shape-wise pruning, as shown
in Table II. In channel-wise pruning, for example, all of
the weights in a channel are pruned or not together. These
pruning schemes are referred to as structured pruning schemes,
whereas pruning schemes with no constraint, as presented in
[19], [20], are termed unstructured pruning.
C. Neural Network Accelerators
Some ASIC or FPGA accelerator architectures have been
proposed for unpruned, dense CNNs, which will be called
3(a) (b)
Fig. 2. Typical PE structures: (a) MWMA and (b) MWSA.
dense architectures. An accelerator usually consists of several
processing elements (PEs), each of which has a single mul-
tiplier or many multipliers. In a PE with many multipliers,
the multipliers may multiply multiple weights and multiple
activations (MWMA), multiple weights and a single activa-
tion (MWSA), or a single weight and multiple activations
(SWMA). The MWMA and MWSA structures are shown in
Fig. 2, where NPE PEs operate in parallel. With the same
naming convention, a PE with a single multiplier can be called
a single weight and single activation (SWSA) structure.
If a PE has multiple multipliers, it is important to fetch
the operands of the multipliers simultaneously. In this paper,
a fetching group is defined as activations or weights that are
fetched and processed simultaneously in a PE. If the size of
a fetching group is Npar and the number of multipliers in a
PE is Nmul, Npar is usually equal to Nmul. The weights and
activations are usually fetched from the internal buffers, but
the buffers are not drawn in the figures.
The PE structures can be further categorized by the axis
followed by the weight- and activation-fetching groups. For
example, DianNao[12] adopts the MWMA structure, where
some of the input activations along the channel axis are fetched
and multiplied with the corresponding kernel weights. The
multiplication results are summed and accumulated to be an
output activation. The MWSA structure is adopted in Cnvlutin
[13] to exploit the sparsity of the input activations. Multiple
weights are fetched along the filter axis. The outputs of the
multiplications in PEs are gathered into Npar NPE-input
adder trees. The architectures proposed in [16], [17] consist of
MWMA-structured PEs and fetch the activations and weights
along the spatial axis.
Fig. 3. Sparse MWMA PE structure.
D. Sparse Accelerator Architectures
The architectures in Fig. 2 can be modified to exploit the
weight sparsity, which will be called sparse architectures. In
such architectures, only non-zero weights are stored in the
weight memories to reduce weight storage use. An example
of the sparse MWMA structure is shown in Fig. 3, where Npar
activations are fetched simultaneously. Npar is usually larger
than the number of multipliers, Nmul, as some weights are
zero and multiplications with zero weights are meaningless.
4Fig. 4. Process of a sparse channel-axis-parallel CNN accelerator.
Multipliers receive non-zero weights and their corresponding
activations. Selecting the Nmul activations from the Npar
fetched activations requires Nmul·b Npar-to-1 MUXs, where b
is the bit width of the activations. To select a proper activation,
a non-zero weight is stored with an index, denoted by the grey
rectangle in Fig. 3. A process example of the sparse MWMA
PE structure is illustrated in Fig. 4, where non-zero weights
are shown in grey. In the figure, the upper MUX selects the
second activation from the front, and the lower MUX selects
the fifth activation.
As an example of such sparse architectures, Cambricon-X
consists of the sparse MWMA structure PEs with Npar=256
and Nmul=16. In this architecture, 256 activations are fetched
simultaneously, from which 16 activations are selected. For
the selection, there are 16·b 256-to-1 MUXs for each PE, and
the MUXs are gathered into the indexing module (IM).
For ease of discussion, the weight-fetching group in a sparse
architecture is defined to include all of the zero and non-zero
weights corresponding to an activation-fetching group. If the
number of non-zero weights in a weight-fetching group is
Nnon−zero, it takes dNnon−zero/Nmule cycles for a PE to
process the activation- and weight-fetching group. In Fig. 4,
one more cycle is required to process the third non-zero
weight.
Other PE structures have been used in sparse accelerators
as well. EIE [34], ESE [35], and ZENA [38] adopt SWSA
structured PEs to easily skip the pruned weights in the irregular
pattern. SCNN [37] has multiple multipliers in a PE but
exploits a special structure of Cartesian Products. In this
structure, all of the fetched non-zero activations are multiplied
with all of the fetched non-zero weights. The multiplication
results are delivered to proper output accumulators.
E. Previous Pruning Scheme and Accelerator Architecture
Previous unstructured pruning schemes could reach high
pruning ratios, but they rarely resulted in pruned networks
that fit sparse accelerator architectures well. The main reason
for this is the non-uniform distribution of the non-zero weights
left after pruning, especially the number of non-zero weights,
Nnon−zero, in each weight-fetching group.
This non-uniform distribution can cause a misalignment
between the activations and weights. In an accelerator with
MWMA PEs, the multiplier operands should be fetched simul-
taneously. When Npar activations and Nmul non-zero weights
are fetched from the internal buffers, it is not guaranteed that
every fetched weight can find its counterpart in the fetched
activations. Another activation-fetching group may have to be
fetched for the process of the weights.
To solve the misalignment problem, Cambricon-X intro-
duces a padding-zero scheme. These padding zeros, however,
not only waste the internal weight buffer but also cause another
type of inefficiency. The total number of padding zeros would
be smaller with a larger Npar. This appears one of the reasons
why Cambricon-X uses a very large Npar compared to Nmul.
For a usual pruning ratio of 75% in convolutional layers [19],
Npar=64 may be enough for Nmul=16, but Cambricon-X uses
Npar=256. Due to the large Npar, the activation selection part,
the IM block, has very wide, 256-to-1, MUXs. Because the
number of the MUXs is also large, Nmul·b MUXs per PE,
these wide MUXs cause a large IM block area, occupying
more than 30% of the total chip area. The large Npar also
requires a very wide (Npar·b bit-width) internal activation
buffer. This type of wide memory usually induces a larger
area than that associated with a square-shaped memory.
Furthermore, the load balance between PEs is also
a problem. A PE processes an activation-fetching group
for dNnon−zero/Nmule cycles. Owing to the diversity of
Nnon−zero, the number of cycles will vary as well. This
may cause a problem in certain types of architectures such as
Cambricon-X, which shares the fetched activations between
PEs. If some PEs complete the process of fetched activations
early, the PEs must wait for the other PEs to finish. The load-
imbalance problem is also a major concern in accelerators
with a weight serial structure such as the SWMA and SWSA
structures [34], [35], [38].
There are few studies on pruning with considering the
divergent distributions of remaining non-zero weights. The
study presented in [33] considers the distribution of non-
zero weights, but deals with only the fully-connected layers.
Furthermore, the target was to reduce the width of the ternary
weight coding without consideration of the accelerator archi-
tectures. A load-balance-aware pruning scheme was proposed
in [35], but this pruning scheme focused solely on fully-
connected layers, and the accelerator architecture discussion
is insufficient, too. In contrast, the proposed scheme, which
will be presented in the next section, is closely related to
accelerator architectures, resolving all of the mentioned inef-
ficiency problems. Furthermore, our scheme focuses on both
convolutional layers and fully-connected layers.
III. ACCELERATOR-AWARE PRUNING
In this section, we propose an accelerator-aware pruning
algorithm that generates a more regular non-zero weight
pattern that fits accelerator architectures well. There can be
5Fig. 5. Example result of the proposed pruning scheme where Nnon−zero=2
for every weight-fetching group along the channel axis (Npar=8).
various architecture consideration points, and this paper will
concentrate on two parameters, the activation- and weight-
fetching groups and the number of non-zero weights left in
each weight group. These two parameters are closely related
to accelerator architectures. The size of the activation- and
weight-fetching group determines the internal buffer width,
and the number of non-zero weights is associated with the
required number of multipliers and processing cycles. Previous
pruning schemes do not consider these points, creating irreg-
ular distributions of the non-zero weights and the problems
mentioned in the previous section. We will discuss pruning
approaches for the architectures mentioned in the previous
section, but the algorithm is not limited to these architectures.
A. Proposed Accelerator-Aware Pruning Scheme
In the proposed pruning scheme, the weights are pruned
within the weight-fetching groups so that the number of re-
maining non-zero weights, Nnon−zero, is uniform for all of the
weight-fetching groups. The accelerator in Fig. 4, for example,
simultaneously fetches and processes an activation-fetching
group consisting of eight activations along the channel axis
(Npar=8). Nnon−zero for the corresponding weight-fetching
group is three in the figure. However, previous pruning
schemes provide no guarantee for Nnon−zero. Nnon−zero can
be two or four in another weight-fetching group. Even zero or
the size of a weight-fetching group is possible. In contrast, the
proposed scheme leaves a fixed number of non-zero weights
for all of the weight-fetching groups, as shown in Fig. 5. In
the figure, every weight-fetching group has six weights pruned
away with two non-zero weights remaining (Nnon−zero=2), as
indicated in white and grey, respectively.
The result of the proposed pruning scheme can be applied
to the previous sparse accelerators, resolving the inefficiency
problems mentioned in the previous section. The misalignment
problem in the sparse MWMA and MWSA structures will be
solved if the number of remaining non-zero weights per group
is set to be a multiple of Nmul. This alignment would make
the padding zeros obsolete in Cambricon-X. Furthermore,
the proposed scheme can solve the load-imbalance problem,
one of the main concerns in the weight serial structures as
well. Every weight-fetching group has the same number of
remaining non-zero weights, so the number of weights for
a PE to process is naturally balanced if every PE processes
Fig. 6. PE structure when Npar=8 and Ngroup=4.
the same number of weight-fetching groups. Every sparse
architecture mentioned in the previous section can benefit from
the proposed pruning scheme.
A similar scheme was presented in [33], but the regularity
was used only to reduce the amount of weight storage. They
did not consider the effect on the accelerator architecture.
Furthermore, they only dealt with fully-connected layers. The
load-balance-aware pruning scheme, proposed in [35], adopts
a similar concept, but it can be thought of as a special case
of the proposed scheme. The load-balance-aware pruning is
also limited to fully-connected layers and the EIE and ESE
architectures.
B. Accelerator Complexity Reduction
In addition to improving the efficiency of previous sparse
architectures, the proposed scheme can also be used to reduce
the degree of accelerator complexity, especially with regard to
the indexing and activation selection logic. Cambricon-X has
very wide (256·b-bit width) activation buffers and very wide
(256-to-1) MUXs in the activation selection logic to deal with
the irregular distribution of non-zero weights by the previous
pruning schemes, as mentioned in Subsection II-E. The input
width of the MUXs can be narrowed by the proposed pruning
scheme, simplifying the activation selection logic. First, every
weight-fetching group is divided evenly into g sub-groups,
referred to as the pruning groups. The size of a pruning group,
Ngroup, will be Npar/g. The pruning is then performed so that
each pruning group has a constant number of weights pruned.
If the number of weights pruned away in a pruning group is
Nprune, the pruning ratio will be Nprune/Ngroup.
Since a weight corresponds to one of the Ngroup activations,
the width of a MUX for the activation selection can be
reduced to Ngroup. With the smaller Ngroup, the PE structure
in Fig. 3 can be modified, becoming the structure shown
in Fig. 6. As indicated in the figure, the activation selec-
tion logic becomes simplified with narrower MUXs. When
Ngroup becomes smaller, however, degradation of the network
performance can increase with the same pruning ratio. The
experiment shows that Ngroup=16 with a 75% pruning ratio
does not deteriorate the CNN performance.
The proposed pruning scheme can induce a reduction of
Npar, too. As mentioned in Subsection II-E, Npar=256 for
Cambricon-X is large compared to Nmul=16. Npar=64 is
enough for the common pruning ratio of 75% in the con-
volutional layers. A large Npar may be chosen to reduce
6the number of padding zeros. However, the proposed scheme
makes the padding zeros unnecessary, implying that Npar=64
can be used. In this case, the width of the activation buffers
can be reduced to 64, enabling more square-like memory
components to be used. Square-like memory components are
more area-efficient than wide memory components.
Furthermore, the indexing logic can be simplified. For the
indexing of the irregular non-zero weights, Deep Compression
and EIE use relative indexing [20], [34], where the number
of zero weights between two adjacent non-zero weights is
stored. The interval is encoded with four bits, and an interval
larger than the encoding bound requires filler zero insertion.
A similar indexing scheme is used in Cambricon-X, known
as step indexing. The indexing can be simplified with the
proposed pruning scheme, where pruning is performed within
a pruning group of Ngroup. The small Ngroup enables direct
indexing, where an index indicates the position of the non-
zero weight within the pruning group. Given that Ngroup is
small, the indexing bit width is also small, dlog2Ngroupe bits,
even with direct indexing. Direct indexing is much simpler
than relative indexing in EIE or step indexing in Cambricon-
X and does not require filler zeros, removing the waste of the
weight storage.
C. Incremental Pruning
In the proposed scheme, Nprune weights are pruned in a
pruning group. The pruning can be processed in a few different
ways. At one extreme, the target number of weights with the
least magnitude are pruned at the same time in each group, and
the pruned network is retrained. This scheme is called one-time
pruning in this paper. At the other extreme, pruning begins
with only one weight with the least magnitude in each group.
Subsequently, a period of retraining is undertaken, followed by
the pruning of one more weight with the least retrained mag-
nitude. Retraining is performed again. The one weight pruning
and retraining processes are iterated until the target number is
reached. In the middle of the two extreme methods, we can
set the initial pruning number and the increment number. This
scheme is referred to here as incremental pruning.
Obviously, incremental pruning would be better than or
equal to the one-time pruning method. However, incremen-
tal pruning requires a long retraining time. Accordingly, in
this paper, it will be applied when one-time pruning is not
sufficient.
D. Pruning for Various Architectures
In the proposed pruning scheme, pruning can be applied to
CNNs for various accelerator architectures. In the previous
subsections, pruning along the channel axis is shown for
architectures such as Cambricon-X. As an example of pruning
for another type of accelerator architecture, the proposed
pruning scheme can be adjusted to a weight-sparse version
of Cnvlutin. Cnvlutin adopts the MWSA structure, and the
weights are fetched along the filter axis. In this architecture,
the dense MWSA structure can be modified to the sparse
MWSA structure shown in Fig. 7. For the sparse architecture,
the proposed pruning scheme can be applied to CNNs with
Fig. 7. Sparse MWSA PE structure.
Fig. 8. Example result of the proposed pruning scheme with Ngroup=8 and
Nprune=6 on a fully-connected layer.
pruning groups set to the weights along the filter axis. Because
Cnvlutin already exploits activation sparsity, the structure in
Fig. 7 can exploit both activation sparsity and weight sparsity.
The architectures in [16], [17] adopt sparse MWMA struc-
ture PEs, where the activations and weights are fetched and
processed along the spatial axis. For these architectures, the
proposed pruning scheme can be applied with pruning groups
established along the spatial axis.
E. Application to Fully-Connected Layers
The proposed pruning scheme can be applied to fully-
connected layers as well. In a fully-connected layer, the
weights can be arranged in a matrix format. We can define two
axes: the column axis and the row axis. Along the row axis, the
weights are multiplied with different activations, while along
the column axis, the weights are multiplied with an activation.
An MWMA-structured PE processes the weights along the row
axis simultaneously, and an MWSA-structured PE processes
the weights along the column axis.
7According to the PE structures, accelerator-aware pruning is
applied following different axes. If an MWMA-structured PE
is used, the weights are grouped along the row axis and pruned
so that each group has a fixed number of non-zero weights
remaining. If an MWSA-structured PE is used, the weights
are grouped and pruned along the column axis. An example of
the proposed pruning scheme along the row axis is illustrated
in Fig. 8. In the figure, Npar=Ngroup=8 and Nprune=6.
IV. EXPERIMENTAL RESULTS
To show that the proposed pruning scheme can preserve
the performance of CNNs well even with the given constraint,
the top-5 accuracy for the ImageNet 2012 validation data
set [42] was measured. The retraining was performed by
Caffe [43] in one of three modes. In Retraining 1, after one-
time pruning, retraining was performed with a learning rate
of 5·10−4 for 12 epochs. If the original accuracy was not
recovered, 8-epoch retraining was performed additionally with
a learning rate of 10−4. If Retraining 1 was not enough,
Retraining 2 was applied, where the learning rate begins
at 5·10−4 and decreases to 10−4, 10−5, and 10−6 when
the validation accuracy becomes saturated. In Retraining 3,
Retraining 2 is applied with incremental pruning. When the
validation accuracy is saturated at the learning rate of 10−6,
Nprune is increased and the retraining resumes with a learning
rate of 5·10−4. The retraining mini-batch size was set to 256.
The network models were obtained publicly [44]–[48]. Various
methods can be used to select the weights to be pruned. Any
selection method is applicable, but the simplest method was
used in the experiments. The weights with the least magnitude
are pruned first in a pruning group.
The pruning of convolutional layers will be discussed first
because convolutional layers account for most of the compu-
tations. The proposed pruning scheme does not change the
channel number of output feature maps and can therefore be
applied to the residual blocks of ResNet including multiple
branches without difficulty. The bias values are not pruned.
Our pruning scheme was then applied to fully-connected layers
with the pruned convolutional layers. It will be shown that
the proposed pruning method is also applicable to compact
networks. The last two subsections will analyze the effects
of the proposed scheme on the accelerator performance and
complexity.
A. Convolutional Layer Pruning
Table III shows the accuracy results right after the pruning
of the convolutional layers. The weights are grouped along the
channel axis, and the first convolutional layer is not pruned
as it has a much smaller number of weights and operations
than the other layers. The table shows that 30% (3/8 or 6/16)
pruning already begins to degrade the accuracy. However,
the degradation can be recovered with retraining as shown
in Table IV. Retraining 1 and 3 are applied; the results of
Retraining 3 are denoted by asterisks.
In the table, with up to 75% (6/8 or 12/16) pruning,
the validation accuracy could be recovered to the baseline
accuracy with Retraining 1 in very deep networks, including
TABLE III
IMAGENET VALIDATION ACCURACY(%) RIGHT AFTER CONVOLUTIONAL
LAYER PRUNING
Ngroup Nconvprune AlexNet VGG16 ResNet-50 ResNet-152
- - 79.81 88.44 91.14 92.20
8 1 79.68 88.31 90.94 92.12
8 2 79.17 88.02 90.30 91.69
8 3 76.89 85.52 88.27 90.72
8 4 65.02 70.20 78.79 86.37
8 5 32.19 12.23 17.21 46.58
16 1 79.83 88.47 91.16 92.22
16 2 79.76 88.36 91.08 92.18
16 4 79.35 88.19 90.80 92.07
16 6 78.00 86.64 88.88 91.36
16 8 69.99 75.34 80.74 87.90
16 9 60.84 55.15 59.39 80.36
16 10 38.61 20.12 32.42 65.67
TABLE IV
IMAGENET VALIDATION ACCURACY(%) AFTER CONVOLUTIONAL LAYER
PRUNING AND RETRAINING
Ngroup Nconvprune AlexNet VGG16 ResNet-50 ResNet-152
- - 79.81 88.44 91.14 92.20
8 4 80.45 90.71 91.95 93.03
8 5 *80.58 90.45 91.68 92.79
8 6 *80.42 89.91 91.14 92.33
8 7 *79.47 88.48 88.86 91.00
16 8 80.46 90.73 91.96 93.02
16 9 80.38 90.80 91.80 92.89
16 10 *80.62 90.54 91.86 92.75
16 11 *80.74 90.38 91.55 92.55
16 12 *80.50 90.22 91.35 92.48
16 13 *80.22 89.65 90.88 92.20
Unstructured (81.25%) *80.34 90.06 91.17 92.37
VGG16, ResNet-50, and ResNet-152. The result of VGG16
matches that of the unstructured pruning algorithm in [19],
where the pruning ratios of the convolutional layers are more
or less than 75% in VGG16. The experiment shows that a
similar pruning ratio can be reached with the proposed pruning
scheme considering the accelerator constraint. It can also be
seen that a 75% pruning ratio does not degrade the accuracy
in the residual networks, ResNet-50 and ResNet-152. The
networks have more complicated structures, such as a residual
path and 1×1 convolution. The results show that the proposed
scheme can be applied to recent state-of-the-art CNNs. With
some pruning ratios, the accuracy is improved, which has
been observed in other pruning papers, too [19], [30]. This
improvement appears to be caused by a kind of regularization
[30].
In a relatively shallow network such as AlexNet, it was more
difficult to recover the accuracy. With the more advanced effort
of Retraining 3, however, the original accuracy level can be
recovered with pruning ratios comparable to those in [19]. In
AlexNet, Retraining 3 begins with (Nprune, Ngroup)=(5,8) or
(10,16), and Nprune is increased by one. While the pruning
ratio of the AlexNet convolutional layers was around 65%
with the unstructured pruning in [19], the proposed scheme
can reach a pruning ratio of 81.25% after approximately 300
epochs of retraining.
The last row of the table shows the results of an unstructured
pruning scheme, which prune 81.25% of weights with the least
8TABLE V
IMAGENET VALIDATION ACCURACY(%) AFTER FULLY-CONNECTED
LAYER PRUNING AND RETRAINING WITH Ngroup=16 AND Nconvprune=12
Nfc1,2prune N
fc3
prune AlexNet VGG16 ResNet-50 ResNet-152
- - 79.81 88.44 91.14 92.20
12 12 *80.24 89.84 91.24 92.54
13 12 *80.29 89.56 - -
14 12 *80.01 89.19 - -
15 12 *79.47 88.95 - -
magnitude in each convolutional layer. The unstructured prun-
ing scheme shows a little better accuracies, but the differences
are negligible.
B. Fully-Connected Layer Pruning
After the pruning of the convolutional layers, the fully-
connected layers were pruned along the row axis. In the fully-
connected layer pruning step, we attempted to prune more
weights than were pruned in the convolutional layers because,
generally, more weights can be pruned in fully-connected
layers [19]. The size of a pruning group, Ngroup, was equally
set for the convolutional layers and fully-connected layers.
The retrained accuracy is shown in Table V. In the table,
N convprune, N
fc1,2
prune, and N
fc3
prune are the number of pruned
weights in the convolutional layers, the first and second fully-
connected layers, and the last fully-connected layer, respec-
tively. ResNet-50 and ResNet-152 have one fully-connected
layer; hence, Nfc1,2prune was ignored in these networks. Retrain-
ing 1 was applied to all of the networks except for AlexNet,
to which Retraining 3 was applied. In Retraining 3, Nfc1,2prune
was increased by one from 12.
The table shows that the proposed pruning scheme can reach
a pruning ratio similar to that of the previous pruning scheme
in the fully-connected layers. In [19], 90–96% of the weights
were pruned in the first and the second fully-connected layers
of AlexNet and VGG16, and as were 75–77% of the weights
in the last layer. In all of the presented networks, the proposed
pruning scheme could prune 75% of the weights (Nfc3prune=12)
in the last fully-connected layers. For the first and second fully-
connected layers of VGG16, pruning with Nfc1,2prune=15 (93.75%
pruning) did not degrade the accuracy. In AlexNet, Nfc1,2prune=15
showed an accuracy of 79.47%, which is slightly worse than
the accuracy of the pruned AlexNet in [19], 79.68%. Because
N convprune=12 in the convolutional layers is larger than 62–65%
in [19], the pruning results are quite comparable.
C. Pruning Along Other Axes
The proposed pruning scheme can be applied along other
axes. The accuracy results after pruning and retraining along
the filter and the spatial axis are presented in Table VI. At the
third to fifth rows in this table, the convolutional layers are
pruned along the filter axis, and at the next row, denoted by ’#’,
the fully-connected layers are further pruned with Nfc1,2prune=15
and Nfc3prune=12. At the last three rows, the convolutional layers
are pruned along the spatial axis. These results show that the
accuracy is not degraded by pruning along other axes, meaning
TABLE VI
IMAGENET VALIDATION ACCURACY(%) AFTER PRUNING AND
RETRAINING ALONG OTHER AXES
Axis Ngroup Nconvprune VGG16 ResNet-50
- - 0 88.44 91.14
Filter 16 10 90.63 91.71
Filter 16 11 90.46 91.54
Filter 16 12 90.12 91.31
Filter# 16 12 89.49 90.85
Spatial 9 5 90.32 91.85
Spatial 9 6 89.69 91.49
Spatial 9 7 88.86 91.31
TABLE VII
IMAGENET VALIDATION ACCURACY(%) AFTER PRUNING AND
RETRAINING OF COMPACT NETWORKS
Ngroup Nconvprune SqueezeNet v1.0 MobileNetV1-224 1.0
- - 80.39 89.24
16 8 *80.93 89.97
16 9 *80.65 89.67
16 10 *80.29 *89.79
16 11 *79.86 *89.50
16 12 *78.80 *89.06
TABLE VIII
IMAGENET VALIDATION ACCURACY(%) AFTER PRUNING AND
RETRAINING OF SLIMMED NETWORKS IN [24]
Ngroup Nconvprune VGG16-4X VGG16-5X ResNet-50 CP
- - 89.06 86.98 89.64
16 8 89.53 88.25 91.44
16 10 88.58 87.91 91.28
16 12 *87.05 87.36 90.30
that the proposed accelerator-aware pruning scheme can be
used for various accelerator architectures.
D. Compact Network Pruning
Some compact networks have been proposed recently [5],
[49], [50]. The compact networks are less redundant, meaning
that pruning may be more harmful. Table VII, however,
shows that the original accuracy can be recovered with
the accelerator-aware pruning along the channel axis. In
SqueezeNet, the proposed scheme can reach a pruning ratio
comparable to 66.3%, the pruning ratio of an unstructured
pruning scheme in [5]. With N convprune=10 or 11, the accuracy
degradation of the proposed pruning scheme is around 0.5
percent points. In MobileNet [49], the accuracy is recovered
more easily. In this case, 75% pruning shows still good ac-
curacy. MobileNet consists of depthwise convolutional layers
and pointwise convolutional layers, and only the latter ones
were pruned because they account for most of the operations
and weight storage.
E. Slimmed Network Pruning
Some previous works pruned convolutional layers in a
channel-wise or filter-wise approach [22]–[24]. The resulting
networks are slimmer networks with fewer channels in the
layers. Slimmed networks can also be pruned by the proposed
pruning scheme. For this experiment, the networks slimmed
9TABLE IX
ACCELERATOR PERFORMANCE ESTIMATION
Architecture Han15[19] Han15[19]# AAP
Cambricon-X Ntotalnonzero 1098K 751K 751K
Conv2–5 Npadding 160K 171K 38K
NMAC 283M 191M 191M
Ncycle 1946K 1540K 857K
Utilization 57% 49% 87%
EIE Ntotalnonzero 5925K 4456K 4456K
FC1–3 Ncycle 187K 154K 70K
Utilization 49% 45% 100%
in [24] were used because their models are publicly available
[51]. In [24], the weight amount of the networks was also re-
duced by other methods such as decomposition. The proposed
scheme was applied with Ngroup=16. In some layers of the
slimmed networks, C is not a multiple of Ngroup. In such a
case, it is assumed that zero weights are added to make C a
multiple of Ngroup.
The accuracy results are shown in Table VIII. The table
presents that the proposed pruning scheme prunes weights
fairly well even in the already slimmed networks. In this case,
50% pruning of the convolutional layers (N convprune=8) does not
degrade the accuracy of the slimmed networks. It was also
noted that 75% pruning of N convprune=12 does not degrade the
accuracy of VGG16-5X or ResNet-50 CP.
F. Accelerator Performance
A neural network pruned by the proposed scheme can be
processed more efficiently in accelerators. The upper part
of Table IX compares the efficiency of Cambricon-X when
it runs the second to fifth convolutional layers of AlexNet
pruned by the unstructured pruning in [19] and the proposed
accelerator-aware pruning. In the table, N totalnonzero, Npadding ,
NMAC , and Ncycle are the number of total non-zero weights
after each pruning, the required number of padding zeros for
the alignment, the number of multiplication and accumulation
(MAC) operations with non-zero weights, and the number of
estimated processing cycles, respectively. The utilization is the
ratio of time when the multipliers operate with valid operands,
calculated as follows:
Utilization =
NMAC
Ncycle ×Nmul ×NPE . (2)
With the conventional pruning scheme, Cambricon-X exe-
cutes the pruned network with some inefficiency, as shown at
the third column. The amount of the required padding zeros is
approximately 14.6% of the non-zero weight amount, which
indicates a waste of the internal buffer and the multipliers.
With the additional problem of the load imbalance between
PEs, the utilization of the multipliers is only 57%.
With the proposed scheme shown in the last column,
the resource waste can be greatly reduced. For the column,
AlexNet is pruned by the proposed scheme with N convprune=12
and Ngroup=16. The amount of the padding zeros is merely
5% of the non-zero weight amount. The padding zeros are
required at the second convolutional layer because this layer
does not have enough input channels. With 48 input channels,
only twelve non-zero weights remain for an activation-fetching
group, which is less than Nmul. The small amount of padding
zeros and the naturally load-balanced PEs lead to high uti-
lization, at 87.24%, and the processing cycles are reduced by
56%.
Since the pruning ratio of [19] is less than 75%, we
compared an additional case at the fourth column. For this
column, each layer of the network at the third column is
additionally pruned to reach 75% pruning ratio. With the
additional pruning, the numbers of non-zero weights and
MACs are reduced, but a similar number of padding zeros is
still required. The utilization is deteriorated further. Compared
to this case, the proposed pruning scheme can reduce the
number of processing cycles by 44%.
The lower part of Table IX compares the number of pro-
cessing cycles of the EIE architecture. Because the EIE archi-
tecture only deals with the fully-connected layers, the table
compares the cycles processing the fully-connected layers of
AlexNet. In this part, NMAC and Nzero−pad are not described
because NMAC is equal to N totalnonzero in fully-connected layers
and EIE does not require padding zeros. In EIE, NPE is 64,
and each PE has a multiplier. According to the EIE process,
the accelerator-aware pruning is applied along the column axis
with Ngroup=16, Nfc1,2prune=15, and N
fc3
prune=12. As shown in
the table, the load-imbalance problem degrades the utilization
to 49%. The proposed pruning scheme, shown at the fifth
column, improves the utilization to 100%, reducing the number
of processing cycles by around 63%. For a fair comparison,
the network of the third column is pruned additionally, and the
result is shown at the fourth column. With the utilization under
50%, the additionally pruned network still requires more than
twice as many processing cycles as that of the fifth column.
G. Accelerator Logic Complexity
As mentioned in Subsection III-B, the proposed pruning
scheme can reduce the accelerator complexity. In this subsec-
tion, attempts are made to reduce the area of Cambricon-X.
The second row in Table X presents the area, delay time, and
power consumption of Cambricon-X. The power values were
measured with dynamic simulation. In the table, NBin, NBout,
and SB are the input activation buffer, the output activation
buffer, and the weight buffer, respectively. IM is the indexing
and activation selection unit including the activation selection
MUXs.
Because the RTL code of Cambricon-X is not publicly
available, we re-implemented the blocks in the table for the
area comparison. The other blocks in Cambricon-X are not
affected by the proposed scheme. Hardware parameters are
inferred from the context of the paper. The synthesis results
of the re-implemented case are presented at the third row. The
synthesis was performed by Synopsys DesignCompiler with
the Global Foundry 65nm process library. Since the results
of Cambricon-X are obtained after the placement and routing
(P&R) process with a different library, it is difficult to compare
the values of the second and third rows directly. Therefore, the
expected reduction effect on Cambricon-X will be estimated
from the reduction in the re-implemented version, which is
10
TABLE X
ACCELERATOR SYNTHESIS RESULTS
Accelerators Npar Ngroup Nmul Area (mm2) Delay Power
(Measurement Condition) NBin NBout SB IM 16 PEs Total (ns) (W)
Cambricon-X (P&R) 0.55 0.55 1.05 1.98 1.78 6.38 1.00 0.95
Re-implemented (Synthesis) 256 256 16 0.32 0.32 0.43 2.39 1.46 4.91 1.02 2.56
Reduced (Synthesis) 64 8 16 0.11 0.11 0.43 0.17 1.19 2.01 1.05 1.24
Reduced (Synthesis) 64 16 16 0.11 0.11 0.43 0.27 1.41 2.34 1.02 1.49
Reduced (Synthesis) 64 32 16 0.11 0.11 0.43 0.44 1.44 2.53 1.02 1.58
Cambricon-X Reduced (Estimation) 64 16 16 0.19 0.19 1.05 0.22 1.72 3.84 1.00 0.56
Reduced (Synthesis) 128 16 16 0.18 0.18 0.43 0.28 1.41 2.47 1.02 1.58
Reduced (Synthesis) 64 16 8 0.11 0.11 0.29 0.16 0.62 1.29 1.02 0.71
shown at the following rows. The number of PEs, NPE , is
assumed to be 16.
The fourth to sixth rows present the synthesis results of
reduced implementation assuming the application of the pro-
posed pruning scheme with various Ngroup values. Compared
to the results of the re-implemented Cambricon-X at the third
row, the area is reduced greatly, especially that used by NBin,
NBout and IM. The area of NBin and NBout is reduced
due to the memory width reduction from 256×16 to 64×16.
Although the memory capacity remains unchanged, the more
square-shaped memory configuration results in a smaller area.
The area reduction of the IM block is more astonishing, with a
decrease of 82–93%. The wide MUXs, 256-to-1 MUXs, in the
IM block of Cambricon-X are substituted with narrow Ngroup-
to-1 MUXs. Because the area of a MUX logic is proportional
to the input width, the narrow MUXs lead to a smaller IM
block. With the smaller area, the table also shows lower power
consumption. The delay increased slightly, but the difference is
negligible. The seventh row shows the estimated Cambricon-
X when the simplification proposed in Subsection III-B is
applied. The values were obtained by comparing the third and
fifth rows. Due to the area reduction of NBin, NBout, and IM,
the total area can be reduced by 40%.
The results of various configurations are presented at the
last two rows for comparison. With Npar=128, the areas of
activation memory components are increased due to wide
memory configuration. The area of the IM block is not
increased because the same Ngroup leads to the same MUX
input width. The area of the IM block is affected by Ngroup
and Nmul. The tables shows that Nmul affects the area of SB
though the capacity of SB is not changed. This is also due to
the memory configuration.
V. CONCLUSIONS
In this paper, an accelerator-aware pruning scheme was pro-
posed for CNNs. In the pruning process, the proposed scheme
considers the accelerator parameters: the width of the internal
activation buffer and the number of multipliers. After pruning,
each weight-fetching group has a fixed number of non-zero
weights left. Networks pruned by the proposed scheme can
be efficiently run on the target accelerator. Furthermore, the
proposed pruning scheme can be used to reduce the complexity
of the accelerators, too. Even with the accelerator constraint, it
was shown that the proposed scheme can reach pruning ratios
close to those of existing unstructured pruning schemes. In
this paper, the pruning scheme was discussed in relation to
representative sparse accelerator architectures, but the scheme
can be used for any sparse architectures.
REFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
with deep convolutional neural networks,” in Proc. Advances in Neural
Inf. Process. Syst., 2012, pp. 1097–1105.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” in Proc. Int. Conf. Learning
Representations, 2015. [Online]. Available: http://arxiv.org/abs/1409.
1556
[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
convolutions,” in Proc. Comput. Vision and Pattern Recognition, 2015,
pp. 1–9. [Online]. Available: http://arxiv.org/abs/1409.4842
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proc. Comput. Vision and Pattern Recognition,
2016, pp. 770–778. [Online]. Available: http://arxiv.org/abs/1512.03385
[5] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,
and K. Keutzer. (2016) SqueezeNet: AlexNet-level accuracy with
50x fewer parameters and <0.5MB model size. [Online]. Available:
http://arxiv.org/abs/1602.07360
[6] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun,
“OverFeat: Integrated recognition, localization and detection using
convolutional networks,” in Proc. Int. Conf. Learning Representations,
2014. [Online]. Available: http://arxiv.org/abs/1312.6229
[7] R. Grishick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
hierarchies for accurate object detection and semantic segmentation,”
in Proc. Comput. Vision and Pattern Recognition, 2014, pp. 580–587.
[8] R. Girshick, “Fast R-CNN,” in Proc. Int. Conf. Comput. Vision, 2015,
pp. 1440–1448. [Online]. Available: http://arxiv.org/abs/1504.08083
[9] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
time object detection with region proposal networks,” in Proc. Advances
in Neural Inf. Process. Syst., 2015, pp. 1–9.
[10] N. McLaughlin, J. M. del Rincon, and P. C. Miller, “Person reidentifica-
tion using deep convnets with multitask learning,” IEEE Trans. Circuits
Syst. Video Technol., vol. 27, no. 3, pp. 525–539, Mar. 2017.
[11] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in Proc. Comput. Vision and Pattern
Recognition, 2015. [Online]. Available: http://arxiv.org/abs/1411.4038
[12] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
“DianNao: A small-footprint high-throughput accelerator for ubiquitous
machine-learning,” in Proc. Int. Conf. Architectural Support for Pro-
gramming Languages and Operating Syst., 2014, pp. 269–283.
[13] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and
A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network
computing,” in Proc. Int. Symp. Comput. Architecture, 2016, pp. 1–13.
[14] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based accelerator design for deep convolutional neural networks,”
in Proc. ACM/SIGDA Int. Symp. Field Programmable Gate Arrays, 2015,
pp. 161–170.
[15] L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li, “C-Brain: A deep
learning accelerator that tames the diversity of CNNs through adaptive
data-level parallelization,” in Proc. Design Automation Conf., 2016, pp.
123:1–123:6.
11
[16] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu,
S. Song, Y. Wang, and H. Yang, “Going deeper with embedded FPGA
platform for convolutional neural network,” in Proc. ACM/SIGDA Int.
Symp. Field Programmable Gate Arrays, 2016, pp. 26–35.
[17] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, “Design space
exploration of FPGA-based deep convolutional neural networks,” in
Proc. Asia and South Pacific Design Automation Conf., 2016, pp. 575–
580.
[18] J. Jo, S. Cha, D. Rho, and I.-C. Park, “DSIP: A scalable inference
accelerator for convolutional neural networks,” IEEE J. Solid-State
Circuits, vol. 53, no. 2, pp. 605–618, Feb. 2018.
[19] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights
and connections for efficient neural networks,” in Proc. Advances in
Neural Inf. Process. Syst., 2015, pp. 1135–1143. [Online]. Available:
http://arxiv.org/abs/1506.02626
[20] S. Han, H. Mao, and W. J. Dally, “Deep Compression: Compressing
deep neural networks with pruning, trained quantization and Huffman
coding,” in Proc. Int. Conf. Learning Representations, 2016. [Online].
Available: http://arxiv.org/abs/1510.00149
[21] T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing energy-efficient
convolutional neural networks using energy-aware pruning,” in Proc.
Comput. Vision and Pattern Recognition, 2017. [Online]. Available:
http://arxiv.org/abs/1611.05128
[22] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured
sparsity in deep neural networks,” in Proc. Advances in Neural Inf.
Process. Syst., 2016, pp. 2074–2082.
[23] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters
for efficient ConvNets,” in Proc. Int. Conf. Learning Representations,
2017. [Online]. Available: http://arxiv.org/abs/1608.08710
[24] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very
deep neural networks,” in Proc. Int. Conf. Comput. Vision, 2017, pp.
1398–1406. [Online]. Available: http://arxiv.org/abs/1707.06168
[25] J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke,
“Scalpel: Customizing DNN pruning to the underlying hardware paral-
lelism,” in Proc. Int. Symp. Comput. Architecture, 2017, pp. 548–560.
[26] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning
convolutional neural networks for resource efficient inference,” in
Proc. Int. Conf. Learning Representations, 2017. [Online]. Available:
http://arxiv.org/abs/1611.06440
[27] N. Yu, S. Qiu, X. Hu, and J. Li, “Accelerating convolutional neural
networks by group-wise 2D-filter pruning,” in Proc. Int. Joint Conf.
Neural Networks, 2017, pp. 2502–2509.
[28] V. Lebedev and V. Lempitsky, “Fast ConvNets using group-wise brain
damage,” in Proc. Comput. Vision and Pattern Recognition, 2016, pp.
2554–2564. [Online]. Available: http://arxiv.org/abs/1506.02515
[29] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep
convolutional neural networks,” ACM J. on Emerging Technologies in
Computing Syst., vol. 13, no. 3, pp. 32:1–32:18, feb 2017.
[30] H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J.
Dally, “Exploring the regularity of sparse structure in convolutional
neural networks,” in Proc. Int. Conf. Learning Representations, 2017.
[Online]. Available: http://arxiv.org/abs/1705.08922
[31] D. Kadetotad, S. Arunachalam, C. Chakrabarti, and J.-S. Seo, “Efficient
memory compression in deep neural networks using coarse-grain spar-
sification for speech applications,” in Proc. Int. Conf. Comput. Aided
Design, 2016.
[32] J. Park, S. Li, W. Wen, P. T. P. Tang, H. Li, Y. Chen, and P. Dubey,
“Faster CNNs with direct sparse convolutions and guided pruning,” in
Proc. Int. Conf. Learning Representations, 2017.
[33] Y. Boo and W. Sung, “Structured sparse ternary weight coding of deep
neural networks for efficient hardware implementations,” in Proc. IEEE
Workshop on Signal Process. Syst. Design and Implementation, 2017.
[34] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J.
Dally, “EIE: Efficient inference engine on compressed deep neural
network,” in Proc. Int. Symp. Comput. Architecture, 2016, pp. 243–254.
[Online]. Available: http://arxiv.org/abs/1602.01528
[35] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao,
Y. Wang, H. Yang, and W. J. Dally, “ESE: Efficient speech recognition
engine with sparse LSTM on FPGA,” in Proc. ACM/SIGDA Int. Symp.
Field Programmable Gate Arrays, 2017, pp. 75–84.
[36] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and
Y. Chyen, “Cambricon-X: An accelerator for sparse neural networks,”
in Proc. Int. Symp. Micro-architecture, 2016, pp. 20:1–20:12.
[37] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,
B. Khailany, J. Emer, S. W. Keckler, and W. J. DAlly, “SCNN: An
accelerator for compressed-sparse convolutional neural networks,” in
Proc. Int. Symp. Comput. Architecture, 2017.
[38] D. Kim, J. Ahn, and S. Yoo, “A novel zero weight/activation-aware
hardware architecture of convolutional neural network,” in Proc. Design,
Automation & Test in Europe Conf. & Exhibition, 2017, pp. 1462–1467.
[39] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in
Proc. Advances in Neural Inf. Process. Syst., 1990, pp. 598–605.
[40] B. Hassibi and D. G. Stork, “Second order derivatives for network
pruning: Optimal brain surgeon,” in Proc. Advances in Neural Inf.
Process. Syst., 1993, pp. 164–171.
[41] B. Hassibi, D. G. Stork, and G. J. Wolff, “Optimal brain surgeon and
general network pruning,” in Proc. Int. Conf. Neural Networks, 1993,
pp. 293–299.
[42] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, Z. Ma, S.a
dn Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-
Fei, “ImageNet large scale visual recognition challenge,” Int. J. Comput.
Vision, vol. 115, no. 3, pp. 211–252, Dec. 2015.
[43] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrel. (2014) Caffe: Convolutional architecture
for fast feature embedding. [Online]. Available: http://arxiv.org/abs/
1408.5093
[44] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell. Caffe. [Online]. Available: https:
//github.com/BVLC/caffe
[45] K. Simonyan and A. Zisserman. VGG16/19. [Online]. Available:
http://www.robots.ox.ac.uk/{∼}vgg/research/very deep
[46] K. He, X. Zhang, S. Ren, and J. Sun. ResNet. [Online]. Available:
https://github.com/KaimingHe/deep-residual-networks
[47] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J.
Dally, and K. Keutzer. SqueezeNet. [Online]. Available: https:
//github.com/DeepScale/SqueezeNet
[48] MobileNet caffe model. [Online]. Available: https://github.com/shicai/
MobileNet-Caffe
[49] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenki, W. Wang,
T. Weyand, M. Andreetto, and H. Adam. (2017) MobileNets: Efficient
convolutional neural networks for mobile vision applications. [Online].
Available: http://arxiv.org/abs/1704.04861
[50] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely
efficient convolutional neural network for mobile devices,” in Proc.
Comput. Vision and Pattern Recognition, 2018. [Online]. Available:
http://arxiv.org/abs/1707.01083
[51] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very
deep neural networks. [Online]. Available: https://github.com/yihui-he/
channel-pruning
