Abstract-Convolutional neural networks have shown tremendous performance in computer vision tasks, but their excessive amount of weights and operations prevent them from being adopted in embedded environments. One of the solutions involves pruning, where some unimportant weights are forced to be zero. Many pruning schemes have been proposed, but have focused mainly on the number of pruned weights. The previous pruning schemes hardly considered ASIC or FPGA accelerator architectures. When the pruned networks are run on the accelerators, the lack of architecture consideration casues some inefficiency problems including internal buffer mis-alignment and load imbalance. This paper proposes a new pruning scheme that reflects accelerator architectures. In the proposed scheme, pruning is performed so that the same number of weights remain for each weight group corresponding to activations fetched simultaneously. In this way, the pruning scheme resolves the inefficiency problems. Even with the constraint, the proposed pruning scheme reached a pruning ratio similar to that of the previous unconstrained pruning schemes not only in AlexNet and VGG16 but also in the state-of-the-art very-deep networks like ResNet. Furthermore, the proposed scheme demonstrated a comparable pruning ratio in slimmed networks that were already pruned channel-wisely. In addition to improving the efficiency of previous sparse accelerators, it will be also shown that the proposed pruning scheme can be used to reduce the logic complexity of sparse accelerators.
I. INTRODUCTION
C ONVOLUTIONAL neural networks (CNNs) are attracting interest in the fields of image recognition [1] - [5] , object detection [6] - [10] , and image segmentation [11] . Although they provide great performance for computer vision tasks, there are some obstacles before they can be adopted in embedded environments. A CNN usually requires an excessive amount of weight storage and arithmetic operations. For fast processing and low power consumption under these requirements, ASIC or FPGA accelerators have been proposed [12] - [18] , but the amount of weights and operations is still a major concern.
The amount of weights can be reduced by network pruning [19] - [33] , where some unimportant weights are forced to be zero. Multiplication with zero is meaningless, so reduction in operation amount can be expected, too. Various pruning schemes have been proposed. Some of them prune the weights without a constraint [19] , [20] , and others prune the weights along the neural network structures. It is known that a larger part of the weights can be pruned in fully connected layers than in convolutional layers. However, pruning convolutional layers can save more energy and reach higher throughput [21] . Since the operation structure of the convolutional layers is more complex than that of the fully connected layers, the pruning of the convolutional layers is more varied [22] - [32] .
The pruned networks can be executed on some ASIC or FPGA accelerators that can exploit the weight sparsity. The EIE proposed in [34] used the sparsity of weights generated by the pruning algorithm of [20] , but the architecture only processes fully-connected layers. The EIE architecture was modified for a long short-term memory (LSTM) network to become the ESE architecture [35] . Cambricon-X is an architecture that can exploit the weight sparsity in the convolutional layers [36] . SCNN proposed in [37] exploits both the weight sparsity and activation sparsity for convolutional layers. In the ZENA proposed in [38] , operating efficiency was reached by skipping zero weights and activations.
Despite the various pruning schemes and accelerators, it is hard to efficiently utilize the performance of the ASIC or FPGA accelerators with pruned networks. The previous pruning schemes were developed without considering accelerator adoptability. They mainly focused on the amount of weights that can be pruned away. Pruning with no constraints creates irregular patterns in the remaining non-zero weights. The irregular pattern causes some inefficiency when the pruned networks are performed on the sparse accelerators. The misalignment of the activation fetching and the weight fetching requires padding zero insertion. The processing elements (PEs) process different numbers of weights, so some PEs need to wait for other PEs to complete. To alleviate this problem, some accelerators use complex structures. For example, Cambricon-X uses very wide (256×16-bits width) memory and very-wide (256-to-1) multiplexers (MUXs).
This paper proposes a new approach, a pruning algorithm considering the accelerator architecture. There can be various architecture consideration points, this paper will focus on the activation groups fetched and processed simultaneously in a PE and the number of remaining weights for each group. Those points are related with some critical parameters in an accelerator design including the number of multipliers and the width of the internal weight buffer. In the proposed algorithm, pruning is applied to weight groups, each of which corresponds to an activation group, and after pruning, a fixed number of weights remain in a group. Since the pruning is applied to a weight group aligned with the operating boundary of a PE, the scheme can resolve the mis-alignment problem, removing the waste of the internal buffer or multipliers. The pruning group can be adjusted to reduce the data selection and indexing logic complexity. Furthermore, since the remaining non-zero weights are distributed evenly, the load-imbalance problem can be solved naturally.
This paper is organized as follows. Section II explains CNNs, the previous pruning schemes, and the accelerator architectures. The accelerator-aware pruning is proposed in Section III, and the experimental results are shown in Section IV. Section V makes the concluding remarks.
II. CNN AND PRUNING

A. CNN
A CNN usually consists of many convolutional layers and a few fully-connected layers. Between the layers, there are activation layers like rectified linear units (ReLU), pooling layers, and batch-normalization layers. It is known that the convolutional layers occupy more than 90% of the arithmetic operations. Since the fully-connected layers require a lot of weight storage, recent CNNs have tended to have just one fully-connected layer [3] , [4] or none [5] .
The operation in a convolutional layer is as follows:
where f o(m, h, w) is the activation of row h and column w in output feature map m , and f i(c, y, x) is the activation of row y and column x in input feature map c. In the equation, C is the number of input feature maps or channels, K is the spatial size of the kernel, and S is the stride size. The w(m, c, i, j)'s are the convolution weights. Fig. 1 shows the structure of the weights consisting of M filters with spatial size K×K and depth C. For ease of discussion, some axes are defined. The channel axis of the input activations is f i(c, y, x)'s with the same y and x. Similarly, the channel axis of the weights is w(m, c, i, j)' s   TABLE I  PRUNING RATIO OF THE PREVIOUS PRUNING SCHEME IN [19]   conv1  conv  fc1&fc2  fc3  AlexNet  16%  62-65%  91%  75%  VGG16  42%  47-78%  96%  77%   TABLE II  PREVIOUS STRUCTURED PRUNING SCHEMES   granularity pruned weights references channel-wise or w(:, n, :, :) or [22] - [27] filter-wise w(m, :, :, :) shape-wise or GPU-aware w(:, n, i, j) [28] , [29] w(m, n, :, :) 2D/1D granularity w(m, n, i, :)
[30] w(m, n, :, j) SIMD aware [25] with the same m, i, and j. The spatial axis is f i(c, y, x)'s with the same c or w(m, c, i, j)'s with the same m and c. The filter axis is w(m, c, i, j)'s with the same c, i, and j. Fig. 1 shows the weights along each axis.
B. Pruning
The pruning of neural networks removes some unimportant weights or nodes, to reduce the amount of storage and operations. The early works are presented in [39] - [41] . Recently, pruning in CNNs was proposed in [19] , whose pruning results are summarized in Table I . The table shows the ratio of the pruned weights at the first convolutional layer, the other convolutional layers, the first and second fully-connected layers, and the last fully-connected layer of AlexNet and VGG16. The quantization with indirect indexing and Huffman coding were added in [20] , and a special architecture for the pruned fully-connected layers was proposed in [34] . The pruning scheme can prune many weights but shows irregularity in the pruned pattern. Moreover, the corresponding accelerator architecture can only deal with the fully-connected layers. The energy-aware pruning scheme proposed in [21] focuses on the convolutional layers since the convolutional layers consume more energy. However, the regularity of the pruning pattern was not considered in this work, either.
The regularity of the pruning has been considered in some other studies [22] - [31] . The pruning schemes of these studies can be categorized as channel-wise, filter-wise, and shape-wise pruning as shown in Table II . For example, in the channelwise pruning, all of the weights in a channel, w(:, n, :, :)'s, are pruned or not at all. These pruning schemes are called structured pruning whereas the pruning schemes with no constraint like [19] , [20] are called unstructured pruning.
Some pruning schemes target a General-Purpose Graphic Processing Unit (GPGPU). In a GPGPU implementation, the convolution is usually transformed into a matrix multiplication. By removing columns in the transformed weight matrix, the amount of operations can be reduced [28] . A column in the weight matrix corresponds to the weights, w(:, n, i, j)'s, so this scheme can be thought of as a shape-wise pruning. A strided version was proposed in [29] . 
C. Neural Network Accelerators
Some ASIC or FPGA accelerator architectures have been proposed for unpruned, dense CNNs, which will be called dense architectures. An accelerator usually consists of several processing elements (PEs), each of which has a single multiplier or many multipliers. In a PE with many multipliers, the multipliers may multiply multiple weights and multiple activations (MWMA), multiple weights and single activation (MWSA), or single weight and multiple activations (SWMA). The MWMA and MWSA structures are shown in Fig. 2 , where N P E PEs operates in parallel. With the same naming convention, a PE with a single multiplier can be called a single weight and single activation (SWSA) structure.
If a PE has multiple multipliers, it is important to fetch the operands of the multipliers simultaneously. In this paper, a fetching group means the activations or weights that are fetched and processed simultaneously in a PE. If the size of a fetching group is N par and the number of multipliers in a PE is N mul , N par is usually equal to N mul . The weights and activations are usually fetched from the internal buffers, but the buffers are not drawn in the figure.
The PE structures can be further categorized by the axis that the weight and activation fetching groups follow. For example, DianNao [12] adopts the MWMA structure where the multiple weights and activations are fetched along the channel axis. A process example of such an architecture is shown in Fig. 3 . A part of the input activations along the channel axis are fetched with a part of the kernel weights and multiplied. The multiplication results are summed and accumulated to be one output activation.
The MWSA structure is adopted in Cnvlutin [13] to exploit the sparsity of the input activations. The multiple weights are fetched along the filter axis. The Cnvlutin architecture skips the MAC operations of zero valued input activations. The outputs of the multiplications in PEs are gathered to N par N P E -input adder trees. The architectures proposed in [14] - [17] adopt MWMA-structured PEs, and some of them fetch the activations and weights along the spatial axis [16] , [17] .
D. Sparse Accelerator Architectures
The architectures in Fig. 2 can be modified to exploit the weight sparsity, which will be called sparse architectures. In such architectures, only non-zero weights are stored in the weight memories to save weight storage. An example of the sparse MWMA structure is shown in Fig. 4 , where N par activations are fetched simultaneously. N par is usually larger than the number of multipliers, N mul , since some weights are zero and multiplications with zero weights are meaningless. Multipliers receive non-zero weights and their corresponding activations. Selecting the N mul activations from the N par fetched ones requires N mul ·w N par -to-1 MUXs, where w is the bit-width of the activations. To select a proper activation, a non-zero weight is stored with an index, which is drawn as a grey rectangular in Fig. 4 . A process example of the sparse MWMA PE structure is illustrated in Fig. 5 , where non-zero weights are marked with a grey color. In the figure, the upper MUX selects the second activation from the front, and the lower one selects the fifth one.
As an example of such sparse architectures, Cambricon-X consists of PEs of the sparse MWMA structure with N par = 256 and N mul = 16. In the architecture, 256 activations are fetched simultaneously, and 16 activations are selected from them. For the selection, there are 16·w 256-to-1 MUXs for each PE, and the MUXs are gathered into a block called IM to reduce the number of wires between blocks.
In those sparse architectures, less weights are usually fetched than activations because only non-zero weights are necessary. For ease of discussion, however, the weight fetching group is defined to include all the zero and non-zero weights corresponding to an activation fetching group. If the number of non-zero weights in a weight fetching group is larger than N mul , the weights are fetched and processed through several cycles with the same activation group. In Fig. 5 , one more cycle is required to process the third non-zero weight. If the number of non-zero weights for a weight fetching group is N non−zero , it takes N non−zero /N mul cycles for a PE to process the activation and weight fetching group.
Other PE structures have been used in some sparse architectures, too. EIE [34] and ESE [35] are sparse architectures for sparse fully-connected layers. To exploit the irregular pattern of sparse weights, they exploit SWSA structured PEs. ZENA [38] also adopts SWSA structured PEs and skips both of zero-valued activations and weights. SCNN [37] has multiple multipliers in a PE, but exploits a special structure of Cartesian Products. In the structure, all of the fetched nonzero activations are multiplied with all of the fetched nonzero weights. The multiplication results are delivered to proper output accumulators.
E. Previous Pruning Scheme and Accelerator Architecture
The previous unstructured pruning schemes like [19] could reach high pruning ratio, but they rarely result in pruned networks that fit sparse accelerator architectures well. The main reason is the non-uniform distribution of the non-zero weights left after pruning, especially the number of non-zero weights, N non−zero , for each weight fetching group.
This non-uniform distribution can cause a mis-alignment between the activations and weights required for multiplications. In an accelerator with MWMA PEs, the multiplier operands should be fetched simultaneously. When N par activations and N mul non-zero weights are fetched from the internal buffers, it is not guaranteed that every fetched weight can find its counterpart in the fetched activations. Another activation fetching group may have to be fetched for the process of the weights.
To solve this mis-alignment problem, Cambricon-X introduces the padding-zero scheme, where padding-zeros are inserted so that the number of non-zero weights and paddingzeros for each activation fetching group is always a multiple of N mul , 16. The padding zeros, however, not only waste the internal weight buffer but also cause another inefficiency. The number of total padding zeros would be smaller with a larger N par . This seems one of the reasons that Cambricon-X uses a very large N par compared to N mul . For a usual pruning ratio of 75% in convolutional layers [19] , N par = 64 may be enough for N mul = 16, but Cambricon-X uses N par = 256. Due to the large N par , the activation selection part, the IM block, has very wide, 256-to-1, MUXs. Since the number of the MUXs is also large, 16·w 256-to-1 MUXs per PE, these wide MUXs cause a large IM block area, which occupies more than 30% of the total chip area. The large N par also requires a very-wide (256·w bit-width) internal activation buffer. Such a wide memory usually induces a larger area than a squareshaped memory.
Furthermore, the load balance between PEs is also a problem. A PE processes an activation fetching group for N non−zero /N mul cycles. Because of the diversity of N non−zero , the number of cycles will vary, too. This may cause a problem in some architectures like DianNao and Cambricon-X, which share the fetched activations between PEs to reduce the required bandwidth between the internal buffers and the external memories. If some PEs complete the process of the fetched activations early, the PEs need to wait for the other PEs to finish. The load-imbalance problem is also a major concern in accelerators with a weight serial structure like the SWMA and SWSA structures, where the zero weights are skipped [35] , [38] .
There are few studies on pruning with considering the divergent distribution of remaining non-zero weights. The reference [33] considers the distribution of non-zero weights, but deals with only fully-connected layers. Furthermore, the target was to reduce the width of ternary weight coding, so the accelerator architecture was not considered. A load-balanceaware pruning was proposed in [35] , which prunes the weights so that the total number of nonzero weights processed in a PE is balanced with that in another PE. However, the pruning scheme was only proposed for fully connected layers, and the accelerator architecture discussion is insufficient, too. In contrast, the proposed scheme, which will be presented in the next section, is closely related to accelerator architectures, resolving all of the mentioned inefficiency problems. Furthermore, my scheme focuses on both the convolutional layers and the fullyconnected layers.
III. ACCELERATOR-AWARE PRUNING
In this section, I will propose an accelerator-aware pruning algorithm, which will generate a more regular non-zero weight pattern that fits accelerator architectures well. There can be various architecture consideration points, and this paper will concentrate on two parameters, the activation and weight fetching group and the number of non-zero weights left for the weight group. The two parameters are closely related to accelerator architectures. The size of the activation and weight fetching group determines the internal buffer widths, and the number of non-zero weights is associated with the required number of multipliers and the processing cycles. The previous pruning schemes do not consider these points, creating irregular distribution of the non-zero weights and the problems mentioned in the previous section. I will discuss the pruning for the architectures mentioned in the previous section, but the algorithm is not limited to the architectures.
A. Proposed Accelerator-Aware Pruning Scheme
In the proposed pruning scheme, the weights are pruned within the weight fetching groups so that the number of remaining non-zero weights, N non−zero , is uniform for all of the weight fetching groups. The accelerator in Fig. 5 , for example, simultaneously fetches and processes an activation fetching group consisting of eight activations along the channel axis. The N non−zero for the corresponding weight fetching group is three in the figure. However, there is no guarantee about N non−zero in the previous pruning schemes. N non−zero can be two or four in another weight fetching group. Even zero or the size of a weight fetching group is possible. In contrast, the proposed scheme leaves a fixed number of non-zero weights for all the weight fetching groups as shown in Fig. 6 . In the figure, every weight fetching group has six weights pruned away and two non-zero weights left (N non−zero = 2), which are marked by white and grey colors, respectively.
The result of the proposed pruning scheme can be applied to the previous sparse accelerators, resolving the inefficiency problems mentioned in the previous section. The misalignment problem in the sparse MWMA and MWSA structures will be solved if the number of remaining non-zero weights per group is set to be a multiple of N mul . This alignment would make the padding zeros obsolete in Cambricon-X. Furthermore, the proposed scheme can solve the load-imbalance problem, one of the main concerns in the weight serial structures, too. Every weight fetching group has the same number of remaining nonzero weights, so the number of weights for a PE to process is naturally balanced if every PE processes the same number of weight fetching groups. Every sparse architecture mentioned in the previous section can benefit from the proposed pruning scheme.
B. Accelerator Complexity Reduction
In addition to improving efficiency in the previous sparse architectures, the proposed scheme can also be used to reduce accelerator complexity, especially the indexing and activation selection logic. To deal with the irregular distribution of nonzero weights by the previous pruning schemes, Cambricon-X has very wide (256·w-bit width) activation buffers and very wide (256-to-1) MUXs in the activation selection logic, as mentioned in Subsection II-E. The input width of the MUXs can be narrowed by the proposed pruning scheme, simplifying the activation selection logic. First, every weight fetching group is divided evenly into m sub-groups, which will be called pruning groups. The size of a pruning group, g, will be N par /m. Then the pruning is performed so that each pruning group has a constant number of non-zero weights. Since a weight corresponds to one of g activations, the width of a MUX for the activation selection can be reduced to g. With the smaller g, the PE structure in Fig. 4 can be simplified into the structure in Fig. 7 . As can be seen in the figure, the activation selection logic becomes simplified with narrower MUXs. When g becomes smaller, however, the network performance can degrade more with the same pruning ratio. The experiment shows that g = 16 with 75% pruning ratio does not deteriorate the CNN performance.
The proposed pruning scheme induces the reduction of N par , too. As mentioned in Subsection II-E, N par = 256 of Cambricon-X is large compared to N mul = 16. With the common pruning ratio of 75% in convolutional layers, N par = 64 is enough. The large N par may be chosen to reduce the amount of padding zeros. However, the proposed scheme makes the padding zeros unnecessary, so N par = 64 can be used. Then the width of the activation buffers can be reduced to 64, enabling more square-like memories to be used, which are more area-efficient than wide memories.
Furthermore, the indexing logic can be simplified, too. For indexing of the irregular non-zero weights, Deep Compression and EIE use relative indexing [20] , [34] , where the number of zero weights between two adjacent non-zero weights is stored. An interval is encoded with 4 bits, and an interval larger than the encoding bound requires filler zero insertions. A similar indexing is used in Cambricon-X, called step indexing. The indexing can be simplified with the proposed pruning scheme. In the proposed scheme, the pruning is performed within a pruning group of g. The small g enables direct indexing, where an index indicates the position of the non-zero weight within the pruning group. Since g is small, the indexing bit width, w i , is also small, log 2 g bits, even with the direct indexing. The direct indexing is much simpler than the relative indexing in EIE or the step indexing in Cambricon-X and does not require the filler zeros, removing the waste of the weight storage.
C. Pruning Ratio and Pruning Weight Selection
The proposed pruning scheme can set the pruning ratio target directly, as apposed to the previous schemes. In [19] , weights with a magnitude less than a threshold are pruned away. With this scheme, the pruning ratio cannot be predicted easily. The pruning ratio is determined after pruning. If the target ratio is not reached, the pruning is tried again with a smaller threshold. The process is iterated until the target pruning ratio is reached. In the proposed pruning, however, the target is the number of pruned weights per pruning group. In other words, the pruning ratio is directly focused. If the number of weights pruned away in a pruning group is p, the pruning ratio will be p/g.
Various methods can be used to select the weights to be pruned. The simplest one is to select the weights by weight magnitude. In a more complex method, the effect of weights is estimated and the weights with the a smaller effect are removed earlier. The proposed pruning scheme can be used with any selection method, but, in this paper, the simplest method will be used. The weights with the least magnitude are pruned first in a pruning group.
D. Incremental Pruning
In the proposed scheme, p weights are pruned in a pruning group. The pruning can be processed in a few ways. In an extreme, the target number of weights with the least magnitude in each group are pruned at the same time, and the pruned network is retrained. This scheme will be called one-time pruning. In the other extreme, only one weight with the least magnitude in each group is pruned at first. After a period of retraining, one more weight with the least retrained magnitude is selected and pruned. The retraining is performed again. The one weight pruning and retraining are iterated until the target number is reached. In the middle of the two extreme methods, we can set the initial pruning number and the increment number. This scheme will be called the incremental pruning in this paper.
Obviously, the incremental pruning would be better than or equal to the one-time pruning method. However, the incremental pruning requires a long retraining time, so, in this paper, it will be applied when one-time pruning is not sufficient. 
E. Pruning for Various Architectures
In the proposed pruning scheme, the pruning can be applied for various accelerator architectures. In the previous subsections, pruning along the channel axis is shown for architectures like Cambricon-X. As an example of pruning for another accelerator architecture, the proposed pruning scheme can be applied for a weight-sparse version of Cnvlutin. Cnvlutin adopts the MWSA structure, and the weights are fetched along the filter axis. In the architecture, the dense MWSA structure can be modified to the sparse MWSA structure shown in Fig. 8 . For the sparse architecture, the proposed pruning can be applied with pruning groups set to the weights along the filter axis. Since Cnvlutin already exploits activation sparsity, the structure in Fig. 8 can exploit both activation sparsity and weight sparsity.
The architectures in [16] , [17] adopt sparse MWMA structure PEs, where the activations and weights are fetched and processed along the spatial axis. For those architectures, the proposed pruning scheme can be applied with pruning groups established along the spatial axis.
F. Application to Fully-Connected Layers
The proposed pruning scheme can be applied to fullyconnected layers, too. In a fully-connected layer, the weights can be arranged in a matrix format. We can define two axes: the column axis and the row axis. Along the row axis, the weights are multiplied with different activations, and along the column axis, the weights are multiplied with an activation. An MWMA-structured PE processes the weights along the row axis, and an MWSA-structured PE processes the weights along the column axis.
With the consideration of the PE structures, the acceleratoraware pruning can be applied to fully-connected layers. If an MWMA-structured PE is used, the weights are grouped along the row axis and pruned so that each group has a fixed number of non-zero weights left. If an MWSA-structured PE is used, the weights are grouped and pruned along the column axis. An example of the proposed pruning along the row axis is illustrated in Fig. 9 . In the figure, N par = g = 8 and p = 6.
IV. EXPERIMENTAL RESULTS
To show that the proposed pruning scheme can preserve the performance of CNNs well even with the constraint, the top-5 accuracy for the ImageNet 2012 validation data set [42] was measured. The retraining was performed by Caffe [43] in one of three modes. In Retraining 1, after one-time pruning, the retraining was performed with the learning rate of 5 · 10 −4 for 12 epochs. If the original accuracy was not recovered, 8-epoch retraining was performed additionally with the learning rate of 10 −4 . If Retraining 1 was not enough, Retraining 2 was applied, where the learning rate begins with 5 · 10 −4 and decreases to 10 −4 , 10 −5 , and 10 −6 when the validation accuracy is saturated. In Retraining 3, Retraining 2 is applied with incremental pruning. When the validation accuracy is saturated with the learning rate of 10 −6 , p is increased, and the retraining resumes with the learning rate of 5 · 10 −4 . The retraining mini-batch size was set to 256. The network models were obtained publicly [44] - [47] . The pruning of the convolutional layers will be discussed first because convolutional layers occupy most of the computations. Then my pruning scheme was applied to the fully connected layers with the pruned convolutional layers. It will be shown that the proposed pruning can be applied to networks that are already slimmed by the previous channel pruning schemes, too. In the last subsection, the accelerator complexity reduction will be analyzed. Table III shows the accuracy results right after the pruning of the convolutional layers. The weights are grouped along the channel axis, and the first convolutional layer is not pruned since it has a much smaller number of weights and operations than the other layers. The table shows that 30% (3/8 or 6/16) pruning already begins to degrade the accuracy. However, the degradation can be recovered with retraining as shown in Table IV . Retraining 1 and 3 are applied, and the results of Retraining 3 are marked by asterisks.
A. Convolutional Layer Pruning
In the table, up to 75% (6/8 or 12/16) pruning, the validation accuracy could be recovered to the baseline accuracy with Retraining 1 in the very-deep networks including VGG16, ResNet-50, and ResNet-152. The result of VGG16 matches that of the unstructured pruning algorithm in [19] , where the pruning ratio of the convolutional layers are more or less than 75% in VGG16. The experiment shows that a similar pruning ratio can be reached with the proposed pruning scheme considering the accelerator constraint. It can also be seen that a 75% pruning ratio does not degrade the accuracy in the residual networks, ResNet-50 and ResNet-152, too. The networks have more complicated structures like the residual path and the 1×1 convolution. The results show that the proposed scheme can be applied to recent state-of-the-art CNNs.
In the relatively shallow networks like AlexNet and SqueezeNet, it was harder to recover the accuracy degradation. With the more advanced effort of Retraining 3, however, the original accuracy can be recovered with pruning ratios comparable to those in [19] and [5] . In AlexNet, Retraining 3 begins with (p, g) = (5, 8) or (10, 16) , and p is increased by one. While the pruning ratio of the AlexNet convolutional layers was around 65% with the unstructured pruning in [19] , the proposed scheme can reach a pruning ratio of 81.25%. In SqueezeNet, p = g/2 weights are pruned initially, and p is increased by one, too. In the network, the unstructured pruning can reach a 66.3% pruning ratio [5] , which is similar to p = 10 or 11 with g = 16 in the proposed scheme. With this pruning ratio, the accuracy degradation of the proposed pruning scheme is around 0.005.
B. Fully-Connected Layer Pruning
After the pruning of the convolutional layers, the fully connected layers were pruned along the row axis. In the fully connected layer pruning, there was an attempt to prune more weights than were pruned in the convolutional layers since a larger part of the weights can usually be pruned in fully connected layers [19] . The size of a pruning group, g, was equally set for the convolutional layers and fully connected layers.
The retrained accuracy is shown in Table V . In the table, p1, p2, and p3 are the number of pruned weights in the convolutional layers, the first and second fully-connected layers, and the last fully-connected layer. ResNet-50 and ResNet-152 have one fully-connected layer, so p2 was ignored in those networks. Retraining 1 was applied to all the networks except AlexNet, to which Retraining 3 was applied. In Retraining 3, p2 was increased by one from 12.
The table shows that the proposed pruning scheme can reach a pruning ratio similar to that of the previous pruning scheme in the fully connected layers, too. In [19] , 90-96% of the weights were pruned in the first and the second fully connected layers of AlexNet and VGG16 and 75-77% of the weights in the last layer. In all of the presented networks, the proposed pruning scheme could prune 75% of the weights (p3 = 12) in the last fully connected layers. For the first and second fully-connected layers of VGG16, the pruning with p2 = 15 (93.75% pruning) did not degrade the accuracy. In AlexNet, p2 = 15 showed an accuracy of 0.79470, which is slightly worse than the accuracy of the pruned AlexNet in [20] , 0.7968. Since p1 = 12 in the convolutional layers is larger than 62-65% in [20] , the pruning result is quite comparable. 
C. Slimmed Network Pruning
Some of the previous works pruned convolutional layers channel-wisely or filter-wisely [22] - [24] . The resulting networks are the slimmer ones with fewer channels in the layers. The slimmed networks can also be pruned by the proposed pruning scheme, too. For the experiment, the networks slimmed in [24] were used because their models are publicly available [48] . In [24] , the weight amount of the networks was also reduced by other methods, like decomposition. The proposed scheme was applied with g = 16. In some layers of the slimmed networks, C is not a multiple of g. In that case, it is assumed that zero weights are added to make C a multiple of g.
The accuracy results are shown in Table VI . The table shows that the proposed pruning scheme prunes weights pretty well even in the already slimmed networks. The 50% pruning of convolutional layers (p = 8) does not degrade the accuracy of the slimmed networks. This shows that the proposed pruning scheme provides a comparable pruning for the slimmed networks. It is also noticeable that 75% pruning of p = 12 does not degrade the accuracy of VGG16-5X and ResNet-50 CP.
D. Accelerator Logic Complexity
As mentioned in Subsection III-B, the proposed pruning scheme can reduce accelerator complexity. In this subsection, reduction of the area of Cambricon-X will be attempted. The second row in Table VII presents the area, delay time, and power consumption of Cambricon-X. In the table, NBin, NBout, and SB are the input activation buffer, the output activation buffer, and the weight buffer, respectively. IM is the indexing and the activation selection unit.
Since the RTL code of Cambricon-X is not publicly available, I reimplemented the blocks in the table for the area comparison. The other blocks in Cambricon-X are not affected by the proposed scheme. Hardware parameters are inferred from the context of the paper. The synthesis results of the reimplemented one is presented at the third row. The synthesis was performed by Synopsys DesignCompiler with the Global Foundry 65nm process library. Since the results of Cambricon-X are obtained after the placement and routing (P&R) process with a different library, it is hard to compare the values of the second and third rows directly. Therefore, the expected reduction effect on Cambricon-X will be estimated from the reduction in the reimplemented version, which is shown at the following rows. Measurement and implementation conditions are described in the parenthesis of the first column. N par is 256 for Cambricon-X and 64 for the other cases. N mul and N P E are 16.
The forth to sixth rows present the synthesis results of the reduced implementation assuming the application of the proposed pruning scheme with various g values. Compared to the results of the reimplemented Cambricon-X at the third row, the area is reduced very much, especially that of NBin, NBout and IM. The area of NBin and NBout is reduced by the memory width reduction from 256×16 to 64×16. Although the memory capacity is not changed, more squareshaped memory configuration results in a smaller area. The area reduction of the IM block is more astonishing, with a decrease of 82-93%. The wide MUXs, 256-to-1 MUXs, in the IM block of Cambricon-X are substituted with the narrow g-to-1 MUXs. Since the area of a MUX logic is proportional to the input width, the narrow MUXs lead to a smaller IM block. With the smaller area, the table also shows lower power consumption. The delay increased slightly, but the difference is negligible. The last row shows the estimated Cambricon-X when the simplification proposed in Subsection III-B is applied. The values were obtained by comparing the third and fifth rows. Due to the area reduction of NBin, NBout, and IM, the total area can be reduced by 40%.
V. CONCLUSIONS
In this paper, an accelerator-aware pruning scheme was proposed for CNNs. In the pruning process, the proposed scheme considers accelerator parameters like the width of the internal activation buffer and the number of multipliers. After pruning, each weight fetching group has a fixed number of non-zero weights left. The network pruned by the proposed scheme can be efficiently run on a target accelerator. Furthermore, the proposed pruning scheme can be used to reduce accelerator complexity, too. Even with the accelerator constraint, it was shown that the proposed scheme can reach the pruning ratio close to that of the previous unstructured pruning schemes. In the paper, the pruning scheme was discussed in relation to some representative sparse accelerator architectures, but the scheme can be applied for any sparse architectures.
