Deep Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance in a wide range of applications. However, deeper CNN models, which are usually computation consuming, are widely required for complex Artificial Intelligence (AI) tasks. Though recent research progress on network compression such as pruning has emerged as a promising direction to mitigate computational burden, existing accelerators are still prevented from completely utilizing the benefits of leveraging sparsity owing to the irregularity caused by pruning. On the other hand, Field-Programmable Gate Arrays (FPGAs) have been regarded as a promising hardware platform for CNN inference acceleration. However, most existing FPGA accelerators focus on dense CNN and cannot address the irregularity problem. In this paper, we propose a sparse wise dataflow to skip the cycles of processing Multiply-and-Accumulates (MACs) with zero weights and exploit data statistics to minimize energy through zeros gating to avoid unnecessary computations. The proposed sparse wise dataflow leads to a low bandwidth requirement and a high data sharing. Then we design an FPGA accelerator containing a Vector Generator Module (VGM) which can match the index between sparse weights and input activations according to the proposed dataflow. Experimental results demonstrate that our implementation can achieve 987 imag/s and 48 imag/s performance for AlexNet and VGG-16 on Xilinx ZCU102, respectively, which provides 1.5× to 6.7× speedup and 2.0× to 6.2× energy-efficiency over previous CNN FPGA accelerators.
I. INTRODUCTION
T HE remarkable performance improvement in various domains [1] - [3] achieved by Convolutional Neural Networks (CNNs) like AlexNet [4] and VGG-16 [5] comes at the computational and data cost which challenges both of onchip storage and off-chip bandwidth in accelerator architecture design. Even though most of the operations in the CNN training and inference can be converted to matrix multiplication operations and be accelerated with modern Graphics Processing Units (GPUs), deploying CNNs on GPUs suffers from high power and area consumption. Customized accelerators have been regarded as a promising alternative which are more flexible for considering the performance requirements and energy constraints [6] - [17] . C Recently, the CNN pruning technique has been proved as an effective solution to reduce the computation and memory requirements of these models [18] , [19] . For example, Han et al. [18] pointed out that pruning can lead to more than 10× amount reduction of data with negligible accuracy loss. On the other hand, weight encoding including quantization and entropy coding has been proposed to further reduce the bitwidth of each weight, e.g., 4-bit per weight for AlexNet [19] . Unstructured pruning techniques like Deep Compression [19] have the weaknesses of imbalanced load and high irregularity. Therefore, structured pruning techniques [20] - [23] were proposed which are more hardware friendly with a slightly lower compression ratio.
However, irregularity caused by sparsity prevents accelerators from fully leveraging the computation and data reduction. Exciting architectures on FPGAs for dense models are not efficient for sparse CNN models because a lot of weights are pruned so that most multiplication operations involve zero operands leading to low hardware efficiency [8] , [24] - [28] . Sparse architecture on Field-Programmable Gate Arrays (FPGAs) has been investigated in recent years [29] , [30] . [29] is designed for the Fully-Connected (FC) layers, which uses matrix-vector multiplication operations. In fact, the major operators in CNNs are convolution operations. Although the spatial convolution can be converted to matrix-vector multiplications, this will lead to a large memory footprint since the input feature map has to be copied multiple times when being flattened to a vector. [30] proposes a dataflow that exploits element-matrix multiplication as the key operation. However, this design holds a low computation efficiency due to the imbalanced load of each Processing Engine (PE). Since this accelerator requires a large number of Look-Up- Table  ( LUT) to buffer the input activations for nonzero weights, the performance is bounded by the number of LUT on FPGA, which leads to an inefficient resource utilization.
To design an efficient FPGA accelerator for sparse CNN models, following challenges have to be tackled:
First of all, each output activation connects to several input activations through the sliding window (Kernel) in dense Convolutional (CONV) layers. The connection becomes irregular after pruning. Meanwhile, the sparse weights are encoded in compressed format, which results in extra coordinate computation to reconstruct the connection or to locate the output. So it is challenging to design a dataflow to address the irregularity whereas efficiently leverage the data and computation reduction and maintain the high parallelism of FPGA.
... Group1 Fig. 1 . Illustration of shape-wise structured pruning. The F kernels are split into multiple groups, kernels in the same group are pruned on the same locations.
Second, FPGAs can only provide limited on-chip memory and off-chip bandwidth. Although sparse CNNs have significant data reduction, it is difficult to save all the weights into on-chip memory for complex CNNs like VGG-16. In addition, different CNN models have different sizes, which results in high variability in the number of operations. A rigid accelerator architecture for CNNs may not fully utilize the FPGA's limited resources for every CNN model.
To address both challenges, we only focus on structured pruning to reduce the irregularity of sparse weights. On this basis, a sparse wise dataflow is proposed to address the remaining irregularity of sparse weights. With proposed dataflow, we do not need extra coordinate computation to reconstruct the connection or to locate the output. Furthermore, we minimize energy through zeros gating to avoid unnecessary computations if the input activations equal zero.
In conclusion, we make the following contributions: 1) we propose a sparse wise dataflow to skip the cycles of processing Multiply-and-Accumulates (MACs) that have zero weights and minimize energy through zero gating to further avoid unnecessary computations. 2) we propose a Vector Generator Module to reuse and generate the necessary input activations for sparse CNNs. 3) we co-design both the accelerator architecture and the loop tiling to minimize off-chip memory accesses and maximize performance by slicing the input feature maps to best match the capacity of Block Random Access Memory (BRAMs) on FPGA.
Experiments demonstrate that the proposed accelerator can achieve 987 imag/s and 48 imag/s performance for AlexNet and VGG-16 on Xilinx ZCU102, respectively, which provides 1.5× to 6.7× speedup and 2.0× to 6.2× energy-efficiency over previous CNN FPGA accelerators II. BACKGROUND CNNs consist of multiple types of layers, including convolutional layers, pooling layers and fully-connected layers. Through these layers, inputs are processed and propagated, thus to be classified or recognized. The convolution operation uses an R×R window to slide through the input feature map to extract features. At every location, the input activations inside the window are multiplied by corresponding weights and the products are accumulated to compute the partial sum of an output activation. Note that the partial sums in different input channels are accumulated to compute the output activation. 
A. Network Pruning
CNNs have achieved remarkable success in various applications [1] - [3] , [31] , [32] at the cost of huge amount of computations. The weights pruning method has been proved as an effective solution to reduce the computation and memory burden of these models without significant accuracy loss [18] - [23] . Compared to unstructured pruning techniques, structured pruning techniques are less irregular and more hardware friendly at the cost of a slightly lower compression rate. With shape-wise structured pruning (cf. Fig. 1 ), kernels in the same group will be encoded in the same compressed format. By sharing a common compressed format, the irregularity of sparse weights will be reduced. Besides, the load of each parallel processing engines is balanced. Table I shows the sparsity and accuracy comparison of Deep Compression [19] and Coarse-grained pruning [22] . The difference of sparsities between Deep Compression and Coarse-grained pruning is almost negligible. Although structured pruning techniques reduce the irregularity of sparse weights, previous FPGA CNN accelerators still can not exploit weight sparsity efficiently. To address the remaining irregularity, we propose a sparse wise dataflow and propose a set of architecture optimization techniques.
B. Loop Operation
In a typical CNN, convolutional layers take up about 90% of the computation in inference procedure. A convolutional layer can be characterized by six parameters: F , the number of output feature maps (output channels); C, the number of input features maps (input channels); U , the height of output feature map; V , the width of output feature map; R, the kernel size; S, the stride size, as shown in Fig. 2 . The computation can To achieve high performance, the above deep nested loop is unrolled and mapped to a parallel hardware. This involves loop tiling and loop interchange strategy [33] . Loop tiling keeps all the input data of a loop tile stored on chip to reduce external memory access. External memory access happens when the operation of a new tile begins. Besides, a part of data may be temporally reused across the adjacent tiles. Loop interchange strategy decides the order of the loop tiles. Both loop tiling and loop interchange strategy decide the dataflow of hardware.
III. SPARSE WISE DATAFLOW
There have been dense CNNs dataflows on FPGA [24] , [34] , [35] . However, these dataflows cannot leverage the benefits of sparse CNNs since most multiplication operations involve zero weights that will not contribute to the corresponding output feature map. Therefore, it is extremely essential to skip the cycles of processing MACs that have zero weights.
Recently, designing dataflows for sparse CNNs on ASIC platforms has attracted attentions from the research community. However, these dataflows will not be efficient for FPGA platforms due to the architecture difference between ASIC and FPGA. For example, SCNN architecture [36] applies input-stationary dataflow where the inner computation is a Cartesian production. In SCNN architecture, there are N PEs and each PE contains an I × F multiplier array. Consequently, it simultaneously requires N × I input activations and N × F weights for computation. Although, this dataflow temporally reuses input activations, it still requires to update N × F weights in each cycle. Furthermore, this method firstly requires significant coordinates computation to locate output activations. Then using Cartesian production, this dataflow returns multiple partial sums, which are needed to be arbitrated before being saved. Cambricon-S [22] applies output-stationary dataflow where the inner computation is a Vector dot product. Cambricon-S implements a centralized indexing module to select input activations and only transfers selected activations and indexes to PEs without extra coordinates computation. However, this dataflow only performs parallel computation in channel dimensions by spatially sharing input activations, thus requires M input activations and N × M weights for computation, which will lead to poor parallelism on FPGAs.
We propose a sparse wise dataflow to accelerate CNNs on FPGA (cf. Alg. 2). For a convolutional layer with input feature maps Ifamp(C, H, W ) and kernels Kernel(F, C, R, R), we select and fetch a vector of T oc nonzero weights from T oc adjacent kernels at each access. Similarly, we also fetch a vector of T om activations which are in the same row of input feature map. Meanwhile, we compute T oc element-vector mutiplications on the PE array. After shape-wise structured pruning, the coordinates of nonzero weights in different kernels are uniformed. So each element-vector mutiplication involves a weight and a shared vector of T om activations. For example, in the step 1, the weight vector is [W 0,0 0,0 , · · · , W Toc,0 0,0 ], and the input vector is [X 0 0,0 , · · · , X 0 0,Tom ]. In the step 2, we fetch weight vector [W 0,0 0,1 , · · · , W Toc,0 0,1 ] and input vector [X 0 0,jump , · · · , X 0 0,Tom+jump ] for next computation and compute T oc element-vector mutiplications in pipeline mode. The variable jump equals to the number of zeros between adjacent nonzero weights in the same row of a kernel.
Convolution windows that produce adjacent outputs share part of input activations. Therefore, the two successive fetched input vectors from the same row of input feature map involve partially shared data. We add a dedicated Vector Generator Module (will be introduced in the next section) to update the input register for reusing data. We describe the convolution operation execution pattern in Fig. 3 . Using the pattern in Fig.  3 , the four PEs compute four adjacent output activations and skip the cycles of processing MACs that have zero weights.
For a fully-connected layer with input vector I vector(C, 1) and weight matrix W (F, C), the weight buffer delivers a vector of T om nonzero weights from the adjacent rows of matrix W at each access (see Fig. 4 ). The input register only delivers a single activation.
This dataflow requires a group of buffers to save partial sums produced by convolution operations. Considering a convolutional layer with output feature map Ofamp(F, U, V ), we applied a N × M PE array. Make T oc = N and T om = M , to prevent partial sums from being saved to and restored from DRAM, the depth of the partial sum (Psum) buffer should Algorithm 2: Pseudo code of our dataflow Input :
The nonzero weight array W oc,ic kh,kw ; The index array Index[ ];
The row pointer of index R pointer[ ];
The number of nonzero weights in each input channel Of f set[ ];
The input activation X ic ih,iw ; The Kernel size R×R×C×F; The Output feature map size U×V×F; Output:
The output activation Y oc oh,ow ; 1 Initialize the parameter kw = 0 , kh = 0, i = 0 and jump = 0; 2 for oh = 0; oh<U ; oh = oh + Ut do ..
Output activation PE
(a) (b) be at least U × V /M . However, some CNNs like VGG-16 involve a quite large product of U × V in the first two layers. Therefore, we partition the input feature map into H/H t tiles, where
The parameter Slice BRAM represents the capacity of one block of BRAM on FPGA. The adjacent tiles share (R − Stride) rows because the sliding-window nature of the convolution operation introduces data dependency at tile edges. When the width of output feature map V <M , we map a M/V × V tile to one PU for efficiency because each PE computes one output activation. Otherwise, we map a 1 × M tile to one PU, as shown in Fig. 5 . The PE arracy consists of N PUs and each PU consists of M PEs in the same row. The input activations are shared across PUs. Different PUs compute output activations which are from different output channels. By spatially sharing both input activations and weights, and, temporally reusing partial input activations, we reduce the bandwidth requirement to less than N × B + M × A. Therefore, the bandwidth requirement of the accelerator is much smaller than Cambricon-
, where A and B represent the widths of activation and weight, respectively.
IV. ARCHITECTURE OF SPARSE WISE ACCELERATOR
In this section, we introduce the detailed architecture of our accelerator, to address the remaining irregularity of shape-wise pruned CNNs.
Overview. Fig. 6 depicts the overall block diagram of our proposed accelerator. Following the proposed dataflow, we design a Vector Generator Module (VGM) to address the sparsity with shared indexes. We design a PU that has multiple PEs to compute adjacent output activations in parallel. Multiple PUs constitute an array to compute multiple output activations 
A. Addressing with sparsity
The accelerator is designed to exploit structured sparsity for performance gain and energy reduction. In our accelerator, sparsity is processed by VGM and PU together. The VGM receives input activations from ABin and decodes weight indexes from WIB, then produces the selected activations that are broadcast to all the PUs. Meanwhile, these input activations will be cached for next selection in VGM since there is overlapping when the kernel slides across the input feature map. Each PU receives the read address of weight from Main Controller and reads out the needed weight following the principles described in proposed dataflow, thus avoiding unnecessary computations.
Index. Before elaborating the VGM and PU, we clearly explain how we store and index the sparse weights. We store the sparse weights that result from shape-wise pruning using Compressed Sparse Row (CSR) format, which only requires 2a + R × C + C numbers, where a is the number of nonzero weights, R is the number of rows and C is the number of input channels. Owing to shape-wise pruning, each kernel shares the common index. To compress further, we store the step index instead of the absolute position. We encode both step index and R pointer in 4 bits, and encoded Of f set in 16 bits as illustrated in Fig. 7 . The 4 bits index is large enough in convolutional layers, however, the situation is different in fully-connected layers. When we need an index larger than the bound, we will pad a filler zero to prevent overflow. Regarding the example in Fig. 8 , when the step index exceeds the largest 4-bit unsigned number, we pad a filler zero.
VGM. The VGM module processes the sparsity by selecting the needed input activations and transfers the selected input activations to all the PUs, see Fig. 9 . We design a central VGM shared by multiple PUs to more efficiently process shared indexes from structured sparsity. For example, firstly, when index is "1", the activations in register REG 0 will be shifted by A bits to the left (the most left A bits, where A represents the width of activation), and the data with index
will be broadcasted to all the PUs. In cycle1, when index is "0", the activations in register REG 0 will be further shifted by A bits. PUs do MACs with previous selected input activations and cache the activations selected by VGM in this cycle. In cycle2, the row number of kernel-kh (see Algorithm 2) increases. Therefore, REG 0 reloads activations from REG 1 and does data-shift according to the new weight index. The depths of both REG 0 and REG 1 are set to (M − 1) × stride + R. To leverage the overlap between activation selection and memory access, the activations in REG 1 will be updated after REG 0 lastly reloading activations from REG 1 .
As PUs share the same indexes of weights due to shape-wise pruning, the module for selecting activations (VGM) is shared by all the PUs, thus reducing the indexing module overhead and bandwidth requirement between VGM and PUs.
PU. The PU processes all operations in CNNs. Each PU consists of M zero-value discriminators, M homogeneous PEs, M homogeneous PSBs, a WB, pooling, normalization, and activation module, see Fig. 10 . Weights can be stored separately in PUs as output activations from different channels involve independent kernels. A selected activation firstly streams into a zero-value discriminator, and then will be used in PE if it does not equal to zero. Meanwhile, the discriminator produces an enable signal to control the clock gating of PE. When the ... selected activation equals to zero, the corresponding PE is clock gated. To minimize data communication, the partial sums produced by PE will be saved in a local partial sum buffer which is placed next to the PE. Until the entire computation of an output activation is done, the final partial sum will be transfered to activation and pooling module. As there is neither spatially sharing nor temporally reuse for weights in fully-connected layer, it requires quite large input bandwidth. Although all the PUs can be active in fullyconnected layer, the required off-chip memory bandwidth cannot be satisfied on the FPGA platform. Thus, we only keep one PU active when the M is large enough in fully-connected layer.
B. Optimize PE for quantization
One of the most commonly methods for model compression is to quantize both weights and activations. The lowbit activations and weights require small bandwidth, which benefits to improve the throughput if the accelerator is a computation bounded design [14] . However, the proposed design is BRAM limited. Directly increasing the size of PE array will lead to a failure of implementation on FPGA because the number of required BRAM will easily exceed the available number. Therefore, we maintain the datapath of activations and optimize the structure of PE. For example, the activations and weights are quantized to 8-bit. We dispatch two selected 8-bit activations into one PE. Due to the proposed dataflow, the two activations multiply with the same weight in two DSP slice, respectively. Then the partial sums can be concatenated and stored in the buffer.
C. Storage
As the data processed in our accelerator have different behaviors, we split storage into five parts: an ABin, an ABout, a WIB, N WBs, and N × M PSBs.
For the ABin and ABout, we set the width as 16×(M −1+ R)-bit and 16 × M -bit, respectively, so as to provide (M − 1 + R) input activations for VGM and to fetch M output activations from PU at each access. Benefit from the proposed dataflow, N PUs share the same input activations, we read input activations and produce output activations row by row. Thus, we set a small depth for both ABin and ABout.
For the WB in each PU, we use a dual port ram that we select the read width as 16-bit for one port and 16 × M -bit for the other port. In particular, the write width of both two ports is 16 × M -bit. In the convolutional layer, only one weight will be read out and be broadcasted to all the PEs. Whereas in the fully-connected layer, M weights will be fetched and be transfered to the corresponding PE.
For the WIB, we select the width as 16-bit as we use CSR format where we deploy 16-bit, 4-bit, 4-bit for Of f set, Index, and R pointer, respectively. Thus, the Index and R pointer are stored aligning to 4 bits. We divide the WIB into three parts for the three components of the compressed weights index.
For the PSB, we set the width to 32-bit. We store the partial sums and cache output activations in PSB, as the ABout fetchs output activations from PUs in turns. We map each PSB to one BRAM. Because if we map each PSB to two or more BRAMs, the on-chip memory resources will easily be the bottleneck of the available peak performance, leading to the damping of throughput.
The size of the five buffers are decisive to overall performance and energy consumption. For example, the size of PSB decides the number of tiles. Small size of PSB requires large number of tiles that leads to costly off-chip memory accessed. Whereas large size of PSB leads to unscalability for small layers in CNNs. Thus, we generally deploy 2KB, 2KB, 4KB, 512B and 2KB for ABin, ABout, WIB, WB and PSB, respectively. The configuration of these buffers can be adjusted accroding to the CNNs.
V. EXPERIMENT

A. Experiments Setup
We evaluate our design on the Xilinx ZCU102 evaluation kit consisting of an Ultrascale FPGA, quad ARM Cortex-A53 processors, 4GB PS DDR4 and 512MB PL DDR4. In this work, we use verilog for RTL implementation and employ Xilinx Vivado (v2017.2) to compile the source code to bitstream. The design method is inspired by the DNNWEAVER [6] . Our FPGA implementation is synthesized at 200MHz frequency. We use a GPIO to USB adapter to read the power directly from the PMbus in the FPGA board. We comprehensively apply [22] , [23] methods to train the CNN models. In Our experiment, we test typical CNNs including Lenet, Alexnet and VGG-16 and achieve 11.85%, 32.92%, 36.75% sparsity of Lenet, Alexnet and VGG-16 without significant accuracy loss. Table II shows bit for VGM. Besides the buffers, the rest modules consume LUT and Flip-Flop (FF). Fig. 11 shows the resource utilization of different parallelism factors obtained from Xilinx Vivado tool (v.2017.2). The LUT utilization increases as the total number of PE N × M increases. The utilization of BRAM are mainly used to construct buffers and FIFOs. When (N, M ) increases to a certain extent, some large FIFOs are implemented by LUTs and FFs rather than BRAM to meet the timing constraints.
B. Resource utilization
C. Computation Efficiency
In this work, we do output channel parallelly across PUs. Benefiting from the shape-wise pruning, the load of each PU is balanced. When we map the network onto PEs, the inefficiency of our design mainly comes from two aspects: Dynamic Activation Inefficiency (DAI) and Dataflow Mapping Inefficiency (DMI). First, there are zero activations in input feature maps, which leads to some PEs gated. This pattern is designed to save energy deliberately. Second, the size of feature map cannot be divided by M evenly, where M is the number of PE in each PU. The DAI is dynamic and variable dependent on data sheet but irrespective of the proposed datafow whereas the DMI is dependent on the proposed dataflow and can be computed as the following equations. We assume the size of a 3-D output feature map is U × V × F . According to our dataflow, when V >M , the average computation efficiency is shown as Eq (2) .
When V <M , the average compute efficiency is shown as Eq (3). We measure the DMI on Alexnet and VGG-16. In Fig.  12 , the computation efficiency which involves in DMI across different layers with different parallelism factors. Because both U and V in VGG-16 can be divided by 14, the computation efficiency keeps 100% when M equals to 14 or 28. In conclusion, our sparse wise dataflow can maintain a high computation efficiency for different neuron networks.
To analyse DAI, we firstly count sparse activations on convolutional layers of Alexnet and VGG-16 by using Pytorch vision. The dataset is ImageNet 2012. As shown in Fig. 13 , layer conv5 shows the lowest activation sparsity below 10%, and layer conv2 shows the highest activation sparsity over 90%. The overall sparsity is about 39.6%. As for VGG-16, Fig. 14 depicts that the last seven convolutional layers show a low sparsity below 30% and layer conv2 1 shows the highest sparsity about 75%. In the mass, the overall activation sparsity is about 39.5%. According to the activation sparsity, we can estimate the DAI.
Then we measure the DAI on conv2 1 of VGG-16 by simulation. On each compute cycle, we sample the total number of active PEs and calculate the efficiency. As the number of total samples is too large, so we calculate an average efficiency every 100 samples. Fig. 15 shows the proportion of ungated PE. Indeed, part of PEs are gated to save energy. The proportion of active PE on conv2 1 is positively related to the activation sparsity. The sample circuit is designed for simulation when the configuration parameter < N, M >=< 16, 28 >, and does not be implemented on FPGA.
D. Performance analysis
In this section, we adopt the well-known roofline model [14] for exploring the impact of insufficient off-chip bandwidth on performance. We set the bitwidth of AXI data bus which connects to DDR as 128-bit. First, we do not quantize the weight and change the configuration to find the optimal parameters of mapping the FC6 layer of VGG-16 onto our accelerator. We normalize all the performance number to that of "N1M2". As illustrated in Fig. 16 , when the number of PE reaches to 8, the performance touches the roof. Second, we quantize the weights to 8-bit. We find that the performance touches the roof until the number of PE reaches to 16, and the roof becomes higher in Fig. 17 . Because there is neither weight share nor weight reuse in FC layer. When the bitwidth of weight decreases by half, the number of transfered weights doubles in each DDR access. So the speedup also doubles.
After that we map the conv2 1 layer of VGG-16 onto our accelerator to explore the optimal parameters. To get a high compute efficiency, we set N as the divisor of the number of output channel F. We normalize all the performance number to that of "N1M8." In Fig. 18 , configuration "N48M28" achieves the best peak performance.
Final, we analyze the performance of our implementation. We set the PE array size as < N, M >=< 48, 28 >, which consists of 1344 PEs. In this configuration, the peak throughput can be calculated as 2×0.2GHz×28×48 = 537.6GOP/s when the width of operand is 16. Specially, the proposed design supports to perform two 8bit × 8bit multiplications in a PE with 2 DSPs, which leads to 1075.2 GOP/s peak performance.
We compare our design with convolutional FPGA accelerators in Table III . The performance in Table III represents the effective throughput. [25] , [37] are dense CNN accelerators, and [8] , [30] are sparse CNN accelerators. Both [8] and [30] only address the weight sparsity, but do not address the activation sparsity, so we compute our throughput with DMI which is definded in previous subsection. For the dense accelerator, the performance is computed by dividing the effective throughput with computation of dense network. According to Table III , our accelerator achieves 987 imag/s effective performance on structured sparse Alexnet which shows 1.5× to 6.7× speedup and 2.0× to 4.2× energy-efficiency compared with [8] , [30] , [37] . As for VGG-16, our implementation achieves 48 imag/s performance which is 2.2× to 2.3× speedup and 3.4× to 6.2× energy-efficiency compared with [25] , [30] . For the case of 8bit int, we achieve 96 imag/s performance which is 4.4× to 4.6× speedup and 6.1× to 11.2× energy-efficiency compared with [25] , [30] . The reason of the speedup is because our dataflow can effectively skip the sparse weight multiplications. In addition, this dataflow flexibly maps network onto PEs which leads to a high utilization of on-chip resources. Previous works cannot efficiently exploit the zeros or involve a low compute efficiency. Besides, we apply clock gate on unused PEs when the input activations equal to zero. So we get a higher energy-efficiency than previous works.
The proposed design achieves a higher performance of resource efficiency because we leverage the sparsity and achieve a high mapping efficiency, as tabulated in Table IV . The accelerator [30] also leverages the sparsity but the DSP efficiency is encumbered by the low mapping efficiency due to the unbalanced load of PE. In addition, the logic cell efficiency is only 0.405 GOP/s/K cells because the TLUT and CMUX [30] consume a large number of logic resources. Reducing the bitwidth can help to improve the resource utilization efficiency. The design [38] achieves a performance of 0.275 GOP/s/DSP and 1.213 GOP/s/K cells because it uses 8-bit block float point to represent activations and weights so that two multiplication operations can be carried out in a DSP slice. In the proposed design, if the width of data is 8-bit, the logic cell efficiency nearly improves 100%.
VI. CONCLUSION
In this work, we have proposed a sparse CNN FPGA accelerator with a sparse-wise dataflow to skip zero weights computations. Moreover we have exploited data statistics to minimize energy through zeros gating to avoid unnecessary computations. In addition, we have proposed a set of architecture optimization techniques for sparse CNNs. Experiments demonstrated that our implementation could achieve 987 imag/s and 48 imag/s performance for AlexNet and VGG-16 on Xilinx ZCU102, respectively, which provides 1.5× to 6.7× speedup and 2.0× to 6.2× energy-efficiency over previous CNN FPGA accelerators. Furthermore, the performance improvement and energy efficiency will be much larger if we achieve the sparsity of CNNs described in Cambricon-S [22] . Haibin Shen is currently a Professor with Zhejiang University, a member of the second level of 151 talents project of Zhejiang Province, and a member of the Key Team of Zhejiang Science and Technology Innovation. His research interests include learning algorithm, processor architecture, and modeling. His research achievement has been used by many major enterprises. He has published more than 100 papers on academic journals, and he has been granted more than 30 patents of invention. He was a recipient of the First Prize of Electronic Information Science and Technology Award from the Chinese Institute of Electronics, and has won a second prize at the provincial level.
VII. ACKNOWLEDGEMENT
