Abstract-This paper proposes a convolution core for sparse CNN that is capable of flexibly alternating the parallelism schemes and degree exploiting intra-and inter-output parallelism of the convolutional layer, and leveraging weight sparsity using a compressed sparse model in the compressed sparse column format and output-stationary dataflow. The experimental results show that the performance is improved by 3.9 times even in the deeper layer where the conventional accelerator could not fully exploit the parallelism due to the small layer size. The proposed architecture could also exploit the weight sparsity. Then, by combining both the multi-parallelism and the weight sparsity, the proposed architecture achieved 5.2 times better performance than the conventional accelerator.
I. Introduction
Convolutional Neural Networks (CNN) have been one of the most vigorous deep learning models in extracting knowledge for a vast number of applications especially in image and video analytic domain, such as image recognition, scene understanding, and autonomous driving, because of their remarkable classification performance. Recent studies have actually demonstrated that deep CNN models, such as VGG [12] and ResNet [8] , can achieve higher accuracy than human recognition.
A CNN consists of two kinds of layers: (1) convolutional layers, which functions as a feature extractor, and (2) fullyconnected layers, which works as a classifier. Fig. 1 illustrates the computation of a convolutional layer, which computes two-dimensional convolution between sliding windows of input feature maps and kernels, and consumes most computation time of CNN.
High accuracy obtained by CNNs comes with the price of an excessive computation that becomes critical for real-time and low-power inference processing. One of the promising approaches is the use of low-power, high-performance hardware accelerators, such as FPGAs and ASICs. In order to accelerate a vast amount of computation, recent accelerators employ three major techniques: (1) data reuse maximization; (2) calculation skip maximization, referred to as weight pruning; and 3) calculation parallelism maximization.
The effectiveness of data reuse maximization is achieved through a specific dataflow. a reconfigurable processor array architecture uses a column-wise data delivery together with forwarding bus to reuse input data [1] . The effectiveness of calculation skip maximization comes from the fact that the state- The computation of a convolutional layer and its parallelism schemes of-the-art study has significantly reduced more than 80% of weights without the loss of accuracy. Efficient Inference Engine (EIE) leverages sparsity in addition to employing weightstationary dataflow [6] , where input activations are delivered to processing elements (PE) in order to multiply with locally stored weight elements.
While the above accelerators help exploit data reuse and sparsity, they fail to maximize calculation parallelism incorporated in CNNs since only parallelism included inside an output feature map (the X-Y plane shown in Fig. 1 ) of each layer, known as intra-output parallelism, has been exploited. In other words, parallelism included between multiple output features maps (the C o axis shown in Fig.1 ), known as inter-output parallelism, has not been exploited. Efficiently exploiting both parallelism benefits in the acceleration of CNN greatly.
The major contribution of this paper is that the proposed convolution core exploits both multiple parallelism schemes and degree, and weight sparsity to accelerate the convolutional layers of sparse CNN. It flexibly alternates the parallelism scheme and degree for two-dimensional convolution between intra-output or the multi-parallelism of inter-and intra-output parallelism based on the characteristics of each convolutional layer in order to increase PE utilization. In addition, the core also exploits the weight sparsity effectively using compressed sparse model in a channel-wise modified compressed sparse column (CSC) format [6] and the output-stationary dataflow that the architecture delivers the weight elements to the PE for multiplication and accumulates the output locally.
II. Related Studies
Two keys of accelerating CNN are computation optimization and specific architecture design. The amount of computa-tion is optimized in several ways including quantization and weight pruning in order to reduce the number of multiplyaccumulate (MACC) operations and the required resources. An specific architecture leverages CNN's parallelism and data locality to optimize memory footprint.
Computation optimization focuses on quantizing arithmetic precision and reducing the number of weights by pruning techniques since CNN contains redundant operations. Most studies quantize data to fixed-point format using techniques such as greedy algorithm [11] and approximation method [5] . Deep compression study [7] creates a sparse network without the loss of accuracy by iteratively pruning small-valued weights and retraining the model. It further compresses the model by weight sharing quantization in order to reduce the number of bits required for weight representation. These optimizations lower both computational and storage resource requirement of the customized hardware.
Previous studies on CNN accelerators exploits CNN's unique pattern of data usage, parallelism, and data sparsity.
Most of them emphasize on leveraging specific energy-efficient dataflow, which are categorized into weight stationary-based [4, 2] , output stationary-based [1, 6] , and row stationary-based dataflow [3] . Nevertheless, only a few work considers data sparsity. For example, EIE has both leading non-zero detection circuit for detecting zero-valued activation data and sparse matrix access circuit for extracting nonzero weight from the compressed sparse model [6] . Similarly, SCNN employs a zero-skipping circuits to dynamically compress sparse activation data and eliminate the zero-operand multiplication with tile-based PT-IS-CP-sparse dataflow [10] .
However, these architectures suffers inefficiency from different dominant parallelism of convolutional layers within a CNN, which varies by the size and the number of kernels, input, and output feature maps of each layer. The specific dataflow restricts the parallelism that the accelerator can exploit, which leads to the underutilization of the available resources. For instance, most PEs of the reconfigurable processor array architecture [1] , which employ spatial outputstationary dataflow, are idle when the size of output feature maps (X × Y in Fig. 1 ), is smaller than the dimension of the processor array. Likewise, many multipliers within a PE of the tile-based SCNN [10] are not occupied due to the fact that the tiles become smaller in the deep layers. For that reason, most architectures fail to deliver their maximum performance because its parallelization capability is not flexible.
The FlexFlow architecture [9] comes closest to ours. It leverages multiple parallelism schemes and degree to improve PE utilization with multiple dataflow that can realize various parallelism. However, it neither supports the compress sparse model nor exploits sparsity.
III. Parallelism-Flexible Convolution Core
for Sparse CNN
The proposing parallelism-flexible convolution core architecture for sparse CNN includes the following key concepts: (1) multiple parallelism schemes and degree of parallelism that can be flexibly changed in layer-wise; (2) exploiting weight sparsity using compressed sparse weight format and outputstationary dataflow to eliminate the operations related to zerovalued weights efficiently.
First, the proposed architecture flexibly altenates the effective parallelism schemes and degree according to the characteristics of each convolutional layer to increase the PE utilization. The effective parallelism scheme and degree of each layer can be selected as intra-output or multi-parallelism of both interand intra-output parallelism. It can be determined in advance because the layer's characteristics and the number of available processing elements (PE) are known. If the size of the output feature maps is equal to or larger than the number of PEs, the effective parallelism is intra-output parallelism, which means that the PEs compute different pixels of the same output feature map. Otherwise, the effective parallelism is the multiparallelism, and the degree of parallelism, P, is determined in order to maximize the number of the occupied PEs. In the multi-parallelism scheme, a group of PEs computes different pixels of an output feature map (intra-output parallelism) and P different groups of PEs compute P different output feature maps at once using P duplications of input data and P different kernels (inter-output parallelism). Section III.B explains the layer-wise determination of effective parallelism scheme and the degree of parallelism in details. As a consequence, the PE utilization can be improved with the exploitation of both interand intra-output parallelism.
Second, the convolution core leverages weight sparsity by extracting non-zero weights and their indices from the compressed sparse model and convoluting the weights with the corresponding input data. The kernels of each convolutional layer are channel-wisely serialized in a modified compressed sparse column (CSC) format [6] . For example, the kernels of size 3 × 3 in Fig A. Architecture Organization Fig. 2 illustrates the proposed parallelism-flexible convolution core for sparse CNN, which is consisted of a parallelism controller, a sparse weight broadcaster, and a PE grid. Compared to the conventional architectures, the proposed architecture contains the parallelism controller and a sparse weight broadcaster in order to exploit multiple parallelism schemes and sparsity of the CNN. The core receives input feature maps, The architecture of the proposed parallelism-flexible convolution core compressed sparse model, and characteristics of a convolutional layer, including information of the pre-determined effective parallelism, as input. The characteristics of a convolutional layer includes the size and the number of kernels, input, and output feature maps. The information of the pre-determined effective parallelism includes the parallelism scheme and the degree of parallelism, P, which indicates the number of kernels to be computed in parallel. The results are stored in the output buffer.
Parallelism Controller The parallelism controller is composed of a broadcast parallelism controller and a data sequencing controller. The broadcast parallelism controller forwards P to the sparse weight broadcaster. The data sequencing controller determines the coordinate of output feature map to be computed by each PE group. If the size of output feature maps is larger than the number of PEs, the data sequencing controller repeats the coordinate assignment process for all pixels.
The execution of the convolution core is alternated according to the characteristics of a convolutional layer and information of the pre-determined effective parallelism scheme and degree. Specifically, if the pre-determined effective parallelism scheme is intra-output parallelism, the broadcast parallelism controller forwards 1 as P to the broadcaster, and the data sequencing controller assigns different coordinates to all PE groups in the PE banks of the PE grid. In the case that the multi-paralleism is indicated, the broadcast parallelism controller forwards P, where P is more than 1, to the broadcaster, and the data sequencing controller assigns M/ P duplications of coordinates, where M is the number of PE banks. For example, assuming P is 2, then the coordinates assigned to PE bank#1 to PE bank#(M / 2) are different, but are the same as the coordinates assigned to PE bank#(M/2+1) to PE bank#M.
Sparse Weight Broadcaster To leverage weight sparsity, the sparse weight broadcaster extracts the non-zero weights, decodes the corresponding indices, and broadcasts them to the PE grid. It is composed of a sparse weight memory, an index memory, a broadcast controller, and multiple broadcast units (BCU). First, the compressed sparse model of each layer is loaded channel-wisely into weight and index memory. Next, after the input feature maps are loaded into PE grids' input buffer, the broadcast controller starts reading all weights and indices of one channel from both memories, and passes them to the BCUs according to P. Finally, the compressed model is decoded channel-wisely at BCUs, and the non-zero weights and their indices are broadcasted consecutively to PE banks.
For the ease of passing the compressed sparse model to the BCUs, the non-zero weights and a number of leading zeros are re-ordered in advance according to P in such a way that the weights and indices from P different kernels are read simultaneously. Fig. 3 illustrates an example of weight arrangement in the weight memory when assuming that one memory word can store four weights, and P equals to 1, 2, and 4. When P = 1, which indicates that all BCUs broadcast the same weight value, the weights in one channel of all kernels are ordered consecutively. On the other hand, the weights from different kernels that must be broadcasted at the same time are ordered in the same memory word when P is more than one. For example, when P = 2, the first memory word contains k1#1, k1#2, k3#1, and k3#2, so that the weights of kernel 1 and 3 can be read at the same time. Likewise, k1#1, k2#1, k3#1, and k4#1 are stored in the first memory word when P = 4, so that the weights of four kernels can be read simultaneously.
The broadcast controller distributes the weights to the BCUs according to P. For instance, assuming that there are four BCUs, the weights in Fig. 3 are passed to the BCUs in order to be broadcasted as follows:
• When P = 1 : First, the weight k1#1 is passed to all BCUs, then, followed by k1#2, and so on.
• When P = 2 : First, the weight k1#1 is passed to BCU#1 and BCU#2, and the weight k3#1 is passed to BCU#3 and BCU#4, then followed by k1#2 and k3#2, and so on.
• When P = 4 : The weight k1#1, k2#2, k3#3, and k4#4 are passed to BCU#1 through BCU#4, respectively.
Consequently, the two and four kernels can be convoluted simultaneously when P equals to 2 and 4, respectively. Processing Element Grid A PE grid is consisted of multiple PE banks, each of which receives a pair of weight and index from BCUs, the coordinates of output from the parallelism controller, and input feature maps. The number of PE banks and the number of BCUs are equal, and every PE groups within a PE bank comsumes the same pair of weight and index from one BCU. The coordinates of the output assigned to every PE groups within a PE bank are unique. Each PE group Fig. 3 .: Example of weight arrangement of four kernels in weight memory, so that the BCUs can broadcast weights from different kernels at the same time stores the pixels of input feature maps required to compute the output of the assigned coordinate locally in its buffer. Fig. 2b shows block diagram of a PE bank, which includes multiple PE groups that compute different output pixels. First, a forward register (Fwd register in Fig. 2b ) receives input feature maps and forwards them to the neighbour PE group in order to reduce wire delay. Then, the address calculator determines the address of input feature maps needed for the computation of the output at the assigned coordinate, and stores the input from the forward register to the input buffer, IN BUF. Assuming that there are N PEs in one PE group, N consecutive outputs of the same output row starting from the assigned coordinate are computed within a PE group. The IN BUF are registers that stores K rows of N overlapping input feature map windows, where K is the size of kernel. In other words, it stores input pixels x to x + N + K − 1 of row y to y + K − 1 in total of K × (N + K − 1) input pixels when the assigned output coordinate is (x, y) and the stride of the sliding window is one. The data is reused locally for the computation of all kernels. Next, the local data sequencer, D SEQ, selects the data from IN BUF according to the index of the broadcasted non-zero weight, so that the weight is multiplied with the corresponding data. Each PE, which is composed of a multiplier, an adder, and an accumulation register, computes MACC and stores the accumulation result of each kernel in the output buffer, OUT BUF, after the result of one channel is accumulated. The operations of D SEQ and PEs are pipelined in order to compute MACC in every clock cycle. In addition, both IN BUF and OUT BUF are implemented as ping-pong buffer in order to hide data transfer time and enable pipeline processing.
Currently, the proposed convolution core is implemented on an Altera FPGA. The arithmetic precision is 16 bits for multiplication and 32 bits for accumulation. Two multiplications are mapped on the same DSP block. In other words, two PEs are implemented with one DSP block, two 32-bit adders, and two register for accumulation. Note that the proposed architecture can also be implemented on other FPGAs, such as Xilinx's FPGAs.
The architecture was implemented with 16 BCUs and 16 PE for t from 1 to C i do //Loop all input channels 3:
for u from 1 to P do //Loop all degree of parallelism 4:
for F o (x, y) ∈ T s do //Loop all output in tile 5:
//Loop all non-zero weight in kernel 8: 
B. Flow of the Sparse Convolution
The parallelism scheme and degree of parallelism of a layer is determined in advance based on the size of output feature maps, the number of BCUs, and the number of PEs. The parallelism scheme is determined as intra-output parallelism with degree of parallelism, P, of 1 when the size of the output feature maps is equal or larger than the number of PEs. Otherwise, the parallelism scheme is determined as multi-parallelism with P determined as follows:
where B is the number of BCUs, D is the smallest number of PE banks containing the number of PEs equal to or more than the size of one output feature map, X and Y are the width and height of the output feature maps, N is the number of PEs in one PE group, and G is the number of PE groups in one PE bank. The degree of parallelism, P, implies inter-output parallelism that P different kernels is convoluted with P duplications of input data at the same time on different PEs. Intraoutput parallelism is realized with the convolution of one duplication of all input data and one kernel to produce all pixels of an output feature map. The convolution results of each layer are computed as shown in Algorithm 1. First, the output feature maps are divided into T equal tiles, where
and M is the number of PE banks. Then, the algorithm loops through all input channels, C i , in the second loop in order to maximally reuse input feature maps. The proposed architecture unrolls the third and forth loop. Unrolling the third loop parallelizes the convolution of P different kernels as an implementation of inter-output parallelism. In other words, P different output feature maps are computed on PEs in P different PE banks simultaneously. Each PE is assigned to compute C o /P kernels, where C o is the number of kernels or output channels. 
IV. Evaluation Methodology and Results
The parallelism-flexible convolution core was implemented using Verilog HDL and synthesized based on Arria10GX115 FPGA using Quartus. The performance evaluation was performed by RTL simulation.
The performance was evaluated on layer conv1 1 and conv5 1 of VGG-16 using a compressed sparse model with 16-bit weight elements and 4-bit indices. These layers were selected because they have different dominant parallelism. The output feature map size of conv1 1, the first convolution layer, is as large as 224 × 224, while the number of kernels is only 64. On the contrary, the conv5 1, which is located deeper in the VGG, contains as much as 512 kernels, while the output feature map size is as small as 14 × 14 due to the previous pooling layers. Hence, the dominant parallelism of conv1 1 and conv5 1 layer is intra-output and inter-output parallelism, respectively. The compressed sparse model of both layers were randomly generated according to the sparsity reported by the DeepCompression study [7] .
A. Performance Evaluation
The performance was compared with the baseline reconfigurable processor array, which is composed of broadcast buffer, input buffer, output buffer, and array of PEs [1] . The number of execution cycles of the baseline architecture was derived from the equation:
where K is kernel size, C i and C o are the number of input and output channels, X and Y are the number of columns and rows of output feature maps, and P is the number of PEs in one row and column of the processor array. We estimated the performance of the baseline architecture that contains 32 × 32 processor array to match the number of PEs in our architecture. Fig. 4 illustrates the execution cycles for computing layer conv1 1 and conv5 1 of VGG-16. It shows the execution cycles of the baseline architecture, the proposed architecture exploiting flexible parallelism, sparsity, and both techniques.
In the computation of conv5 1, the proposed convolution core, denoted by Proposed (Flexible), achieved 3.9 times speed up over the baseline architecture by exploiting multiparallelism of convolutional layer. Since the baseline architecture employs a specific dataflow, the parallelism is limited to intra-output parallelism. Therefore, the performance is limited by the small output size of the conv5 1 layer because the architecture can occupy only 19% of the available PEs. On the other hand, the proposed convolution core is capable of exploiting multi-parallelism, both intra-output parallelism and inter-output parallelism, using the parallelism controller together with multiple BCUs, and able to occupy up to 76% of the available PEs. Hence, the PE utilization of the proposed architecture can be increased by exploiting both intra-and interoutput parallelism.
Furthermore, the proposed architecture can exploits weight sparsity efficiently using compressed sparse model in CSC format and the output-stationary dataflow. The results has shown that the proposed architecture, denoted by Proposed (Sparse), can accelerate the convolution by 1.7 times and 2.8 times compared to the baseline architecture for conv1 1 and conv5 1 layer, respectively. The reason is that the proposed architecture can decode the compressed sparse model into non-zero weight elements and their indices with index accumulator within the BCUs, so that it does not involve any zero-weight multiplication. On the other hand, the baseline architecture is incapable of skipping those operations. In computing conv1 1 with 42% weight sparsity, the execution cycle of the proposed architecture can be reduced by 42% compared to the baseline architecture. Likewise, the execution cycle can be reduced by 64% in the computation of conv5 1 layer.
By the combination of both techniques, the proposed architecture, denoted by Proposed (Both), achieved 1.7 and 5.2 times speed up for conv1 1 and conv5 1, respectively. The number of execution cycles of conv5 1 is evaluated from the average execution cycles of the first 64 input channels. The speed-up of the conv1 1 layer is achieved from only the exploitation of sparsity because of the fact that its dominant parallelism is intra-output, so that the flexiblility of the proposed architecture is not utilized. The speed-up of the conv5 1 layer is achieved from both the parallelism-flexibility and sparsity because the number of kernels is large and the number of pixels of one output feature map is much smaller than the number of PEs, which means that inter-output parallelism dominates the others . However, considering the speed-up achieved from both techniques, the architecture can achieve only 5.2 times because the workload from the dividing the sparse kernels into P parallel equally is imbalance, which arises PE idle cycles. The architecture suffers two limitations. Firstly, PE utilization decreases when the size of output feature map, X in Fig. 1 , is indivisible by the number of PEs in a PE group or a PE bank. That is because the input buffer can store input data for only one output row and one-to-one connection between a BCU and a PE bank restricts that only one kernel is broadcasted at a time. Secondly, the number of BCU limits the degree of inter-output parallelism available on the architecture. When either the number of PE increases or the output size becomes very small, the maximum degree of inter-output parallelism may exceeds the number of BCUs.
B. Area Evaluation
The resource utilization of the core that employs 1,024 PEs is shown in table I. For a better performance, the proposed architecture can be scaled up to 1,536 PEs on the same FPGA. Compared to the baseline processor array, the proposed architecture requires additional 9, 2, 3, and 3% of LUTs, registers, DSPs, and M20K block RAM (BRAM), respectively, for leveraging both multi-parallelism and sparsity of sparse CNN using the parallelism controller (Parallelism Cntl in Table I ) and the broadcaster (Broadcaster in Table I ). The proposed convolution core can be operated at 270MHz compared to 200MHz of the baseline architecture.
The result shows that the bottleneck of the architecture is BRAM which is consumed up to 65% and 80% for the core with 1,024 and 1,536 PEs, respectively. The BRAM is used for two purposes: 104 blocks as memory for storing compressed sparse model of each layer (52 blocks each for weight and index memory) and 1,664 blocks output buffers (26 blocks for each PE groups). The main cause of a large BRAM usage is that the design requires high bitwidth memory. The memory for storing the model, which is located in the broadcaster, consumes a large amount of BRAMs, and requires high bandwidth in order to support the maximum of degree of inter-output parallelism according to the number of BCUs. The number of BRAMs can be reduced when the number of BCUs decreases. Likewise, the output buffer requires high-bandwidth memory because the bit width of the intermediate accumulation results is as high as 32 bits.
V. Conclusion
This paper proposes a parallelism-flexible convolution core for sparse CNN to resolves two problems: PE underutilization due to non-flexible dataflow and redundant cycles to compute zero-valued weights of the sparse CNN. The proposed architecture is capable of operating with intra-output and interoutput parallelism, and skipping all of the computation related to zero-valued weights. Hence, the maximum of 5.2x speedup is achieved. However, the proposed architecture has several weaknesses and limitations. The main problem that should be resolved is the imbalance workload that leads to the PE idle cycles. This remains as our future work.
