Abstract. It is a challenging task to deploy computationally and memory intensive State-of-the-art deep neural networks (DNNs) on embedded systems with limited hardware resources and power budgets. Recently developed techniques like Deep Compression make it possible to fit large DNNs, such as AlexNet and VGGNet, fully in on-chip SRAM. But sparse networks compressed using existing encoding formats, like CSR or CSC, complex the computation at runtime due to their irregular memory access characteristics. In [1], we introduce a computation dataflow, stacked filters stationary dataflow (SFS), and a corresponding data encoding format, relative indexed compressed sparse filter format (CSF), to make the best of data sparsity, and simplify data handling at execution time. In this paper we present FPGA implementations of these methods. We implement several compact streaming fully connected (FC) and Convolutional (CONV) neural network processors to show their efficiency. Comparing with the state-of-the-art results [2, 3, 4] , our methods achieve at least 2× improvement for computation efficiency per PE on most layers. Especially, our methods achieve 8× improvement on AlexNet layer CONV4 with 384 filters, and 11× improvement on VGG16 layer CONV5-3 with 512 filters.
INTRODUCTION
In recent years, DNN technology has made breakthroughs in many areas such as motion detection, object detection, image classification and recognition, image semantic understanding, natural language processing, translation, and many other areas. A lot of groundbreaking neural networks have been proposed, such as AlexNet [11] , VGG [12] , GoogleNet [13] , ResNet [14] , R-FCN [15] , Deformable-ConvNets [16] , and so on. However, these networks contain millions of parameters, tens to hundreds of convolution layers, and require billions of arithmetic operations. They will also produce tons of intermediate data and need frequent data transmissions between process units and memory. The large amount of computation and huge memory resource consumption characteristics have hindered their wide usage in embedded devices. Various efforts have been made to address this issue, such as ShiftCNN [8] , Ristretto [9] , Eyeriss [4] , Deep Compression [2] and EIE [3] , etc. Among these approaches, deep compression is a promising method for embedded applications.
In our previous work [1] , we pointed out several still existing problems of these approaches. The first problem is manipulating compressed sparse data need considerable extra logics and consumes extra clock cycles. Eyeriss [4] uses network on chip (NoC) to handle sparsity by only performing data reads and MACs on nonzero values; DVAS [5] and ENVISION [6] use input guard memories and guard control units to handle data sparsity. Several existing sparse matrix encoding formats, such as CSC, CSR and CISR [10] , complex the computation at runtime due to their irregular memory access characteristics. This results in inefficiency in parallelizing computation and bigger chip area. The second one is, for deeply compressed sparse networks, the PE array utilization rate of recently proposed hardware acceleration designs, such as Eyeriss [4] , DVAS [5] , ENVISION [6] , DNPU [7] , etc., is fairly low. In [1] , we proposed stacked filters stationary dataflow (SFS), relative indexed compressed sparse filter format (CSF), and a three dimensional Single Instruction Multiple Data (3D-SIMD) processor architecture to address these problems. Using these methods the sparse data can be easily handled during execution without complex transformations, lookups and computation. In this paper, we implement several compact streaming fully connected (FC) and Convolutional (CONV) neural network processors to show their efficiency. For the convenience of later usage, we copy equation (4) from [1] to equation (1) here.
Vo, Vi and Wf are the matrices of output feature maps, input feature maps and filters, respectively. S, C, K, M, M′, m, W, H, W′, H′ are a given stride size, channel number, filter kernel size, total filter number, number of batches, batch size, input feature width, height and output feature width, height.
COMPUTATION FLOW OPTIMIZATION AND PARALLELIZATION

FC Layer
For equation (1) , one channel of feature data will convolute with m filters from the same channel in parallel. As the maximum output feature number of FC layers usually is 4096, all the output feature data can be buffered in the on-chip local registers. Grouping the filters is not needed, so to take full advantage of CSF and SFS, batch number 
CONV Layer
For DNNs with large output feature dimensions, like VGG16, the output feature size of a CONV layer can be up to 12MB if 32 bit floating point numbers are used. It is hard to buffer all the intermediate computing results on-chip, which can be dealt with two skills, filter grouping and feature or image division. For designs using CSF data encoding format and SFS computing flow, filter grouping may cause performance degradation. Using feature division, as shown in figure 2, only one portion of the input feature is processed each time. The output buffer needed in figure 2 is only 1/4 of the original size. The advantage of using these two skills is that the on-chip buffer size is greatly reduced. The buffer can be implemented using registers instead of RAM, which can increase the processing bandwidth and increase parallelism. The disadvantage is that it will greatly increase the number of filter weight data or feature data loaded in each reference operation. Table 1 and 2 compares these two methods. It shows that comparing with filter grouping, feature division uses only one-half output feature buffer size and the total number of data loaded increases less. Besides, feature division won't lose computation efficiency and filters are usually compressed, so feature division is recommended for handling large CONV layers. For feature division, features are divided according to output feature dimension. If the output feature height and width of a division are Hdo and Wdo, equation (4) illustrates how to calculate the input feature height and width (Hdi and Wdi) of this division.
The CONV layer parallel computing pseudo code can be rewritten to the code as shown in figure 3 . The computation flow of a single 3D-SIMD instruction is illustrated in figure 4 . For each input channel, input feature data and filter weight data of this channel are buffered before computation. Figure 5 shows the 3D-SIMD processor architecture. To lower computation complexity, skills described in ShiftCNN [8] are used to simplify floating point number multiplication. So the PEs in this paper only process 32 bit float point number shifts and additions. For FC layers, feature values, pointers, filter weight values and indices can all be streamed in from outside RAM, there is no need to buffer them internally if the data load speed can match the PE processing speed. For CONV layer, before processing a channel, input feature data and filter data are buffered into the on-chip global buffer first. All the designs are implemented on a Xilinx ZCU102 evaluation kit. This kit features a Zynq MPSoC device with ARM processors and programmable logic fabric. Figure 6 shows the resources used by a simple high speed streaming FC layer processor for LeNet. Figure 7 shows the resources used by a CONV layer processor for AlexNet. Figure 8 shows the resources used by the CONV layer processor for VGG16. 
HARDWARE IMPLEMENTATION
Local output registers
. 3D-SIMD processor architecture. 
RESULT
The LeNet implementation is tested on the MNIST data set [17] . By using 8 PEs, this simple design can process this data set at 70 fps (Table 3 ). AlexNet and VGG16 implementations are all tested on ImageNet 2012 data set [18] .
For AlexNet CONV processor, the first CONV layer of AlexNet is divided into 4  4 divisions. For VGG16 CONV processor, the first CONV layer of VGG16 is divided into 16  16 divisions. Table 4 and 5 illustrate their performance. 
CONCLUSION
In this paper we present FPGA implementations of the proposed 3D-SIMD processor architecture. We implement several compact streaming FC and Convolutional neural network processors to show their efficiency. Comparing with the state-of-the-art result [2] [3] [4] , our methods achieve at least 2× improvement for the computation efficiency per PE on most layers. Especially, our methods achieve 8× improvement for the computation efficiency per PE on AlexNet layer CONV4 with 384 filters, and 11× improvement on VGG16 layer CONV5-3 with 512 filters.
For the future ASIC implementation of the 3D-SIMD processor architecture, approximating of the networks with 16 bit fixed point numbers should be considered to lower the computation complexity. All the experiments in this paper are done with 32-bit floating point numbers, which consume 11 clocks for one single addition computation. To have the best performance of CONV and FC layers, care should be taken for handling the filter parameters. If the maximum number of filter parameters can be loaded in one clock matches the maximum number of PEs used in the implementation, the best performance can be achieved.
