ABSTRACT In current computer vision tasks, deep convolutional neural networks (DCNN) achieve stateof-the-art results. At present, the DCNN has been widely used in many usage scenarios. It is also important to develop DCNN on supercomputers in the age of artificial intelligence. As a general-purpose accelerator deployed in a supercomputer which ranks the fourth in the latest Top500 list, the Matrix2000 is a vector single instruction multiple data (vector-SIMD) digital signal processor (DSP). The research of DCNN implementation based on DSP is important and meaningful. Till now, there are few systematic studies published on DCNN implementation based on DSP. Qualcomm, CEVA, and Cadence have said that they have implemented DCNN on their own DSP. However, Qualcomm and CEVA have not published how they implemented DCNN based on their own DSP and Cadence has only implemented a convolution layer based on Vision P6. In this paper, we proposed a vectorization mapping method and a high efficient partition analysis model for implementing DCNN on vector-SIMD DSP. Based on the vectorization mapping method and analysis model, we implemented all layers of typical DCNN models on Matrix2000 and tested the computation and energy efficiency. The experiments demonstrate that the average computation efficiency of this paper based on Matrix2000 is 20∼35% higher than GPU, 35∼45% higher than Xeon Phi, about 8% higher than Vision P6 DSP, about 62∼75% higher than an existing evaluation of DCNN based on Matrix2000, and the average energy efficiency is about 9∼30% higher than GPU, and about 56% higher than the existing evaluation of DCNN based on Matrix2000. The results show that the vector-SIMD DSP with a suitable programming mapping method is also a suitable platform in the age of artificial intelligence.
I. INTRODUCTION
In recent years, deep convolutional neural network (DCNN) algorithms have been widely adopted for computer vision tasks including image classification, object detection, semantic segmentation and so on [1] - [8] . In large scale visual recognition challenge (ILSVRC) 2012, AlexNet [9] achieved the best results, making DCNN the core algorithm model for image classification. Then more and more better DCNN models emerged, such as Inception-v1 [10] , ResNet18 [11] . The amazing accuracy improvement of DCNN comes at the cost of huge computational complexity due to the large amounts of convolutional operations on feature maps [12] , [13] . Currently the DCNN algorithm has been widely used on mobile devices [14] , large-scale workstations, self-driving car, data center and supercomputer center [15] and so on.
The associate editor coordinating the review of this manuscript and approving it for publication was Huanqing Wang.
For some usage scenarios that only require acceleration of deep neural network algorithms, dedicatedpurpose hardware accelerators such as ASIC (eg TPU [16] , Eyeriss [17] ) or FPGA [18] - [21] are widely used. However, for some usage scenarios such as supercomputer and workstations that not only need to speed up deep neural network algorithms, general-purpose hardware accelerators should be employed for their acceleration capabilities on multiple types of algorithms. Although the architecture of generalpurpose hardware accelerators has been fixed, different kinds of algorithms could be implemented through programming. For example, Intel Xeon Phi many-core processor, digital signal processor (DSP) and graphics processing unit (GPU) could be used to accelerate fast Fourier transformation (FFT), general matrix multiplication (GEMM) and deep neural network algorithms through programming.
After the hardware architecture fixed, the programming methods determine the actual hardware performance.
For example, compute unified device architecture (CUDA) deep neural network (CuDNN) programming is adopted to accelerate DCNN algorithms on NVIDIA GPU [22] . CuDNN has released several versions, of which CuDNN v1 lowers the convolutions into a matrix multiplication [23] , and CuDNN v4 adopts a Winograd acceleration for small convolution kernels [24] . And the latest version is CuDNN v7.
Generally, as one of the most important embedded generalpurpose accelerator, DSP is high computation and energy efficiency when accelerating multiple types of algorithms. The authors consider DCNN algorithms are very suitable for DSP processing. Because DCNN algorithm mainly contains convolutional layers and fully connected layers [1] , which are collections of large numbers of multiply-add operations. However, till now there are few systematic studies published on the implementation of DCNN based on DSP. Qualcomm, CEVA, and Cadence said they have implemented DCNN on their own DSP. However, Qualcomm and CEVA have not published how they implemented DCNN based on their own DSP and Cadence has only implemented convolution layer based on Vision P6.
In this paper, we study the DCNN programming method on vector single instruction multiple data (SIMD) DSP Matrix2000, which is a new version of the FT-Matrix series chip [25] . And now Matrix2000 is employed in a supercomputer, which ranks the fourth in the latest Top500 list. Matrix2000 has 12 DSP cores, and its throughput is 2.4 trillion floating-point operations per second (TFLOPS) running in 1GHz. This paper makes the following contributions.
1)
We implemented the whole layers of DCNN models with our vectorization mapping method on vector-SIMD DSP, which has almost no bandwidth waste and unnecessary clock cycles. 2) We proposed an efficient analysis model for DCNN, which can split a complex DCNN layer into small blocks suitable for vector-SIMD DSP processing with high efficiency. 3) We are the first to systematically summarize the programming methods for DCNN implementation on vector-SIMD DSP. Our vectorization method and analysis model can be easily employed to other vector-SIMD DSPs.
Finally, based on our vectorization mapping method and analysis model, we implemented all layers of some typical DCNN models and tested the computation and energy efficiency on vector-SIMD DSP Matrix2000. We also tested these DCNN models on GPU with the latest CuDNN v7. Experimental results show that, as general-purpose accelerators, the average computation efficiency of this article based on vector-SIMD DSP Matrix2000 is higher than GPU based on CuDNN programming, Intel Xeon Phi many-core processor, Vision P6 DSP, an existing evaluation of DCNN based on Matrix2000 and the average energy efficiency is also higher than GPU and the existing evaluation of DCNN based on Matrix2000. The rest of this article is organized as follows. In Section II, the backgrounds of DCNN, Matrix2000 and GEMM are presented. In Section III, based on DCNN and vector-SIMD DSP, we proposed vectorization mapping method and basic analysis model. Then in Section IV, we optimized the partition and analysis model for the situation of multi-input images and multi DSP cores. In Section V, the performance of the proposed system is evaluated and discussed. Section VI concludes this paper.
II. BACKGROUNDS A. DCNN BASICS
The DCNN algorithm is an important algorithm in the field of machine learning [26] - [29] . Since AlexNet's great success in classification tasks in ILSVRC2012, there have been a growing number of higher performance DCNN models for multiple vision and prediction tasks [30] , [31] . These DCNN models are in different forms. For example, GoogLeNet adopts inception model and ResNet adopts residual structure [10] , [11] . However, the convolutional layer is the core and foundation of these algorithms [1] . A typical DCNN is shown in Figure 1 . The main layers of DCNN are as follows.
1) Convolutional Layer: The convolutional layer is the most important layer in DCNN, and the convolutional layer's calculation accounts for about 95% of the DCNN's total calculations. Therefore, it is very important to study how to implement the convolution layer on general-purpose hardware accelerator; 2) Pooling Layer: The pooling layer is generally performed after the convolutional layer, and the max pooling is widely used in DCNN. The pooling layer helps reduce the size of the feature map and highlights features; 3) Activation Layer: The activation layer mainly activates the feature map. Rectified linear unit (ReLu) activation is commonly used, which sets any input value less than zero to zero. Activation layer and pooling layer are implemented by comparison operators; 4) Fully Connected(FC) Layer and Normalization Layer:
There are also FC layer, normalization layer and other layers in DCNN. FC layer connects every output neuron to previous layer's neurons, and it is transmissionintensive. And the normalization layer smooths the feature map of the previous layer.
B. THE ARCHITECTURE OF MATRIX2000
Matrix2000 is a new version of the FT-Matrix series chip. And now Matrix2000 is employed in a supercomputer, which ranks the fourth in the latest Top500 list. Matrix2000 has 12 DSP cores, each with the same architecture. Each core is connected together through a ring interconnection, and multiple DSP cores are connected to external double data rate (DDR) synchronous dynamic random access memorys (SDRAM), as shown in Figure 2 . As shown in Figure 3 , each DSP core in Matrix2000 includes mainly three portions: a vector portion, a scalar portion and a direct memory access (DMA). The vector portion is used for processing vector data, which consists of a vector processing unit (VPU), a set of vector registers (VR), a vector memory (VM, 768KB). Similarly, the scalar portion is used for processing scalar data, which includes a scalar processing unit (SPU), a scalar memory (SM, 128KB). The DMA is used for data transmission between SM, VM and DDR. VPU in a single DSP core contains a set of 16 processing elements (PEs), each PE with six 32-bit arithmetic units, which can perform operations such as multiply-add, comparison, and shift of floating-point and fixed-point operations. Vector-SIMD DSP Matrix2000 also have other instructions including gather-scatter, shuffle and etc. We do not implement DCNN adopting method combining gather-scatter and shuffle for data reorganization, so we do not introduce these irrelevant features of Matrix2000 here.
The architecture of DSP core with a SU and VPU executing in parallel can improve the overall performance through increasing the utilization of computing resources. The local VM with multiple banks corresponding to each PE provides a large memory bandwidth [25] . And each memory bank is ping-pong structure that facilitates simultaneous VPU and DMA access. This structure is well suited for algorithms that can be vectorized and have large amounts of data reuse, and DCNN is just such an algorithm. 
C. GEMM
Lowering the convolution to matrix multiplication is used in GPU, and GEMM is optimized on GPU. One optimized GEMM example is shown in [32] . This method for accelerating convolution needs some data reorganization with reshaping weight filter into a matrix and gathering a data matrix by duplicating the original input data into a matrix, which needs a high memory bandwidth [23] . A simple convolution lowered to a matrix multiplication is shown in Figure 4 . GEMM could be directly used for FC layer on GPU, and in this paper we also implemented FC layer using GEMM based on vector-SIMD DSP.
III. RESEARCH OF VECTORIZATION MAPPING AND PARTITION

A. CONVOLUTIONAL LAYERS
Since the convolution layer accounts for the largest proportion of total DCNN computations, research on the implementation of convolution layers is very important. When mapping convolutional operations on the DSP core, we applied a vector-centric methodology to maximize the utilization of vector computing resources, and we proposed a partition and analysis model to minimize the amount of data transfer between DDR and DSP core, therefore minimizing the total time and energy consumption.
1) VECTORIZATION MAPPING OF CONVOLUTION LAYER AND CALCULATION
Based on DSP, we mainly study the implementation of convolutional layer through vectorization calculation. Our approach is to lower the multiple sets of convolutions into multiple sets of vector multiply-add, which is fundamentally different from the approach of lowering the convolutions into a matrix multiplication in CuDNN.
As shown in Figure 5 , the input feature map is Figure 5 . To conduct vector multiplication efficiently, the following steps are followed: 1) a vector is formed with the length of the number of arithmetic units(N) and loaded into VR. And the vector VOLUME 7, 2019 contains N identical input feature points.
InF[H][W][C], and the weight is Weight[C][T][T][K][N]. The weight could be also viewed as vector form of vWeight[C][T][T][K]. Bias is not listed in
2) The number of convolutional output channels in DCNN algorithm is often a multiple of the number of arithmetic units(N) in VPU in DSP core. In view of this, we consider the data at the same position of N groups(corresponding to output channels) of weights as a vector, and there are multiple sets of such vectors(K).
3) The multiplication operation is performed on a vector in 1) with the corresponding K vectors in 2). 4) The intermediate result is temporarily stored in a vector register, and the vector multiply-add operation is performed until the final result of output vectors are calculated.
With the vectorization mapping method we can maximize the use of vector computing resources and can hide the transmission time of input feature maps. For example, the value of each parameter marked in Figure 6 is the parameter of the first layer in AlexNet. For ease of understanding, vector VPU in DSP core has 16 arithmetic units, N is 16, and the K value is 6. Therefor one input feature point is input and multiplied with six groups of vector. Input feature points are generally transmitted from DDR to VR using scalar component, while the vector multiply-add operations are performed using vector components. With one input feature point transmitted and six multiply-add operations, the scalar and vector components could achieve full parallel effect, so that the calculations can be performed at almost all time except some data transmission time.
The pseudo code is shown in Figure 7 . In the pseudo code, we could find that vIF could be reused six times, which greatly reduces the workload of scalar components. Since vector memory and scalar memory in DSP are both ping-pong structures, data transmission could be further optimized. And loop unrolling can further improve performance. As these optimization methods are regular, they will not be discussed in detail here.
2) PARTITION AND ANALYSIS MODEL
The above is the mapping method of the convolutional layer to the DSP core in the case where the weight and bias value could be stored in the vector memory. However, for many convolutional layers the storage space required of weight is relatively large, and the vector memory space in DSP core is insufficient. The following focuses on the analysis of this situation.
As shown in Figure 8 , just like the previous mark, the same group of weights are marked as one color. The output feature map is also divided into e×f blocks for calculation separately. Assuming that the vector memory in DSP core cannot meet the storage requirements of weight, weight data need to be partitioned. There are three ways to divide the weights. With multiple sets of weights together, all weights can be divided into v groups; It is also possible to separate each set of weights internally and combine the same parts of different sets of weights so that all weights can be divided into u groups. These two methods can also be combined to divide the weights into u×v groups. When the weights are divided into u groups or u×v groups, intermediate result is obtained each time. And the complete output result is obtained by adding multiple intermediate results. Since the energy and time consumed to move data between DDR and DSP core is very high, the following focuses on the analysis of reducing data transmission. The partition and analysis model consists three equations. Total in (1) is the amount of data transfers between DDR and DSP core; Equation (2) and (3) are the corresponding constraints. M((NKT/v)×(CT/u)) is the storage capacity that vector memory can provide for weight; M(Temp) is the storage capacity that vector memory can provide for intermediate result; OF is small block of output feature map as shown in Figure 8 . When u is 1, each set of operations can get a complete partial output feature map, in which g is also 1. According to the specific convolutional layer and constraints, the smallest result of total value in (1) will be got. Then the convolutional layer is mapped to the DSP core according to the parameters u, v, g, etc. Then the minimum amount of data transfer between DDR and DSP core could be obtained each convolutional layer, which means the least amount of energy will be consumed on data transmission. According to the method in Figure 5 , 6, 7, the transmission time of the first item in (1) could be hidden. In this case, the minimum data transmission time could be obtained by minimizing the sum of the second and the third item in (1). After analyzing our (1) with the actual network such as AlexNet, GoogLeNet and so on, we found that it is best to let all parameters be 1. And in this case, the least data transmission time and power consumption will be got. In most situations, that cannot be satisfied due to hardware resources limitations, so it is better to get the minimum value of (1) under the constraints of (2), (3).
u, v, g, e, f are integers.
B. VECTORIZATION MAPPING OF POOLING LAYERS
The pooling layer is a common layer in deep learning. It usually includes max pooling and average pooling which could effectively reduce the size of feature map and highlight features. Max pooling is mostly adopted in DCNN. The vector VPU in DSP core has a set of comparison unit, which can As shown in Figure 9 , a pooling operation with feature map of 5×5, selection box of 3×3 and stride of 2 is shown. In the first step, we compare the input data of 1 to 3 rows, and then compare the data of 3 to 5 rows to obtain 2 rows of intermediate results. In the second step, we compare the intermediate results of columns 1 to 3, and then compare the data of columns 3 to 5 to obtain the final result. The above is the operation for a single channel of pooling layer, and each comparison unit in VPU processes one channel of feature map to form a vectorization operation.
The input data of pooling layer should be stored in vector memory. In the case of insufficient vector memory capacity, the input feature map should be partitioned by rows and then transferred into vector memory. And this will not result in a significant increase in the amount of data transmission between DDR and DSP core.
C. VECTORIZATION MAPPING OF NORMALIZATION LAYER
The normalization layer is also a common layer in DCNN algorithm, and this layer mainly smooths the feature maps. Vectorization mapping of this layer utilizes matrix transposition and dimensional transformations to allow efficient processing in the DSP core. The following describes our method.
As shown in subplot 1 in Figure 10 , the input feature map of this layer is a three-dimensional array InF 
D. OTHER LAYERS
In addition to the above-mentioned layers, fully connected(FC) layer and active layer are also common layers in the DCNN algorithm. However, these layers usually do not consist of significant amount of computing operations and the vectorization mapping of these layer is relatively simple.
The FC layer is the common form of the few last layers in DCNN algorithm. For example, AlexNet has three FC layers, and GoogLeNet has one FC layer. The FC layer can be viewed as matrix-vector and matrix-matrix multiplication operations, so it can be processed through GEMM library. GEMM based on vector-SIMD DSP has high efficiency, so it will not be repeated here.
The activation layer is a common operation in deep learning algorithms. ReLu is always adopted as activation function in DCNN, which sets an input that is less than zero to zero. In DSP core, the activation operation could be completed by vector comparison instruction easily.
IV. ANALYSIS MODEL OPTIMIZATION FOR MULTI-INPUT AND MULTI DSP CORES
In the above, we mainly analyzed the situation of single-input images on one DSP core. However, in actual supercomputer center, data centers and large workstations, multiple images are input at a time, which is as shown in Figure 11 . And the accelerator contains multiple DSP cores. So it is necessary to optimize the analysis model for the situation of multi-input images and multi DSP cores.
When multi-input images are processed, since the values of different input feature maps and output feature maps are not FIGURE 11. DCNN of multiple input images. the same, the transmission time of the first and third items in (1) has little room for optimization. However, all input images could share the same weight. So when the number of input images is I, (1) becomes (4) .
u, v, g, e, f , are integers (6) Matrix2000 has 12 DSP cores. How to effectively decompose the above tasks to multi-cores is important. It can be seen from (4) that the K value and the C value in the second items(corresponding to Figure 8 , the weight data divided into u, v parts) or the multi input images can be mapped to the multi DSP core. Due to Matrix2000's broadcast sharing function, the amount of data transmission will not increase if each DSP core access the same data from DDR. When the K value is decomposed into multiple cores, the value of Total in (4) will not increase. However, when the C value is decomposed into multiple cores, the g value increases, that is to say the third item in (4) increases. If the value of I is decomposed into multi-cores, it will not cause the parameters in (4) to increase. In summary, for multi DSP cores and multiinput images, according to the analysis of (4), if the value of I or the value of K or the combination of the two in (4) are mapped to multi-cores, the value of Total will not increase. That is to say based on our analysis model we decomposed the computational tasks onto multiple DSP cores without causing unnecessary losses.
V. RESULTS
Currently there are mainly three kinds of general purpose accelerator deployed in Supercomputer [36] . The three are many-core processor, GPU and DSP. In this section, we analyze and compare the performance of these three accelerators by running some DCNN models. These DCNN models we chose as benchmarks are all typical and practical, and most of other DCNN models are developed based on these DCNN models.
Intel Xeon Phi 7250 is a processor with 68 cores deployed in Cori supercomputer. Buck adopted Caffe framework and tested the speed performance of Intel Xeon Phi 7250 and Pascal Titan X for DCNN [33] . Based on the running time and parameters of Xeon [37], we calculated the computation efficiency(here we define computation efficiency as the utilization of computing resources) of Xeon Phi 7250.
Jcjohnson measured the performance of NVIDIA GPUs adopting Caffe framework with CuDNN v5 and a batch size of 16 [38] . For a fair comparison, we adopted Caffe framework with the latest CuDNN v7 and a batch size of 16 to measure the performance of GPUs. Power consumption of GPU was obtained using the NVIDIA profiling tools.
Efland implemented DCNN on a vector-SIMD DSP Vision P6 [34] adopting method combining gather-scatter and shuffle. He only tested the computation efficiency of the AlexNet and the computation efficiency is 57%. Jun-Yang only implemented the convolutional layers on Matrix2000 with the method combining shuffle [35] . Data reorganization with shuffle causes some bandwidth waste and unnecessary clock cycles. The computation efficiency of this method is only 40%.
Combining our vectorization mapping method and analysis model based on Matrix2000, we implemented some DCNN models to evaluate our methods on a Matrix2000 testing board as shown in Figure 12 with the same batch size of 16. DDRs and Matrix2000 chip are integrated in the testing board in Figure 12 . We tested the power of testing board through a power test instrument.
The comparisons among Xeon Phi 7250, NVIDIA GPUs, vector-SIMD DSP Vision P6 and Matrix 2000 are listed in Table 1 . Some information is missing so we mark them as ''−'' in Table 1 . As the testing time for each DCNN model was not reported in [33] and [35] by Buck and Jun-Yang, we used the average computation efficiency for these two hardware accelerator. As shown in Table 1 , the average computation efficiency of this work based on Matrix2000 is 20∼35% higher than the CuDNN approaches based on NVIDIA GPU, 35∼45% higher than Xeon Phi many-core processor 7250, about 8% higher than Vision P6, and about 62∼75% higher than Matrix2000 with the method in [35] .
The energy efficiency of GPUs and Matrix2000 DSP is plotted in Figure 13 (Buck and Efland did not report the energy efficiency of Xeon Phi 7250 [33] and Vison P6 [34] ). As power consumption(inversely proportional to the energy efficiency) of the hardware accelerator is directly determined by the fabrication technology processes, technology scaling is considered to ensure the fair comparisons of the energy efficiency. For technology scaling, the definition of S and the normalization of power consumption are (7, 8) from [39] . As a result, energy efficiency of the vector-SIMD DSP Matrix2000 is normalized to 16nm to be compared with the GPUs on the same fabrication technology in Figure 13 . As shown in the plot, energy efficiency of Matrix2000 with our programming method is about 9∼30% higher than GPUs, and about 56% higher than the existing evaluation of DCNN based on Matrix2000 [35] .
VI. CONCLUSION AND FUTURE WORK
In fact, method of lowering the convolution into GEMM could also be used in vector-SIMD DSP, but it causes some bandwidth waste. And as a vector-SIMD DSP, method combining gather-scatter and shuffle could also be used in Matrix2000, but the data reorganization consumes some unnecessary clock cycles.
In table 1, our vectorization mapping method has good results. We think there are two reasons. The first one is that DSP contains a large number of MAC units, which are well suited to the large number of multiply-add operations required by DCNN. The second is that our method avoids converting convolution to matrix multiplication and data reorganization using gather-scatter and shuffle, with almost no bandwidth waste and unnecessary clock cycles.
In fact, as a general-purpose accelerator deployed in supercomputer center which ranks the fourth in the latest Top500 list, the results in this paper demonstrate that vector-SIMD DSP Matrix2000 with suitable programming mapping method is still a suitable platform in the age of artificial intelligence. For the first time, we systematically summarized the programming methods for DCNN implementation on vector-SIMD DSP. Vector-SIMD DSP architecture is a common architecture. Our vectorization mapping method and analysis model have use for reference to other vector-SIMD DSPs, just as CuDNN of NVIDIA GPU has a reference to AMD GPU. And it is also meaningful to develop the next generation of vector-SIMD DSP and corresponding programming method based on Matrix2000 and method in this paper for future supercomputer.
Next step, our future work mainly focuses on following aspects: 1) further improving the computing efficiency and energy efficiency of DSP for DCNN. 2) implementing the training of DCNN on DSP. 3) providing support for some common DCNN programming framework based on Matrix2000, for example, TensorFlow, Pytorch. 
