The Sunway TaihuLight supercomputer is powered by SW26010, a new 260-core processor designed with onchip fusion of heterogeneous cores. In this article, we present our work on optimizing the training process of convolutional neural networks (CNNs) on the Sunway TaihuLight supercomputer. Specifically, a highly efficient library (swDNN) and a customized Caffe framework (swCaffe) are proposed. Architecture-oriented optimization methods targeting the many-core architecture of SW26010 are introduced and are able to achieve 48× speedup for the convolution routine in swDNN and 4× speedup for the complete training process of the VGG-16 network using swCaffe, compared to the unoptimized algorithm and framework. Compared to the cuDNN library and the Caffe framework based on the NVIDIA K40m GPU, the proposed swDNN library and swCaffe framework on SW26010 have nearly half the performance of K40m in single -precision and have 3.6× and 1.8× speedup over K40m in double precision, respectively. • We present a more systemic algorithm design and optimization process with methods related to the local directive memory usage, register communication, and instruction pipeline, including a modified performance model and a new register blocking strategy based on the previous work.
INTRODUCTION
The convolutional neural network (CNN [14] ) is one of the most successful deep learning models in modern artificial intelligence applications [8, 12, 20, 22, 24] . The training process of CNN involves a large amount of computation and has become a popular research topic in the field of high performance computing (HPC). GPUs have currently been considered as the most efficient hardware choice for deep learning tasks and can support high-level deep learning frameworks [1, 4, 7, 13] .
Sunway TaihuLight [11] , a supercomputer that ranks number one in the latest release (November 2017) of the TOP 500 list with over 100 PFlops computing capacity, is powered by the SW26010 many-core processor, which is designed with on-chip fusion of heterogeneous cores and is able to provide a peak double-precision performance of 3.06 TFlops. SW26010 introduces several unique features that could potentially accelerate the training process of CNNs, such as user-controlled local directive memory (LDM), hardware-supported register-level data sharing, and a unified memory space shared by all processing elements.
Our previous publication [10] introduced the optimization of convolutional algorithm targeting the many-core architecture of SW26010. As an extension of the previous work, in this article we present a more systemic optimization process to accelerate CNN training tasks on the Sunway TaihuLight supercomputer. Specifically, a highly efficient library and a customized Caffe framework for the SW26010 many-core processor are proposed.
The major contributions of this article include the following:
• We propose algorithm design and optimization methods related to the LDM usage, register communication, and instruction pipeline, guided by a performance model. The optimized convolution routine can achieve 48× speedup over the basic implementation on SW26010.
• A customized deep learning library for the SW26010 many-core processor is developed, called swDNN, to provide the support for various computation and data processing layers in CNN models.
• An optimized Caffe framework for the SW26010 many-core processor is proposed, called swCaffe, which is integrated with the swDNN library and supports a four core-group (CG) parallelization on a SW26010 processor. The swCaffe framework can achieve about 4 times speedup over a BLAS-based Caffe framework on SW26010.
Evaluation results also show that the proposed convolution implementation and the swCaffe framework have nearly half the performance of the NVIDIA K40m GPU in single precision while achieving 3.6× and 1.8× speedup over K40m in double precision, respectively.
The article is organized as follows. Section 2 introduces the background of the work, including the CNN algorithms, the detailed architecture of SW26010, and the related work on the optimization of CNN algorithms. Section 3 presents the performance model and architecture-oriented optimization methods targeting the convolution algorithm, including the evaluation of the implementation. Section 4 presents the swDNN library and the swCaffe framework, as well as the evaluation of the complete training process with swCaffe on a SW26010 many-core processor. Section 5 presents our conclusion.
BACKGROUND

Convolutional Neural Networks
CNNs usually contain multiple computing layers, among which convolutional layers usually account for the majority of the computing time (greater than 90%). We first give the description of the convolutional layer configurations, listed in Table 1 . The input data of a convolutional layer consists of N i channels, each of which can be considered as a feature map with size of R i × C i . Similarly, the output of a convolutional layer consists of N o feature maps with size of R o × C o . To calculate the values in an output feature map, N i convolutional kernels with size of K × K and 1 bias value are required. Each kernel convolutes with an input feature map. The output value equals the sum of N i convolution results and the bias value. Therefore, there are N i × N o convolutional kernels and N o bias in a convolutional layer.
The training process of a CNN model is based on the stochastic gradient descent (SGD) algorithm. In each training step, the network is trained with a batch of samples. We define the batch size as B s , and the original algorithm of a convolutional layer in a training iteration can be described as Algorithm 1. The input data, output data, and convolution weights are organized in four-dimension tensors, and there are seven nested loops in the algorithm, which provides possibilities for the parallel optimization on many-core processors like SW26010.
In addition to convolutional layers, a CNN usually contains other kinds of layers, such as pooling layers, fully connected layers, softmax layers, and other data processing layers, such as activation function layers and normalization layers. Different CNN models have different network structures, which describe how different kinds of layers are stacked in the neural network.
The major algorithm of fully connected layers is matrix multiplication, which can be supported by the high-performance basic linear algebra subprograms (BLAS). Other layers, such as pooling, activation functions, and softmax, can be considered as data processing layers, and they are not the critical points of performance optimization. Figure 1 shows the architecture of a SW26010 many-core processor. SW26010 consists of four CGs, and each CG includes 65 cores: one management processing element (MPE), and 64 computing processing elements (CPEs) organized as an 8 × 8 mesh. The MPE and CPE are both complete 64-bit RISC cores but serve different roles in a computing task. 
SW26010 Many-Core Architecture
feature maps, convolutional kernels, and bias 2: //K r = K c = K represent the number of rows and columns of a 2-dimensional convolutional kernel 3: //The output images OUT are initialized with the bias b 4: for cB := 0 : 1 : B s do 5: for cN o := 0 : 1 : N o do 6: for cR o := 0 : 1 : R o do 7: for cC o := 0 : 1 : C o do 8: for cN i := 0 : 1 : N i do 9: for cK r := 0 : 1 : K r do 10: for cK c := 0 : 1 : K c do 11 :
12: end for 13: end for 14: end for 15: end for 16: end for 17: • Time-domain transformation methods are first introduced in the early phase of CNN optimization research [2, 6, 15] . By expanding convolution operations into matrix multiplications, the performance can be improved with the help of the BLAS on different hardware platforms. However, additional data transformation is required, which either consumes more memory space and extra data copy operations or involves complicated memory address remapping. Therefore, the memory consumption and bandwidth are major problems for time-domain transformation methods, and the overall performance is limited by the performance of BLAS.
• Frequency-domain transformation methods can reduce the arithmetic complexity of convolution operations. FFT-based [18, 23] and Winograd's filtering-based [16] convolution algorithms are proposed and perform well in cases with both large and small convolution kernel sizes. Similar to time domain-based methods, additional data transformation, as well as extra memory consumption, is required, and the overall performance is limited by the performance of transformation.
• Direct convolution optimization methods can reduce the data dependency by redesigning the convolution algorithm with loop reordering and data blocking, so as to improve the parallelism of the core computation. Instead of relying on existing BLAS or FFT libraries, direct convolution implementations require hardware-oriented optimization methods to take full advantage of the hardware architecture, and therefore the overall performance can approach the peak performance of the processor. Moreover, by carefully designing the data blocking strategies, additional data transformation and extra memory consumption can be avoided, which is more suitable for memory and bandwidth bounded architectures.
In addition to the algorithm optimization, various hardware accelerators are employed to accelerate the convolution computation, such as GPU, FPGA, and ASIC, focusing on both classification and training process of CNNs. FPGAs [19, [25] [26] [27] and ASICs [3, 5, 9, 17] are usually used for classification tasks due to the customizability of data precision, low latency, and high energy efficiency. GPUs have currently dominated the competition of the HPC platforms for training tasks. Especially, NVIDIA launched GPU like V100, which includes deep learning specific units such as tensor cores. Correspondingly, the cuDNN [6] library was released to provide highly efficient routines for deep learning algorithms on NVIDIA GPUs and can be neatly integrated to widely used deep learning frameworks such as Caffe [13] and TensorFlow [1] .
To explore the potential of training CNNs on other off-the-shelf many-core processors, in this article we present the detailed architecture-oriented optimization methods for the convolution algorithm on the SW26010 processor. Then we show the design of the deep learning library and the customized Caffe framework dedicated for the SW26010 processor, so as to support a highly efficient training process of the CNN on the Sunway TaihuLight supercomputer.
CONVOLUTION ALGORITHM OPTIMIZATION
We first introduce a performance model that shows the features of the SW26010 architecture and indicates the key factors that could affect the performance of an implementation. Guided by the performance model, we redesign the convolution algorithm and propose LDM-related, registerrelated, and instruction-related methods for further optimizations.
Performance Model
We consider different factors that affect the performance of one CG and propose a performance model shown in Figure 2 . The frequency of a CPE is 1.45GHz and the vectorization size is 4.
Assuming that each CPE executes one vector floating-point multiplication and addition (vfmad) instruction, the peak performance of a CG can be derived as:
For an implementation, we define the execution efficiency (EE) as the ratio of vfmad instructions to the total execution cycles. Therefore, considering the loss from EE, the theoretical performance of an implementation is 742.4GFlops · EE.
Before a computing instruction can be executed, we need to make sure that the data has been loaded into registers. For a vfmad instruction, 12 double-precision numbers (12 × 64 = 768 bits) are needed. In Figure 2 , the required bandwidth (RBW) of an implementation is defined as the minimum data access bandwidth that could overlap the data access and computation.
A CPE supports two data access patterns to load the data into registers. One is the global memory access (gload instruction), which can load 64 bits of data into a scalar register directly from main memory. In this case, to guarantee the overlapping of computation and data access, the data accessed by a gload instruction should be involved in at least 12 vfmad instructions (768 bits : 64 bits). Here we define the computation to data access ratio (CDR), which represents the ratio of computation instructions (vfmad) to data access instructions. In the global memory access pattern, to overlap the computation and data access, the CDR should be greater than 12, which can hardly be met by most algorithms. Therefore, the global memory access pattern is relatively low efficient.
The performance model of the global memory access pattern is shown in Figure 2 . The maximum memory bandwidth of one CG is about 8GB/s. We denote the RBW by RBW M EM−>REG . Here we assume that the computation and the data access are parallel processes and are independent, which can be realized through some optimization methods, such as double buffering (see Section 3.3.2). Therefore, if the RBW M EM−>REG is greater than 8GB/s, it will lower the performance by a rate of
The other memory access pattern is to use the LDM as a data cache, which means that the data will be loaded first from the main memory into the LDM and then from the LDM into registers. There are two stages of data accessing in this case. We denote the RBW of the two stages by RBW M EM−>LDM and RBW LDM−>REG . When loading data from the LDM to registers, vectorized load instruction (vload) is supported. Each vload instruction can load 256-bit (32 Bytes) data into a vector register. The execution of load/store instructions usually takes three or four CPU cycles and is a nonblocking process so that we can issue an instruction every cycle and the bandwidth between the LDM and registers is 32 Bytes × 1.45 GHz = 46.4 GB/s.
Data is transferred from main memory to the LDM through the direct memory access interface (DMA), and the theoretical maximum bandwidth of the DMA is 36GB/s. A DMA put/get operation will access one or more memory blocks, which has a size of 128Bytes for SW26010. The latency of a DMA request from CPEs is usually more than 100 CPU cycles. Therefore, theoretically, successive DMA operations with large granularity can make full use of the DMA bandwidth. Practically, the actual bandwidth is not a constant value and is variant with the size of continuous memory access blocks of one CPE. We write a microbenchmark on one CG to measure the actual DMA bandwidth and present the results in Table 2 , where Size indicates the granularity of a DMA operation. We denote the measured DMA bandwidth (MBW) by MBW M EM−>LDM . We can see that the bandwidth of the DMA ranges from 4GB/s to 36GB/s. In general, a higher bandwidth is achieved when using a block size larger than 256Bytes and aligned in 128Bytes. Figure 2 also shows the performance model of the LDM-cache memory access pattern. Here the required CDR is 3 (768 bits : 256 bits), which is more easily accomplished compared to the global memory access pattern. Our design is based on the LDM-cache memory access pattern. 
Algorithm Design
Considering the original algorithm of a convolutional layer (Algorithm 1), the inner loops perform a K × K convolution. Usually, the value of K is relatively small and is odd, such as 3, 5, 7. Therefore, it is hard to map the inner loops onto the CPE mesh and is also inefficient for the vectorization of core computation.
To improve the parallelism, we reschedule the seven nested loops, making the inner computation to be a matrix multiplication with dimensions N i , N o , and B s , which are relatively large in most convolution layers and are suitable for mapping the inner computation onto the CPE mesh. , both of which can be shared between the CPEs either in the same row or in the same column. Therefore, for the core computation, the amount of data to be accessed by a CPE is
We use vload instruction for data access, so the theoretical CDR of the core computation is
Assuming that N i , N o , and B s have the same value, the CDR can meet the requirement (CDR ≥ 3) of the LDM-cache pattern when the value is larger than 51, which can be realized in most of the convolution layers. For values that are not a multiple of 8, zero padding can be adopted and will not cause too much decrease in performance. Therefore, for brevity, we focus on the configurations that are a multiple of 8 in the following discussion. The following sections will show the detailed implementation and optimization methods based on Algorithm 2.
LDM-Related Optimization
LDM-related optimization methods are focused on an effective implementation for outer loops of the algorithm. The targets are to realize the overlap of data access from main memory to the LDM and the core computation of the CPE mesh, so as to increase MBW M EM−>LDM and reduce 
for cK r := 0 : 1 : K r do 8: for cK c := 0 : 1 : K c do 9 :
Core computation:
end for 13: end for 14 :
end for 16 : end for
Optimized Data Layout.
The input data of the core computation is a part of the input/ output feature maps and the convolutional kernels. Based on the original data layout, data in W , D i , D o is not stored continuously in I N , OUT , and CONVW , so the MBW M EM−>LDM will be limited due to small data access block. To increase MBW M EM−>LDM , we redesign the data layout of the input/output feature maps and the convolutional kernels as
In addition, we rotate the convolutional kernels on K r and K c dimensions to eliminate the coordinate transform in line 6 of Algorithm 2. For I N and OUT , we put B s as the lowest dimension, which can eliminate the data transposition in lines 3, 7, and 11 of Algorithm 2, and can support vectorized operations on the B s dimension in the core computation.
Double
Buffering. Double buffering is adopted to overlap the data access from main memory to the LDM and the core computation. Because the DMA is asynchronous, we design two LDM buffers of the same size. While the data in one buffer is used for core computation, the data to be used in next core computation can be loaded into another buffer. Note that the double buffering design halves the maximum available space of the LDM for one computation iteration, which means that for one CPE, only a 32KB LDM is available for the core computation.
LDM Blocking.
We consider the total LDM usage of 64 CPEs in the core computation with different convolutional-layer configurations. It can be described as follows:
where DataLen is the number of bytes for the data type. Assuming that N i , N o , and B s are equal to 256, which are relatively large configurations for most convolutional layers, and the data type is double precision, the total LDM usage of 64 CPEs is 3 × 256 × 256 × 8 Bytes = 1,536 KBytes. By using register communication techniques, the data stored in one CPE's LDM can be shared to other CPEs (more details will be shown in Section 3.4.1), so the exact LDM usage of each CPE is 1, 536 KB/64 = 24 KB. Therefore, for most convolutional layers, a 32KB LDM is enough for the core computation, and in other words, it is possible to take advantage of the remaining LDM spaces to improve the overall performance of the implementation. 
for cK r := 0 : 1 : K r do 9: for cK c := 0 : 1 : K c do 10: DMA get:
//Core computation: 14: for cb C := 0 : 1 : b C do 15 :
end for 17: Check DMA getW ,D i finished. end for 20: end for 21 :
end for 23 : end for
In the convolution algorithm, the convolutional kernel is shared by the computation of values in the same output image. In the core computation of Algorithm 2, the data of convolutional kernel (W ) is only used for one core computation corresponding to the values in the output feature maps at coordinate (cRo, cCo). To improve the data reuse of W , and in the meantime to improve the CDR of the core computation, we propose an LDM blocking strategy shown in Algorithm 3.
In the core computation of Algorithm 3, we load b C times more data of input/output feature maps and reuse the data of convolutional kernels to complete b C matrix multiplication computation. The RBW M EM−>LDM is reduced, and the CDR of a CPE is
which is greater than Equation (2) . The larger b C we choose, the greater CDR we can get. However, b C is limited by the available size of the LDM, and we can maximize the value to take full advantage of the LDM.
Register-Related Optimization
Register-related optimization methods mainly focus on effectively mapping the core computation onto an 8 × 8 CPE mesh. Two key problems are targeted in our work: (i) to realize the register-level data sharing between CPEsto reduce the RBW LDM−>REG for each CPE, and (ii) to take full use of the vector register to implement the computation efficiently on a CPE. 
Register Communication.
In the core computation, a CPE is responsible for a (3, 3) . For a given pair of (i, j), the computation of D o (i, j) can be described as follows:
which can be done in four steps by CPE(i, j). D i (i, j), W (i, j), and D o (i, j) are preloaded into the LDM of CPE(i, j) before executing the core computation. Without loss of generality, we take CPE(2, 1) as an example to show the process.
• Based on the proposed register communication strategy, the core computation can be done on an 8 × 8 CPE mesh following eight steps with highly efficient data sharing between CPEs.
Register
Blocking. In each step of the register communication process, the computation task of a CPE is to calculate the matrix multiplication of W (i, j) and D i (i, j). The size of the blocks are (
For each CPE, there are only 32 vector registers, including the zero register and the stack pointer (sp) register, so the number of available registers is less than 30 for the implementation. We should consider to use vectorized computation to improve the data reuse in registers, and to reduce the data dependency to achieve an efficient instruction flow. Therefore, we propose a register blocking strategy to implement the computation in each step. Figure 4 shows the details.
We use four vector registers to load • Third, for i, j ∈ {0, 1, 2, 3}, we calculate Equation (6) using 16 v f mad instructions. 8 times, whereas C[0:15] only need to be loaded once in the first kernel task, which improves the data reuse at register level and thus reduces the RBW LDM−>REG . Because there is no data dependency between the v f mad instructions in a kernel task, one instruction can be issued in each CPU cycle, which can increase the EE of the implementation.
Instruction-Related Optimization
We adopt instruction-related optimization methods to overlap the data loading and computation instructions and to further improve the EE in the kernel task. Figure 5 
for cK r := 0 : 1 : K r do 11: for cK c := 0 : 1 : K c do 12: DMA get:
for cb C := 0 : 1 : b C do 16: Core computation:
end for 18: Check DMA getW ,D i finished. end for 21: end for 22: end for 23 :
end for 25: end for see, in cycles 4, 8, 23, and 24, two instructions can be issued to pipeline P0 and P1 simultaneously, because there is no data dependency and the instructions can be executed on P0 and P1 separately. Only data loading instructions (vldr can load the data into a vector register and send out through row register communication) are issued in the first few cycles, which will lower the EE of the implementation.
Considering that 8 kernel tasks and reorder the instructions to overlap the vldr instructions of a kernel task with the v f mad instructions at the end of the previous kernel task. The implementation after loop unrolling and instructions reordering is shown in Figure 5(b) , where only 17 CPU cycles are required to finish a kernel task and the EE is improved to 16/17 = 94.1%.
CG-Level Parallel Scheme
Based on the preceding optimization methods, the convolution algorithms can be mapped onto a CG efficiently. Considering that there are four CGs in a SW26010 processor, we can further design the parallel scheme for four CGs. The simplest but most efficient way is to introduce parallelism on the outermost loop (R o ). As discussed in Section 2.2, data can be shared by four CGs without extra data copy. Therefore, we can set the data of input/output feature map, convolutional kernel, and bias to the shared mode, and implement a four-CG convolution algorithm as shown in Algorithm 4. 
end for 6: end for 7: Test Set 2 : 16: for Bs = 32; Bs <= 512; Bs * = 2 do 17:
Single-Precision Support
During the preceding design and optimization process, we consider double precision (64 bits) as the data representation for both feature maps and weights. However, unlike in most scientific applications, single precision (32 bits) is sufficient for training CNN models in deep learning applications. Therefore, to support practical CNN training tasks, we further improve our optimized algorithm implementation to support single-precision operations.
Originally designed for supporting major scientific applications that mostly rely on doubleprecision data types, the features of the SW26010 architecture, such as vectorized instructions and register communication operations, are generally more suitable to handling double-precision operations. There is no special optimization on hardware for single-precision operations on SW26010. Therefore, theoretically, the peak performance for single-precision computation is equal to that for double precision. In practice, there will be performance loss due to the lack of support for single-precision operation in the instruction set, which will be discussed in the following.
A straightforward way to support single precision is to redesign the kernel task instruction flow based on single-precision instructions. The major problem is that there is no instruction like vldr or vlddec for single-precision data in the instruction set of SW26010. Instead, we should first load four single-precision data into a vector register using the vlds or vldse instruction, then call register communication using the putr or putc instruction. Therefore, for single precision, the instruction flow of the kernel task has eight more instructions than the double-precision implementation shown in Figure 5 , and more importantly, these instructions cannot be overlapped by computation instructions due to the register dependency. Eight more cycles in the kernel task will lower the EE to 16/(17 + 8) = 64%, which indicates that the overall performance loss will be more than 30% (compared to 94.1%).
In the straightforward way, all data accessed in the kernel task requires extra cycles. Considering that there is data reused in the core computation (e.g., W will be reused cb C times), we propose another way to reduce the overall extra cycles for single-precision data access, called a float2double implementation. After we load the data from the main memory to the LDM, we cast the single-precision data in the LDM to double precision and then do the core computation in double precision. Correspondingly, we cast the double-precision data to single precision before storing the computation results back to the main memory. The data casting can be implemented using a flow of vlds/vsts (for single-precision) and vldd/vstd (for double-precision) instructions.
Evaluation
To show the performance improvement obtained from the proposed algorithm design and optimization methods, we first evaluate the performance of the implementation based on double precision. Different convolutional layer configurations listed in Table 1 will lead to different practical performance. Since the configurations change with CNN models and applications irregularly, it is unnecessary to traverse all possibilities. Therefore, we derive the test cases according to Algorithm 5, where four sets of test cases are generated targeting different values of N i /N o , Ro(Co) K, and B s separately. Table 3 lists the specifications of SW26010 and NVIDIA K40/K80 GPUs. Taking the peak performance in both single and double precision into consideration, we choose the K40m GPU as a comparison to SW26010 in our evaluation. We run the test cases using our implementation and the convolution subroutine of cuDNN (v5.1) on the NVIDIA K40m GPU. The evaluation results are summarized into four categories to show how the performance changes with different configurations as shown in Figures 6 and 7 .
As we can see from Figure 6 (a), the performance of our implementation is more sensitive to the value of N i . As discussed in Section 3.4.2, in each step of the register communication process, N i 8 kernel tasks are executed. Therefore, larger N i will lead to a longer process with consecutive kernel tasks, which can provide better performance. Figure 7(a) shows that the performance with a small value of R o (and C o ) is relatively low, which is because we use a double buffering design to achieve the overlap of the data access from main memory to the LDM and the core computation. The design can be considered as a pipeline, and there is a starting phase at the beginning of the process. Small R o and C o will shorten the pipeline and therefore lower the overall performance. The performance with different N o and different K is relatively stable according to Figure 6 (b) and Figure 7 (b).
In Figure 7 (c), small B s (e.g., 32 and 64) cannot take full use of the 8 × 8 CPE mesh, so the performance penalty of the proposed implementation is quite apparent. For large B s (e.g., 256 and Optimizing Convolutional Neural Networks on the Sunway TaihuLight Supercomputer 13:17 512), the performance improvement is also apparent since large B s will benefit the performance of the innermost matrix multiplication computation in our design.
Considering all test cases, the performance of our implementation ranges from 1.3TFlops to 2.0TFlops, and the average performance is about 1.68TFlops, which is about 56% of the peak performance of SW26010. For the evaluation of cuDNN on the K40m GPU, the average performance is about 0.47TFlops. The peak double-precision performance of K40m is 1.43TFlops, so the efficiency of cuDNN is about 32%. Compared to cuDNN, our work can achieve about 3.6× speedup on performance and about 24% improvement on hardware efficiency.
To illustrate the effectiveness of the optimization methods proposed in this article, we show the performance of the implementations after adopting different optimization methods in Figure 8 . In our work, we take the implementation of Algorithm 2 as the basic version and follow the steps of adopting vectorization design, LDM-related optimization, register-related optimization, and instruction-related optimization successively, which forms an optimization process guided by the performance model. Finally, we propose four-CG parallelization design and introduce the implementation based on Algorithm 4. As we can see, in the optimization process, distinct performance improvement can be achieved in each step, and 48× speedup is achieved in total.
Generally speaking, some of the proposed optimization techniques, such as vectorization, register blocking, and instruction-related optimization, are also applicable to other heterogeneous architectures, such as the GPU and Intel Xeon Phi. In our work, we consider the features of the SW26010 many-core architecture and customize the practical optimization strategies for these general optimization techniques. In addition, other optimization techniques, such as LDM utilization and register communication, are specific to the SW26010 architecture. In summary, both the architecture-specific optimization techniques and the architecture-customized strategies for general optimization techniques are considered as architecture-oriented optimization methods, which can also be general for other application and algorithm optimization problems on the SW26010 many-core architecture.
We further evaluate the performance of our convolution implementation based on singleprecision data representation, which is generally used in the practical training process of CNN models. As discussed in Section 3.7, a straightforward implementation and a float2double implementation are proposed. In the experiment, we train VGG-16 [21] with both float and double data precision based on our work on SW26010 and cuDNN on K40m. There are 13 convolutional layers with different configurations in VGG-16. We show the average performance and hardware efficiency of the convolutional layers in Figure 9 .
Considering the performance on SW26010, the float2double implementation has about 10% improvement on the hardware efficiency over the straightforward implementation. Therefore, we adopt the float2double implementation when training CNN models with single precision. To support more efficient computation in lower data precisions, such as single or half precision, further improvement to the SW26010 architecture is necessary and should target two main aspects. The first is to provide optimized SIMD operations for lower data precisions, such as 8× SIMD in single precision or 16× SIMD in half precision, which can be realized by using the current 256-bit vector registers and adopting hardware optimization for the ALU part in CPEs. The second aspect is to support more completed low precision instructions (e.g., vldr for single/half precision), so as to improve the overlapping of computation and data access (or data transferring).
TRAINING PROCESS OPTIMIZATION
Based on the proposed algorithm optimization methods in Section 3, we further design the swDNN library and a customized Caffe framework, called swCaffe, which can provide a complete solution to train CNN models on the Sunway TaihuLight supercomputer.
swDNN Library
Section 3 focused on detailed algorithm and code optimization methods for convolution algorithm, which is the most computational intensively part in a CNN. To provide a high-performance solution for the complete training process of a CNN on the Sunway TaihuLight supercomputer, we further put efforts on optimizing the computation of all kinds of layers with all possible conditions, as well as the backward propagation process to support practical CNN models.
First, we consider different conditions for a convolutional layer. The proposed implementation performs well when the numbers of input and output channels are large enough to assign the tasks to all CPEs. Usually for the first few layers in a practical CNN, the number of channels is small. In this case, the performance of the proposed implementation is poor, so we provide an alternate implementation based on a time-domain transformation method proposed by Jia et al. [13] , which contains an Img2Col function to transform the input maps to a matrix and a general matrix-matrix multiplication (GEMM) function to do the computation. The GEMM implementation on SW26010 shares the same core computation with the proposed convolution algorithm, so we can skip over the optimization details for brevity. Different implementations are chosen under different conditions, together to support all kinds of convolutional layers.
The fully connected layer, which is realized through matrix multiplication, involves the second largest amount of computation in a CNN. The implementation is also based on GEMM.
In addition to the computation-intensive layers, such as convolutional and fully connected layers, other layers can be considered as memory-intensive layers, such as pooling layers, normalization layers, and activation function layers. Memory access bandwidth is the key factor that affects the performance of these layers. As shown in Algorithms 1 and 3, the original data layout is (B s , N o , R o , C o ) and the optimized data layout is (R o , C o ,N o , B s ) . Therefore, as discussed in Section 3.1, we propose parallel implementations using CPEs with task partition along the output channel dimension(N o ), which in both data layout cases can guarantee a successive memory access with large granularity, so as to take fully advantage of the memory bandwidth. Similarly, data transformation operations, such as the data layout and Img2Col transformation can also be accelerated using CPEs.
Usually the output layer of a CNN is a softmax layer. The algorithm of softmax is hard to be parallelized, and the computation amount of the softmax layer is rather small. Therefore, there is no need to design a CPE-based implementation for the softmax layer.
Each iteration of the training process contains a forward process and a backward process. The preceding implementations are focused on the forward process. The output of a forward process is the classification results given by the current model. In the backward process, we first evaluate the error of the output results referring to the true labels of the input samples. Then we propagate the error back from the output layer to the input layer and adjust the weights in the layers to minimize the error. In each layer, the backward process shares similar computation patterns with the forward process but involves approximately twofold computation operations for both error propagation and weight update. Therefore, the algorithm design and optimization for the backward process of a layer is similar to the forward process but has different input/output data. We implement CPE-based backward process for each layer to provide a highly efficient backward propagation in the training process. Integrating the preceding implementations for different layers and corresponding data transformation functions, we present a library for accelerating deep neural networks on the SW26010 many-core architecture, called swDNN. A summary of the swDNN library is shown in Table 4 . For each subroutine in swDNN, we provide two implementations. The basic implementation utilizes one CG of SW26010. For the training process of large CNN models, we provide four-CG parallel implementation to take advantage of the all-shared memory. The four-CG parallel strategy is to adopt the task partition along the outermost dimension of the data. For the convolutional layers with optimized data layout, the outermost dimension is R o , as introduced in Section 3.6. For other layers with the original data layout, the outermost dimension is batch size (B s ).
swCaffe Framework
To support more efficient CNN model development and training task deployment, we port Caffe, an open-source deep learning framework, onto the Sunway TaihuLight supercomputer. The original Caffe calls the BLAS library to do the arithmetic computation. On Sunway TaihuLight, swBLAS is one of the fundamental libraries that provide CPE-based implementations on one CG. We consider the Caffe framework depending on swBLAS as the basic version, which has no specialized optimization for the CNN models.
Based on the basic version, we propose three optimization methods to customize the Caffe framework for the SW26010 many-core architecture, and finally we present swCaffe.
First, we implement swDNN-based layers, as listed in Table 4 , to substitute for the original implementations of different layers in Caffe.
Second, we add new data transformation layer to swCaffe. In most CNN models, there are consecutive convolutional layers and pooling layers that can be accelerated with optimized data layout, such as the 2nd to 5th convolutional layers in AlexNet [14] and the 2nd to 13th convolutional layers in VGG-16. Here we take the convolutional layers in VGG-16 as examples. If we do data transformation for input/output feature maps and weights in each layer, the data transformation time is about 27% of the total execution time of all convolutional layers, as listed in Table 5 . We add a dedicated data transformation layer into swCaffe so that for the consecutive convolutional layers, the data transformation of input/output feature maps is performed only once. The data transformation time is reduced to 16% of the total execution time of all convolutional layers. Specifically, the data transformation time for feature maps is reduced to about one fourth (from 3.16s to 0.73s). Third, we extend swCaffe to support a four-CG parallel in the complete training process. A straightforward way is to utilize four-CG implementations in swDNN for each layer, and the parallel process is shown in Figure 10 (a). As we can see, the training process is started on CG 0. For layers that have four-CG parallel implementations in swDNN, we call pthread_create to start three computing threads on CG1-CG3. After the computation finished, we call pthread_join to release the computing threads and continue the process on the main thread. In a CNN model, when most of the layers are based on swDNN, calling pthread_create and pthread_join repeatedly will lead to a relatively large overhead for creating or releasing the thread context.
Addressing the preceding problems, we propose a framework-level parallelization design as shown in Figure 10(b) . At the beginning of the process, we call pthread_create to start four threads on four CGs, all of which will be activated during the whole training process. For layers that can be implemented in parallel, such as Layer 1 in Figure 10 , computation can be done in four CGs without extra overhead. For layers that cannot be implemented in parallel, such as Layer 2, we first call a simple synchronization function to guarantee that all four threads are at the same stage, then do the • swCaffe: swDNN-based Caffe with customized data transformation layers and framework parallelization design for four CGs.
For comparison, we provide the performance of training VGG-16 with Caffe on Intel multicore CPUs (2 × E5-2670v3, 24 cores, with 128GB memory) and the NVIDIA K40m GPU. The training dataset is the ImageNet (ILSVRC) 2012 image classification dataset. We use sample per second (sample/s) as the metric to show the average training speed. Results are shown in Figure 11 .
As we can see, for the single-precision-based training process, the proposed swCaffe framework can achieve about 4.6× speedup over the basic swBLAS-based framework, mainly because of the utilization of four CGs. In addition, the optimization targeting data transformation layers and four-CG parallelization can provide about 20% (speedup from 3.8× to 4.6×) performance improvement. Overall, the proposed optimization methods are proven to be effective.
Compared to CPU and GPU results, swCaffe is 4.6× more efficient than two 12-core CPUs (based on OpenBLAS) and is nearly half the performance of K40m (based on cuDNNv5.1). As a supplement, the performance of the double-precision-based training process is also provided. The double-precision performance of swCaffe is even higher than the single-precision performance on SW26010, and is 8.9× and 1.8× times that of CPUs and the K40m GPU, respectively.
CONCLUSIONS
In this article, we present our work on optimizing the CNN on the SW26010 many-core processor. We propose architecture-oriented optimization methods for the algorithm implementation and framework parallelization. Based on the proposed optimization methods, we develop a customized deep learning library (swDNN) and a customized Caffe framework (swCaffe).
Evaluation results show that the proposed optimization methods can bring 48× performance improvement to the convolution routine in swDNN compared to the basic implementation. The optimized swCaffe framework achieves 4× performance improvement for the complete training process of the VGG-16 network compared to the original Caffe with swBLAS. Moreover, the proposed convolution routine in swDNN and the swCaffe framework show nearly half the performance of the cuDNN library (on a K40m GPU) in single precision while achieving 3.6× and 1.8× speedup over cuDNN (on a K40m GPU) in double precision, respectively.
The presented work can provide highly efficient solutions for training CNN models with the SW26010 many-core processor. Moreover, it proves the capability of deploying large-scale deep learning applications on the Sunway TaihuLight supercomputer.
