This paper reports our efforts on swCaffe, a highefficient parallel framework for accelerating deep neural networks (DNNs) training on Sunway TaihuLight, one of the fastest supercomputers in the world that adopts a unique heterogeneous many-core architecture. First, we point out some insightful principles to fully exploit the performance of the innovative many-core architecture. Second, we propose a set of optimization strategies for redesigning a variety of neural network layers based on Caffe. Third, we put forward a topology-aware parameter synchronization scheme to scale the synchronous Stochastic Gradient Descent (SGD) method to multiple processors efficiently. We evaluate our framework by training a variety of widely used neural networks with the ImageNet dataset. On a single node, swCaffe can achieve 23%˜119% overall performance compared with Caffe running on K40m GPU. As compared with Caffe on CPU, swCaffe runs 3.04˜7.84× faster on all networks. When training ResNet50 and AlexNet with 1024 nodes, swCaffe can achieve up to 715.45× and 928.15× speedup. * L. Li and J. Fang equally contributed to this work.
I. INTRODUCTION
Deep learning has already proven its usability in a variety of applications [1] . In order to achieve better result or to deal with more complex problems, the scale of network gets larger and larger. As large networks require substantial computational power, memory throughput and storage capacity, training neural networks poses a great challenge to the underlying hardware. In addition, as single processor efficiency has reached the physical limitation, scaling deep neural network (DNN) training over parallel supercomputers becomes a good solution to satisfy the computation and storage requirements.
This paper targets Sunway TaihuLight [2] , the former fastest supercomputer in the world that is powered by the SW26010 many-core processors with a peak performance of more than 100 petaflops. Our previous work [3] has already explored the possibility of developing high-efficient convolution subroutines on SW26010. However, there remains great challenges to scale the entire DNN training process. Xeon Phi. Table I shows a comparison of these processors. The detailed architecture of SW26010 is shown in Fig. 1 . It consists of 4 core-groups (CG) connected via a network on chip (NoC) . Each CG has one management processing element (MPE), one computing processing element (CPE) cluster with 8x8 CPEs, one protocol processing unit (PPU), and one memory controller (iMC). The processor interacts with the outside world via a system interface (SI). Fig. 1 : The architecture of SW26010 many-core processor Each CG has its own 8GB memory that is shared by the MPE and CPE cluster and accessible via iMC. The memory has a theoretical bandwidth of 136GB/s. The 8×8 CPEs can interact with each other via register buses. CPEs that fall into the same row or column can exchange message via the fast register communication mechanism, which can support up to 256-bit broadcast or unicast in one cycle.
Both MPE and CPE adopt 64-bit RISC (Reduced Instruction Set Computer) design running at 1.45 GHz and supporting SIMD instructions of 256 bits. Each MPE has 32 KB L1 data cache, 32 KB L1 instruction cache, and 256 KB L2 cache. Each CPE has 16 KB instruction cache and 64 KB local directive memory (LDM), which, aka Scratch Pad Memory (SPM), should be explicitly controlled by user.
B. Network Topology of Sunway TaihuLight
The Sunway TaihuLight supercomputer is composed of 40,960 nodes connected via a customized network. The network is divided into 2 levels -a fat tree at the top and a supernode network at the bottom. While the fat tree network is used for communicating different supernodes, the bottom network is used to connect the 256 nodes within a supernode.
TaihuLight uses FDR (Fourteen Data Rate) 56Gbps network interface cards (NICs) and provides a 70TB/s bisection network bandwidth in total. The theoretical bandwidth between any two nodes is 16GB/s. The real network speed is 12GB/s with a latency at the level of micro-second when nodes are communicating via the Message Passing Interface (MPI).
C. DNN Training Process and Frameworks
Deep learning tries to solve the optimization problem below.
arg min
where θ is the model parameters (or weights) we are looking for; N is the number of samples; f (θ) is typically in a form of a DNN; f n (θ) is the loss function of the n th sample. The SGD method is the de facto method for DNN training. It fulfills the task by iterating the forward-backward propagations. In the forward propagation step, a mini-batch of training data is used as input to calculate the intermediate activations of each layer. In the backward propagation step, the intermediate activations just got are used to perform gradient computation. Afterward, the newly got gradient to model parameters is applied to the model.
Caffe [4] is an open-sourced software framework written in C++ for DNN training. It is widely adopted by both academia and industry. Caffe implements DNN training with three major components, namely layers, net and solvers, corresponding to three optimization levels. Layers implement the algorithm of different neural network layers, related with the algorithm level optimization targeting different underlying hardware and platforms. The net defines the network structure of a DNN model and implements the forward and backward propagations, so it allows optimizations for the process of one training iteration, such as process parallelization and memory optimizations. Solvers control the network training process and implement the parameter tuning algorithms such as SGD. Therefore, optimizations for network training algorithms and distributed training process should be involved in the solvers.
The original Caffe framework is designed for standalone training on a single server, and only supports GPU accelerators. In order to efficiently map the framework onto Sunway TaihuLight supercomputer, we need to refactor or redesign the implementation of the above three components, so as to fit the unique architecture of the processors and to support distributed training over multiple nodes.
III. PRINCIPLES OF PARALLEL ALGORITHM DESIGN ON SW26010
It is not an easy task to squeeze the full potential of Sunway TaihuLight. To design high-performance applications, the following principles should be followed.
Principle 1: Fully utilize the 8 × 8 CPE mesh for computation-intensive tasks. The CPE cluster provides a computing capacity of 742.4 GFlop/s while the MPE only 11.6 GFlop/s in each CG theoretically. So the most important step to improve performance is to offload the computationally intensive kernels to the 8 × 8 CPE mesh. Various levels of parallelism can be supported by the CPE cluster:
• The parallelism between 64 CPEs can be exploited by orchestrating data-independent tasks on each CPE simultaneously. • For each CPE, data-level parallelism can be exploited by using the 256-bit vector registers for SIMD operations. • In addition, instruction-level parallelism is also supported by two instruction pipelines, namely the floating-point pipeline and the memory access pipeline. Instructions within each pipeline are issued in order, whereas independent instructions on different pipelines can be issued out of order.
Principle 2: Always use LDM as intermediary cache for data movements between DDR3 memory. In each CG, the memory controller is responsible for connecting MPE and the CPE cluster to the DDR3 memory. The theoretical shared memory bandwidth is 32 GB/s. According to the benchmarking results shown in Fig. 2 , the DMA (Direct Memory Access) bandwidth saturates around 28 GB/s for both read and write. However, the memory bandwidth between Memory-to-MPE and Memory-to-LDM is extremely different. The bandwidth of copying data from one DDR3 memory space to another through Memory-to-MPE is only 9.9 GB/s. As a result, it is always preferred to use LDM as the intermediary cache, other than accessing main memory from CPEs directly.
Principle 3: Increase available memory bandwidth by transferring large data blocks. The limited aggregated memory bandwidth and the high-computing power lead to an extremely high flop-per-byte ratio, which is 742.4Gf lops 28GBps = 26.5, compared with ratios of 14.90 and 14.56 for K40m and KNL, respectively. To achieve satisfactory DMA bandwidth, we should keep in following points in mind during algorithm design. First, data transfer should be conducted with 64 CPEs together. Second, memory access from the CPE cluster in small granularity should be avoided as much as possible. Size of data to be transferred for each CPE should be larger than 2 KB so that data transfer time can hide the hundreds of cycles LDM transfer latency. Data block size for strided access should be at least 256 bytes so as to achieve satisfactory bandwidth performance.
Principle 4: Reduce memory access by register-level communication among CPEs. Besides increasing available bandwidth, we can also improve performance by reducing the amount of data transfer between LDM and memory. The register-level communication (RLC), which enables 256-bit unicast/broadcast communications at the register level among CPEs, is a unique hardware characteristic of SW26010. Direct RLCs are allowed only between CPEs within the same row or the same column, following an anonymous producerconsumer pattern with FIFO sending/receiving buffers (i.e., the send instruction is asynchronous, and the sender/receiver gets stalled if the sending/receiving buffer is full/empty). If RLC transfers are fully pipelined, the overall unicast and broadcast bandwidth can reach 2549 GB/s and 4461 GB/s respectively [5]. In this way, we can reuse the data in other LDMs on the same row/column in the CPE cluster to reduce bandwidth requirements between the main memory and LDMs.
IV. PARALLEL DNN LAYERS DESIGN ON SW26010
A deep neural network consists of various layers. Here we present our optimization methods for the most frequently used layers in DNN applications, according to the principles pointed out in the previous section.
A. Matrix-Multiplication Layer
The inner-product layers and other more complicated layers, such as Long Short Time Memory (LSTM) layers, are mainly involving General Matrix to Matrix Multiplication (GEMM) operations. If data locality is fully exploited and near optimal memory bandwidth is achieved, GEMM operations can be implemented with a high flop-to-byte ratio. To implement it on CPE cluster, we use the register communication proposed in [3] [6] to increase data locality in LDM. Assume we intend to perform GEMM operation C+ = A × B, where matrix A, B and C are of sizes m × k, k × n, m × n, respectively and can all fit into the 64 KB LDM. Matrices are evenly divided to dimension of size m/8, n/8 and k/8. A CPE is responsible for computing m/8 × n/8 block of C requiring an m/8 × k tile of A and a k × n/8 tile of B. Note that, in this case, 7/8 of both tiles of B and C required by this CPE are resident on remote LDM of other CPEs. According to Principle 4, we can take advantage of the row and column register communication scheme to fetch remote data, as CPEs in the same row of the cluster share the tile of A, and CPEs in the same row of the cluster share the tile of B.
The GEMM operation can be finished in 8 steps as
For each time step t(0 ≤ t ≤ 7), CPE(i, t) loads data of A(i, t) from LDM and broadcasts the data to other CPEs in the same column by column register communication. Similarly, CPE(t, j) loads data of B(t, j) from LDM and broadcasts the data to CPEs in the same row. Thus, CPE(i, j) can receive both data of CPE(i, t) and CPE(t, j) and the computation of Fig. 3 illustrates the register communication operations when t is 2. This is optimal design with highest flop-to-byte ratio, as we only require to fetch matrices from memory to LDM once. Blocking techniques are applied to matrices which are too large to fit into the LDM. As the memory-LDM bandwidth is critical for the GEMM performance, the continuous data sizes of matrix blocks each CPE accesses should be large enough according to Principle 3. As a result the dimension size of matrices should be large enough for good memory bandwidth.
SW26010 provides no inherent support for single-precision floating point operations, which is default precision option used in DNN. As there is no instruction to support RLC for single precision data in the instruction set of SW26010, we always perform RLC operations with double-precision data and we conduct inline transformation for elements between double-precision to single-precision with SIMD instructions.
B. Convolutional Layer
The convolutional layers are the most compute-intensive parts when training Convolutional Neural Networks (CNNs). Both time-domain methods with GEMM operations [7] and frequency-domain methods with FTT operations [8] are proposed to optimize convolutional layers on GPU. Because GEMM operations can be perfectly optimized on CPE cluster with the register-level communication as mentioned previously, we adopt time-domain transformation methods. To support different convolutional layer parameter configurations in real CNN applications, we propose a mixed strategy combining the explicit GEMM plan used in original Caffe and the implicit GEMM plan proposed in [3] .
1) Explicit GEMM transformation: To map convolution operations to GEMM and reuse the GEMM routine mentioned in Sec. IV-A, we adopt the explicit GEMM transformation proposed in original Caffe. During forward propagation, im2col
(image-to-column) operations are performed to transform input tensors into matrices before leveraging GEMM operations. During backward propagation, col2im (column-to-image) operations are performed after GEMM operations. Assuming a convolutional layer has filter of size (N o , N i , K, K), im2col operation transforms a 3D multi-channel image tensor of size
Here C i/o and R i/o are column and row of the output image,
is the filter channel number, and K is the filter size. A dimension called batch-size B is also introduced to bring more optimization space for GEMM blocking.
As the filter tensor can be viewed as a matrix of size (N o , K × K × N i ), GEMM operation is performed on two matrices with common dimension of size K ×K ×N i . Im2col and col2im consist of irregular memory access pattern. The convolutional layers in backward propagation can transfer matrix back to tensor with col2im, which has a reverse memory movement. As indicated by Principle 4, irregular memory access of im2col and col2im should be implemented with DMA on CPE cluster. Fig. 4 shows our im2col and col2im plan on one CPE. During im2col process, each CPE reads one row of a input image into LDM buffer with DMA get operation. After adding with pad, each CPE writes K × K line of data into memory. Block sizes are critical for memory bandwidth in GEMM operation.
2) implicit GEMM transformation: The time overheads of im2col and col2im are not negligible for some layers. An implicit GEMM transformation proposed in our previous work [3] is integrated to implement convolutional layers for swCaffe by blocking on dimensions of image width and input and output channels to increase data reuse in LDM. However, when the input and output filter channel numbers are smaller than 64, performance of implicit method would largely degrade, because the amount of data in LDM with small channels is not large enough to support 256-bit SIMD and register communication operations.
Real applications apply convolutional layers with input images after zero padding. Considering padding operation has already been implemented combining with im2col/col2im operations in explicit scheme, we also propose a padding optimization in implicit GEMM transformation convolution layers by use a coordinate mapping techniques to avoid explicitly padding operations. Details of padding and more other optimization techniques for convolutional layers can be found in our technique report released with source code.
C. Tensor Transformation Layer
The data of explicit GEMM transformation and implicit GEMM transformation are arranged differently. In the explicit GEMM transformation plan, input and output tensors are of shape (B, N, R, C) and filters are of shape (N o , N i , K, K), which is also the default data layout for other layers. In the implicit GEMM transformation plan, input and output tensors are of shape (R, C, N, B) and filters are of shape 0 0 1 2 2 3 3 4 4 5  1 2 2 3 3 4 4 5 5 6  2 2 3   3  3 4   4  3  4 5   5  4  5 6   6  5 5  6 0   1 2 2 3 3 4 4 5 5 6   1 2 2 3 3 4 4 5 5 6 ... 1 1 2  1  2 2 2 2 2 2 3   2 2 3  2  3 3 3 3 3 3 4   3 3 4   4 4 4 4 5 .   5  4 4 5  5 5 5 5 6  1  2   1 1 2  2 2 2 2 3   2 2 3  2  3 3 3 3 3 3 4   3 3 (K, K, N o , N i ). Note that the convolutional layers that can be accelerated with implicit transformation are gathered together. The filters are local variables of this layers and its layout do not effect other layers. In swCaffe, we add a tensor transformation layer, which has an 4D tensor input and an 4D tensor output with dimensions transposition between two different data layouts.
The tensor transformation in trans layer is mainly irregular memory movement and should also be accelerated on CPE cluster. Stride DMA access is adopted to access a block of tensor into LDM. SIMD shuffle instructions are used to transform data after load data from LDM to registers.
D. Pooling Layer
The pooling layer partitions the input image into a set of non-overlapping tiles and, for each such sub-region, outputs the maximum or average values of elements inside. Since pooling layers are featured with massive memory copy operations, they should be implemented with DMA operations on CPE cluster. We design different movement strategies according to the sizes of input images. Assuming the tile size is K × K. According to Principle 3, we should increase the continuous data size as much as possible for data blocks. Most of times, each CPE is in charge of pooling operation for multiple K rows of input image. When K rows of image can not be fitted in LDM, we load number of columns into LDM as large as possible. In this case, the data needed by LDM is not continuously stored in memory and strided DMA is used to access it.
V. SCALING SWCAFFE ON SUNWAY TAIHULIGHT
In this section, we describe how to scale swCaffe to multiple processors on Sunway TaihuLight.
A. Optimization for Communication of Model Parameters
In our work, we devise a data parallel scheme to scale swCaffe using the synchronous Stochastic Gradient Descent (SSGD) algorithm, which is widely adopted by HPC clusters and supercomputer systems [9] [10] in consideration of the high network quality and balanced node performance. There are mainly two methods to implement model parameter synchronization in SSGD. One method is to use the parameter servers [11] as an intermediary to store the parameters among several servers. The parameter server scheme is unable to sufficiently exploit the bandwidth potential on fully-connected network infrastructure of Sunway Taihlight, since the processor has only one network port, thus, receiving gradients simultaneously from a large number of workers could potentially become a bottleneck in the parameter server design and bandwidth between workers are not fully used. The other method is to perform all-reduce operations on the gradients among all nodes and to update the parameters on each node independently [10] . We adopt the latter approach to take advantage of the MPI routines optimizing for the supercomputer system, as the former approach is designed for synchronization based on low-bandwidth network infrastructures, like Ethernet. Our parallel synchronous SGD algorithm is described in Algorithm 1. threads synchronization() 8 :
As shown in Fig. 5 , we use multiple-threading technique among 4 CGs inside one processor to calculate the averages of gradients. At the beginning of each iteration, we call pthread create() to start 4 threads on 4 CGs. Each process is able to launch light-weight CPE threads to load work tasks onto CPE cluster, in order to perform forward-backward propagations of 1/4 of data in that mini-batch. Afterwards, each CG achieves its local parameter gradients and CG 0 sums them together to achieve the average gradients of this mini-batch. To synchronize the sub-threads, we implement a synchronization function by ourself, which is based on a handshake (initiationconfirmation) strategy through the semaphore stored in the shared memory.
To synchronize the gradients across nodes, we implement a customized all-reduce communication. The default MPI_Allreduce routine provided by compiler, which is modified from Open MPI, can not be directly applied for swCaffe for mainly three reasons. First, the Sunway network is characterized by high latency, thus MPI_Allreduce routines designed for low latency network hardware are no longer suitable in this situation. As shown in Fig. 6 , we compare the Sunway network with an Infiniband FDR network. While achieving similar high-bandwidth as Infiniband, the Sunway network has higher latency when message size is larger than 2KB. Second, the communication pattern in MPI_Allreduce is not aware of the topology of hierarchical network as mentioned in Sec. II-B. If every node in one supernode performs point-to-point communication with a different node in another supernode, it will result in oversubscribed interconnect across supernodes. As shown in Fig.  6 , the over-subscribed bandwidth between two supernodes is around 1 4 of full bandwidth. Third, the sum operation after data gathering in MPI_Allreduce is performed on MPEs, thus it is not efficient in the case of large parameter amount. Before introducing our improvement to the all-reduce operation, we use the cost model proposed in [12] to evaluate our all-reduce in terms of latency and bandwidth use. We assume that the time taken to send a message between any two nodes can be modeled as α + βn, where α is the latency (or startup time) per message, independent of message size. β is the transfer time per byte, and n is the number of bytes transferred. More specifically, β 1 is the transfer time inside one supernode and β 2 (≈ 1 4 β 1 ) is time across supernodes when bandwidth is over-subscribed. In the case of reduction operations, we define γ to be the computation cost per byte for performing the reduction operation locally on any node. We also define p to be the total number nodes in all-reduce operation and q to be the number of nodes in one supernode.
Considering the high latency characteristics of the Sunway network, the popular ring-based algorithms [13] , having a pα latency term, are not our best candidates. We choose a binomial-tree-based algorithm used in MPICH [12] , which has a 2 log pα latency term, as our baseline to improve. An allreduce operation is implemented with an allgather phase after a reduce-scatter phase. Instead of storing all results at the root node, reduce-scatter phase adopts the Recursive Halving algorithm to scatter reduction results among all nodes. In the first step, each node exchanges n/2 data with a node that is a distance p/2 away. Each node sends the data needed by all nodes in the other half, receives the data needed by all nodes in its own half, and performs the reduction operation on the received data. In the second step, each node exchanges n/4 data with a node that is a distance p/4 away. This procedure continues recursively, halving the data communicated at each step, for a total of log p steps. Recursive Doubling algorithm, analogous to the Recursive Halving algorithm, is adopted to collect partial results from other nodes for each node in the allgather phase. In the first step, nodes that are a distance 1 apart exchange their n/p data. In the second step, nodes that are a distance 2 apart exchange their own data as well as the data they received in the previous step, which has a size of 2n/p in total. In the third step, nodes that are a distance 4 apart exchange their own data as well the data they received in the previous two steps, which has a size of 4n/p in total. Nodes exchange message size up to (2 log p−1 n)/p with the nodes that are a distance p/2 apart in the last step. A simple example of such all-reduce implementation is illustrated on the left side of Fig. 7 .
In the original implementation, nodes within the same supernode are assigned adjacent logical node numbers. In the first several steps of Recursive Halving and last several steps of Recursive Doubling, each node has to communicate with a node far away in another supernode, causing over-subscription between supernodes. As a result, only 1/4 of full bi-direction network bandwidth can be utilized. The costs of original allreduce are illustrated in Eq. 2, Eq. 3, and Eq. 4. The last two equations are obtained by summing the costs for each time step, which can be viewed as a geometric progression. If p is much larger than q, term (p − q)β 2 n p will account for most of the communication time.
We notice that the communication traffic in different steps is not balanced. Recursive Halving gradually reduces traffic, while Recursive Double gradually increases traffic. Considering the topology of the Sunway network, a better all-reduce implementation should place heavy communication traffic inside one supernode and light one across supernodes. We redesign the relationship between physical distance and logical distance used in all-reduce algorithm, by incrementally assigning logical numbers to nodes of different supernodes in a round robin way. For example, assuming we have 4 supernodes, Nodes numbered 0, 4, 8, ... belong to supernode 0, nodes numbered 1, 5, 9, ... belong to supernode 1, and so on. As shown in Fig. 7 , the new all-reduce conducts crosssupernode communication in the last log p q steps of reducescatter phase and the first log p q steps of allgather phase. For these steps, we only need to exchange relative small amount of message. The new costs are shown in Eq. 5 and Eq. 6. As we can see, new implemenation largely reduces the coefficient of β 2 from p − q to p q − 1, thus reducing the overhead caused by over-subscribed communication.
In addition, sum operations after data gathering are implemented on four CPE clusters of the processor. The parameters of different layers can vary greatly in size. In VGG-16, the first fully-connected layer is 102 MB, while the first convolutional layer is only 1.7 KB. Sum operation for layer gradients of small parameter size can be inefficient, because we can not fully utilize the memory bandwidth to access data in small granularity. We pack the gradients of all layers together to performance all-reduce after backward propagation. Such scheme can fully utilize both network bandwidth for communication and memory bandwidth for sum operation.
B. Parallel I/O optimization
Computing nodes in Sunway TaihuLight adopt a shared file system. Each worker of the parallel DNN training task uses an I/O thread to prefetch one mini-batch data via random sampling prior to each iteration. The file system of Sunway TaihuLight adopts a single-split mode for data distribution by default, that is, one file will only be distributed to one disk array. In this case, if we read the file concurrently, as the number of processes increases, the aggregate read bandwidth of multiple concurrent processes can quickly reach the upper limit of a single disk array. As a result, each process will get a bandwidth drop and the entire read time becomes longer.
We improve the aggregated bandwidth of disk arrays by increasing the number of stripe to 32 and modifying the splitting size to 256 MB. Data is distributed on 32 disk array under the round robin strategy with block size as 256 MB. Assume that one process is required to read a mini-batch data size of 256 for ImageNet images. The data size for this minibatch is around 192 MB. Since each process always accesses consecutive 192 MB of data, a single process can access at most two disk arrays. Accordingly, the number of processes required per disk array is also reduced to at most N/32 × 2, where N is the number of processes.
VI. FRAMEWORK EVALUATION
We implement swCaffe with customized Sunway REACH (Open64 based) C compiler and SWMPI 2.2 (Mvapich 2.2 based) C++/MPI compiler on TaihuLight. We compare its performance with the original Caffe built with g++-4.8.0, CUDA-8.0 Toolkit and cuDNN-v5.1, and deployed on a hybrid system with an Intel 12-core E52680 V3 CPU and an NVIDIA K40m GPU card. The CPU has a memory bandwidth of 68 GB/s and a peak performance of 1.28 TFlop/s. We conduct our experiments based on the public 1000-way ImageNet dataset.
A. Results for Optimizations on Different Layers
We analyze the performance of convolutional layers with both explicit and implicit GEMM transformation strategies proposed in Sec. IV-B. Table II presents the measured time and throughput for each convolutional layer of the VGG-16 [14] network with batch-size 128. VGG-16 has 12 convolutional layers and covers most commonly used parameter configurations. In terms of the forwardprop in conv1 1 and backwardprop in conv1 1,conv1 2 and conv2 1, implicit strategy is unable to handle small channel sizes and explicit strategy is the only solution. For most parameter configurations, implicit strategy outperforms explicit strategy. However, explicit strategy is slightly better for layers of large image sizes and large channel numbers, where GEMM operations can be performed on large block sizes on matrices generated by im2col. During iterative DNN training, since layers can be implemented with two methods, swCaffe can run first two iterations to determine the best strategy used for remaining iterations. Fig. 8 and Fig. 9 present the processing time for each DNN layer on SW26010 and GPU K40m for forward and backward propagation on AlexNet [15] and VGG-16 respectively. We refine AlexNet without affecting the accuracy by changing the local response normalization (LRN) to batch normalization (BN). The performance differences between the two architectures mainly come from the following aspects. i) Although DNN training has long been considered as a compute-intensive task on GPU, we notice that most DNN training time is taken by bandwidth-bounded operations on SW26010. As the bandwidth of GPU device memory can reach 288 GB/s, bandwidth-bounded layers, such as pooling layers, can be processed in device memory very fast. However, these layers still consume a significant amount of time on SW26010. ii) Although we achieve comparative performance for most of compute-insensitive layers, for the first two convolutional layers in both networks, SW26010 is of low efficiency compared with GPU. Given that these layers have large image sizes, im2col and col2im operations account for most of time in the first two layers. In addition, the input/output channel sizes are 3/64 and 64/64 for the first two convolutional layers, which is not enough for compute-bounded blocked GEMM operations. The flop-to-byte ratio of GEMM operation with A (size of conv1_1  relu1_1  conv1_2  relu1_2  pool1  conv2_1  relu2_1  conv2_2  relu2_2  pool2  conv3_1  relu3_1  conv3_2  relu3_2  conv3_3  relu3_3  pool3  conv4_1  relu4_1  conv4_2  relu4_2  conv4_3  relu4_3  pool4  conv5_1  relu5_1  conv5_2  relu5_2  conv5_3  relu5_3   10 The best ratio is m 6 , if m = n = k. The architectural flopto-byte ratio calculated with the best measured bandwidth is 742.4 28 = 26.5. As a result, to make GEMM compute-bounded, we must have m > 160. However, small channel size limits the m dimension sizes in transformed matrices.
B. Results for Different Network Structures
We evaluate swCaffe performance of complete DNN training using different network structures, with the results shown in Table III . AlexNet, VGG-16, VGG-19 [14] , ResNet50 [16] and GoogleNet [17] are tested with batch size 256, 64, 64, 32, and 128 respectively. Compared with 12-core CPU, SW26010 with our framework is 3.04x˜7.84× faster on five DNNs. Our framework on SW26010 outperforms K40 GPU on AlexNet with a speedup of 1.19x. Data reading from CPU host memory to GPU device memory through PCI-E bus accounts for over 40% time during training of AlexNet, as calculation time is too short to hide memory I/O overhead. In contrast, CPEs in SW26010 can directly access memory with DMA so as to eliminate data reading overhead. Our framework on SW26010 achieves 45% and 49% overall performance compared with NVIDIA K40m GPU on AlexNet, VGG-16, but with a theoretical memory bandwidth only 44% of that of GPU. Implementations of ResNet50 and GoogleNet with swCaffe achieve 21% and 23% overall performance of GPU Caffe, because their convolutional layers adopt smaller channel settings than VGG-16 and VGG-19. Since limited memory bandwidth achieved on convolutional layers with small channel numbers, the two networks exhibit stronger memory-bounded properties on SW26010. 
C. Results for Scalability
Recently, the work in [9] and [10] has increased the minibatch size in data-parallel SGD without losing accuracy over a fixed number of epochs. Larger mini-batch size implies more possible parallelism for scaling DNN to multiple nodes, as computing task of each node can achieve high computeto-communication ratio. Fig. 10 shows the result of scaling AlexNet and ResNet50 to 1024 nodes on Sunway TaihuLight. Compared with training on a single node, 715.45×, 561.58× and 409.50× speedup are achieved for AlexNet trained with sub-mini-batch size 256, 128, and 64 respectively. As for ResNet50, 928.15× and 828.32× speedup are achieved with sub-mini-batch size 32 and 64 respectively. Although the limit of mini-batch size of the current large-batch method [10] for AlexNet and ResNet is 32K, TaihuLight equipped with our framework can get more benefit from the new training algorithm with larger batch-size. 
VII. RELATED WORK
Existing methods on accelerating basic DNN layers are mainly focused on NVIDIA GPU and Intel Xeon Phi. Library cuDNN [7] is a widely used GPU-accelerated library of primitives for deep neural networks. Intel-Caffe [18] provides a library of DNN performance primitives optimized for Intel architectures. They both provide a set of highly optimized building blocks intended to accelerate compute-intensive parts of deep learning applications.
The work in [19] was first proposed to train DNN models on a CPU-GPU hybrid HPC systems. Since then, a large number of works have already been focused on scaling DNN on GPU supercomputers and HPC clusters. Inspur-Caffe [20] is an MPI-based Caffe fork that exploits parameter-server approach with stale asynchronous gradient updates. FireCaffe [21] discusses scaling of DNN models on a cluster of 128 GPUs connected with Infiniband interconnects. It also adopts a allreduce-based parameter synchronization implemented with reduction trees. S-Caffe [22] provides modern multi-GPU clusters with a CUDA-Aware MPI runtime for reducing/broadcasting operations and scales DNN training to 160 GPUs.
There are a variety of general DNN frameworks deployed on HPC systems. Tensorflow [23] developed by Google is the most famous DNN framework that operates at large scale in heterogeneous environments. It implements communication using the Google RPC library. Caffe2 [24] is developed by Facebook and built based on Caffe. CNTK [25] developed by Microsoft. Both Caffe2 and CNTK natively support MPI for inter-node communications. MXNet [26] support multi-GPU training with a parameter server called PS-lite implemented with ZeroMQ library for communication. Intel-Caffe also supports multi-node training by Intel MLSL (Machine Learning Scaling Library), which is a library built on top of MPI and works across various interconnects.
VIII. CONCLUSION
We shared our experience in designing swCaffe, a highefficient parallel DNN training framework, on Sunway Taihu-Light from processor architecture and networking perspectives. We showed how to derive high-efficient routines of some key DNN layers on SW26010 by fully taking into consideration the unique hardware features. We also presented approaches to efficiently scale swCaffe to multiple nodes, namely the topologyaware optimization of the all-reduce operation for gradients synchronization and file system mode optimization for parallel I/O. Compared to Caffe on NVIDIA K40m GPU, swCaffe on SW26010 has competitive performance for networks with compute-intensive convolution operations such as AlexNet and VGG. In addition, good scalability can be achieved on multiple nodes as indicated by the experimental results.
ACKNOWLEDGMENT
This work is co-sponsored by National Key R&D Program of China (2018YFB0204100), and Natural Science Foundation of China (61572280, 61672312).
