Deep Neural Networks (DNNs) have revolutionized numerous applications, but the demand for ever more performance remains unabated. Scaling DNN computations to larger clusters is generally done by distributing tasks in batch mode using methods such as distributed synchronous SGD. Among the issues with this approach is that to make the distributed cluster work with high utilization, the workload distributed to each node must be large, which implies nontrivial growth in the SGD mini-batch size. In this paper, we propose a framework called FPDeep, which uses a hybrid of model and layer parallelism to configure distributed reconfigurable clusters to train DNNs. This approach has numerous benefits. First, the design does not suffer from batch size growth. Second, novel workload and weight partitioning leads to balanced loads of both among nodes. And third, the entire system is a fine-grained pipeline. This leads to high parallelism and utilization and also minimizes the time features need to be cached while waiting for back-propagation. As a result, storage demand is reduced to the point where only on-chip memory is used for the convolution layers. We evaluate FPDeep with the Alexnet, VGG-16, and VGG-19 benchmarks. Experimental results show that FPDeep has good scalability to a large number of FPGAs, with the limiting factor being the FPGA-to-FPGA bandwidth. With 6 transceivers per FPGA, FPDeep shows linearity up to 83 FPGAs. Energy efficiency is evaluated with respect to GOPs/J. FPDeep provides, on average, 6.36x higher energy efficiency than comparable GPU servers.
Introduction
Deep convolutional neural networks (CNNs) have revolutionized applications such as image classification and object recognition [1, 2, 3, 4] . As there remains an open-ended demand for more complex networks and larger datasets, new computing solutions are critical.
Distributed synchronous stochastic gradient descent (SGD) has enabled large-scale CNN training by partitioning SGD mini-batches into smaller data batches that can then be processed in parallel and so accelerate CNN training [5] . A drawback of this method is scalability: to enable continued high utilization as the number of nodes increases, each node must be allocated an ever larger workload. But larger mini-batches slow training convergence. Thus, while larger clusters provide increased computation capacity, the training time is not proportionally reduced [5] .
arXiv:1901.01007v1 [cs.LG] 4 Jan 2019
FPGA clusters are recognized to be a competitive technology for CNN inference [6, 7, 8, 9, 10, 11, 12, 13] . For CNN training, however, their efficacy is still an open question; one that is addressed in this work. Previous FPGA clusters for CNN training have generally worked in batch mode (batch in the computational sense), which uses the distributed synchronous SGD algorithm just described [14, 15, 16, 17, 18, 19] . In this approach, called Data-Parallelism [20] , each FPGA executes all layers of the CNN. This is done in sequential order, a layer at a time, and a new layer starts only after the previous one has completed. Data-Parallelism has three significant disadvantages. First, optimal FPGA configurations for the different CNN layers vary greatly. Therefore, either the FPGA is suboptimally configured, or the FPGA needs to be reconfigured repeatedly at run-time. Second, storage required for weights and intermediate features is generally large enough that off-chip memory must be used. And third, this entire approach suffers from the scalability problem of the distributed synchronous SGD algorithm.
Another method, which we call Layer Parallelism, is to daisy-chain multiple FPGAs and map the entire CNN onto a single pipeline. Zhang et al. [6] have used Layer Parallelism to accelerate CNNs using FPGA clusters, though only for inference. Their approach, however, still leaves two problems. First, the pipeline is not seamless; a particular layer might be stalled until the previous layer is finished. All features must, therefore, be cached until the last feature of a layer is obtained. This requires large storage that necessitates the use of off-chip memory. Second, the computational intensity varies greatly among layers. A naive workload distribution can therefore result in a large number of idle cycles due to inter-layer dependencies. These two problems exist in both inference and training, but have a greater impact on the latter. In training, all features of the hidden layers must be cached until their corresponding errors arrive through Back Propagation (BP), thus requiring much larger memory. Moreover, due to BP, the number of operations per layer triples.
We propose FPDeep, a novel FPGA-cluster-based training framework for CNNs that solves the problems just described. FPDeep does this by using a hybrid of layer and model parallelism (described below) together with a number of new workload/weight balancing strategies. No reconfiguration is needed: each device computes only certain layers, or a part of a single layer; each device is optimized independently with respect to its own computation. The cluster is now a single fine-grained pipeline so the batch size can be arbitrarily small. The amount of data that must be saved is drastically reduced eliminating most off-chip memory accesses. Internode communication is simple and pipeline utilization very high. To the best of our knowledge, our work is the first framework of CNN training on FPGA-based clusters using this method of parallelism and also the first framework with fine-grained workload/weight balancing.
The underlying theme of this work is to convert batch parallelism to pipeline parallelism, which has obvious benefits. Parallelism is equal to the depth of the pipeline, in this case many thousands of stages across the cluster. Communication paths can be short so cycle times are as small as the designer can make them. Communication among devices is direct and contention-free with any latency having no effect on throughput. There is also the aforementioned benefit of having all of the latency reduction applied to individual problem instances and so obviating the algorithmic challenges that come with larger batches.
We find this approach to be effective with performance similar to that of GPU clusters of similar size and technology, but with far better power efficiency. The limiting factor is inter-FPGA bandwidth. But, somewhat surprisingly, we find that a 1D topology suffices and that, even using only six transceivers per FPGA, FPDeep achieves linear speed-up to 83 FPGAs. The main contributions are as follows:
• The possibility to break down the well-known scalability wall of CNN training and the demonstration of FPGA clusters as a competitive technology for CNN training;
• A novel framework for mapping CNN training logic to distributed FPGA clusters that achieves both high efficiency and scalability; that does not suffer from issues related to mini-batch size; and that needs only a simple interconnection network as is available in any multi-FPGA system with consistent communication and reasonable bandwidth;
• A fine-grained pipeline design that minimizes the time that features need to remain available while waiting for back-propagation, thus reducing the storage demand to the point where only on-chip memory is required for the convolution layers;
• Fine-grained partitioning and mapping methodologies, which provide almost perfect workload and weight balancing among FPGAs; this is done by increasing the flexibility of workload and weight allocation, thus leading to improved utilization: multiple FPGAs can cooperatively compute the same layer, while multiple layers can also be mapped to the same device;
• An RTL code generator that automatically creates RTL implementations based on the mapping scheme generated by FPDeep. The organization of this paper is as follows. In Section II, related work is discussed. In Section III, the methodology of FPDeep is presented and the workload/weight balanced partition methods are defined. In Section IV, the overall architecture and design of FPDeep accelerators are discussed. In Section V, experimental results are given. Finally, we conclude and suggest further work in Section VI.
Design Space and Related Work
Much work has addressed the mapping of inference/training of CNNs to clusters with programmable accelerators, including [21, 4] . Many frameworks and libraries have been deployed, e.g., MXNet [22] , Caffe [23] , and Tensorflow [24] . These systems hide the complexity of workload decomposition and provide friendly programmer interfaces, including Python, R, and Scala. For FPGA-based clouds, the prior work is more limited. Microsoft's Catapult project [25, 26] implements a parameterized CNN accelerator cluster which can deliver over 1 TFLOPS with very high energy efficiency. Zhang's CDSC FPGA-Enabled Cluster accelerates CNNs on top of Spark and Hadoop [16, 6] . In [6] , researchers build a deeply pipelined FPGA cluster with 6 Xilinx VC709 boards to accelerate CNNs. In [14] , an FPGA-based framework of CNN training is proposed, but focuses mainly on single-FPGA designs.
Most distributed CNN systems, including TensorFlow and CNTK, are based on the distributed synchronous SGD algorithm (Centralized Parallel SGD algorithm -C-PSGD, see Fig.1(A) ). The Parameter Server Topology [27] uses a central parameter node connected with multiple worker nodes. Clearly, there are several bottlenecks: communication load on the central node [28] and idle time waiting for straggling worker nodes [29] . Also, for large-scale clusters, the growth in the SGD mini-batch size limits scalability. Lian, et al. use a decentralized parallel SGD algorithm (D-PSGD) to build a large-scale cluster [28] . As shown in Fig.1(B) ) each node must maintain its own local copy of the model and data duplication is inevitable. We complete the taxonomy of mapping CNN applications to distributed clusters in the remainder of Fig.1 . Fig.1(C) shows the primary design choices: note the present work is decentralized/MP. Fig.2 shows the design space for mapping CNNs onto distributed nodes. We use terminology introduced by [20] .
Data parallelism (Fig.2(A) ) is the most popular approach in CPU and GPU clouds [22, 24] . It is also widely used in existing FPGA clouds, such as Catapult and CDSC [16] . This method has drawbacks as mentioned in Section I. In CNNs, the configurations of each layer, such as kernel size, pooling size, and stride size, vary greatly, requiring different hardware designs to obtain optimal performance. Thus, FPGAs need to be reconfigured between layers, leading to significant overhead. In addition, as each FPGA executes all layers in sequential order, each layer starts only after the previous layer has completed. Thus, for all intermediate features, weights need to be stored to and loaded from the host upon completion of a layer, leading to heavy communication with off-chip memory.
Layer Parallelism (Fig.2(B) ) maps layers of the CNN onto individual nodes and pipelines CNN computation. It has been employed by both GPU and FPGA frameworks. In [30] , multiple GPUs are used in a pipelined manner. Each LSTM layer is assigned to a different GPU. After GPU 1 finishes computing layer 1 for the first sentence, it passes its output to GPU 2. At the same time, GPU 1 fetches the next sentence and starts training. In their work, each layer is allocated with a certain GPU, thus, workloads are not balanced among devices. For multi-FPGA systems, Zhang et al. [6] only focuses on inference; also, the parallelism is coarse-grained, the workload is unbalanced, and there is heavy off-chip memory communication. So while Layer Parallelism mitigates some of the problems with batch size 3 FPDeep Methods and Design
Preliminaries
In this subsection, we review basics of CNN training and define the most important parameters used by FPDeep. A six layer CNN (in Fig.3 ) is used as an example. The red datapath refers to Forward Propagation (FP). It calculates the errors 
IC[l]
Number of layer l input feature map channels
Number of layer l output feature map channels
Row num of layer l input feature map
Column num of layer l output feature map
Width of layer l convolution kernel
Height of layer l convolution kernel
Layer l pooling size
Layer l pooling function
Layer l activation function
Hardware Constraint Parameters Number of DSP per FPGA T RAN S max Number of transceiver per FPGA of output features in the final layer. Starting with an input image (Cat), neurons in each layer are evaluated with weights w i . Errors are calculated by comparing inference results to the label "Cat" in the training dataset. BP has two sub-steps: Error Back-propagation (EB-green) and Weight Gradient calculation (WG-orange). In EB, errors are back-propagated through the network. In WG, using the errors of each layer, gradients of the weights are calculated ( ∂error ∂wi ). There are two parts to FPDeep: mapping and implementation (Fig.4) . The Mapping Framework partitions a CNN into a number of fine-grained segments and maps them to FPGA clusters so that every FPGA gets a balanced workload and weights. In the Implementation Framework, the RTL generator creates RTL implementations for each FPGA based on the parameterized mapping, while a cycle-accurate simulator gives measures of throughput, percent idle stages, and bandwidth demand.
The two parts each have their own set of input parameters (see Table 1 and Fig.5 ). The network configuration parameters include 1) numbers (IC,OC) and sizes (Ci * Ri, Co * Ro) of the input and output feature maps, 2) convolution kernel size KW * KH, 3) stride size S, 4) padding size P S , 5) activation function of each layer, 6) layer number, 7) layer type, 8) pooling function, and 9) pooling size P . Hardware constraints include 1) number of available FPGAs in the target cluster N , 2) on-chip memory resources per FPGA chip BRAM , 3) DSP resources per FPGA chip DSP , and 4) available transceiver channels per FPGA board T RAN S. 
The organization of the rest of this and the next section is as follows. First, we describe how it is possible for FPDeep to implement a very deep, highly fine-grained pipeline, starting with a data dependency analysis of CNN training. Second, we present the mathematical model that is the basis for the mapping of CNN training to the distributed FPGA clusters. In the next section we describe the methods for partitioning and mapping and how they ensure balanced loads. We follow by describing how to allocate CNN weights among the FPGA devices so that each node's on-chip memory requirement is balanced and use of off-chip memory avoided.
Deep Fine-Grained Pipeline
The pseudo code of the convolution layer is shown in Algorithm 1. At each layer, IC feature maps are convolved by OC sets of IC × KW × KH convolution kernels. For each kernel set, the convolution is performed by sliding the filters across feature maps. At each location, the resulting products of parameters and covered activations are summed, giving one activation at an output feature map. After OC iterations, every output feature map receives a new activation.
To illustrate data dependencies during training we use as an example two CONV layers with 3 × 3 kernel size. The operations of these two layers' forward/backward propagation are shown in Fig.6(A) . In forward propagation, a 7 × 7 feature map is fed into Layer 1 and a 5 × 5 feature map is generated. At layer 2 the 5 × 5 feature map is convolved with the parameters and inferred to a 3 × 3 feature map. In the backward propagation, the 3 × 3 error map is padded to 7 × 7 before it is fed to Layer 2. Next, the error map and corresponding parameters are convolved and another (5 × 5) error map is produced; this is used for Layer 2's weight/bias gradient calculation. At Layer 1, the 5 × 5 error map is padded to 9 × 9 and then convolved to 7 × 7. The Fig.6 (B) depicts the data dependency of forwarding and backward propagations during CNN training. For the forward propagation phase, the image is inferred through all layers. To determine the data dependency, we start from the four activations at the output feature map at the last layer, which are marked as black, red, blue and yellow, respectively, and trace backwards to find the region of the input feature maps on which each depends. For the backward propagation phase, errors calculated are propagated backward through the network. To calculate gradients at a particular layer, errors which are backward propagated from the next layer and activations of its feature maps are necessary. Hence, the feature maps, which are generated in the forward propagation phase, need to remain available awaiting backward propagation. As shown in Fig.6 , activations and errors among CONV layers show only fine-grained dependencies. That is, to begin computing the value of a pixel in a layer, only a fraction of the pixels from the previous layer are needed. Therefore the computation of a layer can start much earlier, before the previous layer is completely done. This provides the opportunity to process all CONV layers in parallel in a fine-grained pipeline.
The Fig.6(C) shows the traditional method of accelerating CNN training. First, the N channels of the feature maps are fed into convolution layer L1. Next, results from all M channels begin to be processed while the convolution kernel slides across the N -channel feature maps. Much storage capacity is needed to maintain all temporal feature maps. Clearly, this method is not efficient. The fine-grained alternative is shown in Fig.6(D) : the calculation of an activation/error starts as soon as its dependent activations/errors are propagated from the previous/next layer. The basic Figure 7 : Compute capacity vs storage demand process unit of FPDeep is an activation/error of a feature/error map; this is in contrast to the traditional method's basic unit of the entire feature/error map. The result is both a large increase in parallelism through the added pipeline stages and a reduction in storage so that only on-chip memory is needed.
Mathematical Model
We present a mathematical model based on the fine-grained pipelined mapping of CNN training to distributed FPGA clusters. Resource allocation is an optimization problem. The pipeline is constructed from Computation Engines (CEs), which are used to handle compute-intensive convolution operation, and buffers (Buf) to store CNN model weights and temporal feature maps. Convolution engines are composed of 2-D systolic array that consume the input features from shift-registers. Their design is similar to the ones in [14, 32] . The FPGAs' resources can be expressed as a tuple: (LU T, F F, BRAM, DSP ). The target function is to maximize the cluster's throughput (T). The constraints lie in the limited hardware resources at each device, which are denoted as (LU T max , F F max , BRAM max , DSP max )
The number of CEs is denoted as N CE . The overall throughput (T) and number of buffers (N Buf ) are both functions of N CE . We build part of CEs (αN CE ) with hard DSP slices and other CEs ((1 − α)N CE ) with LUTs/FFs. Similarly, some buffers (βN Buf ) are build with hard BRAMs and others ((1 − β)N Buf ) with LUTs/FFs. Equation [1] [2] [3] [4] [5] [6] defines the problem.
Subject to:
As Fig.7 shows, the computational capacity of the cluster increase linearly with the number of available devices. However, since FPDeep uses model parallelism, a single execution pipeline is distributed over all of the FPGAs. The implication on storage demand is shown in Fig. 7 : storage demand per device decreases as the number of FPGAs increases. The gap between computational capacity and storage demand shows that computation is the bottleneck and therefore also the larger challenge for load balancing (except for very small clusters). Fig.8 , the input feature maps are partitioned into r/c dimension and each device handles part of image's convolution operations. A drawback of this method is that the CNN model weights need to be duplicated for each device. Also, during CNN training's back-propagation phase, the weight gradient descent results must be synchronized among all CNN model copies. This increases latency, which, in turn, increases the required memory since more activations must be stored. Furthermore, the activations and errors at the edges of these segments need to be shared by the neighbors, which makes the inter-device communication more complex.
1) Partition by image's row/column (r/c). As shown in
2) Partition by convolution kernel width/ kernel height (kw/kh). As shown in Fig.9 , all input feature maps are broadcast to the FPGAs and the CNN weights are partitioned into four pieces. Each device generates partial results of the 3) Partition by image input/output channel (ic/oc). As shown in Fig.10 and 11 , each device executes the operations of a part of the input/output channels. Input feature maps, along with model parameters, are partitioned in the ic/oc dimension and allocated among FPGA devices. Each device generates the partial results and their sum is the final output features. In this method, only one copy of the CNN model is stored. For most of CNN networks, there are hundreds of activation channels resulting in sufficient parallelism. In short, partitioning by the images' input/output channels is our choice. In the following subsections, more details of the ic/oc partition method are presented.
Dataflow Analysis for CNN training
As shown in Fig.12(A) , an N layer CNN is mapped to an FPGA cluster with M devices. Each CNN layer contains
The computation capacity of each device is C operation per second. To allocate balanced computation workloads among devices, and to minimize the idle stages of the whole fine-grained pipeline, the Fig.12 (B) zooms in on the CNN training procedure and turns the data flow graph into a directed acyclic graph (DAG). Sum operations highlighted by the red dotted boxes limit the scalability. Therefore, we transform the sum operations into several distributed reductions; the new DAG is shown in Fig.12(C) . Continuing with Fig.12(C) , the summation of all activation channels is cut into several pieces. Each piece only needs to add the local convolution result R i to the previous node's intermediate result R i−1 . In Fig.12(D) , the workloads, i.e., the arithmetic operations, of a whole network are partitioned and mapped to M FPGA nodes proportionally to their computation capacities (the number of DSP slices and LUTs which can be used for computation). Fig.13 shows the data streams in the FPGA cluster (forward propagation activations, forward propagation partial activations, backward propagation activations, and backward propagation partial activations). Note that a 1-D interconnect topology is sufficient. 
1) ICP:
The ICP method is illustrated in Fig.15(A) . Layer 1 is partitioned and mapped to 4.8 devices. 192 input features and corresponding weights are partitioned into 5 segments including 4 segments each containing 40 features and 1 smaller segment containing 32 features. Each FPGA receives one of the five segments and calculates partial results of features for all output channels. Each complete output feature is calculated by summing up the related partial results from the 5 FPGAs.
2) OCP: The OCP method is shown in Fig.15(B) . 256 output features are partitioned into 5 segments, including 4 segments containing 53 output features and 1 smaller segment containing 44 features. Each FPGA is responsible for calculating a certain segment of output features. After joining all of the 5 segments results, the output feature channels of Layer 1 are complete. In CNN training, all activations need to remain available while waiting for back-propagation. Therefore, all 192 input feature maps are cached in every FPGA. This duplication leads to additional on-chip memory overhead. This defect of OCP does not exist in ICP, so FPDeep prefers to use ICP: OCP is only used when the number of the input channel is too small to provide enough parallelism. For example, the first layer of AlexNet only has 3 channels of input features and 96 channels of output feature so OCP is used. 
Partitioning and Mapping the Weights
Figs.16(A&B) show the number of weights and activations of two VGGNets. Observe that from the first to the last layer, the number of activations is decreasing while the number of weights is increasing. The decrease of activations is because the dimensions of the feature maps are reduced by the pooling layers. The increase of the weights is because the number input and output channels increases in the later layers.
Using clusters with a small number of FPGAs to accelerate VGG-16 and VGG-19, the memory demands of weights for the FPGAs allocated to the later layers increase to the point where the on-chip memories in each FPGA are not big enough to cache the allocated weights. To make it possible to map big networks to a small number of FPGAs, weight balancing can be used. Figs.16(C-E) show the weight balancing methodology. Simply, weights from the later layers are stored in FPGAs where there is room, even if those FPGAs are some distance away from where those weights will eventually be used. For example, the weights of layer 8 are stored in FPGAs 3 and 1. During computation, the weights stored in FPGA 1 are transferred to the destination FPGA 3 through the 1D network together with activations. Note that the transport of weights does not tighten the constraint of inter-FPGA communication. This is because the smaller number of activations in the later layers cancels out the added traffic for the weights. Our experiments demonstrate the benefit of this approach: only on-chip memory is needed for the CONV layers.
Accelerator Implementation
We briefly sketch the accelerator implementation. FPDeep will be open-source and explained in detail in accompanying documentation.
As shown in Fig.17 , each FPGA instance includes FP, WG, and EB modules, as well as a memory subsystem to cache weights, gradients, and activations. Each accelerator has 6 interconnection modules to communicate with its neighbors (this number is selected because it is available on many boards used for FPGA clusters and is sufficient for good scaling). An FPGA can be allocated to multiple layers. Implementations with FPGAs working for single layer and for multiple layers are illustrated in Fig.17(A) and (C). Weights are transferred from the node where they are cached for weight load balancing to the node where they are used to compute the output features. Data-path 6: The gradients of weights are transferred from the node where they are produced to the node where they are cached for weight load balancing.
Memory Subsystem
The memory subsystem includes BRAM-based modules storing feature, weights and gradients.
1. Feature RAM (FRAM) caches input features mapped to the target FPGA until they are consumed in back-propagation and provides input features as operators to FP and WG modules.
Local Gradient Buffer (
LGB) caches the gradients of the weights stored in LWRAM.
3. Balanced Gradient Buffer (BGB) caches the gradients of the weights stored BWRAM. These gradients are generated by and transferred from the node where the corresponding weights are consumed.
4.
Local Weight RAM (LWRAM) caches weights used as operators to produce the output features at the local FPGA. 
Forward Propagation (FP)
The Line Buffer (LB) reads input features from the FRAM and feeds them to the Convolution Engines (CE) which are implemented as 2-D systolic arrays. The CEs perform convolutions with weights from the WRAM and input features from LB. In the Special Function Unit (SFU), the output features are activated, normalized, and sampled.
Error Back-Propagation (EB)
The EB module consumes errors of the next layer propagated by succeeding nodes and produces errors of the target layer. To calculate an error of a certain input feature map, errors of all output feature maps need to be convolved with respect to all weight filters. Errors of these input features are cached in the LB and propagated to the preceding node.
Weight Gradient Calculation (WG)
WG consumes errors of the next layer propagated from the succeeding neighbor and calculates gradients of weights and biases. To obtain gradients of weights, errors of output feature maps are used as a filter and convolved with input feature maps cached in the FRAM. Gradients are cached in the Gradient Buffer and used to update weights in WRAM.
Experimental Results
In this section, we describe experiments performed to evaluate the efficiency of FPDeep. First, we evaluate the design of each FPGA generated by FPDeep based on real implementations. Second, we evaluate the performance of the cluster-level design of FPDeep based on results of the FPDeep simulator.
Single FPGA Implementation and Experiments
Results in this subsection are generated with a single FPGA board. We use FPDeep to map three of the most widely used CNNs-Alexnet, VGGNet-16 and VGGNet-19-onto a 15-FPGA cluster. For each network, the RTL generator creates 15 bitfiles, one for each FPGA. The experimental setup is shown in Fig.18 .
We evaluate the design of each FPGA separately with a Xilinx VC709 board (Virtex7 XC7VX690T) and gather resource utilization, throughput, and bandwidth requirements. Stochastic rounding is used during low-precision fixed-point training. In [33] , it is proven that CNNs can be trained using only 16-bit fixed-point numbers when using stochastic rounding and incur no degradation in classification accuracy. Table 2 shows a performance and power efficiency comparison among Titan X GPU [6] , Tesla K80 GPU [34] , a previous FPGA [6] implementation, and our work. FPDeep provides performance which is 5× higher than previous FPGA work and comparable to the Titan X GPU. We evaluate energy efficiency with respect to GOPs/J. FPDeep provides 8.8× better energy efficiency than Titan X and 5.6× better than the previous FPGA work. Compared with the K80, FPDeep provides 5.7× better energy efficiency.
Cluster-Level Performance Evaluation
We use a cycle-accurate simulator to evaluate cluster-level performance. Alexnet, VGG-16, and VGG-19 are mapped onto clusters of sizes 5 to 85. To demonstrate that the workload among FPGAs remains balanced in different sized clusters, we present the proportions of idle stages. Figs.20(B)(E)(H) shows that the proportion of idle stages is always under 5%. When the number of FPGAs is more than 30, the number of idle stages is stable with fluctuation of only from 0.5% to 1%. Generally, as the number of FPGAs increases, the proportion of idle stages decreases. The reason is that during IFP and OFP, the number of DSPs allocated to each layer is rounded to a multiple of K × K. With more FPGAs and more DSP resources, the effects of rounding error are lessened. As each transceiver (of that generation) can reach a maximum rate of 28 Gb/s, using 6 transceivers per FPGA achieves this number [36, 37, 38] .
Since high-end FPGAs frequently have more than 50 transceivers, scaling to much larger clusters is possible. The reason that bandwidth required by VGG-16 is larger than VGG-19 is straightforward: VGG-19 has more layers and thus more workload. During partitioning, with the same overall hardware resources, each layer of VGG-19 is allocated fewer resources. Thus, fewer batch (B f , B w , B e ) features in each layer can be computed and transferred in parallel, which results in a smaller bandwidth requirement.
Comparing AlexNet with VGGNets, we observe that AlexNet requires less bandwidth than both VGGNets, despite having a lower workload. Using an 85-FPGA cluster as an example, the communication bottleneck of all three networks is located at the 2nd layer. For Alexnet, K and F i of layer 2 are bigger than that of the VGGNets, which means that computing each output feature requires more resources. Although the 2nd layer of AlexNet is allocated with more resources, B of Alexnet is still smaller than that of the other two networks. B e s of the 2nd layer are 12, 16, 13 for Alexnet, VGG-16, and VGG-19 respectively.
Figs.20(C)(F)(I) shows the number of epochs that can be trained per hour. FPDeep provides linear speedup of training per epoch. As hybrid model/layer parallelism does not constrain the choice of mini-batch size, the optimal learning rate and mini-batch size can always be applied in SGD, leading to the minimal number of epochs needed to be trained for a certain accuracy. Hence, linear speedup of training per epoch results in linear speedup of CNN training.
Conclusion
In this paper, we propose a framework, FPDeep, which maps training logic of CNNs to a multi-FPGA cluster efficiently with workload balancing, and also automatically generates RTL implementations for the target networks.
With FPDeep, clusters of FPGAs work in a deeply-pipelined manner using 1-D topology; this enables the accelerators to map directly onto existing platforms, including Catapult, Catapult2, and almost any tightly-coupled FPGA cluster. FPDeep uses two mechanisms to facilitate high-performance and energy-efficiency: 1) various fine-grained partition and mapping strategies to balance workload among FPGAs, and 2) training of CNNs is executed in a fine-grained interand intra-layer pipelined manner, which reduces the time features need for backward propagation, leading to reduced storage demand to the point where only on-chip memory is required for CONV layers. Experiments show that FPDeep has good scalability to a large number of FPGAs. The bottleneck is the inter-FPGA communication bandwidth. Using Alexnet and the VGGNets as benchmarks, with 6 transceivers per FPGA, FPDeep shows linearity up to 83 FPGAs. We evaluate energy efficiency with respect to GOPs/J and find that FPDeep provides 5.7x to 8.8x higher energy efficiency than GPU servers.
Currently, we are implementing FPDeep on the AWS FPGA cloud and will then make FPDeep open source.
8 Acknowledgement
