Intensive computation is entering data centers with multiple workloads of deep learning. To balance the compute efficiency, performance, and total cost of ownership (TCO), the use of a field-programmable gate array (FPGA) with reconfigurable logic provides an acceptable acceleration capacity and is compatible with diverse computation-sensitive tasks in the cloud. In this paper, we develop an FPGA acceleration platform that leverages a unified framework architecture for generalpurpose convolutional neural network (CNN) inference acceleration at a data center. To overcome the computation bound, 4,096 DSPs are assembled and shaped as supertile units (SUs) for different types of convolution, which provide up to 4.2 TOP/s 16bit fixed-point performance at 500 MHz. The interleaved-taskdispatching method is proposed to map the computation across the SUs, and the memory bound is solved by a dispatchingassembling buffering model and broadcast caches. For various non-convolution operators, a filter processing unit is designed for general-purpose filter-like/pointwise operators. In the experiment, the performances of CNN models running on server-class CPUs, a GPU, and an FPGA are compared. The results show that our design achieves the best FPGA peak performance and a throughput at the same level as that of the state-of-the-art GPU in data centers, with more than 50 times lower latency.
I. INTRODUCTION
Developing a hardware computing platform for convolutional neural network (CNN) based deep learning inference in a modern data center is a significant challenge [1] [2] . The introduction of AlexNet has enabled the creation of very deep CNNs with hundreds of layers that improve the accuracy [3] [4] [5] . Subsequently, most research has focused on reducing the redundancy and improving the efficiency to lower the computational cost in applications, such as low-bit representation [6] [7] [8] , parameter pruning and sparsification [9] , Winograd/Fast Fourier Transform (FFT)-based optimization, and implementation of novel layer types such as squeeze and shuffle [10] . Hence, in terms of the total cost of ownership (TCO), a reconfigurable accelerator with homogeneity on the hardware level is desirable, particularly when models are still in fast evolution. The field-programmable gate array (FPGA) is an ideal choice for maintaining the same infrastructure and provides customized computing architectures for different solutions. However, three aspects of the FPGA architectures for online CNN inference still need to be discussed: achieving a higher throughput to lower the inference cost per image in CPU-FPGA-based servers, efficiently supporting diverse CNN workloads, which show quite different input image sizes, model topologies and basic operators, and exploring architectures that could be transplanted from one model to another from hours to minutes.
In this paper, an FPGA acceleration platform based on supertile methods is proposed for general-purpose CNNs in a data center. The principal contributions are as follows:
1. A unified and scalable framework is proposed for a CNN accelerator for applications in a data center, in which basic supertile units (SUs) are scaled up with an interleaved task dispatch to achieve maximum performance and efficiently support different types of convolution.
2.
A dispatching-assembling buffering model with broadcast cache (BC) sets is designed for a multi-SU architecture and to scale up the reading and writing bandwidth.
3. Postprocessing architectures with logic sharing are proposed to support various operations and simplify the design. A two-dimensional filter processing unit (FPU) for a class of filter-like and pointwise operations is discussed to balance design complexity and performance.
The remainder of this paper is organized as follows: Section II discusses the related works. Section III provides the supertile-based design for the convolution computation. Section IV focuses on the design for non-convolution operators. Section V discusses the memory organization and implementation of the system. After a comparison with CPUs, a GPU and related works in Section VI, Section VII concludes the paper.
II. RELATED WORKS
FPGAs have been adopted by most cloud service providers, such as Amazon, Microsoft, Tencent, Baidu, Alibaba, and Huawei, as a reconfigurable heterogeneous computing resource. NN-based inference solutions on an FPGA have also been discussed in [8, [11] [12] [13] [14] [15] [16] . Project Catapult and Brainwave from Microsoft are the most widely deployed examples of FPGAs in data centers on both the infrastructure and application levels [14] for search ranking, network acceleration low-latency LSTM [11] , and CNN processing [15, 16] . The solutions also provide interconnection across chips, cards, and servers and organize FPGAs at the data center scale into an acceleration pool. Baidu [12] developed a software-defined accelerator for matrix multiplication on an FPGA, and the active functions were reconfigurable depending upon different cases to fit more models.
In the recent literature, studies have proposed CNN accelerators on high-end FPGAs that enhance the processing abilities of inference for cloud services. [17] and [18] used an OpenGL-designed architecture to accelerate AlexNet and VGG on Arria 10. A reusable CNN engine with a unified framework and a scalable PE array was proposed in [19] , which provided an end-to-end solution for deploying CNN models from Caffe onto an FPGA. The motivation matches well with the gap between deep learning researchers and hardware, but there is still space to improve the performance and resource utilization. [20] proposed a layer-based pipeline structure built by automated tools for both edge and cloud applications, which provided a deeply optimized architecture for different models.
Previous solutions on high-end FPGAs have the following drawbacks: 1) Limited architecture scalability: A number of studies have discussed solutions on middle/low-end FPGAs with dedicated off-chip memory. When on-chip resources, such as digital signal processors (DSPs), change from hundreds to thousands and the off-chip bandwidth remains, the scalability of the processing elements and memory system has not been sufficiently discussed.
2) Limited types of operators: Various kinds of convolution and non-convolution operators are emerging in CNN-model design, and only a few types of operators have been given significant attention in FPGA-based architecture design.
3) Higher deployment cost: Instead of a specific model, inference tasks involve many types of CNN networks. Deploying a new model onto an existing CNN architecture without general-purpose orientation design would cost extra development time. Although exploration methods with automatic tools may perform well for bottleneck analysis and resource scheduling, the time cost associated with re-synthesis in the deployment remains.
In this paper, we focus on the drawbacks above and discuss a scalable solution both for the computation and memory architecture on high-end FPGAs and reduce the deployment cost for different models with a general-purpose design.
III. UNIFIED COMPUTING ENGINE FOR CONVOLUTION
Like multi-pumping used in [21] , the supertile method proposed in [22] runs the DSP systolic array at twice the clock rate of the surrounding system logic. This approach has three benefits: 1) extensive use of built-in DSP cascades enables the systolic array to operate at maximum throughput while consuming little fabric resources; 2) the DSP is efficiently used as both a multiplier and an adder; and 3) the same input data is reused and multiplied by at least two different weights from the local weight buffer in each DSP supertile. However, [22] focused on the supertile model and computing behavior from the DSP to a 2D processing array. Cross-array processing, task mapping and scheduling, various types of convolution adaption, and on-chip memory design are still under discussion. In this section, we focus on the challenges of scaling up a supertile unit (SU) to multiple units and employing different types of convolution. Fig. 1 (b) shows the structure of the enhanced processing element (EPE), which is designed to fit the physical resource layout of the Xilinx Ultrascale FPGA in Fig. 1 (a) for better timing performance. An EPE running at the double clock rate would increase the bandwidth demand both for activations and weights. Therefore, in each EPE, small distributed RAMs cache weights in ping-pong mode for fast responses. When one of the weight caches supplies weights for computing, the other waits for weight update to overlap the weight transfer with the computation. The weight cache also contains ping-pong buffers running at twice the clock speed to provide weight data to the DSP. The activation input is not only shared in the local EPE but also shared with all EPEs within the same row. When the EPEs are spread in the form of an m × n array, this array becomes an SU as shown in Figs. 1(c) and (d).
A. Supertile Unit
When the data from a × sliding window of a feature map are streamed into a row of EPEs as a 1D vector, the corresponding weights are fetched from the weight caches for dot-product operations. The weight-activation products in the same position of the sliding window from different channels in the EPEs along the same column are summed and then produced at the top of each column. Because EPEs run at the double clock rate, two results from two kernel groups are stored into two output buffer blocks at every × cycles.
B. Scaled-up SU
We organize the DSPs into two levels, with the first level being the SU and the second level crossing the SUs. Different from [22] , we set n = 16 and m = 32 for each SU containing 512 DSPs for two reasons: For a model-based consideration, the number of channels of most input and output feature maps is a multiple of 32 in [3-5, 25-26, 28] , and for the hardware design, the input and output data paths prefer a matched data bit width in the loop. Two challenges exist when putting SUs into practice. First, directly deploying individual batch-based tasks onto SUs would introduce more resource cost both in terms of memory and bandwidth. A proper method for the task partition should be explored to fully use SUs simultaneously and efficiently. Second, the input and output data bandwidth would be multiplied when multiple SUs are applied. Dedicated design for data buffering should be discussed. We propose the interleaved task dispatching method and a dispatchingassembling buffer model to solve these problems, as shown in Fig. 2 Interleaved task dispatching explores column-based parallelism for fine-grained task partitioning, as shown in Fig.  2 (a). The same 2n kernel groups are loaded into four SUs for weight sharing before the convolution. Initially, the vectors from successive sliding windows are sent to four SUs. The activation from the same IFT feeds into the SUs, but the window locations are different. The convolutions of the four sliding windows are processed simultaneously, which provides parallelism in one row's processing.
Each SU calls for its exclusive reading and writing memory bandwidth, and a dispatching-assembling buffer model is proposed, as shown in Fig. 2(b) . With the case of four processing paths, the data buffered in the input buffer (IB) are shared among the paths. Each path contains a broadcast cache (BC) set as the local data cache, an SU as the processing unit, and an output buffer (OB) set providing writing bandwidth and buffering temporary convolution results. Under the control of interleaved task dispatching, the four paths work synchronously, focusing on the same convolutional task. When the output tensor is ready after convolution, an assemble reader reads the data distributing across the OB sets, reorders the data, and writes back to the IB for the next convolution. Note that the IB provides only one-fourth of the total input bandwidth for the SUs. To meet this discrepancy, multi-BC sets are used to buffer the same temporary tensor dispatched from the IB and output the data from different sliding windows for the SUs, which are further discussed in Section V.
In this way, sliding-window-based tasks are dispatched onto multiple SUs, and the memory bottleneck in data reading and writing is avoided. Finally, up to 4096 DSPs are organized as 8 SUs and are shaped as 3D tensor processing units of 2 CNN engines on 2 die of KU115. The four SUs in each engine share the same IB, kernel load controller, and kernel data, thus allowing the deployment of more DSPs on one convolution layer and reducing the time cost of processing by a factor of four.
C. Tile-Based Slice-Loop-Hiding Cross Input Feature Map
The definition of the tile is a sub tensor, with a smaller size in H and W but the same size in Cin compared with the original tensor ( Fig. 1 [C] ). The slice is a sub tensor with the same H and W but that contains only a few of channels on Cin, like the IFT with m channels in Fig. 1 (d) . Instead of separated slice processing with respective commands, a slice-loop controller is placed between the command decoder and SUs' controller, handshaking with them, controlling the slice loop and updating a few address parameters automatically with the following benefits: 1) The on-chip memory is used efficiently, and no extra storage capacity would be introduced to buffer the results between slices. 2) The command efficiency is improved and the total command length, command loading and decoding times are reduced. 3) The idle time of the SUs are minimized, and the next-slice processing is triggered immediately in the local controller instead of via talking and handshaking with the global controller.
D. Efficient Processing for Different Types of Convolutions on SUs
A standard convolution with different parameters, together with various types of convolutions, appears in CNN-model evaluation, wherein the utilization and real performance vary significantly [20, 34] . We will discuss methods to map the convolutions onto this unified SU-based architecture efficiently, which include the following special cases.
The first layer of a standard convolution has only three channels, and only 3/m DSP-rows of each SU can be used without optimization. We partition the data within a sliding window of a channel into pieces and send them into different EPE rows with the help of the parameters window_pos_begin and window_pos_end. Thereafter, the partition results of a sliding window are added together with cascade adders between the EPEs in a column. The number of pieces (NP) is defined as = max{ , × 3, × }, where ≤ . For example, when m = 32 and the kernel size (ks) of the first layer is 7 × 7 with 3 channels, the computation proceeds as = 1 × 7 with 21 channels to improve the utilization of the SUs from 3/32 into 21/32.
Kernel fusion for the non-first layer of a standard convolution with ks = 1: A convolution with 1 × 1 kernels usually accompanies with lower reusability of activation and changing bound from computation to memory. In contrast to the kernel partition, kernel fusion is applied. When the kernel group number is Cout and when = ⌈ 2 ⁄ ⌉, where ∈ [1, 16] , each weight buffer with a depth of 16 can buffer the weight with the number of Fu. For example, when n = 16 and when the kernel tensor with the size 1 × 1 × × 384 is convolved with an input tile tensor, e.g., = ⌈ 384 32 ⌉ = 12, weight buffer A at EPEij in Fig. 1 (d) caches 12 kernels with the indices { = , = + × 32} , where Sn is the slice index with a value of 0, 1,..., 11. When processing is enabled, the activation sent into the SUs is updated every 12 clock cycles and shared with 12 weights in each weight buffer at each EPE. After accumulation with the cascaded adder, the data at the end of each EPE column are updated every clock cycle and written into the OB with the corresponding OFT address.
Nonstandard convolution types comprise the transposed convolution [35] , dilated convolution [36] , and depthwise convolution [27] . Both transposed and dilated convolutions can be computed as a standard convolution layer, except upsampling should be performed before the transposed convolution. Depthwise convolution is performed in an FPU, which will be discussed in Section IV A.
IV. POSTPROCESSING DESIGN
The convolution consumes most of the computation in the CNN model with only a few types, whereas non-convolution operations have low computational costs and comprise most types of operators in the opposite manner in different models (e.g., pool, normalization, activation function, and elementwise operations between branches [23] [24] [25] [26] .). We divide these operations into three classes: 2D filter-like operations with the sliding window traverse height (H) and width (W) of a channel of the feature map, operations across channels (C) and operations that can fuse with convolution to reduce the memory access. We design a general-purpose processing unit for the first category and a custom module for operators (such as LRN) in the second category because they do not frequently appear in current inference tasks. The operations in the third category are processed with operator fusion.
A. 2D Processing Unit for Filter-Like Operations
When designing the circuit for 2D filter-like operations, we consider the following aspects: The first is compatibility and configurability to support the current and potential operators. Simplifying the complexity of the hardware is the second consideration to avoid multi-module scheduling, maintain a large parameter field, and provide a dedicated data load and storage path with a complex multiplexor for each module. The third aspect is the resource limitations. SUs consume most of the resources of a specific physical region after mapping, as shown in Figs. 1 (a) and (b), and only a few resources outside the region are available. The resource limitation calls for more functional logic sharing. In this case, an FPU is proposed.
We observe that there are common computing and data access styles in filter-like non-convolution operators that traverse the feature map within a sliding window (kernel) on each channel, and no operations exist across channels, such as max/average pool. Most of the parameters also have similarities, such as the parameters of the source address reading like StartAddr/Stride/Pad/WindowSize/FeatureMap-Size/ChannelNum and the parameters of destination address writing like StartAddr and FeatureMapSize. Pointwise operations, such as relu/relu6/linear-transformations, can be considered special cases when the kernel size is equal to 1. A depthwise convolution [27] could also be performed on the proposed FPU because no cross-channel addition is needed. Better performance can be achieved when standard and depthwise convolutions are interweaved [27] [28] , which can be performed with the SUs and FPU, respectively, in parallel. Fig. 3 shows the micro-architecture of the FPU. It is mainly divided into the function sharing part (FSP), which contains the common functional modules in the processing for different operators, and the worker part (WP), which can be changed or integrated into more processing modules depending on the needs. The FSP fetches the micro-commands from the ucmd buffer [37] , decodes them into parameters including the parameter categories of DataLoad/DataStore/FuncSet/Scalar-Value. Then, the tensor are read by the address generation module from OB continuously, streamed into WP for processing, and finally written back into the OB. When the channel number is more than 2n, the slice-loop-control module is enabled for better command efficiency as slice-loop-hiding for convolution. In the WP, 2n channels of ALUs process the data stream in SIMD mode. The reconfigurable ALU is designed with a cascaded pre-multiplier, midadder/comparator, and final multiplier, which can be enabled or bypassed depending on the function setting by the microcommand. To improve the efficiency and save memory bandwidth, they work in pipelines and provide a maximum of three operations per clock. Note that the module for the kernel load control is located next to slice-loop controller, providing the weights when depthwise convolution is enabled. For maxpool, the mid-comparator is enabled, and the others are bypassed. In mid-comparator, the maximum value is saved into the register after each comparison, and the window-end signal triggers the output and resets the register. The final multiplier is used for the division of avg-pool, and the adder is also enabled in this case.
B. Operator Fusion
Each operation across the tensor would introduce extra memory access, which would suspend other memory-accessdependent operations. This is the reason we fuse some pointwise operators with the convolution even though they have already been supported by the FPU. Fig. 4 shows a cascade of four kinds of operations at the end of the column of each SU. The first adder adds the output of the SU with the temporary results of the OB from the previous slice. The second adder is for element-wise addition across the branches, which is widely used as a residual block [23] . Relu is then performed, and dynamic precision data quantization [7] follows. Finally, the results are written into the OB. These operations can be individually enabled or bypassed with control instructions. Fig. 5 shows a heterogeneous server architecture with the CPU + FPGA, including a server system with the CPU and memory, two channels of DDR4 on an FPGA-PCIE card, and two CNN engines in the FPGA. The memory/buffer is shown in green, and the control/processing logic is shown in blue.
Ele_add with temp cov_rlts/bias

A. System Overview
Here, 2048 DSPs shaped as four SUs running at the double clock speed provide enough computing capacity in each engine, while FPU and operator fusion at the output of each SU integrate most of the non-convolution operators and simplify the design. Only the modules customized for LRN remain.
B. Memory Organization
The on-chip memory is mainly divided into IB and OB sets as shown in Fig. 5(b) . The IB is shared globally among multiple SUs. The OB sets are placed at the output of each SU and each set consists of 2n components. Each component provides an exclusive read and write port for one SU's column. The OB sets are designed in the form of a ping-pong structure so that when n = 16, a 64 GB/s on-chip reading and writing bandwidth can be provided for both SUs and the FPU, allowing them to run in parallel to overlap the convolution and filter-like operations.
C. Broadcast Cache
When four SUs operate together, each consumes the input bandwidth of 16 × 32 × bit/s in Fig. 2(b) , where is the frequency of non-EPE logic, and up to 4 × 512 × bit/s of bandwidth is needed. However, the output bandwidth of the IB is only 512 × bit/s. The BC is designed to meet this discrepancy and buffer the same input feature maps but output the data from different sliding windows for columnbased parallelism. The BC at the input port of each row of the SU is a circular buffer that updates the data continuously. The data used will be overwritten with the data from the next row. The window in each BC continues to slide with a step size of 4 × Convolution-Sride along the row but starts at different positions for different SUs. Fig. 6 shows the behavior of one BC with a 3 × 3 sliding window inside and with Convolution-Stride=1. In this case, the data within rows 3 to 7 are buffered in the cache, and the window moves with the central position on row 4 and Window-Stride = 4. After the computation for the last sliding window position of row 4, the window center moves to the first position of row 5, and the BC starts to load the data of row 8 to overwrite the data of row 3. To accommodate enough rows to fit the sliding window and leave an extra row as the margin, the cache should buffer at least Kernel-Size + 1 rows of data. On the basis of the tile partition method in Section III C, the width of the tile of the input feature map can be flexibly narrowed to buffer more rows when a larger kernel size is used.
VI. EXPERIMENT AND PERFORMANCE
The performance of the implemented platform is evaluated in this section. We setup the system on the server and perform three CNN models on it. After that, the performance of the proposed system is compared with those of the other CNN acceleration solutions in the data center, including high-end FPGAs, CPUs, and a GPU.
A. Experimental Setup
The proposed CNN engines are implemented on KU115 with Vivado 2016.4. Table I lists the resource utilization for AlexNet/GoogLeNet. Each type of resource exceeds 70% of the total, thus making it difficult to reach the maximum frequency of 661 MHz in [22] . Finally, at the peak performance of 4.2 TOP/s with 16-bit quantization, 500 MHz is used for the EPEs, and 250 MHz is used for the others. The system is built on Semptian's FPGA card with a PCIe interface, and the size of the card is half height and half length (Fig. 7, left side) . SUPERMICRO 6028UX-TR4 ( Fig. 7, right side) is used as a server with two Intel Xeon E5-2680V4 CPUs and 16 GB × 16 GB of DDR3 SDRAM. 
B. Acceleration for Different CNN Models
The experiments are performed with three models (Table  II) . AlexNet is well discussed in most of the literature on accelerator design, and 92% and 8% of the computations are performed on the FPGA and CPU, respectively. The performance reaches 2.3 TOP/s with a latency of 2.3 ms. The second model is GoogLeNet. We adjust the batch size to 2 and achieve approximately 1.6 TOP/s with a 3.8 ms latency. We select the high-concurrency network (HCNet) as the third model, which is a customized compact CNN model to lower the classification cost per image at Tencent. This model achieves almost the same accuracy as GoogLeNet but three times the throughput upon testing on an Intel Xeon E5-2620v3. Inspired by ResNet [5] and ShuffleNet [10] , the HCNet begins with the convolution and pooling layers changing the input image from 224×224 into 56×56 with 32 channels. Three stages follow until the end of the model with avg-pooling and FC layers. There are four, eight, and four basic residual blocks (see Fig. 8 ) in three stages, respectively.
The 1 × 1 convolution and fewer channel convolutions are largely used in the HCNet. Although kernel fusion for 1×1 is performed, fewer output channels decrease the reusability of the activation, and frequent data transfer lowers the SU utilization, which increases the power cost. Furthermore, when one-tenth of the layers are those with 16 channels, the SUs are not fully used. Although a performance of 3.44× is achieved compared with P4 with a 7 ms time constraint [2] , its throughput is limited to 650.5 GOP/s at 225/450 MHz. A higher frequency and platform with the same design as AlexNet/GoogLeNet could be used if the limitation on the PCIE power supply could be ignored. Owing to the efficient model structure, 2.8 times the throughput of GoogLeNet is achieved, and the TCO is significantly reduced. By comparing the performance with Nvidia TESLA P4, which is the state-ofthe-art GPU for deep learning inference in data centers, the speedup ratios of the FPGA in these three tests are 1.35, 3.91 and 3.44, respectively, with a 7 ms response time limitation. Fig. 8 . The residual blocks shown in (b) are used as the basic building block for the HCNet. Each block consists of three convolutional layers, where two 1 × 1 convolutional layers are used for feature squeezing and unsqueezing, and a 3 × 3 convolutional layer with group 2 is used for spatial convolution. Downsampling is performed at the beginning of each stage, where the 3 × 3 convolution with stride 2 is used, as in (a). The kernel group numbers of these layers in three stages are [32, 32, 128] Table III shows the comparison among different high-end FPGA-based CNN accelerators for potential cloud computing applications with the precision of fix16. We deliver the best peak performance of 4.2TOP/s, which is more than twice that of the others. In addition, real performance reaches 2.3 TOP/s in the AlexNet test benefiting from efficient task dispatching across multiple SUs, overlapping between computation and data-move, and minimizing the time cost in scheduling by slice-loop-hiding. Table IV lists the available processing solutions that can be deployed in the data center for CNN inference. The Intel Xeon E5-2680V4 is a high-performance CPU and is also used as a host in the server for the FPGA/GPU. MKL 2018.0.0 is used for the optimization. The Nvidia TESLA P4 is a 16 nm GPU with 2560 CUDA cores and a 1 GHz clock speed and reaches 5.5 TeraFlops with the boosting of 192 GB/s memory bandwidth. CUDA 8.0.44 and Cudnn 6.0.21 are used to improve the performance. In the CPU test, two CPUs have 28 cores represented as 56 threads. Each thread binds with one of the batches with the corresponding batch size to keep all the cores busy. In the GPU test, a single process is used as a scheduler that sends the tasks to the GPU with different batch sizes. In the FPGA test, eight threads are used, and each binds with a physical core. Fig. 9 shows the results. Running with a fixed batch size, the FPGA presents a steady performance, providing the lowest latency of 3.8 ms for GoogLeNet. The FPGA also shows the highest throughput until the batch size exceeds 32 for the GPU. Finally, the GPU achieves the highest throughput of 684 FPS at a batch size of 128 (it is out of memory at a batch size of 256). The situation is slightly different when the task moves to the HCNet. When the batch size ≥ 4, the FPGA reaches its highest performance and maintains a constant batch size of 4. A comparison of the FPGA and GPU shows that the FPGA runs at a higher frame rate before the batch size of the GPU reaches 64. The gap does not change significantly until a batch size of 256 is reached. When P4 achieves its peak performance, the FPGA provides an 89% throughput with 1/57 latency compared with the GPU.
C. Comparison with FPGA-based Accelerators
D. Comparison with CPU and GPU
The performance of the FPGA is remarkable even when limitations exist. An FPGA with a simpler fabrication process, approximately 20% of the off-chip memory bandwidth, and one-fourth of the frequency of P4 can achieve superior performance in low-latency circumstances. For a larger batch size test, which higher data reuse is performed in the GPU, the performance improvement of the GPU over the FPGA is no more than 20%. The performance can be further improved when the proposed architecture is implemented with the next generation of FPGAs, e.g., UltraScale+ VU9P (16 nm), by using the same fabrication process of P4. 
VII. CONCLUSION
In this paper, an FPGA acceleration platform with a supertile-based design is introduced for general-purpose CNNs and for performing various image/video inference tasks in data centers. Basic supertile EPEs are scaled up and shaped as multiple SUs to maximize the performance with the interleaved-task-dispatching method for the processing of types of convolution, and the increased bandwidth is provided by a dispatching-assembling buffering model. A configurable FPU is proposed due to the resource limitations to simplify the design and support different types of non-convolution operators, which makes it possible to run different CNN models on the same platform and reduce the deployment cost. We implement the design and make comparisons with highend FPGAs and data-center-scale CPU/GPU. The experiment shows that the proposed architecture on KU115 achieves the best peak performance and throughput on FPGAs, and it performs at the same level as state-of-the-art GPU with more than 50 times lower latency. Compared with TCO, the FPGA enhances the throughput of the server by 149.2% with a 31.5% cost increase. The system is now deployed in a data center to serve over one billion people every day.
