The convolutional neural network (CNN) has become a state-of-the-art method for several artificial intelligence domains in recent years. The increasingly complex CNN models are both computationbound and I/O-bound. FPGA-based accelerators driven by custom instruction set architecture (ISA) achieve a balance between generality and efficiency, but there is much on them left to be optimized. We propose the full-stack compiler DNNVM, which is an integration of optimizers for graphs, loops and data layouts, and an assembler, a runtime supporter and a validation environment. The DNNVM works in the context of deep learning frameworks and transforms CNN models into the directed acyclic graph: XGraph. Based on XGraph, we transform the optimization challenges for both the data layout and pipeline into graph-level problems. DNNVM enumerates all potentially profitable fusion opportunities by a heuristic subgraph isomorphism algorithm to leverage pipeline and data layout optimizations, and searches for the optimal execution strategies of the whole computing graph. On the Xilinx ZU2 @330 MHz and ZU9 @330 MHz, we achieve equivalently state-of-the-art performance on our benchmarks by naive implementations without optimizations, and the throughput is further improved up to 1.26x by leveraging heterogeneous optimizations in DNNVM. Finally, with ZU9 @330 MHz, we achieve state-of-the-art performance for VGG and ResNet50. We achieve a throughput of 2.82 TOPs/s and an energy efficiency of 123.7 GOPs/s/W for VGG. Additionally, we achieve 1.38 TOPs/s for ResNet50.
Introduction
Deep convolutional neural networks (CNNs) [1, 2, 3, 4] are extensively employed in various artificial intelligence tasks, such as object detection, classification, natural language processing and semantic segmentation. The extensive variety in application complexity produces substantial challenges for hardware platforms [5] , such as computation ability and power consumption.
Multi-core CPUs and GPUs have been the dominant hardware platforms for CNN training and inference. Following the single-instruction, multiple-data (SIMD) or single-instruction, multiple-thread (SIMT) parallel-processing methods, CNN algorithms can be efficiently processed. While the potentialities of a CPU have been fully exploited [6] , unfortunately, a CPU cannot provide acceptable computation ability for CNN models. The high utilization of a GPU relies on a large batch size, which indicates that input feature maps are processed in parallel; thus, GPUs are very suitable for training. However, the applications of CNN in practice, especially vision tasks and video processing, require input feature maps to be separately executed. The low utilization of GPU resources in inference, high cost and relatively low energy efficiency limit the applications of GPUs for CNNs.
There is a significant trend of selecting custom hardware platforms, such as field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs), as the next-generation of CNN accelerators for inference. ASICs such as TPU [7] , Nervana [8] can achieve state-of-the-art performance; however, the design processing of ASICs is time-consuming and expensive to compete with the rapid evolution of CNN algorithms. FPGA-based accelerators that target a specific CNN model achieve appealing performance while sacrificing the flexibility to different networks and platforms. To avoid this problem, some authors have proposed the alternative of hardware templates with many configurable parameters [9] . However, hardware designers are concerned with the low-level hardware behaviours in a cycle-accurate manner or optimize the frequency and throughput by a long process of parameter value selection. Although the use of high-level implementation tools such as Vivado High-level Synthesis (HLS) can simplify this design process, describing an efficient parallel architecture by HLS and long compilations to realize the throughput gains remains difficult.
An alternative solution is to construct a flexible hardware structure with a compiler to map different CNN models onto it by generating different instructions. In this way, we can take advantage of both the software programmability and the efficiency of custom hardware. Additionally, several advanced compiler technologies, graph-level algorithms, optimization methods can be applied in a compiler to improve the throughput of accelerators and the prolificacy. Building on these considerations and challenges, we design our hardware architecture extended from Angel-Eye [10] and propose a full-stack compiler infrastructure, our main contributions are described as follows:
• An end-to-end compilation infrastructure named the Deep Neural Network Virtual Machine (DNNVM) is proposed for a custom hardware design that is extended from angel-eye. The DNNVM is an integration of optimizers for graphs, loops and data layouts, an assembler, a runtime supporter and a validation environment. • In the DNNVM, we employ a domain-specific computing graph named XGraph to decouple the DNNVM with various deep learning frameworks and hardware platforms. Based on XGraph, we propose an efficient subgraph isomorphism algorithm to enumerate all valid potentially profitable fusion opportunities and optimize the pipeline. • We adopt a heuristic shortest-path algorithm to extend the optimization scope to the whole computing graph and obtain the optimal fusion strategy, which is primarily disregarded by other designs. • Compared with naïve implementations without fusion, the experimental results demonstrate up to 1.26x improvement on our benchmarks VGG [1] , ResNet [2] , and GoogLeNet [3] . We achieve the state-of-the-art performance on ZU2 for embedded applications, and on ZU9 for VGG and ResNet.
The remainder of this paper is organized as follows: Section 2 introduces the background and motivations of this research. An overview of DNNVM and our custom hardware is provided in Section 3. We leverage a heuristics algorithm to explore pipeline optimizations in Section 4 and apply a heuristics shortest-path algorithm to obtain the optimal execution strategy in Section 5. The performance that we achieve is presented in Section 6. Section 7 introduces related studies about optimizations in hardware designs and implementations of compiler designs. We conclude this paper in Section 8.
Background and Motivations

CNN Accelerators
Accelerators for CNNs can be classified into two categories: (1) general-purpose processors (GPP), such as CPUs [11] and GPUs [12] , (2) specialized domain accelerators (SDAs) that are predominantly implemented on FPGAs [10, 13, 14, 15, 16] or ASICs [17] . GPPs are concerned with flexibility, while SDAs can offer 10-100× more energy efficiency than GPPs in dedicated application domains. Due to the advantages of reconfigurability, flexibility and energy-efficiency, FPGA-based SDAs have attracted a considerable amount of attention. As shown in Figure 1 , two kinds of FPGA-based SDAs [18] exist: Streaming Architectures (SAs) [15, 19, 20] and Single Computation Engines (SCEs) [10, 21, 16, 14, 22] . SAs designs favour customization over generality; using high-level synthesis (HLS) [23] , they can directly map each layer or subgraph of a target CNN model into one computation block in an easy and elegant way. SCE designs employ a design trade-off between flexibility and customization; they are easily scaled and configured based on the availability of hardware resources. Custom instruction-set architecture (ISA) and the corresponding compiler are designed for better throughput on SCEs. SCEs not only guarantee compatibility with neural networks with different scales and topologies but also achieve appealing performance. In this study, we prefer SCE designs due to Figure 1 : FPGA-based accelerators can be classified into two types: SA and SCE their flexibility and appealing performance. Without reconfiguring the FPGA, multiple CNNs can be consecutively implemented by running different instructions generated by our compiler.
Compiler Toolchains
A compiler is generally an architecture-aware tool that efficiently maps algorithms to hardware implementations and control signals. The generation of CPU instructions from CNN algorithms is relatively simple due to advanced compiler technologies and a mature compiler ecosystem, such as LLVM [24] . Additionally, explorations in data scheduling optimizations [25] , such as fusion, tiling, vectorizing, and parallelism [26] , and many other affine transformation methods [27] , are utilized to produce instructions for efficient CNN processing. Algorithms that are capable of auto-tuning and optimizing scheduling attach a substantial amount of attention as compiler optimization methods [28] .
Compilers for FPGA-based accelerators or ASICs are responsible for tuning the configurable parameters of the hardware architectures [13, 22] , and map an NN to custom instructions [21, 22, 14] . These domain-specific compilers maximize the hardware efficiency according to different topologies of CNN and hardware platforms on both the hardware side and the software side. Due to the immature ecosystem and originality of custom hardware designs, explicit scheduling and memory management are required for each compiler that targets custom hardware platforms. In the literature, few compilers for custom FPGA-based accelerators, such as TVM [29] , DLA [22] , have been addressed as blueprints for standardized compiler tool chains. Inspired by these studies and traditional compilers [30, 31, 26, 29] for GPP, we present a complete end-to-end compiler toolchain for our custom hardware design. The first step of a compiler is transforming a high-level framework-dependent CNN representation to an IR. For example, Caffe [32] , Darknet [33] adopt layer-based IRs, which define each typical CNN operation as a layer. A Figure 2 : Motivations of Compiler Optimization Methods, (a) data layout in Caffe, N means batch, N CHW is for feature maps, width of a feature map is first and height is the next, OCK h K w is for weights, kernel is first and channel is next, (b) operation fusion allows concurrent implementations of computation and optimizes pipeline, (c) automatically distribute tasks to CPU and FPGA.
Optimization Methods
specific layer can be optimized in a straightforward manner to produce an efficient implementation. Unfortunately, difficulties arise when implementing cross-layer optimizations by layer-based IR. TensorFlow [34] and Torch employs a graph-level IR and construct computing graphs. Each node in a computing graph represents a course-grained operation, such as convolution, or a fine-grained operation, such as add and pad. Various graph algorithms can be leveraged to optimize the implementations. Due to the difference among operations from different frameworks, to optimize operator implementations from different deep learning frameworks, we need to make O(N f · N o · N p ) efforts, where N f is the number of deep learning frameworks, N o is the number of operations, and N p is the number of platforms. An appropriate IR to decouple the compiler with deep learning frameworks and hardware platforms is necessary; ideally, the complexity can be reduced to O(N o ). Note that ONNX [35] defines a unified format that provides the exchangeability of representations among different frameworks. However, accuracy loss and error occur when we transform trained models from other frameworks to ONNX in practice. Additionally, IR is also coupled with the hardware designs. A new IR is a necessity.
Data Layout and Allocation
Mapping data onto on-chip buffers helps enhance data locality and improve the total throughput. However, as the resolution of an image increases and the neural networks become deeper, the size of on-chip buffers is not sufficient to store all required data. In addition, an implementation of operations usually requires a specific data layout of input data. As shown in Figure 2 (a), the data layout of feature maps in Caffe [32] is NCHW. In an further optimized implementation [6] , the data layout of NCHW[x]c is required for CONVs on a CPU, instead of the default (NCHW or NHWC) from Caffe or Tensorflow. Additionally, transformations, such as flatten, concat and reorganization, introduce significant overheads. As a result, an appropriate data layout, data slicing and the rule to allocate on-chip memory for each tiled data should be considered.
Pipeline
As shown in Figure 2 (b), pipeline parallelism enables simultaneous implementations of computation and data communication in a single operation. Additionally, operation Fusion, which fuses adjacent operations, has been demonstrated as an effective pipelining technology in various hardware platforms [29, 25, 26, 31, 36, 19] . In this way, on-chip input feature maps will perform fused operations to produce output feature maps without intermediate results for communication with the off-chip memory. Operations can be concurrently executed in parallel computation engines for latency hiding. Fusion technology has been employed in multiple acceleration platforms, such as fpgaConvnet [15] and VTA [16] . However, architectures such as fpgaConvnet adopt SA designs and leverage fusion on the hardware side. On the software side, compilers such as VTA can only fuse limited operations. We extend the optimization scope to the entire computing graph and propose heuristic algorithms instead of greedy algorithms to ensure the inclusion of several potentially profitable fusion strategies. 
Verification
A CNN is trained with 32-bit floating point data on a GPU or CPU, and data are usually simplified to a fixed-point format with the same bit-width with negligible accuracy loss to reduce on-chip memory consumption and computation complexity. The hardware design should be adapted to the fixed-point computation, and various optimization methods in a compiler cannot introduce the extra accuracy loss. However, the verification method is primarily disregarded in previous compiler research.
Mixed Compilation
State-of-the-art CNNs use a variety of algorithms and technologies to solve complex artificial intelligence tasks. Implementing these operations on FPGAs may be inefficient and infeasible, which increases the design complexity of accelerators and is time-consuming. Computationally intensive operations are accelerated on FPGA while other operations are implemented by a CPU. To simplify the usability of the tool chain, a compiler should automatically distribute different operations to a CPU and FPGA, generate instructions for specialized domain accelerators and produce CPU code by general compilers.
Based on a well-designed hardware architecture, all optimization methods mentioned above in our compiler work together, and greatly improve the productivity and the overall throughput of the system. Especially in our case, even one percent improvement of the final performance of the system is appealing and contributes to a large performance gap with the origin implementations without optimizations by DNNVM.
Framework Overview
This section provides an overview of our hardware design and the DNNVM stack.
Hardware Design
DNNVM's instruction set architecture (ISA) is composed of four kinds of instructions: LOAD/SAVE, CONV, POOL, and MISC. Execution modules are designed to correspond to our customized ISA. The implementations of computation modules are inspired by Angel-Eye [10] . LOAD/SAVE modules move data between on-chip buffers and off-chip DDR.
The CONV module operates convolution over its input data that are fetched from the input buffers. The POOL module operates pooling over its input data, during which one bit is used to specify the pooling type. MISC modules execute other operations, such as element-wise add, reorganization, start, and end. These instructions are variable in size, and a few bits are set for the dependency relation among instructions to guarantee that they are executed in a certain way. The coarse-grained nature of the ISA enables the accelerators to incorporate graph-level optimizations and instruction pipelining, regardless of the overhead from decoding and fetching for a large number of low-level instructions. Another important advantage of our ISA is that each computation module can be reused for various operations. For example, the convolution module can be reused for deconvolution, depth-wise convolution, and dilated convolution by reconfiguring fields such as stride, length, and mode in conv-related instructions.
As shown in Figure 3 (a), instructions are fetched from off-chip DDR to Instruction FIFO. The Dispatcher is responsible for decoding instructions into controlling signals and addresses of registers for the execution modules. The Bank arbiter is used to determine the order of access to buffers among requests. We use BRAMs and DSP slices to construct Buffers and ALUs and pack two or four INT8 Mults into one DSP48E1 slice to achieve the peak performance of 380 GOPs/s and 4.05 TOPs/s on ZU2 devices and ZU9 devices, respectively at 330 MHz. A fixed number of BRAMs are pre-allocated for input feature maps, output feature maps and parameters. Each buffer can be reused by instructions with different access addresses. In Figure 3 (b), we partition CONV into several pieces and parallelize upon input channel, height and output channel. In this way, we do not need to leverage optimizations for specific kernel sizes and enhance the flexibility. The algorithm in Figure 3 (b) decides the data layout of feature maps to be NHWC, which means input channel first, and batch is the last. The data layout of OWHC is for weights, and they can be pre-allocated on DDR before computation so that no extra overhead would be introduced. In Figure 3 (c), different data transformation operations such as flatten and concat introduce significant overheads. Based on our understanding of the hardware design and computation implementations, our custom data layout method prunes the flatten operation naturally due to the completely same layout of the results. Other dimension transformation operations can be fused to the SAVE instruction of the previous operation. For example in Figure 3 (c), a channel-first data layout makes SAVE1 stores the first O 1 pixels at the address 0, and then strides O 1 + O 2 pixels to store the second O 1 feature maps in the output buffer, which is equivalent to a single concat after convolution.
Compiler Infrastructure
The DNNVM is a full-stack compiler that takes advantage of many advanced compiler technologies [29, 31, 22] . As presented in Figure 4 , the DNNVM transforms CNN models into a framework-independent computing graph, searches for pipeline optimization opportunities, allocates buffers for fused-operations, and importantly, selects the optimal execution strategies. In general, DNNVM integrates various optimizers, runtime support, a data transformer and a validation bench to improve the productivity.
• Front-End and XGraph: DNNVM presents the corse-grained computing graph format XGraph to decouple the compiler with deep learning frameworks and hardware platforms to reduce the complexity to O(N o ). As shown in Figure 8 and Figure 4 (b), different deep learning frameworks have a great diversity of operations with different granularities, such as 1 from caffe and 4 from TensorFlow in Figure 8 . As shown in the second raw in Figure 4 (b), XGraph has a similar format with Caffe and Pytorch but is a graph-level IR. DNNVM transforms the computing graphs from different deep learning frameworks into the course-grained XGraph by leveraging intrinsic fusion such as the convolution + BN + Scale, which can be pre-calculated, and point-wise fusion, such as convolution + ReLU. Additionally, dimension transformation operations are also pruned and fused into the previous computation operation. • Middle-End: We recognize three types of operation fusion templates to maximize the utilization of locality and parallelism and optimize the pipeline. In the middle-end of DNNVM, we adopt graph-level optimizations. A heuristics subgraph isomorphism algorithm is designed to efficiently enumerate all fusion opportunities. In addition, using the directed acyclic graphs (DAG) of computational operations, we design a DSL based on the scheme [37] for these hypergraph IRs and automate the DRAM allocation, synchronization, and distribution according to these IRs. And we primarily allocate DRAM for feature maps and parameters. The space allocated for the feature maps which are not depended by all the following operations can be freed, while the space for parameters such as weights and bias are protected. • Back-End: In the back-end, we leverage a combination of affine transformations, such as hierarchical tiling using nested loops and pipelines. We search for the optimal execution strategy by a heuristics shortest-path algorithm to maximize the utilization of locality and parallelism. We map the optimal execution strategy into our ISA IRs. Assembler transforms the instructions into the machine code. First, at the beginning of the hardware design, we need to make a trade-off between the accuracy loss and the hardware resources consumption to determine the shifting, truncation, and rounding methods. Second, a difference in the implementations among different deep leaning frameworks will always exist. As a result, we pack all implementations of operations from different frameworks and make modifications to the packed implementations according to our hardware design, underlying libraries, granularity of operations, processing strategies of padding zeros, fixed-point policies, and the data layout to prevent even a one-bit difference between the results by the CPU and the results by the FPGA.
These modules are indispensable parts in a full-stack compiler design. We focus on the optimization methods in DNNVM in the following sections, especially the pipeline optimization and operation fusion of the entire computing graph.
Optimizations
As mentioned above, DNNVM works in the context of deep learning frameworks and transforms the computing graph of CNN model into XGraph. Due to the course-grained attributes of XGraph, the transformation process can be seen as operation fusion. Fined-grained operations are fused into a complex operation, point-wise operation such as ReLU and dimension transformation operations are fused into the previous operation. In addition, prior optimization approaches [20] usually consider each individual computation operation separately and start with the assumption that each operation loads the feature maps and parameters from the off-chip memory and then writes the intermediate results back to the off-chip memory. Computation operation fusion such as conv + conv, can avoid frequent data exchange with the off-chip memory and reduce the bandwidth requirement. The memory access latency will be hidden by the concurrent execution of computations and communication, and the implementation overheads are greatly hidden by concurrent execution of different computations.
Fusion Template
We recognize three categories of the operation fusion, as shown in right column of Figure 4 : intrinsic fusion, point-wise fusion and kernel fusion. The injective in templates can be convolution, pooling, nonlinear, deconvolution, depth-wise convolution, upsample, and reorganization by DNNVM. 
Intrinsic Fusion
We fuse the parts of the graph that can be pre-computed or statically determined. For example, the parameters of batch normalization and scale can be pre-computed, and these operations can be fused to the adjacent convolution. In addition, we fuse operations from different deep learning frameworks to a uniform grain. As shown in Figure 8 , we fuse pad, weights, bias, conv2d, biasadd and relu from TensorFlow to a single convolution operation in XGraph.
Point-wise Fusion
We fully integrate the point-wise operations. A typical example is the relu after convolution. We implement relu on the intermediate data by setting slightly that denotes the nonlinear type in the instruction CONV. After a convolution result is calculated by a large number of input parameters, this intermediate result will be sent to the nonlinear module directly to avoid the kernel boot time and to reduce the time required for off-chip memory access.
Kernel Fusion
We leverage kernel fusion for data reuse. First, horizontally adjacent layers share the same input feature maps; as a result, input data does not need to be reloaded. Second, vertically fused-operations can avoid the exchange of intermediate results between on-chip buffers and off-chip memory. Third, different operations can be concurrently executed by different computation modules. In addition, dimension transformation and the reorganization of feature maps can be skipped by fusing operations of reshaping, such as concat and flatten, into the SAVE process of the previous operator or the LOAD process of the following operator.
We summarize our fusion templates in Figure 4 , Once the DNNVM obtained a computation graph from the deep learning frameworks, it needs to identify all possibilities of operation fusion and generate corresponding instructions.
Modeling of Fusion and Tiling
Ideally, the buffers are sized to fit the entire feature maps and parameters. Unfortunately, on-chip memory is often too small to store the required data of fused-operations. For example, VGG has 16 convolution layers with more than 100 MB parameters; however, Xilinx ZU2 has only 0.66 MB of on-chip BRAMs. To overcome this constraint, segmenting the CNN models into subgraphs and slicing along the feature maps are potential solutions.
When given a group of operations f , which has l layers, as depicted in Figure 7 , we make W and H represent the width and height, respectively, of an output feature map, where F −1 (W ) and G −1 (H) denote the corresponding width and height, respectively, of an input feature map.
Figure 7: Details of Operation Fusion Techniques
A comp , A ac denote the amount of computations and data exchange for each operation. Given a CNN model, we obtain the total amount of computations, which is a fixed number and can be calculated in Equation (1).
< T w , T h , T ic , T oc , T k > comprises a specific tile size combination for the width, height, channel of input feature maps, channel of output feature maps and computation kernels, respectively. < α in , α out , α p > and < β in , β out , β p > represent the trip counts and the size of the memory accesses to each tiled feature maps and parameters, respectively.
We extend Computation to Communication (CTC) [20] , which describes the computation operations per memory access to describe the influence of fusion and tiling. The total amount of computation is fixed when given a CNN, and the performance improvement is in proportion to the reduction in communication.
In previous studies, without operation fusion techniques, CTC would be:
With the operation fusion: Figure 8 : A case study of the operation fusion. DNNVM works in the context of deep learning frameworks and transform CNN models into XGraph by leveraging intrinsic fusion and point-wise fusion, then searches for fusion opportunities according to pre-defined fusion templates by a subgraph isomorphism algorithm. Figure 9 : Evaluation and path searching. By resolving configurations generated by DNNVM in Scheme, DNNVM automatically evaluates implementation costs on-board or by simulator, and searches for the optimal execution strategies by a heuristic algorithm.
Slicing and Fusion
According to Equation (10), to map a CNN into instructions and increase the CTC, it is necessary to determine appropriate sliced sizes to fit the feature maps and parameters. However, slicing causes recomputation and reloading of intermediate values in the overlapping region between two neighbouring tiles if the stride is smaller than the kernel size.
The bandwidth utilization will increase as the tiling size increases. In addition, the feature maps and kernels that are loaded at one time cannot exceed the size of the on-chip buffer as the topologies of NNs become deeper.
As a result, the fusion depth, tiling across multiple stages, and on-chip memory capacity are decisiveness factors that influence operation fusion. The number of strategies for tiling and fusion is geometrically increasing due to these factors. Based on these considerations, our heuristics start from an operation and then iterate over the computing graph to check for fusing opportunities. Operations are fused when either of the following conditions are satisfied: First, dependency among operations is pre-defined in our fusion templates shown in Figure 4 ; thus, different operations can be concurrently executed by different blocks. We maximize the tiled size for output feature maps along the width and height with the constrains that correspond to the input feature maps, and the parameters can be simultaneously stored in respective buffers. The second condition to be satisfied for fusion is that all feature maps and parameters required for all fused-layers without tiling can be stored on-chip to avoid the exchange of intermediate results. In this case, we determine that more than two operations can be fused; the number of operations to be fused is not the limitation.
As a case study, we fuse one adjacent convolution and pooling in GoogLeNet-v1 [3] . In Figure 5 , the input feature maps are sliced, and intermediate data calculated by the convolution will not be stored to off-chip memory; thus, LOAD and SAVE between CONV and POOL can be skipped. As soon as an output feature map is achieved, SAVE is implemented and can be hidden in computations. In addition, LOAD, SAVE, CONV, POOL can be concurrently performed. We set the input feature map to be 28 × 28 × 32, the convolution kernel size to be 5x5, the pooling kernel size to be 3 × 3, the stride of them to be 1 and the output feature map to be 28 × 28 × 256. According to Equation 1, the total workloads are 0.32 GOPs, and according to Equation 9 10, data transfer is 6.27 KB. On ZU2, 0.375 ms is needed for this convolution operation, and 0.242 ms is needed for the following pooling operation. The fusion methodology reduces the total data transfer by 64% and achieves a 1.67x speedup.
In Figure 6 , the element-wise-add operation needs to load the results from different convolution operations. The large amount of data exchange is one of the main types of overhead. As we fuse one adjacent convolution with the element-wise-add in Resnet50, SAVE 1 and LOAD 3 can be skipped. Tiling is used to fit the data into on-chip buffers, and communications and computations are implemented by different blocks concurrently. On ZU2, we need 0.467 ms and 0.463 ms for the convolutions, and an extra 0.833 ms is required for the element-wise-add. Fortunately, after fusing element-wise-add with one of the prior convolutions, the total execution time can be reduced to 1.039 ms. Fusion of convolution and element-wise-add achieves a 2.2x speedup and a 36.4% reduction of data transfer compared with the primitive serial implementations.
To efficiently investigate fusion technology, we divide the problem into two parts: 1) the operation fusion opportunities traversal in subsection D and 2) the optimal execution strategies exploration in Section 5.
Subgraph Isomorphism Algorithm
Use of the fusion template for traversal matching on the computation graph represented by XGraph is a subgraph isomorphism problem, which is a well-known NP-complete problem that cannot effectively traverse all possibilities.
Currently, operation fusion techniques on GPPs employ a greedy method for fusion template matching. However, if we greedily search for fusion opportunities, several valid and potentially profitable opportunities may be missed. The sequence of template searching influences the final performance. For example, as shown in Figure 8 b, if we fuse 4 with 3 , the combination of 4 and 5 will be missed.
To solve this problem, we adopt a heuristic subgraph isomorphism to enumerate all fusion opportunities instead of greedy algorithms. Ullmann [38] , vf2 [39] , and boostIso [40] present several methods for subgraph isomorphism, and Jinsoo et al. [41] compare their performances. We learn from these methods and apply a heuristic algorithm for our design:
Algorithm 1 Subgraph Isomorphism Algorithm
Input: Subgraph pattern template Q, Q i ∈ Q; Input: computing graph G, start point S i ; Output: all Subgraph isomorphisms of Q i in G; the optimal strategy = path of min(cost i + cost j ) 12: else 13: ShortestPath(between adjacent barriers) 14: end if 15: end for 16: DEF ShortestPath(): 17: for (k = 0; k < |V |; k + +) do 18: for (i = 0; i < |V |; i + +) do 19: for (j = 0; j < |V |; j + +) do The computing graph of the CNN model G is defined as a < V, E, T >, where V is a function that represents the operations in neural networks, E represents the dataflow dependency and T is a labelling function that maps the type and configurations of one operation. Unless otherwise specified, we use the symbols Q, v and b to denote the set of fusion templates, a query vertex, and a query edge, respectively. Given a set of query fusion templates Q, Q i ∈ Q, Q i represents one candidate fusion template. The problem is to identify all distinct embeddings of Q i in G.
As shown in Algorithm 1, F ilterCandiates searches for candidate vertexes in G that have the same type of v in Q i . We set the start point S i to a type in Q i that has the least number of occurrences to reduce the size of the recursive call tree. We select a vertex adjacent to the previously matched query vertex M by N extQueryV ertex a breadth-first search is implemented here. By using Ref ineCandidates, we prune any vertex u in C v such that u is not adjacent to the previously matched vertex. u would not be pushed back into M unless u satisfies all query requirements, such as the type of the operation, the kernel size, and the stride. The edge that connects u with a previously matched vertex should have the same type with a corresponding edge in the query. The recursion stops when the algorithm obtains the complete solution (i.e., when |M | = |V (Q i )|). Otherwise, the algorithm calls N extQueryV ertex to select the query vertex v ∈ V(Q i ), which has not been matched.
Execution Path Searching
As shown in Figure 8 .c, after we find all embeddings of Q i in G, one operation can be performed by different fusion templates. Thus, we need to determine the optimal fusion strategies. Two challenges occur in this circumstance: 1) execution cost of each fused operation needs to be investigated; and 2) due to the complex topologies of neural networks, combinational explosion remains in searching the optimal execution strategies. To solve these challenges, we present the following methods.
Fused-Operations Evaluation
To determine the optimal execution strategies, we should investigate the cost of each fused operation, which denotes the execution time. As shown in Figure 9 .a, after parsing configurations written in our DSL, we employ three methods to evaluate the performance of these fused-operations. First, we can directly execute each subgraph by our hardware platforms, which is the simplest evaluation method. Second, in the case of an offline evaluation, we leverage a modeldriven method to estimate the cost. Instead of providing an empirical formula, we design a small neural network to fit the cost with a maximum deviation of 5%. Third, the compiler optimization and hardware design is a long process; we hope that the parts can be synchronously designed and optimized. Therefore, we design a cycle-accurate simulator that can analyse the dependencies among the instructions and record the number of cycles consumption for each instruction.
Auto-tuning of Execution Path
The problem becomes a problem of how to obtain the execution strategies, with the smallest cost of performing all operations in the computing graph. First, we exchange the attributes of the node and the edge, as shown in Figure 8 .d. In this way, once operations can be fused, such as 1 and 2 , an additional edge 6 is be inserted into the computing graph.
To extend the algorithm to the computing graph of the entire CNN, we set the operations that are dependent on more than one operation or by different operations to be barriers. Fusion opportunities will never exist among operations via the barriers due to our fusion templates. Thus, we only need to determine the optimal execution strategies among all pairs of barriers, as shown in Figure 8 .d. In Algorithm 2, cost[i][j] denotes the execution time of operations from the i th node to the j th node.
Unfortunately, a problem occurs at the barriers while we directly leverage the ShortestP ath algorithm in our study. For example, as shown in Figure 8 .c, if we fuse operations denoted by 2 and 5 , the edge data1 targeting barrier1, data3 targeting barrier1 in Figure 8 .d should be pruned as 5 denotes an element-wise-add, which only needs to be implemented once. Horizontal fusion has a similar problem. In topologies such as Inception of GoogLeNet, convolutions that share the same input feature maps can be fused to avoid reloading the input data. Once these convolutions are fused, other edges that proceed from the input feature map should be pruned. In our study, we enumerate these special cases at barriers, as shown in Figure 9 .c d e.
Evaluation
This section proposes onboard tests to emphasize the DNNVM and the performance improvement by operation fusion methods using devices for both embedded applications and data centres.
Experiment Environment
As shown in Figure 10 , to evaluate the DNNVM with an operation fusion using our previously described designs, we employ Xilinx ZU2 and ZU9 FPGA devices. Our experiments are conducted on these devices. ZU2 is a low-cost FPGA chip that is commonly employed in embedded devices; it contains approximately 0.66 MB of on-chip storage. ZU9 has 4 MB of on-chip storage and targets data centre scenarios. Table 2 shows our benchmark CNN models. Selected CNNs are extensively employed in multiple applications, including face recognition, object recognition, classification, and tracking. As a result, VGG [1] , Resnet [2] , and GoogLeNet [3] are selected for comparison with other designs. We train all neural network models with Caffe and perform adjustments to convert feature maps, weights and biases to 8-bit points from a 32-bit floating point. We map all operations, with the exception of fully connected layers, onto FPGA accelerators. We demonstrate that our design guarantees the practicality for both embedded platforms and data centres. We synthesize the hardware logic with Vivado 2017.1.
Experimental Results
The most critical characteristic of an FPGA-based accelerator is the achieved performance of the system. Peak performance reveals the optimized design of the hardware platform, but the comparison between peak performance of multiple designs is meaningless in some degree because the hardware never achieves the peak performance for specific CNN models in practice. So we implement typical CNN models and propose achieved performance in practice. Firstly, as shown in Table 2 and Table 4 , we demonstrate that our baseline design implementing instructions generated by DNNVM without pipeline optimizations has achieved the state-of-the-art performance on embedded FPGA platforms, and the optimized implementations by leveraging operation fusion improves the throughput further. 
Throughput Improvement by the Operation Fusion
To analyse the performance improvement by the operation fusion and pipeline optimizations, we test instructions generated by DNNVM on ZU2 @330 MHz. Table 2 shows the execution time of the compilation process and the on-board execution time of the generated instructions. The results show that operation fusion techniques provide a speedup from 1.02× to 1.26×. To further analyse how much the operation fusion can improve the resource utilization, we calculate the average performance normalized by the peak performance (380 GOPs/s). As shown in Figure 11 , the resource utilization rate increases by 10% and 20% on ResNet and GoogLeNet, respectively, but only approximately 2% on VGG. Due to the variety of operations and complex topology in ResNet and GoogLeNet, operation fusion keeps the computation blocks busy and hides communication in computations, as shown in Figure 6 . However, VGG only has convolution and pooling and their original performance utilization is approximately 90%; thus, the improvement is not impressive. Table 2 indicates that compilation jobs and corresponding optimizations only cost dozens of seconds, which has a minimal impact on the entire deployment process. Our heuristic fusion algorithm can effectively improve the performance of complex neural networks with minimal compilation cost.
In our work, meaningful and fair comparisons require designs that leverage pipeline optimizations, especially operation fusion for the same CNN that targets the same FPGA device. It is hard to satisfied all requirements as mentioned, thus, we choose designs that leverage fusion technology on FPGA platforms as analogous as possible. We compare our design with them on either the hardware side or the compiler side. Additionally, the frequency of some hardware designs may be different and can not be improved due to the lack of hardware resources, so we compare our design with others at their highest frequency reported in the literature. As shown in Table 4 , fpgaConvNet [15] , fuse1 [36] and fuse2 [19] optimize throughput on the hardware side, and fuse different operations into a single block. fpgaConvNet [15] leverages layer fusion in an SA design. They propose the alternative exploitation of the capabilities of FPGAs and implement partitioning of a CNN along the depth into several subgraphs and then map each subgraph into a different bitstream. Although reconfiguration overheads are added, fpgaConvNet achieves 48.53 GOPs/s for VGG on Zynq XC7Z020 and 155 GOPs/s on Zynq XC7Z045. Fuse1 [36] focuses on optimizing the external memory bandwidth utilization. This fused-layer accelerator reduces memory transfer from 77 MB to 3.6 MB for VGG. Unfortunately, this design requires 6.5% additional clock cycles with fusion. Fuse2 [19] explores algorithms to determine the fusion strategy for each layer; they explore fusion possibility by the branch and bound algorithm, depending on the hardware resources, bandwidth and workloads. Heterogeneous achieves 76.9 GOPs/s and 230 GOPs/s for Alexnet and VGG respectively. Table 3 provides the detailed utilization of hardware resources of our design on different platforms. We only use 25% BRAMs, 24% DSP, 33% FF and 14% LUT on ZU2 compared with the fusion design [19] on Zynq XC7Z045 but achieves 1.45x throughput and 1.8x energy efficiency even with the substantially smaller on-chip memory on ZU2. As leveraging fusion on the software side simplifies the hardware design and causes a performance gap, we can focus on optimizing frequency and resource utilization. In addition, we can more efficiently scale to deeper and more complicated CNNs by adjusting instructions with equivalent performance compared with these SA designs.
Comparison to Fusion Designs
Comparison to Compilers
As shown in Table 4 , Snowflake [21] , DnnWeaver [13] , VTA [16] (not in Table 4 ), xfDNN [14] , and DLA [22] propose an SCE design and optimize the implementations on the software side. Their compilers leverage fusion with a computing graph, loops, or co-optimizations. Snowflake and DnnWeaver provide a powerful compiler; however, their final performance cannot be guaranteed. VTA explores the simultaneous utilization of compute and memory resources by reducing the inference time of ResNet18 from more than 3 s to less than 0.5 s. Currently, xfDNN and DLA have shown a state-of-the-art performance. xfDNN adopts VU9P, which has 8.9 MB BRAMs and more than 35 MB UltraRAMs. As a result, xfDNN can pre-load a large number of parameters and feature maps onto on-chip memory to avoid the data exchange. Larger batch than 1 contributes to the performance improvement in xfDNN and more data re-usage of parameters. We outperform xfDNN on VGG and ResNet50. DLA is the most efficient acceleration on GoogLeNet. Unfortunately, we achieve a substantially lower performance than that of xfDNN and DLA on GoogLeNet. First, our frequency is limited to 330 MHz for batch 3 on ZU9 due to a lack of wiring resources. Second, GoogLeNet has many layers with small-scaled feature maps, frequent data exchange is caused by a substantially smaller on-chip memory than xfDNN and DLA, and cannot be hidden by fusion. Third, bandwidth saturation also causes a performance gap. Many domains in machine learning benefit from CNN algorithms. Due to the high computation complexity of a CNN, as shown in Table 5 , multiple compiler infrastructures occur, and various compiler technologies are employed to improve the throughput on GPPs or on specialized processors.
Related Work
For GPP compilers, a side-effect free representation of operations, applicability and generality to different deep learning frameworks, and optimized scheduling are highlighted. Intel nGraph [42] and Google XLA [43] have the role of a bridge between deep learning frameworks and hardware back-ends. nGraph utilizes MKL-DNN to produce highly optimized implementations on CPUs and the PTX-emitting back-end of the LLVM to generate assembly code for GPUs. The XLA compiler acts as a back-end for TensorFlow. TVM [29] proposes an ahead-of-time compiler that supports multiple front-ends and hardware platforms. These compilers adopt high-level computing graphs and leverage fusion across layers based on predetermined rules. DLVM [30] employs another DSL, which is inspired by the LLVM, and can specify forward and backward operations based on tensors. Auto-scheduling algorithms have gradually attracted a substantial amount of attention and provide appealing productiveness. Tensor Comprehension [31] adopts polyhedral IRs TC and employs a genetic search of affine transformation options (e.g., tile sizes, loop fusion and scheduling strategies). In addition, PolyMage [27] introduces fusion methods with loop nests and determines the rules of fusion and the range of tiling sizes to ensure a small auto-tuning space. Mullapudi et al. [28] introduce an automatic fusion and tiling selecting method for Halide [26] to provide the best performance. Jangda et al. [44] develop a cost function for evaluating all valid fusion opportunities in only O(n 2 ) time instead of O(2 n ). In this way, all potentially profitable fusion opportunities will considered compared with greedy algorithms [28, 27] .
Similarly, to accelerate a CNN on specialized accelerators, the previously mentioned methods are important, especially an optimized pipeline strategy to enhance data locality and parallelism. To leverage operation fusion and other pipeline strategies, most FPGA-based accelerators adopt an SA design and are optimized on the hardware side. Alwani et al. [36] and Xiao et al. [19] instantiate a fusion design with full consideration of hardware resources. fpgaConvnet [15] maps fused-layers to different bitstreams of SA designs. Although they achieve appealing performance for dedicated neural networks, the scaling of these designs to deeper networks is difficult on a resource-limited platform, and re-configuration overheads are introduced when switching to other models or applications. Even with high-level synthesis, a long compilation process is needed to realize these optimizations.
A few compilers map NNs to instructions that are loaded on custom FPGA-based SCEs. These designs simplify the hardware design and usability of applications. Writing algorithms in compilers to maximize the hardware throughput can be more efficient. Additionally, instruction-based computation blocks can be reused for various operations. VTA [16] together with TVM adopts fusion technology named Task-Level Pipeline Parallelism to generate optimized instruction chains that are loaded on a custom SCE design. The main challenges of TVM/VTA lie in the combinatorial explosion of the fusion opportunities, and which fusion sequence is the optimal execution strategy has not been determined. Xilinx and Intel propose xfDNN [14] and DLA [22] , respectively, as a tool chain to deploy CNN on custom accelerators and achieve the state-of-art throughput. They fuse many adjacent layers, and the weights and bias of these layers can be pre-loaded due to the large on-chip memory capacity. However, public evidence of their feasibility on platforms with less on-chip memory is not available.
