Deep Convolutional Neural Networks (CNNs) are the state of the art systems for image classi cation and scene understating. However, such techniques are computationally intensive and involve highly regular parallel computation. CNNs can thus bene t from a signi cant acceleration in execution time when running on ne grain programmable logic devices. As a consequence, several studies have proposed FPGAbased accelerators for CNNs. However, because of the huge amount of the required hardware resources, none of these studies directly was based on a direct mapping of the CNN computing elements onto the FPGA physical resources. In this work, we demonstrate the feasibility of this so-called direct hardware mapping approach and discuss several associated implementation issues. As a proof of concept, we introduce the 2 open source tool, that is able to automatically transform a CNN description into a platform independent hardware description for FPGA implementation.
Introduction
Convolutional Neural Networks (CNNs) [1] have become a de-facto standard that increased the robustness and accuracy of machine vision systems. It is possible nowadays to build high performance image classication systems by deploying large-scale, pre-trained CNNs models. However, this accuracy comes at the price of a high computational cost as state of the art CNNs may require up to 38 GOP to classify a single frame [2] . As a result, implementing CNNs with real-time constraints is challenging task. A possible way to address this challenge is to take advantage of the massive ne grain parallelism o ered by FPGA devices to embody the large amount of intrinsic parallelism exhibited by CNN-based algorithms. In this case, the problem boils down to nd an adequate and e cient mapping between the computation model of the latter and the execution model supported by the former. Based on our previous experience in the implementation of real-time vision applications on FPGA-based platforms [3] , we advocate the use of a stream-based data ow model to solve this mapping problem. In this approach, a CNN-based algorithm is described as graph of data ow actors exchanging data through unidirectional channels and this graph is statically and physically mapped onto the target FPGA using a library of pre-de ned computing elements to implement actors.
In the sequel, we demonstrate the feasibility of this so-called Direct Hardware Mapping (DHM) approach for implementing realistic CNN-based applications onto Field-Programmable Gate Arrays (FPGAs). Moreover, we introduce 2, a software framework providing a fully automated implementation path for CNNs onto FPGAs using the DHM approach. The 2 tool is compatible with the widely used Ca e deep learning framework [4] and generates platform independent synthetizable VHDL code. In other words, we introduce in this work a tool that automatically maps a Ca e pre-trained model onto an FPGA device.
CNNs : Computations and parallelism sources
CNNs are a category of feed forward arti cial neural networks that are bio-inspired by the visual cortex of the brain. The huge improvement of CNN-based algorithms was made possible by two factors: On one hand, the availability of massive-sized annotated image data-sets [5] allowed to train robust large scale feature extractors and accurate classi ers. On the other hand, the growth of high performance processors and, especially Graphics Processing Units (GPUs), provided the computational power required to train deeper and more complex neural networks [6] . A typical CNN structure, as shown in gure 1, will perform a succession of convolutions interspersed with sub-sampling layers. The last stages include typically two or three fully connected neural network for classi cation tasks. The depth (number of layers) of a CNN ensures better accuracy and less over-tting. As a result, depth of neural networks tend to increase (8 to 19 layers to VGG [7] ). 
Convolution layers
Convolutional layers are the most computationally intensive and are responsible -in a typical implementation -for more than 90% of the CNN execution time [8] . Each layer (l) extracts N feature maps from C input channels by performing N convolutions of size K × K on each input. This ltering is followed by the application of a non-linear activation function act and a bias term b n to each set of features. As shown in equation 1, N × C convolutions are required to process a given layer. 
where
is a tensor of output feature maps of layer (l)
is the bias term applied to applied to feature n
is a tensor of input feature maps of layer (l)
• w (l ) is tensor of pre-learned lters
As already pointed out in [9] , the computations described in equations 1 exhibit a large amount of potential parallelism:
• Inter Layer parallelism: CNNs have a feed-forward hierarchical structure consisting of a succession of data-dependent layers. Layers can therefore that can be executed in a pipelined fashion where the execution of layer (l) can start before the execution of layer (l − 1) ends.
• Inter neuron parallelism: Each neuron of a layer is independent when processing features. Thereby, a full data-parallelism can be exploited when computing concurrently each of the N (l) element of equation 1
• Inter convolution parallelism: All of the convolutions performed by a single neuron can also be evaluated simultaneously by computing concurrently the C (l) convolutions of equation 1.
• Intra convolution parallelism: 2D image convolution can be implemented in a pipelined fashion [10] allowing the K × K multiplications to be computed concurrently in equation 1
Subsampling layers
A common operation when conceiving CNNs is to periodically insert subsampling (or pooling) layers inbetween successive convolutional layers. These downsample the inputs by selecting the average, or, more commonly, the maximum of a given neighborhood of each pixel as described in equation 2 ∀l = 1 : L (Number of pool layers) ∀n = 1 : N (Number of output feature maps)
∀i =1 : Ix (feature map rows) ∀j = 1 : I (feature map columns)
Pooling layers reduce the amount of parameters required to process the next stages of the network, which controls over tting in one hand and decrease the computation load on the other.
Fully connected layers
A Fully Connected (FC) neural network -with usually 3 or 4 hidden layers-terminates CNNs and acts as a classi er. In this case, no parameters are shared across the feature-maps (feature maps and learned parameters have the same dimension). In this case, FC layer activations are computed with the inner product operation followed by a bias o set as detailed in equation 3, where <, > denotes the the inner product operator.
3 Direct Hardware Mapping of CNN entities
Data ow processing of CNNs
The foundations of data ow Models of Computation (MoC) were formalized by [11] in order to create an architecture where multiple fragments of instructions can process simultaneously a stream of data. Programs respecting data ow semantics are described as a network (graph) of fundamental processing units commonly called actors and communicating abstract data messages called tokens on unidirectional First-In First-Out (FIFO) channels. In terms of architecture-application matching, the CNN's layout ts naturally with a stream-based model of computation. All of the operations involved in feed forward propagation of a CNN -described in the latter section-can be executed following the stream-based data ow MoC. In fact, CNN-based algorithms can be modeled as modeled as data ow process networks (DPNs) where nodes correspond to processing actors and edges correspond to communication channels. Each actor follows a purely datadriven execution model where execution ( ring) is triggered only by the availability of input operands.
The DHM approach consists of physically mapping entirely graph of actors onto the target device. Each actor becomes a computing unit with its speci c instance on the FPGA and each edge is mapped to a signal.
DHM of Convolution layers
As stated in section 2.1, convolutional layers are the most computation intensive tasks in a given network. However, DHM approach fully exploits all the parallelism sources of theses layers. All neurons of a layer are mapped on the device to take advantage of intra-neuron parallelism (Fig 2-a) . In neurons, each convolution is mapped separately (Fig 2-b) and nally, within a convolution engine, each multiplier is instantiated separately (Fig 2-c) . As an example, gure 3 illustrates how a convolution layer C1 (C = 3, N = 5, K = 3) extracts 5 features from a 3-channel input pixel ow. In this example, 15 convolution and 5 activation blocks are mapped onto the FPGA as a result of the layer graph transformation, which corresponds to 135 multiplications, 20 summations and 5 activations.
. . . 
Optimizing DHM-based CNN accelerators
Direct Hardware Mapping of CNNs completely removes the need for an external memory to store intermediate results or parameters. Moreover, thanks to the fully pipelined execution model, the global throughput is only limited by the maximum clock frequency. However, these advantages come at the cost of a high resource consumption since the whole graph has to mapped onto the physical resources of the FPGA. In certain cases, this could limit the complexity of the CNNs that can be handled by the DHM approach. It is crucial, therefore, to ensure that the core operations involved in CNN actors can be translated e ciently in hardware. The most important issues, by far, are those related to on-chip memory requirements on one hand, and the implementation of arithmetic operators on the other.
Neighborhood extraction
The literature provides multiple approach to e ciently accelerate the computation of convolutions. Data owbased based accelerators -such in [10] -are based on a fully pipelined architecture that is able to process one convolution per clock cycle. Such an architecture can be divided into 2 parts: neighborhood extraction (NE) and Multiply-ACCumulation (MAC). Neighborhood Extraction (NE) relies on bu ers to grant a full access to the K (l) × K (l) neighbors of each pixel (as shown in gure 4). Such an architecture is advantageous since it can directly extract the neighborhood of streams of pixels each clock-cycle.
Multiply Accumulate (MAC) performs a multiplication of neighborhood pixels with pre-learned kernels then accumulates the result to output feature maps. As long as the access to full neighborhood pixels is guaranteed, each of the multiplications of can be performed in a parallel way using K (l) × K (l) multipliers (as shown in Fig 2-c) .
In the case of CNNs, Combining NE and parallel MAC strategy fully exploits the intra Kernel parallelism of CNNs which grants high acceleration to convolutions and, consequently, the feature extraction process. However, mapping a full CNN graph involving millions of convolutions comes down to map millions of memory bu ers on the FPGA fabric which increases the power consumption of the system and lowers the maximum frequency (and thus the computation throughput). 
Neighborhood Extraction Factorization (NEF)
One way to address the latter issue is to factorize the neighborhood extraction process in order to optimize the memory print of convolutional layers. In this case, it is possible to rely only on on-chip memory bu ers to process a hole convolutional layer.
Thus, since multiple neurons in a given layer have same input features to process (only the convolution kernels change), the neighborhood extraction entity can be factorized for each input feature map which divides the memory requirements of each layer by a factor N (l) (cf gure 5). For instance, while the rst layer of the AlexNet CNN (N=96,C=3,K=11) would require 96 × 3 × 11 × 11 = 34KB of bu er memory to be processed, a factorization of neighborhood extractors needs 0.3KB which corresponds to 96 times less memory requirements. Full results of NEF on Alexnet layers are detailed in gure 6. Several studies [12, 13] have demonstrated that CNNs, and more generally deep learning applications, usually tolerate approximate computations with short xed-point arithmetic. Frameworks such as Ristretto [13] , for example, can perform ne-tuning of data representation in order to support xed-point numerical representations with variable data lengths. In particular, an 8-bit (resp. 2-bit) precision is su cient to infer the AlexNet [14] (resp. LeNet [15] ) CNNs with little to no degradation in classi cation accuracy. The DHM approach advocated in this work can indeed take advantage of this to signi cantly reduce the amount of required hardware resources by rst inferring the minimal required precision and then deriving the size of the hardware resources to exactly match this precision with the adequate bit-width.
Multiplications with Logic Elements
Convolutions require many multiplications. If these multiplications are implemented using hardwired Digital Signal Processing (DSP) blocks within the target FPGA, this dramatically limits the complexity of the CNN that can be implemented. For instance, the second layer of the LeNet5 network (C = 6, N = 16, K = 5) requires 2400 multipliers. This number largely exceeds the number of hard-wired multiplier blocks provided by many FPGAs especially by embedded devices. We overcome this problem by systematically forcing the synthesis tool to implement multiplications with logical elements instead of DSP blocks, leading the resulting implementations to rely on AND gates and trees of half-adders [16] . In this case, the logic elements required to implement a convolution increase quadratically with precision. Moreover, due to the large number of multiplications involved in CNNs, the available logic on embedded FPGA devices may not be su ce to support a full complex CNN graph. We take advantage of the fact that in the case of CNNs the convolution kernels -and hence the second operand of the multiplications -are actually constants and derived from the o ine training stage. It is therefore possible to use a specialized version for those multiplier instances. While this approach limits the exibility of the systemit requires to re-compile and re-synthesise the VHDL design whenever parameters values are changed -, it delegates to the synthesis tool the task to perform low-level area and performance optimizations. More particularly, multiplications by 0 (resp 1) are removed (resp. replaced by a simple signal connection) and multiplications by a power of 2 are implemented using shift registers.
Moreover, we nd that a large proportion of CNN parameters are, after quantization process, equal to zero, one or a power of two. This is illustrated in gure 7 where 72% of the AlexNet multiplications (with an 8 bit precision) can be either removed, or replaced with signals or shift registers. Figure 8 shows how the logic elements required to implement a pipelined convolution decrease as the proportion of these "special" kernels increase. 
The H 2 utility
The H 2 framework is set of tools built upon the principles and optimization techniques described in the previous section. It is capable of automatically generating a platform independent hardware description of a CNN from a Ca e model [4] . First, layer speci cations (Layer type, Number of input channels C, Number of output features N , kernel size K) are extracted from the Ca e model and the learned parameters are read, rounded to a xed-point representation format and written as generic parameters in a con guration le. Second, a top-level VHDL le is created by transforming the data ow graph described in Ca e. The top-level instantiates a set of generic layers parametrized according to the Ca e model specications. These layers are described using a small number of basic prede ned actors. These actors, written in a structural VHDL, follow the data ow execution semantics discussed in the latter sections. The output is a platform independent VHDL code that can be implemented on the FPGA device using the adequate 
... 
Experimental Results with Haddoc2
As a proof of concept, we have implemented, using the H 2 framework, FPGA-based accelerators for three CNN-based applications, listed in Table 2 . The rst one is the Ca e version of the LeNet5 [15] CNN that requires 20.78 MOPs to process a frame of size 28x28. The second application is the face detector used in [?] which requires 622.08 MOPs to process a 320x240 frame. The last one is introduced in [18] to perform car type classi cation and requires 268.28 MOPs to process 96x96 frames. The two rst CNNs have been trained using Ca e while the third model has been directly downloaded as a Ca e pre-trained model. Table 2 gives parameter values for each CNN convolutional layer. LeNet5 and CarType CNNs have 2 convolutional layers while FaceDetect has 3. The corresponding hardware descriptions of each network have been automatically generated using Haddoc2 on an Intel i7-4770 CPU and were synthesised on two FPGA devices using respectively Intel Quartus 16.1 and Xilinx Vivaldo 2016.4. Table 3 reports post-tting results of the LeNet-5 accelerator on an embedded Intel Cyclone V 5CGXFC9E7 device using 3 implementation strategies. In the rst case, only DSP blocks are used to map the CNN multiplications. The resulting hardware requires 72× the available resource of the device. The second case features an implementation of multiplication based on logic elements and requires 3.8× the available logic. Using tailored multipliers reduces resources by a factor of 8.6×, tting the CNN accelerator onto an Intel Cyclone V device. Table 4 details post tting results on two embedded FPGA platforms: the Intel Cyclone V 5CGXFC9E7 and the Xilinx Kintex7 XC7Z045FBG. To the best of our knowledge, these numbers are the rst to demonstrate the applicability of a DHM-based approach for the implementation of CNNs on embedded FPGAs. The three hardware accelerators t onto the embedded devices with no o -chip memory requirement. The memory footprint shown in post tting reports corresponds to line bu ers used by the data ow-based convolution engine and both synthesis tools instantiate LUT-based memory blocks to implement these bu ers. As expected when using DHM, the logic utilization in the FPGA grows with the the topology of the CNN. However, in all the studied cases, the resources are su cient to support direct hardware mapping. Finally, the same table reports timing analysis results of the three generated hardware accelerators. With a peak frequency of 62.3 MHz for the CarType CNN, DHM grants a maximum computation throughput of 1813 GOPs/s. For the face detection neural network, the presence of a third convolutional layer in the pipeline drops the maximum frequency to 56.7 MHz (i.e 357 GOPs/s) in the Cyclone device, which corresponds to 164 classi cations/sec on 512x512 images with a 3-multiscale pyramid. 
Related work
Several studies leverage on FPGA computational power and hardware exibility to implement the feedforward propagation of CNNs. A non exhaustive review of these can be found in [20] . In most of approaches, acceleration of CNN-based applications is provided by mapping a limited subset of processing elements onto the target device. This is the case for example in [21] where authors describe an accelerator for the AlexNet CNN [14] implemented on a large Stratix V FPGA which, to the best of our knowledge, outperforms most state-of-the-art implementations in terms of computational and outperformed most of state-of-the-art implementations such [19, 22, 23] . Most of these designs are FPGA based accelerators for convolution with a relatively similar architecture of parallel processing elements associated with embedded hardcore processors running a software layer. Other approaches like [24] relies on analytical design scheme using the roo ine model and loop tiling to propose an inference engine where the attainable computation roof of the FPGA is reached. This loop tilling optimization is performed on a C code then implemented in oating point on a Virtex 7 485T using Vivaldo HLS Tool.
As it has been seen in the latter sections, feed forward propagation is an algorithm that intrinsically suits to data ow processing. Thus, dedicated stream processors for CNNs have been proposed. The most notable contribution was neuFlow [25] : A runtime recon gurable processor for real-time image classi cation. In this work, Farabet and al. introduced a grid of processing tiles that were con gured on runtime to build a data ow graph for CNN applications. It was associated to "luaFlow": a data ow compiler that transforms a high-level ow-graph representation of an algorithm into machine code for neuFlow. Such architecture was implemented on a Virtex 6 VLX240T and provided a 12 fps categorization for 512x375 images. Thus, NeuFlow transformed a CNN graph into a set of data ow instructions, where each instruction is described as an hardware con guration of 2D-processing elements called Processing tiles (PTs). Execution of the graph is carried out by sequencing the instructions on the target FPGA. This approach requires an external memory to store intermediate results, which in turn, even with the help of a DMA, limits the nal speedup. The study in [26] features a partitioning of the CNN graph with one bitstream per subgraph in a way that only on-chip memory is needed to store intermediate results. This however requires the recon guration of the FPGA whenever data has to enter a di erent subgraph, which adds a substantial recon guration time overhead.
By contrast, the DHM approach and H 2 tool introduced in the present work performs all processing on the y and does not require an external memory to store intermediate results. Throughput is therefore not limited by o -chip memory bandwidth. Previous works in [27] describe a rst version of Haddoc that relied on the Caph [3] , a High-Level Synthesis (HLS) tool to provide data ow-based hardware accelerators for CNNs on FPGAs. While this implementation operated at very high frame-rates (800 classi cations/sec on 256 × 256 images), the over-head that comes with the HLS heavily restrained the size of CNNs to be implemented which motivated us to bypass the Caph HLS layer by hand-crafting RTL IP cores that respects the data ow execution model and supports the detailed DHM concepts.
