Deep neural networks (DNNs) are used by different applications that are executed on a range of computer architectures, from IoT devices to supercomputers. The footprint of these networks is huge as well as their computational and communication needs. In order to ease the pressure on resources, research indicates that in many cases a low precision representation (1-2 bit per parameter) of weights and other parameters can achieve similar accuracy while requiring less resources. Using quantized values enables the use of FPGAs to run NNs, since FPGAs are well fitted to these primitives; e.g., FPGAs provide efficient support for bitwise operations and can work with arbitrary-precision representation of numbers.
INTRODUCTION
A neural network (NN) [1] [2] is a computational model inspired by the way we believe our brain operates: the data that comes from our sensors, e.g., eyes, is processed by multiple simple computational units called neurons. The neurons are interconnected through a complex network of connections (axons), and after several transformations, the input is translated into a conclusion such as "there is a chair in the picture." Similarly, artificial NNs use vast amounts of simple computational elements that are organized in interconnected layers. Modern NNs usually have multiple layers (sometimes 1000 [3] or more) and thus are called deep neural networks (DNNs). These networks are widely used in image processing, medicine, autonomous driving, translation and other fields.
In order to better interpret local features of multidimensional inputs such as images, convolutional neural networks (CNNs) are commonly used. This type of NNs has been shown to be efficient in image-related problems such as classification or scene parsing. To achieve these results, CNNs need many parameters (over 100M parameters reported in [4] ) and require huge amounts of computational resources and memory. As a result, expensive and power hungry computers are needed to efficiently process these networks, which has led researchers to seek ways to reduce the computational, memory, and bandwidth requirements [5] [6] [7] [8] [9] .
Using binarized neural networks (BNNs) [10] [11] [12] is one proposed solution to the problem. In BNNs, each parameter is represented by only one bit, which saves memory, communication time and energy, and enables the use of bitwise operations, which are simpler and faster than multiplications. For this reason, FPGAs seem to be the most appropriate architecture for BNN execution. Programming FPGAs, however, is non-trivial, especially in comparison to modern scripting languages that are being used for NN development. In order to simplify development, major FPGA manufacturers have invested heavily in high-level synthesis tools that can translate a program written in a high level language such as OpenSPL [13] and C-to-VHDL (presented as part of Vivado HLS [14] ), or frameworks such as OpenCL [15] [16] . Today, HLS-based tools provide a decent tradeoff between resource utilization, compared to customwritten HDL code, and development time.
In this paper, we focus on architectural and optimization techniques for implementing QNNs on FPGAs using high level programming languages. The main objective of this work is to investigate architectural features of reduced-precision NNs without focusing on low-level optimizations, and accordingly we used an HLS-based plat-form to model our architecture. We propose a streaming model based on functional decomposition of the computations, which are embedded in data flow engines (DFEs) based on FPGAs. For this purpose, we used the OpenSPL programming environment and the Maxeler's hardware platform since the latter allowed us to implement the desired processor model using high level languages.
The paper indicates that QNNs scale well both on input and network sizes, showing only a minor increase in resource usage on larger inputs. In addition, our system can easily be divided into a couple of FPGAs, almost without a performance drop. All this allows us to run a full-sized ResNet-18 and AlexNet on two and three FPGAs, respectively, achieving runtime comparable with the latest GPUs, consuming less power and energy. Moreover, in contrast to previous work, we implemented multiple-bit activations, which improves accuracy of the network by up to 10% [17] [18] .
We also analyze skip connections and their impact on resource utilization and runtime, concluding that streaming architecture allows us to add skip connections for a relatively small price.
The paper is organized as follows: Section 2 provides background on NN, CNN, and NN size-reduction methods, including BNN and NN parallelization approaches. Section 3 reviews related works and their notable achievements. Section 4 explains the platform on which we built our network. Section 5 describes our model architecture and optimizations. Section 6 presents our experimental evaluation, Section 7 presents our conclusions and proposes ideas for future work.
BACKGROUND
We now provide a brief review of NNs, in particular CNNs and DNNs, their architecture and implementation in hardware.
Neural networks
In an NN, inspired by a human brain, neurons are activated in response to input. The activation of neurons allows the network to detect and classify patterns. Depending on the input data, an NN will calculate the probability that the data belong to a certain class (e.g., an object in a specific image). The network can be trained to recognize different classes by being provided a set of labeled training data. For example, given a set of faces and a set of non-faces, it can learn to decide whether an image contains a face. This is called supervised learning.
Training of the NN involves more computations and takes more time than using a network (inference). To train the network, it is necessary to repeatedly compute gradients of output errors from the last to the first layer and update the weights of the network, trying to minimize loss function and improve accuracy. CNNs are a type of NN commonly used in image processing. Convolutions allow NNs to use the way information is structured in the image to reduce the number of calculations and improve feature extraction. In contrast to fully connected layers, where each neuron is connected to every neuron of the previous layer, neurons in a convolutional layer are connected only to a small group of adjacent neurons in the previous layer. By stacking a number of convolutional layers, we hierarchically learn high level features of the image.
Convolution neural networks
In a typical CNN, convolutional layers are interleaved with pooling layers. Pooling layers are used to reduce feature map dimensions by subsampling with some simple function; for example, average or maximum. An activation function, such as Rectified Linear Unit (ReLU), softmax or tanh, is applied to the output of convolutions to introduce nonlinearity. At the end, a fully connected layer is usually added to transform spatially organized feature maps into the required output format.
CNNs perform well in object recognition tasks with pixel-based input. Most successful CNN models such as AlexNet [19] , GoogLeNet [20] , ResNet [3] and Inception [21] can be used to classify thousands of different objects with high accuracy. These models can include dozens of convolutional layers and require more than 10 GFLOPs of calculations [3] , outperforming humans on complex image processing tasks [22] .
The image convolution algorithm uses discrete convolution for two dimensions. The value of the output pixel is calculated as follows:
where O(x, y) is the pixel at position (x, y) in the output image, I(x, y) is the corresponding pixel, K is the kernel size and H(i, j) is the filter kernel. Both the input and output images consist of X(width) × Y (height) pixels. To calculate the output pixel, O(x, y), we sum over all the pixels in the input image multiplied by the corresponding coefficients in the filter kernel, H. Figure 1 illustrates a simple example of image convolution. The figure shows an input image, I, with 7 × 7 pixels and a 3 × 3 filter kernel, H. The resulting output, O, will be a 5 × 5 image where every pixel has been processed.
In addition to filter size, in convolution two more parameters must be defined: padding and stride. Padding (usually zero-padding, i.e., padding with zeros) defines the amount of pixels added to each side of the image. Padding is usually applied when it is necessary to preserve the spatial size of the feature map, in which case the borders of the image are zero-padded. Using the previous notation, in order to have the same dimensions of output and input, the image must be padded with K 2 zero pixels. The stride of the convolution is defined as the size of the filter shift between two applications of the convolution. Strided convolution is usually used for the same purposes as pooling -spatial dimensions reduction. The dimensions of output, given padding P and stride S, are
ImageNet [23] is one of the biggest image databases, which is used in a yearly competition for 1000-class image classification. The first big success of NNs in ImageNet classifications is AlexNet [19] . AlexNet is a DNN consisting of five convolutional, three pooling and three fully connected (FC) layers, which uses ReLU for activation.
Neural network size reduction
Inferencing and especially training NNs requires a lot of multiply-accumulate operations (MACs) to compute the weighted sums of the neurons' inputs. For example, AlexNet requires 724M MACs for the processing of a single image [24] . Thus, DNNs are now mostly trained on one or more GPUs [25] . The high power usage of GPUs makes it hard to run DNNs on low-power devices such as embedded systems. Researchers have proposed different solutions for reducing DNN power demands and computational requirements on both custom [26] [27] [28] and general purpose [29] [7] hardware.
The obvious solution is to compress an already trained network. For example, HashedNets [6] utilized a low-cost hash function to find connection weights with similar values (i.e., same hash) and assign the same value to all of them. Boulch [30] developed the idea of shared weights further, achieving a significant decrease in model size with negligible reduction in accuracy. Gong et al. [29] used vector quantization to reduce model size by more than an order of magnitude with only 1% loss of accuracy.
Han et al. [7] integrated pruning, quantization and Huffman coding to compress state-of-the-art large scale neural networks by almost two orders of magnitude without loss of accuracy.
After Han et al. [7] showed that parameters can contain a lot of redundant information, deep learning (DL) frameworks such as TensorFlow, Torch, and Caffe worked on providing software support to reduce network parameter size by quantization parameters to 16-bit floating point numbers (FP16) and 8-bit integers (INT8), since storing parameters in single-precision (32-bit) floating point format is usually redundant [31] . Miyashita et al. [32] proposed a linear quantization scheme for parameters:
where bw represents target bit-width, x is an input and min, max are the minimum and maximum of the scale range, respectively. Initial results were promising [33] [8] , and encouraged further research in the field. Recently, researchers have shown competitive results with ternary [34] [35] [36] or binary quantization.
Quantized and binary neural networks
Binary neural networks (BNNs) are the extreme case of quantized neural networks (QNNs), in which each parameter is 1-bit. One of the first successful works on binarization of both training and inference is BinaryConnect by Courbariaux et al. [12] , followed by XNOR-Net by Rastegari et al. [37] , which achieved top-1 accuracy of up to 51.2% for ImageNet, using ResNet-18-based BNN. BinaryNet by Courbariaux et al. [38] shows nearly state-of-the-art results on MNIST, CIFAR-10 and SVHN datasets with full binarization and batch normalization (BatchNorm). In the same paper, efficient GPU kernels for binary matrix multiplication were presented. DoReFa-Net by Zhou et al. [17] , improved the fullybinarized networks by using different bitwidths for different parameters. One successful setup includes 1-bit weights, 2-bit activations and 6-bit gradients. The authors proposed a method to improve the NN performance by eliminating multiplications over non-binary parameters similarly to the original approach, which could be applied exclusively to binary parameters. They reported results on different datasets, including ImageNet, achieving top-1 accuracy of up to 53% for partial and 43% for full binarization. In addition, they noted that the reduced precision may make FPGAs more attractive for NN training and inference.
Hubara et al. [18] demonstrated simple and robust QNNs that achieved results comparable to full-precision networks on various datasets, including ImageNet, and their QNN is the basis of our work. They noted that quantization, which is an addition of noise to weights and activations, can be seen as a variant of Dropout [39] . Accordingly, half of the activations are not being randomly set to zero when computing the parameter gradients, but rather both the activations and the weights are quantized. As can be seen in Figure 2 , while there are some similarities between full-precision feature maps and the original images, binarized feature maps appear as noise to humans.
Zhou et al. [40] achieved approximately a 1% drop in top-1 precision for different deep CNN models, and even improved accuracy over the 32-bit floating-point references for 5-bit quantization of a pre-trained network. Hou et al. [41] introduced a binarization algorithm that is more robust for recurrent and deep networks. Additionally, in the work by Bulat and Tzimiropoulos [42] , BNNs often exhibited state-of-the-art performance for (a) Input (b) Full precision feature maps (c) Binarized feature maps Figure 2 : Unlike full precision feature maps, binarized feature maps are interpreted by humans as noise. Picture by Itay Hubara.
human pose estimation and face alignment. All the above-mentioned works show that QNNs not only reduce resource consumption, but even improve performance over regular precision networks.
Binarized parameters are constrained to two values: 1 and −1. There are several options for binarization functions, but the simplest one is the deterministic Sign() function:
Another option is stochastic functions, where the value of the function is assigned with a certain probability.
Since binarized values are constrained to ±1, we can replace matrix multiplication by XNOR and population count (popcount) operations. These operations are especially efficient on FPGAs that, in contrast to GPUs, have native bitwise operations. The same replacement can also be applied for more precise representations. In such cases we apply XNOR-popcount operations on each bit of the input:
where m is the bitwidth of the input, x n is the n th bit of input, and w is a vector of 1-bit weights.
On the output we apply BatchNorm and linear quantization. Batch normalization [43] acts as a normalizer while quantization serves as an activation function. According to Hubara et al. [18] , both are necessary for the convergence of the QNN.
Neural network parallelization
In order to accelerate NN applications, hardware parallelism strategies are utilized. Most of these strategies can be classified into one of three categories: Data parallelism. Data parallelism makes use of simultaneous execution on multiple cores/threads of the same function across the elements of a dataset. This approach focuses on distributing the data across different nodes, which operate on the data in parallel. It can be applied on regular data structures such as arrays and matrices by working on each element in parallel.
The main advantage of this strategy is the lack of communication in a forward pass, which speeds up the system. When data parallelism is applied among multi-ple GPUs, however, the whole gradient must be passed to every other GPU during the backward pass; this requires a lot of communication, and can degrade performance dramatically.
Model parallelism. Model parallelism is based on splitting the model among GPUs and using the same data for each model part.
This method splits weights among devices and does not pass gradients between them, improving performance for large models and allowing larger models to fit in memory [25] .
If, however, the model is small and the GPUs are not saturated, the performance of model parallelismbased applications would be lower than for data-parallel analogues since no device would be fully utilized.
Krizhevsky [44] showed that for different layer types, different parallelism approaches should be used. Fully connected layers with more parameters and less computations are better suited to model parallelism, while for convolutional layers, data parallelism performs better.
Pipeline parallelism. This type of parallelism is based on different dependent computation steps performed concurrently on different threads, so that output from one step is streamed as input to the next, while execution of steps is overlapping. The feed-forward computation of CNNs is well suited to pipeline parallelism, so the hardware that can exploit deep pipeline parallelism, such as FPGAs, offers significant speedup.
Today, DL frameworks are mostly suited to GPU acceleration and are based on data or model parallelism [45] [46] while FPGAs are more efficient at pipeline parallelism. This is the reason that integration of frameworks directly into FPGAs has not yielded high performance [47] .
RELATED WORK
Though many areas of machine learning benefit from DNN-like algorithms, a main drawback is the high computational complexity, especially for CNNs. Various accelerator designs for CNNs have been proposed. In this section, we introduce some notable CNN hardware implementations. A simple way to port an NN to an FPGA is to port an existing DL framework. This would allow usage of all existing code on FPGAs with minimal efforts. The notable work in this field was done by DiCecco et al. [47] , who presented an adaptation of the Caffe framework for FPGAs. Unfortunately, they did not achieve performance improvements over GPUs. As mentioned earlier, one reason for lower performance is that parallelism techniques that achieve great results on GPUs and are used in existing frameworks, do not perform well when ported to FPGAs. Another attempt to exploit existing frameworks was made by Qiao et al. [48] , who also ported Caffe and showed higher energy-efficiency in comparison to GPUs, but worse performance.
During inference, the processor needs to read numerous input data and weights, which requires a large number of memory reads as well as a large amount of memory itself. One approach to reducing the necessary bandwidth is to store the weights on-chip, which, however, is infeasible for large models. At the same time, storage of all the weights off-chip requires a high memory bandwidth. The bandwidth utilization can be increased by exploiting data reuse or data precision reduction [49] [50] [26] [51] [52] [53] .
Zhang and Prasanna [54] suggested a method for efficient computations of the convolutions on a CPU-FPGA platform with coherent shared memory. Nurvitadhi et al. [55] proposed using FPGAs to accelerate sparse ternary networks and showed improvements over a GPU implementation. Han et al. [56] showed 3× better performance on FPGAs compared to GPUs for speech recognition with Sparse LSTM, using 12-bit quantization. Another approach for boosting CNN inference on FPGAs was proposed by Aydonat et al. [57] . Using the Winograd transform [58] , they achieved state-of-the-art results on the AlexNet architecture.
Recently, systolic array architecture, which takes advantage of pipeline parallelism, was shown to achieve high performance on both ASICs [59] and FPGAs [60] .
One of the most promising works on quantized CNN custom hardware design was presented by Umuroglu et al. [61] . They proposed a BNN implementation and achieved state-of-the-art performance and power consumption on the CIFAR-10, SVHN and MNIST datasets. This design is particularly well suited to FPGAs because the model sizes are small enough to fit on-chip memory and the required operations are highly efficient on FP-GAs. Similar results with ternary networks were shown by Prost-Boucle et al. [62] .
Notwithstanding, the latter researchers demonstrated inference for CNNs with input size no larger than 32 × 32×3, which has somewhat limited application in real-life problems. In our paper, we present an implementation of a full-sized AlexNet for the ImageNet dataset (224 × 224 × 3) using only on-chip memory. We introduce optimizations to achieve computation speeds comparable to the GPU implementation.
DESIGN METHODOLOGIES FOR FPGA-BASED SYSTEMS
FPGAs are operated at relatively low frequencies and are based on a simple execution unit that usually can operate only a couple of bits (typically less than 8). In order to achieve overall high performance and low power per operation, FPGAs rely on massively parallel operations at the chip level.
Traditionally, software for mainstream hardware is based on data decomposition: the same operations are executed in parallel on a massive amount of independent data. Another approach to achieving massively parallel operations in general-and on FPGAs in particular-is functional decomposition, also called dataflow. In this execution model, the functionality of an algorithm is decomposed into independent parallel threads and the data flows between them. 
The use of pipeline parallelism for programming FPGAs
High level languages such as C, C++ or OpenCL are often used to program highly complicated algorithms, such as CNNs on FPGAs. Usually, a restricted version of these languages is used to simplify translation into lower-level representation by applying auto-vectorization techniques.
Many systems allow the addition of specific optimizations at this lower level, e.g., an efficient implementation of the XNOR primitive. Thus, the full development path starts by implementing the entire system using a high level language, followed by gradual replacement of critical blocks with highly optimized specific synthesized blocks to optimize the system.
The use of functional decomposition for programing FPGAs
Designing the system based on functional decomposition starts by identifying the different functionalities the system needs to perform and determining the flow of the data between blocks performing these functions.
This approach can use the notion of dataflow in which system activities are triggered by their inputs being ready and the output buffers able to hold the results. It fits well with the concept of streaming processing where "nodes" are implemented as threads and data are transferred using configurable routing resources, buffered on-chip memory, and flip-flops, embedded on an FPGA.
Functional decomposition has the advantage of scaleout (can easily be extended over multiple FPGAs) but needs to be designed with extra care, since a bottleneck in one of the nodes can determine the performance of the entire system.
In this work, we chose to use the software environment of Maxeler's system, since it is (1) inherently built around the notion of data flow engines (DFEs) and (2) can be programmed using high level languages.
The general structure of Maxeler's environment is shown in Figure 3 .
Maxeler boards consist of multiple CPUs and multiple FPGAs. Each DFE contains a single FPGA, which interfaces with a CPU via a PCIe. Multiple DFEs are interconnected in a daisy chain topology, via a proprietary link called a MaxRing. Figure 3 depicts the architecture of a Maxeler dataflow processing system. The Maxeler system can execute multiple kernels concurrently to support multiple streams of data both at the level of internal computations of the DFEs and between the CPU and the DFEs.
Although Maxeler systems allows the attachment of a large amount of memory to each FPGA (LMem in Figure  3 ), in this work we used only the memory that is embedded in the FPGA fabric, called fast memory (FMem in Figure 3 ). FMem can store only a few megabytes of data, but the access to memory is much faster and thus FMem can be used as a communication buffer between the DFEs.
The entire system is written in high level languages: Java for the kernels and manager and C++ for the CPU code.
FPGAs vs. GPUs for deep learning applications
GPUs are the most common platform to accelerate DNNs since they offer high performance (over 10 TFLOPS for FP32) and low power per operation, when compared to CPUs. The latest GPUs now have native support for short representation of numbers (both FP and integers). This feature supports the increasing need of DNNs to reduce their footprint and use more power-efficient arithmetic operations.
FPGAs are mainly optimized for the use of binary and integer operations and are less efficient when performing floating-point operations. This is one reason that most attempts to run a DNN on an FPGA accelerator have not been very successful in terms of performance. QNNs, however, involve only a small number of FP operations, which makes the use of FPGAs more attractive. Moreover, even if enough dedicated FP units are available, QNNs would still be more power efficient compared with full precision networks.
The next generation of FPGAs, such as the Intel Stratix 10 and Xilinx UltraScale+ families, will offer higher frequency, more internal high-bandwidth memory, and an increased number of dedicated floating-point units while maintaining efficient execution. These FP-GAs have the potential to further boost the execution of QNNs and at the same time be more effective when executing the "traditional" DNN algorithms.
THE PROPOSED STREAMING ARCHI-TECTURE OF QNNS ON FPGAS
This work focuses on developing a streaming architecture that uses dataflow-based functional decomposition in order to efficiently run QNNs. In this section, we describe the architecture, the optimizations and the internal structure of a system that can efficiently run different QNNs and handle inputs of any size. [3] . Brackets contain one block, and each block is stacked twice. conv3 1, conv4 1 and conv5 1 have a stride of 2 to perform downsampling.
Overview of DNN architecture
We developed an architecture for regular CNNs and their main building blocks (convolutional, pooling and fully connected layers) and also for residual networks. Residual networks add skip connections to CNNs architecture. Skip connections forward the output of one layer to the one after adjacent one, skipping one layer. This resolves the vanishing gradient [64] problem, thus increasing the number of layers and achieving state-ofthe-art accuracy on image-related problems [65] [66] [67] . We developed a hardware design for skip connections and, to analyze their performance, implemented the ResNet-18 [3] network, which architecture is shown in Table 1 .
Additionally, we implemented the AlexNet [19] , since it is one of the most well-known DNNs and is often used as a basis for new techniques in DNNs such as network compression, performance improvements, and new types of layers [ [57] . The network consists of eight layers: the first five are convolutions intermediated with pooling layers, and the remaining three are fully connected. The output of the last layer is fed to a 1000-way softmax, which produces a distribution over the 1000 class labels.
Hardware implementation overview
CNN models used in our evaluations (ResNet and AlexNet) are based on the work of Hubara et al. [18] . We chose to use 1-bit weights and 2-bit activation function outputs. According to Hubara's evaluations, this set of parameters is a satisfactory compromise between memory requirements and model accuracy.
All the pre-trained weights and normalization parameters are stored on the CPU side, while all the computations required for the inference are performed on the DFE side. In order to fully utilize the DFE's spatial computation capabilities, we chose a streaming architecture in which the output of each layer is fed to the input of the next one as shown in Figure 5 . Unlike a traditional approach, in which the computation of the current layer starts once the previous one has finished, streaming architecture allows the current layer to begin its output calculation once enough data has been accumulated in its internal buffer. Moreover, in streaming architecture there is no need to store each layer's intermediate results in off-chip memory, since they are immediately passed down the stream.
The input to each kernel, which represents an NN layer, is a stream of pixels stored in an internal buffer (Shift Register in Figure 5 ). As soon as all the data required (shown as a stack of pixels in Current Window in Figure 5 ) for the calculation of the particular output pixel is present, the pixel is calculated and passed to the next layer. It means we can treat other layers as a black box that receives or provides pixels. This approach simplifies integration of layers and building of complicated networks. Since each layer is represented in the DFE Manager by a single function call, the building of the network is similar to the process of building in high level frameworks such as Tensorflow (Listing 1).
Each kernel starts the computation as soon as the previous one provides output to it. Due to this computation overlap, the latency is pretty small, and after the initiation interval, computations are performed by all layers simultaneously. Additionally, due to the model's compact size, all NN parameters are kept in on-chip memory, eliminating the need to use slower off-chip memory. Further subsections describe the hardware design of each QNN component.
Convolution
The execution of the convolution kernel ( Figure 5 ) starts with inputs for weights, BatchNorm parameters, and feature maps. Pixels that are currently processed are stored in shift registers, while binarized weights and BatchNorm parameters are stored in the FPGA's internal memory caches. We replaced element-wise matrix multiplication of feature maps and their corresponding weights with the XNOR-popcount algorithm, followed by BatchNorm and activation functions.
The inference begins with fetching parameters: weights, biases and BatchNorm parameters. After all the parameters have been fetched, we start to input the feature maps. Every time there is enough data in the internal shift register, the kernel halts the input and calculates one output pixel per clock cycle, until all the filters are applied at this position (i.e., same (X,Y) coordinates in all feature maps). There are positions that do not produce any output; for example, the borders of the input feature map and, in the case of strided convolution, all pixels between two valid filter positions. This is especially important in the first layer, where, given the stride S = 4, we acquire around 13× speedup.
If the image is padded, then, when the kernel is processing padding pixels, it stops the input stream and inputs padding values into the buffer instead. The only available values for BNNs are −1 and 1, meaning zeropadding is not possible, and −1 padding was used instead.
Weights and BatchNorm coefficient storage.
All the weights received by the FPGA are represented as 32-bit floating point numbers. Before storing these parameters in the internal memory cache, we transformed them into a 1-bit representation, using the Sign function, as described earlier.
For the filter dimensions K ×K ×I, where K is the size of the filter and I is the number of input feature maps, there are K × K × I × O weights at this layer, where O is the number of output feature maps. In order to calculate one output pixel, we need to access K × K × I weights simultaneously. Therefore, each address of the cache stores K × K × I weights and the cache has O entries.
Since BRAMs have a limited number of predefined width/depth configurations, there is no way to avoid overhead while storing weights. In our FPGA, the minimal depth of a BRAM is 512, while the maximal number of weight cache entries is 384. A BRAM can allow only one access per clock, which means that at least 25% of each BRAM used for weights cache is wasted.
The amount of memory required for normalization parameter storage is relatively small. We need to store 2×O normalization parameters for each layer in its cache. K e r n e l B l o c k conv = a d d K e r n e l (new c o n v o l v e r 3 d K e r n e l ( m a k eK er n el P ar a m e t er s ( l a y e r n a m e ) , i n p u t s i z e , n u m i n c h a n n e l s , n u m o u t c h a n n e l s , f i l t e r s i z e , s t r i d e , padding , i n p u t b i t w i d t h , o u t b i t w i d t h , l a y e r n u m b e r ) ) ; conv . g e t I n p u t ( " w e i g h t s " ) <== addStreamFromCPU ( " w e i g h t s " ) ; conv . g e t I n p u t ( " n o r m a l i z a t i o n p a r a m s " ) <== addStreamFromCPU ( " n o r m a l i z a t i o n p a r a m s " ) ; conv . g e t I n p u t ( "fmaps " ) <== The weights and normalization parameters enter each layer in depth-first order, similarly to the feature maps. They are loaded into their dedicated caches only once, before inference of images starts, and then used repeatedly during inference.
Feature map buffering.
Let us define an input tensor of size H × W × I, and a filter tensor of size K × K × I × O. In order to calculate the first output pixel, we can choose two possible options to scan the input pixels, as shown in Figure 6 . The necessary buffer size for Figure 6a is I × H × (K − 1) + I × K, and the size for Figure 6b is H ×W ×(I −1)+H ×(K −1)+K, which means memory requirements per height dimension for the two methods are Θ(IK) and Θ(IW + K), respectively. Since W > K (sometimes an order of magnitude bigger), scanning to depth guarantees a smaller buffer. This means that in order to minimize the number of flip flops used for feature map buffering, all images should be streamed to the FPGA pixel by pixel and not channel by channel.
Pooling
The pooling kernel is built similarly to the convolutional one. Since the pooling has no parameters, output pixels are calculated as soon as enough data is accumulated inside the internal buffers. In addition, since each output pixel depends only on its own feature map, we do not need to wait until input is finished, but can produce output at the same clock cycle at which the input is received. In our implementation, max pooling is used in all cases, except for the last pooling in ResNet-18.
Batch normalization and activation function
As was shown in FINN [61] , BatchNorm and one-bit activation can be replaced by a threshold function. We extend this idea to multiple-bit activations, performing BatchNorm and n-bit activation using only two additional parameters with an n-input comparator and a 2 n → 1 multiplexer.
Using the notation of [61] , we denote pre-activation output of neuron k as a k , and BatchNorm parameters as Θ k = (γ k , µ k , i k , B k ). Then BatchNorm is calculated as BatchN orm (a k , Θ k ) = γ k · (a k − µ k ) · i k + B k . The n-bit uniform activation (quantization) divides the range of inputs into 2 n equally-sized ranges. Each range is mapped to a single output value of the activation function. Denote the size of each range as d. Given the mean µ and d, we can calculate the endpoints of all ranges. Thus, to acquire an output of the normalization and activation function combination for a pre-normalized value (i.e., which range it belongs to), it is enough to have a value of one of the endpoints and the size of the range. To this end, we first solve BatchN orm (τ k , Θ k ) = 0, acquiring τ k = µ k − B k / (γ k · i k ). Next, by solving BatchN orm (t k , Θ k ) = α · d, we acquire
. Therefore, to calculate all endpoints, it is enough to have τ k and d/ (γ k · i k ). Finally, we perform a binary search on the ranges to determine in which range a k falls.
Fully connected layer
As shown by Springenberg et al. [68] , the traditional architecture of convolutional layers followed by FC layers can be replaced by an all-convolutional network (i.e., an NN that consists only of convolutional and pooling layers) where FC layers are represented as 1-by-1 convolutions. The specifics of fully connected layerslarge amounts of weights and small amounts of neuronsinfluences resource utilization: more BRAMs, but less LUTs and FFs are used.
Skip connections
Skip connections are implemented as a part of residual network building block, which contains two convolutional layers and additional infrastructure to manage a skip connection, namely, a buffer and an adder. As shown in Figure 4 , the block receives two inputs: one via a skip connection and one via a regular.
The data passed in skip connections are 16-bit integers, which accumulate non-quantized outputs of convolutions. The whole block works as follows: the regular connection input, which is, as described earlier, 2-bits wide, enters a convolution block (5.2.1). At this stage, BatchNorm and activation are not applied. The convolution output is summed with input from the skip connection and the result is split into two paths. The first one is a skip connection, where data is sent as is. The second one goes through BatchNorm and activation, and then is streamed to the next (regular) convolution. The output of the next convolution together with the skip connection are inputs of the next "residual block".
In order to sum the skip connection data and the corresponding convolution result, skip connection inputs are buffered to compensate for delay created by the intermediate convolutional layer in a "regular" path. The required buffer is exactly same size as the buffer in a convolutional layer. This is not accidental. Using previous notation, taking padding and the fact that I = O into account, I × [H × (K − 1) + K] inputs in the first convolution produce I × H × K−1 2 + K inputs in the second convolution. This, together with padding, is exactly the amount of data needed to create one output pixel.
From the hardware perspective, the addition of a skip connection requires a minimal amount of resources-one adder and the buffer as described earlier. The skip buffer is needed to compensate for the delay and never creates delays by itself. This means that generally, the overhead of the addition of a skip connection is negligible.
Multi-DFE implementation
Since our architecture comprises independent kernels and the Maxeler platform allows data to directly flow from DFE to DFE, the workload can be divided into multiple DFEs with very small performance degradation if the design cannot fit one DFE. Since each pixel is represented by 2 bits, the required bandwidth of the DFE-to-DFE link, for a 105 MHz fabric clock, is 210 Mbps. According to the Maxeler specifications, this link can be set to rates of up to several Gbps, which is more than enough for our purposes.
EVALUATION
We conducted our experiments on different platforms, including last-generation Nvidia GPUs and Intel FPGAs. As an FPGA-based system, we used Maxeler's MPC-X node that provides 8 MAX4 (Maia) DFEs interconnected by a dedicated MaxRing connection. Each DFE contained an Intel Stratix V 5SGSD8 FPGA. GPUs used as baseline were Nvidia's TeslaP100-12GB and Geforce GTX1080. Table 2 shows the hardware specifications of the GPUs and FPGAs used for evaluation.
We measured performance, power consumption and resource utilization for FPGA implementation for three common datasets: CIFAR-10 [69] , ImageNet [23] , and STL-10 [70] . For our evaluation, we implemented ResNet-18, AlexNet and a VGG-like CNN, based on one proposed by Umuroglu et al. [61] , on both DFEs and GPUs. The VGG-like CNN consisted of three blocks of two convolutions and one pooling layer, and three FC layers at the end. First, the CNNs were trained for the above-mentioned datasets, using GPUs to obtain the network parameters, i.e., weights and normalization values. These parameters were then loaded onto the DFEs prior to the inference process.
Methodology
Runtime measurements. We compared the execution time of our hardware design to the execution time of two different GPUs, using the code provided by Itay Hubara on the Theano framework. Baseline timings were obtained by running 50,000 pictures through the network and taking the average. For the DFE, we similarly ran our implementation 50,000 times and took the average. To achieve the fastest possible execution time for the GPU, we used the latest version of Theano, which has been configured to use the NVIDIA cuDNN library.
FPGA-based platform details. The kernels written in Java code were translated into VHDL by Maxeler's Max-Compiler and thereafter synthesized by Quartus to run on an FPGA. MaxCompiler generates code in MaxJ, which is a low-level Java-based hardware description language. Eventually, a bit-stream is created and downloaded to the DFE at runtime. We obtained the resource utilization, timing analysis and power estimation of the board housing the DFEs. Board power measurements were obtained using Maxeler's library called from host code.
Results
This section characterizes our proposed streaming solution in terms of power, performance and scalability. We compare these parameters using different input sizes, up to 224 × 224. We also compare our results with the results of the same network running on a GPU using the Theano framework and the results claimed by Umuroglu et al. [61] for input sizes as described in their paper.
Performance against GPU-based implementation
We compared our implementation with QNN using Hubara's code [18] running on two different GPU-based systems. For comparison, we chose three datasets with different input sizes ranging from 32×32 to 224×224. To show performance variation for different input sizes, we also used STL-10 resized to 144 × 144. For the full-sized ImageNet dataset of size 224 × 224, we used the ResNet- As shown in Figure 7 , for an input size of 32 × 32, our network is 12% faster than the same network running on a GPU. This presumably results from the overhead of kernel invocation processes between the CPU and GPU. Even though the GPUs demonstrate faster inference for larger inputs, power consumption of the DFE is significantly lower (at least 15×) for VGG-like networks, as can be seen from Figure 9 . For AlexNet (input size 224 × 224), the power consumption of the DFE increases, since three DFEs are needed to fit the network. The energy consumption of a single-picture inference, as shown in Figure 10 , is up to 20× better for FPGAs, and even when more than one FPGA is used, the energy consumption was at least 50% less compared to GPUs.
Nevertheless, it should be noted that GPUs, unlike our architecture, are capable of simultaneously processing multiple inputs (minibatches). Modern GPUs can process at least 128-256 inputs with very small inference time degradation. While this is not helpful in realtime applications, it can speed up the process if a large amount of already-available data must be processed.
ResNet-18 and AlexNet performance comparison
To analyze the effect of adding skip connections and increasing network depth, we compared the performance of AlexNet and ResNet on DFE.
First of all, it should be noted, that GPU results As for resource utilization, as shown in Table 3 , ResNet-18 requires ∼ 75% more LUTs, which is the reason we were forced to divide it into three DFEs. Due to lack of big FC layers and lower total number of parameters, ResNet requires fewer BRAMs than AlexNet.
Performance comparison with other FPGAbased implementations
We compared our implementation with FINN by Umuroglu et al. [61] using the same network architecture and dataset as appears in their paper. Their implementation, however, uses binary activations. Although the binary activations demand fewer resources and allows faster inference, multi-bit activations have superior classification accuracy [17] . In addition, Umuroglu et al. store inputs in on-chip memory, while we stream them directly from the CPU. The comparison of resource utilization of both architectures is shown in Table 4 . Note that the resource utilization cannot be compared directly, since our implementations use FPGAs from different vendors, but we can refer to the general trends as presented.
As can be seen in the Table 4a , we achieve 4.1% better accuracy compared to FINN, although execution and power consumption are better in their solution. We assume that a major part of the differences in runtime are due to the quality of the compilers and the special optimizations that were implemented there. Nevertheless, the main purpose of our design was to show the scalability of our solution, so less effort was directed to optimizations for small inputs. Figure 8 shows the resource utilization of VGG-like architecture with different input sizes. It indicates that our architecture does have high scalability and the ability to effectively utilize resources on both single and multiple FPGAs. For example, increasing the size of input from 32 × 32 to 96 × 96 increases the resource utilization by approximately 5% for all types of resources.
Scalability of proposed architecture
Our theoretical estimation of the number of clocks per picture for ResNet-18 (the largest network implemented) is approximately 1.85×10 6 . This estimation matches the measured time on a real system with a clock frequency of 105 MHz. Among other things, this allows us to approximate runtime on next-generation FPGAs. For example, Intel's upcoming Stratix 10 FPGA promises 5× higher frequency, allowing us to achieve a 3-4 ms per image inference with the same ResNet architecture, and at the same time to fit even bigger networks onto a single FPGA.
CONCLUSIONS AND FUTURE WORK

Conclusions
In this work, we have shown streaming architecture for QNNs, which scales well for large inputs size and large NNs. For inputs up to 144 × 144, resource utilization is small enough to fit on a single Stratix V 5SGSD8 FPGA. In addition, since the DFE platform allows us to easily split the network into multiple FPGAs, we can implement even larger networks, such as ResNet and AlexNet.
Although GPUs outperform our implementation with large inputs, the proposed architecture is still fast enough to meet real-time requirements, achieving more than 60 fps for all types of inputs. Our results showing at least 15× lower power and 4× lower energy consumption (for a single FPGA) indicate that FPGAs can be a better choice for embedded systems. In addition, the run-time is only a couple of times higher compared to the top GPUs, which allows us to speculate that next-generation FPGAs could outperform GPUs in both performance and power/energy consumption.
The usage of HLS tools and DFEs as a means for functional decomposition allowed us to achieve better scalability, simplify the development process and construct a complicated FPGA system with minimal resources. Such tools may enable DL researchers with virtually no hardware development experience to construct NNs in a way similar to current scripting language frameworks, making use of the key advantages of FPGAs such as dataflow parallelism and low power consumption.
Future work
We have shown that creating scalable architecture and running large-scale NNs on FPGAs is possible. This development opens many directions for future research.
Recently, recurrent neural networks (RNNs) [71] [72] have been used for many different applications [73] [74] [75] , especially in natural language and video processing. These networks can be implemented on an FPGA [56] and potentially can be accelerated significantly by streaming architecture.
Another area to which we want to apply our architecture to is transfer learning. We believe that some common techniques in transfer learning, such as usage of an adaptation layer [76] , can achieve higher accuracy on FPGAs, if they take advantage of our architecture.
ACKNOWLEDGMENTS
This work was supported by the Metro450 Israeli national consortium 1 . The authors thank Maxeler Technologies Ltd for providing hardware for experiments.
