Deep neural networks (DNNs) have attracted significant attention for their excellent accuracy especially in areas such as computer vision and artificial intelligence. To enhance their performance, technologies for their hardware acceleration are being studied. FPGA technology is a promising choice for hardware acceleration, given its low power consumption and high flexibility which makes it suitable particularly for embedded systems. However, complex DNN models may need more computing and memory resources than those available in many current FPGAs. This paper presents FP-BNN, a binarized neural network (BNN) for FPGAs, which drastically cuts down the hardware consumption while maintaining acceptable accuracy. We introduce a Resource-Aware Model Analysis (RAMA) method, and remove the bottleneck involving multipliers by bit-level XNOR and shifting operations, and the bottleneck of parameter access by data quantization and optimized on-chip storage. We evaluate the FP-BNN accelerator designs for MNIST multi-layer perceptrons (MLP), Cifar-10 ConvNet, and AlexNet on a Stratix-V FPGA system. An inference performance of Tera opartions per second with acceptable accuracy loss is obtained, which shows improvement in speed and energy efficiency over other computing platforms.
Introduction
As the computational ability of processors rapidly grows, training and testing deep neural networks (NNs) become much more feasible, which substantially boost the design of various models targeting applications such as computer vision [1] [2] [3] , speech recognition [4, 5] , and even artificial intelligence (AI) for games against human beings [6, 7] . Higher accuracy typically demands more complex models. Take ImageNet Large-Scale Vision Recognition Challenge (ILSVRC) as example, Krizhevsky et al. [8] achieved 84.7% top-5 accuracy in classification task in 2012 with a model including 5 convolution (CONV) layers and 3 fully-connected (FC) layers; He et al. [9] got a 95.1% result surpassing human-level classification performance (94.9% [3] ) with a 22-layer model, and they won the 2015 competition for achieving an accuracy of 96.4% with a model depth of 152 [10] . Such model can take over 11.3 billion floatingpoint operations (GFLOPs) for the inference procedure, and even more for training.
These convolutional neural networks (CNNs) mostly consist of intensive multiplication and accumulation (MAC) operations. General-purpose processors execute these operations mostly se-quentially, which leads to low efficiency. Graphics processing units (GPUs) can offer Giga to Tera FLOPs per second's (FLOP/s) computing speed due to their single-instruction-multiple-data (SIMD) architecture and high clock frequency. Therefore, researchers tend to use one or several GPUs to meet the model training demand [11] for quick development iterations. However, GPUs also suffer from a high energy cost -for a NVIDIA Tesla K40 GPU, the thermal design power (TDP) is 235 W [12] . Such power consumption can be tolerable for high-performance servers, but for embedded systems such as mobile devices, robots, etc., which are mostly powered by batteries, low power consumption becomes essential.
Field Programmable Gate Arrays (FPGAs) usually consume one order-of-magnitude less power than GPUs, while offering considerable speed-up over CPUs. Moreover, FPGAs offer more flexibility, since they are reconfigurable and support customizable data types, which can be useful in reducing resource utilization. There is much research on accelerating state-of-the-art NN models with FPGAs [13] [14] [15] . However, since most current FPGAs have limited resources (several dozen M bits of on-chip memory, several hundred to thousand digital signal processors (DSPs)), designers have to adopt techniques such as tiling to support many NN models, since most models have a large number of weights and MAC operations ( Table 1 ) . Furthermore, memory bandwidth can be a bottleneck during the data loading stage for some wide data-dependency pattern such as FC layers [15] . To improve resource usage, there are several ways of compressing models to smaller sizes, such as gaining sparsity of network connections and narrowing data bit-width [15] [16] [17] . Binarization is a promising method to compress the NN models, which can directly shrink the bit-width of inputs and weights from 32 bit (single-precision floating-point) to a single bit. Recently, Courbariaux et al. [18] introduced a method to train binarized neural networks (BNNs) over MNIST, Cifar-10 and SVHN [19] datasets, with near state-of-the-art accuracy. Shortly after that, Rastegari et al. [20] announced they successfully trained ImageNet models with BNN-based XNOR-Net method with an accuracy of 12.4% below the full precision AlexNet, and provides a 58 times speedup and 32 times model size compression. The emergence of binarized models makes it feasible to implement a system on FPGAs with much higher performance than floating-point versions. This motivates us to design a method to take a given BNN model and generate the datapath logic and data management pattern on FPGA based to an optimization metric, which forms an accelerator system targeting Tera operations per second's(TOP/s) throughput speed.
In this paper, we introduce FP-BNN, a BNN acceleration system design on FPGA, with related optimizations. The contributions of this paper are as follows:
-An analytical resource aware model analysis (RAMA) to assess the resource cost, to help on-chip system architecture design. -A datapath design with multipliers replaced by XNOR, popcount and shifting operations for BNNs, and a compression tree generation method for more efficient popcount. -An optimized data managing pattern with parameter quantization and on-chip storage strategy. -A demonstration with popular small (MNIST MLP and Cifar-10 ConvNet) and large (AlexNet) models implemented on FPGA in binarized style, achieving a performance of TOP/s with high power efficiency.
The rest of the paper is organized as follows. Section 2 reviews the basic concepts of CNN and BNN and discuss on the related works. Section 3 describes the RAMA method. Section 4 presents the system design and the details of each processing element (PE). Section 5 explains how we tile and schedule the large computing task onto our system. Section 6 covers a data quantization to compress the model, and introduces the on-chip design of the memory system. Evaluation will be discussed in Section 7 , and conclusion will be given in Section 8 .
Background
In this section, we will first provide an overview of the basic concepts of CNN, and then explain how a binarized NN works. Based on these concepts, we take a brief overview of related efforts and discuss them. Fig. 1 shows a typical CNN model structure [22] . A CNN model usually consists of CONV layer, FC layer and Pooling (POOL) layer, forming a trainable network. CONV layer : The CONV layer realizes a filter-like process, which uses a K × K weight kernel W to convolve the input feature-map (fmap) I in a sliding-window manner with a stride of S . This can be expressed as:
Basics of CNN
where is defined as convolution, which equals to K 2 elementwise multiplications with accumulation ( K stands for the kernel size):
FC layer : The FC layer will operate a linear transformation on the input 1-D vectors with a weight matrix. The pattern of the input-output network is fully-connected, which is how it got its name. This process can be shown as:
POOL layer : The POOL layer realizes a "down-sampling" operation, which compresses the input images into smaller scales. We take the most common max-POOL as an example, which extracts the maximum value from the K × K kernel window as the output:
Activation Layer : Just like biological neurons, we say they are "firing" once the key value exceeds the threshold and are "silent" if not. Various activation functions are implemented in neural network designs to imitate the neurological behaviour such as ReLU, tanh, sigmoid, etc., which also introduce non-linearity to the networks.
Batch Normalization (BN) layer : Since the distribution of each layer's input can fluctuate during training, Batch Normalization [23] is introduced to speed up training. For a d -dimensional input vector x = (x (1) , x (2) , . . . , x (d) ) , we can normalize each dimension with:
After that, for each activation x ( k ) , we should scale and shift the normalized value to achieve an identity transform:
where γ ( k ) and β ( k ) are to be learned during the training process.
The whole process is described in Algorithm 1 . 
Table 2
Comparison between activation function Tanh, sign and HTanh.
Operation Function plots Derivative plots
model for ground-truth results. The most commonly used training method is Back-Propagation (BP) training, which consists of two stages:
(1) Forward propagation (Inference) , which leads the input data going through the network to get an output result; (2) Back propagation , which calculates the error between output and ground-truth labels with a defined loss function C , and then propagates the gradient of each layer's output function backwards to update the weights in order to minimize the loss function for the next training iteration.
Detailed derivation can be found in [24] . Since the overall process is compute-intensive, high-performance servers with accelerators such as GPUs are often used in training. Then the pretrained models can be used in many real-time scenarios by going through inference process only with minor changes, which can be implemented on many embedded hardware platforms.
How BNN works
The essential idea of BNN is to constrain both weights and activations to +1 and −1 [18] . The binarization method can be done in either stochastic or deterministic way, and the latter is often realized by the Sign function:
The problem is that during the training process, the derivative of the Sign function is almost zero everywhere (as shown in Table 2 ), resulting in an incompatibility with the BP training process. Hinton [25] introduced a "straight-through estimator" to cope with this problem. Courbariaux et al. [18] used a similar estimator in a deterministic way, which can be seen as a hard tanh (HTanh) function:
, and a = Sign (u ) , then we will have the estimator of the gradient:
Since BN layers have the effect of avoiding internal covariate shift, which can accelerate the training process and reduce the impact of binarization, [18] introduces BN layers in their BNN models. To deal with the large amount of multiplications in BN, they replace them with shift operations to get a Shift-Based BN (SBN). This can largely reduce the computing resource cost with only a small loss of precision -which actually can be healed through the training process. The SBN replacement can be described as Eq. (10) where sal ( x, y ) means an arithmetic left shift to x by y bits:
Related work
To accelerate an NN model in embedded hardware, spade husbandry should be taken. There has been many effort s deploying CNN models in hardware. Farabet et al. [26] designed a 3 CONV layers +5 FC layers simple face detection system on FPGA with 10 frames (512 × 384) per second's performance. Zhang et al. [14] proposed a nested-loop model to describe CNN, and accelerates CONV layers only under the guidance of a roofline model. Qiu et al. [15] realized an even deeper VGG model on FPGA. Most of these previous designs store weights and fmaps off-chip since their size is too large for on-chip storage. As a result, the dataflow bandwidth is limited and frequent off-chip memory access happens. So some 
ARTICLE IN PRESS
JID: NEUCOM [m5G; October 28, 2017; 13:0 ] designs support dedicated memory cache for on-chip data reuse [27] [28] [29] , but the increase of memory placement means fewer arithmetic resources since chip area is limited. Clearly a small model that supports high accuracy and high performance is ideal. One method is to exploit the sparsity inside the model by pruning off connections [16, 30, 31] . Another method is to reduce bit-width of operations. Much previous work took a quantized fixed-point strategy to the on-chip data [15, 27, 28, 32, 33] presented a detailed analysis pointing out that for small models such as MNIST and Cifar-10, the weights can be quantized to 4 bits, while for large models such as AlexNet, 8 bits would be necessary.
Recently, some effort s successfully reduced the bit-width of weights to 2 bits such as ternarized weight NN (TWN) [34, 35] , or even 1-bit binarized weight NN (BWN) [20] . Moreover, activations can be reduced to 2 bits [36] [37] [38] or even 1 bit (BNN) with little loss for small datasets [18, 20] . These results stimulate hardware development. YodaNN [39] We should notice that since the bit-width of data has been reduced by 32 times in BNN, an execution speed of TOP/s is expected since many recent non-BNN designs have already reached a performance of several hundred GOP/s. The key optimizations include: (1) single-bit based MAC operation, which can be replaced by efficient XNOR and popcount operations and can be free from conventional multiply and add operations; (2) small size for both parameters and intermediate results, which would enable on-chip caching; (3) broaden bandwidth for on-chip BRAMs, which would reduce the bottleneck of data dependency with wide data-access patterns such as those in FC layers. Our FP-BNN design is developed based on the above motivations. Furthermore, FP-BNN supports large models such as XNOR-Net version AlexNet.
Resource-Aware Model Analysis (RAMA)
To design an NN accelerator on chip, we should consider how to tile the overall task onto limited resources, which can be classified into two classes: arithmetic units and memory units. To help choosing the size of task tiles, we need to estimate the resource cost beforehand. The RAMA method is introduced to address this need.
In modern FPGA platforms, four kinds of resources are provided: look-up tables (LUTs), flip-flops (FFs), block RAMs (BRAMs) and digital signal processing units (DSPs). LUTs and DSPs are the key to form arithmetic and control logic, while BRAMs are usually used as on-chip storage for fast data access. From the arithmetic perspective, MACs are the key operations which cost most resources. DSPs have hard-wired multipliers and can be configured to quickly deliver results under high clock frequency -and one can choose LUTs to implement a customized multiplier. We compare the resource cost of these two ways on a Stratix V FPGA synthesized with Altera Quartus v13.1, and the result is shown in Table 3 .
With the resource cost of one single MAC operation in hand, we need to further count the number of MACs in each layer, which can be represented as N layer (MAC) . For CONV layers we have (FC layers can be seen as K = R out = C out = 1 ): For operations in BN layers, the number of operations (NOP) has a linear relationship with the number of output channels N out . Notice that the shift-based transformation can change multiplications into cheap sum and shift operations. To get NOP after tiling, we just need to replace the original dimensions with tiled ones, and then we can estimate the resource cost for a certain type C res _ type (layer) by summing up the product of tiled NOP and resource cost of one operation, which in tern help us determine the tiling factor.
Next, from the memory perspective, we should concentrate on the size of parameters and the activation outputs of each layer.
Given that N layer (data ) denotes the size of a certain kind of data in one layer, then for the weights we have
For other parameters, such as biases, normalization parameters, they are given by the number of output channels, that is N CONV (Other) = N out (13) For activations, we have
The overall memory cost of each type of data is the product of the bit-width and the amount of data. For weights, from Fig. 2 we can see that in an ideal binarized condition, small models can completely be stored in on-chip BRAMs, while large models' amount of weight can exceed the upper limit of available BRAMs. We use a tiled weight storage strategy that takes only one portion of weights required for the current tile from off-chip memories. For activations (feature maps (fmaps)), since data adjacency will be needed in both vertical and horizontal axis, BRAM will not be a suitable choice since it can only be configured into fixed shapes, and the maximum width of one BRAM is often no more than 40 and accordingly the minimum depth is 512.
Hardware logic design
In this section, we present the hardware logic design of our FPGA accelerator system. Table 4 RAMA-based topology analysis of MNIST MLP, Cifar-10 CONV-Net and AlexNet. 
ARTICLE IN PRESS
JID: NEUCOM [m5G; October 28, 2017;13:0 ]
Macro Layer Structure
a N in × R 2 in K S K POOL b N out × R 2 out N (W ) N (Others ) N (MAC) N (A ) MNIST MLP 1 F-B-
Overall architecture
A normal structure of a BNN model is given in Fig. 3 . We can divide the model into several macro-layers with similar structures, each including a convolution or fully-connected ( C/F ) layer, a batch normalization (BN) layer and an activation layer which consists of a Hard Tanh (HTanh) layer and a Binarized Neuron (BNeu) layer. For some macro layers, pooling is introduced for down-sampling. Here we choose MNIST MLP and Cifar-10 ConvNet as small dataset examples, and AlexNet for large dataset ImageNet. The topology of each model is described in Table 4 , and the key features of each layer are extracted based on RAMA, in which R in and R out are respectively the input and output image size, K is the convolution kernel (window) size, S is the stride of the moving window, and K POOL is the pooling window size.
The overall system is shown in Fig. 4 . We have altogether N PE channels to process in parallel the data from the input cache. CONV/FC ( C/F ) layer includes processing elements (PEs) that are shared by the CONV and FC since they both mainly consist of MAC computations. Shift-Based Normalization (SBN) layer adopts shift operations to replace multiplications as mentioned in Section 2.3 . Activation layer merges the HTanh and BNeu layers together to produce an output vector containing either 0 or 1. Parameters for each layer are fetched from on-chip BRAMs or registers to meet bandwidth requirements, and control signals select them for each iteration. The output for each iteration will be transferred to the intermediate result cache. For each next layer, the interconnection will be reconfigured by the controller according to the type (CONV or FC) of the layer.
Next, we take a look at the details of different types of PE design.
C/F PE

XNOR-based Binary MAC
Normally, it is necessary to utilize DSPs or customized LUTbased logic to complete a MAC operation for both floating-point or fixed-point input values. However, if input values become binary, it will be much different.
Consider two input vectors
which consist of binarized values either +1 or −1, then the product of the corresponding elements in two vectors will also be either +1 or −1. The sign of the product depends on the two input elements' signs -if they are identical, then the product will be positive, otherwise it will be negative. Then, we need to accumulate these binary values to get a final result. This process is depicted in Fig. 5 (a) .
Hardware implementations usually take 2 bits to represent +1 and −1. If we use only one bit, we should take 0 and 1 as the basic values. This can be achieved through affine transformation .
Since we have
in which A 1 represents the all-1 vector of the same length of A −1 , 1 . To keep the truth table for the result as shown in Table 5 , we can infer that the operation should be transformed from multiplication to XNOR . In addition, if we assume r to be the dot product Original multiplication Affine transformed 
in which
If one of the inputs is already 0, 1 based, for example, the first layer, then we get the result with:
This means we need to add the popcount of vector B 0 , 1 instead of left-shifting 1-bit, as shown in Fig. 5 (b) . signal selects the operation to the output of the popcount compressor tree.
ARTICLE IN PRESS
Popcount Compressor (PC) tree
The popcount value, also known as Hamming Weight, can easily be calculated in parallel hardware. However, for long vectors, this process can be demanding both in time and in resource usage. The most common way is to use a binary full adder tree to sum up the bits in vectors, which will result in a delay of l og 2 (v ec _ l en ) and n − 1 adders of different bitwidth. Here we present a compressor tree method inspired by [42] .
The popcount process can be seen as compressing N input bits into log 2 N + 1 result bits with weights. Since most modern FPGA architectures have 6-input LUTs, a 6:3 compressor (can be seen as N = 6) is therefore an efficient basic component, for it can calculate the popcount of a 6-bit input vector in a look-up table, which leads to a 3-bit popcount output with only three 6-input LUTs in parallel.
Given that tuple
m compressor, where the subscript j = k, k + 1 , k + 2 . . . stand for the bit weight 2 j and p j , q j stand for the input and output bit number of certain weight, respectively. In this way, a 6:3 compressor can be represented as (6; 1, 1, 1).
As shown in Fig. 6 , the input vector is divided into 6-bit portions, each connected to the input ports of a 6-3 compressor. Empty input bits will be filled with dummy 0's (shown as hollow dots in Fig. 6 ). For the following stages, the bits with the same weight (we call it column vector) will repeat the same process, forming a compressor tree. The output bits will heap due to their weights. Our target is to reduce the height of the heap to 3, which can then be accepted as three input vectors of a ternary adder to get the final sum. Thus, the column vectors with height of less than 4 will stop being compressed for the next stage, and the compression process will terminate when all column vectors' heights are less than 4. The overall process of generating a compressor tree is described in Algorithm 2 .
With the help of Algorithm 2 , we obtain the compressor tree topology for the hardware implementations of popcount functions for different sizes of binary vectors. Table 6 gives the comparison between accumulation adder tree and compressor tree. As we can see, for long vectors, compressor tree saves around one third of LUT resources.
PE reuse
With the XNOR array connected to a popcount compressor tree, we can get the result for 0/1 input arrays. For all intermediate layers, the inputs (activations from the preceding layer) and weights must be binarized (either 0 or 1). However, this is not the case for the first layer -we usually take a fixed point input image from the input cache. Also for some large models like AlexNet some layAlgorithm 2 Popcount compressor tree generation algorithm. 
H( j + 1) = zeros (1 , log 2 (N)) ; 6: for k = 1 to log 2 (N) do 7: if h (k, j) > 3 then 8 :
else 13 :
end if 15: end for 16: j = j + 1 ; 17: end while Table 6 Comparison between accumulation adder tree and popcount compressor tree. ers' weights are not binarized. To deal with this, if a vector x of n fixed-point inputs with m -bit precision:
BW in (bits
and a vector w of n p -bit weights: 
then the output vector s could be calculated by Implementing Eq. (21) requires reuse of PE. Hence, we introduce an accumulator to the PE with a selectable left-shifter. While the precedent lower bit vector is being processed, the next input vector can be loaded behind, and be added with the shifted result of the precedent vector. The start and the end of the accumulation will be set by the controller signal. A detailed scheduling will be introduced in Section 5 .
The overall structure of C/F PE is given in Fig. 7 . For AlexNet, the binarization method is different from sign function [20] . It introduces a binarized filter w bin for w with a scaling factor α in order to approximate the MAC operation by x · w ≈ α( x · w bin ) , where α = w T w bin n = 1 n w 1 . So for AlexNet the accumulator is followed by a multiplier to time the scaling factor α.
BN PE
As described in Algorithm 1 [23] , the batch normalization process can be presented as:
where μ stands for the running mean value, σ stands for the standard deviation. γ and β are the learnt values to implement affine scale and shift for an identity transform. However, as mentioned in Section 2.3 , floating-point multiplications are required for every normalization process, which will lead to a considerable resource cost. For this reason, [18] uses shifting to approximate the multiply operation. For Eq. (22) , the shift-based approximation would be:
where φ = round( log 2 γ σ ) is the left-shift value of both σ and γ .
As we have the pre-trained models in hand, we can calculate the required parameters for BN like γ σ in advance, and store them into the corresponding parameter cache. With the above, we get the SBN PE as presented in Fig. 8 . We have also noticed that the shifting-based approximation would cause severe accuracy drop for AlexNet (from 42.9% to 31.9%), so we avoid using shift replacement for AlexNet and keep the original batch normalization, using a multiplier to replace the shift and sign operations.
For all models, we train them with the last BN layer kept as the original batch normalization in order to avoid accuracy loss, that is, with no shift-based operations. Floating-point multiplications are implemented independently with DSP blocks to accomplish the multiplication operations, and for ImageNet classification (AlexNet), the output process will be tiled.
Activation PE
For the last part of a normal macro layer, we need to binarize the result into either 0 or 1. In the training process, the HTanh function, as shown in Table 2 , constricts the values between −1 and 1, while the final BNeu layer will push those values in between to the two boundaries, which means −1 for all the negatives and 1 for the others. Since we introduce in Section 4.1 that the ( −1,1) based vectors can be affine transformed into (0,1) vectors, the case becomes much simpler: we just need to discern all the negative values from the SBN layer and set them to 1, with the others to 0. This can be done directly by accessing the signal bit of these values.
Pooling
For some macro-layers, pooling is applied to support subsampling in order to reduce the output fmap size. As shown in Table 4 , pooling comes closely after CONV layer, and the pooling type for all of our models is max-pooling. If pooling comes after activation, most of the output values will be + 1 which result in significant information loss for training [20] . So a C-P-B-A macrolayer structure is taken. However, for the inference process, a C-B-A-P structure can get an identical result and the pooling is applied to values of 0 and 1 only. This can be directly implemented with OR operations. We organize selective line buffers after the activation PEs, and when a pooling process is required, the activation values will stream into the line buffers. If the pooling size is K , we will enable OR operations to the horizontal targeted locations once K rows of activations arrive, and then a similar process is repeated along the vertical direction with other line buffers to complete a 2D max-pooling. 
Task tiling and scheduling (T&S)
This section introduces T&S, a method for tiling and scheduling models on chip. The T&S method is depicted in Fig. 9 . We assume the width of one PE to be PE size , and the number of tiled input channels to be T N in . The number of tiled output channels equals to the number of PE channels N PE .
For FC layers, one tiled input will be fed into all PE channels. For better resource utilization we set T N in as close as possible to P E size . It will take For CONV layers, since most filter kernel size K is rather small, we consider joining several filters together as one input for C/F PE. For one PE, the input will be summed up to get one output, so the joined filters should be at the same location of input fmaps as different locations have no dependency to each other. The input vector size will be L in = T N in × K 2 . Similar to the FC layers, N in T N in will be taken to get one output pixel, and the next tiled input location will be given in a sliding window style.
Considering the datapath shared among different layers, N PE should be a common divisor of N out of different layers, T N in for each layer should preferably be best a sub-multiple of N in , and PE size should be a big value and also close to L in to best explore the resource utilization.
The first layer can become a bottleneck for the datapath since the number of N in is small (usually 3). Let us study Eq. (21) , for input values which are not binarized, T N in can get multiplied if tiling could be achieved inside Eq. (21) . Notice that the inner most MAC will be implemented through XNOR + Popcount, and the m -bit of input is actually calculated in different run and accumulated with shifting. As the P E size is much larger than L in = N in × K 2 in the first layer, we can repeat the innermost MAC by 2 j times inside one PE to complete the j iterations in one run. This process is illustrated in Fig. 10 . If t -bit is tiled in one run, then we have
and through this tiling, the utilization rate of PE for the first layer is increased. With RAMA and the information given in Tables 4 and 6 , our tiling strategy is shown in Table 7 .
Memory system design
In this section, we introduce the memory system design for FP-BNN. This mainly consists of two parts: the first is parameter quantization and storage, and the second is on-chip fmap caching.
Quantization over other parameters
Since in BNN models, weights have already been binarized, so to take a further step, we quantize non-weight parameters which we call as other parameters . It would be essential for small models since we would like to store everything on-chip. These parameters are in floating-point format, and their number mostly is equal to the number of output channels of a layer. BRAMs would be too wasteful for their storage, since these parameters need to be provided in parallel and the width equals to N PE , which would make the depth of their storage too shallow. So we place them in fast distributed memories. Such memories are available in Altera FPGAs as memory logic array blocks (MLABs). Since we also need to use MLABs to construct intermediate cache, quantization of other parameters is adopted to make the best use of limited MLAB storage.
A fixed-point number can be represented as:
where BW stands for the overall bitwidth (including the sign bit) of the number and f l stands for the fractional part bitwidth. Here we use Q = (BW, − f l ) to capture the quantization strategy for a particular type of parameters in a layer. To transform a floatingpoint number n float to a fixed-point number n fixed using a given Q strategy, we make use of the round-to-nearest rounding mode [43] to shift and cut:
We use this method to quantize parameters for various models to see how the accuracy fluctuates with the bitwidth variation ( Fig. 11 ) . It is obvious that when the bitwidth of parameters drops below a particular threshold, the model accuracy drops significantly.
Please For the biases of C/F layers, we discover that even if their bitwidth drops to 0, there is still no significant variance of the results. So, to save storage and computing resources, we ignore the C/F layer bias addition. For the running means μ and the affine biases β, the point varies between 2 to 8-bit for different layers. For the shift parameter of BN layers, we can express each φ as φ min + φ, and therefore we only need to store the variance using φ to reduce resource usage. So the bitwidth w of BW can be calculated as: (27) We choose a dynamic fixed-point strategy [44] to optimize the bitwidth for each layer. Table 8 shows all the value ranges and BW strategies that we have chosen for each parameter, and Table 9 shows the comparisons between the quantized models and the original models.
Memories for parameters
The memory storage structure for weights is given in Fig. 12 . In order to reduce the memory access time, we keep the parallelism of memories identical to the number of PE channels N PE , and the width of each memory equals to P E size . Weights for each computing tile, which is represented in form, will be arranged serially. An address generator will be controlled by the overall controller in order to provide the exact weight. For MNIST MLP and Cifar-10 ConvNet, the weights can fit into an array of BRAMs. For AlexNet, the weights will be tiled by each layer or inside a layer once they get too large. The oldest weights for finished tiles will be covered by weights for the next tile from off-chip memory in a ping-pong fashion. For other parameters, a similar storage structure is proposed based on MLABs.
Intermediate cache
For the intermediate outputs, that is, the output fmaps of each macro layer, we place a cache to hold them for the next macro layer to read. The design of intermediate cache is given in Fig. 13 . For CONV layers, the cache structure facilitates the sliding window data fetching. For each input iteration, we separate the intermediate cache into T N in groups, each containing K memory blocks, and every two adjacent blocks storing consecutive rows. The input logic needs to offer corresponding addresses for the required K rows, and select rows to choose the horizontal required K -bits. So in total we get T N in × K 2 windows to form an L in long tensor. The output fmaps of each layer will be stored into the spared space from the input fmaps in a ping-pong way. For FC layers, we just need to ensure that the cache bitwidth can satisfy T N in . For MNIST MLP we choose 32 
System evaluation
We evaluate the performance of our accelerator system in this section. Environment setup and NN model preparation will be introduced first. We target CNN models to train for a binarized version, and apply quantization to the parameters to further compress the model. Then we map the optimized BNN models onto FPGA, and provide performance analysis comparing FP-BNN with general purpose processors and other FPGA designs.
System environment
We train the models on an IBM x3650 M4 server equipped with an NVIDIA Tesla K40 (28 nm feature size, 2880 CUDA cores with 12 GB GDDR5 external memory) and a K80 GPU (28 nm feature size, 4992 CUDA cores with 24 GB GDDR5 external memory) card, and use both cards to accelerate the training process. The evaluation system is built on Maxeler's MPC-X20 0 0 platform [45] . The system has 8 dataflow engines (DFEs), each comprising a single Altera Stratix-V 5SGSD8 FPGA (28 nm to 48 GB of DDR3 RAM, and can communicate with other DFEs through MaxRing interconnections. Besides, two Intel Xeon E5-2640 6-core CPUs (32 nm feature size) are included in the server, and can communicate with the DFEs through InfiniBand. Here we take only one of the DFEs to implement the models. The Maxeler system offers a convenient solution to support data communication between software algorithms and FPGA hardware.
Model preparation
We use Torch 7 framework [46] to train the NN models for MNIST and Cifar-10 based on Hubara's BNN framework [18] and for AlexNet based on Rastegari's XNOR-Net framework [20] . The MNIST dataset is a permutation-invariant version consists of 60 K examples of 28 × 28 gray level digit images for training and 10 K examples for testing. The Cifar-10 dataset consists of 50 K examples of 32 × 32 RGB colour images in 10 classes for training and 10 K examples for testing, and global contrast normalization and ZCA whitening are used in the same way as Goodfellow et al. [47] and Lin et al. [48] did. The ImageNet dataset consists of 1.2 M images from 1 K categories and 50 K images for testing, and a center crop of 224 × 224 is extracted for forward propagation. Adam [49] learning rule is adopted for training, with a mini-batch size of 10 0, 20 0 and 800. The binarization method used here is deterministic [50] considering the convenience for implementing hardware for inference. Model accuracies are measured and presented in Table 9 . As we can see, even the quantized versions for MNIST and Cifar-10 keep high accuracies close to the state-of-the-art results. XNOR-Net based AlexNet for our design suffers from a 13% accuracy drop, while supporting state-of-the-art performance among the existing BNN solutions for ImageNet.
Hardware implementation
We use MaxCompiler to generate the executable bit-stream for FPGA, which takes Altera Quartus II v13.1 to synthesize, place and route the designs. The resource utilization of the final implementation is shown in Table 10 . The design can be driven with an achievable 150 MHz clock. We notice that the utilization of DSP blocks is not high, since only a small portion of arithmetic operations needs floating-point multipliers. 
Performance Analysis
We implement binarized models for the Xeon E5-2640 CPU, the NVIDIA Tesla K40 GPU and the Altera Stratix-V FPGA. Performance is measured as shown in Table 11 . To feed CPU and GPU with enough data, we take batch size to be identical to training for forward propagation. As we can see, with about an order of magnitude slower clock frequency and much lower power consumption, our accelerator still gets an average speed-up of 314.07 times over CPU and 19.08 times over GPU for MNIST, 51.83 times over CPU and 5.07 times over GPU for Cifar-10, and 11.67 times over CPU and 2.72 times over GPU for ImageNet. Peak speed-ups can reach 705.19 times over CPU and 70.75 times over GPU. Although the model has been compressed for about 32 times, the low-precision operations can exploit the potential of fine-grained parallelism in FPGA, which can offer higher performance than CPUs and GPUs. If we take energy efficiency as the criterion, with similar feature size, the FPGA implementation can offer an efficiency of two to three orders of magnitude of CPU's and GPU's.
We take another comparison with some previous FPGA accelerator designs for CNN and BNN models, as listed in Table 12 . We can see that our FP-BNN reaches a TOP/s speed which is significantly faster than the previous CNN designs. For some designs (such as that in [15] ), one major problem is that for memory-centric FC layers, the data and parameter loading time is much longer than the computing time as the number of input ports for data and weight RAMs is limited to 8, while in our design all the computing channels can be fed with data and weights in parallel. FP-BNN is also much faster than the most recent BNN design [40] . Although our design involves a large FPGA, the power efficiency is also 10 times better. Another BNN design FINN [41] reaches a performance similar to ours. For the MNIST case, FINN has taken a smaller MLP network in which the input dimension is larger than the number of neurons in each layer, which results in a higher resource utilization after task tiling. If we are prepared to reduce model accuracy a O = Overall, P = Peak, C = CONV, F = FC.
ARTICLE IN PRESS
for a smaller network, the overall performance should get closer to the peak value (12 TOP/s). For the Cifar-10 case, our CONV-Net model can achieve a throughput of almost 4 times of FINN's. We also support large datasets for our FP-BNN design, which proves the compatibility of our design method with various CNN models.
Discussion
There is considerable scope for improvement in FP-BNN especially for the first layers, since datapath utilization is low due to the limited number of input channels. Moreover, the utilization of DSP blocks is low, and more DSP blocks can be involved if they can effectively support low-bandwidth operations to enhance the overall throughput. Furthermore, we can exploit the heterogeneity of logic elements in FPGAs, such as introducing different bit-width choices together with binarized data for better use of DSP multipliers.
This implementation shows that it is promising to implement BNN models especially for an embedded system, which can offer a competitive speed and accuracy with low power consumption. Recently, various designs [36, 52] have shown that more complicated NN models can also be binarized with tolerable loss of accuracy. Considering the similarity of component layers and logic generation algorithms, it is feasible to implement these models layer-bylayer in a sequential way as long as there is sufficient amount of on-chip memory for parameters.
Conclusion
This paper presents FP-BNN -our design for binarized neural networks targeting FPGA technology. Based on the RAMA analysis method, we design a 64-channel accelerator architecture, which can accommodate both CONV and FC type layers. An XNOR-based method is introduced for binarized vector MAC operations, and the summing up process is achieved with a popcount compressor tree which can be automatically generated. For small models like MNIST MLP and Cifar-10 ConvNet, shift-based normalization is introduced which largely reduces the cost of multipliers. With proper dynamic quantization to the input and parameters, the model keeps good performance with the weights binarized and other parameters compressed by over 10 times. Optimized onchip data storage is managed with parameter quantization. Our implementation on Maxeler MPC-X20 0 0 platform (with Stratix-V 5SGSD8 FPGA) shows a promising TOP/s speed with only 26.2 W power at 150 MHz clock frequency. We expect expect enhanced accuracy in future binarized models, which should greatly extend their range of applications.
