Hardware accelerations of deep learning systems have been extensively investigated in industry and academia. The aim of this paper is to achieve ultra-high energy efficiency and performance for hardware implementations of deep neural networks (DNNs). An algorithm-hardware co-optimization framework is developed, which is applicable to different DNN types, sizes, and application scenarios. The algorithm part adopts the general block-circulant matrices to achieve a fine-grained tradeoff of accuracy and compression ratio. It applies to both fully-connected and convolutional layers and contains a mathematically rigorous proof of the effectiveness of the method. The proposed algorithm reduces computational complexity per layer from O(n 2 ) to O(n log n) and storage complexity from O(n 2 ) to O(n), both for training and inference. The hardware part consists of highly efficient Field Programmable Gate Array (FPGA)-based implementations using effective reconfiguration, batch processing, deep pipelining, resource re-using, and hierarchical control. Experimental results demonstrate that the proposed framework achieves at least 152X speedup and 71X energy efficiency gain compared with IBM TrueNorth processor under the same test accuracy. It achieves at least 31X energy efficiency gain compared with the reference FPGA-based work.
Introduction
The recent deep neural networks (DNNs), especially deep convolutional neural networks (CNNs), have been able to deliver remarkable success in visual and recognition tasks (Deng et al.,Taigman et al.) and real-world applications (Huval et al., Collobert and Weston, Burbidge et al.) , by leveraging large-scale neural network sizes and learning from a huge volume of data. Despite the advantage of improved overall accuracy, the deep layered structure and large model sizes increase the computational complexity and memory requirements. It is projected that the majority of inference tasks will be performed on embedded, IoT and mobile systems which are with limited power and computational resources. In order to achieve higher scalability, performance, and energy efficiency, two orthogonal research and development trends have both attracted enormous interests.
The first is hardware accelerations of deep learning systems/applications, which have been extensively investigated Copyright c 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
in industry and academia (Farabet et al.,Suda et al.,Qiu et al., Zhang et al.,Zhang et al.,Han et al.,Zhao et al.,Zhang and Li, Umuroglu et al.,com,com,Chen et al.,Han et al.,Chen et al.) . As a representative technique, FPGA-based accelerators can offer the advantages of programmability, high degree of parallelism and short development cycle. Important progresses have been reported on FPGA accelerations of original DNNs (Farabet et al., Suda et al., Zhang et al., Zhang et al.) , binary neural networks (Zhao et al., Umuroglu et al.) , and more recently, on DNNs and recurrent neural networks (RNNs) with model compression techniques (Qiu et al., Han et al.) . These prior work mainly focus on the inference phase of DNNs, and suffer from frequent access to off-chip memory systems because the limited on-chip memory can hardly accommodate the large model sizes. Accessing off-chip memory is highly energy inefficient. As pointed out in (Han et al., Han, Mao, and Dally) , the per-bit access energy of off-chip memory is 200X compared with on-chip memory storage, and dominates the whole system power consumptions. Besides, it is also desirable to achieve algorithmic-level accelerations to accommodate the further scaling of DNNs, instead of simply adding more and more hardware devices.
The second important trend is the model size compression and algorithmic-level acceleration of DNNs (with very minor accuracy loss), including weight quantization (Lin, Talathi, and Annapureddy, Lin et al.) , sparsity regularization (Feng and Darrell, Wen et al., Li, Park, and Tang) , connection pruning (Han et al.,Han, Mao, and Dally) , and low rank approximation (Denil et al., Denton et al.) . These approaches can offer a reasonable amount of parameter reduction (e.g., by 9× to 13× in (Han et al., Han, Mao, and Dally) ) and/or a reasonable speedup (e.g., around 50% to 2× in (Wen et al.) ). However, they suffer from the following limitations: (i) the sparsity regularization and pruning methods will likely result in an irregular and sparse network structure, thereby undermining the compression ratio and increasing computation time (especially inefficient on GPUs and dedicated hardware which has high parallelism capability); (ii) the training complexity will be increased by incorporating additional pruning process (Han et al., Han, Mao, and Dally) , additional low rank approximation step (Denil et al., Denton et al.) , or extra trade-off parameters (Wen et al.) ; (iii) the compression or acceleration factors are heuristic numbers that cannot be precisely controlled, not to mention a mathematically rigor-ous proof of the effectiveness of these methods.
To combine these two directions, the aim of this paper is to address the limitations of existing model size compression and acceleration work and to achieve ultra-high energy efficiency and performance for FPGA-based hardware implementations of DNNs, by (i) deriving a highly suitable algorithm for efficient computation and storage reduction without significant accuracy loss, and (ii) deriving the corresponding optimized hardware implementations. We develop an algorithm-hardware co-optimization framework, which is applicable to different DNN types, sizes, and application scenarios. The proposed framework comprises algorithm and hardware parts. The algorithm part extends reference (Cheng et al.) , which applies circulant matrices to the whole fully-connected (FC) layer for model compression, to (i) the adoption of the general block-circulant matrices to achieve fine-grained tradeoff of accuracy and compression ratio, (ii) the generalization to the convolutional (CONV) layers for significant acceleration as CONV layers dominate the computation of DNNs (Krizhevsky, Sutskever, and Hinton, He et al.), (iii) providing a mathematically rigorous proof that the proposed algorithm will asymptotically converge to the same "effectiveness" as DNNs without compression, and (iv) decoupling the fast Fourier transform (FFT) and inverse FFT computations in the framework for accelerating computation and facilitating hardware implementations. The proposed algorithm reduces computational complexity per layer from O(n 2 ) to O(n log n) and storage complexity from O(n 2 ) to O(n), both for training and inference, with negligible degradation in DNN accuracy. The hardware part consists of highly efficient FPGA-based implementations using effective reconfiguration, batch processing, deep pipelining technique, effective resource re-using, and a hierarchical control framework. The proposed FPGA-based implementation can accommodate the whole DNN model using on-chip block memory, thereby significantly improving the overall energy efficiency. Finally, a comprehensive algorithmhardware co-optimization is proposed which comprises (i) model selection and optimization, (ii) hardware optimization, and (iii) variational inference-based Bayesian learning for enhancing accuracy and robustness. In summary, the major contributions of this work include both algorithm and hardware parts. The algorithm part adopts block-circulant matrices for weight representation, which could achieve a significant model compression ratio with minor accuracy degradation. It applies to the whole network, both fullyconnected and convolutional layers. The hardware part consists of highly efficient FPGA-based implementations with multiple innovative parts of reconfiguration, batch processing, deep pipelining, resource re-using, etc.
Please note that the proposed framework is distinct from the prior work (Mathieu, Henaff, and LeCun) , which applies FFTs to accelerate the computations in the CONV layers. The prior work applies only to a single filter in the CONV layer and achieves no storage reduction (in fact it results in storage increase), whereas the proposed method applies both to CONV and FC layers and achieves simultaneous acceleration and storage reduction.
Because we focus on highly energy-efficient FPGA- (Zhang et al., Zhang et al.) or on compressed models using singular value decomposition (SVD) (Qiu et al.) . In this year the research on this topic has exploded, including accelerations of DNNs with weight pruning (Han et al.) , binary neural networks (Zhao et al., Umuroglu et al.) , and high-level synthesis for fast generation of FPGA implementations (Zhao et al.,Zhang and Li) . These work typically suffer from frequent access to off-chip memory systems because their model sizes cannot be effectively reduced for on-chip memory storage, thereby resulting in high energy consumptions. The typical (equivalent) energy efficiency range is from 7 GOPS/W to less than 1 TOPS/W, depending on the testing FPGA platform, implementation details, and compression techniques. Connection Pruning and Weight Sparsifying. Han et al. (Han et al., Han, Mao, and Dally) reduced the number of parameters by 9X -13X using connection pruning. Since most reduction is achieved on FC layers, no significant speedups of CONV layers can be observed (Wen et al.) . As CONV layers have become the computational bottleneck, compression and acceleration on CONV layers become essential. Liu et al. achieved layer-wise 4.59X speedup on the CONV layers of AlexNet with 2% accuracy loss. Recently, (Wen et al.) adopts a structured sparsity learning method and derives an effective tradeoff between acceleration on CPU/GPU and test accuracy for the CONV layers. More specifically, for ResNet-20 on CIFAR-10 and AlexNet on ImageNet benchmarks, more than 50% acceleration can be achieved without any accuracy loss, while around 3X acceleration is achieved with an acceptable accuracy loss of 2%.
FFTs for CONV Layer Accelerations. LeCun et al. have proposed using FFTs to accelerate the computations in the CONV layers, which applies only to a single filter in the CONV layer (Mathieu, Henaff, and LeCun) . It uses FFT to calculate the traditional inner products of filters and input feature maps, and can achieve speedup for large filter sizes. The underlying neural network structure remains unchanged. The speedup is due to filter reuse and it cannot achieve either asymptotic speedup in Big-O notation or weight compression.
Structured Matrices in FC Layers for Model Compression. The most relevant work to this paper is (Cheng et al.) , which directly applies circulant matrices to the FC layers for model compression. As an example, an FC layer of DNN can be represented as y = ψ(Wx + θ), where vectors x and y represent the outputs of all neurons in the previous layer and the current layer, respectively; W is an n-by-n weight matrix; and ψ(·) is the activation function. When W is a circulant matrix, the fast Fourier transform (FFT)-based fast multiplication method can be utilized, and the computational complexity and weight storage complexity will be reduced from O(n 2 ) to O(n log n) and from O(n 2 ) to O(n), respectively. Despite the significant reduction in computation and weight storage, this approach has the limitations of (i) resulting in a huge number of padding 0's when the numbers of inputs and outputs are not equal, (ii) resulting in certain accuracy degradation for large-scale FC layers because of the aggressive weight reduction, and (iii) only applicable to the FC layer, whereas the CONV layers are the most computationally intensive in DNNs.
Algorithm Development of Block-Circulant Matrix-Based DNNs
In this section, we develop the algorithmic framework of block-circulant matrix-based DNNs for simultaneous acceleration and model compression, for both inference and training phases. The proposed framework is able to accommodate arbitrary size and aspect ratio of weight matrices, and achieves a fine-grained tradeoff between test accuracy and compression/acceleration ratio (Ding et al.) . Unlike (Cheng et al.) , we develop algorithms for both FC and CONV layers as shown in the following. We provide a mathematically rigorous proof of the proposed algorithm that it satisfies the universal approximation property as uncompressed DNNs. Finally, we develop decoupling technique for FFT/IFFT pairs for further acceleration and facilitating hardware (FPGA) implementations.
Inference and Training Algorithms for FC Layers
The key idea of block-circulant matrix-based FC layers is to partition the original arbitrary-size unstructured weight matrix W ∈ R m×n into 2D blocks of square sub-matrices. Such partitioning strategy has two advantages: 1) It is suitable for arbitrary-size weight matrices without any requirement on the aspect ratio of W; and 2) it is an adjustable approach that can conveniently control the compression ratio and potential accuracy loss by only changing the size of sub-matrices.
For formal discussions on the proposed inference and training procedures, let k denote the block size (size of each sub-matrix) and there are p × q blocks after partitioning W, where p = m ÷ k and q = n ÷ k. Zero padding is required if k does not directly divide m or n, but the amount of zero padding will be significantly reduced compared with (Cheng et al.) .
T . Then the forward propagation process in the inference phase is given by:
where a i ∈ R k is a column vector. Assume each circulant matrix C ij is defined by a vector w ij , i.e., w ij is the first row vector of C ij . Then according to the circulant convolution theorem (Pan, Bini, Pan, and Eberly) , the calculation of C ij x j can be performed as IFFT FFT(w ij ) • FFT(x j ) , where • denotes element-wise multiplications. The operation procedure is shown in Fig. 1 . For the inference phase, the computational complexity of this FC layer will be O(pqk log k), which is equivalent to O(n log n) for small p, q values. Similarly, the storage complexity will be O(pqk) because we only need to store w ij or FFT(w ij ) for each submatrix, which is equivalent to O(n) for small p, q values. Simultaneous acceleration and model compression compared with the original DNN can be achieved. Now consider the backward propagation process in the training phase. Let a il be the l-th output element in a i . Then by using the chain rule we can derive the backward propagation process as follows:
We have proved that ∂ai ∂wij and ∂ai ∂xj are block-circulant matrices. Therefore, ∂L ∂wij and ∂L ∂ai ∂ai ∂xj can be calculated as the "FFT→element-wise multiplication→IFFT" procedure and is equivalent to O(n log n) computational complexity per layer. Due to space limitation, the algorithmic descriptions of forward and backward propagations are omitted.
Please note that there is no special need to translate into or approximate each sub-matrix of W. Instead, as shown in Eqns. (2) and (3), we directly learn the vector w ij (the firstrow vector) of each sub-matrix of W in the training process.
The assumption is that the other rows of the sub-matrix follow the circulant formulation. In other words, when following the learning process Eqns. (2) and (3), the learnt weight matrices naturally follow the block-circulant format. In fact, this is a key advantage of this proposed method in that there is no need for additional "translation" or "approximation" steps.
Inference and Training for CONV Layers
We generalize the inference and training algorithms to CONV layers, which have become the computation bottleneck of the whole DNN. The CONV layers are often associated with multiple input and multiple output feature maps:
r×r×C×P represent the input, output, and weight tensors of the CONV layer, respectively. Here W and H are the spatial dimensions of the input feature maps, C is the number of input feature maps, r is the size of the convolutional kernel, and P is the number of output feature maps.
Efficient software tools such as Caffe provide an efficient methodology of transforming tensor-based operations in the CONV layer to matrix-based operations (Jia et al., Vedaldi and Lenc) , in order to enhance the implementation efficiency (GPUs are optimized for matrix operations.) Fig.  2 illustrates the application of the method to reformulate Eqn. (4) to the matrix multiplication Y = XF, where
We generalize the concept of "block-circulant structure" to the rank-4 tensor (F) in the CONV layer, i.e., all the slices of the form F(·, ·, c, p) are block-circulant matrices. Then we can prove that F is actually a block-circulant matrix. Hence the fast multiplication approach for blockcirculant matrices, as the "FFT→component-wise multiplication →IFFT" procedure, can now be applied to accelerate Y = XF, thereby resulting in the acceleration of (4). The training phase can be derived similarly. The overall degrees of reduction in computational and storage complexities are similar to those in FC layers.
Theoretical Foundation and Software Results
With the substantial reduction of weight storage and computations, we also attempt to prove that the proposed blockcirculant matrix-based framework will consistently yield the similar overall accuracy compared with DNNs without compression. The theoretical proof will make the proposed method theoretically rigorous and distinct from prior work.
In the theory of neural networks, the universal approximation property states that a neural network should be able to approximate any continuous or measurable function with arbitrary accuracy provided that an enough large number of parameters are available. This property provides the theoretical guarantee of using neural networks to solve machine learning problems, since machine learning tasks can be formulated as finding a proper approximation of an unknown, high-dimensional function. We have proved the universal approximation property of block circulant matrix-based neural networks, and more generally, for arbitrary structured matrices satisfying the low displacement rank γ. As a result, we can guarantee the universal "effectiveness" of the proposed framework on different DNN types and sizes, application domains, and hardware/software platforms. Detailed proof procedure is provided in the supplementary file (pro) . Fig. 3 shows the model compression results on MNIST, SHVN, CIFAR-10, ImageNet, TIMIT (speech recognition) benchmarks, etc., using various DNN models. The accuracy degradations are constrained to be 1% to 2% between the original models and block-circulant matrix-based models. The overall model compression is contributed by both weight parameter reduction and bit quantization. It can be observed that a significant model size compression, and therefore acceleration, can be achieved using the proposed framework.
Accelerating Computation and Facilitating Hardware Implementations
We propose the decoupling technique of FFTs and IFFTs, which applies to both inference and training phases. We take the inference phase of FC layer as an illustrative example. First, we make the observation that the FFT results of x j , i.e., FFT(x j ), need to be utilized to calculate all a i vectors. Similar observation also holds for w ij . Hence, we could perform pre-calculation of FFT(x j ) and FFT(w ij ) and store them in memory for effective re-use. The FFT(w ij ) values can even be pre-calculated and stored in memory before the inference phase because they are fixed after training. By performing such pre-calculation of FFT(x j ), the total number of FFTs needed to calculate Wx reduces from p · q to q (assuming FFT(w ij )'s are calculated and stored in prior), achieving a significant reduction in total computations.
Similarly, each vector a i to be calculated in Eqn. (1) is given by q j=1 IFFT FFT(w ij ) • FFT(x j ) , which requires q IFFT calculations. Because FFTs and IFFTs are linear operations (Oppenheim), we can calculate IFFT in the last step, i.e., calculate a i as IFFT q j=1 FFT(w ij ) • FFT(x j ) . In this way the total number of IFFT calculations can be reduced by q times.
High Energy Efficiency and Performance Implementation in FPGAs
Based on the algorithmic framework, we describe the developed high-efficiency FPGA-based implementation of DNNs. Since the target is low-power embedded applications, we focus on the inference phase of small to mediumscale DNNs, e.g., for MNIST, SVHN, CIFAR datasets. We leave the large-scale DNNs, e.g., for ImageNet dataset, for future investigation because they do not target at embedded applications. We first describe the proposed FPGA implementations using a set of reconfiguration and performance/efficiency enhancement techniques, then present the algorithm-hardware co-optimization framework.
FPGA Implementations: Reconfigurability, In-Place Computation, Batch Processing, Deep Pipelining, and Resource Re-Use
Reconfigurability, In-Place Computation, and Batch Processing. In order to accommodate different DNN models, sizes, and application scenarios, the proposed FPGA implementation possesses reconfigurability for different layer sizes and layer types (FC or CONV layers). The reconfigurability is achieved because (i) both FC and CONV layers are formulated as the "FFT→component-wise multiplication →IFFT" procedure; (ii) IFFT can be implemented using the FFT structure with simple pre-processing step (Salehi, Amirfattahi, and Parhi) ; and (iii) the FFT structure possesses inherent recursive property in that small-scale FFTs can be implemented in parallel in larger-scale FFT structures (Oppenheim). More specifically, the first and second properties enable the implementation of a single FFT structure in a time-multiplexed manner for both FFTs and IFFTs and both FC and CONV layers. For instance, a 128-input FFT structure can be implemented in FPGA if a block size of 128 is utilized. The third property enables that a single FFT structure can be utilized even if we use different block sizes for FC and CONV layers. Finally, in-place computation is utilized such that the same memory space can be utilized to store the outputs of every layer in the DNN, i.e., the outputs of each neuron layer i will replace the inputs (outputs of layer i − 1). In this way, the execution of an overall DNN will use the single FFT structure in a sequential, timemultiplexed manner without extra memory requirements. The execution of the inference phase of the whole DNNs is shown in Fig. 4 . The batch processing technique is utilized, in that a batch of input pictures are processed in an interleaved manner in the FPGA. As shown in Fig. 4 , we first compute the first layer of all input pictures in this batch, then the second layer, and so on. Different layers of a neural network will be time-multiplexed on the basic block. The computations are all based on the implemented FFT structure discussed previously in a time-multiplexed manner. All operations will be pipelined on the basic computing block. The reason of batch processing is the deep pipelining (to be discussed later) utilized in the hardware implementation. Otherwise, pipeline bubbles have to be injected when computing all layers for one input picture consecutively, which results in timing overheads. A typical batch consists of around 50-100 pictures, because (i) state-of-the-art FPGAs have more than 2MB on-chip memory storage (e.g., Intel (Altera) CyClone V 5CEA9, Xilinx Kintex-7 XC7K325T) and (ii) the intermediate results of small to medium-scale DNNs (e.g., DNNs for CIFAR-10) typically take several KBs per picture.
Three-Phase Operations, Deep Pipelining, and Resource Re-Use. As described before, the calculation of Wx consists of three phases: calculation of FFT(x j ) vectors for each j, calculation of element-wise multiplications FFT(w ij ) • FFT(x j ) for each i, j (and corresponding additions), and IFFTs for each i. For example, if W is 1024-by-1024 and the block size is 128, a total of 8 FFTs, 8 IFFTs, and 64 groups of element-wise multiplications will be performed. As shown in Fig. 4 , the three-phase operations are integrated with batch processing. More specifically, an outer loop iterates on all layers of the DNN. Within the outer loop is the three calculation phases. Within each phase is the calculations for every i, j in each picture and for all pictures. In this way the timing overheads can be minimized to close to zero. The deep pipelining technique is utilized for FFTs and IFFTs in order to improve throughput and energy efficiency, as illustrated in Fig. 4 . For example, if a 128-point FFT is implemented as the basic computing block in FPGA, it needs 7 pipeline stages plus 4 additional stages corresponding to memory reading and writing. When IFFT is implemented on such basic computing block, 2 additional stages are needed corresponding to the preprocessing, and biasing and ReLU activation. The element-wise multiplications and additions in the second phase are also pipelined.
One clear advantage of the FPGA-based hardware implementation is the ability of resource re-use. Besides the effective time multiplexing of FFTs and IFFTs on the same hardware, the hardware multipliers utilized in the second phase can also re-use those in the FFT computing block. This effective resource re-use can be automatically determined in the FPGA synthesis process (qua), which could improve the area and energy efficiency of FPGA implementations.
Algorithm-Hardware Co-Optimizations
Finally, an algorithm-hardware co-optimization framework is developed, which comprises (i) model selection and optimization, (ii) hardware optimization, and (iii) variational inference-based Bayesian learning. The overall objective is to maximize the performance (throughput) and energy efficiency of FPGA hardware implementation subject to certain accuracy requirements. More specifically, the first aspect determines the proper block size and weight matrix size, in order to facilitate FPGA-based FFT implementations while satisfying the overall accuracy requirement. For stateof-the-art FPGAs, a proper block size ranges from 64 to 256 (should better be a power of 2) for FC layers and may be smaller for CONV layers. The second aspect includes the exploitation of FFTs with real-valued inputs, i.e., the FFT results of a real-valued vector is symmetric except for the base (first) component (Oppenheim) . Because both x j and w ij are real-valued vectors, we only need to store the first half of vectors FFT(x j ) and FFT(w ij ), which significantly reduce the storage requirement and computations required in element-wise multiplications. The last aspect uses the variational inference process of Bayesian learning (Blei, Jordan, and others) , which is compatible with the proposed framework and can result in accuracy and robustness enhancements. Bayesian training using variational inference (Blei, Jordan, and others) is an effective training method to enhance accuracy and robustness of machine learning systems, including neural networks. During training phase, it assumes that each weight is a variable that satisfies certain prior distribution at the beginning. For each training sample, it generates a collection of random weights based on the distribution, and learns both the average and variance of each weight variable. The inference phase (implemented in hardware) will be the same, using the average estimate of each weight. Based on our results, Bayesian training is the most effective for small data training and small-to-medium neural networks. The algorithm-hardware co-optimization framework is shown in Fig. 5 . Overall, the proposed FPGA-based implementation can accommodate the whole DNN model using on-chip block memory, thereby significantly improving the overall energy efficiency.
Experimental Results
In this section, we provide the experimental results on FPGA implementations of the proposed framework on small to medium-scale DNNs, using MNIST, SVHN, and CIFAR-10 benchmarks. Our FPGAs for implementation include the low-power FPGA Intel (Altera) CyClone V 5CEA9, and the one with higher performance Xilinx Kintex-7 XC7K325T. The former one is the default FPGA used in experiments. We compare the performance (throughput), energy efficiency, and accuracy with the best state-of-the-arts including IBM TrueNorth neurosynaptic processor, emerging device (e.g., memristor crossbar) based neuromorphic systems, analogbased neuromorphic systems, and reference FPGA implementations. IBM TrueNorth (Esser et al., Esser et al.) is a neuromorphic CMOS chip fabricated in 28nm technology, with 4096 cores each simulating 256 programmable silicon neurons in a time-multiplexed manner. It implements the spiking neural network, which is a bio-inspired type of neural networks and benefits from the ability of globally asynchronous implementations. It can accommodate MNIST, SVHN, and CIFAR-10 benchmarks 1 in the experiments. First, we provide the comparison results on accuracy, performance (throughput, in kilo-frames per second (kFPS)), and energy efficiency (in kFPS/W) on the three benchmarks, as shown in Table 1 . The baselines include IBM TrueNorth processor and reference FPGA implementations of these benchmarks. We provide results of the proposed framework on three DNNs of MNIST data set with different target accuracies, one for SVHN, and two for CIFAR-10 data set. The first two DNNs of the MNIST data set are multi-layer perceptron (MLP) models that achieve 92.9% and 95.6% accuracies, respectively. Prior pooling is applied to reduce the input size to 256 and 128, respectively. The third DNN of the MNIST data set is a CNN similar to the LeNet-5 structure (LeCun et al.) . The baseline IBM TrueNorth processor also has different implementations with different accuracy levels for the MNIST data set. For the CIFAR-10 data set, the first DNN is a simple CNN structure, whereas the second is a wide ResNet model (He et al.) that can achieve 94.75% accuracy, only 0.75% lower than the best state-ofthe-art software implementation. We can observe that under the similar accuracy level, the speedup and energy efficiency gain compared with IBM TrueNorth are at least 152X and 71X, respectively. Under the similar accuracy level, the energy efficiency gain is at least 31X compared with the reference FPGA-based implementation that achieves the highest energy efficiency (Umuroglu et al.) (using binary neural networks). Besides the reduction in computational complexity, the high suitability of the proposed framework for hardware implementation, and the highly efficient deep pipelined hardware structure, the reasons for such significant gains also include the requirement of increasing neuron numbers for spiking or binary neural networks to achieve the same accuracy as MLP or CNN, and the inherent long latency in spiking neural networks.
Next, we provide sample comparison results with emerging device and analog-based implementations. Because the neural networks and applications may be different, we use the equivalent performance in giga-operations per second (GOPS) and energy efficiency in GOPS/W for fair comparisons. The term "equivalent" is utilized because we normalize the number of (multiplication and addition) operations to the original matrix-vector multiplication format. The proposed framework achieves around 5.14 Tera OPS/W (TOPS/W) energy efficiency, which outperforms representative latest results using analog computing and emerging devices. For example, (Shafiee et al., Song et al., Lu et al.) achieve 380.7 GOPS/W, 142.9 GOPS/W, and 1.04 TOPS/W, respectively. The reference work can be either manufactured or device modeling based. Performance wise, as analyzed in (Bayat et al., Liu et al., Li et al.) , a matrix-vector multiplication will take around 100ns and it takes around 1µs to perform one inference sample on the MNIST data set (with 90% -94% accuracy). Our achieved highest performance (throughput) for the MNIST data set, i.e., 11.6ns per image recognition in CyClone V FPGA or around 4ns per image in Kintex-7 FPGA, is difficult to achieve even using emerging devices and technology.
Finally, we provide the comparison results with other FPGA implementations in terms of the equivalent performance (in GOPS) and energy efficiency (in GOPS/W), as shown in Fig. 6 . These metrics are relatively fair comparisons although the DNNs for implementations may be different. The baseline FPGA implementations include highlevel synthesis-based implementations, implementations of compressed models, etc. A minimum of more than 84X energy efficiency gain can be achieved compared with the reference FPGA implementations. Besides the reduced computational complexity and the high-efficiency hardware implementation, another key reason for such significant energy efficiency gain is because the proposed FPGA-based imple- mentation can accommodate the whole DNN model using on-chip block memory, thereby significantly improving the overall energy efficiency.
Conclusion
This paper presents an algorithm-hardware co-optimization framework to facilitate ultra high-performance and high energy efficiency hardware implementations of DNNs on FPGAs. The algorithm part adopts the general block-circulant matrices to achieve a fine-grained tradeoff of accuracy and compression ratio. It applies to both FC and CONV layers and contains a mathematically rigorous proof. The proposed algorithm reduces computational complexity per layer from O(n 2 ) to O(n log n) and storage complexity from O(n 2 ) to O(n), both for training and inference phases. The hardware part consists of highly efficient FPGA-based implementations using effective reconfiguration, batch processing, deep pipelining, resource re-using, and a hierarchical control framework. Experimental results demonstrate that the proposed framework achieves at least 152X speedup in throughput and 71X energy efficiency gain compared with IBM TrueNorth processor under the same test accuracy. It achieves at least 31X energy efficiency gain compared with the reference FPGA-based work. 
