Large-scale deep neural networks (DNNs) are both compute and memory intensive. As the size of DNNs continues to grow, it is critical to improve the energy eciency and performance while maintaining accuracy. For DNNs, the model size is an important factor aecting performance, scalability and energy eciency. Weight pruning achieves good compression ratios but suers from three drawbacks: 1) the irregular network structure after pruning, which aects performance and throughput; 2) the increased training complexity; and 3) the lack of rigirous guarantee of compression ratio and inference accuracy.
INTRODUCTION
From the end of the rst decade of the 21st century, neural networks have been experiencing a phenomenal resurgence thanks to the big data and the signicant advances in processing speeds. Large-scale deep neural networks (DNNs) have been able to deliver impressive results in many challenging problems. For instance, DNNs have led to breakthroughs in object recognition accuracy on the Ima-geNet dataset [1] , even achieving human-level performance for face recognition [2] . Such promising results triggered the revolution of several traditional and emerging real-world applications, such as self-driving systems [3] , automatic machine translations [4] , drug discovery and toxicology [5] . As a result, both academia and industry show the rising interests with signicant resources devoted to investigation, improvement, and promotion of deep learning methods and systems.
One of the key enablers of the unprecedented success of deep learning is the availability of very large models. Modern DNNs typically consist of multiple cascaded layers, and at least millions to hundreds of millions of parameters (i.e., weights) for the entire model [6] [7] [8] [9] . The larger-scale neural networks tend to enable the extraction of more complex high-level features, and therefore, lead to a signicant improvement of the overall accuracy [10] [11] [12] . On the other side, the layered deep structure and large model sizes also demand increasing computational capability and memory requirements. In order to achieve higher scalability, performance, and energy eciency for deep learning systems, two orthogonal research and development trends have both attracted enormous interests.
The rst trend is the hardware acceleration of DNNs, which has been extensively investigated in both industry and academia. As a representative technique, FPGA-based accelerators oer good programmability, high degree of parallelism and short development cycle. FPGA has been used to accelerate the original DNNs [13] [14] [15] [16] [17] , binary neural networks [18, 19] , and more recently, DNNs with model compression [20] . Alternatively, ASIC-based implementations have been recently explored to overcome the limitations of general-purpose computing approaches. A number of major hightech companies have announced their ASIC chip designs of the DNN inference framework, such as Intel, Google, etc. [21, 22] . In academia, three representative works at the architectural level are Eyeriss [23] , EIE [24] , and the DianNao family [25] [26] [27] , which focus specically on the convolutional layers, the fully-connected layers, and the memory design/organization, respectively. There are a number of recent tapeouts of hardware deep learning systems [23, [28] [29] [30] [31] [32] [33] .
These prior works mainly focus on the inference phase of DNNs, and usually suer from the frequent accesses to o-chip DRAM systems (e.g., when large-scale DNNs are used for ImageNet dataset). This is because the limited on-chip SRAM memory can hardly accommodate large model sizes. Unfortunately, o-chip DRAM accesses consume signicant energy. The recent studies [34, 35] show that the per-bit access energy of o-chip DRAM memory is 200⇥ compared with on-chip SRAM. Therefore, it can easily dominate the whole system power consumption.
The energy eciency challenge of large models motivates the second trend: model compression. Several algorithm-level techniques have been proposed to compress models and accelerate DNNs, including weight quantization [36, 37] , connection pruning [34, 35] , and low rank approximation [38, 39] . These approaches can oer a reasonable parameter reduction (e.g., by 9⇥ to 13⇥ in [34, 35] ) with minor accuracy degradation. However, they suer from the three drawbacks: 1) the sparsity regularization and pruning typically result in an irregular network structure, thereby undermining the compression ratio and limiting performance and throughput [40] ; 2) the training complexity is increased due to the additional pruning process [34, 35] or low rank approximation step [38, 39] , etc.; 3) the compression ratios depending on network are heuristic and cannot be precisely controlled.
We believe that an ideal model compression technique should: i) maintain regular network structure; ii) reduce the complexity for both inference and training, and, most importantly, iii) retain a rigorous mathematical fundation on compression ratio and accuracy.
As an eort to achieve the three goals, we propose CCNN, a principled approach to represent weights and process neural networks using block-circulant matrices [41] . The concept of the block-circulant matrix compared to the ordinary unstructured matrix is shown in Fig. 1 . In a square circulant matrix, each row (or column) vector is the circulant reformat of the other row (column) vectors. A non-squared matrix could be represented by a set of square circulant submatrices (blocks). Therefore, by representing a matrix with a vector, the rst benet of CCNN is storage size reduction. In Fig. 1 , the unstructured 6 ⇥ 3 weight matrix (on the left) holds 18 parameters. Suppose we can represent the weights using two 3 ⇥ 3 circulant matrices (on the right), we just need to store 6 parameters, easily leading to 3x model size reduction. Intuitively, the reduction ratio is determined by the block size of the circulant submatrices: larger block size leads to high compression ratio. In general, the storage complexity is reduced from O(n 2 ) to O(n).
The second benet of CCNN is computational complexity reduction. We explain the insights using a fully-connected layer of DNN, which can be represented as y = (Wx + ), where vectors x and y represent the outputs of all neurons in the previous layer and the current layer, respectively; W is the m-by-n weight matrix; and (·) is activation function. When W is a block-circulant matrix, the Fast Fourier Transform (FFT)-based fast multiplication method can be utilized, and the computational complexity is reduced from O(n 2 ) to O(n log n).
It is important to understand that CCNN incurs no conversion between the unstructured weight matrices and block-circulant matrices. Instead, we assume that the layers can be represented by block-circulant matrices and the training generates a vector for each circulant submatrix. The fundamental dierence is that: the current approaches apply various compression techniques (e.g., pruning) on the unstructured weight matrices and then retrain the network; while CCNN directly trains the network assuming block-circulant structure. This leads to two advantages. First, the prior work can only reduce the model size by a heuristic factor, depending on the network, while CCNN provides the adjustable but xed reduction ratio. Second, with the same FFT-based fast multiplication, the computational complexity of training is also reduced from O(n 2 ) to O(n log n). Unfortunately, the prior work does not reduce (or even increase) training complexity.
Due to the storage and computational complexity reduction, C CNN is clearly attractive. The only question is: can a network really be represented by block-circulant matrices with no (or negligible) accuracy loss? This question is natural, because with the much less weights in the vectors, the network may not be able to approximate the function of the network with unstructured weight matrices.
Fortunately, the answer to the question is YES. CCNN is mathematically rigorous: we have developed a theoretical foundation and formal proof showing that the DNNs represented by block-circulant matrices can converge to the same "eectiveness" as DNNs without compression, fundamentally distinguishing our method from prior arts. The outline of the proof is discussed in Section 3.3 and the details are provided in technical reports [42, 43] .
Based on block-circulant matrix-based algorithms, we propose CCNN architecture, -a universal DNN inference engine that can be implemented in various hardware/software platforms with congurable network architecture (e.g., layer type, size, scales, etc.). Applying CCNN to neural network accelerators enables notable architectural innovations. 1) Due to its recursive property and its intrinsic role in CCNN, FFT is implemented as the basic computing block. It ensures universal and small-footprint implementations.
2) Pipelining and parallelism optimizations. Taking advantage of the compressed but regular network structures, we aggressively apply inter-level and intra-level pipelining in the basic computing block. Moreover, we can conduct joint-optimizations considering parallelization degree, performance and power consumption.
3) Platform-specic optimizations focusing on weight storage and memory management.
To demonstrate the performance and energy eciency, we test CCNN architecture in three platforms: FPGA, ASIC and embedded processors. Our results show that CCNN architecture achieves very high energy eciency and performance with a small hardware footprint. Based on the FPGA implementation and ASIC synthesis results, CCNN achieves 6 -102X energy eciency improvements compared with the best state-of-the-art results.
BACKGROUND AND MOTIVATION 2.1 Deep Neural Networks
Deep learning systems can be constructed using dierent types of architectures, including deep convolutional neural networks (DC-NNs), deep belief networks (DBNs), and recurrent neural networks (RNNs). Despite the dierences in network structures and target applications, they share the same construction principle: multiple functional layers are cascaded together to extract features at multiple levels of abstraction [44] [45] [46] . Fig. 2 illustrates the multi-layer structure of an example DCNN, which consists of a stack of fullyconnected layers, convolutional layers, and pooling layers. These three types of layers are fundamental in deep learning systems. The fully-connected (FC) layer is the most storage-intensive layer in DNN architectures [14, 15] since its neurons are fully connected with neurons in the previous layer. The computation procedure of a FC layer consists of matrix-vector arithmetics (multiplications and additions) and transformation by the activation function, as described as follows:
where W 2 R m⇥n is the weight matrix of the synapses between this FC layer (with m neurons) and its previous layer (with n neurons); 2 R m is the bias vector; and (·) is the activation function. The Rectied Linear Unit (ReLU) (x ) = max(0, x ) is the most widely utilized in DNNs.
The convolutional (CONV) layer, as the name implies, performs a two-dimensional convolution to extract features from its inputs that will be fed into subsequent layers for extracting higherlevel features. A CONV layer is associated with a set of learnable lters (or kernels) [47] , which are activated when specic types of features are found at some spatial positions in inputs. A lter-sized moving window is applied to the inputs to obtain a set of feature maps, calculating the convolution of the lter and inputs in the moving window. Each convolutional neuron, representing one pixel in a feature map, takes a set of inputs and the corresponding lter weights to calculate the inner-product. Given input feature map X and the r ⇥ r -sized lter (i.e., the convolutional kernel) F, the output feature map Y is calculated as
where a,b , x a+i 1,b+j 1 , and f i, j are elements in Y, X, and F, respectively. Multiple convolutional kernels can be adopted to extract dierent features in the same input feature map. Multiple input feature maps can be convolved with the same lter and results are summed up to derive a single feature map. The pooling (POOL) layer performs a subsampling operation on the extracted features to reduce the data dimensions and mitigate overtting issues. Here, the subsampling operation on the inputs of pooling layer can be realized by various non-linear operations, such as max, average or L2-norm calculation. Among them, the max pooling is the dominant type of pooling strategy in state-of-the-art DCNNs due to the higher overall accuracy and convergence speed [20, 23] .
Among these three types of layers, the majority of computation occurs in CONV and FC layers, while the POOL layer has a relatively lower computational complexity of O(n). The storage requirement of DNNs is due to the weight matrices W's in the FC layers and the convolutional kernels F's in CONV layers. As a result, the FC and CONV layers become the major research focuses on energy-ecient implementation and weight reduction of DNNs.
DNN Weight Storage Reduction and Acceleration
Mathematical investigations have demonstrated signicant sparsity and margin for weight reduction in DNNs, a number of prior works leverage this property to reduce weight storage. The techniques can be classied into two categories. 1) Systematic methods [48] [49] [50] such as Singular Value Decomposition (SVD). Despite being systematic, these methods typically exhibit a relatively high degradation in the overall accuracy (by 5%-10% at 10⇥ compression). 2) Heuristic pruning methods [34, 35, 51] use heuristic weight together with weight quantization. These method could achieve a better parameter reductions, i.e., 9⇥-13⇥ [34, 35] , and a very small accuracy degradation. However, the network structure and weight storage after pruning become highly irregular (c.f. Fig. 3 ) and therefore indexing is always needed, which undermines the compression ratio and more importantly, the performance improvement.
pruning synapses pruning neurons Besides the pros and cons of the two approaches, the prior works share the following common limitations: 1) mainly focusing on weight reduction rather than computational complexity reduction; 2) only reducing the model size by a heuristic factor instead of reducing the Big-O complexity; and 3) performing weight pruning or applying matrix transformations based on a trained DNN model, thereby adding complexity to the training process. The third item is crucial because it may limit the scalability of future larger-scale deep learning systems.
FFT-Based Methods
LeCun et al. has proposed using FFTs to accelerate the computations in the CONV layers, which applies only to a single lter in the CONV layer [52] . It uses FFT to calculate the traditional inner products of lters and input feature maps, and can achieve speedup for large lter sizes (which is less common in state-of-the-art DCNNs [53] ). The underlying neural network structure and parameters remain unchanged. The speedup is due to lter reuse and it cannot achieve either asymptotic speedup in big-O notation or weight compressions (in fact additional storage space is needed).
The work most closely related to CCNN is [54] . It proposed to use circulant matrix in the inference and training algorithms. However, it has a number of limitations. First, it only applied to FC layers, but not CONV layer. It limits the potential gain in weight reduction and performance. Second, it uses a single circulant matrix to represent the weights in the whole FC layer. Since the number of input and output neurons are usually not the same, this method leads to the storage waste due to the padded zeros (to make the circulant matrix squared).
Novelty of CCNN
Compared with LeCun et al. [52] , CCNN is fundamentally different as it achieves asymptotic speedup in big-O notation and weight compression simultaneously. Compared with [54] , CCNN generalizes in three signicant and novel aspects.
Supporting both FC and CONV layers. Unlike FC layers, the matrices in CONV layers are small lters (e.g., 3 ⇥ 3). Instead of representing each lter as a circulant matrix, CCNN exploits the inter-lter sparsity among dierent lters. In another word, CCNN represents a matrix of lters, where input and output channels are the two dimensions, by a vector of lters. The support for CONV layers allow CCNN to be applied in the whole network.
Block-circulant matrices. To mitigate the ineciency due to the single large circulant matrix used in [54] , CCNN uses blockcirculant matrices for weight representation. The benets are twofold. First, it avoids the wasted storage/computation due to zero padding when the numbers of inputs and outputs are not equal. Second, it allows us to derive a ne-grained tradeo between accuracy and compression/acceleration. Specically, to achieve better compression ratio, larger block size should be used, however, it may lead to more accuracy degradation. The smaller block sizes provide better accuracy, but less compression. There is no compression if the block size is 1.
Mathematical rigorousness. Importantly, we perform theoretical analysis to prove that the "eectiveness" of block-circulant matrix-based DNNs will (asymptotically) approach that of original networks without compression. The theoretical proof also distinguishes the proposed method with prior work. The outline of the proof is discussed in Section 3.3 and the details are provided in reports [42, 43] . Fig. 4 illustrates the dierence between the baseline [54] and CCNN. The baseline method (a) formulates a large, square circulant matrix by zero padding for FC layer weight representation when the numbers of inputs and outputs are not equal. In contrast, CCNN (b) uses the block-circulant matrix to avoid storage waste and achieve a ne-grained tradeo of accuracy and compression/acceleration. Overall, with the novel techniques of CCNN, at algorithm level, it is possible to achieve the simultaneous and signicant reduction of both computational and storage complexity, for both inference and training.
CIRCNN: ALGORITHMS AND FOUNDATION 3.1 FC Layer Algorithm
The key idea of block-circulant matrix-based FC layers is to partition the original arbitrary-size weight matrix W 2 R m⇥n into 2D blocks of square sub-matrices, and each sub-matrix is a circulant matrix.
The insights are shown in Fig. 5 . Let k denote the block size (size of each sub-matrix) and assume there are p ⇥q blocks after partitioning W, where p = m ÷ k and q = n ÷ k.
Then, the forward propagation process in the inference phase is given by (with bias and ReLU omitted): 
where a i 2 R k is a column vector. Assume each circulant matrix W i j is dened by a vector w i j , i.e., w i j is the rst row vector of W i j . Then according to the circulant convolution theorem [41, 55] , the calculation of W i j x j can be performed as IFFT
⌘ , where denotes element-wise multiplications. The operation procedure is shown on the right of Fig. 5 . For the inference phase, the computational complexity of this FC layer will be O (pqk log k ), which is equivalent to O (n log n) for small p, q values. Similarly, the storage complexity will be O (pqk ) because we only need to store w i j or FFT(w i j ) for each sub-matrix, which is equivalent to O (n) for small p, q values. Therefore, the simultaneous acceleration and model compression compared with the original DNN can be achieved for the inference process. Algorithm 1 illustrates the calculation of Wx in the inference process in the FC layer of CCNN. Next, we consider the backward propagation process in the training phase. Let a il be the l-th output element in a i , and L denote the loss function. Then by using the chain rule we can derive the backward propagation process as follows:
Algorithm 1: Forward propagation process in the FC layer of CCNN Input: w i j 's, x, p, q, k Output: a Initialize a with zeros.
Algorithm 2: Backward propagation process in the FC layer of CCNN Input: @L @a , w i j 's, x, p, q, k Output: @L @w i j 's, @L @x Initialize @L @w i j 's and @L @x with zeros.
We have proved that @a i @w i j and @a i @x j are block-circulant matrices.
Therefore, @L @w i j and @L @a i @a i @x j can be calculated as the "FFT!elementwise multiplication!IFFT" procedure and is equivalent to O (n log n) computational complexity per layer. Algorithm 2 illustrates backward propagation process in the FC layer of CCNN.
In CCNN, the inference and training constitute an integrated framework where the reduction of computational complexity can be gained for both. We directly train the vectors w i j 's, corresponding to the circulant sub-matrices W i j 's, in each layer using Algorithm 2. Clearly, the network after such training procedure naturally follows the block-circulant matrix structure. It is a key advantage of CCNN compared with prior works which require additional steps on a trained neural network.
CONV Layer Algorithm
In practical DNN models, the CONV layers are often associated with multiple input and multiple output feature maps. As a result, the computation in the CONV layer can be expressed in the format of tensor computations as below:
represent the input, output, and weight "tensors" of the CONV layer, respectively. Here, W and H are the spatial dimensions of the input maps, C is the number of input maps, r is the size of the convolutional kernel, and P is the number of output maps.
We generalize the concept of "block-circulant structure" to the rank-4 tensor (F ) in the CONV layer, i.e., all the slices of the form F (·, ·, i, j) are circulant matrices. Next, we reformulate the inference and training algorithms of the CONV layer to matrix operations. We use the inference process as an example, and the training process can be formulated in a similar way.
Software tools such as Cae provide an ecient methodology of transforming tensor-based operations in the CONV layer to matrixbased operations [56, 57] , in order to enhance the implementation eciency (GPUs are optimized for matrix operations.) Fig. 6 illustrates the application of the method to reformulate Eqn. (6) Recall that the slice of F (·, ·, i, j) is a circulant matrix. Then according to the reshaping principle between F and F, we have:
which means F is actually a block-circulant matrix. Hence the fast multiplication approach for block circulant matrix, as the "FFT! component-wise multiplication !IFFT" procedure, can now be applied to accelerate Y = XF, thereby resulting in the acceleration of (6) . With the use of the proposed approach, the computational complexity for (6) is reduced from O(W Hr 2 CP) to O(W HQ log Q), where Q = max(r 2 C, P ).
Outline of Theoretical Proof
With the substantial reduction of weight storage and computational complexities, we attempt to prove that the proposed block-circulant matrix-based framework will consistently yield the similar overall accuracy compared with DNNs without compression. Only testing on existing benchmarks is insucient given the rapid emergence of new application domains, DNN models, and data-sets. The theoretical proof will make the proposed method theoretically rigorous and distinct from prior work.
In the theory of neural networks, the "eectiveness" is dened using the universal approximation property, which states that a neural network should be able to approximate any continuous or measurable function with arbitrary accuracy provided that an enough large number of parameters are available. This property provides the theoretical guarantee of using neural networks to solve machine learning problems, since machine learning tasks can be formulated as nding a proper approximation of an unknown, high-dimensional function. Therefore, the goal is to prove the universal approximation property of block circulant matrix-based neural networks, and more generally, for arbitrary structured matrices satisfying the low displacement rank . The detailed proofs for the block circulant matrix-based networks and general structured matrix-based ones are provided in the technical reports [42, 43] .
The proof of the universal approximation property for block circulant matrix-based neural networks is briey outlined as follows: Our objective is to prove that any continuous or measurable function can be approximated with arbitrary accuracy using a blockcirculant matrix-based network. Equivalently, we aim to prove that the function space achieved by block-circulant matrix-based neural networks is dense in the space of continuous or measurable functions with the same inputs. An important property of the activation function, i.e., the component-wise discriminatory property, is proved. Based on this property, the above objective is proved using proof by contradiction and Hahn-Banach Theorem [58] .
We have further derived an approximation error bound of O(1/n) when the number of neurons in the layer n is limited, with details shown in [43] . It implies that the approximation error will reduce with an increasing n, i.e., an increasing number of neurons/inputs in the network. As a result, we can guarantee the universal "eectiveness" of the proposed framework on dierent DNN types and sizes, application domains, and hardware/software platforms.
Compression Ratio and Test Accuracy
In this section, we apply CCNN to dierent DNN models in software and investigate the weight compression ratio and accuracy. Fig.  7 (a) and (b) show the weight storage (model size) reduction in FC layer and test accuracy on various image recognition datasets and DCNN models: MNIST (LeNet-5), CIFAR-10, SVHN, STL-10, and ImageNet (using AlexNet structure) [6, [59] [60] [61] [62] ). Here, 16-bit weight quantization is adopted for model size reduction. The baselines are the original DCNN models with unstructured weight matrices using 32-bit oating point representations. We see that block-circulant weight matrices enable 400⇥-4000+⇥ reduction in weight storage (model size) in corresponding FC layers. This parameter reduction in FC layers is also observed in [54] . The entire DCNN model size (excluding softmax layer) is reduced by 30-50⇥ when only applying block-circulant matrices to the FC layer (and quantization to the overall network). Regarding accuracy, the loss is negligible and sometimes the compressed models even outperform the baseline models. Fig. 7 (c) illustrates the further application of block-circulant weight matrices to the CONV layers on MNIST (LeNet-5), SVHN, CIFAR-10, and ImageNet (AlexNet structure) datasets, when the accuracy degradation is constrained to be 1-2% by optimizing the block size. Again 16-bit weight quantization is adopted, and softmax layer is excluded. The 16-bit quantization also contributes to 2⇥ reduction in model size. In comparison, the reductions of the number of parameters in [34, 35] are 12⇥ for LeNet-5 (on MNIST dataset) and 9⇥ for AlexNet. Moreover, another crucial property of CCNN is that the parameter storage after compression is regular, whereas [34, 35] result in irregular weight storage patterns. The irregularity requires additional index per weight and signicantly impacts the available parallelism degree. From the results, we clearly see the signicant benet and potential of CCNN: it could produce highly compressed models with regular structure. CCNN yields more reductions in parameters compared with the state-of-the-art results for LeNet-5 and AlexNet. In fact, the actual gain could even be higher due to the indexing requirements of [34, 35] .
We have also performed testing on other DNN models such as DBN, and found that CCNN can achieve similar or even higher compression ratio, demonstrating the wide application of blockcirculant matrices. Moreover, a 5⇥ to 9⇥ acceleration in training can be observed for DBNs, which is less phenomenal than the model reduction ratio. This is because GPUs are less optimized for FFT operation than matrix-vector multiplications.
CIRCNN ARCHITECTURE
Based on block-circulant matrix-based algorithms, we propose C CNN architecture, -a universal DNN inference engine that can be implemented in various hardware/software platforms with congurable network architecture (e.g., layer type, size, scales, etc.).
Applying CCNN to neural network accelerators enables notable architectural innovations. 1) Due to its recursive property and its intrinsic role in CCNN, FFT is implemented as the basic computing block (Section 4.1). It ensures universal and small-footprint implementations. 2) Pipelining and parallelism optimizations (Section 4.3). Taking advantage of the compressed but regular network structures, we aggressively apply inter-level and intra-level pipelining in the basic computing block. Moreover, we can conduct jointoptimizations considering parallelization degree, performance and power consumption. 3) Platform-specic optimizations focusing on weight storage and memory management.(Section 4.4).
Recursive Property of FFT: the Key to Universal and Small Footprint Design
In CCNN, the "FFT!component-wise multiplication !IFFT" in Fig. 8 is a universal procedure used in both FC and CONV layers, for both inference and training processes, and for dierent DNN models. We consider FFT as the key computing kernel in CCNN architecture due to its recursive property. It is known that FFT can be highly ecient with O(n log n) computational complexity, and hardware implementation of FFT has been investigated in [63] [64] [65] [66] [66] [67] [68] . The recursive property states that the calculation of a size-n FFT (with n inputs and n outputs) can be implemented using two FFTs with size n/2 plus one additional level of buttery calculation, as shown in Fig. 9 . It can be further decomposed to four FFTs with size n/4 with two additional levels. The recursive property of FFT is the key to ensure a universal and recongurable design which could handle dierent DNN types, sizes, scales, etc. It is because: 1) A large-scale FFT can be calculated by recursively executing on the same computing block and some additional calculations; and 2) IFFT can be implemented using the same structure as FFT with dierent preprocessing procedure and parameters [63] . It also ensures the design with small footprint, because: 1) Multiple small-scale FFT blocks can be multiplexed and calculate a large-scale FFT with certain parallelism degree; and 2) The additional component-wise multiplication has O(n) complexity and relatively small hardware footprint.
Actual hardware systems, such as FPGA or ASIC designs, pose constraints on parallel implementation due to hardware footprint and logic block/interconnect resource limitations. As a result, we dene the basic computing block with a parallelization degree p and depth d (of buttery computations), as shown in Fig. 10 . A buttery computation in FFT comprises cascade connection of complex number-based multiplications and additions [69, 70] . The basic computing block is responsible for implementing the major computational tasks (FFT and IFFTs). An FFT operation (with recongurable size) is done by decomposition and iterative execution on the basic computing blocks.
Compared with conventional FFT calculation, we simplify the FFT computing based on the following observation: Our inputs of the deep learning system are from actual applications and are real values without imaginary parts. Therefore, the FFT result of each level will be a symmetric sequence except for the base component [63] . As an example shown in the basic computing block shown in Fig. 10 , the partial FFT outcomes at each layer of buttery computations will be symmetric, and therefore, the outcomes in the red circles do not need to be calculated and stored as partial outcomes. This observation can signicantly reduce the amount of computations, storage of partial results, and memory trac.
Overall Architecture
The overall CCNN architecture is shown in Fig. 11 , which includes the basic computing block, the peripheral computing block, the control subsystem, the memory subsystem, and I/O subsystem (I/O buers). The basic computing block is responsible for the major FFT and IFFT computations. The peripheral computing block is responsible for performing component-wise multiplication, ReLU activation, pooling etc., which require lower (linear) computational complexity and hardware footprint. The implementations of ReLU activation and pooling are through comparators and have no inherent dierence compared with prior work [24, 26] . The control subsystem orchestrates the actual FFT/IFFT calculations on the basic computing block and peripheral computing block. Due to the dierent sizes of CONV layer, FC layer and dierent types of deep learning applications, the dierent setting of FFT/IFFT calculations is congured by the control subsystem. The memory subsystem is composed of ROM, which is utilized to store the coecients in FFT/IFFT calculations (i.e., the W i n values including both real and imaginary parts); and RAM, which is used to store weights, e.g., the FFT results FFT(w i j ). For ASIC design, a memory hierarchy may be utilized and carefully designed to ensure good performance.
We use 16-bit xed point numbers for input and weight representations, which is common and widely accepted to be enough accurate for DNNs [23, 24, 26, 71] . Furthermore, it is pointed out [35, 37] that inaccuracy caused by quantization is largely independent of inaccuracy caused by compression and the quantization inaccuracy will not accumulate signicantly for deep layers. x(0) x (2) x(N-2)
(c)

Unstructured for both FC and CONV Layers Block-circulant for both FC and CONV Layers
Unstructured for FC Layer Unstructured for FC Layer Block-circulant for FC Layer Block-circulant for FC Layer
x (1) x (3) x(N-1) 
Pipelining and Parallelism
Thanks to the regular block-circulant matrix structure, eective pipelining can be utilized to achieve the optimal tradeo between energy eciency and performance (throughput). CCNN architecture considers two pipelining techniques as shown in Fig. 12 .
In inter-level pipelining, each pipeline stage corresponds to one level in the basic computing block. In intra-level pipelining, additional pipeline stage(s) will be added within each buttery computation unit, i.e., by deriving the optimal stage division in the cascade connection of complex number-based multiplication and additions. The proper selection of pipelining scheme highly depends on the target operating frequency and memory subsystem organization. In the experimental prototype, we target at a clock ... frequency around 200MHz (close to state-of-the-art ASIC tapeouts of DCNNs [23, 29, 30] ), and therefore the inter-level pipelining with a simpler structure will be sucient for ecient implementations.
Based on the denition of p (parallelization degree) and d (parallelization depth), larger p and d values indicate higher level of parallelism and therefore will lead to higher performance and throughput, but also with higher hardware cost/footprint. A larger d value would also result in less memory accesses at the cost of higher as a suitable function of (average) performance and power consumption. The performance Per f (p, d ) will be an increasing function of p and d but also depends on the platform specications, and the target type and size of DNN models (averaged over a set of learning models). The power consumption Power (p, d ) is a close-to-linear function of pd accounting for both static and dynamic components of power dissipation. The optimization of p and d is constrained by the compute and memory hardware resource limits as well as memory and I/O bandwidth constraints. The proposed algorithm for design optimization is illustrated in Algorithm 3, which sets p as optimization priority in order not to increase control complexity. This algorithm depends on the accurate estimation of performance and power consumption at each design conguration.
We provide an example of design optimization and eects assuming a block size of 128 for FPGA-based implementation (Cyclone V). Because of the low operating frequency, increasing the p value from 16 to 32 while maintaining d = 1 only increases power consumption by less than 10%. However, the performance can be increased by 53.8% with a simple pipelining control. Increasing d from 1 to 2 results in even less increase in power of 7.8%, with performance increase of 62.2%. The results seem to show that increasing d is slightly more benecial because of the reduced memory access overheads. However, a d value higher than 3 will result in high control diculty and pipelining bubbles, whereas p can be increased with the same control complexity thanks to the high bandwidth of block memory in FPGAs. As a result, we put p as the optimization priority in Algorithm 3.
Platform-Specic Optimizations
Based on the generic CCNN architecture, this section describes platform-specic optimizations on the FPGA-based and ASIC-based hardware platforms. We focus on weight storage and memory management, in order to simplify the design and achieve higher energy eciency and performance. FPGA Platform. The key observation is that the weight storage requirement of representative DNN applications can be (potentially) met by the on-chip block memory in state-of-the-art FPGAs. As a representative large-scale DCNN model for ImageNet application, the whole AlexNet [6] results in only around 4MB storage requirement after (i) applying block-circulant matrices only to FC layers, and (ii) using 16-bit xed point numbers that results in negligible accuracy loss. Such storage requirement can be fullled by the on-chip block memory of state-of-the-art FPGAs such as Intel (former Altera) Stratix, Xilinx Virtex-7, etc., which consist of up to tens of MBs on-chip memory [72, 73] . Moreover, when applying block-circulant matrices also to CONV layers, the storage requirement can be further reduced to 2MB or even less (depending on the block size and tolerable accuracy degradation). Then, the storage requirement becomes comparable with the input size and can be potentially supported by low-power and high energy-eciency FPGAs such as Intel (Altera) Cyclone V or Xilinx Kintex-7 FPGAs. A similar observation holds for other applications (e.g., the MNIST data-set [62] ), or dierent network models like DBN or RNN. Even for future larger-scale applications, the full model after compression will (likely) t in an FPGA SoC leveraging the storage space of the integrated ARM core and DDR memory [72, 73] . This observation would make the FPGA-based design signicantly more ecient.
In state-of-the-art FPGAs, on-chip memory is organized in the form of memory blocks, each with certain capacity and bandwidth limit. The number of on-chip memory blocks represents a proper tradeo between the lower control complexity (with more memory blocks) and the higher energy eciency (with fewer memory blocks), and thus should become an additional knob for design optimizations. State-of-the-art FPGAs are equipped with comprehensive DSP resources such as 18 ⇥ 18 or variable-size multipliers [72, 73] , which are eectively exploited for performance and energy eciency improvements. ASIC platform. We mainly investigate two aspects in the memory subsystem: 1) the potential memory hierarchy and 2) the memory bandwidth and aspect ratio. The representative deep learning applications require hundreds of KBs to multiple MBs memory storage depending on dierent compression levels, and we assume a conservative value of multiple MBs due to the universal and recongurable property of CCNN architecture. The potential memory hierarchy structure depends strongly on the target clock frequency of the proposed system. Specically, if we target at a clock frequency around 200MHz (close to state-of-the-art ASIC tapeouts of DCNNs [23, 29, 30] ), then the memory hierarchy is not necessary because a single-level memory system can support such operating frequency. Rather, memory/cache reconguration techniques [74, 75] can be employed when executing dierent types and sizes of applications for performance enhancement and static power reduction. If we target at a higher clock frequency, say 800MHz, an eective memory hierarchy with at least two levels (L1 cache and main memory) becomes necessary because a single-level memory cannot accommodate such high operating frequency in this case. Please note that the cache-based memory hierarchy is highly ecient and results in very low cache miss rate because, prefetching [76, 77] , the key technique to improve performance, will be highly eective due to the regular weight access patterns. The eectiveness of prefetching is due to the regularity in the proposed block-circulant matrix-based neural networks, showing another advantage over prior compression schemes. In our experimental results in the next section, we target at a lower clock frequency of 200MHz and therefore the memory hierarchy structure is not needed.
Besides the memory hierarchy structure, the memory bandwidth is determined by the parallelization degree p in the basic computing block. Based on such conguration, the aspect ratio of the memory subsystem is determined. Because of the relatively high memory bandwidth requirement compared with the total memory capacity (after compression using block-circulant matrices), column decoders can be eliminated in general [78] , thereby resulting in simpler layout and lower routing requirements.
EVALUATION
In this section, we provide detailed experimental setups and results of the proposed universal inference framework on dierent platforms including FPGAs, ASIC designs, and embedded processors. Experimental results on representative benchmarks such as MNIST, CIFAR-10, SVHN, and ImageNet have been provided and we have conducted a comprehensive comparison with state-of-the-art works on hardware deep learning systems. Order(s) of magnitude in energy eciency and performance improvements can be observed using the proposed universal inference framework.
FPGA-Based Testing
First, we illustrate our FPGA-based testing results using a lowpower and low-cost Intel (Altera) Cyclone V 5CEA9 FPGA. The Cyclone V FPGA exhibits a low static power consumption less than 0.35W and highest operating frequency between 227MHz and 250MHz (but actual implementations typically have less than 100MHz frequency), making it a good choice for energy eciency optimization of FPGA-based deep learning systems. 13 illustrates the comparison of performance (in giga operations per second, GOPS) and energy eciency (in giga operations per Joule, GOPS/W) between the proposed and reference FPGA-based implementations. The FPGA implementation uses the AlexNet structure, a representative DCNN model with ve CONV layers and three FC layers for the ImageNet applications [1] . The reference FPGA-based implementations are state-of-the-arts represented by [FPGA16] [14] , [ICCAD16] [15] , [FPGA17, Han] [20] , and [FPGA17, Zhao] [18] . The reference works implement largescale AlexNet, VGG-16, medium-scale DNN for CIFAR-10, or a custom-designed recurrent neural network [20] . Note that we use equivalent GOPS and GOPS/W for all methods with weight storage compression, including ours. Although those references focus on dierent DNN models and structures, both GOPS and GOPS/W are general metrics that are independent of model dierences. It is widely accepted in the hardware deep learning research to compare the GOPS and GOPS/W metrics between their proposed designs and those reported in the reference work, as shown in [14, 15, 18] . Please note that this is not entirely fair comparison because our implementation is layerwise implementation (some reference works implement end-to-end networks) and we extracts on-chip FPGA power consumptions.
In Fig. 13 , we can observe the signicant improvement achieved by the proposed FPGA-based implementations compared with prior arts in terms of energy eciency, even achieving 11⇥-16⇥ improvement when comparing with prior work with heuristic model size reduction techniques [18, 20] (reference [20] uses the heuristic weight pruning method, and [18] uses a binary-weighted neural network XOR-Net). When comparing with prior arts with a uncompressed (or partially compressed) deep learning system [14, 15] , the energy eciency improvement can reach 60-70⇥. These results demonstrate a clear advantage of CCNN using block-circulant matrices on energy eciency. The performance and energy eciency improvements are due to: 1) algorithm complexity reduction and 2) ecient hardware design, weight reduction, and elimination of weight accessing to the o-chip storage. The rst source results in 10⇥-20⇥ improvement and the second results in 2⇥-5⇥. Please note that CCNN architecture does not yield the highest throughput because we use a single low-power FPGA, while reference [20] uses a high-performance FPGA together with large o-chip DRAM on a custom-designed recurrent neural network. If needed, we can increase the number of FPGAs to process multiple neural networks in parallel, thereby improving the throughput without incurring any degradation in the energy eciency.
Besides comparison with large-scale DCNNs with CCNN architecture with state-of-the-art FPGA implementations, we also compare it with IBM TrueNorth neurosynaptic processor on a set of benchmark data sets including MNIST, CIFAR-10, and SVHN, which are all supported by IBM TrueNorth 1 . This is a fair comparison because both the proposed system and IBM TrueNorth are end-to-end implementations. IBM TrueNorth [79] is a neuromorphic CMOS chip fabricated in 28nm technology, with 4096 cores each simulating 256 programmable silicon neurons in a time-multiplexed manner. It implements Spiking Neural Networks, which is a bioinspired type of neural networks and benets from the ability of globally asynchronous implementations, but is widely perceived to achieve a lower accuracy compared with state-of-the-art DNN models. IBM TrueNorth exhibits the advantages of recongurability and programmability. Nevertheless, recongurability also applies to CCNN architecture. [79, 80] ( [80] for MNIST and [79] for CIFAR-10 and SVHN), and we choose results from the low-power mapping mode using only a single TrueNorth chip for high energy eciency. We can observe the improvement on the throughput for MNIST and SVHN data sets and energy eciency on the same level of magnitude. The throughput of CIFAR-10 using the FPGA implementation of CCNN is lower because 1) TrueNorth requires specic preprocessing of CIFAR-10 [79] before performing inference, and 2) the DNN model we chose uses small-scale FFTs, which limits the degree of improvements. Besides, CCNN architecture achieves the higher test accuracy in general: it results in very minor accuracy degradation compared with software DCNNs (cf. Fig. 7 ), whereas the low-power mode of IBM TrueNorth incurs higher accuracy degradation [79, 80] . These results demonstrate the high eectiveness of CCNN architecture because it is widely perceived that FPGA-based implementations will result in lower performance and energy eciency compared with ASIC implementations, with benets of a short development round and higher exibility.
ASIC Designs Synthesis Results
We In Fig. 15 , we can observe that our synthesis results achieve both the highest throughput and energy eciency, more than 6 times compared with the highest energy eciency in the best state-ofthe-art implementations. It is also striking that even our FPGA implementation could achieve the same order of energy eciency and higher throughput compared with the best state-of-the-art ASICs. It is worth noting that the best state-of-the-art ASIC implementations report the highest energy eciency in the near-threshold regime with an aggressively reduced bit-length (say 4 bits, with a signicant accuracy reduction in this case). When using 4-bit input and weight representations and near-threshold computing of 0.55V V dd voltage level, another 17⇥ improvement on energy eciency can be achieved in the synthesis results compared with our super-threshold implementation, as shown in Fig. 15 . This makes it a total of 102⇥ improvement compared with the best state-of-the-art. Moreover, in our systems (in the super-threshold implementation), memory in fact consumes slightly less power consumption compared with computing blocks, which demonstrates that weight storage is no longer the system bottleneck. Please note that the overall accuracy when using 4-bit representation is low in CCNN architecture (e.g., less than 20% for AlexNet). Hence, 4-bit representation is only utilized to provide a fair comparison with the baseline methods using the same number of bits for representations.
We also perform comparison on performance and energy eciency with the most energy-ecient NVIDIA Jetson TX1 embedded GPU toolkit, which is optimized for deep learning applications. It can be observed that an energy eciency improvement of 570⇥ can be achieved using our implementation, and the improvement reaches 9,690⇥ when incorporating near-threshold computing and 4-bit weight and input representations.
Embedded ARM-based Processors
Because ARM-based embedded processors are the most widely used embedded processors in smartphones, embedded and IoT devices, we implement the proposed block-circulant matrix-based DNN inference framework on a smartphone using ARM Cortex A9 processor cores, and provide some sample results. The aim is to demonstrate the potential of real-time implementation of deep learning systems on embedded processors, thereby signicantly enhancing the wide adoption of (large-scale) deep learning systems in personal, embedded, and IoT devices. In the implementation of LeNet-5 DCNN model on the MNIST data set, the proposed embedded processor-based implementation achieves a performance of 0.9ms/image with 96% accuracy, which is slightly faster compared with IBM TrueNorth in the high-accuracy mode [80] (1000 Images/s). The energy eciency is slightly lower but at the same level due to the peripheral devices in a smartphone. When comparing with a GPU-based implementation using NVIDIA Tesla C2075 GPU with 2,333 Images/s, the energy eciency is signicantly higher because the GPU consumes 202.5W power consumption, while the embedded processor only consumes around 1W. It is very interesting that when comparing on the fully-connected layer of AlexNet, our smartphone-based implementation of CCNN even achieves higher throughput (667 Layers/s vs. 573 Layers/s) compared with NVIDIA Tesla GPU. This is because the benets of computational complexity reduction become more signicant when the model size becomes larger.
Summary and Discussions
Energy Eciency and Performance: Overall, CCNN architecture achieves a signicant gain in energy eciency and performance compared with the best state-of-the-arts on dierent platforms including FPGAs, ASIC designs, and embedded processors.
The key reasons of such improvements include the fundamental algorithmic improvements, weight storage reduction, a signicant reduction of o-chip DRAM accessing, and the highly ecient implementation of the basic computing block for FFT/IFFT calculations. The fundamental algorithmic improvements accounts for the most signicant portion of energy eciency and performance improvements around 10⇥-20⇥, and the rest accounts for 2⇥-5⇥. In particular, we emphasize that: 1) the hardware resources and power/energy consumptions associated with memory storage will be at the same order as the computing blocks and will not be the absolute dominating factor of the overall hardware deep learning system; 2) medium to large-scale DNN models can be implemented in small footprint thanks to the recursive property of FFT/IFFT calculations. These characteristics are the key to enable highly ecient implementations of CCNN architecture in low-power FPGAs/ASICs and the elimination of complex control logics and high-power-consumption clock networks.
Recongurability: It is a key property of CCNN architecture, allowing it be applied to a wide set of deep learning systems. It resembles IBM TrueNorth and could signicantly reduce the development round and promote the wide application of deep learning systems. Unlike IBM TrueNorth, CCNN1) does not need a specialized oine training framework and specic preprocessing procedures for certain data sets like CIFAR [79] ; and 2) does not result in any hardware resource waste for small-scale neural networks and additional chips for large-scale ones. The former property is because the proposed training algorithms are general, and the latter is because dierent scales of DNN models can be conducted on the same basic computing block using dierent control signals thanks to the recursive property of FFT/IFFT. The software interface of recongurability is under development and will be released for public testing.
Online Learning Capability: The CCNN architecture described mainly focuses on the inference process of deep learning systems, although its algorithmic framework applies to both inference and training. We focus on inference because it is dicult to perform online training in hardware embedded deep learning systems due to the limited computing power and data set they can encounter.
CONCLUSION
This paper proposes CCNN, a principled approach to represent weights and process neural networks using block-circulant matrices. CCNN utilizes the Fast Fourier Transform (FFT)-based fast multiplication, simultaneously reducing the computational complexity (both in inference and training) from O(n 2 ) to O(n log n) and the storage complexity from O(n 2 ) to O(n), with negligible accuracy loss. We propose the CCNN architecture, a universal DNN inference engine that can be implemented in various hardware/software platforms with congurable network architecture (e.g., layer type, size, scales, etc.). To demonstrate the performance and energy eciency, we test CCNN architecture in FPGA, ASIC and embedded processors. Our results show that CCNN architecture achieves very high energy eciency and performance with a small hardware footprint. Based on the FPGA implementation and ASIC synthesis results, CCNN achieves 6 -102⇥ energy eciency improvements compared with the best state-of-the-art results.
ACKNOWLEDGEMENT
This work is funded by the National Science Foundation Awards CNS-1739748, CNS-1704662, CNS-1337300, CRII-1657333, CCF-1717754, CNS-1717984, Algorithm-in-the-Field program, DARPA SAGA program, and CASE Center at Syracuse University.
