Both industry and academia have extensively investigated hardware accelerations. In this work, to address the increasing demands in computational capability and memory requirement, we propose structured weight matrices (SWM)-based compression techniques for both field programmable gate array (FPGA) and applicationspecific integrated circuit (ASIC) implementations. In algorithm part, SWM-based framework adopts block-circulant matrices to achieve a fine-grained tradeoff between accuracy and compression ratio. The SWM-based technique can reduce computational complexity from O(n 2 ) to O(n log n) and storage complexity from O(n 2 ) to O(n) for each layer and both training and inference phases. For FPGA implementations on deep convolutional neural networks (DCNNs), we achieve at least 152X and 72X improvement in performance and energy efficiency, respectively using the SWM-based framework, compared with the baseline of IBM TrueNorth processor under same accuracy constraints using the data set of MNIST, SVHN, and CIFAR-10. For FPGA implementations on long short term memory (LSTM) networks, the proposed SWM-based LSTM can achieve up to 21X enhancement in performance and 33.5X gains in energy efficiency compared with the baseline accelerator. For ASIC implementations, the SWM-based ASIC design exhibits impressive advantages in terms of power, throughput, and energy efficiency. Experimental results indicate that this method is greatly suitable for applying DNNs onto both FPGAs and mobile/IoT devices.
INTRODUCTION
Deep learning has increasingly drawn attentions in many research fields, such as speech recognition [13] , computer vision [12, 18] , selfdriving cars [14, 32] , and unmanned aircraft systems [25] . Largescale deep neural networks (DNNs) typically consist of multiple layers, and at least millions of weight parameters for the entire model [18] . One major advantage of the larger-scale DNNs is that they extract more complex high-level features from the inputs (e.g., images/videos, speeches), and as a result, achieving a significant improvement in model accuracy [32] .
On the other hand, as the size of DNNs grows continuously, there exist tremendous demands in increasing computational capability and memory requirement. Therefore, improving the performance and energy efficiency while maintaining the accuracy of DNNs becomes extremely critical. Two trends have characterized the research advance in order to achieve higher performance and energy efficiency. The first trend is hardware acceleration. FPGA-based accelerators have the advantage of friendly programmability and high-degree parallelism. Stochastic Computing (SC), in which all the inputs and weight values are represented as streams of random bits, has been investigated and successfully applied to hardware acceleration of DNNs [22-24, 29, 30, 37] . Data-path optimization technique [8] have also been studied to map a limited number of Processing elements (PEs) on FPGA and reuse the mapped PEs by iterating data through them. On the other hand, ASIC-based implementations have been explored to further accelerate DNNs. A substantial number of high-tech companies have declared their ASIC chip designs in DNNs such as Google [15] and IBM TrueNorth [6] . In the field of academia, Eyeriss [2] , EIE [10] , and DaDianNao [1] mainly focus on the convolutional layers, the fully-connected layers, and the memory design/organization at the architectural level, respectively.
The second trend is model compression motivated by energy efficiency limitation of large DNN models. Weight pruning [11] and lower rank approximation [33] have aimed to the reduce the number of operations involved in DNNs. They achieve a parameter reduction to some extent with inconsequential accuracy degradation. However, they have brought the new challenges into DNNs such as irregular network structure caused by sparsity regularization [36] , and increased training complexity caused by the additional pruning process [11] or low rank approximation step [33] .
In this work, to address the limitations of existing works in model size compression and acceleration and to achieve ultra-high energy efficiency and performance for FPGA and ASIC-based hardware implementations, we propose the structured weight matrices (SWM)-based compression technique on both FPGA and ASIC implementations. The SWM-based framework adopts the general block-circulant matrices to achieve a fine-grained tradeoff between accuracy and compression ratio. For FPGA implementations on DCNNs, we achieve at least 152X and 72X improvement in performance and energy efficiency, respectively using SWM-based framework, compared with the baseline of IBM TrueNorth processor under same accuracy constraints using the data set of MNIST, SVHN, and CIFAR-10. For FPGA implementations on LSTM networks, the proposed SWM-based method can achieve up to 21X enhancement in performance and 33.5X gains in energy efficiency compared with ESE, respectively. For ASIC implementations, the proposed SWM-based design exhibits impressive advantages in terms of power, throughput, and energy efficiency. It indicates that this method is greatly suitable for applying DNNs onto both FPGAs and mobile/IoT devices.
BACKGROUND OF DNNS 2.1 Deep Convolutional Neural Networks
DNN systems consist of many different architectures such as DCNNs, recurrent neural network (RNNs), and deep belief networks (DBNs). Although different network structures target at specific applications, they have the similarity in construction principle, i.e., multiple layers connected in series for feature extraction [16, 21] . DNNs are commonly made up of three-layer types: Fully-connected (FC) and convolutional layers (CONV), and pooling layers (POOL).
FC layer is the most storage-intensive layer in DNNs [10, 28] since its neurons are fully connected with neurons in previous layer. The computation of an FC layer consists of matrix-vector arithmetics followed by the activation function, described as: y = ψ (Wx + θ ), where W ∈ R m×n is the weight matrix of the synapses between this FC layer (with m neurons) and its previous layer (with n neurons); θ ∈ R m is the bias vector; and ψ (·) is the activation function. The calculation of Wx dominates computational complexity because the rest has lower complexity of O(n).
CONV layer performs a multi-dimensional convolution to extract features from its inputs that will be fed into subsequent layers for extracting higher-level features. A CONV layer is associated with a set of learnable filters (or kernels) [19] . A filter-sized moving window is applied to the input feature maps, calculating the convolution of the filter and input feature maps in the moving window. In practical DNN models, the CONV layers are often associated with multiple input and multiple output feature maps. As a result, the CONV layer can be expressed in tensor computations:
represent the input, output, and weight "tensors" of the CONV layer, respectively. Here, W and H are the spatial dimensions of the input maps, C is the number of input maps, r is the size of the convolutional kernel, and P is the number of output maps.
POOL layer performs a subsampling operation on the extracted features to reduce the data dimensions and mitigate overfitting issues. Max pooling is the dominant type of pooling strategy in state-of-the-art DCNNs due to its higher overall accuracy and convergence speed [1, 2] .
The majority of computations occur in CONV and FC layers, while the POOL layer has a lower computational complexity of O(n). The storage requirement of DNNs is due to the weight matrices W's in the FC layers and the convolutional kernels F's in CONV layers. As a result, the FC and CONV layers become the major research focuses for energy-efficient implementation of DNNs.
Recurrent Neural Networks
RNNs have been investigated and have many applications in natural language processing, speech recognition, and machine translation [31] . As one popular type of RNNs, long short term memory (LSTM) has been broadly studied as shown in Fig. 1 [31] . An LSTMbased RNN accepts an input sequence X = (x 1 ; x 2 ; x 3 ; ...; x T ) (each of x t is a vector corresponding to time t) with the output sequence from last step Y T −1 = (y 0 ; y 1 ; y 2 ; ...; y T −1 ) (each of y t is a vector). It computes an output sequence Y = (y 1 ; y 2 ; y 3 ; ...; y T ) by using the following equations iteratively from t = 1 to T :
where symbols i, f, o, c, m, and y represent the input gate, forget gate, output gate, cell state, cell output, and projected output, respectively. The ⊙ operation represents element-wise multiplication, and the + operation is matrix addition. The W terms represent weight matrices (for instance, W ix is the weight matrix from the input vector x t to the input gate), and the b terms are the bias vectors. Additionally, weight matrices W ic , W f c , and W oc are diagonal matrices for peephole connections, which can be considered as vectors during matrix-vector multiplication. Therefore, W ic c t −1 can be calculated using ⊙ operation. σ is the logistic activation function and h is a self-defined activation function. In this model we use hyperpolic tangent (tanh) activation function as h.
STRUCTURED WEIGHT MATRIX
This section discusses the inference and training algorithms of SWM-based DNNs (e.g., [5, 35] ). The advantage is two-fold: 1) it is possible to derive a fine-grained tradeoff between accuracy and compression/acceleration by changing the block size; and 2) the method applies to both FC and CONV layers. The theoretical foundation is also derived from [38] , which shows that the "effectiveness" of SWM-based DNNs is the same compared with DNNs without compression. Experimental results in [5, 35] have demonstrated a good ratio of model compression (i.e., from 41× to 256×) with small (less than 2%) overall accuracy degradation. In the following, we discuss the inference and training algorithms for FC layer, details of the CONV layer algorithms are provided in [5] .
The key idea of SWM-based FC layers is to partition the original weight matrix W ∈ R m×n into blocks of square sub-matrices, and each sub-matrix is a circulant matrix. The illustrations are shown in Fig. 2 . Let k denote the block size (size of each sub-matrix) and assume there are p ×q blocks after partitioning W, where p = m ÷k and bias and ReLU omitted for simplicity):
where a i ∈ R k is a column vector. Assume each circulant matrix W i j is defined by a vector w i j , i.e., w i j is the first row vector of W i j . According to the circulant convolution theorem [27] , the calculation of W i j x j can be performed as IFFT FFT(w i j ) • FFT(x j ) , where • denotes element-wise multiplications. The operation procedure is shown on the right of Fig. 2 . For the inference phase, the computational complexity of this FC layer is O(pqk log k), which is equivalent to O(n log n) for small p, q values. Similarly, the storage complexity is O(pqk) because only w i j or FFT(w i j ) for each submatrix needs to be stored, which is equivalent to O(n) for small p, q values. Therefore, the simultaneous acceleration and model compression are achieved.
MODEL COMPRESSION AND ACCURACY
To reduce the computation complexity and storage complexity, many researchers have investigated to reduce the number of weight parameters or the number of bits for weight representation. However, the compression techniques will cause the model accuracy degradation. In this section, we will discuss the trade-off between model compression and model accuracy loss of the SWM-based technique.
Quantization and Weight Reduction
Data quantization on weights and neurons is a commonly used method for model compression. We attempt to use low-bit fixedpoint data to represent the neurons and weights instead of using floating point data. We design a bit-wise simulator using C++ to verify the total number of bits for both integer and fractional part. Structure weight matrix, as a low-rank representation, uses one or several block circulant matrices to replace the original weight matrix as discussed in Section. 3. Shown in Fig. 2 , by partitioning the original weight matrix W ∈ R m×n into p × q blocks of square sub-matrices, the total number of weights are reduced from m × n to m k × n k × k = (m × n)/k, where each block is a k × k matrix. We further investigate the SWM-based DNN models including DCNNs and LSTMs regarding the compression ratio (block size) and model accuracy.
Accuracy Evaluation

Accuracy Evaluation on DCNNs.
The weight storage (model size) reduction, and the test accuracy on various image recognition datasets and DCNN models: MNIST (LeNet-5), CIFAR-10, SVHN, STL-10, and ImageNet (using AlexNet structure) [3, 4, 17, 18, 26] ) are discussed in [5] . 16-bit data quantization is adopted and the baselines are the original DCNN models with unstructured weight matrices and 32-bit floating point representations. The SWM-based compression technique enables 400×-4000+× reduction in model size in the corresponding FC layers. On the other hand, the accuracy is close to original DCNN models and the accuracy degradation is negligible. Moreover, another advantage of the SWM-based technique is that the storage process of weight parameter after compression is regular, while reference works [11] bring in irregularity in storing the weight parameter. The introduced irregularity requires extra index per weight parameter and therefore affects the available parallelism degree.
Accuracy Evaluation on LSTM.
We evaluate the structure matrices based compression technique using TIMIT benchmark, the most commonly used dataset for automatic speech recognition (ASR) application. The LSTM network is built by stacking multiple LSTM layers. The Google LSTM model [31] with unstructured weight matrix is selected as the baseline model. We preprocess the TIMIT audio data using FFT-based filterbank as discussed in [34] . The input speech data have the same number of features and same architecture as ESE [9] . Phone Error Rate (PER) is adopted to evaluate the model prediction accuracy.
The block-circulant matrix based LSTM model enables a comprehensive tuning of model compression ratio by varying the block size k. The PER is close to baseline LSTM when the block size is 2 using SWM-based compression technique. For the SWM-based LSTM models with a block size of 8 and 16, 7.6X and 14.6X model size reduction can be achieved compared with baseline LSTM, respectively. On the other hand, the computational complexity is reduced by 2.6X and 3.7X while the PERs are only 0.32% and 1.23% higher than the baseline.
5 SWM-BASED HARDWARE DESIGN 5.1 FPGA 5.1.1 Overall Architecture. The overall SWM-based architecture is shown in Fig. 3 . The Host CPU is responsible for issuing workload or instructions to the FPGA logic block and monitoring the working stats. The FPGA logic part includes computing unit (containing the basic computing block and the peripheral computing block), the control subsystem, BRAM block, and the preprocess block for certain designs when the data loaded from external memory requires (4) x (6) x (1) x (5) x (7) x (2) x(3) Figure 4 : An example of 8-point basic computing block for FFT using butterfly units.
preprocess. The memory hierarchy of the architecture primarily consists of three blocks: Host MEM, FPGA DDR, and on-chip block memory (BRAM). The control subsystem coordinates the actual FFT/IFFT operations in the basic computing block and peripheral computing block. The control subsystem also determines the input size of FFT/IFFT operations. The twiddle factors in FFT/IFFT operations are stored in BRAM (i.e., the W i n values including both real and imaginary parts); the weights, e.g., the FFT results FFT(w i j ) are also stored in BRAM.
Computing Unit Designs.
In the computing unit, the peripheral computing block mainly focuses on component-wise multiplication, activation (ReLU, Tanh, and Sigmoid), pooling etc., which need lower computational cost and hardware footprint. The basic computing unit consists of an FFT operation with a parallelization degree of N and depth of log N . Fig. 4 shows an example of 8-point FFT operation in the basic computing block using butterfly units. The IFFT operation can also be implemented using the N inputs basic computing unit in addition to a division operation (i.e., ÷N ) and two conjugations.
ASIC
In order to apply DNNs onto mobile/IoT devices, the DNN applications should be implemented in ASICs, due to the benefit of small hardware volume. The great reduction in both parameter size and computational time complexity makes our SWM-based method suitable for ASIC implementations. Figure 5 shows the architecture of our end-to-end ASIC implementation of the SWM-based DNNs. The architecture consists of four main blocks: input/output interface, storage system, processing system, and global controller.
The input/output interface is in charge of communicating with the external environment of the chip and the on-chip storage system. The input interface is composed of an input IO buffer and an input distributor. Similarly, the output interface is composed of an output IO buffer and an output distributor. In the view of data flow, the input IO buffer first receives and buffers data, including input images, weights, and biases from the external environment. a small number, whereas the bandwidth of the processing system is rather large for achieving high parallelism of computation. This mismatch in bandwidth requires an input distributor to temporally hold the external data until the size of the data reaches the bandwidth requirement of the storage system. Besides, there are three storage modules inside the storage system for respectively storing inputs/intermediate activations, weights, and biases, the global controller will decide where the buffered data should flow. With the similar idea, the output distributor will receive final activations from the storage system and be controlled to distribute a portion of activations into the output IO buffer, which will further send them back to the external system.
As depicted in Figure 6 , the storage system composes three subsystems, including a memory bank for storing weights, a register file for storing biases, and a ping-pong buffer (i.e., two alternating register files) for storing image inputs and intermediate activations.
The processing system achieves following equation for each layer:
, where w i j is the vector of weights at the ith row and jth column of the weight matrix, x j and b j are respectively the jth vector of inputs/activations and biases, and h(·) is an activation function. According to above equation, the processing system should contain the modules that are illustrated in Fig. 7 . As the first step in the core computation, the image inputs are loaded from the storage system to the FFT module. Since the weights are repeatedly used without changes, what the weight memory bank stores are the weights in frequency domain. Thus the inputs of the multiply module are FFT(x j ) and FFT(w i j ). Next, the IFFT module performs the inverse FFT operation over the element-wise production vector, converting the vector from frequency domain to time domain. Then the summation is performed by the Accumulator module that generates the dot-product of inputs and weights. Finally, the Biase module adds up the biases to the dot-products, and the Activation module produces a vector of activations.
Another crucial module in the architecture is the global controller, which takes the responsibility to generate control signals to guarantee the whole system to function correctly.
EVALUTATION 6.1 FPGA
We implement the proposed framework on small to medium scale DNNs using the benchmarks of MNIST, SVHN, and CIFAR-10 on We compare the accuracy, performance (kFPS), and energy efficiency (kFPS/W) of the proposed SWM-based FPGA implementation with the state-of-the-art IBM TrueNorth neurosynaptic processor ( [6] ) for DCNNs, and the state-of-the-art ESE accelerator on the platform of Xilinx KU060 [9] for LSTMs. We first demonstrate the results of three MNIST datasets targeting at different accuracies, one SVHN dataset, and two CIFAR-10 datasets targeting at different accuracies. The first two DNNs of MNIST datasets are multi-layer perceptron (MLP) models which can achieve the accuracy of 92.9% and 95.6%, respectively. The third DNN of MNIST dataset has a CNN structure similar to LeNet-5 [20] , which achieves 99.0% accuracy. The first DNN of CIFAR-10 has a simple structure while the second DNN of CIFAR-10 adopts a wide ResNet model [12] which can achieve 94.75% accuracy. The baseline system (IBM TrueNorth) has two different DNNs of MNIST datasets at two accuracy levels. Experimental results show that under the similar accuracy constraint, the gains of the SWM-based framework in performance and energy efficiency are at least 152X and 72X, respectively. For the LSTM implementation, we propose two structures: (i) the proposed LSTM1 adopts a block size of 16 (FFT16), which the relative PER degradation of the model is 1.23%; (ii) the proposed LSTM2 uses a block size of 8 (FFT8), which the relative PER degradation of the model is 0.32%. On the platform of KU060, we achieve 21X and 11X performance speedup for the proposed LSTM1 and LSTM2 based compression techniques compared with ESE. On the platform of AMD-7v3, compared with ESE, we achieve 18.8X and 10.2X and performance enhancement and 33.5X and 19.1X energy efficiency gains using the proposed LSTM1 and LSTM2, respectively. Since the power consumption of SWM-based LSTM is only half of the ESE, the energy efficiency gain is higher than performance. Please note that the manufacturing process of XCKU060 FPGA is 20nm while the process of Virtex-7 is 28nm, which means the actual energy efficiency gain should be more than the report here.
ASIC
In this work, we implement an ASIC design of the SWM-based neural network for the image recognition task, and it is tested with the MNIST dataset. The implemented neural network has the original structure of 512 × 512 − 512 × 512 − 512 × 64 − 64 × 10, and this network is transferred into an SWM-based structure. The FFT module implemented in this work is a 64-point FFT, that is, it takes a vector of 64 real value numbers as inputs and generates their frequency domain representations. Consequently, the weight matrices has the structure of 8×8×64−8×8×64−1×8×64−64×10, where (m × n × s) represents the weight matrix has m rows and n columns, and each element is a vector containing s weights (s is 64 in this case). Our weight matrix transformation is not applied to the output layer, so the weights in this layer still keep the original structure of 64 × 10. Our ASIC design is implemented with SMIC 40nm technology (including memories) and synthesized with Synopsys Design Compiler 2016. Table 2 shows the hardware performance of our design. It can be observed from the table, the SWM-based neural network exhibits impressive advantages in terms of power (0.14W ), throughput (1.14×10 6 Imaдes/s), and energy efficiency (8.08×10 6 Imaдes/J ), suggesting that this method is greatly suitable for applying DNNs onto mobile/IoT devices.
CONCLUSION
In this work, we propose and evaluate the SWM-based compression technique on both FPGA and ASIC implementations. The SWMbased framework adopts the general block-circulant matrices to achieve a fine-grained tradeoff of accuracy and compression ratio and it works for both FC and CONV layers and contains a mathematically rigorous proof. For FPGA implementations, we achieve at least 152X and 72X improvement in performance and energy efficiency, respectively using SWM-based framework, compared with the baseline of IBM TrueNorth processor under same accuracy constraints using the data set of MNIST, SVHN, and CIFAR-10. For the LSTM network, the proposed SWM-based LSTM can achieve up to 21X enhancement in performance and 33.5X gains in energy efficiency compared with ESE, respectively. For ASIC implementations, the proposed SWM-based design exhibits impressive advantages in terms of power, throughput, and energy efficiency. Experimental results indicate that this method is greatly suitable for applying DNNs onto both FPGAs and mobile/IoT devices.
ACKNOWLEDGEMENT
This work is funded by the National Science Foundation Awards CNS-1650469, CCF-1733701, CNS-1704662, CCF-1657333, CNS-1739748, and CCF-1733834.
