Deep Neural Networks (DNNs) have gained immense success in cognitive applications and greatly pushed today's artificial intelligence forward. The biggest challenge in executing DNNs is their extremely data-extensive computations. The computing efficiency in speed and energy is constrained when traditional computing platforms are employed in such computational hungry executions. Spiking neuromorphic computing (SNC) has been widely investigated in deep networks implementation own to their high efficiency in computation and communication. However, weights and signals of DNNs are required to be quantized when deploying the DNNs on the SNC, which results in unacceptable accuracy loss. Previous works mainly focus on weights discretize while inter-layer signals are mainly neglected. In this work, we propose to represent DNNs with fixed integer inter-layer signals and fixed-point weights while holding good accuracy. We implement the proposed DNNs on the memristor-based SNC system as a deployment example. With 4-bit data representation, our results show that the accuracy loss can be controlled within 0.02% (2.3%) on MNIST (CIFAR-10). Compared with the 8-bit dynamic fixed-point DNNs, our system can achieve more than 9.8× speedup, 89.1% energy saving, and 30% area saving.
INTRODUCTION
Deep Neural Networks (DNNs) have achieved great success in cognitive applications such as image classification [1, 2, 3] , object detection [4, 5] , and natural language processing [6] . However, the computations are extremely dataextensive and expensive in perspective of speed and energy. And the computing power of the current von Neumann machines with limited data bandwidth and energy efficiency becomes insufficient to support these computations. This issue becomes more severe with the rapid growth of the depth of the deep network models [7] . Consequently, novel nonvon Neumann computing architectures and other hardwaresoftware co-designs based on CPU, GPU and FPGA have Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. been extensively investigated to improve the computational efficiency [8, 9, 10] .
Among these innovative works, the brain-like neuromorphic computing appears as a promising solution: Deep networks are implemented by VLSI designs and high computing efficiency in speed and energy is obtained inherently by fulfilling data processing and communication in a singlechip [11] . Neuromorphic designs with digital or analog computations have been reported not only in traditional CMOS technology but also in post-silicon devices such as spin devices and memristor [12, 11, 13, 14] . Contributed by the event-driven computation and digitized data communication, spiking neuromorphic computing (SNC) has been proved to be ultra-low-cost in design and energy and is highly attractive in deploying and executing deep networks.
The current system-level spiking designs mainly employ an off-line training methodology and the well-trained deep networks are deployed on the hardware system. Nonetheless, one big challenge exists when performing the straightforward deployment, that is, obvious system accuracy loss induced by the constrained precision of synapses (or synaptic weights) and neurons (or inter-layer signals). For example, the IBM's TureNorth chip has five synaptic states (i.e. 0, ±1, ±2) and acceptable precision can only be achieved by assembling multiple synaptic layers with sacrificed design and energy cost [15] . Similarly, the synaptic weights in the memristor-based designs are usually represented by three or four bits data. Although the memristor devices can afford continuous conductance states or 6-bit (64 levels) as was reported by HP Labs [16] , the heavy programming cost in speed and circuit design are not acceptable. In these SNC designs, the neuron signals are rate coded and the signal strength is represented by spike numbers in a time window in discrete values. To get sufficient accuracy, the compu- tation speed will be compensated, e.g. an 8-bit precision corresponds to 256 spikes and requires large time window for spike generation.
While solutions have been explored in previous works, the issue is still unsolved completely. In [17] , Wang et al. proposed a one-level precision synapse and applied it on the memristor-based neuromorphic design, while the constrained precision of the neuron signals are unconsidered. However, as is shown in Figure 1 , computing speed of the spiking system is mainly constrained by data communication (i.e. time required by spikes generation to guarantee good accuracy). And compared to weight quantization, discritizing the neurons results in larger accuracy loss. In addition, realistic neuromorphic design of the proposed one-level synapse is challenging due to the various synaptic states distribution in different layers. Recently, researchers tried a binary synapse and neurons deployment on TureNorth chip targeting high speed and low energy while retaining accuracy [9] . The training rule is similar to BinaryNet [18] and usually leads to obvious accuracy loss [19] .
In this work, we focus on tackling the unacceptable accuracy loss caused by the low-precision spike neurons and synapses during deep networks deployment. The 32-bit floatingpoint deep networks are transformed to data quantizationaware networks with fixed integer neurons and fix-point synapses.
The proposed networks can be applied to the emerging SNC universally. The memristor-based platform [8, 20, 12] is selected as our deployment example in this work. Our target is retaining the high accuracy of deep networks in building dedicate hardware framework with high computing efficiency. Our major contributions are summarized as follows:
• We transform the inter-layer signals of deep networks to be M -bit fixed integers in neural network training to mimic the discrete spike neurons in the SNC. These integral data in different layers are constrained to the same range and hence hardware implementation-friendly;
• We propose a weight clustering methodology to represent the synapses with N -bit fixed point data in a linear distribution. The best affordable states are obtained to improve system accuracy in low design cost;
• We deploy the proposed quantization-aware deep networks on the memristor-based SNC for performance evaluation. The system accuracy on the state-of-theart dataset such as MNIST and CIFAR-10 are measured. The speed, area, and energy are evaluated and compared with previous 8-bit fixed point design.
Our experimental results show that, when utilizing 4-bit integral neurons and fixed-point synapses and comparing with the ideal 32-bit floating point DNNs, our accuracy loss can be controlled within 0.02% and 2.3% on MNIST and CIFAR10. Compared with 8-bit fixed-point precision, our system can achieve more than 13.9× speedup, 89.1% energy saving, and 30% area saving.
PRELIMINARY

Related Works in DNNs Quantization
Normally, the state-of-the-art deep networks are represented by 32-bit floating points. Quantized DNNs have been explored in previous works to facilitate computation burden and hardware complexity while retain comparable accuracy. Some earlier works focus on training DNNs with quantized weights and regardless of inter-layer signals [21, 22] . For example, Lin et.al trained the deep network efficiently with binary weights and quantized back propagation [21] .
In recent works, implementation of DNNs with fixed-point synaptic weights and inter-layer signals are proposed. Gysel et al. [23] compressed DNNs into 8-bit dynamic fixed point values. A fine-tuning was employed to recover the accuracy loss incurred by the weight quantization, however, the loss caused by the inter-layer signal quantization can not be recovered. Adopting Gysel's quantization process, Tann et al. [24] proposed DNNs with inter-layer signals in 8-bit dynamic fixed-point precision and weights in integer powerof-two values. Lin et al. [21] proposed to tune DNNs with fixed-point weights and inter-layer signals. These works can achieve the target of improving computation efficiency in speed, hardware cost, and energy. Unfortunately, they are not adaptive to the spiking neuromorphic systems in two reasons. First, the 8-bit data utilization of inter-layer signals in the spiking systems will be extremely expensive in speed and hardware complexity. Second, the dynamic values varies greatly in the range for different layers and lead to large design complexity. Different with the above works, we implement quantized deep networks that be particularly feasible to the spiking neuromorphic systems: the proposed networks have fixed integer neurons and fixed point synapses and different layers have uniform values.
DNNs Deployment on SNC
The memristor-based SNC platform is chosen to deploy DNNs in this work. Memristor is a two-terminal device in a MIM (metal-insulator-metal) structure that stores information by resistance states [25, 26, 27] . Its high density crossbar structure and multiple states enable nature implementation of vector-matrix computation in a neural network, and thus are extremely attractive to be leveraged in the emerging neuromorphic approaches [8, 20, 12] .
In the SNC implementation, weights in a neural network layer correspond to the memristor devices in a crossbar array. Outputs of each layer are transformed to spikes and be fed into the next layer as inputs. Fully connected layers in a DNN can be mapped on a crossbar directly [12] , while it is more complicate to implement a convolutional layer. Figure 2 depicts how to deploy a convolutional layer on the memristor-based SNC. Filters in a convolutional layer is deployed to the crossbar column by column: K th layer. Constrained by the realistic size of the memrisotr-based crossbar [28] , multiple crossbar are utilized in parallel to compose a large layer. The crossbar numbers will be calculated as Equation 1.
where
t is the row or column size of a square crossbar.
DESIGN METHODOLOGY
In this work, we aim to construct quantized DNNs with high accuracy, and hence obtain SNC with optimal accuracy, computation efficiency, and design cost. Two major approaches are proposed-a Neuron Convergence for fixed integer inter-layer signals and a Weight Clustering for fixedpoint weights. All network layers are executed and gain uniform values to minimize hardware design complexity.
Neuron Convergence
In this work, inter-layer signals are constrained to fixed integer in a dedicate range that is decided by target bit width. The ranges are the same in all layers to achieve uniform values in networks and alleviate design complexity. The fixed integer is adopted to mimic the discrete output spikes and the dedicate bit width is designed to decrease required spike numbers for high speed and low design cost. Notably, quantizing inter-layer signals causes significant accuracy loss. We propose a novel neuron regularization in neural network training to recover the loss. As is indicated in Figure 3 , during neural networks training, l1 − norm regularization and truncated l1 − norm are usually utilized for weights sparsity and range restriction, respectively. In contrast, we propose a regularization term that can train neural networks with inter-layer signals not only sparse but also range-fixed. Particularly, the range in all the layers are uniform. The loss function in neural networks training are formulated as Equation 2.
Here, W represents the weights in the DNN; ED(·) is the loss term; R(·) is the normal regularization on weights. Rg(·) is the proposed regularization on each inter-layer and its calculation in the i th layer can be represented as Rg(
) (r, c, and d represent row, column, and depth, separately). The regularization of each inter-layer signal rg(·) is calculated by Equation 3, as is described by the blue curve in Figure 3 .
α is set to be 0. Following the proposed DNNs training, the inter-layer signals are quantized to M-bit integer values with sparse property while retaining good accuracy.
Weight Clustering
In implementing the memristor-based SNC, floating points synaptic weights in a DNN are quantized to the available resistance states of the devices and result in accuracy drop. We further propose a weight clustering to achieve fixed-point synaptic values in linear distribution that is hardware implementation friendly and can also reduce the accuracy loss. Based on the inter-layer signals obtained in Sec. 3.1, the accuracy loss generated by weight quantization can be represented by Equation 5 . Similar to the explanation in Sec. 3.1, the accuracy loss in the weight quantization is extremely small. To further lower the loss, we train a cluster to minimize the error between the original weights and the quantized weights, as is depicted in Equation 6 .
where elements in D belongs to {0, ±1, ±2, ...,
), W represents the weight matrix of a DNN, D 2 N is the quantized matrix with fixed point, and N is the target bit width of the weights.
The Equation 6 is designed to find a matrix D 2 N , whose elements are N -bit fixed-point values with a linear distribution and best nearest the ideal floating point matrix in the DNN. Here, we transform the weight quantization to an optimization problem that can be solved by k-nearest neighbors algorithm.
EXPERIMENTS
Experimental Setup
Three different DNNs-Lenet, Alexnet, and Resnet are developed on Torch. Neural networks model details and their ideal accuracy (Ideal Acc.) on MNIST and CIFAR10 without quantization are listed in Table 1 . The quantized networks following our proposed method are implemented on the memristor-based SNC, and the hardware design methodology follows [12] . Resistance states range of the memristor device is set to be [50KΩ, 1MΩ] [12] . The crossbar size is set to be 32 × 32, and required crossbar numbers of each network layer is calculated by the Equation 1 in Sec. 2.2. 
Neuron Convergence on Inter-layer Signals Quantization
The capability of Neuron Convergence in recovering quantization accuracy loss is evaluated and results are listed in Table 2 . In this experiment, the weights are ideal floating points without quantization. Inter-layer signals of the Lenet, Alexnet, and Resnet are quantized to 5-bit, 4-bit, and 3-bit integer values by utilizing the proposed training and traditional training without the Neuron Convergence. As an example, accuracy of the two scenarios are represented by "Lenet (w/)" and "Lenet (w/o)" in Table 2 . The accuracy recovered from traditional quantization by utilizing our proposed method is shown as "Recovered Acc". The computing accuracy of our proposed design is also compared with the idea accuracy and the accuracy loss is described in Table 2 as "Acc. Drop". The results indicate quantizing inter-layer signals directly without the proposed Neuron Convergence induces heavy accuracy loss, which is unacceptable. For example, the accuracy of Alexnet and Resnet with 3-bit inter-layer signals on CIFAR10 drops to 67.83% and 26.57% from ideal accuracy 85.35% and 93.05%, respectively. By utilizing our proposed method, the accuracy can be recovered to 82.1% and 88.95%. The Lenet network on MNIST is robust and the 4-bit and 3-bit network has only 0.01% and 0.03% accuracy loss with our proposed training and discretize. Our method can quantize the Alexnet and Resnet to 4-bit signals with 83.15% and 91.33% accuracy on CIFAR10. Compared with the ideal accuracy, the accuracy loss caused by the proposed 4-bit precision is only 2.2% and 1.72%. The accuracy drop of the three networks in 5-bit signals are fully recovered (0% on MNIST) or extremely small (0.15% and 0.55% on CIFAR10) after using our proposed method.
The above results demonstrate that our proposed Neuron Convergence can recover the accuracy loss during signal quantization successfully. Neural networks with fixed integer signal and good accuracy are obtained. The best accuracy with 4-bit inter-layer signals on CIFAR10 can achieve 91.33%, and the accuracy is 98.15% on MNIST.
Weight Clustering on Weights Quantization
We also evaluate the performance of the proposed Weight Clustering in recovering the accuracy loss caused by weights quantization. Table 3 shows the experimental results with and without the proposed method. Similarly, networks with 5-bit, 4-bit, and 3-bit fixed point weights are evaluated and the inter-layer signals are set to be ideal floating points without quantizaiton. The results indicate that our proposed 4-bit Lenet, Alexnet, and Resnet can achieve 98.1%, 83.69%, and 91% on MNIST and CIFAR10 with only 0.06%, 1.76%, and 2.05% accuracy drop, comparing to the ideal accuracy.
Neuron Convergence and Weights Clustering on Data Quantization
In this experiment, the proposed Neuron Convergence and Weights Clustering are applied together in the three neural networks for overall performance evaluation. Through the proposed method, the inter-layer signals and the weights are quantized to fixed integer values and fixed-point values in 5-bit, 4-bit, and 3-bit, respectively. The accuracy of the networks with and without the proposed method is depicted and compared in Table 4 . Similar to Sec. 4.2 and Sec. 4.3, the "Recovered Acc." indicates accuracy recovery ability of our proposed method and "Acc. Drop" is the accuracy loss compared with the ideal accuracy. Besides compared with the ideal accuracy in Table 1 , we also include the accuracy of the 8-bit dynamic fixed point neural networks in [23] for comparison.
Compard with the 8-bit dynamic fixed point networks in [23] , our proposed networks with 5-bit integer inter-layer signal and fixed-point weights can gain almost the same accuracy: same accuracy of Lenet on MNIST, only 0.03% drop of Alexnet on CIFAR10, and 0.72% drop of Resnet on CI-FAR10. Our proposed 4-bit Lenet, Alexnet, and Resnet can achieve accuracy of 98.14%, 83.05%, and 90.33% on MNIST and CIFAR10 with only 0.06%, 1.76%, and 2.05% accuracy loss compared with the ideal accuracy. Even with 3-bit data representation, our method can achieve 97.46% on MNIST and 87.71% on CIFAR10. Based on the above discussions, it is proved that our proposed method can represent DNNs using 4-bit or even 3-bit data representation in the inter-layer signals and weights while keeping good accuracy.
Improvement on Computation Efficiency
In the SNC system implementation, computation result of one DNNs layer is transformed to spikes by integrate-andfire circuits (IFCs) to generate digitized outputs through counters [12, 11] . Therefore, a reduced bit width of signals between layers corresponds to less required spike numbers and thus improved speed, design cost, and energy efficiency. Weights quantization also helps to improve computing efficiency and reduce hardware design complexity by decreasing the utilization of synaptic crossbar and programming cost.
In this work, the benefit of our proposed DNNs with M bit fixed integer inter-layer signals and N bit fixed-point weights on improving computation efficiency is evaluated on the memristor-based SNC. Based on the results in Sec. 4.4, two scenarios with (M, N ) is (4, 4) and (3, 3) are implemented and analyzed. The 8-bit dynamic fixed-point in [23] is also implemented for comparison. In the memristor-based SNC, each computation unit (i.e. neural network layer) includes four components: wordline (WL) drivers to generate robust input signals, memristor-based crossbars to complete the matrix computation, IFCs to convert the current results from the crossbar to spikes, and counters to generate digitized output of each layer. The speed, energy, and area are obtained from circuits simulation on IBM 130nm technology and the simulation parameter configuration is based on [12] .
The results in Table 5 show that our proposed method can achieve significant computation efficiency improvement compared with the previous 8-bit dynamic fixed point DNNs. Our systems have more than 9.8× speed up, 89.1% energy saving and 29.7% area saving. 
CONCLUSIONS
DNNs quantization in implementing the spiking neuromorphic computing (SNC) is important for acceptable design complexity and computational efficiency. However, directly weights and inter-layer signal quantization cause heavy accuracy loss. In this work, we propose data quantizationaware DNNs with a neuron convergence and a weight clustering method to recover the accuracy loss in neural network quantization. The obtained fixed integer signals and fixed-point weights particularly benefit the SNC in design cost and computation efficiency. We carefully deploy the quantized DNNs on the memristor-based SNC to study the system efficiency improvement that can be achieved by the proposed method. The system accuracy and performance is evaluated in three networks-Lenet, Alexnet, and Resnet on MNIST and CIFAR10 and compared with the ideal DNNs and the previous 8-bit dynamic fixed-point DNNs. The results indicate that the design can achieve 98.14% and 90.33% accuracy on MNIST and CIFAR10 with 4-bit data representation, which is only 0.02% and 2.72% lower than the ideal DNNs. Compared with the 8-bit dynamic fixed point framework, the proposed design demonstrates more than 9.8× speedup, 89.1% energy saving, and 30% area saving.
ACKNOWLEDGMENTS
This work is supported in part by AFRL ICA2017-UP-017. We would like to thank NVIDIA Corporation for their generous GPU donation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of authors and do not necessarily reflect the views of AFRL or its contractors.
