Although deep neural networks (DNNs) are being a revolutionary power to open up the AI era, the notoriously huge hardware overhead has challenged their applications. Recently, several binary and ternary networks, in which the costly multiplyaccumulate operations can be replaced by accumulations or even binary logic operations, make the on-chip training of DNNs quite promising. Therefore there is a pressing need to build an architecture that could subsume these networks under a unified framework that achieves both higher performance and less overhead. To this end, two fundamental issues are yet to be addressed. The first one is how to implement the back propagation when neuronal activations are discrete. The second one is how to remove the full-precision hidden weights in the training phase to break the bottlenecks of memory/computation consumption. To address the first issue, we present a multistep neuronal activation discretization method and a derivative approximation technique that enable the implementing the back propagation algorithm on discrete DNNs. While for the second issue, we propose a discrete state transition (DST) methodology to constrain the weights in a discrete space without saving the hidden weights. Through this way, we build a unified framework that subsumes the binary or ternary networks as its special cases, and under which a heuristic algorithm is provided at the website https://github.com/AcrossV/Gated-XNOR. More particularly, we find that when both the weights and activations become ternary values, the DNNs can be reduced to sparse binary networks, termed as gated XNOR networks (GXNOR-Nets) since only the event of non-zero weight and non-zero activation enables the control gate to start the XNOR logic operations in the original binary networks. This promises the event-driven hardware design for efficient mobile intelligence. We achieve advanced performance compared with state-of-theart algorithms. Furthermore, the computational sparsity and the number of states in the discrete space can be flexibly modified to make it suitable for various hardware platforms.
INTRODUCTION
Deep neural networks (DNNs) are rapidly developing with the use of big data sets, powerful models/tricks and GPUs, and have been widely applied in various fields [1] - [8] , such as vision, speech, natural language, Go game, multimodel tasks, etc. However, the huge hardware overhead is also notorious, such as enormous memory/computation resources and high power consumption, which has greatly challenged their applications. As we know, most of the DNNs computing overheads result from the costly multiplication of real-valued synaptic weight and real-valued neuronal activation, as well as the accumulation operations. Therefore, a few compression methods and binary/ternary networks emerge in recent years, which aim to put DNNs on efficient devices. The former ones [9] - [14] reduce the network parameters and connections, but most of them do not change the full-precision multiplications and accumulations. The latter ones [15] - [20] replace the original computations by only accumulations or even binary logic operations.
In particular, the binary weight networks (BWNs) [15] - [17] and ternary weight networks (TWNs) [17] [18] constrain the synaptic weights to the binary space {−1, 1} or the ternary space {−1, 0, 1}, respectively. In this way, the multiplication operations can be removed. The binary neural networks (BNNs) [19] [20] constrain both the synaptic weights and the neuronal activations to the binary space {−1, 1}, which can directly replace the multiply-accumulate operations by binary logic operations, i.e. XNOR. So this kind of networks is also called the XNOR networks. Even with these most advanced models, there are issues that remain unsolved. Firstly, the reported networks are based on specially designed discretization and training methods, and there is a pressing need to build an architecture that could subsume these networks under a unified framework that achieves both higher performance and less overhead. To this end, how to implement the back propagation for online training algorithms when the activations are constrained in a discrete space is yet to be addressed. On the other side, in all these networks we have to save the full-precision hidden weights in the training phase, which causes frequent data exchange between the external memory arXiv:1705.09283v5 [cs. LG] 2 May 2018 for parameter storage and internal buffer for forward and backward computation.
In this paper, we propose a discretization framework: (1) A multi-step discretization function that constrains the neuronal activations in a discrete space, and a method to implement the back propagation by introducing an approximated derivative for the non-differentiable activation function; (2) A discrete state transition (DST) methodology with a probabilistic projection operator which constrains the synaptic weights in a discrete space without the storage of full-precision hidden weights in the whole training phase. Under such a discretization framework, a heuristic algorithm is provided at the website https://github.com/AcrossV/Gated-XNOR, where the state number of weights and activations are reconfigurable to make it suitable for various hardware platforms. In the extreme case, both the weights and activations can be constrained in the ternary space {−1, 0, 1} to form ternary neural networks (TNNs). For a multiplication operation, when one of the weight and activation is zero or both of them are zeros, the corresponding computation unit is resting, until the nonzero weight and non-zero activation enable and wake up the required computation unit. In other words, the computation trigger determined by the weight and activation acts as a control signal/gate or an event to start the computation. Therefore, in contrast to the existing XNOR networks, the TNNs proposed in this paper can be treated as gated XNOR networks (GXNOR-Nets). We test this network model over MNIST, CIFAR10 and SVHN datasets, and achieve comparable performance with state-of-the-art algorithms. The efficient hardware architecture is designed and compared with conventional ones. Furthermore, the sparsity of the neuronal activations can be flexibly modified to improve the recognition performance and hardware efficiency. In short, the GXNOR-Net promises the ultra efficient hardware for future mobile intelligence based on the reduced memory and computation, especially for the event-driven running paradigm.
We define several abbreviated terms that will be used in the following sections: (1)CWS: continuous weight space;
(2)DWS: discrete weight space; (3)TWS: ternary weight space; (4)BWS: binary weight space; (5)CAS: continuous activation space; (6)DAS: discrete activation space; (7)TAS: ternary weight space; (8)BAS: binary activation space; (9)DST: discrete state transition.
UNIFIED DISCRETIZATION FRAMEWORK WITH

MULTI-LEVEL STATES OF SYNAPTIC WEIGHTS AND NEURONAL ACTIVATIONS IN DNNS
Suppose that there are K training samples given by {(x (1) , y (1) ), ...(x (κ) , y (κ) ), ..., (x (K) , y (K) )} where y (κ) is the label of the κth sample x (κ) . In this work, we are going to propose a general deep architecture to efficiently train DNNs in which both the synaptic weights and neuronal activations are restricted in a discrete space Z N defined as
where N is a given non-negative integer, i.e., N = 0, 1, 2, ... and ∆z N = 1 2 N −1 is the distance between adjacent states. Remark 1. Note that different values of N in Z N denote different discrete spaces. Specifically, when N = 0, Z N = {−1, 1} belongs to the binary space and ∆z 0 = 2. When N = 1, Z N = {−1, 0, 1} belongs to the ternary space and ∆z 1 = 1. Also as seen in (1), the states in Z N are constrained in the interval [−1, 1], and without loss of generality, the range can be easily extended to [−H, H] by multiplying a scaling factor H.
In the following subsections, we first investigate the problem formulation for GXNOR-Nets, i.e, Z N is constrained in the ternary space {−1, 0, 1}. Later we will investigate how to implement back propagation in DNNs with ternary synaptic weights and neuronal activations. Finally a unified discretization framework by extending the weights and activations to multi-level states will be presented.
A. Problem formulation for GXNOR-Net
By constraining both the synaptic weights and neuronal activations to binary states {−1, 1} for the computation in both forward and backward passes, the complicated float multiplications and accumulations change to be very simple logic operations such as XNOR. However, different from XNOR networks, GXNOR-Net can be regarded as a sparse binary network due to the existence of the zero state, in which the number of zero state reflects the networks' sparsity. Only when both the pre-neuronal activation and synaptic weight are non-zero, the forward computation is required, marked as red as seen in Fig. 1 . This indicates that most of the computation resources can be switched off to reduce power consumption. The enable signal determined by the corresponding weight and activation acts as a control gate for the computation. Therefore, such a network is called the gated XNOR network (GXNOR-Net). Actually, the sparsity is also leveraged by other neural networks, such as in [21] [22] .
Suppose that there are L + 1 layers in a GXNOR-Net where both the synaptic weights and neuronal activations are restricted in a discrete space Z 1 = {−1, 0, 1} except the zeroth input layer and the activations of the Lth layer. As shown in Fig. 1 , the last layer, i.e., the Lth layer is followed by a L 2 -SVM output layer with the standard hinge loss, which has been shown to perform better than softmax on several benchmarks [23] [24] .
Denote Y l i as the activation of neuron i in layer l given by
for 1 ≤ l ≤ L − 1, where ϕ(.) denotes an activation function and W l ij represents the synaptic weight between neuron j in layer l −1 and neuron i in layer l. For the κth training sample, Y 0 i represents the ith element of the input vector of x (κ) , i.e.,
For the Lth layer of GXNOR-Net connected with the L2-SVM output layer, the neuronal activation Y L i ∈ R. The optimization model of GXNOR-Net is formulated as
Here E(W, Y ) represents the cost function depending on all synaptic weights (denoted as W ) and neuronal activations (denoted as Y ) in all layers of the GXNOR-Net.
For the convenience of presentation, we denote the discrete space when describing the synaptic weight and the neuronal activation as the DWS and DAS, respectively. Then, the special ternary space for synaptic weight and neuronal activation become the respective TWS and TAS. Both TWS and TAS are the ternary space Z 1 = {−1, 0, 1} defined in (1) .
The objective is to minimize the cost function E(.) in GXNOR-Nets by constraining all the synaptic weights and neuronal activations in TWS and TAS for both forward and backward passes. In the forward pass, we will first investigate how to discretize the neuronal activations by introducing a quantized activation function. In the backward pass, we will discuss how to implement the back propagation with ternary neuronal activations through approximating the derivative of the non-differentiable activation function. After that, the DST methodology for weight update aiming to solve (3) will be presented.
B. Ternary neuronal activation discretization in the forward pass
We introduce a quantization function ϕ r (x) to discretize the neuronal activations Y l (1 ≤ l ≤ L − 1) by setting
In Fig. 2 , it is seen that ϕ r (x) quantizes the neuronal activation to the TAS Z 1 and r > 0 is a window parameter which controls the excitability of the neuron and the sparsity of the computation.
C. Back propagation with ternary neuronal activations through approximating the derivative of the quantized activation function
After the ternary neuronal activation discretization in the forward pass, model (3) has now been simplified to the following optimization model
As mentioned in the Introduction section, in order to implement the back propagation in the backward pass where the neuronal activations are discrete, we need to obtain the derivative of the quantization function ϕ r (x) in (5) . However, it is well known that ϕ r (x) is not continuous and nondifferentiable, as shown in Fig. 2 (a) and (b). This makes it difficult to implement the back propagation in GXNOR-Net in this case. To address this issue, we approximate the derivative of ϕ r (x) with respect to x as follows
where a is a small positive parameter representing the steep degree of the derivative in the neighbourhood of x. In real applications, there are many other ways to approximate the derivative. For example, ∂ϕr(x) ∂x can also be approximated as
for a small given parameter a. The above two approximated methods are shown in Fig. 2 (c) and (d), respectively. It is seen that when a → 0, ∂ϕr(x) ∂x approaches the impulse function in Fig. 2(b) .
Note that the real-valued increment of the synaptic weight W l ij at the kth iteration at layer l, denoted as ∆W l ij (k), can be obtained based on the gradient information
where η represents the learning rate parameter, W (k) and Y (k) denote the respective synaptic weights and neuronal activations of all layers at the current iteration, and
where x l i is a weighted sum of the neuron i's inputs from layer l − 1:
and e l i is the error signal of neuron i propagated from layer l + 1:
and both
are approximated through (8) or (7) . As mentioned, the Lth layer is followed by the L2-SVM output layer, and the hinge foss function [23] [24] is applied for the training. Then, the error back propagates from the output layer to anterior layers and the gradient information for each layer can be obtained accordingly.
D. Weight update by discrete state transition in the ternary weight space
Now we investigate how to solve (6) by constraining W in the TWS through an iterative training process. Let W l ij (k) ∈ Z 1 be the weight state at the k-th iteration step, and ∆W l ij (k) be the weight increment on W l ij (k) that can be derived on the gradient information (9). To guarantee the next weight will not jump out of [−1, 1], define (·) to establish a boundary restriction on ∆W l ij (k): (13) and decompose the above (∆W l ij (k)) as:
such that
and
is a round operation towards zero, and rem(x, y) generates the remainder of the division between two numbers and keeps the same sign with x.
Then, we obtain a projected weight increment ∆w ij (k) and update the weight by
Now we discuss how to project ∆w ij (k) in CWS to make the next state W l ij (k)+P grad (∆W l ij (k)) in TWS, i.e. W l ij (k+ 1) ∈ Z N . We denote ∆w ij (k) = P grad (.) as a probabilistic projection function given by
where the sign function sign(x) is given by
and τ (.) (0 ≤ τ (.) ≤ 1) is a state transition probability function defined by
where m is a nonlinear factor of positive constant to adjust the transition probability in probabilistic projection. The above formula (18) implies that ∆w ij (k) is among κ ij + 1, κ ij − 1 and κ ij . For example, when sign( (∆W l ij (k))) = 1, then ∆w ij (k) = κ ij + 1 happens with probability τ (ν ij ) and ∆w ij (k) = κ ij happens with probability 1 − τ (ν ij ). Basically the P grad (.) describes the transition operation among discrete states in Z 1 defined in (1), i.e., Z 1 = {z n 1 |z n 1 = n − 1, n = 0, 1, 2} where z 0 1 = −1, z 1 1 = 0 and z 2 1 = 1. Fig. 3 illustrates the transition process in TWS. For example, at the current weight state W l ij (k) = z 1 1 = 0, if ∆W l ij (k) < 0, then W l ij (k + 1) has the probability of τ (ν ij ) to transfer to z 0 1 = −1 and has the probability of 1−τ (ν ij ) to stay at z 1 1 = 0; while if ∆W l ij (k) ≥ 0, then W l ij (k + 1) has the probability of τ (ν ij ) to transfer to z 2 1 = 1 and has the probability of 1 − τ (ν ij ) to stay at z 1 1 = 0. At the boundary state
In DST, the weight can directly transit from current discrete state (marked as red circle) to the next discrete state when updating the weight, without the storage of the full-precision hidden weight. With different current weight states, as well as the direction and magnitude of weight increment ∆W l ij (k), there are totally six transition cases when the discrete space is the TWS.
k)) = 0 and P (∆w = 0) = 1, which means that W l ij (k + 1) has the probability of 1 to stay at z 0 1 = −1; if ∆W l ij (k) ≥ 0 and κ ij = 0, P (∆w = 1) = τ (ν ij ), then W l ij (k + 1) has the probability of τ (ν ij ) to transfer to z 1 1 = 0, and has the probability of 1 − τ (ν ij ) to stay at z 0 1 = −1; if ∆W l ij (k) ≥ 0 and κ ij = 1, P (∆w = 2) = τ (ν ij ), then W l ij (k + 1) has the probability of τ (ν ij ) to transfer to z 2 1 = 1, and has the probability of 1−τ (ν ij ) to transfer to z 1 1 = 0. Similar analysis holds for another boundary state W l ij (k) = z 2 1 = 1. Based on the above results, now we can solve the optimization model (3) based on the DST methodology. The main idea is to update the synaptic weight based on (17) in the ternary space Z 1 by exploiting the projected gradient information. The main difference between DST and the ideas in recent works such as BWNs [15] - [17] , TWNs [17] [18] , BNNs or XNOR networks [19] [20] is illustrated in Fig. 4 . In those works, frequent switch and data exchange between the CWS and the BWS or TWS are required during the training phase. The fullprecision weights have to be saved at each iteration, and the gradient computation is based on the binary/ternary version of the stored full-precision weights, termed as "binarization" or "ternary discretization" step. In stark contrast, the weights in DST are always constrained in a DWS. A probabilistic gradient projection operator is introduced in (18) to directly transform a continuous weight increment to a discrete state transition.
Remark 2. In the inference phase, since both the synaptic weights and neuronal activations are in the ternary space, only logic operations are required. In the training phase, the remove of full-precision hidden weights drastically reduces the memory cost. The logic forward pass and additive backward pass (just a bit of multiplications at each neuron node) will also simplify the training computation to some extent. In addition, the number of zero state, i.e. sparsity, can be controlled by adjusting r in ϕ r (.), which further makes our framework efficient in real applications through the eventdriven paradigm.
E. Unified discretization framework: multi-level states of the synaptic weights and neuronal activations Actually, the binary and ternary networks are not the whole story since N is not limited to be 0 or 1 in Z N defined in (1) and it can be any non-negative integer. There are many hardware platforms that support multi-level discrete space for more powerful processing ability [25] - [30] .
The neuronal activations can be extended to multi-level cases. To this end, we introduce the following multi-step neuronal activation discretization function
where
for 1 ≤ ω ≤ 2 N −1 . The interval [−H, H] is similarly defined with Z N in (1). To implement the back propagation algorithm, the derivative of ϕ r (x) can be approximated at each discontinuous point as illustrated in Fig. 5 . Thus, both the forward pass and backward pass of DNNs can be implemented. At the same time,the proposed DST for weight update can also be implemented in a discrete space with multi-level states. In this case, the decomposition of ∆W l ij (k) is revisited as
and the probabilistic projection function in (18) can also be revisited as follows Fig.  3 , the κ ij can be larger than 1 so that further transition is allowable.
-1 1 Fig. 6 . Discretization of synaptic weights in DWS with multi-level states.
RESULTS
We test the proposed GXNOR-Nets over the MNIST, CIFAR10 and SVHN datasets 1 . The results are shown in Table 1 . The network structure for MNIST is "32C5-MP2-64C5-MP2-512FC-SVM", and that for CIFAR10 and SVHN is "2×(128C3)-MP2-2×(256C3)-MP2-2×(512C3)-MP2-1024FC-SVM". Here M P , C and F C stand for max pooling, convolution and full connection, respectively. Specifically, 2 × (128C3) denotes 2 convolution layers with 3 × 3 kernel and 128 feature maps, MP2 means max pooling with window size 2 × 2 and stride 2, and 1024FC represents a fullconnected layer with 1024 neurons. Here SVM is a classifier with squared hinge loss (L2-Support Vector Machine) right after the output layer. All the inputs are normalized into the range of [-1,+1]. As for CIFAR10 and SVHN, we adopt the similar augmentation in [24] , i.e. 4 pixels are padded on each side of training images, and a 32 × 32 crop is further randomly sampled from the padded image and its horizontal flip version. In the inference phase, we only test using the single view of the original 32 × 32 images. The batch size 1 The codes are available at https://github.com/AcrossV/Gated-XNOR 
Methods Datasets
MNIST CIFAR10 SVHN BNNs [19] 98.60% 89.85% 97.20% TWNs [17] 99.35% 92.56% N.A BWNs [16] 98.82% 91.73% 97.70% BWNs [17] 99.05% 90.18% N.A Full-precision NNs [17] (20) satisfies m = 3, the derivative approximation uses rectangular window in Fig.  2(c) where a = 0.5. The base algorithm for gradient descent is Adam, and the presented performance is the accuracy on testing set.
A. Performance comparison
The networks for comparison in Table 1 are listed as follows: GXNOR-Nets in this paper (ternary synaptic weights and ternary neuronal activations), BNNs or XNOR networks (binary synaptic weights and binary neuronal activations), TWNs (ternary synaptic weights and full-precision neuronal activations), BWNs (binary synaptic weights and full-precision neuronal activations), full-precision NNs (full-precision synaptic weights and full-precision neuronal activations). Over MNIST, BWNs [16] use full-connected networks with 3 hidden layers of 1024 neurons and a L2-SVM output layer, BNNs [19] use full-connected networks with 3 hidden layers of 4096 neurons and a L2-SVM output layer, while our paper adopts the same structure as BWNs [17] . Over CIFAR10 and SVHN, we remove the last full-connected layer in BWNs [16] and BNNs [19] . Compared with BWNs [17] , we just replace the softmax output layer by a L2-SVM layer. It is seen that the proposed GXNOR-Nets achieve comparable performance with the state-of-the-art algorithms and networks. In fact, the accuracy of 99.32% (MNIST), 92.50% (CIFAR10) and 97.37% (SVHN) has outperformed most of the existing binary or ternary methods. In GXNOR-Nets, the weights are always constrained in the TWS {−1, 0, 1} without saving the fullprecision hidden weights like the reported networks in Table  1 , and the neuronal activations are further constrained in the TAS {−1, 0, 1}. The results indicate that it is really possible to perform well even if we just use this kind of extremely hardware-friendly network architecture. Furthermore, Fig. 7 presents the graph where the error curve evolves as a function of the training epoch. We can see that the GXNOR-Net can achieve comparable final accuracy, but converges slower than full-precision continuous NN. Fig. 7 . Training curve. GXNOR-Net can achieve comparable final accuracy, but converges slower than full-precision continuous NN.
B. Influence of m, a and r Fig. 8 . Influence of the nonlinear factor for probabilistic projection. A properly larger m value obviously improves performance, while too large value further helps little.
We analyze the influence of several parameters in this section. Firstly, we study the nonlinear factor m in equation (20) for probabilistic projection. The results are shown in Fig. 8 , in which larger m indicates stronger nonlinearity. It is seen that properly increasing m would obviously improve the network performance, while too large m further helps little. m = 3 obtains the best accuracy, that is the reason why we use this value for other experiments.
Secondly, we use the rectangular approximation in Fig.  2(c) as an example to explore the impact of pulse width on the recognition performance, as shown in Fig. 9 . Both too large and too small a value would cause worse performance and in our simulation, a = 0.5 achieves the highest testing accuracy. In other words, there exists a best configuration for approximating the derivative of non-linear discretized activation function.
Finally, we investigate the influence of this sparsity on the network performance, and the results are presented in Fig. 10 .
Here the sparsity represents the fraction of zero activations. By controlling the width of sparse window (determined by r) in Fig. 2(a) , the sparsity of neuronal activations can be flexibly modified. It is observed that the network usually performs Fig. 9 . Influence of the pulse width for derivative approximation. The pulse width for derivative approximation of non-differentiable discretized activation function affects the network performance. 'Not too wide & not too narrow pulse' achieves the best accuracy. Fig. 10 . Influence of the sparsity of neuronal activations. Here the sparsity represents the fraction of zero activations. By properly increasing the zero neuronal activations, i.e. computation sparsity, the recognition performance can be improved. There exists a best sparse space of neuronal activations for a specific network and dataset.
Sparsity of Neuronal Activations
better when the state sparsity properly increases. Actually, the performance significantly degrades when the sparsity further increases, and it approaches 0 when the sparsity approaches 1. This indicates that there exists a best sparse space for a specified network and data set, which is probably due to the fact that the proper increase of zero neuronal activations reduces the network complexity, and the overfitting can be avoided to a great extent, like the dropout technology [31] . But the valid neuronal information will reduce significantly if the network becomes too sparse, which causes the performance degradation. Based on this analysis, it is easily to understand the reason that why the GXNOR-Nets in this paper usually perform better than the BWNs, BNNs and TWNs. On the other side, a sparser network can be more hardware friendly which means that it is possible to achieve higher accuracy and less hardware overhead in the meantime by configuring the computational sparsity.
C. Event-driven hardware computing architecture
For the different networks in Table 1 , the hardware computing architectures can be quite different. As illustrated in Fig. 11 , we present typical hardware implementation examples for a triple-input-single-output neural network, and the corresponding original network is shown in Fig. 11(a) . The conventional hardware implementation for full-precision NN is based on multipliers for the multiplications of activations and weights, and accumulator for the dendritic integration, as shown in Fig. 11(b) . Although a unit for nonlinear activation function is required, we ignore this in all cases of Fig. 11 , so that we can focus on the influence on the implementation architecture with different discrete spaces. The recent BWN in Fig. 11(c) replaces the multiply-accumulate operations by a simple accumulation operation, with the help of multiplexers. When W i = 1, the neuron accumulates X i ; otherwise, the neuron accumulates −X i . In contrast, the TWN in Fig. 11(d) implements the accumulation under an event-driven paradigm by adding a zero state into the binary weight space. When W i = 0, the neuron is regarded as resting; only when the weight W i is non-zero, also termed as an event, the neuron accumulation will be activated. In this sense, W i acts as a control gate. By constraining both the synaptic weights and neuronal activations in the binary space, the BNN in Fig.  11 (e) further simplifies the accumulation operations in the BWN to efficient binary logic XNOR and bitcount operations. Similar to the event control of BNN, the TNN proposed in this paper further introduces the event-driven paradigm based on the binary XNOR network. As shown in Fig. 11(f) , only when both the weight W i and input X i are non-zero, the XNOR and bit count operations are enabled and started. In other words, whether W i or X i equals to zero or not plays the role of closing or opening of the control gate, hence the name of gated XNOR network (GXNOR-Net) is granted. Table 2 shows the required operations of the typical networks in Fig. 11 . Here we assume that the input number of the neuron is M , i.e. M inputs and one neuron output. We can see that the BWN removes the multiplications in the original full-precision NN, and the BNN replaces the arithmetical operations to efficient XNOR logic operations. While, in full-precision NNs, BWNs (binary weight networks), BNNs/XNOR networks (binary neural networks), most states of the activations and weights are non-zero. So their resting probability is ≈ 0.0%. Furthermore, the TWN and GXNOR-Net introduce the event-driven paradigm. If the states in the ternary space {−1, 0, 1} follow uniform distribution, the resting probability of accumulation operations in the TWN reaches 33.3%, and the resting probability of XNOR and bitcount operations in GXNOR-Net further reaches 55.6%. Specifically, in TWNs (ternary weight networks), the synaptic Fig. 12 . Implementation of the GXNOR-Net example. By introducing the event-driven paradigm, most of the operations are efficiently kept in the resting state until the valid gate control signals wake them up. The signal is determined by whether both the weight and activation are non-zero.
GXNOR-based Neuron Array
weight has three states {−1, 0, 1} while the neuronal activation is fully precise. So the resting computation only occurs when the synaptic weight is 0, with average probability of 1 3 ≈ 33.3%. As for the GXNOR-Nets, both the neuronal activation and synaptic weight have three states {−1, 0, 1}. So the resting computation could occur when either the neuronal activation or the synaptic weight is 0. The average probability is 1 − 2 3 × 2 3 = 5 9 ≈ 55.6%. Note that Table 2 is based on an assumption that the states of all the synaptic weights and neuronal activations subject to a uniform distribution. Therefore the resting probability varies from different networks and data sets and the reported values can only be used as rough guidelines. Fig. 12 demonstrates an example of hardware implementation of the GXNOR-Net from Fig. 1 . The original 21 XNOR operations can be reduced to only 9 XNOR operations, and the required bit width for the bitcount operations can also be reduced. In other words, in a GXNOR-Net, most operations keep in the resting state until the valid gate control signals wake them up, determined by whether both the weight and activation are non-zero. This sparse property promises the design of ultra efficient intelligent devices with the help of event-driven paradigm, like the famous event-driven TrueNorth neuromorphic chip from IBM [25] , [26] .
D. Multiple states in the discrete space
According to Fig. 5 and Fig. 6 , we know that the discrete spaces of synaptic weights and neuronal activations can have multi-level states. Similar to the definition of Z N in (1), we denote the state parameters of DWS and DAS as N 1 and N 2 , respectively. Then, the available state number of weights and activations are 2 N1 + 1 and 2 N2 + 1, respectively. N 1 = 0 or N 1 = 1 corresponds to binary or ternary weights, and N 2 = 0 or N 2 = 1 corresponds to binary or ternary activations. We test the influence of N 1 and N 2 over MNIST dataset, and Test Accuary 0 0 98.96% Fig. 13 . Influence of the state number in the discrete space. The state number in discrete spaces of weights and activations can be multi-level values, i.e. DWS and DAS. The state parameters of weight space and activation space are denoted as N 1 and N 2 , respectively, which is similar to the definition of Z N in (1). There exists a best discrete space with respect to either the weight direction or the activation direction, which locates at N 1 = 6 and N 2 = 4.
13 presents the results where the larger circle denotes higher test accuracy. In the weight direction, it is observed that when N 1 = 6, the network performs best; while in the activation direction, the best performance occurs when N 2 = 4. This indicates there exists a best discrete space in either the weight direction or the activation direction, which is similar to the conclusion from the influence analysis of m in Fig. 8 , a in Fig.  9 , and sparsity in Fig. 10 . In this sense, the discretization is also an efficient way to avoid network overfitting that improves the algorithm performance. The investigation in this section can be used as a guidance theory to help us choose a best discretization implementation for a particular hardware platform after considering its computation and memory resources.
CONCLUSION AND DISCUSSION
This work provides a unified discretization framework for both synaptic weights and neuronal activations in DNNs, where the derivative of multi-step activation function is approximated and the storage of full-precision hidden weights is avoided by using a probabilistic projection operator to directly realize DST. Based on this, the complete back propagation learning process can be conveniently implemented when both the weights and activations are discrete. In contrast to the existing binary or ternary methods, our model can flexibly modify the state number of weights and activations to make it suitable for various hardware platforms, not limited to the special cases of binary or ternary values. We test our model in the case of ternary weights and activations (GXNOR-Nets) over MNIST, CIFAR10 and SVHN datesets, and achieve comparable performance with state-of-the-art algorithms. Actually, the non-zero state of the weight and activation acts as a control signal to enable the computation unit, or keep it resting. Therefore GXNOR-Nets can be regarded as one kind of "sparse binary networks" where the networks' sparsity can be controlled through adjusting a pre-given parameter. What's more, this "gated control" behaviour promises the design of efficient hardware implementation by using event-driven paradigm, and this has been compared with several typical neural networks and their hardware computing architectures. The computation sparsity and the number of states in the discrete space can be properly increased to further improve the recognition performance of the GXNOR-Nets.
We have also tested the performance of the two curves in Fig. 2 for derivative approximation. It is found that the pulse shape (rectangle or triangle) affect less on the accuracy compared to the pulse width (or steepness) as shown in Fig. 9 . Therefore, we recommend to use the rectangular one in Fig.  2 (c) because it is simpler than the triangular curve in Fig. 2(d) , which makes the approximation more hardware-friendly.
Through above analysis, we know that GXNOR-Net can dramatically simplify the computation in the inference phase and reduce the memory cost in the training/inference phase. However, regarding the training computation, although it can remove the multiplications and additions in the forward pass and remove most multiplications in the backward pass, it causes slower convergence and probabilistic sampling overhead. On powerful GPU platform with huge computation resources, it may be able to cover the overhead from these two issues by leveraging the reduced multiplications. However, on other embedded platforms (e.g. FPGA/ASIC), they require elaborate architecture design.
Although the GXNOR-Nets promise the event-driven and efficient hardware implementation, the quantitative advantages are not so huge if only based on current digital technology. This is because the generation of the control gate signals also requires extra overhead. But the power consumption can be reduced to a certain extent because of the less state flips in digital circuits, which can be further optimized by increasing the computation sparsity. Even more promising, some emerging nanodevices have the similar event-driven behaviour, such as gated-control memristive devices [32] , [33] . By using these devices, the multi-level multiply-accumulate operations can be directly implemented, and the computation is controlled by the event signal injected into the third terminal of a control gate. These characteristics naturally match well with our model with multi-level weights and activations by modifying the number of states in the discrete space as well as the event-driven paradigm with flexible computation sparsity.
