ABSTRACT Neuromorphic hardware inspired by the brain has attracted much attention for its advanced information processing concept. However, implementing online learning in the neuromorphic chip is still challenging. In this paper, we present a bio-plausible online-learning spiking neural network (SNN) model for hardware implementation. The SNN consists of an input layer, an excitatory layer, and an inhibitory layer. To save resource cost and accelerate information processing speed during hardware implementation, online learning based on the spiking neural model is realized by trace-based spiking-timing-dependent plasticity (STDP). Neuron and synapse activities are digitalized, and decay behaviors of neuron and synapse parameters are realized by the bit-shift operation. After learning training set from the Modified National Institute of Standards and Technology (MNIST), the spiking neural model successfully recognizes the digits from the MNIST test set, showing the feasibility and capability of the model. The recognition accuracy increases significantly from 90.0% to 94.5% with the number of the excitatory/inhibitory neurons rising from 400 to 3,500, which provides a guide to make a trade-off between the recognition accuracy and the resource cost during hardware implementation. Encouragingly, compared to its corresponding floating-point model, the proposed model reduces the hardware resources and power consumption by 40.7% and 36.3%, respectively (under 55-nm CMOS process).
I. INTRODUCTION
A neuromorphic computing platform, inspired by the advanced information processing scheme of the brain [1] , is more efficient and bio-plausible than the traditional Von Neumann computing platform when dealing with brain-like computation tasks (such as pattern recognition) [2] . Therefore, it has attracted significant attention in recent years. However, an effective method to implement online learning in a neuromorphic computing platform is still missing [3] . For example, SpiNNaker, one of the representative neuromorphic platforms, is based on multi-core ARM digital chips and may still suffer from the Von Neumann bottleneck [4] , [5] . TrueNorth, another example of the most advanced platforms, does not support any synaptic plasticity mechanisms and thus
The associate editor coordinating the review of this manuscript and approving it for publication was Bora Onat.
does not have online learning ability [6] - [8] . In these cases, neurons and synaptic parameters can only be off-line mapped into the chip to achieve neural network functions.
Among the various spiking neural network (SNN) training methods [9] , training SNN based on the traditional back-propagation (BP) algorithm [10] - [14] or training artificial neural network (ANN) into the SNN by mapping [15] - [18] are inefficient and biologically implausible. Recently, bio-plausible learning rules, such as spiking-timing-dependent plasticity (STDP), together with BP algorithm have been used to achieve supervised learning in SNN [19] - [21] . For example, Tavanaei proposed a BP-STDP algorithm to approximate backpropagation using STDP and achieved a recognition accuracy of 97.2% under Modified National Institute of Standards and Technology (MNIST) test set [21] . However, these models cannot realize unsupervised online learning, thus cannot well emulate the brain's autonomous learning function (since the brain is unsupervised and event-based). More recently, unsupervised learning based on STDP and other biologically plausible learning rules were reported by several groups [22] , [23] . Diehl proposed an SNN model that can achieve unsupervised learning based on bio-plausible STDP learning rule [23] and achieved a recognition accuracy of 95.0% under MNIST test set. However, this model is implemented on the Brian2 spiking neural network work simulator [24] and involves a large number of floating-point calculations, which is extremely difficult to be implemented in hardware.
This paper presents a hardware-oriented online learning neuromorphic computing model. Biologically plausible STDP rule is used in the model to achieve unsupervised learning. Moreover, neuron and synapse activities are digitalized, and decay behaviors of the neuron and synapse parameters are realized by bit-shift, which can significantly reduce the hardware cost. In the model with the size of 784-3500-3500, a recognition accuracy of 94.5% is achieved under the MNIST test set. Moreover, under the 55 nm CMOS process, the hardware resources and power consumption are reduced by 40.7% and 36.3%, respectively (model size: 784-784-784). The model can be used as a reference model to design a neuromorphic computing platform with an online learning engine, and can also be used to verify the performance of such a platform in the future. Efficiently embedding online learning capabilities enables neuromorphic platforms to adapt to and learn new features from changing environments, which has various practical application scenarios, such as autonomous smart sensing in the Internet-of-Things (IoT) [25] , autonomous embedded systems [26] and robots [27] , brain-machine interfaces [28] , and experimental neuroscience platform [29] , etc.
II. MODEL DESCRIPTION A. NETWORK ARCHITECTURE
As shown in Fig. 1 , the SNN architecture consists of three layers, and the black and blue arrows denote excitatory and inhibitory synapses, respectively. The first layer is an input layer used to receive spikes from the outside. The second layer is an excitatory layer, and the synapses between the input neurons and excitatory neurons are excitatory synapses, whose weights are placed in the weight matrix M xe . The delay time is randomly assigned (e.g., 0∼10 ms) during the initialization process and will not be changed later. Moreover, the values of the delay time should be integer multiples of a time step (e.g., 0.5 ms) for efficient hardware computation. The third layer is an inhibitory layer, and there is a oneto-one correspondence between the excitatory neurons and inhibitory neurons, i.e., the number of inhibitory neurons is equal to the number of excitatory neurons. A competitive learning mechanism, which is similar to the one used in the Self Organizing Maps (SOM) model [30] and winnertake-all model [31] , [32] , is introduced among the excitatory neurons. When an excitatory neuron fires a spike, it will cause the excitation of its corresponding inhibitory neuron through excitatory synapse (with a large fixed weight). The inhibitory neuron, in turn, inhibits all other excitatory neurons through inhibitory synapses. The weights of excitatory synapses and inhibitory synapses between the excitatory neurons and inhibitory neurons are placed in weight matrices M ei and M ie , respectively. Fig. 1 also shows the size of the weight matrices and delay matrix. The activities of the neuron and synapse are digitalized and are represented by fixed-point numbers, and the size of the parameters are shown in Fig. 2 .
B. NEURON MODEL
Integrate-and-Fire (IF) [33] neuron, as well as excitatory synapse and inhibitory synapse, are used as the basic building block in the three-layer SNN (i.e., including an input layer, an excitatory layer, and an inhibitory layer). Moreover, the excitatory synapse between the input neurons and excitatory neurons has a delay unit, and its weight can be autonomously adjusted according to the STDP rule [34] , whereas the excitatory synapse between the excitatory neurons and inhibitory neurons and the inhibitory synapse have no delay unit, and their weights M ei and M ie cannot be changed. The conductance of an excitatory neuron is updated VOLUME 7, 2019 at each time step following the rule:
x lj (t) w lj (2) where i, j, l are the indexes of the input neuron, the excitatory neuron, and the inhibitory neuron, respectively; g exc j and g inh j are the excitatory conductance and the inhibitory conductance of the excitatory neuron j, respectively; x ij is the input from the input neuron i to the excitatory neuron j, which can only be either 1 or 0 (spike or none spike); x lj is the input from the inhibitory neuron l to the excitatory neuron j; w ij is the weight from the input neuron i to the excitatory neuron j; w lj is the weight from the inhibitory neuron l to the excitatory neuron j; m and n are the numbers of input neurons and inhibitory neurons, respectively. The excitatory and the inhibitory currents are then updated according to:
where I exc j and I inh j are the excitatory current and inhibitory current of neuron j, respectively; V j is the membrane potential of neuron j.
The membrane potential is updated according to the IF neuron model:
where V th j and V reset j are the threshold and reset membrane potential of neuron j, respectively. The membrane potential is updated by adding the excitatory current and inhibitory current simultaneously. If the updated membrane potential is higher than the threshold, the neuron fires a spike, and then its membrane potential is reset to V reset j ; otherwise, the membrane potential remains unchanged. When a neuron fires a spike, its threshold will be increased by a specific value (e.g., 5×10 5 ); otherwise, it will exponentially decrease with time (it will be discussed later). After firing a spike, the neuron entries the refractory period, during which the parameters of the neuron will not be changed, and no spike will be fired. Moreover, a timer and a counter are used to record the duration of the refractory period and the number of spikes, respectively.
C. SYNAPSE MODEL
The classic STDP is typically defined as [35] : where t pre and t post are the pre-and post-synaptic spike times, respectively; w is the weight of the synapse; F is a function defined as:
where t = t post −t pre ; τ pre and τ post are time constants of the pre-and post-synaptic neurons; A pre and A post are amplitude constants.
It is inefficient to use the classic STDP to update weights directly because all pairs of spikes have to be summed over, which is also biologically unrealistic because neurons cannot remember all previous spikes times [24] . By contrast, the trace-based STDP, which has been proved equivalent to the classic STDP [34] , is more efficient for hardware implementation. The pre-and post-synaptic traces are defined as pre and post, respectively, which are governed by:
When a pre-synaptic spike occurs, the pre-synaptic trace and weight are modified according to:
When a post-synaptic spike occurs, the post-synaptic trace and weight are modified according to:
For example, if we take τ pre = τ post = 20 ms and A pre = A post = 0.01, weight modification with trace-based STDP (Fig. 3) gets the same effect with the classic STDP.
To further optimize the hardware implementation with the bit-shift operation (see below), pre is reset to an initial value (e.g., 10 4 ) when the pre-synaptic neuron fires a spike, and a learning rate is introduced in the weight modification:
where w ij (t − 1) and w ij (t) are the weights from input neuron i to excitatory neuron j at the previous time step and current time step, respectively; λ pre is the pre-synaptic learning rate; post j is the post-synaptic trace of excitatory neuron j.
Since λ pre is a constant coefficient, the decay of synaptic weight can be realized by the bit-shift operation.
When the post-synaptic neuron fires a spike, post is reset to its initial value, and the synaptic weight is updated according to:
where λ post is the post-synaptic learning rate; pre i is the pre-synaptic trace of input neuron i. The symbols for variables above are summarized in Table 1 . The original STDP rule is unstable [36] , i.e., the learning process may not converge to a stable state. Therefore, a weight constraint scheme is introduced into the STDP rule to prevent the weights from growing unbounded. Specifically, the sum of the weights of all synapses connected to each neuron is restricted to the same magnitude, to guarantee that each excitatory neuron has a fair chance to win the competition. Moreover, the weights are clamped into the upper-and lowerboundaries. These strategies are executed together in each time step to prevent the weights from getting out of control. The working steps of the weight normalization is shown in Algorithm 1.
A delay connection matrix is required to obtain the summed weights (W) that should be added to the neurons' conductance at the current time step. Due to the presence of synaptic delays, synchro pre-synaptic spikes may arrive at the target neurons at different times, whereas nonsynchro pre-synaptic spikes may reach the neuron at the same time.
As shown in the top panel of Fig. 4 , the number of rows (i.e., 21) in the delay connection matrix is the sum of the maximum delay / time step (i.e., 10 ms / 0.5 ms = 20) and initial state (i.e., 1). The number of columns is equal to the number of excitatory neurons, and each column of the delay connection matrix corresponds to an excitatory neuron. Here, Algorithm 1 Pseudo Code of the Weight Normalization 1: #Initialization of input constants 2: constant1 ← input1 3: constant2 ← input2 4: constant3 ← input3 5: #Initialization of weight w (m×n) 6: w ← rand (m, n) * constant1 7: while i < training number do 8: #Sum by column to w 9:
w_sum ← sum (w, 1) 10:
# Calculate normalization factor 11:
factor ← round (constant2./w s um) 12:
# Update weight 13:
w ← w. * factor 14:
# Shift operation to w 15:
w ← bitshift (w, constant3) 16: end while a delay connection vector of a single neuron [37] is used to illustrate the delay mechanism, as shown in the bottom panel of Fig. 4 . The maximum delay and time step are assumed to be 10 ms and 0.5 ms, respectively. The delay connection vector is updated every time step. W pointed by the pointer denotes the summed weight which arrives at the excitatory neuron at the current time step. As an example, it is assumed that the pointer points to W (0) at the current time step and one of the pre-synaptic neurons connected to this neuron fires a spike. The weight of this synapse (w 0 ) and the delay time (assumed to be 4.5 ms) are retrieved from the weight matrix M xe and the delay matrix (as shown in Fig. 1 ), respectively. Since w 0 will reach the neuron after 9 (i.e., 4.5 ms / 0.5 ms = 9) time steps, w 0 will be added to W (9) . If there are any other presynaptic spikes, this delay connection vector will be updated in a similar way. Subsequently, W (0) will be added to this neuron's excitatory conductance and then reset to zero, and the pointer will move to the next cell and point to next weight (W (1) ). The above procedures are repeated until the end of the learning or inferencing process.
D. HARDWARE IMPLEMENTATION
To maintain the dynamic balance of the entire neural network, the parameters including conductance, threshold, trace, etc., always decline over time [38] . Here, all these decreasing processes are implemented by bit-shift operation. In the traditional SNN, these parameters normally decrease exponentially after reset. For example, the evolution of pre is governed by dpre/dt = −pre/τ , where τ is a time constant. For hardware simplification, the above equation is converted into Eq. (9) according to the Euler's method:
To further simplify the hardware implementation, a bitshift method is used to approximate Eq. (17) . For example, if we take dt= 0.5 ms and τ pre = 8 ms, the division operation can be replaced by a right shift 4-bit operation. As shown in Fig. 5 , the bit-shift method (red line) approximates the equation (black line) well in the first 50 iterations. After 50 iterations, pre is reduced to less than 15, and it remains unchanged as it is impossible for it to be right-shift 4 bits. It is observed that this small error may result in non-convergence during the learning process. Fortunately, the non-convergence can be effectively addressed by forcing pre to 0 when it is less than 15, as shown in Fig. 5 . Also, similar strategies are applied to other parameters with the behavior of exponential decay. This bit-shift variable updating method is much more computationally efficient than conventional exponential calculation [39] . The schematic and corresponding working steps of the hardware implementation scheme of the STDP-based online learning SNN are given in Fig. 6 and Algorithm 2, respectively. To accelerate the learning process of the all multiple connections, parallel computing technique can be used in each step of Algorithm 2. Moreover, to save memory space and improve memory access efficiency, input images may be placed on off-chip memory, whereas other parameters can be placed on on-chip memory.
III. RESULTS AND DISCUSSIONS

A. LEARNING PROCESS
The SNN will be used to classify digital images after it learns from 60,000 images of the MNIST training set. The pixel of each image was encoded by a Poisson distributed spike train, whose firing rate is proportional to the pixel value, and a larger pixel corresponds to more spikes during the time window. Maximum input rate determines the maximum number of spikes during the time window. The numbers of excitatory neurons and inhibitory neurons are both 784. Fig. 7 shows the weight distribution of the 614,656 synapses (784 × 784) between the input layer and the excitatory layer after learning from 0, 10,000, and 60,000 images, respectively. The weights were uniformly distributed between 0 and 3 × 10 7 before learning. During the learning process, the weights gradually gathered to 0, which is similar to the sparse connectivity concept in the deep neural networks and is bio-plausible for each neuron only connects to a limited number of neurons in the biological neural networks [40] . Sparse connectivity reduces the complexity of the wiring between neurons and enables more efficient information processing and storage, and can also improve pattern recognition accuracy [41] . Moreover, the comparison of weight matrices before and after learning is also presented in Fig. 7 . All weights connected to an excitatory neuron are reconstructed in 28 × 28 matrices. Weights normalization. # Loading weights and normalizing weights in the computation module, then returning updated weights to the on-chip memory.
5:
while t < time window do 6:
Weights modification according to STDP on-pre. # Loading weights and pre from the on-chip memory, and spike1 from Poisson spike train generator; implementing the weights reduction, and then returning the updated weights. 7:
Conductance updating. # Loading weights, delays, spike and conductance, and then calculating the conductance. 8:
Current updating. # Loading conductance and membrane potential, and then calculating and returning the current. 9:
Membrane potential updating. # Loading current and membrane potential, and then calculating and returning the updated membrane potential. 10:
Spike2 updating. # Loading membrane potential and threshold, and then calculating and returning the spike2.
11:
Weights modification according to STDP on-post. # Loading spike2, weights and post, and implementing the weights increment, and then returning the updated weights. 12:
Parameters bit-shift decay. # Loading pre, post, conductance and threshold, and implementing the bit-shift decay (controlled by the decay controller), and then returning the updated parameters.
13:
end while 14: end while As the learning process proceeds, reconstructed matrices are displayed as specific digits, which means that the network tends to converge [42] . Fig. 8 shows the evolution of the thresholds of excitatory neurons during the learning process. Before learning, the thresholds of 784 neurons were all set to 2 × 10 8 . After learning, the thresholds roughly distributed within 3.5 × 10 8 ∼ 5.5 × 10 8 . Fig. 9 shows the membrane potential change during the learning process of a randomly selected excitatory neuron ( Fig. 9(a) ) and its corresponding inhibitory neuron (Fig. 9(b) ). After the SNN learning from 60,000 images, excitatory neurons were divided into ten classifications, and then each neuron was labeled according to how it responds to the ten types of inputs. Fig. 10 shows the number of each type of labels.
B. IMAGE CLASSIFICATION
The 10,000 images from the MNIST test set were used to test the performance of the neural network after learning. The excitatory layer also acts as the output layer during the inference process, so an extra classifier is not needed. The total number of spikes sent by each class of excitatory neurons for the current image will be added up and divided by the number of excitatory neurons in the class. The most significant one of excitatory neurons is identified as the winner, and the label corresponding to the winner is regarded as the final predicted result. For example, assuming that the current input is digit ''0''. As can be seen from Fig. 10 , there are 110 excitatory neurons are assigned to digit ''0''. Then all spikes fired by these 110 neurons are added up and divided VOLUME 7, 2019 by 110 to get the average number of spikes per neuron. If the number is greater than all other nine classifications, the output of the network is determined to be digit ''0''. As shown in Fig. 11 , the recognition accuracy of the SNN (consisting of 784 input neurons, 784 excitatory neurons, and 784 inhibitory neurons) gradually increases to 92.1% with the number of training images. Fig. 12 shows statistics of the number of predicted digits and the corresponding target digits during the test process. It is observed that digits, such as ''9'', are prone to be erroneously predicted. Recognition accuracy as a function of the maximum input rate is shown in Fig. 13(a) , which provide a guide to implement the proposed model in hardware with fewer spikes while maintaining high accuracy (fewer spikes corresponding to less energy consumption in the neuromorphic hardware [7] ). From Fig. 13(a) , the recognition accuracy drops sharply when the maximum input rate decreases from 60 Hz to 50 Hz because that if the maximum input rate is lower than 60 Hz, some neurons cannot receive enough input to cross their thresholds to fire spikes, thus resulting in a significant loss of accuracy [15] . So the maximum input rate is set to 60 Hz in this work to achieve a balance between the accuracy and energy.
Also, it is observed that the recognition accuracy significantly increased from 90.0% to 94.5% with the number of excitatory neurons increasing from 400 to 3500, as shown in Fig. 13(b) . For ASIC implementation, a trade-off between the recognition accuracy and hardware cost can be made according to this result. Although the maximum number of excitatory neurons in the ASIC is unchangeable because of the hardware limit, it can be designed to be downward compatible, i.e., the number of excitatory neurons can be set by changing the synaptic connection states (e.g., disconnect the unused synaptic connections). Moreover, the time consumption also increases greatly with the increase in the number of excitatory neurons, as shown in Table 2 . Table 3 compares the training time and performance on MNIST data set between the proposed model and pioneer ANN models with unsupervised learning methods. The proposed model achieves a 92.1% accuracy with a relatively fast learning speed. The simulation in this work is carried out with Matlab on Intel i9-7940x platform. Table 4 shows the comparison of the cost of Multiplier/ Divider and RAM between the proposed model and three pioneer SNNs achieved by supervised training, unsupervised learning, and ANN to SNN converting, respectively. Here, the Multiplier/Divider refers to the number of hardware calls per neuron per time step, and the RAM denotes the onchip memory, which stores the parameters except for the input data. The proposed model achieves a comparable accuracy compared with other models while consuming much fewer hardware resources (i.e., multiplier/divider and RAM). Multiplication takes up most of the computations in the neural network [49] . The proposed model only requires three 16-bit fixed-point multipliers per neuron per time step, which dramatically reduces hardware requirements [50] (A 32-bit floating-point multiplier consumes 6.1 times chip area and 7.3 times power than a 16-bit fixed-point multiplier does under the 65 nm TSMC CMOS technology node [48] , as given in Table 5 ). Because an MLP with bit operation runs seven times faster under MNIST dataset than with 32-bit floating-point multiplication on GPU [51] , the proposed model can greatly accelerate the computation since most of the calculations of the proposed model are realized by bit-shift operations rather than multiplication or division. Table 6 gives the resources and power comparison between the 32bits-floating-point model and the proposed model (since 32bits is the mostly used precision in the deep learning VOLUME 7, 2019 framework [52] ). Here the hardware synthesis is carried out with Design Compiler (DC) from Synopsys Inc. under the 55 nm SMIC CMOS process. Encouragingly, the proposed model reduces the hardware resources and power consumption by 40.7% and 36.3%, respectively.
IV. CONCLUSION
In this work, we propose a biologically plausible SNN model for hardware implementation, which can realize the recognition accuracy of 94.5% with the MNIST database. The model uses a hardware-friendly STDP mechanism to achieve selflearning. The neuron and synapse activities are digitalized and updated by fixed-point operations, and decay behaviors of the neuron and synapse parameters are realized by bitshift operations, which can significantly reduce hardware cost and power consumption by 40.7% and 36.3%, respectively (model size: 784-784-784). This model provides a meaningful reference for designing a neuromorphic platform that can efficiently realize online learning. While there are still some challenges for online learning in neuromorphic chips. The weights of SNN require a large amount of on-chip storage space, which may be alleviated by recently proposed weight quantification method [53] ; meanwhile, a large amount of time-and energy-consuming memory access are involved during the implementation of STDP, which may be eased by reducing the scale of weight matrixes by pruning method [54] . And these will be further studied in our future work. 
