Bio-inspired neuromorphic hardware is a research direction to approach brain's computational power and energy efficiency. Spiking neural networks (SNN) encode information as sparsely distributed spike trains and employ spike-timingdependent plasticity (STDP) mechanism for learning. Existing hardware implementations of SNN are limited in scale or do not have in-hardware learning capability. In this work, we propose a low-cost scalable Network-on-Chip (NoC) based SNN hardware architecture with fully distributed in-hardware STDP learning capability. All hardware neurons work in parallel and communicate through the NoC. This enables chip-level interconnection, scalability and reconfigurability necessary for deploying different applications. The hardware is applied to learn MNIST digits as an evaluation of its learning capability. We explore the design space to study the trade-offs between speed, area and energy. How to use this procedure to find optimal architecture configuration is also discussed.
I. INTRODUCTION
In the field of deep learning, convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are developed to perform a series of human-level cognitive applications [1] [2] . However, the tremendous computation and memory requirement have been seriously challenging the processing efficiency of deep learning systems [3] [4] . The limitations of Von Neumann architecture coupled with increasing power demands due to Dennard scaling and the approaching end of Moore's Law have motivated multiple research efforts into low-power, highly parallel and distributed computing architecture [5] [6] [7] [8] and brain-inspired computing architecture [9] [10] . Brain as a source of inspiration is not surprising given its ability to process massive amounts of real-time information while consuming less than 20 W of power [11] . The goal of neuromorphic hardware design is to explore the bio-inspired architecture to achieve cognitive functions in real time utilizing lower power and smaller footprint than the traditional Von Neumann architectures.
Brain, in simplistic terms, is a collection of neurons, interconnected in a vast network through links called synapses. The communication between neurons in the vast network provides the brain its processing abilities and perform pattern recognition, classification, associative memory, reasoning etc. The basis of this communication is the short asynchronous electrical pulses/action potentials called spikes. Spiking neural networks (SNNs), which use spikes as the basis for communication, is the third generation of neural networks [12] . As each neuron works asynchronously in an event-driven manner, SNNs have the potential to reach very low energy dissipation. When spiking activity in SNNs is stochastic i.e. spikes are generated as a stochastic process, the information is carried by the statistics of a group of spikes instead of individual spikes. This makes the SNN more biologically plausible and also improves its fault tolerance and noise (delay) resilience.
The network of neurons in the brain can learn patterns by modifying the synapses linking the neurons based on their causal relative spike timings. This local and causal learning rule is called Spike-Timing Dependent Plasticity ( STDP ) [13] .
As it is based only on local information of individual neurons, fully distributed learning [13] can be achieved on SNNs. Several challenges exist in implementing STDP learning on hardware. Firstly, the STDP rule is typically an exponential function, which is expensive for hardware implementation. Secondly, since the final value of synaptic weight is unknown during the learning, the hardware implementation should consider the worst case and be ready to provide a wide range and high precision for every synapse. Hence, more memory is required for hardware neurons that has learning capability than the hardware neurons that performs inference only.
SNN hardware requires massive interconnection for parallelism and scalability. The Network-on-Chip(NoC) architecture has been used to provide on-chip communication for massive parallel systems. Traditional NoC design aims at minimizing communication latency and router congestion. As we will show later, due to time-multiplexed nature in neuron hardware implementation, and asynchronous and stochastic neuron behavior, latency of inter-neuron communication is not a performance bottleneck in hardware SNN. This property enables us to significantly simply router design and reduces hardware cost.
In this work, we adopt Q2PS approximation of STDP rule [14] to simplify the hardware exponential function. Different synaptic weights are encoded differently to provide both a wide range and a high precision without increasing the storage. A very low-cost NoC is designed to provide just enough communication capability for an SNN. Our main contributions include: 1) A low-cost time-multiplexed hardware spiking neuron design. Replacing multiplication and exponential functions with add and shift operations reduces hardware complexity as well as power consumption. The timemultiplexed physical neuron design improves resource utilization and neuron density. 2) A compact NoC design with a low-cost router, which is application specific and optimized for spiking neural network. 3) Experimental demonstration of STDP learning capability of the hardware by applying it to unsupervised learning of MNIST digits. 4) Design space exploration to study speed, area and energy trade-offs and suggestions of design choices based on the analysis.
II. RELATED WORKS
There have been several existing research works on SNN from the hardware perspective [6] [9] [10] . The IBMs TrueNorth neurosynaptic processor [9] has achieved stateof the-art performance with minimal energy footprint on many tasks [15] [16] , but it does not provide in-hardware learning capabilities. SpiNNaker [6] has also been popular in the research community as a testbed for SNN applications. SpiNNaker Chip Multiprocessor integrates 18 ARM cores. It is capable of massively parallel simulations for spiking neural network. [17] presents an analogue device to implement artificial synapse with high energy efficiency, which shows 30 nJ energy consumption for an epoch of classification task. [18] proposed a programmable CMOS neuromorphic chip. The architecture aims at implementing biologically plausible circuits and is limited in scalability. To address the scalability issue, there are works adopt NoC technique with SNN. EMBRACE is an FPGA based flexible and reconfigurable SNN architecture [19] . It uses NoC to handle inter-neuron communication. EMBRACE also features a genetic based onchip training. It randomly initializes neuron configurations and performs fitness evaluation, crossover and selection until the optimal SNN configuration is obtained. [20] presents H-NoC, an architecture for spiking neural network. The goal of H-NoC is to reduce packet delay and it assumes that each neuron in the SNN has a dedicated port to the router although the detail of the neuron is not given. As we will show later, the time multiplexed neuron core design and asynchronous nature of the neuron activities relax the latency constraint. A more simplified NoC design suffices the SNN application.
Hardware implementations of STDP learning [21] [22] focus more on circuit and device level analysis to achieve variable synaptic plasticity instead of scalability. [23] proposed a digital hardware neuron model for synaptic plasticity, it focuses on the design of individual neuron cores, interconnection and scalability are not addressed. Few analog VLSI approaches of synaptic plasticity are proposed in [24] [25] [26] , which focus on the individual synapses design without addressing large scale network implementation and architectural design. Emerging memristive devices have also been studied to realize artificial synapse and synaptic plasticity [27] [28] [29] [30] [31] . However these researches are still at proof-of-concept level and the fabrication technology of memristive device is not yet mature.
III. NEURON MODEL AND LEARNING RULE
We utilize a simplified version [32] of the neuron model proposed in [33] . Here the membrane potential u(t) of neuron Z is computed as
where w i is the weight of the synapse connecting Z to its i th pre-synaptic neuron y i , y i (t) is 1 if y i issues a spike at time t, and w 0 models the intrinsic excitability (bias) of the neuron Z. An integrate and fire neuron Z spikes when the membrane potential crosses the threshold and then its membrane potential is reset to 0. When the threshold is set to be random over a specified range, the stochastic integrate-and-fire neuron (SIF) approximates the Bayesian neuron in [32] . In order to aggregate or relay spike activities, we also introduce spiking Rectified Linear Unit (ReLU) neuron. A ReLU neuron accumulates every weighted input spike and discharges it over time resembling a burst firing pattern. After a spike, the membrane potential of a ReLU neuron is computed as:
STDP is the basis for learning in a spiking neuron model. Multiplicative STDP are stable but induces low competition whereas additive rules are highly competitive but unstable. Both qualities, stability and competitiveness, are highly desirable. Most existing STDP rules utilize the exponential function, which is expensive for digital hardware implementation. Here, we utilize the low-cost Q2PS STDP rule proposed in [14] to approximate the exponential and multiplications using shifters, adders and a priority encoder. The analysis in [14] shows that the Q2PS rule is stable and highly competitive. The rule is given in Equation :
WhereQ is the quantization of Q through priority encoding which is given as below.
If t post − t pre < τ LT P , then
If t post − t pre > τ LT P or t pre − t post < τ LT D , then
Where η LT P = log 2 η LT P and η LT D = log 2 η LT D . And t post and t pre are the time steps at which the pre and postsynaptic neuron spikes, τ LT P and τ LT D are the LTP and LTD window and η LT P and η LT D are the LTP and LTD learning rates respectively.
The base 2 exponential function in Q2PS can be implemented using a barrel shifter with very low hardware cost. The weight learned by this rule has a limited range [14] , which will be explored to reduce the storage requirement, as will be discussed in section IV-E.
IV. HARDWARE ARCHITECTURE
The proposed architecture consists of a grid of homogeneous neurons. Each individual neuron's behavior is programmable and detailed neuron configuration will be discussed in Section IV-D. We adopt a globally asynchronous, locally synchronous (GALS) approach and avoid using a global clock. Neurons and routers work asynchronously in different clock domains. NoC is used as the global communication infrastructure to address massive interconnections of SNN. In this section, we will discuss the NoC design, the router architecture, the network interface and the hardware neuron design.
A. Network-on-Chip design
SNNs have massive numbers of interconnected neurons running simultaneously with each neuron having fan-outs larger than 10 3 [34] . Traditional on-chip communication solutions such as bus or point-to-point connection are limited in either scalability or flexibility [35] . NoC has been widely used to provide inter-core communication for massive parallel onchip systems. A typical NoC architecture consists of three components; router, channel and PE (processing element). Routers are interconnected by channels. Each PE is attached with a router and communication with each other via multi-hop packet transmission. Based on the destination address provided by the packet, routers make routing decision to forward it either to the next router or to the local PE. In this way, arbitrary network topology can be implemented.
Traditional NoC design aims at minimizing the communication latency and router congestion to ensure reliable communication. Large buffer, wide interconnects and faster router clock (compared to PE clock) are widely used techniques to achieve the goal. However, the proposed hardware SNN is highly resistant to latency. In a typical SNN, the spiking activities is sparse and sporadic [36] . The sparsity is even more visible in the hardware design due to the time-multiplexed nature of the neuron cores. As we will explain in section IV-D, for a neuron core of M logical neurons with N axons, each neuron will be evaluated once every (M +C +4)(N +1) cycles. This interval is referred as neuron evaluation cycle (NEC). Assume all neuron evaluations are randomly ordered, the average latency between the spike generation and required spike reception is T N EC /2. Furthermore, because of the asynchronous behavior of neurons, it is not absolutely necessary for a spike generated in current NEC to be received in the very next NEC. The STDP window is usually set to be multiple NECs. A communication delay of 1 NEC will hardly affect learning and inference at all. Finally, in-hardware learning automatically helps to adjust synaptic weight based on the hardware. Links that consistently have long latency or dropped packet will eventually have low synaptic weight and hence become less important. Therefore, in this work, our application specific NoC design will aim at minimum silicon area and low overhead.
The router consists of five ports, dual clock FIFO, a crossbar switch and an arbiter as shown in Figure 2 . Every port is independent from other ports and all of them work in parallel. Each port has a controller and a routing logic. The routing logic implements routing algorithm and determines the next hop of a packet. The controller detects channel status and coordinates with arbiter to make transmission decisions.
The arbiter handles crossbar conflict during the time when an output is requested by multiple inputs. Each router is connected to 4 neighbor nodes, which are north, south, east, west respectively. Each router also connect to its local PE. Each port is attached with a FIFO buffer that can hold one packet. We set the FIFO size to minimum to reduce hardware cost. Routers work at the same frequency as hardware neuron. All routers form a 2-D mesh.
Physical link width is a key factor to the NoC performance and hardware cost. Link is realized by a number of parallel wires connecting two neighboring routers. A wider physical link can provide a larger bandwidth and reduce transmission latency. However, the area overhead of router increases quadratically as the link width increases [37] . Since the proposed hardware SNN is resistant to latency, we adopt the minimum cost solution and set the link width to be 4 bits. 
B. Routing and flow control
We adopt X-Y routing for its low hardware cost and as its deadlock free [38] . Each node holds its own coordinate (Xc, Y c) and the packets contain destination node's coordinate (Xd, Y d). Router compares its own coordinate with the coordinate of the destination and decides where to forward the packet. Horizontal direction has higher priority than vertical direction. If Xd > Xc, packet is forwarded to east neighbor, otherwise to the west neighbor. If Xd = Xc, packet is routed vertically based on comparison result between Y c and Y d. If Xd = Xc and Y d = Y c, then packet is forwarded to the local port.
Each packet is an address event representation (AER) consisting of two fields; header and body. The header is H bits absolute coordinates of destination, where H is determined by the NoC size. The body again is divided into two parts. The first part is L-bit axon index, where 2 L = N , and N is the number of axons of a neuron core. The second part is reserved for debug and function extension.
Router employs wormhole switching as the flow control mechanism. A packet is split into a few flow control units (flits). Each flit has the same size as the physical link width, which is 4-bit. The H/4 header flits contain the routing information. As long as header is received, the router can forward the header flits to the next desired hop and all subsequent payload flits will follow the same path. In this way, the asynchronous buffer does not have to store entire packets and both the depth and width of buffer can be minimized mitigating high silicon area cost of the asynchronous buffers. Router stalls transmission when its neighbor is busy.
C. Network Interface
Network interface(NI) is the bridge between router and Neuron. NI is also responsible for decoding packets and buffering incoming spikes. NI has a register array of length N . Each bit corresponds to an input of a hardware neuron. Once a packet is received, the axon ID field is decoded into an address. Specified bit of the register array is set to 1, indicating a spike is received. NI also provides local traffic bypass. All packets targeting the local neuron will be directly decoded.
D. Neuron operations
Inspired by IBM TrueNorth [9] , we designed the hardware neuron architecture aiming at low overhead. In-hardware STDP learning capability is added. The proposed hardware supports two major functions, inference and learning. The inference function integrates weights, updates membrane potential and generates spikes. And the learning function updates weights and bias based on the rules proposed in section III.
In order to improve resource utilization and achieve high density neurons in SNN, the hardware works in a timemultiplexed manner. The data path and control logic can be used for multiple neurons' computation. We refer to the physical circuit that implements neuron behavior as a physical neuron. Each physical neuron can implement M logical neurons. As we will show in Section VI, the value of M will affect the speed, cost and energy efficiency of the system.
A physical neuron has N inputs called axons. The set of N axons are shared by all logical neurons in a neuron core. In this way, a logical neuron can connect up to N logical neurons through a single spike packet, which reduces the NoC traffic. We refer every connection between an axon and a logical neuron as a synapse. The connectivity between axons and logical neurons can be represented as a crossbar as shown in Figure 3 . Each dot in the figure is a synapse. Every synapse has a unique weight. If a neuron is not connected to an axon, the corresponding synaptic weight is 0. Each logical neuron performs inference followed by learning, if its learning function is enabled. Learning can only happen after inference, because it requires information such as the firing condition, pre-synaptic history and post-synaptic Fig. 4 : Physical neuron structure history, which are updated during the inference. For performance efficiency, we parallelize the learning of the i th neuron with the inference of the (i + 1) th neuron.
A global synchronization signal coordinates all physical neurons computation and the interval between two synchronization signals is one NEC. Each neuron is evaluated once every NEC. One NEC is partitioned into M + 1 slots; one for each logical neuron and the last slot is to complete the learning operation of the last logical neuron. Each slot has multiple clock cycles and its duration is determined by the learning and inference latencies. As we can see, by evenly distributing logical neurons in multiple physical neurons, less slots are required in a NEC, the whole network can work at a higher frequency.
E. Hardware neuron design
As shown in Figure 4 , a physical neuron has 5 parts. The neuron controller is responsible for scheduling the computation of logical neurons, generating addresses and control signals for learning and inference. Data path is the key component to implement neuron behaviors, including inference and learning functions. Spike buffer has a register array of length N . Each bit corresponds to an axon input. When the start signal is high, the content in spike buffer will be either cleared or overwritten by the output of NI.
There are two memory banks in a hardware neuron. Configuration memory stores every logical neurons' behavior parameters and learning parameters such as logical neuron type, learning mode, LTP learning rate (η LT P ), LTD learning rate (η LT D ) etc. Status memory stores logical neurons' status parameter, including membrane potential, bias, threshold, axon weights, pre-synaptic history and post-synaptic history. Every physical neuron has its own memory, which is located next to the data path.
The overlapped learning and inference both require to access weight memory at same time. To solve this issue, a FIFO is used. When the ith neuron is performing inference function and accessing weight memory, each weight is pushed into the FIFO. When the inference of i th logical neuron is done, (i + 1) th logical neuron starts inference. At the same time, the learning function of i th logical neuron starts, all required weights of i th are fetched from FIFO and sent to the data path.
A physical neuron with M logical neurons and N axons has M * N synapses, and each synapse has a unique weight. Therefore, weight consumes the most memory resources. The Q2PS STDP rule limits weights in a small range but requires high precision [14] . This enables reducing integer bits without accuracy penalty. However, some specialized networks such as Winner-take-all circuit [14] require wide weight range while the precision is not important. To mitigate this problem, each axon is associated with a scaling factor. By configuring scaling factor, axon can be switched between different precision and range levels, so that wide weight range and high precision can both be satisfied.
Based on its inference function, a logical neuron can be configured as one of the four modes: integrate and fire(IF), stochastic integrate and fire(SIF), spiking Rectified Linear Unit(ReLU) and learning mode. In IF mode, neuron implements Equation 1, if the membrane potential exceeds its threshold, it will forward an AER to router. In SIF mode, the threshold is a random number uniformly distributed in a given range, so that neuron fires at certain probability. In ReLU mode, neuron implements equation 2. If learning is enabled, neuron performs the learning rules described in section III. Separate hardware is used to implement the data path for inference and learning functions.
The inference data path is responsible for computing membrane potential, issuing spikes and performing stochastic behavior. It takes one clock cycle to accumulate the weight of each synapse. Adding the bias and previous NECs membrane potential takes another 2 clock cycles. Comparing current membrane potential to the threshold and determining whether to spike or not takes one more cycle. At last the new membrane potential is written back to status memory. Assuming the axon number is configured as N , the inference evaluation of 1 logical neuron takes N + 4 clocks.
The learning data path implements the Q2PS rule, which uses adder, shifter, priority encoder and look-up table(LUT) to approximate the exponential STDP learning. Learning data path is pipelined and has four stages. In stage one, based on the spiking history and spiking condition, η LT D or η LT P is selected to perform LTP or LTD learning. Q is computed as equation 4 or equation 5. In stage two,Q is obtained by priority encoding Q. Then equation 3 is computed, 1 is shifted byQ times to get the change of synaptic weight. In stage three, weight change is applied to the old weight. In stage four, updated weight is written to weight memory. The learning of bias is also implemented in this stage. The difference between bias learning and weight learning is that the bias learning is not a function of time, whether LTD or LTP is used depends only on the current spiking condition. Learning of a neuron with N axons also takes N + 4 clocks. Figure 5 shows the timing of data path and pipeline.
In addition to the data path for inference and learning, an STDP tracker is implemented to maintain the pre-synaptic spike history and post-synaptic spike history which are critical to performing correct learning activity. The post-synaptic history tracker is a counter that is set to 0 in the NEC when the logical neuron generates a spike, and incremented by 1 in every NEC otherwise. The pre-synaptic history tracker is Fig. 5 : Data path pipeline also a counter that is set to 0 in the NEC when a spike is received on that synapse, and incremented by 1 otherwise. Post-synaptic/pre-synaptic history valid flag is asserted if the the tracker is less than the LTD/LTP window. When learning mode is enabled, STDP tracker determines to expire post-synaptic history and pre-synaptic history based on the firing condition, valid flag and incoming spike. When the logical neuron issues a spike and pre-synaptic history is valid, STDP tracker expires pre-synaptic history. When logical neuron receives a spike, and post-synaptic history is valid, STDP tracker will expire post-synaptic history.
The proposed hardware also supports multicast. In the multicast mode, the most significant bit of extension field in spike AER is a control bit. If the MSB is 1, neuron controller will increase address pointer to read next spike AER from configuration memory and keeps sending write request to NI until a spike AER's MSB is 0. In this way, a logical neuron can have flexible number of destinations, which allows the network to support more complex topologies, simplify configuration and improve resource utilization.
V. RESULTS AND DISCUSSION
The proposed hardware design is implemented in Verilog RTL-level model and synthesized on Altera Arria 10 platform. The results in terms of learning and performance of the NoC design is discussed, and the design space is explored in this section.
A. Unsupervised Learning of MNIST digits
The stochastic firing and STDP learning enables unsupervised feature learning in SNNs. To validate the functioning of the hardware design, we employ a simple pattern learning task. In this task, we utilize a simple winner-take-all (WTA) circuit to learn handwritten digits 0 and 1 [32] from MNIST data set. The network is trained using 100 samples and each sample is exposed to the network for 100 NECs. In this experiment we set M and N to be 128 and 256 respectively.
For convenience, given the size of fan-in for a core (256 axons), we look to reduce the required number of inputs into any layer. As an MNIST image has 28x28 pixels, we employ an average-pooling-like mechanism for patches of 2x2 with a stride of 2 in the first layer. Thus the resultant input into the second layer will be an average-pooled 14x14 MNIST image. The overall network consists of two layers. The first layer consists of 196 average pooling neurons whose fanin is 2x2, and the second layer consists of 4 SIF neurons with STDP learning enabled and 4 ReLU neurons to relay the spikes. These 8 neurons form a winner-takes-all (WTA) circuit as described in [32] . The 196 neurons in the first layer are mapped to 14 cores and all neurons in the second layer are mapped 1 core. A 4x4 mesh network is configured to perform this experiment. MNIST image is encoded into spike packets and a dedicated router is used to inject external packets. Figure 6 (a) shows the distribution of the weight before and after learning. The stability of the Q2PS STDP rule can be seen from this figure. As all learnt weights are limited in range [-1.41 , 1.07], less integer bits are used to encode the weights and hence lower memory usage. The selectivity provided by the Q2PS STDP rule can also be observed. Before the learning, the weight follows a uniform distribution in the range [-1, 1], while after learning the diverging weights of the network form a bimodal distribution as expected in [14] . Figure 6(b) gives the weight map of the 4 classification neurons. As we can see that each of them learned specific patterns. Figure 7(a) gives the learning results. Initially, the spiking activity is random, but as the time goes on, a spiking pattern emerges which represent the corresponding selectivity of the neurons. This selectivity is also visible from the firing rates of the learning neurons shown in Figure 7 (b). Initially the spiking rate is high as all the neurons are randomly firing. As time goes on, the neurons start learning patterns and only fire selectively. Thus, drastically reducing the firing rate.
Most neurons remain idle through the entire training and the average firing probability is 11.472%. Because the computation is event-driven, the sparsity of neuron activation leads to low power consumption. The traffic pattern and spike pattern of NoC are specific to the topology and neuron parameters of SNNs that are mapped to it. In order to guarantee the proposed architecture can satisfy the requirements of various applications, we perform a pressure test to evaluate the performance of NoC by increasing firing rate using above the MNIST recognition network a baseline. We configured a 4x4 mesh network with random connectivity. Each physical neuron has M = 128 logical neurons and N = 256 input axon. 2048 logical neurons are distributed in 16 physical neurons. The router has buffer depth of 2 packets. Table I shows NoC traffic statistics under different firing rates. The first row is the above MNIST network that is used as baseline. Each experiment runs 1000 NECs. At the operation frequency of 200 MHz, 5 * 10 7 /(257 * 260) = 1, 489 NECs can be executed in 1 second. Benefiting from the sparse activation of SNNs and shared input axon, the traffic stays at a relatively low level. When firing rate is 87.526%, the NoC achieves throughput of 26 Mbyte/s. The forth column in Table  I shows the average latency of the network under different firing rate. The MNIST baseline has smaller latency because it has shorter average packet route than random topology. The MNIST baseline also has larger traffic because external input of image is injected to network. The average latency remains stable under different traffic load. All packet can be delivered in current NEC, which is 33560 clock cycles in this case. We observe no packet drop due to congestion. The experiment shows that, although small flit size cause large delay, there is still sufficient time to deliver packets. Two types of congestion are studied. The first type of congestion occurs inside the router, which is caused by multiple incoming packets requesting the same destination port. In this case, only one port is granted to transmit while other ports are stalled temporarily until the transmission is done. We refer to this as contention congestion. The second type is buffer congestion, which occurs when the destination routers input buffer is full, router has to wait until the packets stored in the destination are transmitted. A packet will be dropped when buffer congestion occurs. We define the congestion rate as T congestion /T execution , where T congestion is the total clock cycles in which congestion has occurred, and T execution is the running time of the entire simulation. As shown in Figure  8 , both the two types of congestion rate increase as the firing rate increases. However, due to the time-multiplexed design, a physical neuron can generate 1 spike every 260 clocks at most, the traffic is spanned in a long duration and at most times routers remain idle. The sparsity of SNN activation makes the traffic even sparser. As a result, even in the worst case, where the network has a firing rate of 87.562%, the congestion is still rare.
In SNNs, neurons firing rates are about 10% [36] . There is no significant performance degradation observed in the worst case, therefore the proposed architecture is able to satisfy the requirements of various applications.
C. Design space exploration
The physical neuron capacity, NoC size and router buffer size can affect power consumption, parallelism and efficiency. Here, we study the impacts of these factors and provide guidelines for the design of hardware spiking neural network. First, there is a trade-off between physical neuron capacity and parallelism. Assume that a SNN has 2048 neurons and a physical neuron can contain 256 logical neurons. 2048 / 256 = 8 physical neurons are required to map the SNN to hardware. In this case, a NEC should have at least (256 + 1) * 260 = 66820 clock cycles. If the physical neuron can contain 32 neurons, 2048 / 32 = 64 physical neurons are required. A NEC should contain (32 + 1) * 260 = 8,580 clock cycles. Therefore the second hardware SNN runs approximately 8 times faster than the first one as the second hardware SNN require less clock cycles in a NEC. However this improvement is obtained at the cost of silicon area and power consumption. Table II shows the impact of physical neuron capacity on FPGA resource consumption, power and energy when mapping a SNN with 2048 neurons to hardware. The experiment is performed at the operating frequency of 200 MHz. LUT consumption has an approximately linear relation to the number of physical neuron number. The logic resource consumption of a cell is almost constant. Memory consumption increases slightly as the physical neuron number grows. Extra memory consumption is introduced by router's input buffer. The last column shows the energy consumption of executing 1000 NECs. Although power is significantly larger when more physical neurons are required, the length of a NEC becomes shorter. Hence more NECs can be executed in a given running time. For FPGA implementation, it is preferable to use small physical neuron to increase parallelism as well as energy efficiency.
Another factor that has impacts on performance is router's buffer size. Larger buffer size can reduce congestion rate, however the dual clock buffer is expensive in area. It is desirable to reduce router area to improve neuron density. We studied the impact of buffer size on network performance. A 4x4 network is configured, which consists of 2048 neurons. Two sets of experiments are performed. In the first set, the SNN's firing rate is approximately 10%, which is close to the firing rate of realistic applications. In the second set, a pressure test is performed, the SNN has approximately 100% firing rate. Network performance are shown in table III. Compared with buffer of depth 8, buffer congestion decreases considerably, contention congestion also decreases due to less buffer congestion. The buffer congestion is rare when buffer depth is 16. Increasing buffer further does not bring significant performance improvement.
VI. CONCLUSIONS
In this paper, we presented a comprehensive system-level spiking neural network hardware implementation, which features scalability, flexibility and in-hardware STDP learning 
