Scalable NoC-based Neuromorphic Hardware Learning and Inference by Fang, Haowem et al.
Scalable NoC-based Neuromorphic Hardware
Learning and Inference
Haowen Fang1, Amar Shrestha1, De Ma2, Qinru Qiu1
1Department of Engineering and Computer Science, Syracuse University, Syracuse, New York
2College of Computer Science and Technology, Zhejiang University, China
Email: 1{hfang02,amshrest,qiqiu}@syr.edu,2made@hdu.edu.cn
Abstract—Bio-inspired neuromorphic hardware is a research
direction to approach brain’s computational power and energy
efficiency. Spiking neural networks (SNN) encode information
as sparsely distributed spike trains and employ spike-timing-
dependent plasticity (STDP) mechanism for learning. Existing
hardware implementations of SNN are limited in scale or do not
have in-hardware learning capability. In this work, we propose a
low-cost scalable Network-on-Chip (NoC) based SNN hardware
architecture with fully distributed in-hardware STDP learning
capability. All hardware neurons work in parallel and commu-
nicate through the NoC. This enables chip-level interconnection,
scalability and reconfigurability necessary for deploying different
applications. The hardware is applied to learn MNIST digits as an
evaluation of its learning capability. We explore the design space
to study the trade-offs between speed, area and energy. How to
use this procedure to find optimal architecture configuration is
also discussed.
Index Terms—Spiking neural network, Network on chip, STDP
learning, Unsupervised learning
I. INTRODUCTION
In the field of deep learning, convolutional neural networks
(CNNs) and recurrent neural networks (RNNs) are developed
to perform a series of human-level cognitive applications
[1] [2]. However, the tremendous computation and memory
requirement have been seriously challenging the processing
efficiency of deep learning systems [3] [4]. The limitations
of Von Neumann architecture coupled with increasing power
demands due to Dennard scaling and the approaching end of
Moore’s Law have motivated multiple research efforts into
low-power, highly parallel and distributed computing architec-
ture [5] [6] [7] [8] and brain-inspired computing architecture
[9] [10]. Brain as a source of inspiration is not surprising given
its ability to process massive amounts of real-time information
while consuming less than 20 W of power [11]. The goal of
neuromorphic hardware design is to explore the bio-inspired
architecture to achieve cognitive functions in real time utilizing
lower power and smaller footprint than the traditional Von
Neumann architectures.
Brain, in simplistic terms, is a collection of neurons, inter-
connected in a vast network through links called synapses.
The communication between neurons in the vast network
provides the brain its processing abilities and perform pattern
recognition, classification, associative memory, reasoning etc.
The basis of this communication is the short asynchronous
electrical pulses/action potentials called spikes. Spiking neural
networks (SNNs), which use spikes as the basis for commu-
nication, is the third generation of neural networks [12]. As
each neuron works asynchronously in an event-driven manner,
SNNs have the potential to reach very low energy dissipation.
When spiking activity in SNNs is stochastic i.e. spikes are
generated as a stochastic process, the information is carried
by the statistics of a group of spikes instead of individual
spikes. This makes the SNN more biologically plausible and
also improves its fault tolerance and noise (delay) resilience.
The network of neurons in the brain can learn patterns by
modifying the synapses linking the neurons based on their
causal relative spike timings. This local and causal learning
rule is called Spike-Timing Dependent Plasticity ( STDP )[13].
As it is based only on local information of individual neu-
rons, fully distributed learning[13] can be achieved on SNNs.
Several challenges exist in implementing STDP learning on
hardware. Firstly, the STDP rule is typically an exponential
function, which is expensive for hardware implementation.
Secondly, since the final value of synaptic weight is unknown
during the learning, the hardware implementation should con-
sider the worst case and be ready to provide a wide range
and high precision for every synapse. Hence, more memory
is required for hardware neurons that has learning capability
than the hardware neurons that performs inference only.
SNN hardware requires massive interconnection for paral-
lelism and scalability. The Network-on-Chip(NoC) architec-
ture has been used to provide on-chip communication for
massive parallel systems. Traditional NoC design aims at
minimizing communication latency and router congestion. As
we will show later, due to time-multiplexed nature in neuron
hardware implementation, and asynchronous and stochastic
neuron behavior, latency of inter-neuron communication is
not a performance bottleneck in hardware SNN. This property
enables us to significantly simply router design and reduces
hardware cost.
In this work, we adopt Q2PS approximation of STDP rule
[14] to simplify the hardware exponential function. Different
synaptic weights are encoded differently to provide both a
wide range and a high precision without increasing the storage.
A very low-cost NoC is designed to provide just enough
communication capability for an SNN. Our main contributions
include:
1) A low-cost time-multiplexed hardware spiking neuron
design. Replacing multiplication and exponential func-
tions with add and shift operations reduces hardware
complexity as well as power consumption. The time-
multiplexed physical neuron design improves resource
utilization and neuron density.
2) A compact NoC design with a low-cost router, which
is application specific and optimized for spiking neural
network.
3) Experimental demonstration of STDP learning capability
of the hardware by applying it to unsupervised learning
ar
X
iv
:1
81
0.
09
23
3v
1 
 [c
s.E
T]
  1
8 S
ep
 20
18
of MNIST digits.
4) Design space exploration to study speed, area and energy
trade-offs and suggestions of design choices based on
the analysis.
II. RELATED WORKS
There have been several existing research works on SNN
from the hardware perspective [6] [9] [10]. The IBMs
TrueNorth neurosynaptic processor [9] has achieved state-
of the-art performance with minimal energy footprint on
many tasks [15][16], but it does not provide in-hardware
learning capabilities. SpiNNaker [6] has also been popular in
the research community as a testbed for SNN applications.
SpiNNaker Chip Multiprocessor integrates 18 ARM cores.
It is capable of massively parallel simulations for spiking
neural network. [17] presents an analogue device to implement
artificial synapse with high energy efficiency, which shows 30
nJ energy consumption for an epoch of classification task.
[18] proposed a programmable CMOS neuromorphic chip.
The architecture aims at implementing biologically plausible
circuits and is limited in scalability. To address the scalabil-
ity issue, there are works adopt NoC technique with SNN.
EMBRACE is an FPGA based flexible and reconfigurable
SNN architecture[19]. It uses NoC to handle inter-neuron
communication. EMBRACE also features a genetic based
onchip training. It randomly initializes neuron configurations
and performs fitness evaluation, crossover and selection until
the optimal SNN configuration is obtained. [20] presents H-
NoC, an architecture for spiking neural network. The goal of
H-NoC is to reduce packet delay and it assumes that each
neuron in the SNN has a dedicated port to the router although
the detail of the neuron is not given. As we will show later, the
time multiplexed neuron core design and asynchronous nature
of the neuron activities relax the latency constraint. A more
simplified NoC design suffices the SNN application.
Hardware implementations of STDP learning [21] [22]
focus more on circuit and device level analysis to achieve
variable synaptic plasticity instead of scalability. [23] proposed
a digital hardware neuron model for synaptic plasticity, it fo-
cuses on the design of individual neuron cores, interconnection
and scalability are not addressed. Few analog VLSI approaches
of synaptic plasticity are proposed in [24][25][26], which
focus on the individual synapses design without addressing
large scale network implementation and architectural design.
Emerging memristive devices have also been studied to realize
artificial synapse and synaptic plasticity[27][28][29][30][31].
However these researches are still at proof-of-concept level
and the fabrication technology of memristive device is not yet
mature.
III. NEURON MODEL AND LEARNING RULE
We utilize a simplified version [32] of the neuron model
proposed in [33]. Here the membrane potential u(t) of neuron
Z is computed as
u(t) = w0 + u(t− 1) +
n∑
i=1
wi · yi(t) (1)
where wi is the weight of the synapse connecting Z to its ith
pre-synaptic neuron yi, yi(t) is 1 if yi issues a spike at time
Fig. 1: Generic neuron model.
t, and w0 models the intrinsic excitability (bias) of the neuron
Z. An integrate and fire neuron Z spikes when the membrane
potential crosses the threshold and then its membrane potential
is reset to 0. When the threshold is set to be random over a
specified range, the stochastic integrate-and-fire neuron (SIF)
approximates the Bayesian neuron in [32].
In order to aggregate or relay spike activities, we also
introduce spiking Rectified Linear Unit (ReLU) neuron. A
ReLU neuron accumulates every weighted input spike and
discharges it over time resembling a burst firing pattern. After
a spike, the membrane potential of a ReLU neuron is computed
as:
u(t) = u(t− 1) +
n∑
i=1
yi(t)− Uth (2)
STDP is the basis for learning in a spiking neuron model.
Multiplicative STDP are stable but induces low competition
whereas additive rules are highly competitive but unstable.
Both qualities, stability and competitiveness, are highly desir-
able. Most existing STDP rules utilize the exponential func-
tion, which is expensive for digital hardware implementation.
Here, we utilize the low-cost Q2PS STDP rule proposed in
[14] to approximate the exponential and multiplications using
shifters, adders and a priority encoder. The analysis in [14]
shows that the Q2PS rule is stable and highly competitive.
The rule is given in Equation :
∆wi =
{
1 << |Q¯|, ifQ¯ > 0
1 >> |Q¯|, ifQ¯ < 0 (3)
Where Q¯ is the quantization of Q through priority encoding
which is given as below.
If tpost − tpre < τLTP , then
Q = η′LTP − wi (4)
If tpost − tpre > τLTP or tpre − tpost < τLTD, then
Q = η′LTD + wi (5)
Where η′LTP = log2 ηLTP and η
′
LTD = log2 ηLTD. And
tpost and tpre are the time steps at which the pre and post-
synaptic neuron spikes, τLTP and τLTD are the LTP and LTD
window and ηLTP and ηLTD are the LTP and LTD learning
rates respectively.
The base 2 exponential function in Q2PS can be imple-
mented using a barrel shifter with very low hardware cost.
The weight learned by this rule has a limited range[14], which
will be explored to reduce the storage requirement, as will be
discussed in section IV-E.
IV. HARDWARE ARCHITECTURE
The proposed architecture consists of a grid of homoge-
neous neurons. Each individual neuron’s behavior is pro-
grammable and detailed neuron configuration will be discussed
in Section IV-D. We adopt a globally asynchronous, locally
synchronous (GALS) approach and avoid using a global
clock. Neurons and routers work asynchronously in different
clock domains. NoC is used as the global communication
infrastructure to address massive interconnections of SNN.
In this section, we will discuss the NoC design, the router
architecture, the network interface and the hardware neuron
design.
A. Network-on-Chip design
SNNs have massive numbers of interconnected neurons run-
ning simultaneously with each neuron having fan-outs larger
than 103 [34]. Traditional on-chip communication solutions
such as bus or point-to-point connection are limited in either
scalability or flexibility [35]. NoC has been widely used to
provide inter-core communication for massive parallel on-
chip systems. A typical NoC architecture consists of three
components; router, channel and PE (processing element).
Routers are interconnected by channels. Each PE is attached
with a router and communication with each other via multi-hop
packet transmission. Based on the destination address provided
by the packet, routers make routing decision to forward it
either to the next router or to the local PE. In this way, arbitrary
network topology can be implemented.
Traditional NoC design aims at minimizing the communica-
tion latency and router congestion to ensure reliable communi-
cation. Large buffer, wide interconnects and faster router clock
(compared to PE clock) are widely used techniques to achieve
the goal. However, the proposed hardware SNN is highly
resistant to latency. In a typical SNN, the spiking activities is
sparse and sporadic [36]. The sparsity is even more visible in
the hardware design due to the time-multiplexed nature of the
neuron cores. As we will explain in section IV-D, for a neuron
core of M logical neurons with N axons, each neuron will be
evaluated once every (M+C+4)(N+1) cycles. This interval
is referred as neuron evaluation cycle (NEC). Assume all
neuron evaluations are randomly ordered, the average latency
between the spike generation and required spike reception is
TNEC/2. Furthermore, because of the asynchronous behavior
of neurons, it is not absolutely necessary for a spike generated
in current NEC to be received in the very next NEC. The STDP
window is usually set to be multiple NECs. A communication
delay of 1 NEC will hardly affect learning and inference at
all. Finally, in-hardware learning automatically helps to adjust
synaptic weight based on the hardware. Links that consistently
have long latency or dropped packet will eventually have low
synaptic weight and hence become less important. Therefore,
in this work, our application specific NoC design will aim at
minimum silicon area and low overhead.
The router consists of five ports, dual clock FIFO, a crossbar
switch and an arbiter as shown in Figure 2. Every port
is independent from other ports and all of them work in
parallel. Each port has a controller and a routing logic. The
routing logic implements routing algorithm and determines the
next hop of a packet. The controller detects channel status
and coordinates with arbiter to make transmission decisions.
The arbiter handles crossbar conflict during the time when
an output is requested by multiple inputs. Each router is
connected to 4 neighbor nodes, which are north, south, east,
west respectively. Each router also connect to its local PE.
Each port is attached with a FIFO buffer that can hold one
packet. We set the FIFO size to minimum to reduce hardware
cost. Routers work at the same frequency as hardware neuron.
All routers form a 2-D mesh.
Physical link width is a key factor to the NoC perfor-
mance and hardware cost. Link is realized by a number of
parallel wires connecting two neighboring routers. A wider
physical link can provide a larger bandwidth and reduce
transmission latency. However, the area overhead of router
increases quadratically as the link width increases [37]. Since
the proposed hardware SNN is resistant to latency, we adopt
the minimum cost solution and set the link width to be 4 bits.
Crossbar 
Arbiter
Port 
Controller
Routing 
Logic
Port 
Controller
Routing 
Logic
 
 
Request/Arbitration
   
.. 
North
Local
North
Local
Crossbar
   
.. 
Fig. 2: Router Structure
B. Routing and flow control
We adopt X-Y routing for its low hardware cost and as
its deadlock free [38]. Each node holds its own coordinate
(Xc, Y c) and the packets contain destination node’s coordi-
nate (Xd, Y d). Router compares its own coordinate with the
coordinate of the destination and decides where to forward the
packet. Horizontal direction has higher priority than vertical
direction. If Xd > Xc, packet is forwarded to east neighbor,
otherwise to the west neighbor. If Xd = Xc, packet is routed
vertically based on comparison result between Y c and Y d. If
Xd = Xc and Y d = Y c, then packet is forwarded to the local
port.
Each packet is an address event representation (AER) con-
sisting of two fields; header and body. The header is H bits
absolute coordinates of destination, where H is determined by
the NoC size. The body again is divided into two parts. The
first part is L-bit axon index, where 2L = N , and N is the
number of axons of a neuron core. The second part is reserved
for debug and function extension.
Router employs wormhole switching as the flow control
mechanism. A packet is split into a few flow control units
(flits). Each flit has the same size as the physical link width,
which is 4-bit. The H/4 header flits contain the routing
information. As long as header is received, the router can
forward the header flits to the next desired hop and all
subsequent payload flits will follow the same path. In this way,
the asynchronous buffer does not have to store entire packets
and both the depth and width of buffer can be minimized
mitigating high silicon area cost of the asynchronous buffers.
Router stalls transmission when its neighbor is busy.
C. Network Interface
Network interface(NI) is the bridge between router and
Neuron. NI is also responsible for decoding packets and
buffering incoming spikes. NI has a register array of length N .
Each bit corresponds to an input of a hardware neuron. Once a
packet is received, the axon ID field is decoded into an address.
Specified bit of the register array is set to 1, indicating a spike
is received. NI also provides local traffic bypass. All packets
targeting the local neuron will be directly decoded.
D. Neuron operations
Inspired by IBM TrueNorth[9], we designed the hardware
neuron architecture aiming at low overhead. In-hardware
STDP learning capability is added. The proposed hardware
supports two major functions, inference and learning. The
inference function integrates weights, updates membrane po-
tential and generates spikes. And the learning function updates
weights and bias based on the rules proposed in section III.
In order to improve resource utilization and achieve high
density neurons in SNN, the hardware works in a time-
multiplexed manner. The data path and control logic can
be used for multiple neurons’ computation. We refer to the
physical circuit that implements neuron behavior as a phys-
ical neuron. Each physical neuron can implement M logical
neurons. As we will show in Section VI, the value of M will
affect the speed, cost and energy efficiency of the system.
A physical neuron has N inputs called axons. The set of N
axons are shared by all logical neurons in a neuron core. In
this way, a logical neuron can connect up to N logical neurons
through a single spike packet, which reduces the NoC traffic.
We refer every connection between an axon and a logical
neuron as a synapse. The connectivity between axons and
logical neurons can be represented as a crossbar as shown in
Figure 3. Each dot in the figure is a synapse. Every synapse
has a unique weight. If a neuron is not connected to an axon,
the corresponding synaptic weight is 0.
 
  
 
 
 
 
 
N
 a
xo
n
s
Synapse
M logical neurons
 
 
Slot 1 Slot 2 Slot 3 Slot M
Fig. 3: Logical neuron connectivity.
Each logical neuron performs inference followed by learn-
ing, if its learning function is enabled. Learning can only
happen after inference, because it requires information such
as the firing condition, pre-synaptic history and post-synaptic
Neuron 
Controller
Status 
Memory
Configuration 
Memory
Datapath
Spike Buffer
Address
Address
STDP History
Buffered 
Spikes
Neuron Type
Learning Parameters
STDP Parameters
Threshold
Bias
Potential
Weight
Control 
Signals
Learning Mode
Network Interface
Output Spikes
Incoming SpikesSpike AER
Neuron
Fig. 4: Physical neuron structure
history, which are updated during the inference. For perfor-
mance efficiency, we parallelize the learning of the ith neuron
with the inference of the (i+ 1)th neuron.
A global synchronization signal coordinates all physical
neurons computation and the interval between two synchro-
nization signals is one NEC. Each neuron is evaluated once
every NEC. One NEC is partitioned into M + 1 slots; one
for each logical neuron and the last slot is to complete the
learning operation of the last logical neuron. Each slot has
multiple clock cycles and its duration is determined by the
learning and inference latencies. As we can see, by evenly
distributing logical neurons in multiple physical neurons, less
slots are required in a NEC, the whole network can work at
a higher frequency.
E. Hardware neuron design
As shown in Figure 4, a physical neuron has 5 parts. The
neuron controller is responsible for scheduling the compu-
tation of logical neurons, generating addresses and control
signals for learning and inference. Data path is the key compo-
nent to implement neuron behaviors, including inference and
learning functions. Spike buffer has a register array of length
N . Each bit corresponds to an axon input. When the start
signal is high, the content in spike buffer will be either cleared
or overwritten by the output of NI.
There are two memory banks in a hardware neuron. Con-
figuration memory stores every logical neurons’ behavior
parameters and learning parameters such as logical neuron
type, learning mode, LTP learning rate (ηLTP ), LTD learning
rate (ηLTD) etc. Status memory stores logical neurons’ status
parameter, including membrane potential, bias, threshold, axon
weights, pre-synaptic history and post-synaptic history. Every
physical neuron has its own memory, which is located next to
the data path.
The overlapped learning and inference both require to access
weight memory at same time. To solve this issue, a FIFO is
used. When the ith neuron is performing inference function
and accessing weight memory, each weight is pushed into
the FIFO. When the inference of ith logical neuron is done,
(i + 1)th logical neuron starts inference. At the same time,
the learning function of ith logical neuron starts, all required
weights of ith are fetched from FIFO and sent to the data path.
A physical neuron with M logical neurons and N axons
has M ∗N synapses, and each synapse has a unique weight.
Therefore, weight consumes the most memory resources. The
Q2PS STDP rule limits weights in a small range but requires
high precision[14]. This enables reducing integer bits without
accuracy penalty. However, some specialized networks such
as Winner-take-all circuit[14] require wide weight range while
the precision is not important. To mitigate this problem, each
axon is associated with a scaling factor. By configuring scaling
factor, axon can be switched between different precision and
range levels, so that wide weight range and high precision can
both be satisfied.
Based on its inference function, a logical neuron can be
configured as one of the four modes: integrate and fire(IF),
stochastic integrate and fire(SIF), spiking Rectified Linear
Unit(ReLU) and learning mode. In IF mode, neuron im-
plements Equation 1, if the membrane potential exceeds its
threshold, it will forward an AER to router. In SIF mode,
the threshold is a random number uniformly distributed in a
given range, so that neuron fires at certain probability. In ReLU
mode, neuron implements equation 2. If learning is enabled,
neuron performs the learning rules described in section III.
Separate hardware is used to implement the data path for
inference and learning functions.
The inference data path is responsible for computing mem-
brane potential, issuing spikes and performing stochastic be-
havior. It takes one clock cycle to accumulate the weight of
each synapse. Adding the bias and previous NECs membrane
potential takes another 2 clock cycles. Comparing current
membrane potential to the threshold and determining whether
to spike or not takes one more cycle. At last the new membrane
potential is written back to status memory. Assuming the axon
number is configured as N , the inference evaluation of 1
logical neuron takes N + 4 clocks.
The learning data path implements the Q2PS rule, which
uses adder, shifter, priority encoder and look-up table(LUT)
to approximate the exponential STDP learning. Learning data
path is pipelined and has four stages. In stage one, based
on the spiking history and spiking condition, ηLTD or ηLTP
is selected to perform LTP or LTD learning. Q is computed
as equation 4 or equation 5. In stage two, Q¯ is obtained by
priority encoding Q. Then equation 3 is computed, 1 is shifted
by Q¯ times to get the change of synaptic weight. In stage
three, weight change is applied to the old weight. In stage four,
updated weight is written to weight memory. The learning of
bias is also implemented in this stage. The difference between
bias learning and weight learning is that the bias learning is
not a function of time, whether LTD or LTP is used depends
only on the current spiking condition. Learning of a neuron
with N axons also takes N + 4 clocks. Figure 5 shows the
timing of data path and pipeline.
In addition to the data path for inference and learning, an
STDP tracker is implemented to maintain the pre-synaptic
spike history and post-synaptic spike history which are critical
to performing correct learning activity. The post-synaptic
history tracker is a counter that is set to 0 in the NEC when
the logical neuron generates a spike, and incremented by 1
in every NEC otherwise. The pre-synaptic history tracker is
AW AW AW AW AW AW AB CMP WB
DC Q WU WB
DC Q WU WB  
DC Q WU WB
AP
 
 
Learning pipeline
AW  WB
Logical neuron m 
inference evaluation
Logical neuron m+ 1
inference evaluation
DC
Logical neuron m – 1
learning evaluation 
 WB
Logical neuron m 
learning evaluation
Recall data path
AW: accumulate weight AB: accumulate bias AP: accumulate potential WB: write back
DC: delta compute Q:  quantization WU: weight update
Fig. 5: Data path pipeline
also a counter that is set to 0 in the NEC when a spike is
received on that synapse, and incremented by 1 otherwise.
Post-synaptic/pre-synaptic history valid flag is asserted if the
the tracker is less than the LTD/LTP window.
When learning mode is enabled, STDP tracker determines
to expire post-synaptic history and pre-synaptic history based
on the firing condition, valid flag and incoming spike. When
the logical neuron issues a spike and pre-synaptic history is
valid, STDP tracker expires pre-synaptic history. When logical
neuron receives a spike, and post-synaptic history is valid,
STDP tracker will expire post-synaptic history.
The proposed hardware also supports multicast. In the
multicast mode, the most significant bit of extension field in
spike AER is a control bit. If the MSB is 1, neuron controller
will increase address pointer to read next spike AER from
configuration memory and keeps sending write request to NI
until a spike AER’s MSB is 0. In this way, a logical neuron can
have flexible number of destinations, which allows the network
to support more complex topologies, simplify configuration
and improve resource utilization.
V. RESULTS AND DISCUSSION
The proposed hardware design is implemented in Verilog
RTL-level model and synthesized on Altera Arria 10 platform.
The results in terms of learning and performance of the NoC
design is discussed, and the design space is explored in this
section.
A. Unsupervised Learning of MNIST digits
The stochastic firing and STDP learning enables unsuper-
vised feature learning in SNNs. To validate the functioning of
the hardware design, we employ a simple pattern learning task.
In this task, we utilize a simple winner-take-all (WTA) circuit
to learn handwritten digits 0 and 1[32] from MNIST data set.
The network is trained using 100 samples and each sample is
exposed to the network for 100 NECs. In this experiment we
set M and N to be 128 and 256 respectively.
For convenience, given the size of fan-in for a core (256
axons), we look to reduce the required number of inputs into
any layer. As an MNIST image has 28x28 pixels, we employ
an average-pooling-like mechanism for patches of 2x2 with
a stride of 2 in the first layer. Thus the resultant input into
Weight Distribu�on
Neuron 0 Neuron 1 Neuron 2 Neuron 3
NEC 0
NEC 10000
(a) Weight distribution.
Neuron 0 Neuron 1 Neuron 2 Neuron 3
(b) Weight visualization
Fig. 6: Learnt weights
the second layer will be an average-pooled 14x14 MNIST
image. The overall network consists of two layers. The first
layer consists of 196 average pooling neurons whose fan-
in is 2x2, and the second layer consists of 4 SIF neurons
with STDP learning enabled and 4 ReLU neurons to relay
the spikes. These 8 neurons form a winner-takes-all (WTA)
circuit as described in [32]. The 196 neurons in the first layer
are mapped to 14 cores and all neurons in the second layer are
mapped 1 core. A 4x4 mesh network is configured to perform
this experiment. MNIST image is encoded into spike packets
and a dedicated router is used to inject external packets.
Figure 6(a) shows the distribution of the weight before and
after learning. The stability of the Q2PS STDP rule can be
seen from this figure. As all learnt weights are limited in range
[-1.41 , 1.07], less integer bits are used to encode the weights
and hence lower memory usage. The selectivity provided by
the Q2PS STDP rule can also be observed. Before the learning,
the weight follows a uniform distribution in the range [-1, 1],
while after learning the diverging weights of the network form
a bimodal distribution as expected in [14]. Figure 6(b) gives
the weight map of the 4 classification neurons. As we can see
that each of them learned specific patterns.
Figure 7(a) gives the learning results. Initially, the spiking
activity is random, but as the time goes on, a spiking pattern
emerges which represent the corresponding selectivity of the
neurons. This selectivity is also visible from the firing rates
of the learning neurons shown in Figure 7(b). Initially the
spiking rate is high as all the neurons are randomly firing. As
time goes on, the neurons start learning patterns and only fire
selectively. Thus, drastically reducing the firing rate.
Most neurons remain idle through the entire training and
the average firing probability is 11.472%. Because the com-
putation is event-driven, the sparsity of neuron activation leads
to low power consumption.
NEC
In
pu
t
Spike Trains
Ne
ur
on
 #
(a) Spike train pattern and corresponding input.
(b) Average firing rate of learning neurons.
Fig. 7: Learning neuron spike activity
B. Performance evaluation
The traffic pattern and spike pattern of NoC are specific to
the topology and neuron parameters of SNNs that are mapped
to it. In order to guarantee the proposed architecture can
satisfy the requirements of various applications, we perform a
pressure test to evaluate the performance of NoC by increasing
firing rate using above the MNIST recognition network a
baseline. We configured a 4x4 mesh network with random
connectivity. Each physical neuron has M = 128 logical
neurons and N = 256 input axon. 2048 logical neurons are
distributed in 16 physical neurons. The router has buffer depth
of 2 packets. Table I shows NoC traffic statistics under differ-
ent firing rates. The first row is the above MNIST network that
is used as baseline. Each experiment runs 1000 NECs. At the
operation frequency of 200 MHz, 5∗107/(257∗260) = 1, 489
NECs can be executed in 1 second. Benefiting from the sparse
activation of SNNs and shared input axon, the traffic stays at
a relatively low level. When firing rate is 87.526%, the NoC
achieves throughput of 26 Mbyte/s. The forth column in Table
I shows the average latency of the network under different
firing rate. The MNIST baseline has smaller latency because
it has shorter average packet route than random topology. The
MNIST baseline also has larger traffic because external input
of image is injected to network. The average latency remains
stable under different traffic load. All packet can be delivered
in current NEC, which is 33560 clock cycles in this case. We
observe no packet drop due to congestion. The experiment
TABLE I: Traffic Statistic
Firing
Rate
Total
Spikes
Traffic
(byte)
Avg.
Latency
(Cycle)
Max
Latency
Min
latency
11.513% 218196 9324660 33.328 151 23
10.723% 220495 2219028 50.780 99 23
15.911% 327165 3297500 50.906 96 23
39.940% 821245 8266052 51.369 110 23
46.995% 966316 9849392 51.806 101 23
54.633% 1123366 11080148 50.983 99 23
62.743% 1290114 12644556 50.964 108 23
72.609% 1492974 14988016 51.907 106 23
81.393% 1673593 16657080 51.876 108 23
87.562% 1800452 17906592 51.842 106 23
0.00000%
0.00020%
0.00040%
0.00060%
0.00080%
0.00100%
0.00120%
0.00140%
0.000%
0.050%
0.100%
0.150%
0.200%
0.250%
10.723% 15.911% 39.940% 46.995% 54.633% 62.743% 72.609% 81.393% 87.562%
Bu
ff
er
 C
on
ge
s�
on
 R
at
e
etaR noitsegnoC noitnetnoC
Firing Rate
Conten�on Conges�on Rate Buffer Conges�on Rate
Fig. 8: Congestion rate
shows that, although small flit size cause large delay, there is
still sufficient time to deliver packets.
Two types of congestion are studied. The first type of con-
gestion occurs inside the router, which is caused by multiple
incoming packets requesting the same destination port. In this
case, only one port is granted to transmit while other ports are
stalled temporarily until the transmission is done. We refer
to this as contention congestion. The second type is buffer
congestion, which occurs when the destination routers input
buffer is full, router has to wait until the packets stored in the
destination are transmitted. A packet will be dropped when
buffer congestion occurs. We define the congestion rate as
Tcongestion/Texecution, where Tcongestion is the total clock
cycles in which congestion has occurred, and Texecution is
the running time of the entire simulation. As shown in Figure
8, both the two types of congestion rate increase as the firing
rate increases. However, due to the time-multiplexed design, a
physical neuron can generate 1 spike every 260 clocks at most,
the traffic is spanned in a long duration and at most times
routers remain idle. The sparsity of SNN activation makes the
traffic even sparser. As a result, even in the worst case, where
the network has a firing rate of 87.562%, the congestion is
still rare.
In SNNs, neurons firing rates are about 10%[36]. There is
no significant performance degradation observed in the worst
case, therefore the proposed architecture is able to satisfy the
requirements of various applications.
C. Design space exploration
The physical neuron capacity, NoC size and router buffer
size can affect power consumption, parallelism and efficiency.
TABLE II: Impact of Physical Neuron Capacity
Size NEC Neuron LUT Memory Power Energy
(cycle) number (byte) (mW) (J)
32 8580 64 18.97% 1,394,688 5759.05 0.247
64 16900 32 9.56% 1,359,872 2144.83 0.351
128 33540 16 4.87% 1,342,464 1839.52 0.498
256 66820 8 2.46% 1,333,760 1679.96 0.783
Here, we study the impacts of these factors and provide
guidelines for the design of hardware spiking neural network.
First, there is a trade-off between physical neuron capacity
and parallelism. Assume that a SNN has 2048 neurons and a
physical neuron can contain 256 logical neurons. 2048 / 256 =
8 physical neurons are required to map the SNN to hardware.
In this case, a NEC should have at least (256 + 1) * 260
= 66820 clock cycles. If the physical neuron can contain 32
neurons, 2048 / 32 = 64 physical neurons are required. A NEC
should contain (32 + 1) * 260 = 8,580 clock cycles. Therefore
the second hardware SNN runs approximately 8 times faster
than the first one as the second hardware SNN require less
clock cycles in a NEC. However this improvement is obtained
at the cost of silicon area and power consumption.
Table II shows the impact of physical neuron capacity
on FPGA resource consumption, power and energy when
mapping a SNN with 2048 neurons to hardware. The exper-
iment is performed at the operating frequency of 200 MHz.
LUT consumption has an approximately linear relation to the
number of physical neuron number. The logic resource con-
sumption of a cell is almost constant. Memory consumption
increases slightly as the physical neuron number grows. Extra
memory consumption is introduced by router’s input buffer.
The last column shows the energy consumption of executing
1000 NECs. Although power is significantly larger when more
physical neurons are required, the length of a NEC becomes
shorter. Hence more NECs can be executed in a given running
time. For FPGA implementation, it is preferable to use small
physical neuron to increase parallelism as well as energy
efficiency.
Another factor that has impacts on performance is router’s
buffer size. Larger buffer size can reduce congestion rate,
however the dual clock buffer is expensive in area. It is
desirable to reduce router area to improve neuron density. We
studied the impact of buffer size on network performance. A
4x4 network is configured, which consists of 2048 neurons.
Two sets of experiments are performed. In the first set, the
SNN’s firing rate is approximately 10%, which is close to
the firing rate of realistic applications. In the second set,
a pressure test is performed, the SNN has approximately
100% firing rate. Network performance are shown in table III.
Compared with buffer of depth 8, buffer congestion decreases
considerably, contention congestion also decreases due to less
buffer congestion. The buffer congestion is rare when buffer
depth is 16. Increasing buffer further does not bring significant
performance improvement.
VI. CONCLUSIONS
In this paper, we presented a comprehensive system-level
spiking neural network hardware implementation, which fea-
tures scalability, flexibility and in-hardware STDP learning
TABLE III: Impact of router buffer size
Firing rate Buffer size Contention Buffer
(width*depth) congestion congestion
10.251% 4*8 0.0331% 0.037%
10.251% 4*16 0.0328% 0%
10.251% 4*32 0.0328% 0%
99.896% 4*8 2.891% 4.002%
99.896% 4*16 2.625% 0.015%
99.896% 4*32 2.625% 0%
capability. The proposed design is validated by unsupervised
MNIST digits learning.
REFERENCES
[1] Y. Li, Z. Li, and Q. Qiu, “Assisting fuzzy offline handwriting recogni-
tion using recurrent belief propagation,” in Computational Intelligence
(SSCI), 2016 IEEE Symposium Series on. IEEE, 2016, pp. 1–8.
[2] A. Ren, Z. Li, C. Ding, Q. Qiu, Y. Wang, J. Li, X. Qian, and
B. Yuan, “Sc-dcnn: highly-scalable deep convolutional neural network
using stochastic computing,” in Proceedings of the Twenty-Second
International Conference on Architectural Support for Programming
Languages and Operating Systems. ACM, 2017, pp. 405–418.
[3] Y. Wang, C. Ding, Z. Li, G. Yuan, S. Liao, X. Ma, B. Yuan, X. Qian,
J. Tang, Q. Qiu et al., “Towards ultra-high performance and energy effi-
ciency of deep learning systems: an algorithm-hardware co-optimization
framework,” arXiv preprint arXiv:1802.06402.
[4] S. Lin, N. Liu, M. Nazemi, H. Li, C. Ding, Y. Wang, and M. Pedram,
“Fft-based deep learning deployment in embedded systems,” arXiv
preprint arXiv:1712.04910, 2017.
[5] S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, and Y. Liang, “C-
lstm: Enabling efficient lstm using structured compression techniques
on fpgas,” in Proceedings of the 2018 ACM/SIGDA International Sym-
posium on Field-Programmable Gate Arrays. ACM, 2018, pp. 11–20.
[6] S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras,
S. Temple, and A. D. Brown, “Overview of the spinnaker system
architecture,” IEEE Transactions on Computers, vol. 62, no. 12, pp.
2454–2467, 2013.
[7] C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian,
Y. Bai, G. Yuan et al., “Circnn: accelerating and compressing deep neural
networks using block-circulant weight matrices,” in Proceedings of the
50th Annual IEEE/ACM International Symposium on Microarchitecture.
ACM, 2017, pp. 395–408.
[8] J. Li, Z. Yuan, Z. Li, C. Ding, A. Ren, Q. Qiu, J. Draper, and Y. Wang,
“Hardware-driven nonlinear activation for stochastic computing based
deep convolutional neural networks,” in Neural Networks (IJCNN), 2017
International Joint Conference on. IEEE, 2017, pp. 1230–1236.
[9] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada,
F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura et al., “A
million spiking-neuron integrated circuit with a scalable communication
network and interface,” Science, vol. 345, no. 6197, pp. 668–673, 2014.
[10] Z. LI, Y. WANG, and Q. QIU, “Probabilistic inference in neuromorphic
architecture: Applications and implementations.”
[11] R. Ananthanarayanan, S. K. Esser, H. D. Simon, and D. S. Modha,
“The cat is out of the bag: cortical simulations with 109 neurons, 1013
synapses,” in High Performance Computing Networking, Storage and
Analysis, Proceedings of the Conference on. IEEE, 2009, pp. 1–12.
[12] W. Maass, “Networks of spiking neurons: the third generation of neural
network models,” Neural networks, vol. 10, no. 9, pp. 1659–1671, 1997.
[13] J. Sjo¨stro¨m and W. Gerstner, “Spike-timing dependent plasticity,” Spike-
timing dependent plasticity, vol. 35, 2010.
[14] A. Shrestha, K. Ahmed, Y. Wang, and Q. Qiu, “Stable spike-timing
dependent plasticity rule for multilayer unsupervised and supervised
learning,” in Neural Networks (IJCNN), 2017 International Joint Con-
ference on. IEEE, 2017, pp. 1999–2006.
[15] A. Shrestha, K. Ahmed, Y. Wang, D. P. Widemann, A. T. Moody, B. C.
Van Essen, and Q. Qiu, “A spike-based long short-term memory on
a neurosynaptic processor,” in Computer-Aided Design (ICCAD), 2017
IEEE/ACM International Conference on. IEEE, 2017, pp. 631–637.
[16] A. G. Andreou, A. A. Dykman, K. D. Fischl, G. Garreau, D. R. Mendat,
G. Orchard, A. S. Cassidy, P. Merolla, J. V. Arthur, R. Alvarez-Icaza
et al., “Real-time sensory information processing using the truenorth
neurosynaptic system.” in ISCAS, 2016, p. 2911.
[17] P. Yao, H. Wu, B. Gao, S. B. Eryilmaz, X. Huang, W. Zhang, Q. Zhang,
N. Deng, L. Shi, H.-S. P. Wong et al., “Face classification using
electronic synapses,” Nature Communications, vol. 8, 2017.
[18] M. R. Azghadi, S. Moradi, and G. Indiveri, “Programmable neuromor-
phic circuits for spike-based neural dynamics,” in New Circuits and
Systems Conference (NEWCAS), 2013 IEEE 11th International. IEEE,
2013, pp. 1–4.
[19] S. Cawley, F. Morgan, B. McGinley, S. Pande, L. McDaid, S. Carrillo,
and J. Harkin, “Hardware spiking neural network prototyping and
application,” Genetic Programming and Evolvable Machines, vol. 12,
no. 3, pp. 257–280, 2011.
[20] S. Carrillo, J. Harkin, L. J. McDaid, F. Morgan, S. Pande, S. Cawley, and
B. McGinley, “Scalable hierarchical network-on-chip architecture for
spiking neural network hardware implementations,” IEEE Transactions
on Parallel and Distributed Systems, vol. 24, no. 12, pp. 2451–2461,
2013.
[21] S. Park, A. Sheri, J. Kim, J. Noh, J. Jang, M. Jeon, B. Lee, B. Lee,
B. Lee, and H. Hwang, “Neuromorphic speech systems using advanced
reram-based synapse,” in Electron Devices Meeting (IEDM), 2013 IEEE
International. IEEE, 2013, pp. 25–6.
[22] J. M. Cruz-Albrecht, M. W. Yung, and N. Srinivasa, “Energy-efficient
neuron, synapse and stdp integrated circuits,” IEEE transactions on
biomedical circuits and systems, vol. 6, no. 3, pp. 246–256, 2012.
[23] K. Ahmed, A. Shrestha, Y. Wang, and Q. Qiu, “System design for in-
hardware stdp learning and spiking based probablistic inference,” in
VLSI (ISVLSI), 2016 IEEE Computer Society Annual Symposium on.
IEEE, 2016, pp. 272–277.
[24] J. Schemmel, A. Grubl, K. Meier, and E. Mueller, “Implementing
synaptic plasticity in a vlsi spiking neural network model,” in Neural
Networks, 2006. IJCNN’06. International Joint Conference on. IEEE,
2006, pp. 1–6.
[25] F. L. M. Huayaney and E. Chicca, “A vlsi implementation of a calcium-
based plasticity learning model,” in Circuits and Systems (ISCAS), 2016
IEEE International Symposium on. IEEE, 2016, pp. 373–376.
[26] G. Indiveri, E. Chicca, and R. Douglas, “A vlsi array of low-power
spiking neurons and bistable synapses with spike-timing dependent
plasticity,” IEEE transactions on neural networks, vol. 17, no. 1, pp.
211–221, 2006.
[27] G. Yuan, C. Ding, R. Cai, X. Ma, Z. Zhao, A. Ren, B. Yuan, and
Y. Wang, “Memristor crossbar-based ultra-efficient next-generation base-
band processors,” in Circuits and Systems (MWSCAS), 2017 IEEE 60th
International Midwest Symposium on. IEEE, 2017, pp. 1121–1124.
[28] L. Deng, G. Li, N. Deng, D. Wang, Z. Zhang, W. He, H. Li, J. Pei,
and L. Shi, “Complex learning in bio-plausible memristive networks,”
Scientific reports, vol. 5, 2015.
[29] E. Covi, S. Brivio, A. Serb, T. Prodromakis, M. Fanciulli, and S. Spiga,
“Analog memristive synapse in spiking networks implementing unsu-
pervised learning,” Frontiers in neuroscience, vol. 10, 2016.
[30] M. Prezioso, F. M. Bayat, B. Hoskins, K. Likharev, and D. Strukov,
“Self-adaptive spike-time-dependent plasticity of metal-oxide memris-
tors,” Scientific reports, vol. 6, p. 21331, 2016.
[31] S. Saı¨ghi, C. G. Mayr, T. Serrano-Gotarredona, H. Schmidt, G. Lecerf,
J. Tomas, J. Grollier, S. Boyn, A. F. Vincent, D. Querlioz et al.,
“Plasticity in memristive devices for spiking neural networks,” Frontiers
in neuroscience, vol. 9, 2015.
[32] K. Ahmed, A. Shrestha, Q. Qiu, and Q. Wu, “Probabilistic inference
using stochastic spiking neural networks on a neurosynaptic processor,”
in Neural Networks (IJCNN), 2016 International Joint Conference on.
IEEE, 2016, pp. 4286–4293.
[33] B. Nessler, M. Pfeiffer, L. Buesing, and W. Maass, “Bayesian compu-
tation emerges in generic cortical microcircuits through spike-timing-
dependent plasticity,” PLoS computational biology, vol. 9, no. 4, p.
e1003037, 2013.
[34] B. Pakkenberg, D. Pelvig, L. Marner, M. J. Bundgaard, H. J. G.
Gundersen, J. R. Nyengaard, and L. Regeur, “Aging and the human
neocortex,” Experimental gerontology, vol. 38, no. 1, pp. 95–99, 2003.
[35] A. Jantsch, H. Tenhunen et al., Networks on chip. Springer, 2003, vol.
396.
[36] J. A. Cardin, M. Carle´n, K. Meletis, U. Knoblich, F. Zhang, K. Deis-
seroth, L.-H. Tsai, and C. I. Moore, “Driving fast-spiking cells induces
gamma rhythm and controls sensory responses,” Nature, vol. 459, no.
7247, pp. 663–667, 2009.
[37] J. Lee, C. Nicopoulos, S. J. Park, M. Swaminathan, and J. Kim, “Do
we need wide flits in networks-on-chip?” in VLSI (ISVLSI), 2013 IEEE
Computer Society Annual Symposium on. IEEE, 2013, pp. 2–7.
[38] C. J. Glass and L. M. Ni, “The turn model for adaptive routing,” Journal
of the ACM (JACM), vol. 41, no. 5, pp. 874–902, 1994.
