SNEAP: A Fast and Efficient Toolchain for Mapping Large-Scale Spiking
  Neural Network onto NoC-based Neuromorphic Platform by Li, Shiming et al.
ar
X
iv
:2
00
4.
01
63
9v
1 
 [c
s.D
C]
  3
1 M
ar 
20
20
SNEAP: A Fast and Efficient Toolchain for Mapping Large-Scale
Spiking Neural Network onto NoC-based Neuromorphic
Platform
Shiming Li
National University of Defense
Technology
Changsha, China
lishiming15@nudt.edu.cn
Shasha Guo
National University of Defense
Technology
Changsha, China
guoshasha13@nudt.edu.cn
Limeng Zhang
National University of Defense
Technology
Changsha, China
zhanglimeng@nudt.edu.cn
Ziyang Kang
National University of Defense
Technology
Changsha, China
kangziyang14@nudt.edu.cn
Shiying Wang
National University of Defense
Technology
Changsha, China
wangshiying18@nudt.edu.cn
Wei Shi
National University of Defense
Technology
Changsha, China
shiwei@nudt.edu.cn
Lei Wang
National University of Defense
Technology
Changsha, China
leiwang@nudt.edu.cn
Weixia Xu
National University of Defense
Technology
Changsha, China
xuweixia@nudt.edu.cn
ABSTRACT
Spiking neural network (SNN), as the third generation of artifi-
cial neural networks, has been widely adopted in vision and audio
tasks. Nowadays, many neuromorphic platforms support SNN sim-
ulation and adopt Network-on-Chips (NoC) architecture for multi-
cores interconnection. However, interconnection brings huge area
overhead to the platform. Moreover, run-time communication on
the interconnection has a significant effect on the total power con-
sumption and performance of the platform. In this paper, we pro-
pose a toolchain called SNEAP (Spiking NEural network mAPping
toolchain) formapping SNNs to neuromorphic platformswithmulti-
cores, which aims to reduce the energy and latency brought by
spike communication on the interconnection.
SNEAP includes two key steps: partitioning the SNN to reduce
the spikes communicated between partitions, andmapping the par-
titions of SNN to theNoC to reduce average hop of spikes under the
constraint of hardware resources. SNEAP can reduce more spikes
communicated on the interconnection of NoC and spend less time
than other toolchains in the partitioning phase. Moreover, the aver-
age hop of spikes is reduced more by SNEAP within a time period,
which effectively reduces the energy and latency on theNoC-based
neuromorphic platform.
e experimental results show that SNEAP can achieve 418×
reduction in end-to-end execution time, and reduce energy con-
sumption and spike latency, on average, by 23% and 51% respec-
tively, compared with SpiNeMap.
KEYWORDS
spiking neural network, toolchain, partitioning, mapping, neuro-
morphic platform
1 INTRODUCTION
Spiking neuron networks (SNN) [1] is the third generation of arti-
ficial neural network (ANN) inspired by brain science. At present,
SNNs are widely adopted in image classification, paern recogni-
tion tasks and so on [2]. A neuron in SNN accepts stimulus and gen-
erates spikes if its membrane potential exceeds the firing threshold.
Neurons communicate with each other by spikes. Compared with
current popular ANNs, SNNs have more biological characteristics
and require lower power consumption when simulated with neu-
romorphic platforms [3].
Neuromorphic platforms are gaining more aention recently.
e typical examples are IBMfis TrueNorth [4], Intelfis Loihi [5],
ETHfis Dynapse [6], UMfis SpiNNaker [7] etc. All of these neuro-
morphic platforms are based on Network-on-Chips (NoC) to con-
nect multiple neuromorphic cores. In each neuromorphic core,
there are fixed amounts of neurons.
Mapping SNNs to various neuromorphic platforms is a key step
in the application of neuromorphic platform. e general solution
is dividing a SNN into multiple partitions, and then mapping these
partitions to the neuromorphic cores. e neurons of each parti-
tion should not exceed the capacity of a single neuromorphic core.
If these partitions cannot be mapped at once when the partitions
outnumber the cores, multiple rounds of mapping are required to
ensure that all partitions have been mapped and executed.
ere are some mapping methods for deploying SNN to these
neuromorphic platforms, such as PACMAN [8], NEUTRAMS [9],
SCO [10], SpiNeMap [11], and etc. But these mapping methods
have some problems. PACMAN only partitions the SNN model
and then sequentially maps the result of partitioning to the ARM
cores, which leads to spike congestion on the NoC. SCO adopts
sequential mapping methods, which minimize the neuromorphic
cores usage to reduce the overhead of hardware resources. But
thismethod does not optimizes the spikes communication between
cores, resulting in increased spike latency and power consumption.
Although SpiNeMap uses a two-stage optimization method to re-
duce the power consumption and latency of the neuromorphic plat-
form, the entire process will take a huge amount of time for large-
scale SNNs. Meanwhile, limited by the algorithm, SpiNeMap does
not search out the best mapping scheme.
ere are two challenges for SNN mapping. e first comes
from the partitioning process. It is slow to partition the SNNs and
hard to find the best solution with the minimized spike communi-
cations for larger SNNs. e second comes from the mapping pro-
cess. A fast and efficient search algorithm needed to be proposed to
find out the best mapping scheme that minimizes the spike latency
and energy of the NoC-based neuromorphic platform. During the
mapping process, the search algorithm continuously evaluates the
metrics, such as average hop, latency, and energy. However, the
evaluation of these metrics oen requires to use real hardware or
hardware simulator, which leads to a lot of time consumption and
makes the entire optimization process unacceptable.
To confront these challenges, we propose a toolchain for map-
ping a large-scale SNN onto a NoC-based neuromorphic platform,
called SNEAP (Spiking NEural network mAPping toolchain). e
toolchain includes four parts: profiling, partitioning, mapping, and
evaluation. We first profile the connection information and the
spike traces of a SNN from the soware simulator. en we use a
multi-level graph partitioning method to quickly reduce the num-
ber of inter spike communications under the constraints of hard-
ware structure. Subsequently, a heuristic algorithm that selected
from three algorithm are used to map partitions to the NoC ar-
chitecture to optimize latency and energy. Finally, the mapping
scheme is evaluated by NoC-based hardware simulator, Noxim++
[11], so as to get key performance statistics.
Our contributions of this paper as follows:
• We propose a toolchain to map SNN to underlying NoC-
based neuromorphic platform. During the mapping pro-
cess, average neuron communication latency and power
consumption is minimalized.
• For large-scale SNN, we use an effective graph partition-
ing method to improve the quality of partitioning while
reducing the partitioning time dramatically.
• We use an optimization algorithm that selected from three
algorithms to minimize average neuron communication
latency and power consumption during themapping phase
of the toolchain.
• Average hop is used to evaluate the average neuron com-
munication latency and power consumption instead of us-
ing the simulator to improve the search speed.
We evaluate SNEAP using several SNNs. e experiment result
shows that SNEAP can achieve 418× reduction in end-to-end exe-
cution time, and reduce average energy consumption by 23% and
average spike latency by 51%, compared to SpiNeMap [11].
2 BACKGROUND & RELATED WORKS
2.1 NoC of Neuromorphic Platforms
e neuromorphic platform aims at developing VLSI systems to
mimic the neuro-biological networks of the nervous system - SNN.
It is a large-scale parallel system composed of a large number of
computing units called neuromorphic cores interconnected byNoC.
NoC is responsible for managing communication in the neuromor-
phic platform. NoC structure generally uses a dimensional-order
routing strategy to avoid deadlocks. According to the topology of
NoC, two types of NoC are commonly used: NoC-tree and NoC-
mesh. Examples include the NoC-mesh for TrueNorth and Loihi,
multi-stage NoC-mesh for Dynapse [6], and NoC-tree for Cxad.
SpiNNaker [7] simulates the brain by connecting 1 million ARM
processors together in real-time. Eighteen ARM processors are in-
tegrated into one chip multiprocessor (CMP), and 216 CMPs form
a complete system with a 2D toroidal mesh structure. Dynapse
[6] is an advanced mixed-signal multi-core neuromorphic proces-
sor. Dynapse hse 4 cores, each core has 256 analog circuit neurons.
ese 256 analog neurons are placed on a 16x16 2D-Mesh. e
maximum fan-in is 64 connections and the maximum fan-out is 4k
connections. TrueNorth [4] has 4096 cores, and each core includes
256 Leaky Integrate-and-Fire (LIF) model neurons. Synapses, neu-
rons, and axons are organized in the form of crossbars. 4096 cores
are connected together through a 2D-mesh NoCs. Loihi [5] is a
digital neuromorphic chip developed by Intel. Each chip of Loihi
has 128 neuromorphic cores and each core has 1024 neuromorphic
units. Each core can simulate 130,000 LIF neurons and 1.3 billion
synapses with a learning engine that supports on-chip training.
2.2 Mapping Tools of Neuromorphic Platforms
Since the architecture of each neuromorphic platform is different,
a dedicated toolchain is required to enable SNN to efficiently simu-
late on the neuromorphic platform. SpiNNaker [7] is a 2D toroidal
mesh structure. PACMAN [8] was proposed to address SNN map-
ping on SpiNNaker. PACMAN uses a simulated annealing algo-
rithm to search out the best partitioning scheme. But PACMAN
only partitions the SNN model, which leads to spike congestion
on the NoC. TrueNorth [4] also has their own mapping tool - cor-
let [12]. It uses the layout and routing optimization scheme in the
traditional VLSI field for the mapping of logical SNNs to physical
cores. SpiNemap [11] is proposed for the 2D-mesh architecture of
Dynapse [6]. It divides the mapping process into two phases: parti-
tioning and placement. ey design a greedy Kernighan-Lin algo-
rithm used in the partitioning phase and use the particle swarm op-
timization algorithm in the placement phase. For some neuromor-
phic platforms designed by new devices, [10] [13] were proposed
to enable SNN to effective run on these neuromorphic platforms.
3 TOOLCHAIN
3.1 Overview
e toolchain we proposed maps SNN onto the NoC-based neu-
romorphic platform, is called SNEAP ( Spiking NEural network
mAPping toolchain ). As shown in Figure 1, SNEAP consists of 4
phases: 1. Profiling phase: e topological structure of the trained
SNN network and the behavior of neurons are extracted by the
2
SNN network soware simulator to form an undirected graph; 2.
Partitioning phase: Partitioning divides the graph intomultiple par-
titions based on the capability of the target neuromorphic platform.
e multi-level partitioning algorithm is used to minimize spike
communication among partitions; 3. Mapping phase: A selected al-
gorithm is used to distribute these partitions to NoC of the target
hardware, which minimizes the average-hop of all spikes on NoC
of target hardware; 4. Evaluation phase: e mapping scheme is
evaluated by NoC-based hardware simulator - Noxim++ [11], so
as to get key performance statistics.
P
artitio
n 2
Partition 3
Partition 1
1
2
3
2
2
3
3
3
2
2
4
1
2 3
1
2 3
Particle Swarm Optimization
Simulated Annealing
Candidate Set
…
Tabu list
…
Tabu Search
Spiking Neural Network SNN Graph Multi-level Graph Partitioning
R R R RR
R R R RR
R R R RR
R R R RR
R R R RR
Crossbar
RouterR
Hardware Simulator
Noxim ++
Diagram of Hardware  
Phase 1: Profiling Phase 2: Partitioning
Phase 3: Mapping
Multiple Mapping AlgorithmsMapping Scheme
Phase 4: Evaluation 
Figure 1: Overview of SNEAP.
3.2 Profiling
SNN soware simulators (CARLsim [14], Nest [15], etc.) have been
widely used by neuroscientists to precisely simulate the behavior
of SNN. At present, most of SNN soware simulators provide pro-
gramming interfaces for developer to construct SNN. Aer con-
struction, the developer can use aributes of the SNN to configure
the SNN soware simulator. ese aributes include the number
of neurons, neuron dynamic model, network topology and etc.
In this paper, we use CARLsim [14] to extract the connection
information of the SNN and the behavior of spike. Aer we define
the structure and connection scheme of SNN, we set the program-
ming interface of CARLsim for simulation. When the simulation is
finished, the log files of CARLsim are analyzed to generate graph
with neurons as vertices and with synapses as edges between neu-
rons. e weights of the edges are the number of spikes commu-
nicated on synapses. In addition, spike trace file can be obtained
during the simulation. Each trace in spike trace file shows the spe-
cific behavior of each spike, and contains the ID of the source and
destination neurons and firing time. en we can perform parti-
tioning and mapping on the SNN through the obtained graph and
spike trace file.
3.3 Partitioning
In our work, we propose to use a multi-level graph partitioning
paradigm [16] to construct our partitioning tool. is tool solve
SNN partitioning problemwith the goal of minimizing the number
of spikes between partitions.
Partitioning problem can be transformed intoG(N , S)→ P(V ,E).
is is a classic graph partitioning problem. e graph partition-
ing problem is NP-complete problem. Previously works use classic
algorithms to solve the problem, such as particle swarm optimiza-
tion (PSO) [17], Kernighan-Lin (KL) [18], etc. However, these ap-
proachs take a lot of time to find out the beer partitioned SNN.We
use a multi-level graph partitioning method to optimize the parti-
tioning of large-scale SNNs. For the purpose of SNN partitioning,
we introduce the following notations.
G(N , S) = SNN graph with a set N of vertices (neurons) and a
set S of edges (i, j) (synapses).
P(V ,E) = Partitioned SNN graph with a set V of vertices (par-
titions) and a set E of edges between partitions.
Gi (Ni , Si ) = e i-th level coarsening graph.
Dc [v] = Partitioning vector in c-th level uncoarsening repre-
senting vertex v belong to which partition.
B(v) =e union of the partitions that the vertices adjacent
v belong to.
ED[v]b = External degree. For every b ∈ B(v), ED[v]b is the
sum of the weights of edges (v,u) such that Dc [u] = b .
ID[v] = Internal degree. ID[v] is the sum of the weights of
edges (v,u) such that Dc [u] = Dc [v].
Multi-level graph partitioning paradigm [16] consists of three
steps (shown in Figure 2): Coarsening, Initial partitioning, Uncoars-
ening.
Coarsening step is divided into multiple levels, and an original
graph G0(N0, S0) is coarsened level by level. In the i-th level of
coarsening, a set of vertices of Gi is combined to form a single
vertex of the next level coarser graph Gi+1. e vertices in graph
Gi are randomly selected. If a vertexm is not folded yet, we fold
a vertexm with vertex n such that the weight of the edge (m,n) is
maximum overall valid adjacent edges, which forms a vertex v of
graph Gi+1. We markm,n vertices as folded, and then repeat the
above process until there is no more vertex that can be folded.
Initial partitioning step divides the graph Gc generated by the
coarsening step into k partitions. e upper bound of the total ver-
tex weight of each partition is decided by the number of neurons
that can be accommodated in a neuromorphic core. A vertexm in
graphGc is randomly selected to insert into partition k . We search
out an edge (m,n)with the largest weight from the set of adjacent
edges of partition k and then insert vertex n into partition k . When
a vertex inserts into partition k , the set of adjacent edges of parti-
tion k is updated. We end inserting partition k process if the total
vertex weight of partition k reaches the upper bound of partition
k . Follow this process until the graph is divided into k partitions.
Uncoarsening step, similar to the Coarsening step, is also divided
into multiple levels. e partitioning Pc of the coarser graph Gc
(Nc , Sc ) is projected back to the original graphG0. We use a global
priority queue that stores the vertices according to their gains. Ini-
tially, all the vertices are scanned, and those whose sum of ED is
greater or equal to their ID are inserted into the priority queue. In
particular, let v be such a vertex and b ∈ B(v) such that ED[v]b is
maximum in B(v). We insert v into the priority queue with a gain
equal to ED[v]b − ID[v]. A vertex v is selected from the global pri-
ority queue with the highest gains. We move vertex v to partition
b that ED[v]b is maximum while satisfying the capacity of neuro-
morphic core. We continue moving vertices until x vertex moves
3
that have not decreased the sum of edge weights among partitions.
In that case, the last x moves are undone. Aer such uncoarsening
level by level, the optimized k partitions are finally obtained.
Our proposedmethod use heuristics to quickly compress a large
graph in the Coarsening step, so that the subsequent optimization
steps will reduce the time consumption due to the large reduction
in the size of the graph. Furthermore, since in the Uncoarsening
step the single priority queue contains only vertices whose sum of
ED is greater or equal to their ID, this method has less powerful
hill-climbing capabilities than the generalized KL [18] that uses
multiple priority queues and considers all the vertices.
Step 1
Coarsening
Step 2
Initial Partitioninging
Step 3
Uncoarsening
Partition 1
P
artitio
n
 2
Partition 3 Partition 3
P
artitio
n
 2
Partition 1
P
artitio
n
 2
Partition 3
Partition 1
Folding
Vertice
Moving
Figure 2: Multi-level graph partitioning diagram.
3.4 Mapping
Aer the SNN is divided into multiple partitions, the placement
of partitions on the neuromorphic platform also influences the la-
tency and power consumption of the platform. As shown in Figure
3, different mapping schemes will change the communication be-
havior of spikes on the NoC, resulting in differences in power con-
sumption and latency. In this paper, we implement three heuristic-
based search algorithms to construct the mapping tool. e tool
can find out the best mapping scheme that minimizes the spike la-
tency and energy of the NoC-based neuromorphic platform. ese
search algorithms are Simulated Annealing algorithm (SA), Parti-
cle Swarm Optimization (PSO), and Tabu Search algorithm (Tabu)
respectively.
e optimization objective of mapping could be latency and/or
energy. However, evaluation of these metrics oen requires using
real hardware or hardware simulator, which leads to substantial
time overhead and makes the entire search process unacceptable.
As mentioned in section 3.4.2, average hop is used to measure the
latency and power consumption on the NoC. Compared with the
above two metrics, average hop is easier to get [19]. us, instead
of minimizing latency and energy consumption, we decide to mini-
mize the average hop. Since we adopt XY routing algorithm in our
neuromorphic platform, we propose amethod of evaluating the av-
erage hop based on the XY routing algorithm, which reduces the
time overhead caused by using real hardware or hardware simula-
tor.
3.4.1 Mapping Algorithms. We implement three heuristic-based
algorithms (SA, PSO, Tabu) for finding a mapping scheme with
the smallest average-hop. As shown in section 3.3, the partitioned
SNN can be represented as a graph P(V , E). e architecture of
NoC-based neuromorphic platform can be considered as a graph
A(C, I ), whereC is the set of neuromorphic cores and I is the set of
connections among these cores for a given interconnect topology.
Mapping M can be transformed into M : P(V ,E) → A(C, I ). Map-
ping M is represented by a matrixmi j ∈ {0, 1}
|C |× |V | , wheremi j
1
2 3
1
2 3
2
1
3
1
2 3
1
2
3
SNN Partitions Mapping Scheme 1 Mapping Scheme 2
High Medium LowSpike Traffic :
Figure 3: Congestion impact of different mapping schemes
on neuromorphic platform.
is defined as:
mi j =
{
1 if partition ci ∈ C is mapped to core vj ∈ V
0 otherwise
(1)
e optimization objective of ourmapping phase is to find themap-
ping with the minimum average hop count H , i.e.
Hmin =min{H (Mi)|i ∈ 1, 2, ...,N } (2)
Where N is the number of evaluated mapping schemes.
e three algorithms use the same heuristic function (section
3.4.2) to measure a candidate mapping scheme. e input and out-
put format of the three algorithms are also the same. e input
is a random initialized scheme. e output is the best scheme the
algorithm can find within the given time limitation. ey differ in
choosing the next scheme from neighbors. Neighbors are possible
schemes derived from the current scheme. For example, in the cur-
rent scheme, all partitions have a corresponding core. Swapping
any two partitions and their cores leads to a new scheme. ese al-
gorithms adopt different search strategies to find the best from the
new schemes. SA allows the search forwarding to a less optimal
orientation with a certain possibility, which is good for jumping
out of local optimum. PSO is a population-based algorithm. Every
particle in the population adapts according to both the best popula-
tion history solution and the best personal history solution. Tabu
uses a list, which is called tabu list, to record every history moves.
Using this history information, Tabu can avoid dead loop and jump
out of local optimum.
3.4.2 Algorithm for average hop evaluation. In the current NoC-
based neuromorphic platforms, the XY static routing algorithm is
mainly adopted. e XY static routing algorithm can avoid dead-
locks and is very simple to implement in hardware. anks to the
static feature of the XY routing algorithm, hop distance that spike
traverses can be calculated directly without using hardware sim-
ulation. Based on this, we proposed a algorithm that can directly
calculate the average hop.
We formalize the algorithm for average hop evaluation as Algo-
rithm 1. First, we extract the communications between partitions
from the spike trace. en, we traverse the communication be-
tween any two partitions and calculate the distance between cores
whose partitions are mapped. Finally, we multiply the distance by
the corresponding total amount of communications to get average
hop.
4
Algorithm 1 Average Hop Evaluation Algorithm
1: Input: the partitions (p1,p2, ...,pn), the number of cores m,
the mapping option M ,the source core s ,the destination core
d , spike trace.
2: Output: average hop H.
3: trace lenдth← spike trace
4: Cn∗n ← zeromatrix // communications between partitions
5: for spike in spike trace do
6: time step,neuronsource,neurondestination ←spike
7: neuronsource ∈ pi ,neurondestination ∈ pj
8: add a communication to C(pi ,pj )
9: end for
10: for a in partitions do
11: for b in partitions do
12: s ← M(a), d ← M(b)
13: (x,y)core ←get coordinate()
14: hopdistance ← |sx − dx | + |sy − dy |
15: H ←
∑n
a=0
∑n
b=0
hop distance ∗C(a,b) ÷ trace lenдth
16: end for
17: end for
18: return H
4 EXPERIMENT SETUP
4.1 Experiment platform
e experimental platform was constructed following two simula-
tors and two tools.
Two simulators are SNN soware simulator - CARLsim [14] and
hardware simulator - Noxim++ [11]. CARLsim is aGPU-accelerated
soware SNN simulator that can be used to train and test SNN net-
works. e behavior of spike can be analyzed from the log file
of CARLsim. Noxim++ is a trace-driven and cycle-accurate NoC
simulator. Noxim++ is an extension version based on Noxim [20].
Noxim++ is used to simulate the execution of SNN on real NoC-
based hardware, so as to evaluate key performance statistics of
NoC, such as average hop, delay, and power consumption.
Two tools are partitioning tool and mapping tool. For the par-
titioning tool, we reference the Metis [16] with a python interface
to implement it, including all key components of the multi-level
partitioning paradigm. e mapping tool mainly contains three
heuristic algorithms (SA, PSO, Tabu) and a component of evaluat-
ing average hop. Combined with the average hop evaluation com-
ponent, these algorithms are used to search for the best mapping
on the NoC-based neuromorphic platform.
Our experiment uses the hardware configuration of 5x5 2D-mesh
NoC, and neuromorphic core adopts crossbar structure. Every cross-
bar can accommodate at most 256 neurons, meaning that a crossbar
sends at most 256 spikes per time step.
All experiments were performed on i7-7700, 16GB RAM, and
NVIDIA GTX1060 GPU, Ubuntu 16.04.
4.2 Evaluated SNNs
Table 1 provides a set of SNNs used to evaluate our proposed toolchain.
ese five SNNs have different topologies, including variety of
depth and width of SNN layers and different connectivity-scheme.
Table 1: Evaluated SNNs. e number in the first column
represents the number of neurons of the SNN.
SNN Name Network Topology Spikes
Smooth 320[14] Feedfoward, 2 layer 175124
Smooth 1280[14] Feedfoward, 2 layer 981808
MLP 2048[2] Feedfoward, 2 layer 15905792
Edge 5120[14] Feedfoward, 3 layer 4570546
Random 6212[14] Feedfoward, 3 layer 51756245
4.3 Metrics for evaluation
We evaluate all three mapping methods in terms of the following
metrics for every SNNs.
Energy consumption on the NoC : is is the overall energy con-
sumed by spikes communication on the NoC.
Average latency : is is the delay experienced by spikes before
reaching their destination and averaged overall spikes.
Congestion Count : Beside latency and energy consumption, one
essential metric that we get from the toolchain is congestion count,
which reflects the degree of congestion on the NoC.
ConдestionCount =
n∑
t=0
Ct (3)
During each time step t , congestion is defined as the number of
spikes exceed the mesh edge’s load. e spikes that exceed the
load cannot be transmied at this time step, whose number is Ct .
Edge Variance : Same as Congestion Count, edge variance is
used to reflect the degree of congestion and the load distribution
on the NoC. We can get the total hop numbers of every edge on
mesh network with XY static routing algorithm. Supposed there
are n edges on mesh, ei represent edge-i’s total hop numbers aer
all time steps.
Edдe = (e1, e2, ..., en) (4)
EdдeVar = Var (Edдe) (5)
5 RESULTS AND DISCUSSION
In this section, we compare SNEAPwith some state-of-the-art meth-
ods proposed by SpiNeMap and SCO [10]. SpiNeMap uses SpiNeClus-
ter to partition SNNs into clusters to minimize the total number of
spikes among the clusters and SpiNePlacer to optimize the place-
ment of clusters to crossbars of the neuromorphic hardware tomin-
imize energy consumption and latency. SCO uses its framework to
balance the utilization of crossbars in the hardware. We summarize
the improvements of our method against SpiNeMap and SCO. We
now describe these results in detail.
5.1 Partitioning Performance
In Figure 4, we compare the global traffic (the number of spikes
among partitions) and the execution time of each SNNs under dif-
ferent methods normalized to SpiNeMap. Comparedwith SpiNeMap,
SNEAP has a 890× reduction in execution time. e cause of this
reduction is that the heuristic algorithm is used to compress a
large graph quickly during the partitioning phase, so that the sub-
sequent optimization process will reduce the time consumption.
SNEAP has 8% fewer average the number of spikes among parti-
tions than SpiNeMap. ese improvements are as a result of the
optimization algorithm of SNEAP, which is good for jumping out
of local optimum.
5
Sm
oo
th
_3
20
Sm
oo
th
_1
28
0
ML
P_
20
48
Ed
ge
_5
12
0
Ra
nd
om
_6
21
2
100
101
102
103
104
105
106
E
x
e
c
u
ti
o
n
 T
im
e
(s
)
SNEAP SpiNeMap
Sm
oo
th_
32
0
Sm
oo
th
_1
28
0
ML
P_
20
48
Ed
ge
_5
12
0
Ra
nd
om
_6
21
2
0%
20%
40%
60%
80%
100%
120%
G
lo
b
a
l 
T
ra
ff
ic
N
o
rm
a
li
z
e
d
 t
o
 S
p
iN
e
M
a
p
SNEAP SpiNeMap
Figure 4: Performance in partitioning phase.
5.2 Mapping Algorithms Comparison
As shown in Figure 5, we evaluate the convergence time of three
algorithms (SA, PSO, Tabu) and then get the relationship between
average hop and time consumed. We also performed the same anal-
ysis on the other four types of SNN, and the results are similar.
Because SA can search the best results in the shortest time in this
type of optimization problem, so in this paper we use SA to find
out the best mapping scheme.
0 50 100 150 200 250
1.4
1.6
1.8
2.0
2.2
Optimization Time(s)
A
v
e
ra
g
e
 H
o
p
PSO
SA
TABU
Figure 5: Comparison of convergence speed.
Figure 6 shows average latency, dynamic energy, congestion
count, and edge variance on the mapping phase under different
heuristic algorithms normalized to PSO proposed by SpiNeMap.
As can be seen from Figure 6, SA results in about 1% to 8% and
average 3% reduction in average latency, almost 2% to 33% and
average 16% reduction in dynamic energy, nearly 15% to 63% and
average 28% in edge variance and approximately 12% - 61% and av-
erage 25% in congestion count compared with other algorithms. In
conclusion, SA can find the best mapping with lower energy and
latency than other algorithms within a certain time period.
Sm
oo
th
_3
20
Sm
oo
th
_1
28
0
ML
P_
20
48
Ed
ge
_5
12
0
Ra
nd
om
_6
21
2
0%
20%
40%
60%
80%
100%
(d)
C
o
n
g
e
s
ti
o
n
 c
o
u
n
t
n
o
rm
a
li
z
e
d
 t
o
 P
S
O
SA TABU PSO
Sm
oo
th
_3
20
Sm
oo
th
_1
28
0
ML
P_
20
48
Ed
ge
_5
12
0
Ra
nd
om
_6
21
2
60%
70%
80%
90%
100%
(a)
A
v
e
ra
g
e
 l
a
te
n
c
y
 n
o
rm
a
li
z
e
d
 t
o
 P
S
O
SA TABU PSO
Sm
oo
th
_3
20
Sm
oo
th
_1
28
0
ML
P_
20
48
Ed
ge
_5
12
0
Ra
nd
om
_6
21
2
0%
20%
40%
60%
80%
100%
(c)
E
d
g
e
 v
a
ri
a
n
c
e
n
o
rm
a
li
z
e
d
 t
o
 P
S
O
PSOSA TABU
Sm
oo
th
_3
20
Sm
oo
th
_1
28
0
ML
P_
20
48
Ed
ge
_5
12
0
Ra
nd
om
_6
21
2
60%
70%
80%
90%
100%
(b)
D
y
n
a
m
ic
 e
n
e
rg
y
n
o
rm
a
li
z
e
d
 t
o
 P
S
O
SA TABU PSO
Figure 6: Evaluate various algorithm in mapping phase.
5.3 Overall Toolchain Results
5.3.1 Average latency. Figure 7(a) gives shows the average la-
tency of overall spikes on the NoC under different method normal-
ized to SpiNeMap. e statistic shows that comparedwith SpiNeMap
and SCO, SNEAP has a great reduction in all of SNN cases. SNEAP
results in average 51% lower than the SpiNeMap and 88% lower
than SCO. ese improvements are because of the optimization
objective of SNEAP. SNEAP adopts objective to minimize the total
number of spikes among the partitions and average hop. In ad-
dition to optimization objective, optimization algorithms are also
beer, which good for jumping out of local optimum.
For the case of the largest SNN Random 6212, SNEAP achieves
92% lower average latency than SpiNeMap. While for the other
case such as MLP 6212, SNEAP only achieves 8% lower average.
e cause of this consequence is different connectivity-scheme be-
tweenMLP 2048 and Random 6212. Compared to random connect
(Random 6212), full connect (MLP 2048) has less optimizable space
in the whole toolchain.
5.3.2 Energy. Figure 7(b) gives the dynamic energy of theNoCs
under different method normalized to SpiNeMap. Since all experi-
ments are based on 5x5 2D mesh structure, static energy is always
a constant. Consequently, we use the dynamic energy to evaluate
the energy consumption of NoCs. Compared with other methods,
SNEAP has the lowest energy consumption. SNEAP results in av-
erage 23% lower than the SpiNeMap and 31% lower than SCO.
e improvement is due to the multi-level partitioning algo-
rithm, which outperforms the greedy KL algorithm proposed by
SpiNeMap. Fewer spikes communicated among the partitions, lower
dynamic energy consumption.
Sm
oo
th
_3
20
Sm
oo
th
_1
28
0
ML
P_
20
48
Ed
ge
_5
12
0
Ra
nd
om
_6
21
2
Av
er
ag
e
0%
50%
100%
150%
(d)
C
o
n
g
e
s
ti
o
n
 c
o
u
n
t
 n
o
rm
a
li
z
e
d
 t
o
 S
p
iN
e
M
a
p
SNEAP SpiNeMap SCO
Sm
oo
th
_3
20
Sm
oo
th
_1
28
0
ML
P_
20
48
Ed
ge
_5
12
0
Ra
nd
om
_6
21
2
Av
er
ag
e
0%
50%
100%
150%
(a)
A
v
e
ra
g
e
 l
a
te
n
c
y
n
o
rm
a
li
z
e
d
 t
o
 S
p
iN
e
M
a
p
SpiNeMapSNEAP SCO
Sm
oo
th
_3
20
Sm
oo
th
_1
28
0
ML
P_
20
48
Ed
ge
_5
12
0
Ra
nd
om
_6
21
2
Av
er
ag
e
0%
50%
100%
150%
(c)
E
d
g
e
 v
a
ri
a
n
c
e
n
o
rm
a
li
z
e
d
 t
o
 S
p
iN
e
M
a
p
SNEAP SpiNeMap SCO
Sm
oo
th
_3
20
Sm
oo
th
_1
28
0
ML
P_
20
48
Ed
ge
_5
12
0
Ra
nd
om
_6
21
2
Av
er
ag
e
0%
50%
100%
150%
(b)
D
y
n
a
m
ic
 e
n
e
rg
y
n
o
rm
a
li
z
e
d
 t
o
 S
p
iN
e
M
a
p
SNEAP SpiNeMap SCO
Figure 7: Overall Results.
5.3.3 Congestion. In Figure 7(c), we report the edge variance
of the NoCs under different methods normalized to SpiNeMap. As
shown in Figure 7(c), SNEAP has the lowest edge variance of all
our evaluated methods. For SpiNeMap, SNEAP has an average 61%
reduction. For SCO, an average reduction is 1×. is reduction is
due to the partitioning algorithm of SNEAP, which may adopt a
non-optimal solution to jump out of local optimal compared with
greedy KL used by SpiNeMap. is indirectly leads to a balanced
distribution of spikes on the NoCs.
6
Figure 7(d) presents the congestion count of the NoCs under
different methods normalized to SpiNeMap. e results of the con-
gestion count are similar to that of the edge variance. e more
balanced mapping of spikes can effectively reduce the congestion
count on the NoC.
5.3.4 Execution time of toolchains. In Figure 8, we illustrate
the end-to-end execution time under different toolchains. SNEAP
achieves 418× lower average execution time than SpiNeMap. e
causes behind this are that during the partitioning phase SNEAP
has a reduced amount of execution time compared to SpiNeMap
and that in mapping phase SA converges faster than PSO.
Smooth_320 Smooth_1280 MLP_2048 Edge_5120 Random_6212
100
101
102
103
104
105
106
E
x
e
c
u
ti
o
n
 T
im
e
(s
)
SNEAP SpiNeMap
Figure 8: Execution time of toolchains.
6 CONCLUSION & FUTURE WORK
is paper presents a fast and efficient toolchain - SNEAP to map
the large-scale SNN onto the NoC-based neuromorphic platform.
SNEAP completes the entire mapping process in four phases: Pro-
filing, Partitioning, Mapping, Evaluation. In the profiling phase,
we use the SNN soware simulator to extract the essential infor-
mation of SNN such as topology and the behavior of spike. By
using this information, we construct the undirected graph of SNN
and generate spike trace files. In the partitioning phase, we use a
multi-level graph partitioningmethod to quickly divided the graph
of SNN into multiple SNN partitions. Our objective is to minimize
the number of spikes between partitions. In themapping phase, we
use the heuristic-based algorithm (SA) to map optimized SNN par-
titions on the physical processing unit in hardware. Combining the
optimization in the partitioning phase, heuristic-based mapping al-
gorithm optimizes the energy consumption and spike latency on
the NoC-based neuromorphic platform. Using five SNNs, we show
that our toolchain can achieve 418× reduction in end-to-end execu-
tion time, and reduce average energy consumption by 23% and av-
erage spike latency by 51%, compared to SpiNeMap. In the future,
the toolchain is to support mapping optimization during the learn-
ing process of SNNs. In the learning process of SNNs, the topology
of the SNNs changes dynamically, which brings challenges to the
partitioning and mapping tasks of the toolchain.
REFERENCES
[1] Wolfgang Maass. Networks of spiking neurons: the third generation of neural
network models. Neural networks, 10(9):1659–1671, 1997.
[2] Peter U Diehl and Mahew Cook. Unsupervised learning of digit recognition
using spike-timing-dependent plasticity. Frontiers in computational neuroscience,
9:99, 2015.
[3] Peter U Diehl, Guido Zarrella, Andrew Cassidy, Bruno U Pedroni, and Emre
Neci. Conversion of artificial recurrent neural networks to spiking neural
networks for low-power neuromorphic hardware. In 2016 IEEE International
Conference on Rebooting Computing (ICRC), pages 1–8. IEEE, 2016.
[4] Filipp Akopyan, Jun Sawada, Andrew Cassidy, Rodrigo Alvarez-Icaza, John
Arthur, Paul Merolla, Nabil Imam, Yutaka Nakamura, Pallab Daa, Gi-Joon
Nam, et al. Truenorth: Design and tool flow of a 65 mw 1 million neuron pro-
grammable neurosynaptic chip. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, 34(10):1537–1557, 2015.
[5] Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang
Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta
Jain, et al. Loihi: A neuromorphic manycore processor with on-chip learning.
IEEE Micro, 38(1):82–99, 2018.
[6] Saber Moradi, Ning Qiao, Fabio Stefanini, and Giacomo Indiveri. A scalable
multicore architecturewith heterogeneous memory structures for dynamic neu-
romorphic asynchronous processors (dynaps). IEEE transactions on biomedical
circuits and systems, 12(1):106–122, 2017.
[7] Steve B Furber, David R Lester, Luis A Plana, Jim D Garside, Eustace Painkras,
Steve Temple, and Andrew D Brown. Overview of the spinnaker system archi-
tecture. IEEE Transactions on Computers, 62(12):2454–2467, 2012.
[8] FrancescoGalluppi, Sergio Davies, Alexander Rast,omas Sharp, Luis A Plana,
and Steve Furber. A hierachical configuration system for a massively parallel
neural hardware platform. In Proceedings of the 9th conference on Computing
Frontiers, pages 183–192. ACM, 2012.
[9] Yu Ji, YouHui Zhang, ShuangChen Li, Ping Chi, CiHang Jiang, Peng , Yuan
Xie, and WenGuang Chen. Neutrams: Neural network transformation and co-
design under neuromorphic hardware constraints. Ine49thAnnual IEEE/ACM
International Symposium on Microarchitecture, page 21. IEEE Press, 2016.
[10] Mahew Kay Fei Lee, Yingnan Cui, annirmalai Somu, Tao Luo, Jun Zhou,
Wai Teng Tang, Weng-Fai Wong, and Rick Siow Mong Goh. A system-level
simulator for rram-based neuromorphic computing chips. ACM Transactions on
Architecture and Code Optimization (TACO), 15(4):64, 2019.
[11] Adarsha Balaji, Anup Das, Yuefeng Wu, Khanh Huynh, Francesco G Dell’Anna,
Giacomo Indiveri, Jeffrey L Krichmar, Nikil D Du, Siebren Schaafsma, and
Francky Cahoor. Mapping spiking neural networks to neuromorphic hard-
ware. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2019.
[12] Arnon Amir, Pallab Daa,William P Risk, Andrew S Cassidy, Jeffrey A Kusnitz,
Steve K Esser, Alexander Andreopoulos, eodore M Wong, Myron Flickner,
Rodrigo Alvarez-Icaza, et al. Cognitive computing programming paradigm: a
corelet language for composing networks of neurosynaptic cores. In e 2013
International Joint Conference on Neural Networks (IJCNN), pages 1–10. IEEE,
2013.
[13] Qiangfei Xia and J Joshua Yang. Memristive crossbar arrays for brain-inspired
computing. Nature materials, 18(4):309–323, 2019.
[14] Ting-Shuo Chou, Hirak J Kashyap, Jinwei Xing, Stanislav Listopad, Emily L
Rounds, Michael Beyeler, Nikil Du, and Jeffrey L Krichmar. Carlsim 4: an
open source library for large scale, biologically detailed spiking neural network
simulation using heterogeneous clusters. In 2018 International Joint Conference
on Neural Networks (IJCNN), pages 1–8. IEEE, 2018.
[15] Marc-Oliver Gewaltig and Markus Diesmann. Nest (neural simulation tool).
Scholarpedia, 2(4):1430, 2007.
[16] George Karypis and Vipin Kumar. Multilevelk-way partitioning scheme for ir-
regular graphs. Journal of Parallel and Distributed computing, 48(1):96–129, 1998.
[17] James Kennedy. Particle swarm optimization. Encyclopedia of machine learning,
pages 760–766, 2010.
[18] BrianWKernighan and Shen Lin. An efficient heuristic procedure for partition-
ing graphs. Bell system technical journal, 49(2):291–307, 1970.
[19] Hyung Gyu Lee, Naehyuck Chang, Umit Y Ogras, and Radu Marculescu. On-
chip communication architecture exploration: A quantitative evaluation of
point-to-point, bus, and network-on-chip approaches. ACM Transactions on De-
sign Automation of Electronic Systems (TODAES), 12(3):23, 2007.
[20] Vincenzo Catania, Andrea Mineo, Salvatore Monteleone, Maurizio Palesi, and
Davide Pai. Improving energy efficiency in wireless network-on-chip archi-
tectures. ACM Journal on Emerging Technologies in Computing Systems (JETC),
14(1):9, 2018.
7
