Multi-casting mesh AER: A scalable assembly approach for reconfigurable neuromorphic structured AER systems. Application to ConvNets by Zamarreño Ramos, Carlos et al.
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 1
Multi-Casting Mesh AER: A Scalable Assembly
Approach for Reconfigurable Neuromorphic
Structured AER Systems. Application to ConvNets
C. Zamarren˜o-Ramos, A. Linares-Barranco, T. Serrano-Gotarredona, and B. Linares-Barranco
Abstract— This paper presents a modular, scalable approach
to assembling hierarchically structured neuromorphic AER (Ad-
dress Event Representation) systems. The method consists of
arranging modules in a 2D mesh, each communicating bidi-
rectionally with all four neighbors. Address events include a
module label. Each module includes an AER router which
decides how to route address events. Two routing approaches
have been proposed, analyzed and tested, using either destination
or source module labels. Our analyses reveal that depending
on traffic conditions and network topologies either one or the
other approach may result in better performance. Experimental
results are given after testing the approach using high-end Virtex-
6 FPGAs. The approach is proposed for both single and multiple
FPGAs, in which case a special bidirectional parallel-serial AER
link with flow control is exploited, using the FPGA Rocket-
I/O interfaces. Extensive test results are provided exploiting
convolution modules of 64 × 64 pixels with kernels with sizes
up to 11 × 11, which process real sensory data from a DVS
(Dynamic Vision Sensor) retina. One single Virtex-6 FPGA can
hold up to 64 of these convolution modules, which is equivalent to
a neural network with 262× 103 neurons and almost 32 million
synapses.
I. INTRODUCTION
AER (Address-Event-Representation) is now a popular “vir-
tual wiring” technique for interconnecting spiking neuromor-
phic systems [1]–[33]. The high-speed available for digi-
tal inter-chip communications is exploited in AER to time-
multiplex numerous synaptic connections between neurons,
which only need to be active during a spike (also called event)
transmission. In AER, whenever a spiking neuron in a chip (or
module1) generates a spike, its “address” (or any given ID) is
written on a high speed digital bus and sent to the receiving
neuron(s) in one (or more) receiver module(s). In general,
AER processing modules require at least one AER input port
and one AER output port. As neuromorphic systems scale up
in size, complexity, and functionality, researchers have been
developing more complex and smarter AER “variations” to
maintain the efficiency, reconfigurability and reliability of the
ever growing target systems they want to build.
Authors are with the Instituto de Microelectro´nica de Sevilla (IMSE-CNM-
CSIC). Av. Ame´rico Vespucio s/n, 41092 Sevilla, Spain. Copyright (c) 2011
IEEE. Personal use of this material is permitted. However, permission to use
this material for any other purposes must be obtained from the IEEE by
sending an email to pubs-permissions@ieee.org.
1Throughout the paper we will be referring to generic AER modules, where
a module can be one chip, several chips grouped into a PCB, several PCBs, or
even a part of a chip or FPGA. One chip/FPGA could therefore hold several
AER modules as well as AER buses.
In the following Section we review a set of approaches
for large scale reconfigurable AER systems that have been
proposed by different researchers, ranging from the sharing
of a single AER bus by all AER modules, to the use of
multiple independent buses or mesh type arrangements of
modules which exploit communication techniques from the
NoC (Network-on-Chip) research community.
II. REVIEW OF AER APPROACHES FOR
LARGE SCALE SYSTEMS
Table I summarizes a set of AER assembly techniques
proposed in literature for tackling the growth of AER based
processing systems. The simplest form of a generic AER
concept for use in a large scale multi-module spiking neu-
romorphic system is illustrated in Fig. 1(a). Let us call it
“Flat-AER”. Each module can contain, for example, an array
of neurons. Each neuron is assigned a unique global address,
which identifies the module it belongs to and its position inside
the module. This way, the address space of all modules’ input
and output AER ports is the same. All modules share a single
external AER bus [3]–[9]. Connectivity between neurons is
configurable and set by a look-up-table in the external pro-
grammable “Mapper”. Multi fan-out can be programmed in the
mapper by repeating multiple destination addresses for each
incoming address. Similarly, synaptic weighting can also be
implemented by programming destination address repetitions.
However, event repetition (for either fan-out or weighting)
severely penalizes the AER bus communication bandwidth.
To overcome this, some reported neuron chips allow for
an additional synaptic weight parameter to be programmed
into the mapper together with the event address [7], [10].
Alternatively, some other reported neuron chips include a
built-in mechanism which will implement a given fan-out and
synaptic weighting from a single input event (as in pre-wired
diffusive networks [11], [12] or more elaborate computational
hardware [13], [18], [22], [24]). In this case, the mapper would
only need to repeat an event if it is destined for neurons
belonging to different modules. Normally, event addresses
represent neurons. However, in some reported neuron chips
that include a number of physical synapses per neuron [6],
the input address can represent one specific synaptic input. In
this case, the address spaces at the mapper input and output
would be different, as they represent different elements.
Flat-AER is simple and easy to build, configure, and use. It
requires a mapper memory with as many positions as there are
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 2
neurons in the system. However, the main limitation of flat-
AER is its communication bandwidth. Since every single event
produced by any neuron has to travel through the single AER
bus and Mapper, the system’s maximum total event traffic is
limited by the bus bandwidth. If Ntot is the total number of
neurons, fn is the mean spike rate per neuron and Fout the
average fan-out per neuron (spike repetitions introduced by
the mapper to emulate the projection fields and/or synaptic
weighting), the event arrival rate λ at the Mapper output
channel is [34] λ = NtotfnFout. If Eflat is the physical
channel bandwidth, then the channel service rate is µ = Eflat
and the average time an event waits to be serviced is (assuming
an M/M/1 queue model [35]) t¯q = 1/ (µ− λ). Consequently,
the absolute maximum communication bandwidth (number of
events per unit time) Ev this approach can handle is obtained
when λ = Eflat and is
Evmax = Ntotfn
∣∣∣
max
=
Eflat
Fout
(1)
Given Evmax, it is possible to estimate the maximum allow-
able number of neurons for such communication architecture
Ntot max =
Evmax
fn
(2)
Note that, in general and specially for rate encoded weights,
Fout = nRF ×nW (nRF is the projection field size and nW is
the synaptic weight dynamic range) can become significantly
large. For example, if the projection field has a size of nRF =
11× 11 neurons and weights can have integer values ranging
from nW = 1 to 32, then Fout = 3872.
Reported AER-bus bandwidths are presently below
100Meps (mega events per second) for point-to-point links
[3]–[7], [11]–[13], [18], [22], [24]–[28], [36]–[38], although
cases have been reported of high density channels and mul-
tiplexing techniques being used to achieve higher event rates
[8], [33]. Flat-AER therefore allows for a total communication
bandwidth in the range of 108eps down to 104eps, depending
on fan-out.
Having several modules sharing the same physical lines
degrades speed proportionally to the number of modules [26].
This can be overcome by using Broadcast-Mesh-AER.
“Broadcast-Mesh-AER” uses multiple point-to-point AER
buses [25]–[27]. Fig. 1(b) shows its corresponding 1-D ver-
sion. Each neuron in each module also has a unique global
flat address. Each module has an AER input event path and
an AER output event path, each with an AER input port and an
AER output port. AER input events received at the input AERi
port are sent to the module neuron array but are also passed
through to the next module, via output port AERi’. This way,
input events “hop” from module to module via independent
AER point-to-point links. Consequently, the speed at each
AER link is optimum and events are copied more efficiently
in a pipeline fashion. Output events generated in each module
also “hop” from module to module through AERo and AERo’
ports until they reach the Mapper. In this scheme all input
events coming from the Mapper are broadcast to all modules,
and each module checks if the event is destined to the local
neural array [25], [26]. However, overall network connectivity
information is contained in the global mapper, and can be
totally reconfigured by reprogramming the mapper (as in the
Flat-AER approach).
One major claim of this approach is that the channel band-
width of each point-to-point link Epp improves proportionally
with the number of chips Nch in the network, with respect
to the Flat-AER case where Nch chips share the same AER
bus2 as in Fig. 1(a). The channel bandwidth of a point-to-
point link Epp is constant, while that of a multiple fan-out
link Eflat degrades with the number of destination chips Nch.
More precisely, for typical PCBs, the bandwidth improvement
of a point-to-point link is
Epp ≈ 2 (Nch − 1)Eflat (3)
The network now has Nch point-to-point links, which allows
for a total communication bandwidth of Epp×Nch. However,
each event has to be copied to each link, so the maximum
event rate is
Evmax =
EppNch
FoutNch
=
Epp
Fout
(4)
Bandwidth is thus improved by improving Epp with
respect to Eflat proportionally to the number of chips [26].
Both Flat-AER and Broadcast-Mesh-AER use a common
global flat address space and, in principle, allow for any
arbitrary interconnect topology. However, practical neuromor-
phic systems have a pre-established hierarchical structure,
depending on the functionality they implement. This has been
exploited by other researchers to assemble scalable multi-
module systems with independent AER-links, where each link
is a physical plugged-in point-to-point bus-wire [28]. This is
illustrated in Fig. 1(c), where AER splitters (blocks labeled
“S” in Fig. 1(c)) and mergers (blocks labeled “M” in Fig.
1(c)) are also used for branching or de-branching links. A
splitter block receives one input AER channel and replicates
the traffic for n different output channels, while a merger block
multiplexes n input AER channels into a single output channel.
All physical links are point-to-point. In this “Pre-Structured
AER” approach the address space is local to the neurons
writing to or reading from an AER link. Optionally, local
mappers can be inserted in a link to adapt address spaces from
an output to an input (for example, to perform subsampling,
address rotations, bit reallocations, etc.). In this approach no
global Mapper is required, as the connectivity is pre-wired, and
events do not need to travel through all the links. The number
of links scales with the number of modules, so communication
bandwidth saturation is much less likely to occur as systems
scale up. However, system reconfiguration is laborious as it
has to be done manually by re-plugging bus-wires, splitters,
mergers, mappers, and processing modules.
In this case each point-to-point channel receives events from
only a small fraction of the modules/chips. In general, we can
define an effective number of independent channels Meff as
a fraction of the total number of chip modules Meff = αNch.
The network maximum communication bandwidth would then
be
2For a more detailed explanation see [26].
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 3
TABLE I
MULTI-MODULE AER ADDRESSING SCHEMES
Flat
AER
[7]
Broadcast
Mesh AER
[25]–[27]
Pre-
Structured
AER [28]
Hierarchical
Fractal
AER [29]
Router
Mesh
AER [30],
[31]
Cross-
Point
Interc. [32]
Multi-
Casting
Mesh
AER
A
dd
re
ss
in
g Address space flat flat local local flat local flat
Broadcast to all modules Yes Yes No No No No No
Local Routing Tables to define global network No No No Yes Yes Yes Yes
Single global mapper Yes Yes No No No No No
M
od
ul
e
Pr
op
s. isolated neurons Yes No No Yes No No No
neurons with synapses No Yes Yes No Yes Yes Yes
events with synaptic weighting Yes No No Yes No No No
projection fields with synaptic weighting No Yes Yes No Yes Yes Yes
physical synapses No Yes No Yes No Yes No
M
ap
pe
r
A
E
R
_
in
A
E
R
_
o
u
t
A
E
R
_
in
A
E
R
_
o
u
t
A
E
R
_
in
A
E
R
_
o
u
t
MS
S
M
S
S M
A
E
R
_
o
u
tAE
R
_
in
A
E
R
_
o
u
tAE
R
_
in
A
E
R
_
o
u
tAE
R
_
in
A
E
R
_
o
u
tAE
R
_
in
(c)
M
S
Merger
Splitter
1
2
3
5
4
7
6
S
M
Ev7
Ev6
Ev1
Ev1
Ev1
Ev2
Ev3
Ev2
Ev2
Ev2+Ev3+Ev7
Ev3
Ev2+Ev4
Ev4
Ev4
Ev7
Ev3+Ev4+Ev7
Ev4
Ev7Ev7
Ev5
Ev5+Ev6
AER_out
A
rb
it
e
r
AER_out
D
e
c
o
d
e
r
AER_in
AER_out
Mapper
(a)
Eflat
Fout
Eflat
Router
Array
(e)
Mapper
AERi AER’i
AERoAER’o
AERi AER’i
AERoAER’o
AERi AER’i
AERoAER’o
(b)
AER_out
A
rb
it
e
r
AER_out
D
e
c
o
d
e
r
AER_in
AER_out
L1-mapper
in out
AER_out
A
rb
it
e
r
AER_out
D
e
c
o
d
e
r
AER_in
AER_out
L1-mapper
in out
AER_out
A
rb
it
e
r
D
e
c
o
d
e
r
L1-mapper
in out
A
rb
it
e
r
D
e
c
o
d
e
r
L2-mapper
in out
AER_out
AER_out
AER_in
(d)
Epp
Epp
Epp
Epp
Epp
Epp
Epp
Epp
L1-Section
L2-Section
Fig. 1. Illustration of different multi-module AER assembly options: (a) Flat-AER, (b) Broadcast-Grid-AER, (c) Pre-Structured-AER, (d) Hierarchical-
Fractal-AER, (e) Router-Mesh-AER.
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 4
Evmax = MeffEpp = αNchEpp (5)
Parameter Fout does not appear anywhere, since the projec-
tion fields and synaptic weighting are now implemented inside
each event-processing module. As an illustrative example, Fig.
1(c) has Nch = 7 modules, each generating an output event
rate Evi. The distribution of splitters and mergers determine
potential bottlenecks as
Ev1max ≤ Epp , Ev2max + Ev4max ≤ Epp
Ev2max + Ev3max + Ev7max ≤ Epp
Ev5max + Ev6max ≤ Epp
Ev3max + Ev4max + Ev7max ≤ Epp (6)
under these constraints, the absolute maximum capacity of
this particular network can be obtained by applying constraint
optimization to eqs. (6), resulting in
Evmax =
∑
Evimax = (7)
=
(
1 +
1
2
+
1
4
+
1
2
+
1
2
+
1
2
+
1
4
)
Epp = 4Epp
Thus, in this case Meff = 4 and α = Meff/Nch = 0.57.
In this approach the optimum point-to-point bandwidth Epp
is therefore further improved, proportionally to the number of
chips (by a factor αNch = Meff = 4, in this case).
Joshi et al. recently suggested a “Hierarchical-Fractal-
AER” approach [29], illustrated in Fig. 1(d), which extends
the basic Flat-AER concept of Fig. 1(a) in a hierarchical
fashion. It exploits the assumption that nearby neurons are
more heavily interconnected than more distant ones. Address
space is expanded as events need to climb up in the hierarchy.
This way, more intense local traffic is transferred very fast
in parallel at the numerous lower level modules, while longer
range but sparser traffic needs to traverse levels of hierarchy
and is slower. Disregarding the traffic at the higher hierarchies,
if ML1 is the number of lowest level L1 parallel sections and
NL1 the number of chips per L1 section (Nch = ML1×NL1),
then the absolute maximum network bandwidth would be
Evmax = ML1
Ehier
Fout
(8)
with Epp ≈ 2 ((Nch/ML1)− 1)Ehier. This way, maximum
network bandwidth can be expressed as
Evmax ≈ 1
2
Nch
N2L1
Epp
Fout
=
1
2
ML1
NL1
Epp
Fout
(9)
Depending on the ratio ML1/NL1 a considerable improvement
can be achieved with respect to Flat-AER.
Another approach, illustrated in Fig. 1(e), is what we call
here “Router-Mesh-AER” [30]. Here again neurons have a
global flat address which identifies their module and their
address within the module, but there is no external Mapper
through which all events have to pass. Instead, the mapping ta-
ble is contained within a “Router” in each module. The router
also decides the ports through which an event is sent to reach
its destination module. Events are therefore not broadcast to all
modules, but optimum “hop” paths are established in a multi-
cast fashion. The main problem is that a very large mapping
table needs to be programmed in each module. However,
to simplify these tables, optimizations can be computed for
each module router depending on the system topology, and
default routing paths can be established for event addresses
not listed in the tables [30]. This mesh approach has been
traditionally used in NoC (Network on Chip) [39] topologies
to assemble high performance multi-core processor systems
for high performance computing applications [40]–[42]. The
maximum network bandwidth for mesh type approaches will
be computed in the next Section.
Another approach currently being developed, but at wafer
scale [32], exploits massive programmable cross-point in-
terconnects to reconfigure the network topology. We have
included this method in Table I for comparison, although
we are focusing more on multi-chip systems. Nonetheless,
this wafer scale approach also includes off-wafer re-routing
(and event re-timing) procedures for longer range and delay-
controlled interconnects [33].
In all of the previously listed methods that use a global
flat address space, each event includes a module ID that
corresponds to the module where the event was generated.
Let us call this “Source-driven” coding. However, it is also
possible to label the event with the ID of the destination
module instead. Let us call this “Destination-driven” coding.
In this paper we will consider a type of Pre-Structured-AER
approach. However, instead of having manually pluggable
links, modules are arranged in a 2D-Mesh while inter-module
links are configured through in-module routers. We call this
“Multi-Casting-Mesh-AER”. The approach is similar to the
“Router-Mesh-AER” except that the module router tables
only contain information on the module-to-module links, in-
stead of the full inter-neuron connectivity. We will analyze
both ‘source-driven’ and ‘destination-driven’ codings and will
show experimental results from example vision processing
systems implemented on FPGA prototyping boards, based on
Convolutional Neural Networks (ConvNets) [43]–[47].
Fig. 2 shows the 2D network topology we used for Multi-
Casting-Mesh-AER. The modules communicate bidirectionally
and orthogonally with their neighbors through point-to-point
AER links. The number of links in the network (and, thus,
the network bandwidth) depends on the number of links per
module. For the case shown in Fig. 2, each node has 4 links.
If the mesh has Nch = M1M2 modules, the total number
of inter-module links is3 Nl = 2[(M1 − 1)M2 + (M2 −
1)M1] = 4M1M2 − 2(M1 +M2) ≈ 4Nch. Each event needs
to hop through a number of links to travel from its source
module to its destination module. Let us call nh the average
number of hops per traveling event. This average number nh
3For the specific Router-Mesh-AER shown in Fig. 1(e) there are 6 inter-
module links, thus Nl = 2[(M1−1)M2+(M2−1)M1+(M1−1)(M2−
1)] = 6M1M2−4(M1+M2)+2 ≈ 6Nch. In a 2D mesh, up to 8 links per
module are possible. In a 3D mesh, up to 26 links per module are possible.
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 5
is application specific, but would usually be in the order of
a fraction of M1 or M2. The network absolute maximum
bandwidth is
Evmax =
Nl
nh
Epp
FMout
(10)
where FMout is the module fan-out, representing the number
of routing paths a single event is sent through in order to reach
all its destination nodes.
Table II summarizes how the absolute maximum communi-
cation bandwidth Evmax scales with number of chips Nch
for the different multi-chip approaches. From eq. (2), the
maximum allowable number of neurons Ntot max would also
scale proportionally to Evmax (for a fixed fn). We can see
that for Flat-AER, Evmax (and Ntot max) scales down with
Nch, while for Broadcast-Mesh-AER, Evmax (and Ntot max)
stays constant. Hierarchical-Fractal-AER can be made to scale
efficiently depending on the relative choice of ML1 and
NL1. For example, if NL1 is fixed, then Evmax would scale
linearly with Nch. All three approaches (Flat, Broadcast-
Mesh, Hierarchical-Fractal) are penalized by neuron fan-out
Fout. Pre-Structured-AER is the most efficient, as it scales
linearly with Nch and has no fan-out penalty. However, it is
not practical from a reconfigurability point of view. The 2D
mesh-based approaches scale linearly with Nch, although term
nh might have (square-root or log-type) dependence on Nch
depending on the specific application. However, the module
fan-out FMout penalty is usually relatively small.
III. ROUTING IN MULTI-CASTING-MESH-AER
In Fig. 2, each module in the mesh is identified by a 2D
index (xNODE, yNODE). From now on let us call each module
in this 2D mesh an AER-node. The internal structure of an
AER-node is shown in Fig. 3. It contains a Router, a local
Event Processor (or neuron/synapse array), and a Configura-
tion Processor (to set configurable parameters in the Event
Processor or Router). The Router receives external events
from the four neighbors and, based on its programmed routing
tables, decides whether to send them to the local processors
or to other neighbors. For events generated by the internal
Event Processor, the Router adds the corresponding node
index and sends them through the programmed ports. Thus,
the Router introduces a network layer between the processing
units’ logic layer and the physical layer implementation. A
heading bit distinguishes between configuration commands to
be handled by the Configuration Processor, and data events
to be handled by the Event Processor. The Configuration
Processor can also receive commands through an SPI (Serial
Peripheral Interface) connection. Other heading information
identifies the node 2D index (xNODE, yNODE) coded in the
event. This index identifies either the source node sending the
event to the mesh (in the case of a Source-Driven addressing
scheme), or the destination node in the mesh to which the event
is being sent (in the case of a Destination-Driven addressing
scheme). Each addressing mode has pros and cons which are
analyzed throughout the paper. For both cases, Fig. 4 shows
the proposed 32-bit event format containing two fields:
Interface Interface Interface Interface
InterfaceInterfaceInterfaceInterface
In
terface
In
terface
In
terface
In
terface
AER 
retina
In
terface
In
terface
In
terface
In
terface
AER 
node 
(1,1)
AER 
node 
(2,1)
AER 
node 
(3,1)
AER 
node 
(4,1)
AER 
node 
(1,2)
AER 
node 
(2,2)
AER 
node 
(3,2)
AER 
node 
(4,2)
AER 
node 
(1,3)
AER 
node 
(2,3)
AER 
node 
(3,3)
AER 
node 
(4,3)
AER 
node 
(1,4)
AER 
node 
(2,4)
AER 
node 
(3,4)
AER 
node 
(4,4)
Fig. 2. 2D Network Topology for Multi-Casting-Mesh-AER. Each node is
identified by the address field (xNODE, yNODE).
Event 
processor
ROUTER
North port
South port
E
a
s
t p
o
rt
W
e
s
t 
p
o
rt
Configuration 
processor
SPI
Fig. 3. Example of AER node for a multi-node multi-link AER system.
AER addressRouting header
0 xADD addressyADD address
(a) Event frame
Config dataRouting header
Checksum1 xADD addressyADD address
(b) Configuration frame
Command ID Command Pars
AER data
Fig. 4. Events format with headers added for routing purposes.
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 6
TABLE II
SCALING OF COMMUNICATION BANDWIDTHS IN MULTI-CHIP AER SCHEMES
Flat AER [7] Broadcast
Mesh AER
[25]–[27]
Pre-Structured
AER [28]
Hierarchical
Fractal AER
[29]
Router Mesh
AER [30]
Pre-Struc.
Mesh AER
Evmax Epp
2NchFout
Epp
Fout
αNchEpp NchEpp
2N2
L1
Fout
6NchEpp
nhFMout
4NchEpp
nhFMout
xADD > xNODE
xADD = xNODE
Event to local 
processor
YES
yADD > yNODE
Event to East 
Link
Event to West 
Link
NO
yADD = yNODE
YES
YES NO
NO
YES NO
Event to North 
Link
Event to South 
Link
Fig. 5. Destination-driven routing algorithm for handling incoming events.
(xNODE, yNODE) is the local node address and (xADD, yADD) is the event
destination address.
• Routing header: the most significant bit is used to distin-
guish between data event (first bit is ‘0’) or configuration
command (first bit is ‘1’). The next 8 bits are used to code
the destination or source node ID. Coordinates xADD and
yADD are represented using 4 bits for each one.
• Upper layer data: the remaining 23 bits contain the
event/command data. If it is a configuration command,
it contains a command description, for example, a check-
sum, a command identifier and command parameters.
A. Destination-Driven Routing Algorithm
In this algorithm, the destination node address is written
in the routing header. When the event arrives at a network
node, the router analyzes the addressing header and decides
the output port to which the event is to be forwarded. If the
destination address corresponds to the node address, the event
is sent to the local processor. If this is not the case, the event
is routed in accordance with the algorithm represented in Fig.
5: it compares the event destination address xADD and yADD
with the present node address xNODE and yNODE to decide
the output port to which the event is to be forwarded. Using the
geographical information contained in the destination address,
the event is routed to the neighbor node with the shortest path
to the destination in terms of the number of hops. As the router
only has to compare two 4-bit digital words, the hardware
required can be very simple and the routing operation can be
performed on the fly. The algorithm in Fig. 5 gives priority to
xADD. This is called dimension-ordered routing and tends to
concentrate the traffic in one dimension (horizontal, in this
case). To avoid this, routers which give priority to yADD
can be alternated with those priming xADD, balancing the
situation. However, this may yield to deadlock situations [48]
and should be analyzed carefully for each case. On the other
hand, dimension-ordered routing (with bi-directional links) is
known to be deadlock-free [49], [50].
Fig. 6. Routing table for cloning output events in the destination driven
algorithm. Left: Example logic diagram (schematics) showing the logic
(virtual) connections between a source module AER1 and three destination
nodes AER2−4. Right: Routing table showing the output event header to
be added (xOUT ,yOUT ) and the port through which it is to be sent. V Ci
(i=1,2,3) represents a virtual connection between the source node AER1 and
the destination nodes AERi−1. (xAERi,yAERi) (i=2,3,4) is the network
address of node AERi.
YES
Wait for new 
output events
End of 
routing table?
Add event 
header 
(xOUT,yOUT)
NO
Send to 
‘Out Port’
Empty buffer
Events in
the buffer
Fig. 7. Output event management in the destination-driven routing algorithm.
Values for “xOUT”, “yOUT” and “Out Port” are read from the routing table,
as in the Fig. 6 example.
Besides managing the traffic coming from the neighboring
chips, the router also inserts headers for the new events created
in the local processor. The local router clones each newly
created event as many times as the number of destination
nodes (or virtual connections V Ci) there are for that event.
For each clone, or virtual connection V Ci, the router adds
the destination node address and sends it to one of the local
output ports. This is organized in a routing table which is
read for every new event. Every entry in this table has an
output destination address and an output port that transmits the
event. The routing table organization is illustrated in Fig. 6, for
the case of one node AER1 having three virtual connections
V C1−V C3 to three other nodes AER2-AER4. Fig. 7 shows
the flow diagram for the output event management in the
destination-driven routing algorithm.
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 7
Wait for new 
input events
To local 
processor?
Access 
routing table
No events
New event
Memory
(xS,yS)
To west 
link?
To east 
link?
To north 
link?
To south 
link?
Forward 
event
Forward 
event
Forward 
event
Forward 
event
Forward 
event
YES YES YES YES YES
NO NO NO NO NO
Action word
Fig. 8. Source-driven routing algorithm.
B. Source-Driven Routing Algorithm
In the source-driven option, the router receives information
about the source address that generated the event. Hence, all
the nodes must store information about all the possible source
addresses that they can receive. When a new event is received,
the router searches for the source address in a local user
configurable connection memory. This memory codes all the
operations that should be performed when a source address is
received: forwarding the event to one or more output ports,
routing it to the local processor, or both at the same time.
The input routing algorithm for the source-driven solution is
shown in Fig. 8.
The user can program connectivity maps by setting the
elements of the connection memory. Each node locally stores
its connection memory with its own routing actions for all
the possible source addresses. Each position of this memory
corresponds to each of the possible 8 bit source addresses and
stores a 5-bit digital word. When one of those bits is at high
level, it indicates that the event must be transmitted through
the interface associated with that bit position.
This feature allows replication of events at intermediate
network nodes. This is done by activating several bits in the
connection memory position of the received source address.
The event will be transmitted through the selected output
interfaces. The programmed tasks can be performed in parallel
because they do not require any shared resources. Output
virtual connections from the same node can share the same
initial segments of a route and clone events closer to the
destination nodes, as opposed to the destination-driven case
where the full route is cloned.
Local processor output event stream management is greatly
simplified in the source-driven option. The source address
is added to all the events that must be transmitted through
the network. The only configuration parameter is the output
port or ports that must transmit this information to get their
final destination. By sending the same event through different
output ports the user can balance the network traffic load and
improve overall latency, because this reduces the event rate on
critical physical links which may otherwise get saturated.
C. Comparison between both algorithms
Any node interconnection map can be implemented using
either of the proposed routing algorithms. However, each
solution offers certain advantages with respect to the other
in terms of connectivity features. We will focus the different
algorithms impact on parameters such as latency, network
event traffic (the number of events transmitted through the
network) and the hardware resources needed for the router
implementation.
In terms of hardware complexity, the source-driven im-
plementation needs a more sophisticated routing algorithm.
This increased complexity leads to longer delays in the router
event processing, increasing the latency associated with event
transmission. The destination-driven router takes this decision
on the fly, taking into account only the node address and
the information contained in the incoming event. The latency
penalty caused by the source-driven router is strongly depen-
dent on the shared connection memory implementation and its
arbitration mechanism. This memory block is large and results
in an important area overhead.
The source driven algorithm provides the system designer
with more freedom to balance event traffic and design routes
through the networks. For any source-to-destination route, the
designer can insert detours, de-branchings, and local event
clonings at any intermediate node of the route to balance and
optimize overall traffic. On the other hand, the destination
driven algorithm creates pre-determined routes along the net-
work, and the designer can only change the output ports of the
source module of a route. Also, for the destination-driven case,
the events that have to reach several modules necessarily have
to be cloned at the source module. However, in the source-
driven case, multiple module destination events can be cloned
at intermediate route points. This alleviates overall traffic and
makes the average effective module fanout (FMout in eq. (10))
smaller.
D. Deadlock in NoC type systems
Deadlock in Network-on-chip (NoC) or Network-on-Board
(NoB) mesh-type systems is an issue of primary concern
among researchers and developers. A deadlock situation can
happen if routing paths form closed loops [48]. The result
is that all sender ports in the loop are requesting to send
an event, while at the same time all receiver ports in the
loop cannot acknowledge because they cannot take a new
event before their corresponding sender port drops an event.
This is a very well known and studied problem in mesh type
communicating structures. In order to avoid such situations
one solution is to route the paths in such a way that no closed
loops are formed. In the example systems we provide in this
paper we did not encounter any deadlock situation because
the ConvNet examples provided are all feed-forward systems.
However, in general one may encounter situations where feed-
back paths need to be implemented, thus increasing chances of
forming closed loops. Consequently, when assigning routing
paths and output ports within the routers, care must be taken
to avoid closed loops, thus eliminating the possibilities for
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 8
ROUTERIN
NORTH
ARBITER
NORTH
N
O
R
T
H
s
N
O
R
T
H
w
N
O
R
T
H
e
N
O
R
T
H
c
S
O
U
T
H
n
E
A
S
T
n
W
E
S
T
n
C
H
IP
n
R
O
U
T
E
R
IN
W
E
S
T
A
R
B
IT
E
R
W
E
S
T
WESTs
WESTn
WESTe
WESTc
SOUTHw
EASTw
NORTHw
CHIPw
R
O
U
T
E
R
IN
E
A
S
T
A
R
B
IT
E
R
E
A
S
T
EASTn
EASTw
EASTc
SOUTHe
WESTe
NORTHe
CHIPe
EASTs
ROUTEROUT
ARBITER
CHIP
C
H
IP
s
S
O
U
T
H
c
E
A
S
T
c
W
E
S
T
c
N
O
R
T
H
c
ROUTERIN
SOUTH
ARBITER
SOUTH
S
O
U
T
H
n
S
O
U
T
H
w
S
O
U
T
H
e
S
O
U
T
H
c
N
O
R
T
H
s
E
A
S
T
s
W
E
S
T
s
C
H
IP
s
Local processor
Routing 
Table
C
H
IP
n
C
H
IP
w
C
H
IP
e
South port
W
es
t 
p
o
rt East p
o
rt
North port
Fig. 9. Destination-driven router block diagram.
deadlock situations. Since the routers we are using have bi-
directional data paths between nearest neighbours and events
can be routed to four output ports per module, there is a lot
of flexibility for avoiding closed loops.
IV. ROUTER DESIGN DETAILS
The routing algorithms described above should be im-
plemented efficiently and with a minimum hardware cost.
This Section describes hardware implementations focusing on
design issues that must be faced to reduce routing processing
times, while keeping lightweight implementations.
A. Destination-Driven Router
Fig. 9 shows a block diagram of the destination-driven
router circuit. The traffic through any of the input channels
is processed by a ROUTERIN block that implements the
destination-driven routing algorithm. A highly parallel hard-
ware architecture has been chosen to reduce routing processing
times. The goal is to separate the event streams that do not
need to use the same channel. For example, a stream that
is being transmitted from the west to the east port, never
interferes with another stream that is being transmitted from
the north to the south port. This is only possible if the shared
routing resources are reduced to a minimum by replicating
them in the architecture.
Routing is defined by user provided parameters. The first
parameter is the node address which is used to identify
the node in the network topology. The second parameter
is the routing table that contains the information needed to
communicate with the node’s target destinations. Every entry
in this routing table is a 10 bit word in which the 8 least
significant bits are used to code the destination address and the
2 most significant bits are used to specify the corresponding
output ports (as in Fig. 6).
The basic building blocks of the destination-driven router
architecture are:
• ROUTERIN: this block receives the input stream coming
from an input channel and implements the routing algo-
rithm described in Fig. 5. It decides the output interface
to which the event is to be forwarded, by comparing
the input event address (xADD, yADD) with the user-
specified local node address (xNODE, yNODE). The
AER handshaking protocol is also used to transfer events
between different blocks internally. Handshaking is used
here for flow control purposes as the individual processor
operation is stopped when a network communication link
cannot transmit events. In these overflow situations, the
router does not send the acknowledge back until there
are hardware resources free to process the event. Each
ROUTERIN block has four independent output interfaces
connected to the output channel access arbiters or to the
local processor arbiter.
• ARBITER: this block manages access to an output
channel for the events coming from the ROUTERIN
blocks or from the local processor. The ARBITER scans
its four input AER interfaces to detect any new event. If
an event is detected, it takes control of the output channel.
If another event from any other interface arrives while the
output channel is busy, the resource is not assigned until
the current event releases the shared resource. If several
input interfaces want to take control of the shared re-
source at the same time and the block is busy transmitting
an event, the ARBITER gives lower priority to the last
interface attended. Therefore, the arbitration mechanism
prevents a fast input interface from monopolizing the
shared resource. Note that all arbiters are synchronous
circuits driven by the common FPGA clock.
• ROUTEROUT: this block implements the algorithm
described in Fig. 7 to manage events coming from the
local processor. This block reads the routing table row
by row to add the proper header to the event, forwarding
it through the specified output interface. For this pur-
pose, the ROUTEROUT block has four output interfaces
connected to the north, south, east and west arbiters. To
improve router parallelism, the routing table is divided
for every output port. This way, if an event has to be
transmitted through different output ports, this can be
done in parallel.
B. Source-Driven Router
Fig. 10 shows the block diagram of the source-driven router.
The building blocks used are very similar to those described
for the destination-driven router. However, the ROUTERIN and
ROUTEROUT blocks have very different internal structures
due to the differences in the routing algorithm. In the source-
driven option, all the blocks have to read a shared connection
memory containing the routing information. The implementa-
tion of an efficient shared access scheme for this memory can
dramatically reduce the delay time associated with the routing
event process. The router contains the following blocks:
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 9
ROUTERIN
NORTH
ARBITER
NORTH
N
O
R
T
H
s
N
O
R
T
H
w
N
O
R
T
H
e
N
O
R
T
H
c
S
O
U
T
H
n
E
A
S
T
n
W
E
S
T
n
C
H
IP
n
R
O
U
T
E
R
IN
W
E
S
T
A
R
B
IT
E
R
W
E
S
T
WESTs
WESTn
WESTe
WESTc
SOUTHw
EASTw
NORTHw
CHIPw
R
O
U
T
E
R
IN
E
A
S
T
A
R
B
IT
E
R
E
A
S
T
EASTn
EASTw
EASTc
SOUTHe
WESTe
NORTHe
CHIPe
EASTs
ROUTEROUT
ARBITER
CHIP
C
H
IP
s
S
O
U
T
H
c
E
A
S
T
c
W
E
S
T
c
N
O
R
T
H
c
ROUTERIN
SOUTH
ARBITER
SOUTH
S
O
U
T
H
n
S
O
U
T
H
w
S
O
U
T
H
e
S
O
U
T
H
c
N
O
R
T
H
s
E
A
S
T
s
W
E
S
T
s
C
H
IP
s
Local processor
C
H
IP
n
C
H
IP
w
C
H
IP
e
South port
W
es
t 
p
o
rt E
ast p
o
rt
North port
Shared 
Connection  
Memory 
with access 
Arbitration
CACHE
C
A
C
H
E
CACHE
C
A
C
H
E
Fig. 10. Source-driven router block diagram.
• ROUTERIN: this block analyzes events coming from
the input channel to extract the source address. This
address is used to access a shared RAM connection
memory of 256 positions containing the routing algorithm
information. The word read from the connection memory
codes the output port(s) that the event must be routed
to. When this task is done, the behavior of this block is
exactly the same as in the destination-driven router.
• Local Cache: every event that is routed in the source-
driven solution needs a memory access and all the
ROUTERIN blocks can read the memory at the same
time. However, in a real network each input block will
process a limited number of flows. This means that the
addresses that every ROUTERIN block is going to read
will be repeated very often and will only constitute a
small part of the total address space. To speed up this
process, the ROUTERIN block first consults a dedicated
cache memory that stores the most common accessed
address content. When the routing algorithm needs to
access a shared memory position, the ROUTERIN block
checks the cache memory and searches for this word. If
it is found in the cache, the event is routed and there is
no need to access the shared RAM. If the word is not
found, the block reads the shared RAM and stores the
word on its own cache.
• ROUTEROUT is greatly simplified because neither an
event has to be replicated nor the same header has to
be added to all the events. In this case, every time an
event coming from the local processor is detected, the
router node address is added and the event is sent to the
corresponding output port(s).
• Shared Connection Memory: this block stores the con-
figuration words that code the routing actions to be taken
for each possible source address. All the ROUTERIN
blocks need to read this memory when they receive a
source address that they have not previously stored in
their local cache. An arbitration mechanism is needed to
avoid conflicts when several ROUTERIN blocks need to
access the shared memory. This mechanism ensures that
only one ROUTERIN block has control over the address
bus of the shared memory.
C. FPGA Implementations Comparison
In order to compare the two router implementations in a
multi-module system, we analyzed the impact of implement-
ing different size networks on a Virtex-6 FPGA prototyping
system. As unit-module we used a VHDL description of
an event-driven programmable-kernel 2D AER-Convolution-
processor for vision applications, capable of handling pro-
grammable kernels of size up to 11 × 11 on pixel arrays
of size 64 × 64 [51]. Each VHDL ConvModule uses register
RAM to store pixel states and kernel values. For each pixel
state and each kernel weight we use an 8-bit register. The
Virtex-6 could hold up to 64 of these Convolution modules,
programmed with any arbitrary interconnection map. Fig. 11
illustrates the case of a 3× 3 Convolution network. Inputs to
the network were provided through 3 input ports, connected
to an AER splitter receiving a unique external input AER
flow and replicating it over the three inputs. Every network
node included one of the previously described routers, the
convolution block, and a dedicated configuration processor
with an SPI. This interface was fed by a global configuration
controller that received all the configuration data (like router
tables, and convolution processor parameters) from a host
computer. The rest of peripheral modules could connect one of
their AER outputs to a multiplexer block connected in turn to
the external FPGA. This way, such peripheral modules were
able to send their outputs to any of the multiplexer inputs
to allow external monitoring of the AER flow. Fig. 12 shows
the FPGA occupation ratios for different network sizes (where
the number of nodes is Nn×Nn) in terms of occupied slices
and memory resources. The destination-driven routing solution
needed less hardware resources in terms of memory and slices
than the source-driven implementation. Note that a 7× 7 grid
of these Convolution modules implemented a neural network
with 7× 7× 64× 64 = 196k neurons (for k = 1024) and an
equivalent number of 196k×11×11 = 24.3 million synapses.
Since each convolution module only had to store one kernel of
11× 11 8-bit words, the total physical RAM memory needed
to store 7 × 7 kernels was 5929 bytes. The RAM required
to hold all 8-bit neural states was 196KB. Consequently, the
limiting factor in this particular case was the number of slices
available in the FPGA and not its memory.
V. NETWORK EXTENSION TO MULTIPLE FPGAS
The examples illustrated in Section IV were synthesized
on a single Virtex-6 FPGA. To allow modular scalability to
arbitrary size multi-module networks, provisions for multi-
FPGAs (or multi-chips, in case of ASICs) needed to be
made. AER links inside the FPGA were made using parallel
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 10
Router
CONV
C
O
N
F
SPI SPI SPI
SPI SPI SPI
SPI SPI SPI
Router
CONV
C
O
N
F
Router
CONV
C
O
N
F
Router
CONV
C
O
N
F
Router
CONV
C
O
N
F
Router
CONV
C
O
N
F
Router
CONV
C
O
N
F
Router
CONV
C
O
N
F
Router
CONV
C
O
N
F
S
P
L
I
T
T
E
R
M
U
X
Config block
SPI
Computer interface (serial port)
AER
input
AER
output
Fig. 11. Block diagram for the network on chip implementation.
3 4 5 6 7
0.5
1
1.5
2
2.5
3
3.5
4 x 10
4
N
n
Sl
ic
e
s
 
 
Destination-driven
Source-driven
3 4 5 6 7
500
1000
1500
2000
2500
3000
3500
4000
N
n
M
e
m
o
ry
 (K
bi
ts
)
 
 
Destination-driven
Source-driven
Fig. 12. Slice and memory occupation in an Nn × Nn node ConvNet
implementation. The LX240 Virtex-6 FPGA used in the experiments have
37680 slices and 14976 Kb of block RAM.
AER buses. However, this was not realistic for a multi-FPGA
realization because of the excessive number of resulting pins.
Fortunately, high-end FPGAs include state-of-the-art serial
links, like the Rocket I/O. In this Section we describe a way
of extending each asynchronous bidirectional AER link to use
available Rocket I/O serial interfaces, using neither dedicated
handshaking lines (such as Ack and Rqst), nor an extra LVDS
pair [38]. When the event rate transmitted in one direction
exceeds the processing speed of the receiving module, a stop
command is transmitted in the opposite direction. This way,
flow control is implemented in both directions just by using
the two required LVDS cable pairs for bidirectional serial
transmission. For this purpose we used the 8b/10b encoding
scheme, as this allows for 12 special characters, commonly
called K-characters. We used these characters to implement
idle commas (to keep the link synchronized during the absence
of address events) and flow control commands. Note that this
is standard industry practice.
Fig. 13 shows the full duplex serial Rocket I/O AER link
Rocket I/O 
wrapper
FRAME_GEN
CLK
Ktx
DATAtx
FRAME_CAPT
AERin
32
reqIN
ackIN
ND
8ND
AERout
32
reqOUT
ackOUT
Krx
DATArx
ND
8ND
reqfc ackfc stop
DISPrx
ND
NTrx
ND
TXp
TXn
RXp
RXn
REFCLKpREFCLKnTX
RX
Fig. 13. Full-duplex Rocket I/O based parallel-serial AER link with flow
control capability.
with flow control capability. The Xilinx CORE Generator tool
provides a wrapper to interface with the FPGA dedicated
hardware. It defines the signals that the user must generate
in order to send and receive data over this link. The data
width interface, ND in Fig. 13, is user-configurable in 8, 16
or 32 bits. TXp-TXn and RXp-RXn represent the serial output
interface and REFCLKp and REFCLKn the reference clock
used by the Rocket I/O circuitry to generate the transmission
frequency.
The FRAME GEN block handshakes the input parallel AER
stream and sends 8ND bits at every rising edge of the master
clock CLK provided by the Rocket I/O circuitry. If there
is no user data to transmit, the interface sends a comma
character represented by a K-character of the 8b/10b code.
FRAME CAPT receives the continuous data stream coming
out from the channel, analyzes it, discards the commas and
frames the parallel AER events, implementing the handshak-
ing with the next processing block. Signals Ktx and Krx
are activated when a K-character is transmitted or received,
respectively. The receiver also uses the information provided
by signals DISPrx and NTrx to detect possible transmission
errors. DISPrx indicates a disparity error in the 8b/10b words
received and NTrx is actived when the received word is not a
valid code character.
Fig. 14 illustrates the full-duplex flow control mechanism
implemented through signals reqfc, ackfc and stop in Fig. 13.
The figure illustrates the case of ROUTER1 sending events
to ROUTER2. The ROUTER2 RX2 FRAME CAPT block
(see Fig. 13) writes each incoming event in a FIFO memory,
which is read by the router when there are new events to
be processed. If the ROUTER1 TX1 transmission event rate
is faster than the ROUTER2 RX2 handling capabilities, the
number of elements stored in the FIFO will increase. If
ROUTER1 TX1 keeps sending new events, the FIFO would
overflow and information would be lost. The FRAME CAPT
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 11
TX1
RX1
RX2
TX2
1) Saturation!
2) 5) Flow 
control 
message
3) 6) Stop 
signal
ROUTER1 ROUTER2
FIFO
Nmax
Nmin4) E
nd o
f 
Satu
ratio
n!
Flow control datapath
Flow control message
Stop pulse
Flow control handshaking
Fig. 14. Flow control mechanism in a full-duplex link.
block detects when the number of elements is greater than
a user-defined threshold Nmax (see ‘1)’ in Fig. 14) and
sends a flow control message using a K-character via the
ROUTER2 transmitting channel TX2 (see ‘2)’ in Fig. 14).
This message request is made by activating reqfc and it will
be processed with the highest priority by the ROUTER2 TX2
FRAME GEN block. When the flow control message is sent,
the ROUTER2 TX2 FRAME GEN block acknowledges the
transmission using ackfc.
When the ROUTER1 RX1 FRAME CAPT block detects the
flow control K-character, it automatically stops event transmis-
sion, asserting signal stop (see ‘3)’ in Fig. 14). In an overflow
situation, the AER acknowledge signal ackIN is not activated
even when the request signal is asserted and the AER data
flow is stopped. The ROUTER2 RX2 FRAME CAPT block
monitors the receiving FIFO until the number of elements falls
below a second user-defined threshold Nmin (see ‘4)’ in Fig.
14). From this point on, the overflow situation is considered
finished and the flow control message is sent to ROUTER1
RX1 (see ‘5)’ in Fig. 14). When it is received, the stop signal
is deasserted (see ‘6)’ in Fig. 14) and the transmission flow is
resumed.
The ROUTER1 sender keeps transmitting events while the
flow control mechanism is in operation. To ensure that no
events are lost during traffic peaks, the time needed to stop
the transmitter when an overflow situation is detected must
fulfill the inequality
TDET1 + TPROP + TSTOP ≤ (NF −Nmax)TTX,EV (11)
where TDET1 is the time needed to detect the overflow
situation and generate the flow control message, TPROP is the
channel propagation time and TSTOP is the time required to
stop the transmitter. The maximum number of FIFO elements
is represented by NF and TTX,EV is the transmission event
period that is causing the overflow.
It would be desirable for the transmission to begin as soon
as possible after a flow control pause. For this purpose, the
flow control mechanism has to ensure that there will be events
stored in the FIFO waiting to be processed by the router after
the recovery. To maximize the receiver event rate, we can
impose the following restriction on the Nmin value
TDET2 + TPROP + TSTART ≤ NminTRX,EV (12)
where TDET2 is the time needed to detect the end of an
overflow situation, TSTART is the time required to start the
transmitter again and TRX,EV corresponds to the receiver
event processing time (or the inverse of the event rate).
VI. SYSTEM LEVEL DESIGN CONSIDERATIONS
Several analysis and optimization methodologies for 2D
mesh connected networks have been proposed in literature
[52], [53]. The 2D communication layer is analyzed from a
traffic management perspective using queuing theory to find
network parameters such as latency, queue delays or queue
occupation rates. All these parameters are strongly influenced
by the application, because they depend on the network
topology (physical and logical) and the traffic rates generated
by the processors. On the other hand, optimization procedures
for massively parallel architectures [54] have been successfully
applied to neuromorphic systems which integrate millions of
neurons [9]. Here we will rely on analysis techniques for
2D mesh networks, and use the results to suggest ways to
optimize the implementations. The study will be centered
on our convolution unit network implementations, but it can
be easily extended to other neuron array schemes. We will
analyze two points of view: hardware resource requirements
and event traffic.
A. Hardware Resource Requirements
AER convolution modules and network circuits employ a
certain amount of resources (logic area and memory) which
must fit within the selected implementation platform. This
resource consumption will be related to the number of AER
modules and the neuron array sizes. In general, an Nunits
AER system requires an area of
Atotal = Nunits (Aconv +Arouter) +Aprog (13)
where Aconv , Arouter, and Aprog are the areas used by
one convolution processor, one router, and the overall SPI
programming circuitry. Let every convolution module have a
maximum kernel size of Nker × Nker and every weight be
coded with W bytes. If convolution modules integrate an array
of Narr ×Narr neurons the state of which is represented by
S bytes, the area taken up by a ConvModule is given by
Aconv = N
2
arrAreg(S) +N
2
kerAreg(W ) +Alogic (14)
where Areg(x) is the area of a register of x bytes, and Alogic
is the area taken up by the additional logic of the convolution
module implementation, which depends on Narr, W and S.
For the routers, as we discussed previously, the simplicity
of the destination driven algorithm needs less logic resources
Alogic,dest than the source driven Alogic,sour. Furthermore,
the source driven approach needs extra memory to store the
routing actions for all possible source addresses in the network
(coded in 5 bits). This extra memory is implemented through
local registers and uses an area Areg(x), where x is number
of bytes. Since routing actions only require 5 bits, x = (5/8)y
where y is the number of memory positions. If Nadd bits are
used to code the network addresses, the router areas can be
expressed as
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 12
A 
(1,1)
D 
(2,1)
G 
(3,1)
B 
(1,2)
E 
(2,2)
H 
(3,2)
C 
(1,3)
F 
(2,3)
I 
(3,3)
(b)
S
p
lit
te
r
A
B
C
D
E
F
G
H
I
10Eα 
5Eα 
5Eα
4Eα
6Eα 
1Eα 
3Eα 
2Eα 0.5Eα
0.7Eα 
(a)
Fig. 15. Example AER system to study system level characterization
methodology. (a) Logical description of the network and event rates for each
logical (virtual) channel. Parameter Eα has units of event rate (events per
second) and the traffic loads at the different nodes are expressed in terms
of Eα. By sweeping Eα we can analyze the impact of traffic scaling and
saturation in the network. (b) Physical implementation on the structured-grid-
AER infrastructure and corresponding mapping of the logical modules (A to
I) on the physical 2D grid.
Arouter,dest = Arlogic,dest (15)
Arouter,sour = Arlogic,sour +Areg(
5
8
× 2Nadd) (16)
The total number of neurons which can be integrated in the
system is NunitsN2arr and the maximum number of kernel
weights is NunitsN2ker. Taking into consideration the previous
resource analysis, the total RAM memory bytes M needed for
both routing algorithms can be written as
Mdest = Nunits
(
N2kerW +N
2
arrS
)
(17)
Msour = Nunits
(
N2kerW +N
2
arrS +
5
8
× 2Nadd
)
(18)
B. Event Traffic Estimation
One of the most important parameters of any AER com-
munication scheme is the event transmission latency between
processing blocks. The AER mesh architecture used in this
paper can be studied using an analytical model for NoC
performance analysis [52], where routers are modeled as a
collection of FIFO buffers with five input/output channels
(north, south, east, west and local interfaces). This model
computes the network queue occupation at every interface of
each router assuming that event rates for all network channels
are known parameters. These rates are strongly dependent on
the specific application and can be easily estimated through a
behavioral level simulation.
For example, Fig. 15(a) shows the logic network topology
of a specific pre-structured AER system. This network can
be simulated behaviorally to obtain the average event rates
at each connection (virtual channel). The event rates at each
connection are expressed in term of a reference event rate Eα,
which can be swept to study changing traffic conditions. To
map the logic network onto the physical 2D mesh of nodes, the
first step is the “placement” of modules, which is illustrated
in Fig. 15(b). The second step is to assign a route (or list of
nodes) for events going from a source node s to a destination
node d. Let us call this list of route nodes Πsd. Each route
corresponds to a virtual connection in the logic network. Once
the module placement and route lists Πsd are established we
know the event rates at the input and output router channels.
Let lijr be the event rate at input channel i routed to output
channel j in router r. For each router, we can define a 5× 5
forwarding probability matrix where element fijr corresponds
to the probability that an event which arrives at interface i
will leave the router through interface j. These probabilities
can be computed for every router in the network as
fijr =
lijr
λrj
i, j ∈ [1, 5] , r ∈ [1, Nunit] (19)
λrj =
5∑
k=1
likr
In the network the events from different routes have to
share common resources to reach their final destination. If
two events want to use the same resource, arbiters grant access
and make some events wait in their queues until the resource
becomes available. The forwarding matrix can be used to
compute the contention probabilities cijr for each router, i.e.,
the probability that channels i and j compete for the same
output, as:
cijr =
5∑
k=1
fikrfjkr ∀i 6= j cij = 1 ∀i = j (20)
The router forwarding matrix Fr = [frij ]5×5 and the
contention matrix Cr = [cijr]5×5 describe the routers’ traffic
management. It can be demonstrated [52] that the average
number of events per queue Nr = [Nrj ]5×1 at each router
can be computed as:
Nr = (I − trΛrCr)−1 ΛrR¯r (21)
where scalar tr is the mean event processing time in the
router and Λr = [λrj ]j=r is a diagonal matrix made up of
the total event rates through the 5 input interfaces.
[
R¯r
]
5×1 is
the residual time matrix, which represents the amount of time
that a new event has to wait in the queue until the event which
occupied the shared resource at the moment the new event
arrived finishes its processing. Solving eq. (21) and applying
Little’s theorem [55], we can compute the mean waiting time
in channel j of router r as Wrj = Nrj/λrj . This way, the
total latency of one node-to-node hop is Nrj + tr + ttx, where
ttx is the transmission time through the inter-node physical
channel. The total latency of an event traveling from source
node s to destination node d is therefore
Lsd =
∑
(r,j)∈Πsd
(Wrj + tr + ttx) (22)
This analysis methodology will be applied to the example
system of Fig. 15 in Section VIII where the network traffic
will be estimated for a source driven and a destination driven
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 13
solution. Moreover, we will discuss how to use this analysis
procedure to improve network performance by varying some
of the implementation parameters. Note that, given a logical
network together with virtual connection event rates, it is only
necessary to establish a node “placement” and the route lists
{Πsd}. The other computations (from eqs. (19) to (22)) are
quite straightforward, given parameters tr, Λr, R¯r and ttx.
The result is the route delays {Lsd} from which the maximum
can be identified as its main timing bottleneck
Lmax = max{Lsd} (23)
The designer must then adapt the node “placement” and route
lists {Πsd} to minimize Lmax.
VII. EXPERIMENTAL RESULTS
In this Section we provide experimental results by imple-
menting the above mentioned concepts on Virtex-6 hardware
using Xilinx ML-605 development boards. First we show the
characterization results of the Full-Duplex Rocket-I/O-Based
AER parallel-serial interface described in Section V. An exam-
ple of a multi-module AER processing system which consists
of an array of Gabor filters implemented on a single FPGA
is then described. After that, a second system, implementing
a multi-layer feed-forward Convolutional Neural Network on
a single FPGA, is described. Next, we check the maximum
capacity of a single FPGA, and finally we provide results for
a multi-layer feed-forward ConvNet for character recognition.
A. Full-Duplex Parallel-Serial AER Interface
The ML-605 development board provides twenty indepen-
dent full-duplex Rocket I/O serial ports, eight of which are
available through an 8x PCIe connector. We used a dedicated
board to adapt this connector to 16 independent SMA (Sub-
Miniature version A coaxial RF connector) pairs, thus making
it possible not only independently to test and characterize
several transmitters and receivers, but also to interconnect
them. Some extra test circuits were added inside the FPGA,
such as an event generator and an event consumer/analyzer,
both with independent programmable event rates. This allowed
us to force overflow situations and test the flow control
dynamics, while detecting errors between the sent and received
events.
The timing characteristics of the serial link are given by its
latency and maximum event rate. To characterize latency, two
independent Full-Duplex AER serial links were interconnected
(each with two high speed wires), and the delay between
the ‘reqIN’ (see Fig. 13) of the first one and the ‘reqOUT’
of the second one was measured. This latency included the
delay introduced by the 8b/10b encoders and decoders, phase
alignment buffers, comma detection circuits, etc. To mea-
sure this latency, a very low event rate was programmed,
so that consecutive events were sufficiently spaced in time.
The measured latency was 232ns for a 2.5Gps bit rate, as
shown in Fig. 16. The maximum event rate supported by
the Rocket I/O could be characterized by analyzing the input
AER handshaking (reqIN, ackIN) cycle duration which is
0 50 100 150 200 250 300 350
0
2
4
re
qI
N
,
a
ck
IN
 
(V
)
 
 
0 50 100 150 200 250 300 350
0
2
4
Time (ns)
re
qO
UT
 
(V
)
reqIN
ackIN
232 ns
20 ns
Fig. 16. Interface input (reqIN, ackIN) and output request (reqOUT) when
the link operates with very sparse events to measure the input to output event
latency.
highlighted in Fig. 16 as 20ns. This would result in 50Meps
(mega events per second) maximum possible event rate for
32-bit events.
Fig. 17 illustrates the flow control operation. The event
generator in the transmitter was set intentionally to 26Meps
event rate, while the event consumer in the receiver was
set to have an event consumption rate of 20Meps. This led
to overflow in the receiving FIFO. It can be seen how the
transmitter was stopped when the flow control message was
received and started over again when the overflow situation
was overcome. Two different overflow behaviors were possible
depending on the chosen Nmax or Nmin values. If these
were optimally programmed, there was no stop in the output
event flow, as shown in Fig. 17(a). On the other hand, pauses
appeared if the receiver processed all the events stored in
the FIFO before the flow control message arrived at the
transmitter. This situation is illustrated in Fig. 17(b).
B. Routers with Parallel-Serial Interfaces
The routers described in Section III were complemented
with four Full-Duplex Rocket-I/O-based AER parallel-serial
interfaces (see Fig. 13), to test the performance for multi-
FPGA event routing. Table III shows the Virtex-6 occupa-
tion statistics associated with both implementations. For the
destination-driven router, clock frequency could be set up to
250MHz resulting in 2.5Gbps serial bit rate. However, for the
source-driven router, clock frequency could only be set up
to 200MHz because of the higher complexity, resulting in a
2Gbps bit rate. The latency introduced by the routers can be
measured by looking at the delay between the reqOUT signal
of the input full-duplex Rocket-I/O serial-to-parallel interface
and the reqIN signal of the output full-duplex Rocket-I/O
parallel-to-serial interface. This latency was 12.5ns for the
destination-driven router and 20ns for the source-driven router,
in non-overflow situations. In the case of the source-driven
router the received address was also stored in local cache.
Otherwise, the latency became 30ns.
The destination-driven router could handle a maximum 32-
bit event rate of Epp = 27Meps, which corresponds to 37ns
for completion of the handshaking cycle. On the other hand,
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 14
TABLE III
ROUTER IMPLEMENTATION STATISTICS FOR THE DD
(DESTINATION-DRIVEN) AND THE SD (SOURCE-DRIVEN) ROUTERS.
Resources DD router SD router Total FPGA
Occupied slices 1121 1400 37680
Occupied RAMB18E1 blocks 0 1 832
Rocket I/O transceivers 4 4 20
the source-driven router could handle up to Epp = 17.5Meps
maximum event rate, or 57ns handshaking cycle. Although
event transmission in destination-driven routing is faster, it is
also true that events have to be transmitted multiple times
when destinations are multiple. In this case, if the multiple
events are routed through the same port, there will be a
considerable delay penalty as shown in Fig. 18(a). However,
most of the time it is also possible to replicate events through
(up to four) different ports, as shown in Fig. 18(b), to avoid
this penalty.
C. Single-FPGA Implementation of Gabor Filter Array
As a first illustration of multi-module operation we im-
plemented a 3 × 3 array of orientation extraction 2D-Gabor
0 500 1000 1500
0
2
4
re
qI
N
0 500 1000 1500
0
2
4
Time (ns)
re
qO
UT
(a)
0 500 1000 1500 2000
0
2
4
re
qI
N
0 500 1000 1500 2000
0
2
4
Time (ns)
re
qO
UT
(b)
Fig. 17. Interface input (reqIN) and output request (reqOUT) in overflow
with (a) optimum Nmax-Nmin election and (b) non-optimum Nmax-Nmin
election leading to pauses in the output flow
filters of different scales and angles on a single Virtex-6.
The kernels are shown in Fig. 19. The implemented structure
follows the diagram in Fig. 11 and received sensory input from
an AER DVS retina [56], [57]. The convolution filters used a
modified version of a previously reported VHDL ConvModule
[51], with its random Poisson distributed readout mechanism
replaced by a plain compare-and-fire mechanism. All filter
outputs were routed to one of the multiplexer inputs and
captured off-chip with an AER data logger [58]. Fig. 20 shows
the sensor and the nine filter output events collected during
the same period of 160ms, while the retina was observing two
walking persons.
Event-driven convolution processors present the “pseudo-
simultaneity” property [22], [24]: during a given time interval,
the input flow of events (representing the input scene) is
simultaneous to the output flow of events (representing the
filtered input scene). This is because events are processed
as they arrive with delays shorter than the average inter-
event time. For the convolution modules we were using, event
processing time for 11× 11 kernels was about 3µs. The input
event flow provided by a 128 × 128 pixel DVS retina when
observing people walking was in the range of 10-50keps (kilo
100 150 200 250 300 350
0
2
4
In
pu
t R
e
qu
e
st
100 150 200 250 300 350
0
2
4
O
u
tp
u
t R
e
qu
e
st
Time(ns)
(a)
0 50 100 150 200
0
2
4
In
pu
t R
e
qu
e
st
0 50 100 150 200
0
2
4
O
u
t R
e
q 
Li
n
k 
A
0 50 100 150 200
0
2
4
O
u
t R
e
q 
Li
n
k 
B
Time(ns)
(b)
Fig. 18. (a) Output event replica when the same event must be transmitted
to different destinations using the same output port. (b) Parallel transmission
of events coming from the local processor that must be transmitted through
different output ports A and B.
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 15
Scale 1 Orientation 0º
2 4 6 8 10
2
4
6
8
10
Scale 1 Orientation 45º
2 4 6 8 10
2
4
6
8
10
Scale 1 Orientation 90º
2 4 6 8 10
2
4
6
8
10
Scale 2 Orientation 0º
2 4 6 8 10
2
4
6
8
10
Scale 2 Orientation 45º
2 4 6 8 10
2
4
6
8
10
Scale 2 Orientation 90º
2 4 6 8 10
2
4
6
8
10
Scale 3 Orientation 0º
2 4 6 8 10
2
4
6
8
10
Scale 3 Orientation 45º
2 4 6 8 10
2
4
6
8
10
Scale 3 Orientation 90º
2 4 6 8 10
2
4
6
8
10
Fig. 19. Kernels in the bank of Gabor filters for the three chosen orientations
and scales.
events per second). Consequently, a Gabor filter output event
representing an angle at a given scale was available as soon as
sufficient input events representing this feature were received,
plus the extra 3µs for processing the last one. This is illustrated
in Fig. 21, where a −45◦ filter provides output events as
soon as enough input events are received which are aligned
in short −45◦ edges. Stars represent input events and circles
output events. The left most subfigure shows the input and
output events collected during 40ms. As can be seen, there
are −45◦ oriented input segments present during these 40ms
that are readily detected by output events during these same
(a)
S
C
A
L
E
ANGLE
(b)
Fig. 20. Event-driven Gabor filtering illustration. Background gray represents
zero activity pixels, brighter pixels are active pixels sending positively signed
events, darker pixels are active pixels sending negatively signed events. (a)
Input scene captured by the DVS retina, with pixel activity of both signs. (b)
Results of the 3x3 bank of Gabor filters and sign rectification, so that only
positive events result.
Fig. 21. Illustration of pseudo-simultaneity of event-driven convolutional
filtering with a Gabor filter for detection of −45◦ oriented edges. The two
subfigures represent the x/y projection of events captured during 40ms and
6ms. Convolution module input events are represented by stars and output
events by circles. One can see that input events representing −45◦ edges are
detected during the same 40ms or even 6ms they appear.
S
P
L
I
T
T
E
R
AER
input
3x3 
ConvNet 
NoC
Parallel 
to serial 
AER
Parallel 
to serial 
AER
3x3 
ConvNet 
NoC
M
E
R
G
E
R
AER
output
FPGA1 FPGA2
(1,1) (4,1) (6,1)(2,1) (3,1) (5,1)
req1 req2 req3 req4 req5 req6
Fig. 22. Diagram of 3x6 Gabor Filter Array Implementation in two FPGAs
40ms. Furthermore, for the right most subfigure we can also
see this but for a time interval of only 6ms. Consequently, in
event-driven feature extraction, a given feature is detected as
soon as enough representative input events are received. Thus,
recognition delay would be determined mainly by the event
statistics provided by the sensor, not the processing delay of
the event-driven ConvNet.
D. Multi-FPGA Implementation of Gabor Filter Array
In order to illustrate the use and operation of the Full-
Duplex Rocket-I/O-Based Parallel-Serial Interfaces described
in Section V.A, we implemented a 3 × 6 array of Gabor
filters in 2 FPGAs. The corresponding diagram is shown in
Fig. 22. The retina events were fed through port ‘AER input’
and an in-FPGA splitter replicated them on three rows. Node
routers were programmed so that the retina events would be
copied horizontally from node to node inside each FPGA and
also from FPGA1 to FPGA2. To transfer the output events
produced by each node (or Gabor filter), the routers were also
programmed to copy all output events horizontally from node
to node and from FPGA1 to FPGA2. At the right end of all
three rows there was an in-FPGA merger block that merged
the three flows into a single output port, where an AER data
logger board [58] was used to capture and timestamp events.
The flow between the two FPGAs was fed through three Full-
Duplex Rocket-I/O-Based Parallel-Serial Interfaces.
To analyze the impact of event hopping from node to node
(either intra-FPGA or inter-FPGA) we programmed the same
Gabor filter into all nodes in the first row. The Gabor filter
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 16
2 3 4 5 60
200
400
600
800
1000
Node
La
te
n
cy
 (n
s)
 
 
Destination driven
Source driven
inter-FPGA
(serial link)
(FPGA2)
intra-FPGA
(FPGA1)
intra-FPGA
Fig. 23. Latency between nodes for the experiment described in Fig. 22 for
the destination and source driven routing algorithms.
output flow was routed to the west and north channels at each
node to observe the output and measure the latencies between
the first output node (1, 1) and the other nodes (j, 1) with
j = 2...6. These latencies could be measured by observing
the delays between request signals reqj (j = 2...6) and req1.
The measured delays of reqj with respect to req1 are
shown in Fig. 23, for destination and source driven algorithms.
Every in-FPGA network hop added a latency of 150ns for the
destination driven algorithm and 70ns for the source driven
algorithm. For the inter-FPGA hops an additional 350ns was
added to the routing delay, for both routing algorithms. Note
that here the source-driven case presents lower latency than
the destination-driven case. This is because of the massive
retina event cloning for the destination-driven case, because
each retina event has to be cloned for each destination Gabor
filter. Therefore, event traffic is much higher in the destination-
driven case for this particular arrangement.
E. Testing Single-FPGA Maximum Capacity
So far, in each FPGA we had programmed nine 64 × 64
pixel ConvModules to verify the operation of routers and
interfaces. In order to test the maximum capacity of one
Virtex-6 FPGA, we checked the maximum of Gabor filters it
could hold, together with their routers, peripheral interfaces,
SPI configuration circuitry, and input splitter. We used the
destination-driven routers, as they are more efficient in terms
of FPGA resources. We were able to have the FPGA hold
a total of 64 Gabor filters4, each with 64 × 64 pixels and
kernels of size 11× 11. The ConvModule array, together with
the 1x8 input splitter circuit, the configuration infrastructure
through the serial port and the output channels read-out cir-
cuitry occupied 32720 slices, representing 86% of the Virtex6
FPGA capacity. Internal memory occupation was 15% for the
36K RAM blocks and 18% for the 18K RAM blocks. We
programmed a Gabor filter array by sweeping four scales and
16 angles. Fig. 24 shows the collected output events for the 64
filters in the same 160ms time window, while the input DVS
retina was observing the same two persons walking, shown
in Fig. 20(a). Note that, in this case, one single FPGA was
emulating a system with Nneurons = 64×64×64 = 2.62×105
4The corresponding VHDL description is available upon request.
neurons and Nsynapses = Nneurons × 11 × 11 = 3.17 × 107
synapses.
Since the network had a total of Nl = 216 inter-module
links, each with Epp = 27Meps (as it is destination-driven),
the mesh could communicate a total of up to NlEpp =
5.8Geps of 32-bits each. However, this number needs to be
divided by the average number of hops per event nh and the
average number of module fan-out FMout, to determine the
effective (non-cloned) events traveling through the network.
Both numbers nh and FMout are problem specific, and can
be optimized for each case. In this particular case average nh
was around 4, while the average FMout was about5 3. For the
example in Section VII-F average nh is about 3 and average
FMout about 2.
F. Multi-Module Multi-Layer ConvNet Recognition Example
The previously described arrays of Gabor filters represent a
one-layer neural system, where all modules (filters) received
the same replica from the input sensor. The example illustrated
in this Section is a multi-layer Convolutional Neural Network
that performs a previously reported character recognition task
which has been verified using an AER event-driven simula-
tor with user-defined behaviorally-described event-processing
modules [59], [60]. It is loosely based on Fukushima’s neocog-
nitron [61] or Serre’s hierarchical network [62]. Here we
used the same 64 × 64 convolution module as above (a
modified version of the one in [51]) to assemble the 36-node
Convolution Neural Network shown in Fig. 25. For this, a 2D-
array of 6× 6 AER-nodes was synthesized in a single FPGA.
The heuristically chosen kernels [59], [60] are illustrated in
Fig. 26. Kernels k1 to k13 performed feature extraction for
the 1st layer. Kernels Ker1 to Ker6 were used for the 15
filters in the second layer. Convolution outputs were always
half-wave rectified (events were assigned a positive sign). The
layer 2 output virtual channels (labeled 19 to 41 in Fig. 25)
were fed to four modules labeled AGGRi in Fig. 25. These
were not ConvModules, but plain arrays of integrate-and-fire
neurons. Each AGGRi module included an AER-merger at
its input to merge the traffic from several virtual channels,
while forcing their sign bit. For example, module AGGRi
merged events from virtual channels {19, 20, 21, 23, 33} and
{25, 27, 32, 38, 40, 41}, while forcing a positive sign bit for
the first set and a negative sign bit for the second set. Finally,
the 4th layer performed 4 convolutions in parallel, all with the
same kernel KerC in Fig. 26.
The system was stimulated with bursts of events represent-
ing three different versions of letters A, C, H, and M. Bursts
had between 200 to 400 events and lasted from about 0.5 to
1ms. Fig. 27 shows the main timing properties in this set-up.
An input stimulus burst lasts for a time Tburst. At one of the
four output recognition channels (nodes ‘46’ to ‘49’ in Fig.
25) the first output event appears at time Tfirst and the output
burst lasts until time Tlast. Table IV summarizes the measured
5Assuming a 2r retina event rate and an r average filter output event rate,
the total cloned event rate would be Emax = nh(2r× 64 + r× 64), while
the effective uncloned rate would be Eeff = 2r + 64r. This results in
Emax/Eeff=FMoutnh=2.91nh.
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 17
s=1 a=1
s=1 a=2
s=1 a=3
s=1 a=4
s=1 a=5
s=1 a=6
s=1 a=7
s=1 a=8
s=1 a=9
s=1 a=10
s=1 a=11
s=1 a=12
s=1 a=13
s=1 a=14
s=1 a=15
s=1 a=16
s=2 a=1
s=2 a=2
s=2 a=3
s=2 a=4
s=2 a=5
s=2 a=6
s=2 a=7
s=2 a=8
s=2 a=9
s=2 a=10
s=2 a=11
s=2 a=12
s=2 a=13
s=2 a=14
s=2 a=15
s=2 a=16
s=3 a=1
s=3 a=2
s=3 a=3
s=3 a=4
s=3 a=5
s=3 a=6
s=3 a=7
s=3 a=8
s=3 a=9
s=3 a=10
s=3 a=11
s=3 a=12
s=3 a=13
s=3 a=14
s=3 a=15
s=3 a=16
s=4 a=1
s=4 a=2
s=4 a=3
s=4 a=4
s=4 a=5
s=4 a=6
s=4 a=7
s=4 a=8
s=4 a=9
s=4 a=10
s=4 a=11
s=4 a=12
s=4 a=13
s=4 a=14
s=4 a=15
s=4 a=16
Fig. 24. Output captured from the 64 Gabor filter array. Four scales (s = 1, ...4) and 16 angles (a = 1, ...16) were swept. Gray scale represents the number
of events integrated in every address position in a 160ms temporal window. Background gray is zero output, while bright pixels represent active pixels. The
rotated red bar in each subfigure indicates angle and scale (thickness) of the corresponding Gabor filter.
timing results (Tburst, Tfirst, Tlast) and also the number of
events per output burst for each letter presentation. On average,
correct recognition output spikes (which start at time Tfirst)
appeared at about half-way through the input stimulus burst
0.5 × Tburst and lasted until shortly after the input stimulus
burst had finished.
VIII. DISCUSSION
In this Section we will briefly illustrate the system level
analysis methodology presented in Section VI with the simple
example shown in Fig. 15. This example is not optimized to
achieve best performance, but is merely intended to show how
we can analyze the network using queuing theory and how
this can help us in making design decisions. Fig. 15(a) shows
the logical connections (virtual channels) between blocks
(nodes) and the event rate at each channel. Average channel
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 18
k1
k2
k3
k4
k5
k6
k7
k8
k9
k10
k11
k12
k13
Ker1
Ker2
Ker3
Ker4
Ker5
Ker6
Ker7
Ker7
Ker7
Ker7
Ker7
Ker7
Ker7
Ker8
Ker8
1
19
20
21
23
25
27
28
30
32
33
34
35
38
40
41
AGGRA
+
-
AGGRC
+
-
AGGRH
+
-
AGGRM
+
-
C filter
C filter
C filter
C filter
42
43
44
45
46
47
48
49
19 20 21 23 33
19 20 21 23 
25 38 40 41
19 27 28 30 38 41
19 20 21 25 27 
28 30 32 33 34
25 27 32 
38 40 41
 
27 28 30 
32 34 35
20 21 23 25 33 35
23 35  38 40 41
2
3
4
6
7
9
10
11
12
13
14
16
17
2
3
4
6
7
9
10
11
12
12
12
13
14
16
17
INPUT
1st layer 2nd layer 3rd layer 4th layer
Fig. 25. Logical network topology for the four letter recognition system.
Filters ki, Keri and KerC are kernels programmed in the event-driven
convolution modules of layers 1, 2, and 4, respectively. Modules AGGRi
are simple integrate-and-fire neurons which count events produced at every
address for any of their input channels, producing output events when the
count threshold is reached. Node numbers in the figure represent virtual AER
channels. Modules AGGRi include AER-mergers at their inputs which force
the sign bit of incoming events.
Layer 1 k1 Layer 1 k2 Layer 1 k3 Layer 1 k4 Layer 1 k5
Layer 1 k6 Layer 1 k7 Layer 1 k8 Layer 1 k9 Layer 1 k10
Layer 1 k11 Layer 1 k12 Layer 1 k13 Layer 2 Ker1 Layer 2 Ker2
Layer 2 Ker3 Layer 2 Ker4 Layer 2 Ker5 Layer 2 Ker6 Layer 2 Ker7
Layer 2 Ker8 Layer 4 KerC
Fig. 26. Kernels used in the letter recognition system through all the network
layers.
Tburst
t
Input 
channel
Output 
channel
TlastTfirst
t
0
0
Fig. 27. Timing diagram of the letter recognition process. Tburst is the
total duration of the input stream representing the input letter. The first output
event in the recognition channel appears at instant Tfirst, while the last one
is generated at Tlast.
TABLE IV
TEST RESULTS FOR THE LETTER RECOGNITION SYSTEM.
Input Tburst Tfirst Tlast # of # of # of # of
Letter (µs) (µs) (µs) events
A
events
C
events
H
events
M
A1 897 366 941 38 2 0 0
A2 777 589 812 17 0 0 0
A3 867 367 899 39 1 0 0
C1 596 334 619 4 65 0 0
C2 656 373 690 4 87 0 0
C3 476 289 495 5 26 0 0
H1 897 331 928 5 0 68 8
H2 777 460 776 1 0 41 1
H3 746 424 778 1 0 44 1
M1 927 281 885 6 0 5 47
M2 897 584 892 0 0 8 21
M3 837 400 786 0 0 11 28
event rate is expressed in terms of a reference event rate
Eα (which has units of events per second) to study traffic
under different load conditions. Traffic information can be
obtained from behavioral simulations. The logical topology
can be mapped into the physical system as depicted in Fig.
15(b). The routing algorithm and tables determine the event
routes through the network and this information can be used
to estimate the total event rate through each physical channel.
Using this information, the forwarding Fr and contention
Cr matrices can be computed for each router using eqs.
(19) and (20). The model is fed with all these matrices and
with certain implementation parameters, such as the routers’
service time tr or the physical channel transmission times ttx.
Knowing all these parameters enables us to solve the traffic
equation (eq. (21)) to estimate the mean number of packets
at each network queue Nr. This number provides information
about the network traffic distribution and makes it possible to
compute the queues’ mean waiting time Wrj . By applying eq.
(22) to every route it is possible to obtain the latency associated
with each route Lsd, and also the mean and worst case latency
for the whole network.
Fig. 28(a) shows the resulting mean network latency versus
reference event rate Eα for the example in Fig. 15. For low
event rates, routers are fast enough to process events on the
fly and queues are empty most of the time. As a result, mean
latency is constant and depends only on the mean number
of hops and routing times. In this situation, and for this
particular example, the destination driven algorithm presents
lower mean latency than the source driven algorithm because
the routing algorithm is faster. Mean latency starts to increase
exponentially with reference event rate Eα ≈ 8 × 105eps
because queues must handle more events and their waiting
times rise. For this particular system, the saturation point is
almost the same for both routing algorithms. However, Fig.
28(b) shows the same simulation for a 3 × 3 array of Gabor
filters where the input flow must be forwarded to all the nodes
in the network. In this case, the network saturates at Eα ≈
5.8 × 105eps for the destination driven algorithm, while for
the source driven algorithm it saturates at Eα ≈ 7.4×105eps.
In this case, source driven routing is more efficient for very
high traffic. This is because in the destination driven routing
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 19
103 104 105 106
50
100
150
200
250
300
350
400
450
500
Reference Rate E
α
 (eps)
Av
er
ag
e 
La
te
nc
y 
L s
d 
(ns
)
 
 
Destination driven
Source Driven
103 104 105 106
50
100
150
200
250
300
350
400
450
500
Reference Rate E
α
 (eps)
Av
ra
ge
 L
at
en
cy
 L
sd
(ns
)
 
 
Destination driven
Source Driven
(b)(a)
Fig. 28. Mean network latency for different reference event rates Eα for
destination and source driven routing algorithms (a) for the example system
in Fig. 15 and (b) for a 3 × 3 array of Gabor filters. Service times were
tr = 50ns for destination-driven routers and tr = 70ns for source-driven
routers.
RA RD RG RB RE RH RC RF RI0
0.5
1
1.5
2
2.5
3
N
u
m
be
r 
o
f e
ve
n
ts
 (N
rj)
RA RD RG RB RE RH RC RF RI0
0.2
0.4
0.6
0.8
1
1.2
1.4
N
u
m
be
r 
o
f e
ve
n
ts
 (N
rj)
(a) (b)
Destination Driven Source Driven
Fig. 29. Events in router queues for a reference event rate Eα = 9×105eps.
Vertical bars indicate number of events in the order north-east-south-west-local
interfaces for every router Ri.
each input event to the network has to be cloned once per
destination module, while in the source driven routing each
node can perform local replication. In the source driven routing
the number of actual events traveling over the physical links
is therefore much less for this particular application.
For the example in Fig. 15, the number of events queued
in all routers when the network saturates at Eα = 9× 105eps
can be obtained by solving eq. (21). Fig. 29 represents the 5
input interface queues for each router. This makes it possible to
find the network bottlenecks in each case. For the destination
driven system, the west interface of router A (RA in Fig.
29(a)) is the most loaded queue. As input events have to be
replicated to reach nodes A and D, the input event rate at
RA west input interface is artificially increased, overloading
this channel. For the source driven algorithm represented in
Fig. 29(b), the west interface of routers RA and RD represent
the system bottleneck. Again, these two channels are the most
overloaded channels due to the multiplexing of several AER
streams.
Fig. 30(a) is an example of how traffic analysis methodology
can be used to explore the network parameters design space.
The traffic simulations were repeated varying the routing time
tr for the destination driven system. For longer service times,
2 4 6 8 10
x 105
50
100
150
200
250
300
350
400
450
Reference Rate E
α
 (eps)
Av
er
ag
e 
La
te
nc
y 
L s
d 
(ns
)
 
 
2 4 6 8 10
x 105
50
100
150
200
250
300
350
400
450
Reference Rate E
α
 (eps)
Av
er
ag
e 
La
te
nc
y 
L s
d 
(ns
)
 
 
t
r
=40ns
t
r
=50ns
t
r
=60ns
t
r
=80ns
t
r
=100ns
ttx=0ns
ttx=30ns
ttx=70ns
ttx=100ns
(b)(a)
2 4 6 8 10
x 105
50
100
150
200
250
300
350
400
450
Reference Rate E
α
 (eps)
Av
er
ag
e 
La
te
nc
y 
L s
d 
(ns
)
 
 
2 4 6 8 10
x 105
50
100
150
200
250
300
350
400
450
Reference Rate E
α
 (eps)
Av
er
ag
e 
La
te
nc
y 
L s
d 
(ns
)
 
 
t
r
=60ns
t
r
=70ns
t
r
=80ns
t
r
=90ns
t
r
=100ns
ttx=0ns
ttx=30ns
ttx=70ns
ttx=100ns
(d)(c)
Fig. 30. (a) Latency curves for the destination driven case for different
router service times tr . (b) Latency curves for the destination driven case for
different transmission times ttx. (c) Latency curves for the source driven case
for different router service times tr . (d) Latency curves for the source driven
case for different transmission times ttx.
mean latency increases because each hop takes longer to route
events. Moreover, the network becomes saturated for lower
event rates if the routers’ service time ttx is increased, redu-
cing the maximum achievable event rate. Fig. 30(b) performs
the same simulation, this time varying the transmission time
through the channel. In this case, the saturation point remains
the same in all simulations, but overall latency increases for
low event rates. Figs. 30(c-d) show the same but for source
driven routing. As can be seen, the behavior is similar.
IX. CONCLUSIONS
We have presented a scalable method for assembling arbi-
trary modular AER neuromorphic systems, by arranging mod-
ules in a 2D mesh. Address events include a module label, and
modules include a simple router that routes events depending
on their module labels. The approach is generic for ASIC
and FPGA based hardware implementations, but was tested on
single and multiple Virtex-6 FPGAs. Experimental results are
provided for AER-based vision processing applications, such
as multiple Gabor filtering and character recognition based on
convolutional type neural networks.
The approach is scalable and robust to communication
bandwidth saturation. Analysis techniques to estimate resource
usage and traffic bottlenecks have been given, which also allow
the user to optimize mappings from logical descriptions to the
physical implementation. Two routing approaches have been
discussed, “destination-driven” and “source-driven”. Either
one can perform better depending on the traffic conditions
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 20
and the network connectivity. Latencies in event commu-
nications have been measured to be in the micro second
range or below. Implemented test network examples, based
on ConvNets, allow for neural systems with over 200,000
neurons, emulating over 32 million synapses (using kernel-
based weight-sharing for RAM storage) in a single Virtex6
FPGA. The mesh network uses 4 bi-directional links per
module and can communicate up to 5.8×109 events per second
(excluding module fan-out and event-cloning). This number
can be increased by also using diagonal links (resulting in up
to 8 links per module), or up to 26 links per module in a 3D
arrangement.
The presented approach is intended for the real-time pro-
cessing of visual sensory data provided by spiking event-driven
vision sensors, and does not include any on-line learning ca-
pability. Other researchers are developing hardware platforms
for emulating fine neural dynamics with synaptic learning
capabilities, aimed at studying complex brain functions. For
example, the FACETS and BrainScaleS projects (see the
summary flyer at [63]) are attempting to put 180,000 analog
neurons and 50 million synapses with an average firing rate of
105eps per neuron (e.g. with an acceleration factor of about
104 with respect to biological time) together on a wafer. The
SpiNNaker project is pursuing the development of an ARM-
core mesh of processors, using SpiNNaker-chips each with
about 20 ARM-cores [31]. A variety of neural and synaptic
models can be programmed capable of operating in biological
real time. Using realistic but low complexity neural models,
an ARM core can handle about 2000 neurons with about 1000
synapses each. Putting 18 SpiNNaker chips on a PCB would,
therefore, allow for a system with about 640,000 neurons and
640 million synapses. Other researchers are pursuing similar
goals using FPGA based implementations, capable of hosting
1 million neurons in one Virtex-6 FPGA [9].
The results presented in this paper using the Virtex-6
platform, can be extrapolated to ASIC based designs by relying
on performance figures from already fabricated ConvModule
chips [13], [18], [22], [24]. For example, by re-using an earlier
0.35µm CMOS ConvChip pixel [24] in a 1cm2 40nm die, it
is realistic to consider achieving 1 million neurons per chip.
Using a 10 × 10 mesh of these chips would provide a 108
neuron ConvNet, which is comparable (in terms of number of
neurons and synapses) to 1% of the human brain.
ACKNOWLEDGMENT
This work was supported in part by Andalusian grant TIC-
6091 (NANO-NEURO), and Spanish grants from the Ministry
of Economy and Competitivity (former Ministry of Science
and Innovation) TEC2009-106039-C04-01/02 (VULCANO)
(with support from the European Regional Development Fund)
and PRI-PIMCHI-2011-0768 (PNEUMA) coordinated with
the European CHIST-ERA program. CZR was supported by
an FPU scholarship. The authors would like to acknowledge
the highly constructive feedback received from the reviewers,
which helped to greatly improve the quality of the paper.
REFERENCES
[1] M. Sivilotti, “Wiring considerations in analog VLSI systems with
application to field-programmable networks,” PhD, Computation and
Neural Systems, Caltech, Pasadena California, 1991.
[2] M. Mahowald, “VLSI Analogs of Neuronal Visual Processing: a Syn-
thesis of Form and Function,” PhD, Computation and Neural Systems,
Caltech, Pasadena, California, 1992.
[3] A. Mortara, E. A. Vittoz, and P. Venier, “A Communication Scheme
for Analog VLSI Perceptive Systems,” IEEE J. of Solid-State Circuits,
vol. 30, no. 6, pp. 660–669, June 1995.
[4] K. Boahen, “Point-to-Point Connectivity Between Neuromorphic Chips
Using Address Events,” IEEE Trans. on Circ. and Syst. Part-II, vol. 47,
no. 5, pp. 416–434, May 2000.
[5] ——, “A Burst-Mode Word-Serial Address-Event Link-I,II,III,” IEEE
Trans. on Circ. and Syst. Part-II, vol. 51, no. 7, pp. 1269–1300, July
2004.
[6] E. Chicca, A. M. Whatley, P. Lichtsteiner, V. Dante, T. Delbruck,
P. D. Giudice, R. J. Douglas, and G. Indiveri, “A Multichip Pulse-
Based Neuromorphic Infrastructure and Its Application to a Model of
Orientation Selectivity,” IEEE Trans. Circ. Syst. Part I, vol. 54, no. 5,
pp. 981–993, May 2007.
[7] R. Vogelstein, U. Mallik, J. Vogelstein, and G. Cauwenberghs, “Dy-
namically Reconfigurable Silicon Array of Spiking Neurons with
Conductance-Based Synapses,” IEEE Trans. Neural Networks, vol. 18,
no. 1, pp. 253–265, Jan. 2007.
[8] C. Mayr, H. Eisenreich, S. Henker, and R. Schu¨ffny, “Pulsed Multi-
Layered Image Filtering: A VLSI Implementation,” International Jour-
nal of Applied Mathematics and Computer Sciences, vol. 1, pp. 60–65,
2005.
[9] A. Cassidy, A. Andreou, and J. Georgiou, “Design of a one million
neuron single fpga neuromorphic system for real-time multimodal scene
analysis,” 45th Annual Conf. on Inf. Sciences and Systems (CISS), pp.
1–6, 23-25 March 2011.
[10] Y. Wang and S.-C. Liu, “Programmable Synaptic Weights for an aVLSI
Network of Spiking Neurons,” Proc. of the 2007 IEEE Int. Symp. on
Circ. and Syst. (ISCAS), pp. 4531–4534, May 2006.
[11] P. Vernier, A. Mortara, X. Arreguit, and E. A. Vittoz, “An Integrated
Cortical Layer for Orientation Enhancement,” IEEE J. Solid-State Circ.,
vol. 32, no. 2, pp. 177–186, Feb. 1997.
[12] T. Y. W. Choi, P. Merolla, J. Arthur, K. Boahen, and B. E. Shi,
“Neuromorphic Implementation of Orientation Hypercolumns,” IEEE
Trans. on Circ. and Systems (Part I), vol. 52, no. 6, pp. 1049–1060,
June 2005.
[13] R. Serrano-Gotarredona, T. Serrano-Gotarredona, A. Acosta-Jime´nez,
and B. Linares-Barranco, “A Neuromorphic Cortical-Layer Microchip
for Spike-Based Event Processing Vision Systems,” IEEE Trans. Circ.
and Syst., Part-I, vol. 53, no. 12, pp. 2548–2566, Dec. 2006.
[14] V. Chan, C. Jin, and A. van Schaik, “An address-event vision sensor for
multiple transient object detection,” IEEE Trans. on Biomedical Circuits
and Systems, pp. 278–288, Dec. 2007.
[15] Z. Fu, T. Delbru¨ck, P. Lichsteiner, and E. Culurciello, “An address-event
fall detector for assisted living applications,” IEEE Trans. on Biomedical
Circuits and Systems, pp. 88–96, June 2008.
[16] T. J. Hamilton, C. Jin, A. van Schaik, and J. Tapson, “An active 2-d
silicon cochlea,” IEEE Trans. on Biomedical Circuits and Systems, pp.
30–43, March 2008.
[17] S. A. Bamford, A. F. Murray, and D. J. Willshaw, “Spike-timing-
dependent plasticity with weight dependence evoked from physical
constraints,” IEEE Trans. on Biomedical Circuits and Systems, in Press.
[18] R. Serrano-Gotarredona, T. Serrano-Gotarredona, A. Acosta-Jime´nez,
C. Serrano-Gotarredona, J. Pe´rez-Carrasco, B. Linares-Barranco,
A. Linares-Barranco, G. Jime´nez-Moreno, and A. Civit-Ballcels, “On
Real-Time AER 2D Convolutions Hardware for Neuromorphic Spike
Based Cortical Processing,” IEEE Trans. on Neural Networks, vol. 19,
no. 7, pp. 1196–1219, July 2008.
[19] B. Wen and K. Boahen, “A Silicon Cochlea with Active Coupling,”
IEEE Trans. on Biomedical Circuits and Systems, pp. 444–455, Dec.
2009.
[20] S. Mitra, S. Fusi, and G. Indiveri, “Real-Time Classification of Complex
Patterns Using Spike-Based Learning in Neuromorphic VLSI,” IEEE
Trans. on Biomedical Circuits and Systems, pp. 32–42, Feb. 2009.
[21] R. Mill, S. Sheik, G. Indiveri, and S. L. Denham, “A Model of Stimulus-
Specific Adaptation in Neuromorphic Analog VLSI,” IEEE Trans. on
Biomedical Circuits and Systems, pp. 413–419, Oct. 2011.
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 21
[22] L. Camun˜as-Mesa, A. Acosta-Jime´nez, C. Zamarren˜o-Ramos,
T. Serrano-Gotarredona, and B. Linares-Barranco, “A 32 × 32
Convolution Processor Chip for Address Event Vision Sensors with
155ns Event Latency and 20Meps Throughput,” IEEE Trans. Circ. and
Syst., Part-I, vol. 58, no. 4, pp. 777–790, April 2011.
[23] D. G. Chen, D. Matolin, A. Bermak, and C. Posch, “Pulse-Modulation
Imaging – Review and Performance Analysis,” IEEE Trans. on Biomed-
ical Circuits and Systems, pp. 64–82, Feb. 2011.
[24] L. Camun˜as-Mesa, C. Zamarren˜o-Ramos, A. Linares-Barranco,
A. Acosta-Jime´nez, T. Serrano-Gotarredona, and B. Linares-Barranco,
“An Event-Driven Multi-Kernel Convolution Processor Module for
Event-Driven Vision Sensors,” IEEE J. of Solid-State Circ., Feb. 2012.
[25] J. Lin, P. Merolla, J. Arthur, and K. Boahen, “Programmable Connec-
tions in Neuromorphic Grids,” Proc. Int. Midwest Symp. on Circ. and
Syst. (MWSCAS), pp. 80–84, Aug. 2006.
[26] P. Merolla, J. Arthur, B. Shi, and K. Boahen, “Expandable Networks for
Neuromorphic Chips,” IEEE Trans. on Circ. and Syst., Part-I, vol. 54,
no. 2, pp. 301–311, Feb. 2007.
[27] S. A. Bamford, A. F. Murray, and D. J. Willshaw, “Large Developing
Receptive Fields Using a Distributed and Locally Reprogrammable
Address-Event Receiver,” IEEE Trans. Neural Networks, vol. 21, no. 2,
pp. 286–304, Feb. 2010.
[28] R. Serrano-Gotarredona, M. Oster, P. Lichtsteiner, A. Linares-Barranco,
R. Paz-Vicente, F. Go´mez-Rodrı´guez, L. Camunas-Mesa, R. Berner,
M. Rivas-Perez, T. Delbru¨ck, S.-C. Liu, R. Douglas, P. Ha¨fliger,
G. Jime´nez-Moreno, A. Ballcels, T. Serrano-Gotarredona, A. Acosta-
Jime´nez, and B. Linares-Barranco, “CAVIAR: A 45k Neuron, 5M
Synapse, 12G Connects/s AER Hardware Sensory-Processing-Learning-
Actuating System for High-Speed Visual Object Recognition and Track-
ing,” IEEE Trans. Neural Networks, vol. 20, no. 9, pp. 1417–1438, Sept.
2009.
[29] S. Joshi, S. Deiss, M. Arnold, Y. J. Park, and G. Cauwenberghs,
“Scalable Event Routing in Hierarchical Neural Array Architecture
with Global Synaptic Connectivity,” 12th Int. Workshop on Cellular
Nanoscale Networks and Their Applications (CNNA), Feb. 2010.
[30] M. Khan, D. Lester, L. Plana, A. Rast, X. Jin, E. Painkras, and S. Furber,
“SpiNNaker: Mapping Neural Networks onto a Massively-Parallel Chip
Multiprocessor,” IEEE Int. Joint Conf. on Neural Networks (IJCNN), pp.
2849–2856, June 2008.
[31] A. D. Rast, F. Galluppi, X. Jin, E. Painkras, and S. B. Furber, “The Leaky
Integrate-and-Fire Neuron: A Platform for Synaptic Model Exploration
on the spiNNaker Chip,” IEEE Int. Joint Conf. on Neural Networks
(IJCNN), pp. 1–8, June 2010.
[32] J. Fieres, J. Schemmel, and K. Meier, “Realizing Biological Spiking
Network Models in a Configurable Wafer-Scale Hardware System,”
IEEE Int. Joint Conf. on Neural Networks (IJCNN), pp. 969–976, June
2008.
[33] S. Scholze, S. Schiefer, J. Partzsch, S. Hartmann, C. G. Mayr,
S. Ho¨ppner, H. Eisenreich, S. Henker, B. Vogginger, and R. Schu¨ffny,
“VLSI Implementation of a 2.8 Gevent/s Packet Based AER Interface
with Routing and Event Sorting Functionality,” Frontiers in Neuro-
science (Frontiers in Neuromorphic Engineering), vol. 117, no. 5, pp.
1–13, 2011. Doi:10.3389/fnins.2011.00117.
[34] A. Cassidy, T. Murray, A. Andreou, and J. Georgiou, “Evaluating
On-Chip Interconnects for Low Operating Frequency Silicon Neuron
Arrays,” IEEE Int. Symp. on Circ. and Syst. (ISCAS), pp. 2437–2440,
May 2011.
[35] D. Gross and C. M. Harris, “Fundamentals of queueing theory,” 3rd ed.
John Wiley & Sons, Inc, 1998.
[36] C. Zamarren˜o-Ramos, T. Serrano-Gotarredona, and B. Linares-
Barranco, “An Instant-Startup Jitter-Tolerant Manchester-Encoding Seri-
alizer/Deserializar Scheme for Event-Driven Bit-Serial LVDS Inter-Chip
AER Links,” IEEE Trans. on Circ. and Syst., Part I, in Press.
[37] H. Berge and P. Hafliger, “High-Speed Serial AER on FPGA,” Proc.
IEEE Int. Symp. on Circ. and Syst. (ISCAS), pp. 857–860, May 2007.
[38] D. B. Fasnacht, A. M. Whatley, and G. Indiveri, “A Serial Communica-
tion Infrastructure for Multi-Chip Address Event Systems,” Proc. IEEE
Int. Symp. on Circ. and Syst. (ISCAS), pp. 648–651, May 2008.
[39] L. Benini and G. D. Micheli, “Networks on Chips: A New SoC
Paradigm,” IEEE Comput., vol. 35, no. 1, pp. 70–78, Jan. 2002.
[40] P. Salihundam, S. Jain, T. Jacob, S. Kumar, V. Erraguntla, Y. Hoskote,
S. Vangal, G. Ruhl, and N. Borkar, “A 2 Tb/s 6×4 Mesh Network for
a Single-Chip Cloud Computer With DVFS in 45 nm CMOS,” IEEE J.
of Solid-State Circuits, vol. 46, no. 4, pp. 757–766, April 2011.
[41] J. Howard, S. Dighe, S. Vangal, G. Ruhl, N. Borkar, S. Jain, V. Erra-
guntla, M. Konow, M. Riepen, M. Gries, G. Droege, T. Lund-Larsen,
S. Steibl, S. Borkar, V. De, and R. V. D. Wijngaart, “A 48-Core IA-32
Processor in 45 nm CMOS Using On-Die Message-Passing and DVFS
for Performance and Power Scaling,” IEEE J. of Solid-State Circuits,
vol. 46, no. 1, pp. 173–183, Jan. 2011.
[42] S. Sarkar, G. Kulkarni, P. Pande, and A. Kalyanaraman, “Network-on-
Chip Hardware Accelerators for Biological Sequence Alignment,” IEEE
Trans. on Computers, vol. 59, no. 1, pp. 29–41, Jan. 2010.
[43] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
Learning Applied to Document Recognition,” Proc. of the IEEE, vol. 86,
no. 11, pp. 2278–2324, Nov. 1998.
[44] Y. LeCun and Y. Bengio, “Convolutional Networks for Images, Speech,
and Time Series,” in The Handbook of Brain Science and Neural
Networks, M. Arbib, Ed. Cambridge, MA: MIT Press, 1995, pp. 255–
258.
[45] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel, “Backpropagation Applied to Handwritten
Zip Code Recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551,
1989.
[46] R. Vaillant, C. Monrocq, and Y. LeCun, “Original Approach for the
Localisation of Objects in Images,” IEE Proc on Vision, Image, and
Signal Processing, vol. 141, no. 4, pp. 245–250, August 1994.
[47] M. Osadchy, Y. LeCun, and M. Miller, “Synergistic Face Detection
and Pose Estimation with Energy-Based Models,” Journal of Machine
Learning Research, vol. 8, pp. 1197–1215, May 2007.
[48] S. Murali, Designing Reliable and Efficient Networks on Chips.
Springer, 2009.
[49] C. L. Seitz, W. C. Athas, C. M. Flaig, A. J. Martin, J. Seizovic, C.
S. Steele, and W.-K. Su, “The architecture and programming of the
Ametek Series 2010 multlcomputer,” In Proc. 3rd Conf. on Hypercube
Concametlt Computers and Apphcatzorzs, Volume I, (Pasadena, Calif.,
Jan. 19-20). ACM, New York, pp. 33-36, 1988.
[50] Intel Corporation, A Touchstone DELTA System Description, 1991.
[51] A. Linares-Barranco, R. Paz-Vicente, F. Go´mez-Rodriguez, A. Jime´nez,
M. R. A., G. Jime´nez, and A. Civit, “On the AER Convolution
Processors for FPGA,” Proc. IEEE Int. Symp. on Circ. and Syst. (ISCAS),
pp. 4237–4240, May 2010.
[52] U. Ogras, P. Bogdan, and R. Marculescu, “An Analytical Approach for
Network-on-Chip Performance Analysis,” IEEE Tran. on Comp.-Aided
Design of Int. Circ. and Syst., vol. 29, no. 12, pp. 2001–2013, Dec.
2010.
[53] J. Hu, U. Y. Ogras, and R. Marculescu, “System-Level Buffer Allocation
for Application-Specific Networks-on-Chip Router Design,” IEEE Tran.
on Comp.-Aided Design of Int. Circ. and Syst., vol. 25, no. 12, pp.
2919–2933, Dec. 2006.
[54] A. Cassidy and A. Andreou, “Beyond Amdahl’s Law: An Objective
Function That Links Multiprocessor Performance Gains to Delay and
Energy,” IEEE Transactions on Computers, 2011.
[55] F. Hillier and G. Lieberman, “Introduction to Operations Research,” 6th
Ed., New York, McGraw-Hill, pp. 631–732, 1995.
[56] P. Lichtsteiner, C. Posch, and T.Delbruck, “A 128×128 120dB 30mW
Asynchronous Vision Sensor that Responds to Relative Intensity
Change,” IEEE J. Solid-State Circuits, vol. 43, pp. 566–576, Feb. 2008.
[57] J. A. Len˜ero-Bardallo, T. Serrano-Gotarredona, and B. Linares-Barranco,
“A 3.6µs Asynchronous Frame-Free Event-Driven Dynamic-Vision-
Sensor,” IEEE J. Solid-State Circuits, vol. 46, no. 6, pp. 1443–1455,
June 2011.
[58] F. Go´mez-Rodrı´guez, R. Paz-Vicente, A. Linares-Barranco, M. Rivas,
L. Miro´, S. Vicente, G. Jime´nez, and A. Civit, “AER Tools for
Communications and Debugging,” Proc. IEEE Int. Symp. on Circ. and
Syst. (ISCAS), pp. 3253–3256, May 2006.
[59] J. Pe´rez-Carrasco, T. Serrano-Gotarredona, C. Serrano, B. Acha, and
B. Linares-Barranco, “High-speed character recognition system based
on a complex hierarchical AER architecture,” Proc. IEEE Int. Symp. on
Circ. and Syst. (ISCAS), pp. 2150–2153, May 2008.
[60] J. Pe´rez-Carrasco, “Simulation tool for building and analyzing complex
and hierarchically structured aer-based visual processing systems,” Ph.D.
dissertation, Univ. of Sevilla, Spain, 2011.
[61] K. Fukushima and N. Wak, “Handwritten Alphanumeric Character
Recognition by the Neocognitron,” IEEE Tran. on Neural Networks,
vol. 2, no. 3, pp. 355–365, May 1991.
[62] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Robust
Object Recognition with Cortex-Like Mechanisms,” IEEE Tran. on
Pattern Analysis and Machine Intelligence, vol. 29, no. 3, pp. 411–426,
March 2007.
[63] “www.facets-project.org.”
IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS, VOL. XX, NO. XX, XXXX 22
Carlos Zamarren˜o-Ramos received his B. S. de-
gree in Telecommunications Engineering in 2007,
his M.Sc. degree in Microelectronics in 2009, and
his PhD degree in December 2011 from the Uni-
versity of Seville, Sevilla, Spain. From 2007 until
December 2011 he was a PhD student at the Instituto
de Microelectro´nica de Sevilla (IMSE-CNM-CSIC),
Sevilla, Spain. During June-July 2010, he was with
the Department of Electrical and Computer Engi-
neering, Texas A&M University, College Station,
as a Visiting Scholar. His research interests include
very large scale integration (VLSI) circuit design applied to bio-inspired
circuits and systems, bio-inspired signal processing, high-speed serial links,
modular assembly of reconfigurable AER (Address Event Representation)
processing systems, hardware implementations and simulation of spiking
neural networks, implementations of event-driven AER vision processing
systems, and power management. Since February 2012 he holds the position
of Junior Analog Designer at Dialog Semiconductor GmbH, Germering,
Germany.
Alejandro Linares-Barranco received the B.S. de-
gree in computer engineering, the M.S. degree in
industrial computer science, and the Ph.D. degree
in computer science (specializing in computer in-
terfaces for bio-inspired systems) from the Uni-
versity of Sevilla, Sevilla, Spain, in 1998, 2002,
and 2003, respectively. From January 1998 to June
1998, he was Second Lieutenant in the Spanish
Air Force working as System Administrator and
Software Developer. From 1998 to 2000, he was
a Member of the Technical Staff at the Sevilla
Microelectronics Institute (IMSE-CNM-CSIC). From 2000 to 2001, he was
a Development Engineer with the Research and Development Department,
at SAINCO Company, Sevilla, working on VHDL-based field-programmable
gate array (FPGA) systems for the INSONET European project on power
line communications. Since 2001 to 2006, he was an Assistant Professor at
the Computer Architecture and Technology Department of the University of
Sevilla, Sevilla, Spain. In 2006 he was promoted to Associated Professor. His
Lab (Robotics and Computers Technology) developed a set of AER-tools for
debugging and connecting AER systems under the EU project CAVIAR. His
research interests include VLSI and FPGA digital design, neuro-inspired chip-
to-chip and chip-to-computer interfaces, spike based processing, motor control
and vision for FPGAs, wireless sensor networks and embedded applications
based on microcontrollers, bus emulation, and computer architectures. Since
2009 he has been a Review Committee Member of ISCAS. In 2010 he became
member of the Technical Committee on Neural Systems and Applications
(NSATC) of the IEEE Circuits and Systems Society. On 2011 he became
Secretary of the NSATC.
Teresa Serrano-Gotarredona received the B.S. de-
gree in Electronic Physics and the Ph.D degree
in VLSI neural categorizers from the University
of Sevilla, Sevilla, Spain, in 1992, and 1996, re-
spectively, and the M.S. degree in Electrical and
Computer Engineering from The Johns Hopkins
University, Baltimore, MD, in 1997. She was an
Assistant Professor in the Electronics and Electro-
magnetism Department, University of Sevilla from
1998 until September 2000. Since September 2000,
she has been a Tenured Scientist at the Sevilla
Microelectronics Institute (IMSE-CNM-CSIC), Sevilla, Spain, and in July
2008 she was promoted to Tenured Researcher. Since January 2006, she
is also part-time Professor with the University of Sevilla. She was on a
sabbatical stay at the Electrical Engineering Dept. of Texas A&M University
during Spring 2002. Her research interests include analog circuit design of
linear and nonlinear circuits, VLSI neural based pattern recognition systems,
VLSI implementations of neural computing and sensory systems, transistor
parameters mismatch characterization, Address-Event-Representation (AER)
VLSI, RF circuit design, nanoscale memristor-type AER, and Real-Time
Vision Sensing and Processing Chips. She is co-author of the book Adaptive
Resonance Theory Microchips (Kluwer 1998).
Dr. Serrano-Gotarredona was corecipient of the 1997 IEEE Transactions
on VLSI Systems Best Paper Award for the paper “A Real-Time Clustering
Microchip Neural Engine”. She was also a corecipient of the 2000 IEEE
Circuit and Systems-Part I Darlington Award for the paper “A General
Translinear Principle for Subthreshold MOS Transistors”. She is presently
Secretary of the IEEE CAS Society Sensory Systems Technical Committee,
Academic Editor of the PLoS ONE Journal, and Associate Editor for IEEE
Trans. on Circuits and Systems Part I.
Bernabe´ Linares-Barranco (F’10) received the B.
S. degree in Electronic Physics, the M. S. degree in
Microelectronics, and a first Ph.D. degree in high-
frequency OTA-C oscillator design in June 1990
from the University of Sevilla, Sevilla, Spain, in
1986, 1987, and 1990, respectively, and a second
Ph.D degree in analog neural network design from
Texas A&M University, College-Station, USA, in
1991. Since September 1991, he has been a Tenured
Scientist at the Sevilla Microelectronics Institute
(IMSE-CNM-CSIC), which is one of the institutes
of the “National Microelectronics Center” (CNM) of the “Spanish Research
Council” (CSIC) of Spain. On January 2003, he was promoted to Tenured
Researcher and, in January 2004, to Full Professor of Research. Since March
2004, he has also been part-time Professor with the University of Sevilla.
From September 1996 to August 1997, he was on sabbatical stay at the De-
partment of Electrical and Computer Engineering, Johns Hopkins University,
Baltimore, MD, as a Postdoctoral Fellow. During Spring 2002, he was a
Visiting Associate Professor at the Electrical Engineering Department, Texas
A&M University. He is co-author of the book “Adaptive Resonance Theory
Microchips” (Kluwer, 1998). He was also the coordinator of the EU-funded
CAVIAR project. He has been involved with circuit design for telecommunica-
tion circuits, VLSI emulators of biological neurons, VLSI neural based pattern
recognition systems, hearing aids, precision circuit design for instrumentation
equipment, bio-inspired VLSI vision processing systems, transistor parameters
mismatch characterization, Address-Event-Representation VLSI, RF circuit
design, real-time AER vision sensing and processing chips, memristor circuits,
and extending AER to the nanoscale.
Dr. Linares-Barranco was corecipient of the 1997 IEEE Transactions on
VLSI Systems Best Paper Award for the paper “A Real-Time Clustering
Microchip Neural Engine”, and the 2000 IEEE CAS Darlington Award for the
paper “A General Translinear Principle for Subthreshold MOS Transistors”.
He has organized several Special Sessions and post-conference Workshops for
ISCAS and NIPS. From July 1997 to July 1999, he was an Associate Editor
of the IEEE Transactions on Circuits and Systems-Part II, and from January
1998 to December 2009 he was an Associate Editor for IEEE Transactions
on Neural Networks. He is Associate Editor of Frontiers in Neuromorphic
Engineering since May 2010. He was the Chief Guest Editor of the 2003
IEEE Transactions on Neural Networks Special Issue on Neural Hardware
Implementations. From June 2009 until May 2011 he was the Chair of the
Sensory Systems Technical Committee of the IEEE CAS Society. In March
2011 he became Chair of the IEEE CAS Society Spain Chapter. He is an
IEEE Fellow.
