A burst-mode word-serial address-event link--III: analysis and test results by Boahen, Kwabena A
University of Pennsylvania
ScholarlyCommons
Departmental Papers (BE) Department of Bioengineering
July 2004
A burst-mode word-serial address-event link--III:
analysis and test results
Kwabena A. Boahen
University of Pennsylvania, boahen@seas.upenn.edu
Follow this and additional works at: http://repository.upenn.edu/be_papers
Copyright 2004 IEEE. Reprinted from IEEE Transactions on Circuits and Systems--I: Regular Papers, Volume 51, Issue 7, July 2004, pages 1292-1300.
Publisher URL: http://ieeexplore.ieee.org/xpl/tocresult.jsp?isNumber=29094&puNumber=8919
This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the
University of Pennsylvania's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this
material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by
writing to pubs-permissions@ieee.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.
This paper is posted at ScholarlyCommons. http://repository.upenn.edu/be_papers/5
For more information, please contact libraryrepository@pobox.upenn.edu.
Recommended Citation
Boahen, K. A. (2004). A burst-mode word-serial address-event link--III: analysis and test results. Retrieved from
http://repository.upenn.edu/be_papers/5
A burst-mode word-serial address-event link--III: analysis and test results
Abstract
We present results for a scalable multiple-access inter-chip link that communicates binary activity between
two-dimensional arrays fabricated in deep submicrometer CMOS. Capacity scales with integration density
because an entire row is read and written in parallel. Row activity is encoded in a burst: The row address
followed by a column address for each active cell. We predict the distribution of burst lengths when
transmission is initiated by active cells and access is arbitered using a two-level queuing model. Agreement
with the experiment is excellent for loads over 50% but not for lighter loads, where our assumption that
service time is exponentially distributed breaks down. We also quantify the throughput–latency tradeoff. The
price of an n-fold increase in throughput is an n per Ncol timing error in a cell’s inter-event interval, where Ncol
is the number of cells per row. Links implemented in 0.6, 0.4, and 0.25 micrometer are compared; the highest
burst-rate achieved was 27.8 M events/s.
Keywords
asynchronous logic synthesis, event-driven communication, fair arbiter design, neuromorphic systems,
parallel readout, pixel-level quantization
Comments
Copyright 2004 IEEE. Reprinted from IEEE Transactions on Circuits and Systems--I: Regular Papers, Volume
51, Issue 7, July 2004, pages 1292-1300.
Publisher URL: http://ieeexplore.ieee.org/xpl/tocresult.jsp?isNumber=29094&puNumber=8919
This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way
imply IEEE endorsement of any of the University of Pennsylvania's products or services. Internal or personal
use of this material is permitted. However, permission to reprint/republish this material for advertising or
promotional purposes or for creating new collective works for resale or redistribution must be obtained from
the IEEE by writing to pubs-permissions@ieee.org. By choosing to view this document, you agree to all
provisions of the copyright laws protecting it.
This journal article is available at ScholarlyCommons: http://repository.upenn.edu/be_papers/5
1292 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004
A Burst-Mode Word-Serial Address-Event Link—III:
Analysis and Test Results
Kwabena A. Boahen
Abstract—We present results for a scalable multiple-access
inter-chip link that communicates binary activity between two-
dimensional arrays fabricated in deep submicrometer CMOS.
Capacity scales with integration density because an entire row is
read and written in parallel. Row activity is encoded in a burst:
The row address followed by a column address for each active cell.
We predict the distribution of burst lengths when transmission is
initiated by active cells and access is arbitered using a two-level
queuing model. Agreement with the experiment is excellent for
loads over 50% but not for lighter loads, where our assumption
that service time is exponentially distributed breaks down. We
also quantify the throughput–latency tradeoff. The price of an
-fold increase in throughput is an per col timing error in a
cell’s inter-event interval, where col is the number of cells per
row. Links implemented in 0.6, 0.4, and 0.25 m are compared;
the highest burst-rate achieved was 27.8 M events/s.
Index Terms—Asynchronous logic synthesis, event-driven com-
munication, fair arbiter design, neuromorphic systems, parallel
readout, pixel-level quantization.
I. PERFORMANCE CRITERIA
ALL-OR-NONE voltage transitions in two-dimensional(2-D) arrays may be communicated to other arrays using
event-driven multiplexer-demultiplexer links. Event-driven
access, whereby the transition, or event, is transmitted as soon
as it occurs, has clear advantages over clock-driven access,
whereby each cell is polled regularly. In particular, perfor-
mance is better when activity is sparse (e.g., spatial or temporal
filtering occurs) and timing is critical (e.g., time encodes analog
information). Consequently, event-driven communication links
have been explored for silicon retina [1]–[4] and silicon cochlea
[5] chips. Interest in these links is increasing, driven by the
growing trend of quantizing signals inside the array (e.g., active
pixel sensors [6]–[8] and pulse-coded neural networks [9]).
Performance of event-driven communication links, or any
system that communicates all-or-none signals for that matter,
is measured by capacity, throughput, and latency [9]. Capacity
is defined as the reciprocal of the minimum transmission
time; this is the maximum rate at which events can be read,
multiplexed, demultiplexed, and written. Throughput is
defined as the usable fraction of capacity; the maximum rate is
rarely sustainable in practice. Latency is defined as the mean
Manuscript received January 3, 2002; revised November 2002. This work
was supported in part by the Whitaker Foundation and in part by the National
Science Foundation’s LIS/KDI and CAREER programs under Grant ECS98-
74463 and Grant ECS00-93851. This paper was recommended by Associate
Editor G. Cauwenberghs.
The author is with the Department of Bioengineering, University of Pennsyl-
vania, Philadelphia, PA 19104-6392 USA (e-mail: boahen@seas.upenn.edu).
Digital Object Identifier 10.1109/TCSI.2004.830701
delay; this wait time may be several transmission slots. Latency
depends on the fraction of transmission slots that are filled; this
fraction of the link capacity that is actually being used is called
the load.
Link designers strive not only to maximize throughput, but to
minimize latency as well. High throughput allows large numbers
of event generators operating over a broad range of rates to be
serviced. Whereas low latency preserves the timing of each in-
dividual event, capturing information over and beyond that car-
ried by the generator’s mean event rate [10], [11]. Throughput
is optimized if collisions are prevented through arbitration; it is
then limited only by the increase in latency with activity due to
queuing [12]–[14]. To achieve a specified timing error, defined
as the percentage error in a cell’s interevent interval, throughput
must be capped at a level somewhat less than 100% of the link’s
capacity.
We have recently developed an event-driven link that boosts
capacity by reading (and writing) the state of an entire row of
cells in parallel, as shown in Fig. 1 [15], [16]. Communication
speed is not compromised when we convert from parallel to se-
rial (and back) because we use devices much larger than those
in the array. As capacity is boosted without sizing up devices
inside the array, our design can better exploit the high integra-
tion densities deep submicrometer processes offer. However, the
throughput attainable depends on the load, as the number of ac-
tive cells read from (or written to) a row increases with the av-
erage event-rate. This fraction goes up both because the proba-
bility per unit time of an event occuring is higher and because
more time is spent waiting when there are more requests.
In this paper, we analyze the tradeoff between latency and
throughput in the parallel-read-write link and validate our an-
alytical results using measurements from fabricated chips. We
judge our design’s success by comparing it to previous event-
driven link designs, which read (and write) events from (and to)
the array serially (reviewed in [14]). A mismatch in the level at
which queuing occurs (entire rows versus single cells) leads to
rather different tradeoffs. In particular, we address the questions:
How much can you increase throughput by going parallel? What
price do you pay in latency? Given your timing-error specifica-
tion, is it worth it?.
The paper is divided into six sections. In Section II, we iden-
tify the timing parameters that determine the parallel-read-write
link’s performance and state the assumptions that underly our
model. In Section III, we analyze the link’s ability to increase
its throughput with demand and quantify the longer latency
incurred in doing so. In Section IV, we present capacity mea-
surements for links fabricated in 0.6-, 0.4-, and 0.25- m CMOS
technology and compare their signaling rates. In Section V, we
1057-7122/04$20.00 © 2004 IEEE
BOAHEN: BURST-MODE WORD-SERIAL ADDRESS-EVENT LINK—III 1293
Fig. 1. Link architecture. Transmitter: An interface circuit (H) relays requests to the row arbiter (A) and permits that row to output its address and its events (S)
when the arbiter acknowledges. The events’ column addresses are generated similarly, after latching the row’s state (L). A two-way multiplexer (T) sequentially
outputs row (Y) and column (X) addresses using separate request lines (Ry,Rx); they share a single acknowledge line (Ack). Meanwhile, a controller (C) cycles
the array to another row. Receiver: A two-way demultiplexer (U) directs row addresses to one latch (D) and column addresses to another (E). When the burst ends,
the row’s address and its decoded column addresses (P) are written to a second set of latches (E’ and M). As the row address is decoded and the column data is
written to that row (R), the next burst is received and its column addresses are decoded.
Fig. 2. Throughput on demand. (a) Number of active cells in a selected row
increases with load. Boxes represent a row with eight cells; active cells are gray.
The three rows are for three different loads. (b) Throughput increases with load
(unity-slope line) as an increasing number of active cells are read in parallel.
Boxes represent how long it takes to service each active cell. Again, the three
rows are for three different loads.
present experimental measurements of the mean number of
events read in parallel (i.e., burst length) at various load levels
and compare them with our theoretical predictions. Section VI
concludes the paper. The parallel-read-write link’s transmitter
and receiver designs are presented in companion papers [15],
[16].
II. TIMING PARAMETERS
The parallel-read-write link’s performance is determined by
two timing parameters, and , as illustrated in Fig. 2.
is the time it takes to read (or write) an event from (or to)
the array. We call the cycle rate, because this is the rate
at which we cycle between rows. is the time it takes to read
(or write) an event from (or to) a latch on the array periphery. We
call the burst rate, because this is the rate at which we
transmit multiple events read from the same row. We presume
that the transmitter and the receiver have the same values for
and , and that they are pipelined such that the next
event can be read while the previous one is being written. Both
of these requirements are met by the architectures we proposed
previously (see Fig. 1).
Users are interested not in how much you boost the ca-
pacity of the link but in how much you increase its absolute
throughput. The parallel-read-write link’s capacity is equal
to the burst-rate, as this is the maximum rate at which it
transmits events. Whereas the serial-read-write link’s capacity
is equal to the cycle-rate, as this is the maximum rate at which
events can be read from (or written to) the array individually.
Therefore, the parallel link has
times more capacity than the
serial link. The question is: Does this -fold boost in capacity
translate to a -fold increase in absolute throughput? We relate
the throughput gain to the boost factor in Section III by
analyzing queuing in the parallel link. For this analysis, we
make the simplifying assumption that event-generation and
event-service are Poisson.
Event generation and event service may be assumed to
be Poisson under most—but not all—conditions of interest.
For event generation, the independent behavior that underlies
Poisson statistics certainly applies to the array’s cells when they
are not responding to a stimulus. Independence holds for stim-
ulus-driven activity as well when decorrelation (i.e., whitening)
is performed to encode sensory information efficiently (e.g.,
silicon retinae [14]). For event service, Poisson statistics’
exponential event interval distribution does not pertain for light
loads, where the transmission time is more or less fixed at .
Nevertheless, the exponential distribution is appropriate for
heavy loads because burst lengths are geometrically distributed,
as we show shortly.
III. QUEUING MODEL
We have developed a two-level queuing model for the par-
allel-read-write link’s transmitter. The model is shown in Fig. 3.
1294 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004
Fig. 3. Queuing model. Events occur at the same rate  in each of N
rows; the overall rate at which rows make requests is  . Each row is serviced
at the same rate  ; the overall rate at which rows are serviced is  . The
number of rows waiting to be selected ism and the number of events a selected
row has is n. These events are transmitted in a burst of length n.
Level 1 has as many queues as there are rows in the array. Each
one gives the number of events that a particular row has when
it is serviced. Level 2 has a single queue; it gives the number
of rows waiting for service. The event count depends on
how often events are read from each row , which in turn de-
pends on the row count . The row count depends on how
rapidly rows are transmitted , which, in turn, depends on
the event count . This circular behavior in analyzed in this sec-
tion, keeping in mind that the average event-transmission time
drops as rises.
Under the Poisson assumption, the rate at which events
occur in a row and the rate at which they are read from
a row give the probability of generating and servicing an event
in that row, respectively. These probabilities are simply
and for a short time interval . Similarly, and
give the probabilities of generating and servicing row
requests in the array, respectively. Unlike and , which
are in events per second, and are in rows per second.
In equilibrium, the number of events in a row is obtained
by equating the rates at which events are generated and ser-
viced [17]. This count increases from to with prob-
ability , where is the probability of being in
state , while it decreases from to with probability
. Equating to and as-
suming that events are serviced faster than they are generated
(i.e., ), we obtain
(1)
This is a geometric distribution with parameter
. For intuition, think of as the fraction
of event-transmission slots that are filled at the level of a
single row. Similarly, the number of row requests also is
described by a geometric distribution, , with parameter
. For intuition, think of as the fraction of
row-transmission slots that are filled at the level of the entire
array.
To calculate , we need to know the rate at which events
occur in a row and the rate at which any particular row is
serviced. We can easily infer from the number of cells the
row has, , and the average cell activity. But is more
involved because it depends not only on how many rows are
serviced per second, , but also on how many rows request at
the same time. To calculate , recall that the transmitter takes
seconds to send the first event in a burst and seconds
to send each of the remaining ones [see Fig. 2(b)]. Therefore,
we have
(2)
where is the average number of events read from a row.
Dividing by , the average number of rows requesting,
and multiplying by , to convert from rows per second to
events per second, yields
(3)
We are now left with the tasks of calculating , the av-
erage number of events read from a row, and , the average
number of row requests. To calculate , we make use of the
geometric distribution
(4)
We do not include because an empty row would not be
read. To calculate , we make use of the geometric distri-
bution
(5)
We now have expressions for every variable we need to calculate
—but we still need to calculate .
To calculate , which is required to calculate , we need
to know the total rate at which the rows make requests.
We already know the total rate at which they are serviced
[see (2)]. We obtain by observing that, on average, a row
makes a request every time it accumulates events, as that is
the mean number of events read. Therefore, we divide the rate
at which events occur in each row by to obtain the
rate at which a single row makes requests. Then, we multiply
the result by , the number of rows, to obtain
(6)
We are finally ready to derive a system of simultaneous equa-
tions that can be solved for the geometric parameters and .
For , substituting the expression for given in (4) into (6)
and (2) to obtain and , respectively, gives
(7)
where is the inter-event interval for the entire
array. For , calculating from (6), dividing the result by the
expression for in (3), and substituting the expression for
in (5) gives
(8)
We can solve these equations for and as a function of ,
the inter-event interval, given , and . Instead of
presenting these expressions, which are messy and uninsightful,
we derive excellent approximations for light and heavy loads in
the next section.
BOAHEN: BURST-MODE WORD-SERIAL ADDRESS-EVENT LINK—III 1295
Fig. 4. Throughput versus load. Row load q, burst probability p, and attainable
throughput f , versus link load x (unity-slope line). Load and throughput are
expressed as a fraction of the link’s burst-rate, which is five times higher than its
cycle-rate. Light- and heavy-load approximations for p are plotted in hairlines.
There are 100 rows.
A. Light and Heavy Loads
To find approximations for the geometric parameters, we
draw on two insights from the preceding derivation. First, the
fraction of events that are transmitted at the burst-rate (i.e.,
) is equal to the fraction of event-transmission
slots that are filled at the row level . This identity is stated
by (4). Second, the fraction of attainable throughput that is
used is equal to the fraction of row-transmission slots that are
filled at the array level. This identity is stated by (7), where
attainable throughput is precisely defined as the reciprocal of
the weighted average of and . Hence, we refer to as
the burst probability and to as the row load, which is not to
be confused with the link load defined earlier (see Section I).
Solutions for row load, burst probability, and attainable
throughput are plotted as a function of the link load in Fig. 4,
with and . The row load starts
increasing early, even when the link load is a small fraction
of the cycle-rate (i.e., ). By the time the cycle rate is
exceeded, the row load is almost 100%, indicating that virtually
all row-transmission slots are filled. At this point the burst
probability shoots up, shifting the transmission time from
to . Hence, attainable throughput starts to increase,
a hair ahead of the load. The row load increases before the burst
probability does because the 1:1 load-to-service ratio for rows
is higher than the load-to-service ratio for
individual cells. Hence, the fraction of row-transmission slots
filled at the array-level is times higher than the
fraction of event-transmission slots filled at the row level .
Or [this identity also follows from (3) and
(6)].
The insight that burst probability is almost zero when the load
is light yields a light-load approximation for . With , (7)
simplifies to . Substituting this value into (8) yields
(9)
This approximation is also plotted in Fig. 4. It is excellent for
loads lower than (i.e., ) because
for , which keeps . Thus, our light-load ap-
proximation is good until half the attainable throughput is used
up.
The insight that row load is almost 100% when the load is
heavy yields a heavy-load approximation for . With , (7)
gives
(10)
This approximation is also plotted in Fig. 4. It is excellent for
loads higher than (i.e., ) because
for , which keeps
.
B. Wait Time
We now use our approximations for the geometric parameters
to estimate the mean wait time. The wait time is given by the
average amount of time required to transmit each event times
the number of events queued. To calculate the number of events
queued, we use (4) to obtain the expected number of events in
each row and solve the identity to
obtain the expected number of rows waiting, . Thus, we
obtain
(11)
To find a light-load estimate for the wait time, we substitute
the appropriate approximations for (i.e., ) and (9) into
(11). After multiplying by the mean transmission time, which is
close to in this regime, we obtain
(12)
assuming , which is very reasonable since
in the light-load regime.
To find a heavy-load estimate for the wait time, we substitute
the appropriate approximations for (i.e., ) and (10)
into (11). Observing that the mean transmission time virtually
equals the mean inter-event interval in this regime, we obtain
(13)
where is given by
(14)
attains its maximum value of times
when .
How do these wait times compare with the serial-read-write
link’s? In the light-load regime [see (12)], the wait time is
identical to that in a link with a fixed transmission time
[14], [17], hence the two links behave the same. We would not
expect any different, as bursts are rare when the load is light.
In the heavy-load regime [see (13)], the wait time is times
longer than in a link with a fixed transmission time of . And
this queue-multiplier can be as large as the boost factor times
the number of rows. The reason is that virtually all the rows are
1296 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004
waiting when the burst probability is high (based on the identity
with ). And the number of events
each row has is more or less equal to the boost factor, as that is
the degree of parallelism required to achieve the boost. It is this
escalation in the wait time that ultimately limits how much of the
parallel-read-write link’s capacity we can use, as we show next.
C. Throughput Gain
We are now in a position to answer the question: Given
your timing error specification, how much can you gain in
throughput? As wait time increases with load, we must ease off
to achieve a desired timing error, , specified as a fraction of
the mean inter-event interval for a cell, . Thus, we have
the constraint , or if the
interevent-interval for the entire array is and it has
columns and rows. Applying this timing constraint to
(13) and solving for yields a throughput of
(15)
This is the fraction of the link’s capacity (i.e., the burst-rate
) we are allowed to use if we wish to keep normalized
timing error less than . Conversely, we must tolerate timing
errors of if we wish to use 50% of the capacity, for
example. Note that must be greater than ,
or , to satisfy the heavy-load assumption used to
derive (13), which requires that .
The last result (15) implies that throughput gain, (see
Section II), cannot exceed the desired timing error times the
number of columns. To see why, assume that most of the
serial-read-write link’s capacity, , is usable, which is
reasonable enough [14]. In comparison, the usable amount
of the parallel-read-write link’s capacity is . When we
substitute the expression for from (15), divide by , and
then subtract one, we obtain
(16)
for the throughput gain, where is the boost
factor.
Throughput gain is plotted as a function of boost factor
in Fig. 5, for various levels of timing error multiplied by
the number of columns saturates at when
. Therefore, we cannot increase throughput more
than times —no matter how small we make .
This limitation arises because the timing error limits how many
events can be read in parallel. Basically, we have to read the
row no latter than s after the first event occurs. We cannot
expect more than events to occur in that time, as the row
has only cells.
In summary, throughput can be enhanced substantially by
going parallel if the acceptable timing error is on the order of a
percent, assuming that the array has several hundred columns.
Essentially, the timing specification sets how long we can wait
for other cells in that row to become active, and therefore it
limits how many events we can read in parallel. For example,
for and (which gives ) and
with and , we get [from (16)].
Fig. 5. Throughput increase. Throughput gain r obtained by increasing
the boost factor b for various values of N , the product of timing error—
expressed as a fraction of a cell’s inter-event interval—and column count.
That is to say, even though the burst rate boosts the link’s
capacity nine-fold, the timing spec caps the gain in throughput
to three-fold. This is a substantial enhancement, though smaller
than expected. These conclusions follow from our assumption
of Poisson-like behavior, which is indeed validated by test
measurements (Section V). Before discussing them, we present
measurements of the capacity of links implemented in three
submicrometer technologies.
IV. SIGNALING RATES
The parallel-read-write link has thus far been used in five
parallel image-processing chips fabricated in 0.6-, 0.4-, and
0.25- m CMOS technologies. The first two were 0.6- and
0.4- m imagers whose pixels converted photosignals into
pulse frequency at the focal plane [3], [18]. The third was a
0.4- m inverse imager whose pixels converted pulse-frequency
back into analog current, which was video-encoded using an
analog multiplexer (i.e., scanner) [19]. The fourth and fifth
were 0.25- m orientation-selective chips whose pixels received
pulse trains from a silicon retina [18] and encoded their outputs
also as pulse trains [20], [21].
In terms of the overhead in cell area, nine transistors are
required for transmission whereas the prior serial-readout tech-
nique requires just four (reviewed in [14]). At five transistors,
the number used for reception is unchanged. However, whereas
serial-readout requires two lines per column, parallel-readout
requires just one because cells are not selected individually.
Trading a metal line for five transistors is worth it when metal
lines are at a higher premium than transistors, which is the
scaling trend.
Before taking a look at signaling rates, we review the
communication protocol, which is based on a four-phase
handshake [22]. As a concrete example, logic-analyzer traces
captured from a 0.4- m implementation are presented in Fig. 6.
In this example, a single event is followed by a burst with
two events. Communicating a single row-column address pair
involves a sequence of eight transitions in the row request (Ry,
active-high), column request (Rx, active-low), and acknowl-
edge (Ack, active high for Ry but active low for Rx) signals.
BOAHEN: BURST-MODE WORD-SERIAL ADDRESS-EVENT LINK—III 1297
Fig. 6. Word-serial communication protocol. Transmitter outputs row address (Addr) and asserts row request (Ry, active-high). Receiver latches address (72) and
asserts acknowledge (Ack, active-high for Ry). Transmitter then outputs column address and asserts column request (Rx, active-low). Receiver latches address
(94) and asserts acknowledge (Ack, active-low for Rx). Now Rx goes high, followed by Ack, and Ry goes low, followed by Ack. A burst of two address-events is
transmitted next (row address 66 and column addresses 90 and 82). These logic analyzer traces were captured from a 0.4-m link.
Fig. 7. Word-serial link signaling rates. (a) Eight transitions transmit a row–column address pair sequentially; it takes 70 ns. (b) Four transitions transmit each
additional column address; it takes just 36 ns. There were three events in this burst. These scope traces were captured from a 0.25-m link.
However, each additional column address requires only four
transitions, involving Rx and Ack.1 The address lines revert
back to the row address when Rx is inactive (i.e., high), so the
row address can be reread at any point during the burst. We
added this feature to support receivers with limited buffering
capability, such as microcontrollers, since a burst can be as
long as an entire row [15].
To demonstrate the parallel-read-write link’s signaling rate,
we present scope traces captured from a 0.25- m implementa-
tion in Fig. 7. These traces indicate that it takes 70 ns to send a
row-column address pair but only 36 ns to send each additional
column address, which gives a capacity of 27.8 M events/s. The
pads and the ribbon cable (plus PCB trace) account for 6.7-ns
delay per transition, as we measured rise and fall times 7.4
1.4 ns long [see Fig. 7(a)] and round-trip propagation delays
6.0-ns long.2 These off-chip sources would have added a total of
26.8 ns to the transmission time, but pipelining splits this delay
half-and-half between the transmitter and the receiver. There-
fore, these measurements lead us to conclude that the transmitter
can issue a new column address every 22.6 ns. This value agrees
with an estimate of 20.5 ns that we calculated using transistor
sizes and capacitances extracted from the layout of the trans-
ceiver’s 192 48 array, based on fabrication parameters sup-
plied by the MOSIS service for this 0.25 m process. Hence,
1The transmission times are 10–20 ns longer than those listed in Table I be-
cause of off-chip logic added to decode cell type. There were four cell types
tiled in 2  2-pixel blocks.
2Although we tried to match the 100-
 impedance of the ribbon cable and
PCB trace, our pad design was conservative. We estimated its impedance to be
150
 from the 1.0-V reflection (Vdd = 2:5V)we observed at the output pad.
TABLE I
BURST-MODE LINK PERFORMANCE
Measured link timing for transmitters and receivers fabri-
cated in three different technologies [transmitter array (rows
 columns) and pixel sizes are given].  is half the minimum
feature size given under Tech. T and T are durations of
inter-row and intra-row transmissions, as defined earlier, while
Word Serial refers to multiplexing row and column addresses.
we conclude that off-chip limitations increased from 22.6
to 36 ns.
Table I summarizes the timing parameters we measured for
the three submicrometer technologies that the link has been fab-
ricated in. Array sizes and pixel areas listed are for the trans-
mitter.3 The array size of the 0.4- m receiver was 176 132;
that of the 0.25- m receiver was identical to the 0.25- m trans-
mitter. Since we did not fabricate a 0.6- m receiver, we used the
0.4- m receiver’s delay as an estimate for the 0.6- m imple-
mentation.4 The 0.4- and 0.25- m implementations have sim-
ilar timing, which supports our conclusion that performance is
3The cell drove the column lines with a 31wide p-type field-effect transistor
(pFET) in the 0.6-m design and with a 21 wide n-type field-effect transistor
(nFET) (u-shaped) in the 0.4- and 0.25-m designs.
4This delay was added to the timing measurement made by tying the 0.6-m
transmitter’s request signal directly to its acknowledge signal. This transmitter
had a single request, as it output row and column addresses in parallel, unlike
the other two transmitters.
1298 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004
limited by off-chip signaling. This limitation also explains why
is half , as the former handshake involves half as many
transitions. However, it is likely that doubling the number of
columns in the 0.25- m design also contributed to the lack of
improvement; this is addressed in Section VI.
The transmission rate was also limited by an increase in
crosstalk with activity. The 0.25- m transceiver chip sustained
a maximum rate of 22.7 M events/s between its transmitter-port
and its receiver-port. This load represents 81.6% of its 27.8-M
events/s capacity. Turning up the silicon neurons’ activity
further caused a step change from 22.7 to 25.0 M events/s, the
peak transmission-rate we observed. We had to turn down the
activity well below the trigger point to get the chip out of this
state. Crosstalk mediated by the supply rails, the substrate, or
the row and column lines can generate spurious events when
it is amplified by the event-generators, which have high gain,
achieved through positive feedback [3]. The mean number of
spurious events triggered per event increases as the activity
level increases, as it becomes more likely to find a generator
close enough to threshold. The event rate explodes when each
event triggers more than one event.
In addition to minimizing crosstalk by isolating low-level
analog circuitry through circuit design and layout techniques
[14], we also packaged the dice carefully. We took advantage
of the 50% reduction in link-width (number of address lines)
realized by multiplexing row and column addresses to inter-
sperse power pads between both input and output pads in the
0.25- m transceiver. Thus, every other digital pad was either
or Gnd. This practice, common in high-speed processor de-
sign, reduces noise by dividing among several par-
allel paths. We also minimized by using a thin-quad-flat-pack
(TQFP) package with just 3-nH lead inductance.5 These mea-
sures yielded the 81.6% utilization figure quoted above, a much
higher fraction of capacity than we have previously been able to
use [23].
V. THEORY VERSUS PRACTICE
We verified that the 0.25- m link behaves as theoretically
predicted for loads up to 81.6% of its capacity, that is 22.7 M
events/s. We calculated the fraction of events transmitted at
(i.e., ) to determine the burst probability . We
obtained an excellent fit to this data when we solved (7) and (8)
for , as shown in Fig. 8(a). The transmission times that gave the
best fit are almost identical to the times where peaks occur in the
event-interval histogram (see Fig. 9). These peaks appeared at
73 and 37 ns whereas ( 6.8%) and
(0%) yielded the best fit. Conversely, the observed transmission
times yielded a theoretical burst probability that exceeded the
measured value by 3.8% at 22.7 MHz (0.834 versus 0.803).
Although the theory fitted nicely at heavy loads, our measure-
ments were up to 65% less than it predicted for light loads, as
Fig. 8(b) shows. Lower than expected burst probability is not
surprising since we assumed an exponentially distributed ser-
vice time when, in actual fact, the service time is fixed. Such
deterministic service times can reduce queuing to 50% of that in
5This 120-pin package was 14 mm  14 mm in size, with a 7 mm  7 mm
paddle size. The 2.1 mm  4.8 mm die was placed near the top of the paddle
and bonded on just three sides to minimize bond-wire length.
Fig. 8. Burst probability measurements. (a) Fit of the analytical solution for p
[(8) and (7)], with T = 68 ns, T = 37 ns, andN = 48. (b) Zooming
in on the light load range to show poor fit of low burst-probabilities. The data is
from the 0.25-m link.
Fig. 9. Event-interval histogram. Distribution of intervals between 107 317
address events recorded in 4.72 ms—a 22.7-MHz mean rate—and time stamped
at 0.5-ns resolution. Bin size is 1 ns. The first peak is at 37 ns; it corresponds
to T . The second is at 73 ns; it corresponds to T . This data is from the
0.25-m link.
the exponentially-distributed model (Poisson assumption) [17],
[24]. However, the row-service time distribution does become
exponential as we transition from sending a single event from
each row to sending bursts of several events, because burst-
BOAHEN: BURST-MODE WORD-SERIAL ADDRESS-EVENT LINK—III 1299
Fig. 10. Ping-pong in arbiter tree. White and black rectangles delimit set and
clear handshake-phases for a cell (middle), its parent (top), and its two daughters
(bottom); requests travel up while acknowledges travel down. These signals are
set and cleared sequentially, as indicated by the solid arrows and the dashed
pendulums, respectively. (a) Cell ping pongs from one daughter to the other one
if that daughter makes another request while its sister is being serviced. (b) Cell
avoids ping ponging by waiting until its parent’s acknowledge clears before it
clears its own acknowledges.
lengths are distributed geometrically [see (1)]. These variable-
length bursts explain the excellent agreement between theory
and experiment in the heavy-load regime.
The highest burst probability we measured in the 0.25- m
link was 0.9944, which occured at a load of 25.0 M events/s.
This probability is pretty close to the maximum possible, as it
corresponds to 180-event burst-lengths, whereas the chip has
only 192 cells per row. However, we did not include this data
point in Fig. 8 because crosstalk introduced significant correla-
tions between event-generators at such high levels of activity,
as explained earlier. These artificial correlations enhance the
burst probability as they make it more likely that neighboring
neurons will fire. Indeed, the measured value was significantly
higher than the theoretically predicted burst probability for this
load (0.9944 versus 0.9309). These high burst probabilities con-
firm that the fair arbiter design [15] introduced in our 0.25- m
transceiver chip successfully eliminated load shedding, which
plagued previous link implementations [14].
A fair arbiter design is crucial to achieving the theoretically
predicted throughput-gain. As shown in Fig. 1, a 1-in- arbiter
is built by connecting 1-in-2 arbiters in a binary tree. In
the greedy arbiter design presented in [14] and [25], the top cell
in the tree ignores one side completely when more than half of
the arbiter’s clients are waiting for service. This behavior first
occurs in the row arbiter, and it migrates down the tree as the
load is increased further, locking out more and more rows. Such
load shedding is detrimental in a parallel-read design because it
prevents the burst probability from exceeding 0.5, which cor-
responds to an average of just two active cells per row. This
value of follows from the identity with
the number of row requests and the row load
. When the burst probability fails to increase, the average
transmission time does not decrease, and hence empty transmis-
sion slots needed to send the locked-out rows’ events do not ma-
terialize.
Load shedding happens when a daughter that has just been
serviced makes a new request while the other daughter is
being serviced—and gets selected again. Selection continues to
ping-pong back and fourth, as illustrated in Fig. 10(a), where
a possible fix also is presented [Fig. 10(b)]: Do not clear your
acknowledge until your parent clears her’s. This ordering,
which mirrors the set sequence, requires requests at all levels
of the tree to be cleared, starting at the bottom and propagating
all the way up, before the acknowledges are cleared, starting at
the top this time and propagating all the way down. Traversing
the entire tree like this makes this reshuffling painfully slow
[compare Fig. 10(a) and (b)]. An alternative sequencing that is
optimized for speed is presented in the companion paper [15].
The results presented here confirm that that fair arbiter design
can indeed achieve burst probabilities that exceed 0.5.
VI. SUMMARY AND CONCLUSION
We analyzed and tested a 2-D address-event communication
link that was designed to take advantage of high integration
densities available in deep submicrometer processes. For
scalability, it exploits a linear increase in the number of active
cells per row with increasing array size by reading and writing
events in parallel. Our analysis revealed that the resulting
throughput-gain can be severalfold if a timing error as large as
this gain divided by the number of columns is acceptable. For
instance, a two-fold throughput gain is achievable with subper-
cent timing errors in a 200-column array. This conclusion holds
if event generation and service are Poisson. This assumption
is validated by the excellent agreement between predicted and
measured burst probabilities in the heavy-load regime.
In practice, the parallel read-write technique did deliver on its
promise of boosting absolute throughput—breaking the array’s
row cycling limit—but we achieved only a twofold gain be-
cause off-chip signaling limited the burst-rate. Whereas it took
36 ns to transmit a column address, it appears that the 0.25- m
transmitter can issue one every 23 ns, but the pads added 7 ns
while the 15 cm-cable added 6 ns. Thus, off-chip signaling lim-
itations cut capacity from 43 to 28 M events/s. Sophisticated
asynchronous off-chip signaling techniques will be required to
achieve substantial throughput gains, in addition to optimizing
critical timing paths on chip.
The critical path in the transmitter (receiver) is now the
column encoder (decoder). The encoder drives capacitance
proportional to the column count when it outputs the column
address, and when it feeds the acknowledge signal back to the
column-arbiter interfaces. Similar considerations apply to the
decoder. Hence, the transmitter (receiver) takes more time to
encode (decode) events as the number of columns increases.
This linear scaling can be avoided by using a hierarchical
organization just like the arbiter does—its two-input cells are
connected in a binary tree. This approach yields a logarithmic
delay scaling; the price is a logarithmic increase in wiring [22].
1300 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004
ACKNOWLEDGMENT
The author would like to thank B. Taba for his insightful
suggestions about analyzing and testing the link design and
P. Merolla for help in obtaining data from his 0.25- m trans-
ceiver chip.
REFERENCES
[1] M. Mahowald, An Analog VLSI Stereoscopic Vision System. Boston,
MA: Kluwer, 1994.
[2] K. A. Boahen, “The retinomorphic approach: Pixel-parallel adaptive am-
plification, filtering, and quantization,” Analog Integr. Circuits Signal
Process., vol. 13, pp. 53–68, 1997.
[3] E. Culurciello, R. Etienne-Cummings, and K. A. Boahen, “A biomor-
phic digital image sensor,” IEEE J. Solid-State Circuits, vol. 38, pp.
281–294, Feb. 2003.
[4] J. Kramer, “An on/off transient imager with event-driven asynchronous
read-out,” in Proc. IEEE Int. Symp. Circuits and Systems, vol. 2, 2002,
pp. II-165–II-168.
[5] J. Lazzaro, J. Wawrzynek, M. Mahowald, M. Sivilotti, and D. Gille-
spie, “Silicon auditory processors as computer peripherals,” IEEE Trans.
Neural Networks, vol. 4, pp. 523–528, May 1993.
[6] W. Yang, “A wide-dynamic range low-power photosensor array,” in
Proc. Int. Solid-State Circuits Conf., vol. 37, 1994, p. 230.
[7] B. Fowler, A. E. Gamal, and D. Yang, “A cmos area image sensor with
pixel-level A/D conversion,” in Proc. IEEE Int. Solid-State Circuits
Conf. (ISSCC’94), vol. 37, San Francisco, CA, 1994, pp. 226–227.
[8] L. G. McIlrath, “A low-power low-noise ultrawide-dynamic-range cmos
imager with pixel-parallel A/D conversion,” IEEE J. Solid-State Cir-
cuits, vol. 36, pp. 846–853, May 2001.
[9] A. Murray and L. Tarassenko, Analogue Neural VLSI: A Pulse Stream
Approach. London, U.K.: Chapman & Hall, 1994.
[10] A. Mortara, E. Vittoz, and P. Venier, “A communication scheme for
analog VLSI perceptive systems,” IEEE J. Solid-State Circuits, vol. 30,
pp. 660–669, June 1995.
[11] A. Abusland, T. S. Lande, and M. Hovin, “A VLSI communication archi-
tecture for stochastically pulse-encoded analog signals,” in Proc. IEEE
Int. Symp. Circuits and Systems, vol. 3, May 1996, pp. 401–404.
[12] M. Sivilotti, “Wiring considerations in analog vlsi systems, with applica-
tion to field-programmable networks,” Ph.D. dissertation, Dept. Comp.
Sci., California Inst. Technol., Pasadena, CA, 1991.
[13] S. R. Deiss, R. J. Douglas, and A. M. Whatley, “A pulse-coded commu-
nications infrastructure for neuromorphic systems,” in Pulsed Neural
Networks, W. Maass, Ed. Boston, MA: MIT Press, 1999, ch. 6, pp.
157–178.
[14] K. A. Boahen, “Point-to-point connectivity between neuromorphic
chips using address-events,” IEEE Trans. Circuits Syst. II, vol. 47, pp.
416–434, May 2000.
[15] , “A burst-mode word-serial address-event link—I: Transmitter de-
sign,” IEEE Trans. Circuits Syst. I , vol. 51, pp. 1269–1280, July 2004.
[16] , “A burst-mode word-serial address-event link—II: Receiver de-
sign,” IEEE Trans. Circuits Syst. I , vol. 51, pp. 1281–1291, July 2004.
[17] M. Schwartz, Telecommunication Networks: Protocols, Modeling, and
Analysis. Reading, MA: Addison-Wesley, 1987.
[18] K. Zaghloul and K. A. Boahen, “Optic nerve signals in a neuromorphic
chip—II: Testing and results,” IEEE Trans. Biomed. Eng., vol. 41, pp.
667–675, Apr. 2004.
[19] C. A. Mead and T. Delbruck, “Scanners for visualizing analog vlsi cir-
cuitry,” Analog Integr. Circuits Signal Process., vol. 1, pp. 93–106, 1991.
[20] T. Y. W. Choi, B. E. Shi, and K. Boahen, “An orientation-selective 2D
AER transceiver,” in Proc. IEEE Int. Symp. Circuits and Systems, vol.
4, May 2003, pp. 800–803.
[21] P. Merolla and K. Boahen, “A recurrent model of orientation maps with
simple and complex cells,” in Advances in Neural Information Pro-
cessing, S. Thrun and L. Saul, Eds. San Mateo, CA: Morgan Kaufman,
2003, vol. 15.
[22] C. A. Mead, Introduction to VLSI Systems. Reading, MA: Addison
Wesley, 1980.
[23] K. A. Boahen, “A retinomorphic chip with parallel pathways: encoding
ON, OFF, INCREASING, and DECREASING visual signals,” Analog
Integr. Circuits Signal Process., vol. 30, no. 2, pp. 121–135, 2002.
[24] L. Kleinrock, Queueing Systems. New York: Wiley, 1976.
[25] K. A. Boahen, “Retinomorphic vision systems II: communication
channel design,” in Proc. IEEE Int. Symp. Circuits and Systems, vol.
suppl., May 1996, pp. 14–17.
Kwabena A. Boahen received the B.S. and M.S.E.
degrees in electrical and computer engineering
from The Johns Hopkins University, Baltimore,
MD, in the concurrent masters-bachelors program,
both in 1989, and the Ph.D. degree in computation
and neural systems from the California Institute of
Technology, Pasadena, in 1997.
He is an Associate Professor in the Bio-
engineering Department at the University of
Pennsylvania, Philadelphia, where he holds a
secondary appointment in electrical engineering.
His current research interests include mixed-mode multichip VLSI models of
biological sensory and perceptual systems, and their epigenetic development,
and asynchronous digital interfaces for interchip connectivity.
Dr. Boahen was awarded a Packard Fellowship in 1999, a National Science
Foundation CAREER Grant in 2001, and an Office of Naval Research YIP Grant
in 2002. He is a member of Tau Beta Kappa and has held a Sloan Fellowship
for Theoretical Neurobiology at the California Institute of Technology.
