University of Pennsylvania

ScholarlyCommons
Departmental Papers (BE)

Department of Bioengineering

July 2004

A burst-mode word-serial address-event link--I: transmitter design
Kwabena A. Boahen
University of Pennsylvania, boahen@seas.upenn.edu

Follow this and additional works at: https://repository.upenn.edu/be_papers

Recommended Citation
Boahen, K. A. (2004). A burst-mode word-serial address-event link--I: transmitter design. Retrieved from
https://repository.upenn.edu/be_papers/3

Copyright 2004 IEEE. Reprinted from IEEE Transactions on Circuits and Systems--I: Regular Papers, Volume 51,
Issue 7, July 2004, pages 1269-1280.
Publisher URL: http://ieeexplore.ieee.org/xpl/tocresult.jsp?isNumber=29094&puNumber=8919
This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply
IEEE endorsement of any of the University of Pennsylvania's products or services. Internal or personal use of this
material is permitted. However, permission to reprint/republish this material for advertising or promotional
purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing
to pubs-permissions@ieee.org. By choosing to view this document, you agree to all provisions of the copyright laws
protecting it.
This paper is posted at ScholarlyCommons. https://repository.upenn.edu/be_papers/3
For more information, please contact repository@pobox.upenn.edu.

A burst-mode word-serial address-event link--I: transmitter design
Abstract
We present a transmitter for a scalable multiple-access inter-chip link that communicates binary activity
between two-dimensional arrays fabricated in deep submicrometer CMOS. Transmission is initiated by
active cells but cells are not read individually. An entire row is read in parallel; this increases
communication capacity with integration density. Access is random but not inequitable. A row is not
reread until all those waiting are serviced; this increases parallelism as more of its cells become active in
the mean time. Row and column addresses identify active cells but they are not transmitted
simultaneously. The row address is followed sequentially by a column address for each active cell; this
cuts pad count in half without sacrificing capacity. We synthesized an asynchronous implementation by
performing a series of program decompositions, starting from a high-level description. Links using this
design have been implemented successfully in three generations of submicrometer CMOS technology.

Keywords
asynchronous logic synthesis, event-driven communication, fair arbiter design, neuromorphic systems,
parallel readout, pixel-level quantization

Comments
Copyright 2004 IEEE. Reprinted from IEEE Transactions on Circuits and Systems--I: Regular Papers,
Volume 51, Issue 7, July 2004, pages 1269-1280.
Publisher URL: http://ieeexplore.ieee.org/xpl/tocresult.jsp?isNumber=29094&puNumber=8919
This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way
imply IEEE endorsement of any of the University of Pennsylvania's products or services. Internal or
personal use of this material is permitted. However, permission to reprint/republish this material for
advertising or promotional purposes or for creating new collective works for resale or redistribution must
be obtained from the IEEE by writing to pubs-permissions@ieee.org. By choosing to view this document,
you agree to all provisions of the copyright laws protecting it.

This journal article is available at ScholarlyCommons: https://repository.upenn.edu/be_papers/3

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004

1269

A Burst-Mode Word-Serial Address-Event
Link—I: Transmitter Design
Kwabena A. Boahen

Abstract—We present a transmitter for a scalable multiple-access inter-chip link that communicates binary activity between
two-dimensional arrays fabricated in deep submicrometer CMOS.
Transmission is initiated by active cells but cells are not read
individually. An entire row is read in parallel; this increases communication capacity with integration density. Access is random
but not inequitable. A row is not reread until all those waiting are
serviced; this increases parallelism as more of its cells become
active in the mean time. Row and column addresses identify active
cells but they are not transmitted simultaneously. The row address
is followed sequentially by a column address for each active cell;
this cuts pad count in half without sacrificing capacity. We synthesized an asynchronous implementation by performing a series of
program decompositions, starting from a high-level description.
Links using this design have been implemented successfully in
three generations of submicrometer CMOS technology.
Index Terms—Asynchronous logic synthesis, event-driven communication, fair arbiter design, neuromorphic systems, parallel
readout, pixel-level quantization.

I. SCALING TWO-DIMENSIONAL ARRAYS

M

ULTIPLE-ACCESS inter-chip communication links
were originally developed to read out analog signals
from sensor arrays. A clock switches the multiplexer from one
sensor to another, reading a value from each and every one at a
fixed interval, hence the nickname “scanner” [1]. Use of these
clock-driven multiplexers continued after quantizers were included in active pixel sensors [2]–[4] and in pulse-coded neural
networks [5] to discretize signals inside the array. However,
the all-or-none transitions so produced, called events, may be
output as soon as they occur. Such event-driven access has clear
advantages over clock-driven access when activity is sparse
(e.g., spatial or temporal filtering occurs) and timing is critical
(e.g., time encodes analog information). Consequently, this
scheme has been explored for silicon retinas [6]–[9] and silicon
cochleas [10].
Several ways of regulating access in event-driven communication systems have been proposed [5], [11]–[13] and their efficiency compared [14], [15], but little attention has been paid
to their scaling properties. In all the architectures proposed so
far, a single active cell is read, its state is cleared, and then
the next cell is read. However, it takes longer to cycle the row
Manuscript received January 3, 2002; revised November 2002. This work
was supported in part by the Whitaker Foundation and in part by the National
Science Foundation’s LIS/KDI and CAREER programs under Grant ECS9874463 and Grant ECS00-93851. This paper was recommended by Associate
Editor G. Cauwenberghs.
The author is with the Department of Bioengineering, University of Pennsylvania, Philadelphia, PA 19104-6392 USA (e-mail: boahen@seas.upenn.edu).
Digital Object Identifier 10.1109/TCSI.2004.830703

and column lines as feature sizes shrink because faster logic
(minimum-sized inverter chain) is neutralized by larger load
(cells per row or column). Hence, these existing designs cannot
accommodate the increase in cell count with integration density—unless the widths of transistors that drive the row and
column lines are increased drastically. However, some of these
devices actually reside in the cell, which must signal when it becomes active in an event-driven system.
In this paper, we describe a scalable event-driven transmitter
interface inspired by two-dimensional (2-D) scanners, which
read out an entire row of cells in parallel, over the column lines
[1]. The increase in parallelism as the array gets denser enables these analog multiplexers to increase their readout rate, despite the fact that the larger load neutralizes the faster logic. By
reading an entire row at once—instead of one cell at a time—we
also achieve a transmission rate that increases as the square-root
of the cell count, assuming a square array. Such scaling is the
best we can do without sizing-up devices inside the array—or
breaking it up into separate banks [16]. Our approach requires
large devices only in the periphery, where parallel-to-serial conversion occurs. Therefore, it allows designers to take better advantage of higher integration densities offered by advanced submicrometer processes.
Our design uses address-events to communicate between
cells in the same array or in different arrays, which need not
be on the same chip. In this respect, it is similar to previous
event-driven links, where the transmitter uses an encoder to
generate an address that uniquely identifies an event’s place of
origin while the receiver uses a decoder to recreate the event at
the destination [6], [10], [11]. However, whereas these previous
designs transmit row and column addresses in parallel, we
transmit them serially. There is no loss in speed because we
do not retransmit the row address if the next event is from the
same row.
In addition to providing a communication standard for
parallel distributed processing, address-events support virtual
point-to-point connectivity. These virtual wires can be routed
by using a look-up table to translate in-coming addresses into
one or more out-going addresses [17]–[19]. Furthermore, the
single-transmitter–single-receiver link may be extended to
support multiple transmitters and receivers using merges and
splits [20], or with a shared bus [18], [21]. Thus, the basic
link can serve a wide variety of purposes when augmented
appropriately. As in previous work, we implemented the link
asynchronously to facilitate its use in large heterogeneous
multichip systems.
The paper is divided into five sections. In Section II, we
briefly review three common multiplexing schemes: metered-,

1057-7122/04$20.00 © 2004 IEEE

1270

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004

free-, and arbitered-access (see [15] for an in-depth review).
In Section III, we present a high-level specification for the
transmitter, and decompose it into a hierarchy of concurrent
subprocesses. In Section IV, we present the final handshaking
sequences and the resulting asynchronous logic circuits;
intermediate synthesis steps can be found in the Appendix.
Section V concludes the paper. A preliminary report of this
work was presented in [22]. A parallel-write burst-mode receiver and analysis and test results are presented in companion
papers [23], [24].

II. MULTIPLEXING EVENTS
Metering access to each cell according to a fixed readout
sequence is the simplest solution. These clock-driven multiplexers, or scanners, are commonly used to read out analog currents from imagers, going from row to row, in sequence. In fact,
they read all a row’s pixels in parallel, and then scan them out
serially on the periphery. Thus, they achieve rates over ten million pixels per second; fast enough to scan arrays with hundreds
of thousands of pixels at video frame rates. However, these fast
analog signals require specialized input/output (I/O) pads, are
prone to clock feedthrough and to noise, cannot be easily interfaced with computers, and can be demultiplexed only when
array sizes and clock speeds match.
If activity is sparse, it is more efficient to transmit a cell’s
state only when it changes. It is easy to recognize when significant changes occur if cell-state is quantized. The cheapest
quantizers are one-bit analog-to-digital converters, such as integrate-and-fire neurons [2], [8], [25] or sigma-delta encoders [3],
[4]. Their fixed-width–fixed-height spikes or binary state-transitions constitute a sequence of events that encode information
only in their timing. When coincidences occur, the transmitter
may either delay the new event to prevent a collision or dump
the old event to preserve timing.
Transmitting events immediately by giving cells free access
shortens latency. As such an event-driven operation does not
follow a predetermined (i.e., clock-driven) readout sequence, we
must transmit information that uniquely identifies the event’s
location. These addressed events (abbreviated to address-events)
can be created simply by wiring the event-generators’ outputs
directly to the address encoder [12]. However, when events coincide, the encoder ORs their addresses together. These collisions
increase exponentially as activity increases, with the fraction of
events that get through unscathed maxing out at 18% when only
50% of the transmission slots are full [26].
Preventing collisions by arbitered access increases
throughput. An arbiter grants only one request at a time, and
the encoder outputs that cell’s address. On average, the wait
equals the mean interval between empty transmission slots
[26]. When only 5% of the slots are empty, for example, the
wait is 20 slots long. For 10 000 cells, each transmission slot
must be shorter than 0.01% of the average inter-event interval
to handle that many cells. Hence, a 20-slot wait corresponds
to just 0.2% of the population’s average inter-event interval.
Therefore, arbitration can potentially achieve a five-fold
increase in throughput—from 18% to 95%—with negligible
timing error [15].

III. TRANSMITTER DESIGN
To achieve the five-fold increase in throughput arbitration
promises, we must ensure that it does not increase the transmission-cycle time. The first implementation of a 2-D arbitered
address-event transmitter, by Mahowald and Sivilotti, yielded
a disappointingly long cycle time of 2 s [6]. Because, for an
arbiter was first used to pick a row
array with cells, a 1-inarbiter was used to pick a cell in that
and then a second 1-in1-in-2 arbiters, orgarow. As a 1-in- arbiter is built from
nized in a binary tree, this hierarchical scheme cuts the number
to
. Unfortunately, the
of 1-in-2 arbiters from
levels in each arbiter tree
cycle time suffered as all
were spanned for every event transmitted.
In previous work, we cut the arbitered address-event transmitter’s transmission cycle-time from 2 s to 730 ns in the same
2- m technology by exploiting locality [27]. That is, arbitrating
at the lowest level of the arbiter tree for inputs next to each other,
at the second level for inputs two to three places apart, and so
on—only two levels are spanned on an average. We went on to
reduce the cycle time to 420 ns by exploiting locality inside the
array as well [27]. That is, servicing all active cells in a selected
row before redoing the row arbitration.
Our present goal is to further optimize the arbitered addressevent transmitter’s row–column architecture. Having exploited
locality in the arbiter and the array, transmission speed is now
primarily limited by the rate at which events are read out of the
array. Here, we break this array-cycling limit, realizing three
enhancements in all, as follows:
;
1) reading a row’s events in parallel boosts capacity by
2) bundling them into a single row-wide word eliminates the
column-select lines;
3) multiplexing row and column addresses cuts output pads
in half.
We alluded to Optimizations 1 and 3 in Section I. Optimization 2
is a direct consequence of Optimization 1—selecting an entire
row instead of a single cell.
A preview of the transmitter architecture we developed is
shown in Fig. 1. In this section, we derive programs that describe the behavior of each of these blocks by following a synthesis methodology for asynchronous digital VLSI systems developed by Martin [28] (tutorial examples are provided in [15]).
His methodology involves applying a series of program decompositions and transformations, starting from a high-level specification. As each step preserves the logic of the original program, the resulting circuit is correct by induction. Thus, it is
unnecessary to deduce how these hardware processes behave
when executed in parallel, which is extremely difficult. After decomposing the transmitter specification into a set of concurrent
one-line programs, we transform these one-liners into hardware
processes in Section IV.
A. High-Level Specification
We start by writing a high-level specification in the concurrent hardware processes (CHP) language, a hardware description language for asynchronous systems [28]. In CHP, logic circuits “execute” concurrent programs. For example

BOAHEN: BURST-MODE WORD-SERIAL ADDRESS-EVENT LINK—I

1271

Fig. 2. Transmitter specification: when a communication occurs on dataless
port P , its address j is output on port A as an a-bit word.

Fig. 1. Transmitter architecture: a interface circuit (H) relays requests to the
row arbiter (A) and permits that row to output its address and its events (S)
when the arbiter acknowledges. The events’ column addresses are generated by
the same procedure, after latching the row’s state (L). A two-way multiplexer
(T) outputs row (Y) and column (X) addresses, using separate requests lines
(Ry,Rx); they share a single acknowledge line (Ack). Meanwhile, a controller
(C) cycles the array to another row.

The program, or process, is named
and its argument
is named ; process and argument names are always set in
upper and lower case sans-serif font, respectively. As we are
as a call
describing hardware here, you should think of
to a silicon compiler that lays out a circuit with, for instance,
denotes infinite repetition; this
an -bit wide datapath.
demarcates the body of the program. Semicolons (;) denote
inputs data from a port named
sequential execution.
and stores it in a local variable named ; port and variable
names are always set in italicized upper and lower case roman
outputs the data stored in
font, respectively. Similarly,
on port .
is a dataless communication on port ; its
only effect is to synchronize the two processes whose ports are
connected together. That is, this process waits until the other
one gets to the corresponding point in its program, or vise
versa. In the text, we will write “port ” to distinguish the port
itself from a communication performed on that port, which we
write simply as “ .” There is no such ambiguity in the code, as
only communications can appear in the body of the program.
A high-level block diagram of the address-event transmitter
is shown in Fig. 2. We use arbitration
to choose an active cell. It picks a guard
that is true and executes the corresponding program segment
.1 In this case, the guard is the probe
which evaluates to
true when there is a communication pending on port
(i.e.,
the other process is waiting). Also, the program segment communicates on that port and outputs its address simultaneously;
are used to denote this. Thus, we have
parallel lines

The address is returned by a function call
a one-hot code ( -bit) to a binary one (
1If

that converts
-bit).

all the guards are false, it waits for at least one to become true.

Row-column organization. (a) The lth row services event generators
l + 1 to (l + 1) through its P -ports and communicates with columns
) and the
through its C ports. It also communicates with the row arbiter (
row encoder (
). (b) The k th column communicates with rows through
) and the column
its R ports and communicates with the column arbiter (
Fig. 3.

encoder (

) as well.

Alternatively, the transmitter process may be described succinctly using the CHP replication construct:
, where each
is a program segment and is
any operator that can be concatenated. As the arbitration opercan be concatenated, we have
ator

The next step in the synthesis procedure is to decompose
this high-level specification into a hierarchy of concurrent
processes. These processes’ ports are then connected together
by channels. We present this connectivity information pictorially. These figures also give the names of instances (e.g.,
specifies an instance of
named
)
specifies that port
and their ports’ data types (e.g.,
outputs bytes). Ports that are not defined as either input or
output are dataless—by default. Port names that appear inside
a box are local to that instance; those outside are local to the
process within which that instance occurs.
B. Reorganizing Into Rows and Columns
into separate row, column,
Here, we decompose
arbiter, and encoder processes, named
,
,
, and
, respectively. These processes are
connected as shown in Fig. 3. This decomposition is accomplished through four program transformations. For the first
’s dataless ports into
transformation, we reorganize
rows and columns. With this
array, we have to use a
1-in- arbiter to choose a row and then use a 1-in- arbiter to
choose one of the ports in that row. Hence, we must or (denoted
by ) together all requests within each row to generate requests
for the row arbiter. We also need to provide separate outputs,

1272

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004

ports
and
, for row and column addresses, respectively.
Thus, we have

where
and
.
For the second transformation, we implement arbitration in a
separate process

It performs the second communication to ensure that the dataless port it picked has been completely serviced before it picks
, with
or , are used
another. Two instances of
for row and column arbitration, respectively. To communicate
with these processes, we provide the remaining array process,
, with dataless ports
and dataless ports
called
, respectively.
Once the row arbiter picks a row, we can use concurrency,
, to service every port
in that row that has a communication pending, before picking
another row. This construct executes, concurrently, all program
segments, , whose guards, , are true.2 Thus, we have

Note that a row address is output only when a new row is selected.
For the third transformation, we implement address encoding
in a separate process

This process chooses one of it dataless
ports using selec. Selection requires
tion,
that only one guard is true at a time, which is indeed the case
here, as arbitration guarantees mutual exclusion. Two instances
, with
or , are used for row and column
of
with dataless
encoding, respectively. We provide
and dataless ports
to communicate with these
ports
processes. Communications on and
, respectively, now reand
communications in its proplace the
gram above.
For the fourth and final transformation, we break up
into row- and column-processes (see Fig. 3).
Dividing
’s program into subroutines yields the
following code for its row and column processes

where and have replaced
and
, respecand
. We parallelize this serial-arraytively, instead of
2This construct is not supported by CHP. Its use is discouraged because the
negated probes used to determine ineligibility can change from true to false at
any time. Concurrency waits for at least one guard to become true if necessary,
just like arbitration does.

Fig. 4. Parallel readout. (a)
’s C ports have been combined into a
single port (C ) that outputs an -bit word. (b) Column processes are replaced
) that transfers -bit words to a mux (
).
by a bus (

readout design, which was implemented in [27], in the next subsection.
C. Reading the Array in Parallel
and
to read all of a seHere, we transform
lected row’s dataless ports in parallel, an innovation introduced
in this work (initially reported in [22]). This parallel read is acinto a single
complished by merging the instances of
-bit-wide bus, called
, and modifying
accordingly, as shown in Fig. 4. An -bit integer, named , can
represent the state of a row’s dataless ports
. We denote
’s th bit by
. If we set this bit
when the probe,
, becomes true, it may serve as a proxy for
in the previous
program. We just have to remember to clear all the
bits (
or
) once the row is read. Thus, we have

One instance of
replaces the instances of
we had before.
We transfer the row word to a latch whose cells execute the
code given for
previously, with each bit,
, serving
instead of probing the
as a proxy for . Hence, we test
th column’s
ports and clear
instead of doing the
communication. Thus, we have

The second
communication now signals that the latch is
is connected as shown in Fig. 5(a). It receives
empty.
the row’s data from
, a process that we shall describe
and
shortly. This completes our transformation of
to support parallel reading.
Our final transformation is to transmit row and column
addresses sequentially, over the same output; this innovation is revealed here for the first time. The addresses are
, where
multiplexed by a separate process called
specifies the number of address
bits. As shown in Fig. 5(b), while it relays addresses from
the row encoder (port ) or the column encoder (port )
to the transmitter’s output (port ),
also relays
row-data from
(port ) to
(port ) as well.

BOAHEN: BURST-MODE WORD-SERIAL ADDRESS-EVENT LINK—I

Fig. 5. Row-word latch and burst generator: (a) A latch ( ) receives
)
row-words from the mux and communicates with the column arbiter (
). (b) A mux (
) receives row and column
and the column encoder (
and
) and outputs them on the
addresses from the encoders (
) to the latch
transmitter’s T port; it also relays row-words from the bus (
).
(

Coordinating these two procedures makes it possible to send
the empty signal, a reserved address named , when
becomes empty. Thus, we have

Recall that the second
communication, which matches
’s second
communication above, signals that the
latch is empty. Column addresses are handled by a concurrent
process, under the assumption that they show up after the
row address does. This timing assumption can be avoided by
to wait for the new row address to appear,
forcing
ignoring all column addresses after it relays .
into six concurIn summary, we have decomposed
, one of
, one of
rent processes: instances of
, and one of
. The remaining two are
and
; there are two instances of each, with equal to
or . The next step in the synthesis procedure is to compile these
CHP programs into hardware.
IV. TRANSMITTER IMPLEMENTATION
Electrically, processes set
or clear
an output
, or wait for an input signal
to become true
signal
or false
; tilde denotes logical complement. To
communicate, they must perform complementary four-phase
on an
sequences of actions and waits:
active port and
for its passive coundenotes repetition, just like in CHP. We
terpart , where
always append and to the port’s name to indicate its input
and output signals, respectively. Such signal names are always
set in lower case typewriter font. The active port’s output signal
is commonly called Request; the passive port’s
is the
so-called Acknowledge. At the signal-level, we refer to as
( , ) and to as ( , )—request first and acknowledge
second in both cases.
We have three choices of signal representations for data. (1)
Bundled-data requires a single line per bit, in addition to the
request and acknowledge signals. The data is valid when the
request signal is set; otherwise it is invalid. (2) Straight-data
dispenses with the request signal. Instead, all zeroes signifies
invalid data; any other word is considered valid. Both repre-

1273

sentations require matched delays—for data as well as request.
(3) Dual-rail achieves delay-insensitive operation by encoding
each bit using two lines: bit-is-true and bit-is-false (denoted by
appending or ). The data are invalid when both are cleared;
setting either transmits a one or a zero.
Handshaking expansion (HSE) is the procedure whereby
each communication in our CHP programs is fleshed out into
a full four-phase request–acknowledge sequence. Following
Martin’s synthesis procedure [28], we make two choices when
we perform HSE. First, we make output ports active and input
ports passive.3 The only exception is a port that is probed
must be passive, as the probe is implemented simply as
.
Symmetric links—dataless ports that are not probed on either
end—are dealt with on a case by case basis. Second, we use the
second half of the four-phase handshake to implement a second
communication on the same port—a two-phase handshake—if
these communications always occur in pairs. This optimization
is possible because the second half just returns the signals to
their initial state. So we are free to clear them whenever it is
convenient to do so, a process known as reshuffling.
The final step in Martin’s synthesis procedure is compiling
HSE sequences into production-rule sets (PRS), which are
straightforward to implement with CMOS transistors. A
, clears a bit, , when a boolean
production-rule,
to set the bit
expression, , becomes true. We write
when the expression is false. A nFET implements the former
rule while a pFET implements the latter—the two rules together
correspond to an inverter. Logical and and or (denoted by
& and , respectively, in PRS, or HSE) are implemented by
connecting FETs in series and in parallel. If both pull-up and
pull-down chains may both be inactive at the same time, a weak
feedback-inverter must be added to overcome their leakage
currents. Such outputs are said to be state holding, as opposed
to combinational; the feedback inverter is called a staticizer.
Active low signals are allowed in PRS and at the circuit level;
).
their names have an underscore prepended (e.g.,
We present only the final HSE sequences and the synthesized
circuits in this section. Details of how we arrived at these reshufflings and how we compiled them into PRS are in the Appendix.
We recommend that you refer to Fig. 6 to see how these circuits
interact as you read their descriptions. To facilitate this, we include the block labels in this figure in HSE sequences and in
subsequent figure captions.
A. Reading
[see Section III-C and
The CHP program for
when the event-generFig. 4(a)] calls for us to set a bit
ator initiates a communication (on port
) and readout this
data onto the column lines (port ) when this particular row is
must communicate
selected by the arbiter (port ).
with the address encoder as well (port ). These operations
are implemented in this section, together with the CHP for
.
passive and ports , , and active and
We made port
we chose a straight-data representation for port , where all
zeroes indicates invalid data. We separated event-generator and
3Our choice is arbitrary—the direction that data flows is not necessarily restrictive.

1274

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004

Fig. 7. Event generator [S] and arbiter [H] interfaces. (a) Interfaces eventgenerator ( pi,po) with arbiter-interface (w,s) and latch-cell (co). (b) Interfaces row (p,s), or latch-cell, with arbiter (ro,ri), address encoder (ao), and
mux ( ci).

We eliminated the staticizers to save space, but this simplification produced a race condition when the first gate is disabled
[see Fig. 7(a)]. If
has not discharged all the way to
by
ground, the pull-down continues to pass current and clears .
Thus, it is important for the event-generator to produce a fast,
. If not, the staticizers
clean, downward transition [8], [25] at
must be included. The absence of staticizers also makes the circuit susceptible to charge sharing, which can be largely avoided
by placing the series-connected n and p transistors in the order
Fig. 6. Transmitter schematic.
consists of event-generator interfaces
shown.
consists of 1-in-2 arbiter cells (A).
(S) and an arbiter interface (H).
’s bit cells consist of a memory (L) and an arbiter interface (H).
Compiling the second sequence into PRS (see Appendix,
consists of a set of wires (from S to L).
(two instances) consists of b
part
A) yielded the circuit shown in Fig. 7(b), which also
address lines (extra column-address line is request) and combinational logic
goes
consists
of two aC-elements. When becomes high,
(represented by discs). Latch (F) stores the row address while staticizers (G)
hold the column address.
consists of two controllers: one (T) switches
to go high. As
is high initially,
high, which prompts
the address-mux (J), the another (C) cycles the array.
and
go high, which prompts and
to go low. Thus,
goes low, causing
to go back low, which clears and . A
goes
new cycle can now begin, but cannot go high until
event-arbiter signals into two sequences, introducing two inter- back high.
We relay requests from the row’s multiple event-generator inmediaries, and (see Fig. 6), for them to communicate with:
terfaces to this arbiter-interface using the circuit shown in Fig. 8.
goes high when an event occurs anywhere in this row while
goes high when the row is selected (see Appendix, part A). This This circuit ORs together all bits in that row (they are tied to
, etc.) to generate (tied to ) and broadcasts the signal
partitioning resulted in the following reshuffled HSE sequences
(tied to ).4 This staticized design is more power-efficient and
noise-immune than the nMOS-style wired-OR used previously
# row (S,H) #
3[[pi];w+;[s];co+,po+;[ ~ pi];w-;[ ~ s];co-,po-] [6], [10], [15]. The address encoder is implemented as described
k3[[p];ro+;[ri& ~ ci];s+;[ci& ~ p];ro-;[ ~ ri];s-] in [15].
merges words form the
s onto its port,
which we chose to be active [see Section III-C and Fig. 4(b)].
where denotes parallel execution, just like in CHP. is the OR We implemented this merge by feeding
’s straight-data
of all the bits. For brevity, the subscript has been suppressed outputs to staticized wired-OR gates, with inputs each (see
transitions at the same time does Fig. 8), connected in a column-wise fashion. Instead of steering
and has been omitted:
and
is combined with
(see Appendix, part A).
the acknowledge signal to the row that was read, it is tied directly
Compiling the first sequence into PRS (see Appendix, part to all the rows’ arbiter-interfaces (see Fig. 6). While simplifying
A) yielded the circuit shown in Fig. 7(a). These two gates are the steering circuit to a single wire, this global acknowledge
asymmetric variations of the C-element, whose output is set signal also blocks newly selected rows from proceeding until
when both inputs are high and cleared when both are low (i.e., on-going column communication is completed, making the ag); they are called aC-elements. gressive arbiter-interface reshuffling presented above safe (see
becomes low, goes high, which Appendix, part A).
Initially, is low, so when
prompts to go high. Consequently,
and
go high, which
prompts
to go high. As a result, is cleared, which prompts
4The select signal is not restricted to the active cells because it is used to
to go back low, thereby clearing
and
to terminate the prevent inactive ones from becoming active. Thus, this broadcast deals with the
cycle.
negated probe instability that plagues concurrency.

BOAHEN: BURST-MODE WORD-SERIAL ADDRESS-EVENT LINK—I

1275

Fig. 8. ORing request signals. ORs multiple requests, l1i; l2i; . . ., or lni,
together to create a single request, ro, and broadcasts the acknowledge, ri, to
all n ports. ri clears ro—but not until all lki are cleared. Because, the pFET
is not strong enough to overcome an nFET.

B. Choosing
is built out of
cells, as shown in Fig. 1
not-a-power of two). The 1-in-2 arbiter cell is
(see [15] for
described by

Ports
and
are connected to its daughters’ ports while
is connected to its parent’s
or
port. Only after
port
communicating on does it perform a communication pending
on
or , arbitrating between them if necessary. Thus, requests are relayed up the tree by probing the -to- channels,
while choices are steered down the tree by communicating on
the same channels. The second pair of and communications
guarantees mutual exclusion.
into two processes by isolating its
We decomposed
-to- and
-to- communications, and provided a third
process to arbitrate between these two

Ports in different processes with the same name are connected
together. For the communication processes, we made ports
and
passive and ports
,
, and active. After reshuffling to optimize performance (see Appendix, part B), we obtained the following HSE sequence for the first communication
process:
# arb (A) #
3[[l1i& ~ ri];a1o+;ro+;[ri&a1i];l1o+;
[ ~ l1i];a1o-;ro-;[ ~ a1i];l1o-].

The second communication process is identical; just replace 1
with 2. Their two
signals are ORed together to generate a
single request signal.
Compiling the sequence above into PRS (see Appendix, part
B) yielded the circuit shown in Fig. 9. Two cross-coupled NAND
gates perform arbitration [16]. Their inputs are active-high and
their outputs are active-low—complementary to a set–reset
flip-flop. The aC-elements activate these inputs when requests
is
are received, provided the parent’s acknowledge
inactive (i.e., high). The NAND gates’ outputs drive a circuit that
steers the parent’s acknowledge
to either daughter (
or
). This steering circuit NORs these active-low signals,

Fig. 9. Two-input arbiter cell [A]: interfaces two daughter cells, (l1i, l1o)
and (l2i, l2o), with a parent cell, (ro, ri), using two asymmetric C-elements
(aC), a pair of cross-coupled NAND gates, a steering circuit, and an OR gate.

and
, with
to produce the outgoing acknowledges,
or
. To filter out metastability, the NOR-gates’ pull-ups
and
differ by more than
are not powered up unless
the threshold voltage [16], [28].
When these 1-in-2 arbiter cells are connected in a binary tree,
requests are selected by a post-order traversal. That is, a node
is visited, and then, its daughters are visited, and so on, recursively. However, a daughter that is not requesting is not visited and a daughter that makes another request is not revisited
until the entire tree has been traversed. Each daughter is viswait in
above
ited only once because the
blocks a second request from being serviced with the same
acknowledge signal, unlike the greedy arbiter design presented
in [15], [27], which would revisit the same daughter over and
over again. Complete traversal makes this new arbiter design a
fair one, in that it will not service the same client again until all
those waiting have been serviced. Fairness is critical if parallel
readout is to be fully exploited, as discussed in the companion
paper [24].
This fair arbiter design is optimized for speed, in that new requests can propagate up the tree while old ones are still being
serviced. It does not require requests at all levels of the tree
to be cleared before the acknowledges are cleared (see Fig. 1).
Traversing the entire tree like this—starting at the bottom and
propagating all the way up to clear the requests and then starting
at the top and propagating all the way down to clear the acknowledges—would be painfully slow. Instead of waiting for its
parent’s acknowledge to clear before it clears its own acknowledge, the cell clears its acknowledge as soon as its daughter’s re). However, it blocks new requests
quest clears (see
until its parent’s acknowledge clears, which just requires the cell
to clear its own request. The cell does this once both of its daughters’ requests are cleared. Thus, new requests propagate up until
they encounter a cell whose other request is currently selected.

1276

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004

[see
Now, we proceed with implementing
Section III-C and Fig. 5(a)]. First, we decompose its CHP
program into two processes

Fig. 10. Data-transfer [C] and latch [L] circuits. (a) Interfaces array ( di, do)
with mux ( d,t). do is forced low during reset. (b) Stores (w) row cell’s output
(rx) under control of mux (ri,ro) and communicates with column arbiter
interface (go,gi).

In fact, they can get all the way up to the top cell, even while it
is servicing the other half of the tree.

The
port of the first process, which stores the bit, is conport of the second one, which interfaces with
nected to the
the arbiter. The arbiter interface turned out to be the same as that
used by the rows [see Section IV-A and Fig. 7(b)]. We simply
make the connections:
. For the memory cell, we used a bunand made it passive, and
dled-data representation for port
active. These choices yielded the following
we made port
reshuffled HSE sequence:
# latch (L) #
3[[ri&rk];w+;go+;ro+;[ ~ ri&gi];w-;go-;[ ~ gi];
ro-].

C. Writing
calls for us to store data read
The CHP program for
from a row (port ) and then communicate with the arbiter (port
) and the column encoder (port
). These communications
that is set, and then the bit is
are performed for each bit
cleared. When all the bits have been cleared, a second communication is performed to signal that the latch is empty. These
operations are implemented in this section, together with the
part of
that coordinates them.
hands data from
to
,
converting it from a straight-data representation to a bundled-data one (see Section III-C and Fig. 5(b)). We separated
’s passive
port and the write
the read operation on
operation on its active
port into two HSE sequences. We
also introduced a pair of local variables and for them to
communicate with (see Fig. 6). The read sequence takes high
when the data appears; the write sequence responds by taking
high when the data is latched. We present the read sequence
below but we postpone implementing the write sequence till
Section IV-D, in order to synchronize it with row address
transmission (see Section III-C). The straight-to-bundled-data
converter is implemented simply by performing a bit-wise OR
on the column data to generate a request signal ( below).
We obtained the following reshuffling for the read sequence
(see Appendix, part D):
# xfr (C) #
3[[di];d+;[t];do+;[ ~ di];d-;do-;[ ~ t]].

Compiling this sequence into PRS (see Appendix, part D)
yielded the circuit shown in Fig. 10(a). Initially is low, so
when
becomes low,
goes low, which prompts
to
become high. Now, both of the OR gate’s inputs are low, so it
drives
low, which prompts
to go high, and hence
and
go back up. Once goes low, a new cycle may begin.
However, new column data can show up immediately after
goes high, so this reshuffling allows us to cycle to the next row
and present its data even before the latch becomes empty.

Compiling the sequence above into PRS (see Appendix,
is
part C) yielded the circuit shown in Fig. 10(b). Initially
becomes high, is set and
and
go high,
high, so when
to go high and
to go low. Thus, and
which prompts
are cleared, but
stays high until
becomes low. When this
goes high in response to
going low, and now
happens,
a new cycle can begin. We OR together the
signals from all
memory cells to generate a single write-acknowledge. Even
though this OR-gate is triggered by the first bit that is set, the
delay in clearing the request signal keeps the latch transparent
for a while, giving tardy bits the chance to be written.
D. Bursting
calls for us to write rowThe CHP program for
words to the latch (port ), to multiplex row addresses (port
) and column addresses (port ) onto the transmitter’s output
(port ), and to send when the latch is empty [see Section III-C
and Fig. 5(b)]. These operations are implemented in this section;
was implemented in Section IV-C.
reading the row’s data
We use the following three-wire-handshake sequence for
port :
3[tro+;[ti];tro-;[ ~ ti]]
k3[[ti];tco+;[ ~ ti];tco-].

This protocol allows us to transmit multiple column addresses
by executing the second sequence as many times as desired,
halfway through the first sequence. The first sequence’s first
half transmits the row address, while the second half terminates
the burst—it transmits . Fig. 6 shows the correspondence between these signals and the transmitter’s Ry, Rx, and Ack signals mentioned earlier.
First, we implement the row communications,
and
. We will merge reading the row’s address with reading
its data, and thereby use the read sequence presented in Secas well as
. Hence, the
signals introtion IV-C for
duced above
will serve as proxies for ( , ).

BOAHEN: BURST-MODE WORD-SERIAL ADDRESS-EVENT LINK—I

1277

[ in Fig. 11(a)] to make the latch transparent, so that the next
row can be written. Since the write failed, we do not bother to
clear the contents of the latch. If the latch is not actually empty
(e.g., during power-up), those bits will corrupt the next burst.
However, after that one is sent, everything will be fine.
V. SUMMARY AND CONCLUSION

Fig. 11. Address transmission circuits [T]. (a) Synchronizes row-address
transmission ( tro,ti) with latch (so, si) and array ( d, t). so and t
are forced high during reset. (b) Synchronizes column-address transmission
(tco,ti) with column arbiter interface (ci,co) and switches the address-mux
(tco).

That takes care of reading. For writing, we use the following
reshuffled sequence (see Appendix, part D):
# yctl (T) #
3[[ ~ ti&d&si];tro+,so-;[ti& ~ d& ~ si];tro-,so+]

which takes care of both and ; transitions at the same time
. Compiling this sequence into PRS (see Appendix, part
as
,
D) yielded the circuit shown in Fig. 11(a). Initially, ,
and , are all high. When the row’s address and data appear,
goes low, and when they are latched,
goes low. At this point,
assuming
is low, ,
, and go low. These signals are
, , and , as well as
, go high.
cleared when
,
Now we implement the column communication,
using the following reshuffled sequence (see Appendix, part
D):
# xctl (T) #
3[[ti&ci];co+,tco+;[ ~ ci& ~ ti];co-,tco-].

Compiling this sequence (see Appendix, part D) yielded the
is high for the burst duration,
circuit shown in Fig. 11(b). As
and
is high initially, we only have to wait for
to become
and
go high, which switches
high. When that happens,
the mux to the column address (see Fig. 6). When both and
become low, these signals are cleared. And when
and
go
high in response, a new cycle can begin. Since the mux switches
is low, this address can be
back to the row address when
reread at any time. If it so desires, the receiver can buy time to
do this by taking its acknowledge high a little latter.
We developed a reset strategy to recover when a row is
selected but no data is delivered to the latch. This situation
could arise if the event-generator’s slew-rate is too slow (see
Section IV-A). We activate the data-transfer circuit’s array-acknowledge [
in Fig. 10(a)] to complete the stalled cycle.
We also activate the row-address transmission circuit’s request

We have described an address-event transmitter that reads all
active cells in a selected row in parallel. Row activity is transmitted in a burst: the row address followed by a column address for each active cell, plus a termination signal. The array
is cycled to the next row while these events are being transmitted, so the next burst can start as soon as this one ends. Bursts
are communicated using a three-wire protocol: a row-request, a
column-request, and a common acknowledge. In return for the
extra request line, output pads are cut by 50%—without sacrificing throughput—as the row address is not repeated.
In terms of cell area, the cost of parallel-readout is minimal.
Whereas previous transmitter designs add four transistors to the
cell (reviewed in [15]), our design requires nine transistors. Both
of these counts include the transistor that pulls down the rowrequest line but do not include the one that resets the eventgenerator. On the otherhand, our design requires just one line per
column whereas previous designs require two, since they select
columns individually. Trading a metal line for five transistors
is highly favorable when wires are at a higher premium than
transistors, which is increasingly the case. Thus, the increased
throughput—and scalability—parallelism offers [24] is attained
at little cost in hardware.
We also illustrated how to synthesize an asynchronous implementation starting from a high-level specification by way of
a concrete example. The result was eight logic circuits that, together, can be used to implement a burst-mode, word-serial, address-event transmitter of any desired size. These circuits include an arbiter design that allows parallelism to be exploited
fully by ensuring that a row is not reread until all those waiting
are serviced. We have laid out a library of cells (in MOSIS
DEEP_SUBM rules) for these circuits and written a siliconcompiler to tile them to fit any desired pixel- or array-size.
Thus far, this tool has successfully compiled transmitters for
three generations of chips, fabricated in 0.6-, 0.4-, and 0.25- m
technology [24].
APPENDIX
LOGIC SYNTHESIS
When compiling HSE into PRS, we perform two passes. On
the first pass, we make the wait before an action the guard for its
, the pasproduction rule. For example,
sive port’s sequence, is realized by the set
, which is implemented by a wire. On the second pass, we
strengthen guards that can become true at some other point in
the sequence by ANDing with another boolean variable. If all
signals are in exactly the same state at these two points, we add
a state variable to distinguish them, setting it after we pass the
first point and clearing it after we pass the second point, or vise
versa.

1278

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004

For example, the CHP process
, where is active and
passive, as above, could be augmented with the state variable
as follows:
3[ao+;[ai];s+;ao-;[ ~ ai];[pi];po+;[ ~ pi];
s-;po-].

’s state now distinguishes the point where ends and begins
from the point where ends and repeats. Alternatively, the
ambiguous state can be eliminated if we begin before ends.
For example
3[ao+;[ai];[pi];po+;ao-;[ ~ ai];[ ~ pi];po-]

is unambiguous. This reshuffling, if acceptable, is cheaper to
implement, as it does not require a state variable.
We can often avoid adding state variables by reshuffling sequences in this way, provided the change in sequencing is benign. When compiling PRS for such sequences, multiple pre) and
ceeding waits are ANDed together (e.g.,
). Anpreceeding actions become guards too (e.g.,
other goal of reshuffling is symmetry—clearing signals in the
same order that you set them. Such symmetry makes the signals that appear in the pull-up and the pull-down the same. This
duplicity usually results in a simpler implementation, as the
pull-up is disabled when the pull-down is active, and vise versa.
It is sometimes possible to convert a state-holding gate into
a combinational one, thereby avoiding the need for a staticizer.
) and pull-down
That is, to make the gate’s pull-up (
) complementary
. Such conversion is typ(
and ANDing
ically done by ORing terms with the pull-up
terms with the pull-down , or vise versa. For example,
requires a staticizer, since
is not an
is combinaidentity. However,
is an identity. In fact, that is a NAND
tional, since
gate. These added terms must have a benign effect, such that
at all points in the sequence, where is the original
guard and is the weakening term.
A. Row
’s
port passive, and its , , and ports
Making
active (see Section III-C), yielded this single-bit HSE sequence
3[[pi];w+;ro+;[ri];co+,po+;[ ~ pi];w-;
[ci];co-,po-;[ ~ ci];ro-;[ ~ ri]]

where has been omitted for the time being and the subscript is suppressed for brevity. If we move the second communication (two-phase) ahead of the second communication,
the arbiter can start selecting the next row earlier. However,
we must ensure that this newly selected row does not interrupt
an on-going column communication. It can be blocked by adforward to where
occurs in the next cycle,
vancing
which also provides more time to complete the column communication. Thus, the sequence becomes
3[[pi];w+;ro+;[ri& ~ ci];co+,po+;[ ~ pi];w-;
[ci];ro-;[ ~ ri];co-,po-]

is broadcast to all the rows. Next, we augment the
where
and selection
signals to
sequence with row-wide event
support multiple bits. For , which is the OR of all the bit-level
signals, we insert “
” after “ ” and “
” after
” after
“ ”. And for , which mirrors , we insert “
” and “
” after “
”. Thus, we obtain
“
3[[pi];w+;p+;[p];ro+;[ri& ~ ci];s+;[s];co+,po+;
[ ~ pi];w-;p-;[ ~ p];[ci];ro-;[ ~ ri];s-;[ ~ s];
co-,po-].

to
and
Finally, moving the row-level parts (i.e., from
to
) into a separate (arbiter-interface) sequence yielded the
.
reshuffling presented in Section IV-A
We compiled our final reshufflings into the following PRS:
pi& ~ s->w+
p->ro+
~ pi->w~ p&ci->ro-

s&w->co+,po+
r& ~ ci->s+
~ s->co-,po~ ri->s-.

We strengthened the guard of
with to ensure that only
those cells that were active when the row was selected participate, as required by concurrency. And we strengthened the guard
,
with , to ensure that only active cells respond.
for
The circuits are shown in Fig. 7.
We include the communication by observing that it occurs
(see Section III-C). Hence, we can acsimultaneously with
tivate , as well as , with , and combine with using a
C-element.5 Alternatively, since we activate the row and the encoder at the same time, we can assume the column bus’s acknowlindicates that both the row’s state and its address have
edge
been latched. This timing assumption eliminates the C-element,
but requires that we compensate for worst-case timing-differences between the address-encoding and data-transfer processes.
B. Arbiter
Making
’s
and
ports passive and its
port
active yielded this HSE sequence for the first communication
process (see Section IV-B)
3[[l1i];a1o+;[a1i];ro+;[ri];l1o+;
[ ~ l1i];l1o-;ro-;[ ~ ri ];a1o-;[ ~ a1i]]

where port
(and ) is passive. If we execute
without
, we will allow the upper levels to make dewaiting for
cisions concurrently. We can maintain mutually exclusive acuntil the next cycle. This so-called
cess to by delaying
) gave us
lazy-active reshuffling (e.g.,
3[[l1i& ~ ri];a1o+;ro+;[ri&a1i];l1o+;
[ ~ l1i];l1o-;ro-;a1o-;[ ~ a1i]]

However, the other daughter is not excluded if her request
) becomes active while
is still false. In that case,
(i.e.,
5A two-input gate whose output is set when both inputs are high and cleared
when both are low.

BOAHEN: BURST-MODE WORD-SERIAL ADDRESS-EVENT LINK—I

we can kill two birds (service both daughters) with one stone
fires, the arbi(a single communication). That is, once
true, allowing the other commutration process will make
nication process to get past
, where it is held up, and
select the other daughter.
to happen faster by postponing
We can also get
till the end and clearing
and
in the same order that we
set them. These changes also make the sequence more symmetric, which simplifies the logic. Thus, we obtained the reshuf. We
fled HSE sequence presented in Section IV-B
compiled that sequence into this PRS
l1i&_ri->a1o+
~ l1i->a1o~ _a1i& ~ _ri->l1o+
_a1ij_ri->l1o-

~ l1o->_l1o+
l1o->_l1oaloja2o->ro+
~ a1o& ~ a2o->ro-

pull-ups (pull-downs)
where we simply OR (AND) the two
’s guard with
is since
together. Weakening
becomes true first. Because, as you can see from the circuit
propagates through only two gates to set
(Fig. 9),
but it propagates through one gate in this cell plus three gates
.
at the next level—and an inverter—to set
C. Latch
’s port passive, and using a bundled-data repMaking
resentation, yielded this HSE sequence for its memory cell
3[[ri&rk];w+;ro+;go+;[gi];go-;[ ~ gi];w-;
[ ~ ri];ro-]

where port
is active (the subscript is suppressed). Moving
communication to the middle of the
the second two-phase
second (two-phase) communication eliminates an ambiguous
state. However, this reshuffling implies that the cell cannot start
the second communication (with the arbiter-interface) before
the second communication (with the mux) starts. The consequences of synchronizing these two-phase communications
with
are dealt with in Appendix , part D. Swapping
and
with
reduces asymmetry; both swaps are benign.
These changes yielded the reshuffling given in Section IV-C
.
We compiled our final reshuffling into the following PRS:
ri&rk->w+
~ ri&gi->w-

w->go+
~ w->go-

gojgi->ro+
~ go& ~ gi->ro-.

Weakening
’s guard is safe because
happens later;
’s makes the gate combinational. The circuit
strengthening
is shown in Fig. 10(b).
D. Mux
’s row communications calls for parThe CHP for
and
(see Section III-C). Howallel execution of
ever, we allowed them to run in parallel only after the row’s address and its data are latched, since we wished to merge the reads

1279

(see Appendix, part A). With ports and passive and ports
and active, this strategy is realized by the HSE sequence
3[[di];so+;[si];(do+;[ ~ di];do-)k
(tro+;[ti];so-;[ ~ si];tro-;[ ~ ti])]

serve the merged – port. We broke this sewhere
quence up into two concurrent read and write sequences and
synchronized them with two new variables, and
3[[di];d+;[t];do+;[ ~ di];d-;do-;[ ~ t]]
k3[[d];so+;[si];t+;tro+;[ti];so-;
[ ~ d& ~ si];tro-;[ ~ ti];t-]

could have been executed anywhere between and
; we
went with symmetry. The first sequence is identical to the read
.
sequence given in Section IV-C
imWe reshuffled the write sequence further. Moving
mediately after the previous cycle’s
makes the latch transparent as soon as it becomes empty. This move allows us to
and
, and we can consolidate
as well by
merge
till after
using the lazy-active reshuffling. Postponing
allows the second (two-phase) communication to occur as
soon as we start transmitting the row’s address. These optimizations yielded the write sequence given in Section IV-D
.
This reshuffling deals with the consequences of synchro’s
and
communications. That is, we do not
nizing
communihamper the memory cell’s second (two-phase)
in Section IV-C), as
, which
cation (see
corresponds to
(see
in Section IV-D), becomes true at the beginning of the burst. Thus, the memory-cell
can complete its second communication with the arbiter-interface right away.
The read sequence above yielded the following PRS:
di& ~ t->d+
d&t->do+

~ di->d~ dj~ t->do-.

Weakening
’s guard is safe because becomes false after
does. The circuit is shown in Fig. 10(a).
And we compiled the final reshuffling of the write sequence
in Section IV-D) into the following PRS:
(see
~ ti&d&si->t+,tro+,soti& ~ d& ~ si& ~ tco->t-,tro-,so+.

To prevent
from firing before the last column-address
transmission is completed (see below), we strengthened its
. This precaution is necessary because when
guard with
it clears
the column arbiter interface is acknowledged
its acknowledge to the memory cell, at which point becomes
false (see Fig. 6). Then,
could fire while we are waiting
to become false in response
(see below). The
for
circuit is shown in Fig. 11(a).
’s column communications,
, making
For
port passive and port active yielded this HSE sequence

1280

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004

3[[ci];co+;[ ~ ci];co-;[ti];tco+;[ ~ ti];tco-].

Relocating the
communication’s two halves a quarter and
three-quarters of the way through the communication yielded
the reshuffling presented in Section IV-D
. Thus,
at the same time transmission
reception is acknowledged
, which requires us to make the encoder’s outputs
starts
state-holding (see Fig. 6). Therefore, we added staticizers to all
its outputs, including the extra always-a-one line that serves as a
request, and we tied a pFET to the request line—as in Fig. 8—to
clear it.
We compiled the final reshuffling into the following PRS:
ti&ci&tro->co+,tco+

~ ti& ~ ci->co-,tco-.

with
to block
We have strengthened the guard of
a column address from a newly loaded row from being transto clear, after
goes low.
mitted while we are waiting for
The circuit is shown in Fig. 11(b).
ACKNOWLEDGMENT
The author would like to thank C. Higgins, T. Horiuchi,
B. Linares-Barranco, and T. Serrano-Gotarredona for their
invaluable help in beta-testing this interface, and fishing out and
documenting bugs. He would also like to thank P. Merolla for
helping with adding serial-address transmission to the design.
REFERENCES
[1] C. A. Mead and T. Delbruck, “Scanners for visualizing analog vlsi circuitry,” Analog Integ. Circuits Signal Process., vol. 1, pp. 93–106, 1991.
[2] W. Yang, “A wide-dynamic range low-power photosensor array,” in
Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC’94), vol. 37, San
Francisco, CA, 1994, p. 230.
[3] B. Fowler, A. E. Gamal, and D. Yang, “A CMOS area image sensor
with pixel-level A/D conversion,” in Proc. IEEE Int. Solid-State Circuits
Conf. (ISSCC’94), vol. 37, San Francisco, CA, 1994, pp. 226–227.
[4] L. G. McIlrath, “A low-power low-noise ultrawide-dynamic-range cmos
imager with pixel-parallel A/D conversion,” IEEE Trans. Solid-State
Circuits, vol. 36, pp. 846–853, May 2001.
[5] A. Murray and L. Tarassenko, Analogue Neural VLSI: A Pulse Stream
Approach. London, U.K.: Chapman and Hall, 1994.
[6] M. Mahowald, An Analog VLSI Stereoscopic Vision System. Boston,
MA: Kluwer Academic, 1994.
[7] K. A. Boahen, “The retinomorphic approach: pixel-parallel adaptive amplification, filtering, and quantization,” Analog Integr. Circuits Signal
Process., vol. 13, pp. 53–68, 1997.
[8] E. Culurciello, R. Etienne-Cummings, and K. Boahen, “Arbitrated
address event representation digital image sensor,” in Proc. IEEE Int.
Solid-State Circuits Conf. (ISSCC’01), Feb. 2001, pp. 92–93.
[9] J. Kramer, “An on/off transient imager with event-driven asynchronous
readout,” in Proc. IEEE Int. Symp. Circuits and Systems, vol. 2, 2002,
pp. II-165–II-168.
[10] J. Lazzaro, J. Wawrzynek, M. Mahowald, M. Sivilotti, and D. Gillespie, “Silicon auditory processors as computer peripherals,” IEEE Trans.
Neural Networks, vol. 4, pp. 523–528, Mar. 1993.
[11] M. Sivilotti, “Wiring considerations in analog VLSI systems, with
application to field-programmable networks,” Ph.D. dissertation, Dept.
Comp. Sci., California Institute of Technology, Pasadena, CA, 1991.

[12] A. Mortara, E. Vittoz, and P. Venier, “A communication scheme for
analog VLSI perceptive systems,” IEEE J. Solid-State Circuits, vol. 30,
pp. 660–669, June 1995.
[13] A. Abusland, T. S. Lande, and M. Hovin, “A VLSI communication architecture for stochastically pulse-encoded analog signals,” in Proc. IEEE
Int. Symp. Circuits and Systems, vol. 3, May 1996, pp. 401–404.
[14] K.A. Boahen, “Communicating neuronal ensembles between neuromorphic chips,” in Neuromorphic Systems Engineering: Neural networks in
Silicon, T. S. Lande, Ed. Boston, MA: Kluwer Academic, 1998, ch.
11, pp. 229–262.
[15]
, “Point-to-point connectivity between neuromorphic chips using
address-events,” IEEE Trans. Circuits Syst. II, vol. 47, pp. 416–434, May
2000.
[16] C. A. Mead, Introduction to VLSI Systems. Reading, MA: Addison
Wesley, 1980.
[17] J. G. Elias, “Artificial dendritic trees,” Neural Computation, vol. 5, pp.
648–663, 1993.
[18] S. R. Deiss, R. J. Douglas, and A. M. Whatley, “A pulse-coded communications infrastructure for neuromorphic systems,” in Pulsed Neural
Networks, W. Maass and W. B. C. M, Eds. Boston, MA: MIT Press,
1999, ch. 6, pp. 157–178.
[19] C. M. Higgins and C. Koch, “Multi-chip motion processing,” in Proceedings of Conference on Advanced Research in VLSI. Los Alamitos,
CA: IEEE Comp. Soc. Press, 1999, vol. 20, pp. 309–322.
[20] S. P. DeWeerth, G. N. Patel, M. F. Simoni, D. E. Schimmel, and R.
L. Calabrese, “A VLSI architecture for modeling intersegmental coordination,” in Proc. 17th Conf. Advanced Research in VLSI, 1997, pp.
182–200.
[21] J. P. Lazzaro and J. Wawrzynek, “A multi-sender asynchronous extension to the address-event protocol,” in Proc. 16th Conf. Advanced Research in VLSI, 1995, pp. 158–169.
[22] K. A. Boahen, “A throughput-on-demand address-event transmitter for
neuromorphic chips,” in Proc. 20th Anniversary Conf. Advanced Research in VLSI, 1999, pp. 72–86.
, “A burst-mode word-serial address-event link II: Receiver de[23]
sign,” IEEE Trans. Circuits Syst. I, vol. 51, pp. 1281–1291, July 2004.
[24]
, “A burst-mode word-serial address-event link—III: Analysis and
test results,” IEEE Trans. Circuits Syst. I, vol. 51, pp. 1292–1300, July
2004.
[25] C. A. Mead, Analog VLSI and Neural Systems. Reading, MA: Addison-Wesley, 1989.
[26] M. Schwartz, Telecommunication Networks: Protocols, Modeling, and
Analysis. Reading, MA: Addison-Wesley, 1987.
[27] K. A. Boahen, “Retinomorphic vision systems II: communication
channel design,” in Proc. IEEE Int. Symp. Circuits and Systems, May
1996, pp. 14–17.
[28] A. Martin, “Programming in VLSI: From communicating processes to delay-insensitive circuits,” in Proceedings of UT Year of
Progamming Institute on Concurrent Programming. Reading, MA:
Addison-Wesley, 1990, pp. 1–64.

Kwabena A. Boahen received the B.S. and M.S.E.
degrees in electrical and computer engineering
from The Johns Hopkins University, Baltimore,
MD, in the concurrent masters-bachelors program,
both in 1989, and the Ph.D. degree in computation
and neural systems from the California Institute of
Technology, Pasadena, in 1997.
He is an Associate Professor in the Bioengineering Department at the University of
Pennsylvania, Philadelphia, where he holds a
secondary appointment in electrical engineering.
His current research interests include mixed-mode multichip VLSI models of
biological sensory and perceptual systems, and their epigenetic development,
and asynchronous digital interfaces for interchip connectivity.
Dr. Boahen was awarded a Packard Fellowship in 1999, a National Science
Foundation CAREER Grant in 2001, and an Office of Naval Research YIP Grant
in 2002. He is a member of Tau Beta Kappa and has held a Sloan Fellowship
for Theoretical Neurobiology at the California Institute of Technology.

