A burst-mode word-serial address-event link--I: transmitter design by Boahen, Kwabena A
University of Pennsylvania
ScholarlyCommons
Departmental Papers (BE) Department of Bioengineering
July 2004
A burst-mode word-serial address-event link--I:
transmitter design
Kwabena A. Boahen
University of Pennsylvania, boahen@seas.upenn.edu
Follow this and additional works at: http://repository.upenn.edu/be_papers
Copyright 2004 IEEE. Reprinted from IEEE Transactions on Circuits and Systems--I: Regular Papers, Volume 51, Issue 7, July 2004, pages 1269-1280.
Publisher URL: http://ieeexplore.ieee.org/xpl/tocresult.jsp?isNumber=29094&puNumber=8919
This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the
University of Pennsylvania's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this
material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by
writing to pubs-permissions@ieee.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.
This paper is posted at ScholarlyCommons. http://repository.upenn.edu/be_papers/3
For more information, please contact libraryrepository@pobox.upenn.edu.
Recommended Citation
Boahen, K. A. (2004). A burst-mode word-serial address-event link--I: transmitter design. Retrieved from
http://repository.upenn.edu/be_papers/3
A burst-mode word-serial address-event link--I: transmitter design
Abstract
We present a transmitter for a scalable multiple-access inter-chip link that communicates binary activity
between two-dimensional arrays fabricated in deep submicrometer CMOS. Transmission is initiated by active
cells but cells are not read individually. An entire row is read in parallel; this increases communication capacity
with integration density. Access is random but not inequitable. A row is not reread until all those waiting are
serviced; this increases parallelism as more of its cells become active in the mean time. Row and column
addresses identify active cells but they are not transmitted simultaneously. The row address is followed
sequentially by a column address for each active cell; this cuts pad count in half without sacrificing capacity.
We synthesized an asynchronous implementation by performing a series of program decompositions, starting
from a high-level description. Links using this design have been implemented successfully in three generations
of submicrometer CMOS technology.
Keywords
asynchronous logic synthesis, event-driven communication, fair arbiter design, neuromorphic systems,
parallel readout, pixel-level quantization
Comments
Copyright 2004 IEEE. Reprinted from IEEE Transactions on Circuits and Systems--I: Regular Papers, Volume
51, Issue 7, July 2004, pages 1269-1280.
Publisher URL: http://ieeexplore.ieee.org/xpl/tocresult.jsp?isNumber=29094&puNumber=8919
This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way
imply IEEE endorsement of any of the University of Pennsylvania's products or services. Internal or personal
use of this material is permitted. However, permission to reprint/republish this material for advertising or
promotional purposes or for creating new collective works for resale or redistribution must be obtained from
the IEEE by writing to pubs-permissions@ieee.org. By choosing to view this document, you agree to all
provisions of the copyright laws protecting it.
This journal article is available at ScholarlyCommons: http://repository.upenn.edu/be_papers/3
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004 1269
A Burst-Mode Word-Serial Address-Event
Link—I: Transmitter Design
Kwabena A. Boahen
Abstract—We present a transmitter for a scalable multiple-ac-
cess inter-chip link that communicates binary activity between
two-dimensional arrays fabricated in deep submicrometer CMOS.
Transmission is initiated by active cells but cells are not read
individually. An entire row is read in parallel; this increases com-
munication capacity with integration density. Access is random
but not inequitable. A row is not reread until all those waiting are
serviced; this increases parallelism as more of its cells become
active in the mean time. Row and column addresses identify active
cells but they are not transmitted simultaneously. The row address
is followed sequentially by a column address for each active cell;
this cuts pad count in half without sacrificing capacity. We synthe-
sized an asynchronous implementation by performing a series of
program decompositions, starting from a high-level description.
Links using this design have been implemented successfully in
three generations of submicrometer CMOS technology.
Index Terms—Asynchronous logic synthesis, event-driven com-
munication, fair arbiter design, neuromorphic systems, parallel
readout, pixel-level quantization.
I. SCALING TWO-DIMENSIONAL ARRAYS
MULTIPLE-ACCESS inter-chip communication linkswere originally developed to read out analog signals
from sensor arrays. A clock switches the multiplexer from one
sensor to another, reading a value from each and every one at a
fixed interval, hence the nickname “scanner” [1]. Use of these
clock-driven multiplexers continued after quantizers were in-
cluded in active pixel sensors [2]–[4] and in pulse-coded neural
networks [5] to discretize signals inside the array. However,
the all-or-none transitions so produced, called events, may be
output as soon as they occur. Such event-driven access has clear
advantages over clock-driven access when activity is sparse
(e.g., spatial or temporal filtering occurs) and timing is critical
(e.g., time encodes analog information). Consequently, this
scheme has been explored for silicon retinas [6]–[9] and silicon
cochleas [10].
Several ways of regulating access in event-driven communi-
cation systems have been proposed [5], [11]–[13] and their ef-
ficiency compared [14], [15], but little attention has been paid
to their scaling properties. In all the architectures proposed so
far, a single active cell is read, its state is cleared, and then
the next cell is read. However, it takes longer to cycle the row
Manuscript received January 3, 2002; revised November 2002. This work
was supported in part by the Whitaker Foundation and in part by the National
Science Foundation’s LIS/KDI and CAREER programs under Grant ECS98-
74463 and Grant ECS00-93851. This paper was recommended by Associate
Editor G. Cauwenberghs.
The author is with the Department of Bioengineering, University of Pennsyl-
vania, Philadelphia, PA 19104-6392 USA (e-mail: boahen@seas.upenn.edu).
Digital Object Identifier 10.1109/TCSI.2004.830703
and column lines as feature sizes shrink because faster logic
(minimum-sized inverter chain) is neutralized by larger load
(cells per row or column). Hence, these existing designs cannot
accommodate the increase in cell count with integration den-
sity—unless the widths of transistors that drive the row and
column lines are increased drastically. However, some of these
devices actually reside in the cell, which must signal when it be-
comes active in an event-driven system.
In this paper, we describe a scalable event-driven transmitter
interface inspired by two-dimensional (2-D) scanners, which
read out an entire row of cells in parallel, over the column lines
[1]. The increase in parallelism as the array gets denser en-
ables these analog multiplexers to increase their readout rate, de-
spite the fact that the larger load neutralizes the faster logic. By
reading an entire row at once—instead of one cell at a time—we
also achieve a transmission rate that increases as the square-root
of the cell count, assuming a square array. Such scaling is the
best we can do without sizing-up devices inside the array—or
breaking it up into separate banks [16]. Our approach requires
large devices only in the periphery, where parallel-to-serial con-
version occurs. Therefore, it allows designers to take better ad-
vantage of higher integration densities offered by advanced sub-
micrometer processes.
Our design uses address-events to communicate between
cells in the same array or in different arrays, which need not
be on the same chip. In this respect, it is similar to previous
event-driven links, where the transmitter uses an encoder to
generate an address that uniquely identifies an event’s place of
origin while the receiver uses a decoder to recreate the event at
the destination [6], [10], [11]. However, whereas these previous
designs transmit row and column addresses in parallel, we
transmit them serially. There is no loss in speed because we
do not retransmit the row address if the next event is from the
same row.
In addition to providing a communication standard for
parallel distributed processing, address-events support virtual
point-to-point connectivity. These virtual wires can be routed
by using a look-up table to translate in-coming addresses into
one or more out-going addresses [17]–[19]. Furthermore, the
single-transmitter–single-receiver link may be extended to
support multiple transmitters and receivers using merges and
splits [20], or with a shared bus [18], [21]. Thus, the basic
link can serve a wide variety of purposes when augmented
appropriately. As in previous work, we implemented the link
asynchronously to facilitate its use in large heterogeneous
multichip systems.
The paper is divided into five sections. In Section II, we
briefly review three common multiplexing schemes: metered-,
1057-7122/04$20.00 © 2004 IEEE
1270 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004
free-, and arbitered-access (see [15] for an in-depth review).
In Section III, we present a high-level specification for the
transmitter, and decompose it into a hierarchy of concurrent
subprocesses. In Section IV, we present the final handshaking
sequences and the resulting asynchronous logic circuits;
intermediate synthesis steps can be found in the Appendix.
Section V concludes the paper. A preliminary report of this
work was presented in [22]. A parallel-write burst-mode re-
ceiver and analysis and test results are presented in companion
papers [23], [24].
II. MULTIPLEXING EVENTS
Metering access to each cell according to a fixed readout
sequence is the simplest solution. These clock-driven multi-
plexers, or scanners, are commonly used to read out analog cur-
rents from imagers, going from row to row, in sequence. In fact,
they read all a row’s pixels in parallel, and then scan them out
serially on the periphery. Thus, they achieve rates over ten mil-
lion pixels per second; fast enough to scan arrays with hundreds
of thousands of pixels at video frame rates. However, these fast
analog signals require specialized input/output (I/O) pads, are
prone to clock feedthrough and to noise, cannot be easily in-
terfaced with computers, and can be demultiplexed only when
array sizes and clock speeds match.
If activity is sparse, it is more efficient to transmit a cell’s
state only when it changes. It is easy to recognize when sig-
nificant changes occur if cell-state is quantized. The cheapest
quantizers are one-bit analog-to-digital converters, such as inte-
grate-and-fire neurons [2], [8], [25] or sigma-delta encoders [3],
[4]. Their fixed-width–fixed-height spikes or binary state-tran-
sitions constitute a sequence of events that encode information
only in their timing. When coincidences occur, the transmitter
may either delay the new event to prevent a collision or dump
the old event to preserve timing.
Transmitting events immediately by giving cells free access
shortens latency. As such an event-driven operation does not
follow a predetermined (i.e., clock-driven) readout sequence, we
must transmit information that uniquely identifies the event’s
location. These addressed events (abbreviated to address-events)
can be created simply by wiring the event-generators’ outputs
directly to the address encoder [12]. However, when events coin-
cide, the encoder ORs their addresses together. These collisions
increase exponentially as activity increases, with the fraction of
events that get through unscathed maxing out at 18% when only
50% of the transmission slots are full [26].
Preventing collisions by arbitered access increases
throughput. An arbiter grants only one request at a time, and
the encoder outputs that cell’s address. On average, the wait
equals the mean interval between empty transmission slots
[26]. When only 5% of the slots are empty, for example, the
wait is 20 slots long. For 10 000 cells, each transmission slot
must be shorter than 0.01% of the average inter-event interval
to handle that many cells. Hence, a 20-slot wait corresponds
to just 0.2% of the population’s average inter-event interval.
Therefore, arbitration can potentially achieve a five-fold
increase in throughput—from 18% to 95%—with negligible
timing error [15].
III. TRANSMITTER DESIGN
To achieve the five-fold increase in throughput arbitration
promises, we must ensure that it does not increase the trans-
mission-cycle time. The first implementation of a 2-D arbitered
address-event transmitter, by Mahowald and Sivilotti, yielded
a disappointingly long cycle time of 2 s [6]. Because, for an
array with cells, a 1-in- arbiter was first used to pick a row
and then a second 1-in- arbiter was used to pick a cell in that
row. As a 1-in- arbiter is built from 1-in-2 arbiters, orga-
nized in a binary tree, this hierarchical scheme cuts the number
of 1-in-2 arbiters from to . Unfortunately, the
cycle time suffered as all levels in each arbiter tree
were spanned for every event transmitted.
In previous work, we cut the arbitered address-event trans-
mitter’s transmission cycle-time from 2 s to 730 ns in the same
2- m technology by exploiting locality [27]. That is, arbitrating
at the lowest level of the arbiter tree for inputs next to each other,
at the second level for inputs two to three places apart, and so
on—only two levels are spanned on an average. We went on to
reduce the cycle time to 420 ns by exploiting locality inside the
array as well [27]. That is, servicing all active cells in a selected
row before redoing the row arbitration.
Our present goal is to further optimize the arbitered address-
event transmitter’s row–column architecture. Having exploited
locality in the arbiter and the array, transmission speed is now
primarily limited by the rate at which events are read out of the
array. Here, we break this array-cycling limit, realizing three
enhancements in all, as follows:
1) reading a row’s events in parallel boosts capacity by ;
2) bundling them into a single row-wide word eliminates the
column-select lines;
3) multiplexing row and column addresses cuts output pads
in half.
We alluded to Optimizations 1 and 3 in Section I. Optimization 2
is a direct consequence of Optimization 1—selecting an entire
row instead of a single cell.
A preview of the transmitter architecture we developed is
shown in Fig. 1. In this section, we derive programs that de-
scribe the behavior of each of these blocks by following a syn-
thesis methodology for asynchronous digital VLSI systems de-
veloped by Martin [28] (tutorial examples are provided in [15]).
His methodology involves applying a series of program decom-
positions and transformations, starting from a high-level spec-
ification. As each step preserves the logic of the original pro-
gram, the resulting circuit is correct by induction. Thus, it is
unnecessary to deduce how these hardware processes behave
when executed in parallel, which is extremely difficult. After de-
composing the transmitter specification into a set of concurrent
one-line programs, we transform these one-liners into hardware
processes in Section IV.
A. High-Level Specification
We start by writing a high-level specification in the concur-
rent hardware processes (CHP) language, a hardware descrip-
tion language for asynchronous systems [28]. In CHP, logic cir-
cuits “execute” concurrent programs. For example
BOAHEN: BURST-MODE WORD-SERIAL ADDRESS-EVENT LINK—I 1271
Fig. 1. Transmitter architecture: a interface circuit (H) relays requests to the
row arbiter (A) and permits that row to output its address and its events (S)
when the arbiter acknowledges. The events’ column addresses are generated by
the same procedure, after latching the row’s state (L). A two-way multiplexer
(T) outputs row (Y) and column (X) addresses, using separate requests lines
(Ry,Rx); they share a single acknowledge line (Ack). Meanwhile, a controller
(C) cycles the array to another row.
The program, or process, is named and its argument
is named ; process and argument names are always set in
upper and lower case sans-serif font, respectively. As we are
describing hardware here, you should think of as a call
to a silicon compiler that lays out a circuit with, for instance,
an -bit wide datapath. denotes infinite repetition; this
demarcates the body of the program. Semicolons (;) denote
sequential execution. inputs data from a port named
and stores it in a local variable named ; port and variable
names are always set in italicized upper and lower case roman
font, respectively. Similarly, outputs the data stored in
on port . is a dataless communication on port ; its
only effect is to synchronize the two processes whose ports are
connected together. That is, this process waits until the other
one gets to the corresponding point in its program, or vise
versa. In the text, we will write “port ” to distinguish the port
itself from a communication performed on that port, which we
write simply as “ .” There is no such ambiguity in the code, as
only communications can appear in the body of the program.
A high-level block diagram of the address-event transmitter
is shown in Fig. 2. We use arbitration
to choose an active cell. It picks a guard
that is true and executes the corresponding program segment
.1 In this case, the guard is the probe which evaluates to
true when there is a communication pending on port (i.e.,
the other process is waiting). Also, the program segment com-
municates on that port and outputs its address simultaneously;
parallel lines are used to denote this. Thus, we have
The address is returned by a function call that converts
a one-hot code ( -bit) to a binary one ( -bit).
1If all the guards are false, it waits for at least one to become true.
Fig. 2. Transmitter specification: when a communication occurs on dataless
port P , its address j is output on port A as an a-bit word.
Fig. 3. Row-column organization. (a) The lth row services event generators
l + 1 to (l + 1) through its P -ports and communicates with columns
through its C ports. It also communicates with the row arbiter ( ) and the
row encoder ( ). (b) The kth column communicates with rows through
its R ports and communicates with the column arbiter ( ) and the column
encoder ( ) as well.
Alternatively, the transmitter process may be described suc-
cinctly using the CHP replication construct:
, where each is a program segment and is
any operator that can be concatenated. As the arbitration oper-
ator can be concatenated, we have
The next step in the synthesis procedure is to decompose
this high-level specification into a hierarchy of concurrent
processes. These processes’ ports are then connected together
by channels. We present this connectivity information picto-
rially. These figures also give the names of instances (e.g.,
specifies an instance of named )
and their ports’ data types (e.g., specifies that port
outputs bytes). Ports that are not defined as either input or
output are dataless—by default. Port names that appear inside
a box are local to that instance; those outside are local to the
process within which that instance occurs.
B. Reorganizing Into Rows and Columns
Here, we decompose into separate row, column,
arbiter, and encoder processes, named , ,
, and , respectively. These processes are
connected as shown in Fig. 3. This decomposition is accom-
plished through four program transformations. For the first
transformation, we reorganize ’s dataless ports into
rows and columns. With this array, we have to use a
1-in- arbiter to choose a row and then use a 1-in- arbiter to
choose one of the ports in that row. Hence, we must or (denoted
by ) together all requests within each row to generate requests
for the row arbiter. We also need to provide separate outputs,
1272 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004
ports and , for row and column addresses, respectively.
Thus, we have
where and .
For the second transformation, we implement arbitration in a
separate process
It performs the second communication to ensure that the data-
less port it picked has been completely serviced before it picks
another. Two instances of , with or , are used
for row and column arbitration, respectively. To communicate
with these processes, we provide the remaining array process,
called , with dataless ports and dataless ports
, respectively.
Once the row arbiter picks a row, we can use concurrency,
, to service every port
in that row that has a communication pending, before picking
another row. This construct executes, concurrently, all program
segments, , whose guards, , are true.2 Thus, we have
Note that a row address is output only when a new row is se-
lected.
For the third transformation, we implement address encoding
in a separate process
This process chooses one of it dataless ports using selec-
tion, . Selection requires
that only one guard is true at a time, which is indeed the case
here, as arbitration guarantees mutual exclusion. Two instances
of , with or , are used for row and column
encoding, respectively. We provide with dataless
ports and dataless ports to communicate with these
processes. Communications on and , respectively, now re-
place the and communications in its pro-
gram above.
For the fourth and final transformation, we break up
into row- and column-processes (see Fig. 3).
Dividing ’s program into subroutines yields the
following code for its row and column processes
where and have replaced and , respec-
tively, instead of and . We parallelize this serial-array-
2This construct is not supported by CHP. Its use is discouraged because the
negated probes used to determine ineligibility can change from true to false at
any time. Concurrency waits for at least one guard to become true if necessary,
just like arbitration does.
Fig. 4. Parallel readout. (a) ’s C ports have been combined into a
single port (C) that outputs an -bit word. (b) Column processes are replaced
by a bus ( ) that transfers -bit words to a mux ( ).
readout design, which was implemented in [27], in the next sub-
section.
C. Reading the Array in Parallel
Here, we transform and to read all of a se-
lected row’s dataless ports in parallel, an innovation introduced
in this work (initially reported in [22]). This parallel read is ac-
complished by merging the instances of into a single
-bit-wide bus, called , and modifying ac-
cordingly, as shown in Fig. 4. An -bit integer, named , can
represent the state of a row’s dataless ports . We denote
’s th bit by . If we set this bit when the probe,
, becomes true, it may serve as a proxy for in the previous
program. We just have to remember to clear all the
bits ( or ) once the row is read. Thus, we have
One instance of replaces the instances of
we had before.
We transfer the row word to a latch whose cells execute the
code given for previously, with each bit, , serving
as a proxy for . Hence, we test instead of probing the
th column’s ports and clear instead of doing the
communication. Thus, we have
The second communication now signals that the latch is
empty. is connected as shown in Fig. 5(a). It receives
the row’s data from , a process that we shall describe
shortly. This completes our transformation of and
to support parallel reading.
Our final transformation is to transmit row and column
addresses sequentially, over the same output; this innova-
tion is revealed here for the first time. The addresses are
multiplexed by a separate process called , where
specifies the number of address
bits. As shown in Fig. 5(b), while it relays addresses from
the row encoder (port ) or the column encoder (port )
to the transmitter’s output (port ), also relays
row-data from (port ) to (port ) as well.
BOAHEN: BURST-MODE WORD-SERIAL ADDRESS-EVENT LINK—I 1273
Fig. 5. Row-word latch and burst generator: (a) A latch ( ) receives
row-words from the mux and communicates with the column arbiter ( )
and the column encoder ( ). (b) A mux ( ) receives row and column
addresses from the encoders ( and ) and outputs them on the
transmitter’s T port; it also relays row-words from the bus ( ) to the latch
( ).
Coordinating these two procedures makes it possible to send
the empty signal, a reserved address named , when
becomes empty. Thus, we have
Recall that the second communication, which matches
’s second communication above, signals that the
latch is empty. Column addresses are handled by a concurrent
process, under the assumption that they show up after the
row address does. This timing assumption can be avoided by
forcing to wait for the new row address to appear,
ignoring all column addresses after it relays .
In summary, we have decomposed into six concur-
rent processes: instances of , one of , one of
, and one of . The remaining two are
and ; there are two instances of each, with equal to
or . The next step in the synthesis procedure is to compile these
CHP programs into hardware.
IV. TRANSMITTER IMPLEMENTATION
Electrically, processes set or clear an output
signal , or wait for an input signal to become true
or false ; tilde denotes logical complement. To
communicate, they must perform complementary four-phase
sequences of actions and waits: on an
active port and for its passive coun-
terpart , where denotes repetition, just like in CHP. We
always append and to the port’s name to indicate its input
and output signals, respectively. Such signal names are always
set in lower case typewriter font. The active port’s output signal
is commonly called Request; the passive port’s is the
so-called Acknowledge. At the signal-level, we refer to as
( , ) and to as ( , )—request first and acknowledge
second in both cases.
We have three choices of signal representations for data. (1)
Bundled-data requires a single line per bit, in addition to the
request and acknowledge signals. The data is valid when the
request signal is set; otherwise it is invalid. (2) Straight-data
dispenses with the request signal. Instead, all zeroes signifies
invalid data; any other word is considered valid. Both repre-
sentations require matched delays—for data as well as request.
(3) Dual-rail achieves delay-insensitive operation by encoding
each bit using two lines: bit-is-true and bit-is-false (denoted by
appending or ). The data are invalid when both are cleared;
setting either transmits a one or a zero.
Handshaking expansion (HSE) is the procedure whereby
each communication in our CHP programs is fleshed out into
a full four-phase request–acknowledge sequence. Following
Martin’s synthesis procedure [28], we make two choices when
we perform HSE. First, we make output ports active and input
ports passive.3 The only exception is a port that is probed
must be passive, as the probe is implemented simply as .
Symmetric links—dataless ports that are not probed on either
end—are dealt with on a case by case basis. Second, we use the
second half of the four-phase handshake to implement a second
communication on the same port—a two-phase handshake—if
these communications always occur in pairs. This optimization
is possible because the second half just returns the signals to
their initial state. So we are free to clear them whenever it is
convenient to do so, a process known as reshuffling.
The final step in Martin’s synthesis procedure is compiling
HSE sequences into production-rule sets (PRS), which are
straightforward to implement with CMOS transistors. A
production-rule, , clears a bit, , when a boolean
expression, , becomes true. We write to set the bit
when the expression is false. A nFET implements the former
rule while a pFET implements the latter—the two rules together
correspond to an inverter. Logical and and or (denoted by
& and , respectively, in PRS, or HSE) are implemented by
connecting FETs in series and in parallel. If both pull-up and
pull-down chains may both be inactive at the same time, a weak
feedback-inverter must be added to overcome their leakage
currents. Such outputs are said to be state holding, as opposed
to combinational; the feedback inverter is called a staticizer.
Active low signals are allowed in PRS and at the circuit level;
their names have an underscore prepended (e.g., ).
We present only the final HSE sequences and the synthesized
circuits in this section. Details of how we arrived at these reshuf-
flings and how we compiled them into PRS are in the Appendix.
We recommend that you refer to Fig. 6 to see how these circuits
interact as you read their descriptions. To facilitate this, we in-
clude the block labels in this figure in HSE sequences and in
subsequent figure captions.
A. Reading
The CHP program for [see Section III-C and
Fig. 4(a)] calls for us to set a bit when the event-gener-
ator initiates a communication (on port ) and readout this
data onto the column lines (port ) when this particular row is
selected by the arbiter (port ). must communicate
with the address encoder as well (port ). These operations
are implemented in this section, together with the CHP for
.
We made port passive and ports , , and active and
we chose a straight-data representation for port , where all
zeroes indicates invalid data. We separated event-generator and
3Our choice is arbitrary—the direction that data flows is not necessarily re-
strictive.
1274 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004
Fig. 6. Transmitter schematic. consists of event-generator interfaces
(S) and an arbiter interface (H). consists of 1-in-2 arbiter cells (A).
’s bit cells consist of a memory (L) and an arbiter interface (H).
consists of a set of wires (from S to L). (two instances) consists of b
address lines (extra column-address line is request) and combinational logic
(represented by discs). Latch (F) stores the row address while staticizers (G)
hold the column address. consists of two controllers: one (T) switches
the address-mux (J), the another (C) cycles the array.
event-arbiter signals into two sequences, introducing two inter-
mediaries, and (see Fig. 6), for them to communicate with:
goes high when an event occurs anywhere in this row while
goes high when the row is selected (see Appendix, part A). This
partitioning resulted in the following reshuffled HSE sequences
# row (S,H) #
[[pi];w+;[s];co+,po+;[~pi];w-;[~s];co-,po-]
k[[p];ro+;[ri&~ci];s+;[ci&~p];ro-;[~ri];s-]
where denotes parallel execution, just like in CHP. is the OR
of all the bits. For brevity, the subscript has been suppressed
and has been omitted: transitions at the same time does
and is combined with (see Appendix, part A).
Compiling the first sequence into PRS (see Appendix, part
A) yielded the circuit shown in Fig. 7(a). These two gates are
asymmetric variations of the C-element, whose output is set
when both inputs are high and cleared when both are low (i.e.,
); they are called aC-elements.
Initially, is low, so when becomes low, goes high, which
prompts to go high. Consequently, and go high, which
prompts to go high. As a result, is cleared, which prompts
to go back low, thereby clearing and to terminate the
cycle.
Fig. 7. Event generator [S] and arbiter [H] interfaces. (a) Interfaces event-
generator ( pi,po) with arbiter-interface (w,s) and latch-cell (co). (b) Inter-
faces row (p,s), or latch-cell, with arbiter (ro,ri), address encoder (ao), and
mux ( ci).
We eliminated the staticizers to save space, but this simplifi-
cation produced a race condition when the first gate is disabled
by [see Fig. 7(a)]. If has not discharged all the way to
ground, the pull-down continues to pass current and clears .
Thus, it is important for the event-generator to produce a fast,
clean, downward transition [8], [25] at . If not, the staticizers
must be included. The absence of staticizers also makes the cir-
cuit susceptible to charge sharing, which can be largely avoided
by placing the series-connected n and p transistors in the order
shown.
Compiling the second sequence into PRS (see Appendix,
part A) yielded the circuit shown in Fig. 7(b), which also
consists of two aC-elements. When becomes high, goes
high, which prompts to go high. As is high initially,
and go high, which prompts and to go low. Thus,
goes low, causing to go back low, which clears and . A
new cycle can now begin, but cannot go high until goes
back high.
We relay requests from the row’s multiple event-generator in-
terfaces to this arbiter-interface using the circuit shown in Fig. 8.
This circuit ORs together all bits in that row (they are tied to
, etc.) to generate (tied to ) and broadcasts the signal
(tied to ).4 This staticized design is more power-efficient and
noise-immune than the nMOS-style wired-OR used previously
[6], [10], [15]. The address encoder is implemented as described
in [15].
merges words form the s onto its port,
which we chose to be active [see Section III-C and Fig. 4(b)].
We implemented this merge by feeding ’s straight-data
outputs to staticized wired-OR gates, with inputs each (see
Fig. 8), connected in a column-wise fashion. Instead of steering
the acknowledge signal to the row that was read, it is tied directly
to all the rows’ arbiter-interfaces (see Fig. 6). While simplifying
the steering circuit to a single wire, this global acknowledge
signal also blocks newly selected rows from proceeding until
on-going column communication is completed, making the ag-
gressive arbiter-interface reshuffling presented above safe (see
Appendix, part A).
4The select signal is not restricted to the active cells because it is used to
prevent inactive ones from becoming active. Thus, this broadcast deals with the
negated probe instability that plagues concurrency.
BOAHEN: BURST-MODE WORD-SERIAL ADDRESS-EVENT LINK—I 1275
Fig. 8. ORing request signals. ORs multiple requests, l1i; l2i; . . ., or lni,
together to create a single request, ro, and broadcasts the acknowledge, ri, to
all n ports. ri clears ro—but not until all lki are cleared. Because, the pFET
is not strong enough to overcome an nFET.
B. Choosing
is built out of cells, as shown in Fig. 1
(see [15] for not-a-power of two). The 1-in-2 arbiter cell is
described by
Ports and are connected to its daughters’ ports while
port is connected to its parent’s or port. Only after
communicating on does it perform a communication pending
on or , arbitrating between them if necessary. Thus, re-
quests are relayed up the tree by probing the -to- channels,
while choices are steered down the tree by communicating on
the same channels. The second pair of and communications
guarantees mutual exclusion.
We decomposed into two processes by isolating its
-to- and -to- communications, and provided a third
process to arbitrate between these two
Ports in different processes with the same name are connected
together. For the communication processes, we made ports
and passive and ports , , and active. After reshuf-
fling to optimize performance (see Appendix, part B), we ob-
tained the following HSE sequence for the first communication
process:
# arb (A) #
[[l1i&~ri];a1o+;ro+;[ri&a1i];l1o+;
[~l1i];a1o-;ro-;[~a1i];l1o-].
The second communication process is identical; just replace 1
with 2. Their two signals are ORed together to generate a
single request signal.
Compiling the sequence above into PRS (see Appendix, part
B) yielded the circuit shown in Fig. 9. Two cross-coupled NAND
gates perform arbitration [16]. Their inputs are active-high and
their outputs are active-low—complementary to a set–reset
flip-flop. The aC-elements activate these inputs when requests
are received, provided the parent’s acknowledge is
inactive (i.e., high). The NAND gates’ outputs drive a circuit that
steers the parent’s acknowledge to either daughter (
or ). This steering circuit NORs these active-low signals,
Fig. 9. Two-input arbiter cell [A]: interfaces two daughter cells, (l1i, l1o)
and (l2i, l2o), with a parent cell, (ro, ri), using two asymmetric C-elements
(aC), a pair of cross-coupled NAND gates, a steering circuit, and an OR gate.
and , with to produce the outgoing acknowledges,
or . To filter out metastability, the NOR-gates’ pull-ups
are not powered up unless and differ by more than
the threshold voltage [16], [28].
When these 1-in-2 arbiter cells are connected in a binary tree,
requests are selected by a post-order traversal. That is, a node
is visited, and then, its daughters are visited, and so on, recur-
sively. However, a daughter that is not requesting is not vis-
ited and a daughter that makes another request is not revisited
until the entire tree has been traversed. Each daughter is vis-
ited only once because the wait in above
blocks a second request from being serviced with the same
acknowledge signal, unlike the greedy arbiter design presented
in [15], [27], which would revisit the same daughter over and
over again. Complete traversal makes this new arbiter design a
fair one, in that it will not service the same client again until all
those waiting have been serviced. Fairness is critical if parallel
readout is to be fully exploited, as discussed in the companion
paper [24].
This fair arbiter design is optimized for speed, in that new re-
quests can propagate up the tree while old ones are still being
serviced. It does not require requests at all levels of the tree
to be cleared before the acknowledges are cleared (see Fig. 1).
Traversing the entire tree like this—starting at the bottom and
propagating all the way up to clear the requests and then starting
at the top and propagating all the way down to clear the ac-
knowledges—would be painfully slow. Instead of waiting for its
parent’s acknowledge to clear before it clears its own acknowl-
edge, the cell clears its acknowledge as soon as its daughter’s re-
quest clears (see ). However, it blocks new requests
until its parent’s acknowledge clears, which just requires the cell
to clear its own request. The cell does this once both of its daugh-
ters’ requests are cleared. Thus, new requests propagate up until
they encounter a cell whose other request is currently selected.
1276 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004
Fig. 10. Data-transfer [C] and latch [L] circuits. (a) Interfaces array ( di, do)
with mux ( d,t). do is forced low during reset. (b) Stores (w) row cell’s output
(rx) under control of mux (ri,ro) and communicates with column arbiter
interface (go,gi).
In fact, they can get all the way up to the top cell, even while it
is servicing the other half of the tree.
C. Writing
The CHP program for calls for us to store data read
from a row (port ) and then communicate with the arbiter (port
) and the column encoder (port ). These communications
are performed for each bit that is set, and then the bit is
cleared. When all the bits have been cleared, a second com-
munication is performed to signal that the latch is empty. These
operations are implemented in this section, together with the
part of that coordinates them.
hands data from to ,
converting it from a straight-data representation to a bun-
dled-data one (see Section III-C and Fig. 5(b)). We separated
the read operation on ’s passive port and the write
operation on its active port into two HSE sequences. We
also introduced a pair of local variables and for them to
communicate with (see Fig. 6). The read sequence takes high
when the data appears; the write sequence responds by taking
high when the data is latched. We present the read sequence
below but we postpone implementing the write sequence till
Section IV-D, in order to synchronize it with row address
transmission (see Section III-C). The straight-to-bundled-data
converter is implemented simply by performing a bit-wise OR
on the column data to generate a request signal ( below).
We obtained the following reshuffling for the read sequence
(see Appendix, part D):
# xfr (C) #
[[di];d+;[t];do+;[~di];d-;do-;[~t]].
Compiling this sequence into PRS (see Appendix, part D)
yielded the circuit shown in Fig. 10(a). Initially is low, so
when becomes low, goes low, which prompts to
become high. Now, both of the OR gate’s inputs are low, so it
drives low, which prompts to go high, and hence
and go back up. Once goes low, a new cycle may begin.
However, new column data can show up immediately after
goes high, so this reshuffling allows us to cycle to the next row
and present its data even before the latch becomes empty.
Now, we proceed with implementing [see
Section III-C and Fig. 5(a)]. First, we decompose its CHP
program into two processes
The port of the first process, which stores the bit, is con-
nected to the port of the second one, which interfaces with
the arbiter. The arbiter interface turned out to be the same as that
used by the rows [see Section IV-A and Fig. 7(b)]. We simply
make the connections:
. For the memory cell, we used a bun-
dled-data representation for port and made it passive, and
we made port active. These choices yielded the following
reshuffled HSE sequence:
# latch (L) #
[[ri&rk];w+;go+;ro+;[~ri&gi];w-;go-;[~gi];
ro-].
Compiling the sequence above into PRS (see Appendix,
part C) yielded the circuit shown in Fig. 10(b). Initially is
high, so when becomes high, is set and and go high,
which prompts to go high and to go low. Thus, and
are cleared, but stays high until becomes low. When this
happens, goes high in response to going low, and now
a new cycle can begin. We OR together the signals from all
memory cells to generate a single write-acknowledge. Even
though this OR-gate is triggered by the first bit that is set, the
delay in clearing the request signal keeps the latch transparent
for a while, giving tardy bits the chance to be written.
D. Bursting
The CHP program for calls for us to write row-
words to the latch (port ), to multiplex row addresses (port
) and column addresses (port ) onto the transmitter’s output
(port ), and to send when the latch is empty [see Section III-C
and Fig. 5(b)]. These operations are implemented in this section;
reading the row’s data was implemented in Section IV-C.
We use the following three-wire-handshake sequence for
port :
[tro+;[ti];tro-;[~ti]]
k[[ti];tco+;[~ti];tco-].
This protocol allows us to transmit multiple column addresses
by executing the second sequence as many times as desired,
halfway through the first sequence. The first sequence’s first
half transmits the row address, while the second half terminates
the burst—it transmits . Fig. 6 shows the correspondence be-
tween these signals and the transmitter’s Ry, Rx, and Ack sig-
nals mentioned earlier.
First, we implement the row communications, and
. We will merge reading the row’s address with reading
its data, and thereby use the read sequence presented in Sec-
tion IV-C for as well as . Hence, the signals intro-
duced above will serve as proxies for ( , ).
BOAHEN: BURST-MODE WORD-SERIAL ADDRESS-EVENT LINK—I 1277
Fig. 11. Address transmission circuits [T]. (a) Synchronizes row-address
transmission ( tro,ti) with latch (so, si) and array ( d, t). so and t
are forced high during reset. (b) Synchronizes column-address transmission
(tco,ti) with column arbiter interface (ci,co) and switches the address-mux
(tco).
That takes care of reading. For writing, we use the following
reshuffled sequence (see Appendix, part D):
# yctl (T) #
[[~ti&d&si];tro+,so-;[ti&~d&~si];tro-,so+]
which takes care of both and ; transitions at the same time
as . Compiling this sequence into PRS (see Appendix, part
D) yielded the circuit shown in Fig. 11(a). Initially, , ,
and , are all high. When the row’s address and data appear,
goes low, and when they are latched, goes low. At this point,
assuming is low, , , and go low. These signals are
cleared when , , and , as well as , go high.
Now we implement the column communication, ,
using the following reshuffled sequence (see Appendix, part
D):
# xctl (T) #
[[ti&ci];co+,tco+;[~ci&~ti];co-,tco-].
Compiling this sequence (see Appendix, part D) yielded the
circuit shown in Fig. 11(b). As is high for the burst duration,
and is high initially, we only have to wait for to become
high. When that happens, and go high, which switches
the mux to the column address (see Fig. 6). When both and
become low, these signals are cleared. And when and go
high in response, a new cycle can begin. Since the mux switches
back to the row address when is low, this address can be
reread at any time. If it so desires, the receiver can buy time to
do this by taking its acknowledge high a little latter.
We developed a reset strategy to recover when a row is
selected but no data is delivered to the latch. This situation
could arise if the event-generator’s slew-rate is too slow (see
Section IV-A). We activate the data-transfer circuit’s array-ac-
knowledge [ in Fig. 10(a)] to complete the stalled cycle.
We also activate the row-address transmission circuit’s request
[ in Fig. 11(a)] to make the latch transparent, so that the next
row can be written. Since the write failed, we do not bother to
clear the contents of the latch. If the latch is not actually empty
(e.g., during power-up), those bits will corrupt the next burst.
However, after that one is sent, everything will be fine.
V. SUMMARY AND CONCLUSION
We have described an address-event transmitter that reads all
active cells in a selected row in parallel. Row activity is trans-
mitted in a burst: the row address followed by a column ad-
dress for each active cell, plus a termination signal. The array
is cycled to the next row while these events are being trans-
mitted, so the next burst can start as soon as this one ends. Bursts
are communicated using a three-wire protocol: a row-request, a
column-request, and a common acknowledge. In return for the
extra request line, output pads are cut by 50%—without sacri-
ficing throughput—as the row address is not repeated.
In terms of cell area, the cost of parallel-readout is minimal.
Whereas previous transmitter designs add four transistors to the
cell (reviewed in [15]), our design requires nine transistors. Both
of these counts include the transistor that pulls down the row-
request line but do not include the one that resets the event-
generator. On the otherhand, our design requires just one line per
column whereas previous designs require two, since they select
columns individually. Trading a metal line for five transistors
is highly favorable when wires are at a higher premium than
transistors, which is increasingly the case. Thus, the increased
throughput—and scalability—parallelism offers [24] is attained
at little cost in hardware.
We also illustrated how to synthesize an asynchronous im-
plementation starting from a high-level specification by way of
a concrete example. The result was eight logic circuits that, to-
gether, can be used to implement a burst-mode, word-serial, ad-
dress-event transmitter of any desired size. These circuits in-
clude an arbiter design that allows parallelism to be exploited
fully by ensuring that a row is not reread until all those waiting
are serviced. We have laid out a library of cells (in MOSIS
DEEP_SUBM rules) for these circuits and written a silicon-
compiler to tile them to fit any desired pixel- or array-size.
Thus far, this tool has successfully compiled transmitters for
three generations of chips, fabricated in 0.6-, 0.4-, and 0.25- m
technology [24].
APPENDIX
LOGIC SYNTHESIS
When compiling HSE into PRS, we perform two passes. On
the first pass, we make the wait before an action the guard for its
production rule. For example, , the pas-
sive port’s sequence, is realized by the set
, which is implemented by a wire. On the second pass, we
strengthen guards that can become true at some other point in
the sequence by ANDing with another boolean variable. If all
signals are in exactly the same state at these two points, we add
a state variable to distinguish them, setting it after we pass the
first point and clearing it after we pass the second point, or vise
versa.
1278 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004
For example, the CHP process , where is active and
passive, as above, could be augmented with the state variable
as follows:
[ao+;[ai];s+;ao-;[~ai];[pi];po+;[~pi];
s-;po-].
’s state now distinguishes the point where ends and begins
from the point where ends and repeats. Alternatively, the
ambiguous state can be eliminated if we begin before ends.
For example
[ao+;[ai];[pi];po+;ao-;[~ai];[~pi];po-]
is unambiguous. This reshuffling, if acceptable, is cheaper to
implement, as it does not require a state variable.
We can often avoid adding state variables by reshuffling se-
quences in this way, provided the change in sequencing is be-
nign. When compiling PRS for such sequences, multiple pre-
ceeding waits are ANDed together (e.g., ) and
preceeding actions become guards too (e.g., ). An-
other goal of reshuffling is symmetry—clearing signals in the
same order that you set them. Such symmetry makes the sig-
nals that appear in the pull-up and the pull-down the same. This
duplicity usually results in a simpler implementation, as the
pull-up is disabled when the pull-down is active, and vise versa.
It is sometimes possible to convert a state-holding gate into
a combinational one, thereby avoiding the need for a staticizer.
That is, to make the gate’s pull-up ( ) and pull-down
( ) complementary . Such conversion is typ-
ically done by ORing terms with the pull-up and ANDing
terms with the pull-down , or vise versa. For example,
requires a staticizer, since is not an
identity. However, is combina-
tional, since is an identity. In fact, that is a NAND
gate. These added terms must have a benign effect, such that
at all points in the sequence, where is the original
guard and is the weakening term.
A. Row
Making ’s port passive, and its , , and ports
active (see Section III-C), yielded this single-bit HSE sequence
[[pi];w+;ro+;[ri];co+,po+;[~pi];w-;
[ci];co-,po-;[~ci];ro-;[~ri]]
where has been omitted for the time being and the sub-
script is suppressed for brevity. If we move the second com-
munication (two-phase) ahead of the second communication,
the arbiter can start selecting the next row earlier. However,
we must ensure that this newly selected row does not interrupt
an on-going column communication. It can be blocked by ad-
vancing forward to where occurs in the next cycle,
which also provides more time to complete the column commu-
nication. Thus, the sequence becomes
[[pi];w+;ro+;[ri&~ci];co+,po+;[~pi];w-;
[ci];ro-;[~ri];co-,po-]
where is broadcast to all the rows. Next, we augment the
sequence with row-wide event and selection signals to
support multiple bits. For , which is the OR of all the bit-level
signals, we insert “ ” after “ ” and “ ” after
“ ”. And for , which mirrors , we insert “ ” after
“ ” and “ ” after “ ”. Thus, we obtain
[[pi];w+;p+;[p];ro+;[ri&~ci];s+;[s];co+,po+;
[~pi];w-;p-;[~p];[ci];ro-;[~ri];s-;[~s];
co-,po-].
Finally, moving the row-level parts (i.e., from to and
to ) into a separate (arbiter-interface) sequence yielded the
reshuffling presented in Section IV-A .
We compiled our final reshufflings into the following PRS:
pi&~s->w+ s&w->co+,po+
p->ro+ r&~ci->s+
~pi->w- ~s->co-,po-
~p&ci->ro- ~ri->s-.
We strengthened the guard of with to ensure that only
those cells that were active when the row was selected partici-
pate, as required by concurrency. And we strengthened the guard
for , with , to ensure that only active cells respond.
The circuits are shown in Fig. 7.
We include the communication by observing that it occurs
simultaneously with (see Section III-C). Hence, we can ac-
tivate , as well as , with , and combine with using a
C-element.5 Alternatively, since we activate the row and the en-
coder at the same time, we can assume the column bus’s acknowl-
edge indicates that both the row’s state and its address have
been latched. This timing assumption eliminates the C-element,
but requires that we compensate for worst-case timing-differ-
ences between the address-encoding and data-transfer processes.
B. Arbiter
Making ’s and ports passive and its port
active yielded this HSE sequence for the first communication
process (see Section IV-B)
[[l1i];a1o+;[a1i];ro+;[ri];l1o+;
[~l1i];l1o-;ro-;[~ri ];a1o-;[~a1i]]
where port (and ) is passive. If we execute without
waiting for , we will allow the upper levels to make de-
cisions concurrently. We can maintain mutually exclusive ac-
cess to by delaying until the next cycle. This so-called
lazy-active reshuffling (e.g., ) gave us
[[l1i&~ri];a1o+;ro+;[ri&a1i];l1o+;
[~l1i];l1o-;ro-;a1o-;[~a1i]]
However, the other daughter is not excluded if her request
(i.e., ) becomes active while is still false. In that case,
5A two-input gate whose output is set when both inputs are high and cleared
when both are low.
BOAHEN: BURST-MODE WORD-SERIAL ADDRESS-EVENT LINK—I 1279
we can kill two birds (service both daughters) with one stone
(a single communication). That is, once fires, the arbi-
tration process will make true, allowing the other commu-
nication process to get past , where it is held up, and
select the other daughter.
We can also get to happen faster by postponing
till the end and clearing and in the same order that we
set them. These changes also make the sequence more sym-
metric, which simplifies the logic. Thus, we obtained the reshuf-
fled HSE sequence presented in Section IV-B . We
compiled that sequence into this PRS
l1i&_ri->a1o+ ~l1o->_l1o+
~l1i->a1o- l1o->_l1o-
~_a1i&~_ri->l1o+ aloja2o->ro+
_a1ij_ri->l1o- ~a1o&~a2o->ro-
where we simply OR (AND) the two pull-ups (pull-downs)
together. Weakening ’s guard with is since
becomes true first. Because, as you can see from the circuit
(Fig. 9), propagates through only two gates to set
but it propagates through one gate in this cell plus three gates
at the next level—and an inverter—to set .
C. Latch
Making ’s port passive, and using a bundled-data rep-
resentation, yielded this HSE sequence for its memory cell
[[ri&rk];w+;ro+;go+;[gi];go-;[~gi];w-;
[~ri];ro-]
where port is active (the subscript is suppressed). Moving
the second two-phase communication to the middle of the
second (two-phase) communication eliminates an ambiguous
state. However, this reshuffling implies that the cell cannot start
the second communication (with the arbiter-interface) before
the second communication (with the mux) starts. The con-
sequences of synchronizing these two-phase communications
are dealt with in Appendix , part D. Swapping with
and with reduces asymmetry; both swaps are benign.
These changes yielded the reshuffling given in Section IV-C
.
We compiled our final reshuffling into the following PRS:
ri&rk->w+ w->go+ gojgi->ro+
~ri&gi->w- ~w->go- ~go&~gi->ro-.
Weakening ’s guard is safe because happens later;
strengthening ’s makes the gate combinational. The circuit
is shown in Fig. 10(b).
D. Mux
The CHP for ’s row communications calls for par-
allel execution of and (see Section III-C). How-
ever, we allowed them to run in parallel only after the row’s ad-
dress and its data are latched, since we wished to merge the reads
(see Appendix, part A). With ports and passive and ports
and active, this strategy is realized by the HSE sequence
[[di];so+;[si];(do+;[~di];do-)k
(tro+;[ti];so-;[~si];tro-;[~ti])]
where serve the merged – port. We broke this se-
quence up into two concurrent read and write sequences and
synchronized them with two new variables, and
[[di];d+;[t];do+;[~di];d-;do-;[~t]]
k[[d];so+;[si];t+;tro+;[ti];so-;
[~d&~si];tro-;[~ti];t-]
could have been executed anywhere between and ; we
went with symmetry. The first sequence is identical to the read
sequence given in Section IV-C .
We reshuffled the write sequence further. Moving im-
mediately after the previous cycle’s makes the latch trans-
parent as soon as it becomes empty. This move allows us to
merge and , and we can consolidate as well by
using the lazy-active reshuffling. Postponing till after
allows the second (two-phase) communication to occur as
soon as we start transmitting the row’s address. These opti-
mizations yielded the write sequence given in Section IV-D
.
This reshuffling deals with the consequences of synchro-
nizing ’s and communications. That is, we do not
hamper the memory cell’s second (two-phase) communi-
cation (see in Section IV-C), as , which
corresponds to (see in Section IV-D), be-
comes true at the beginning of the burst. Thus, the memory-cell
can complete its second communication with the arbiter-in-
terface right away.
The read sequence above yielded the following PRS:
di&~t->d+ ~di->d-
d&t->do+ ~dj~t->do-.
Weakening ’s guard is safe because becomes false after
does. The circuit is shown in Fig. 10(a).
And we compiled the final reshuffling of the write sequence
(see in Section IV-D) into the following PRS:
~ti&d&si->t+,tro+,so-
ti&~d&~si&~tco->t-,tro-,so+.
To prevent from firing before the last column-address
transmission is completed (see below), we strengthened its
guard with . This precaution is necessary because when
the column arbiter interface is acknowledged it clears
its acknowledge to the memory cell, at which point becomes
false (see Fig. 6). Then, could fire while we are waiting
for to become false in response (see below). The
circuit is shown in Fig. 11(a).
For ’s column communications, , making
port passive and port active yielded this HSE sequence
1280 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 51, NO. 7, JULY 2004
[[ci];co+;[~ci];co-;[ti];tco+;[~ti];tco-].
Relocating the communication’s two halves a quarter and
three-quarters of the way through the communication yielded
the reshuffling presented in Section IV-D . Thus,
reception is acknowledged at the same time transmission
starts , which requires us to make the encoder’s outputs
state-holding (see Fig. 6). Therefore, we added staticizers to all
its outputs, including the extra always-a-one line that serves as a
request, and we tied a pFET to the request line—as in Fig. 8—to
clear it.
We compiled the final reshuffling into the following PRS:
ti&ci&tro->co+,tco+ ~ti&~ci->co-,tco-.
We have strengthened the guard of with to block
a column address from a newly loaded row from being trans-
mitted while we are waiting for to clear, after goes low.
The circuit is shown in Fig. 11(b).
ACKNOWLEDGMENT
The author would like to thank C. Higgins, T. Horiuchi,
B. Linares-Barranco, and T. Serrano-Gotarredona for their
invaluable help in beta-testing this interface, and fishing out and
documenting bugs. He would also like to thank P. Merolla for
helping with adding serial-address transmission to the design.
REFERENCES
[1] C. A. Mead and T. Delbruck, “Scanners for visualizing analog vlsi cir-
cuitry,” Analog Integ. Circuits Signal Process., vol. 1, pp. 93–106, 1991.
[2] W. Yang, “A wide-dynamic range low-power photosensor array,” in
Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC’94), vol. 37, San
Francisco, CA, 1994, p. 230.
[3] B. Fowler, A. E. Gamal, and D. Yang, “A CMOS area image sensor
with pixel-level A/D conversion,” in Proc. IEEE Int. Solid-State Circuits
Conf. (ISSCC’94), vol. 37, San Francisco, CA, 1994, pp. 226–227.
[4] L. G. McIlrath, “A low-power low-noise ultrawide-dynamic-range cmos
imager with pixel-parallel A/D conversion,” IEEE Trans. Solid-State
Circuits, vol. 36, pp. 846–853, May 2001.
[5] A. Murray and L. Tarassenko, Analogue Neural VLSI: A Pulse Stream
Approach. London, U.K.: Chapman and Hall, 1994.
[6] M. Mahowald, An Analog VLSI Stereoscopic Vision System. Boston,
MA: Kluwer Academic, 1994.
[7] K. A. Boahen, “The retinomorphic approach: pixel-parallel adaptive am-
plification, filtering, and quantization,” Analog Integr. Circuits Signal
Process., vol. 13, pp. 53–68, 1997.
[8] E. Culurciello, R. Etienne-Cummings, and K. Boahen, “Arbitrated
address event representation digital image sensor,” in Proc. IEEE Int.
Solid-State Circuits Conf. (ISSCC’01), Feb. 2001, pp. 92–93.
[9] J. Kramer, “An on/off transient imager with event-driven asynchronous
readout,” in Proc. IEEE Int. Symp. Circuits and Systems, vol. 2, 2002,
pp. II-165–II-168.
[10] J. Lazzaro, J. Wawrzynek, M. Mahowald, M. Sivilotti, and D. Gille-
spie, “Silicon auditory processors as computer peripherals,” IEEE Trans.
Neural Networks, vol. 4, pp. 523–528, Mar. 1993.
[11] M. Sivilotti, “Wiring considerations in analog VLSI systems, with
application to field-programmable networks,” Ph.D. dissertation, Dept.
Comp. Sci., California Institute of Technology, Pasadena, CA, 1991.
[12] A. Mortara, E. Vittoz, and P. Venier, “A communication scheme for
analog VLSI perceptive systems,” IEEE J. Solid-State Circuits, vol. 30,
pp. 660–669, June 1995.
[13] A. Abusland, T. S. Lande, and M. Hovin, “A VLSI communication archi-
tecture for stochastically pulse-encoded analog signals,” in Proc. IEEE
Int. Symp. Circuits and Systems, vol. 3, May 1996, pp. 401–404.
[14] K.A. Boahen, “Communicating neuronal ensembles between neuromor-
phic chips,” in Neuromorphic Systems Engineering: Neural networks in
Silicon, T. S. Lande, Ed. Boston, MA: Kluwer Academic, 1998, ch.
11, pp. 229–262.
[15] , “Point-to-point connectivity between neuromorphic chips using
address-events,” IEEE Trans. Circuits Syst. II, vol. 47, pp. 416–434, May
2000.
[16] C. A. Mead, Introduction to VLSI Systems. Reading, MA: Addison
Wesley, 1980.
[17] J. G. Elias, “Artificial dendritic trees,” Neural Computation, vol. 5, pp.
648–663, 1993.
[18] S. R. Deiss, R. J. Douglas, and A. M. Whatley, “A pulse-coded commu-
nications infrastructure for neuromorphic systems,” in Pulsed Neural
Networks, W. Maass and W. B. C. M, Eds. Boston, MA: MIT Press,
1999, ch. 6, pp. 157–178.
[19] C. M. Higgins and C. Koch, “Multi-chip motion processing,” in Pro-
ceedings of Conference on Advanced Research in VLSI. Los Alamitos,
CA: IEEE Comp. Soc. Press, 1999, vol. 20, pp. 309–322.
[20] S. P. DeWeerth, G. N. Patel, M. F. Simoni, D. E. Schimmel, and R.
L. Calabrese, “A VLSI architecture for modeling intersegmental coor-
dination,” in Proc. 17th Conf. Advanced Research in VLSI, 1997, pp.
182–200.
[21] J. P. Lazzaro and J. Wawrzynek, “A multi-sender asynchronous exten-
sion to the address-event protocol,” in Proc. 16th Conf. Advanced Re-
search in VLSI, 1995, pp. 158–169.
[22] K. A. Boahen, “A throughput-on-demand address-event transmitter for
neuromorphic chips,” in Proc. 20th Anniversary Conf. Advanced Re-
search in VLSI, 1999, pp. 72–86.
[23] , “A burst-mode word-serial address-event link II: Receiver de-
sign,” IEEE Trans. Circuits Syst. I, vol. 51, pp. 1281–1291, July 2004.
[24] , “A burst-mode word-serial address-event link—III: Analysis and
test results,” IEEE Trans. Circuits Syst. I, vol. 51, pp. 1292–1300, July
2004.
[25] C. A. Mead, Analog VLSI and Neural Systems. Reading, MA: Ad-
dison-Wesley, 1989.
[26] M. Schwartz, Telecommunication Networks: Protocols, Modeling, and
Analysis. Reading, MA: Addison-Wesley, 1987.
[27] K. A. Boahen, “Retinomorphic vision systems II: communication
channel design,” in Proc. IEEE Int. Symp. Circuits and Systems, May
1996, pp. 14–17.
[28] A. Martin, “Programming in VLSI: From communicating pro-
cesses to delay-insensitive circuits,” in Proceedings of UT Year of
Progamming Institute on Concurrent Programming. Reading, MA:
Addison-Wesley, 1990, pp. 1–64.
Kwabena A. Boahen received the B.S. and M.S.E.
degrees in electrical and computer engineering
from The Johns Hopkins University, Baltimore,
MD, in the concurrent masters-bachelors program,
both in 1989, and the Ph.D. degree in computation
and neural systems from the California Institute of
Technology, Pasadena, in 1997.
He is an Associate Professor in the Bio-
engineering Department at the University of
Pennsylvania, Philadelphia, where he holds a
secondary appointment in electrical engineering.
His current research interests include mixed-mode multichip VLSI models of
biological sensory and perceptual systems, and their epigenetic development,
and asynchronous digital interfaces for interchip connectivity.
Dr. Boahen was awarded a Packard Fellowship in 1999, a National Science
Foundation CAREER Grant in 2001, and an Office of Naval Research YIP Grant
in 2002. He is a member of Tau Beta Kappa and has held a Sloan Fellowship
for Theoretical Neurobiology at the California Institute of Technology.
