The post office experience: designing a large asynchronous chip by Stevens, Kenneth & Davis, Al
T h e  P o s t  O ffice E x p e r ie n c e :  




Palo Alto, CA 94019
A b str a c t
The Post Office is an asynchronous, 300,000 tran­
sistor, full-custom CMOS chip designed as the com­
munication component for the Mayfly scalable parallel 
processor. Performance requirements led to the devel­
opment of a design style which permits the design of 
sequential circuits operating under a restricted form 
of multiple input change signalling called burst-mode. 
The Post Office complexity forced us to develop a set 
of design tools capable of correctly synthesizing tran­
sistor circuits from state machine and equation specifi­
cations, and capable of verifying the correctness of the 
resultant circuitry using implementation specific tim­
ing assumptions. The paper provides a case study of 
this design experience.
1 Introduction
The Post Office was designed to support inter­
node communication for the Mayfly parallel processing 
system[8]. The Post Office handles all of the physical 
delivery aspects of packet communication. This in­
cludes local buffering, dynamic adaptive routing and 
congestion avoidance, deadlock avoidance, and virtual 
cut-through. The Mayfly topology was designed to 
be extensible and permits an unbounded number of 
PEs to be interconnected. This implies that the phys­
ical extent of the system is not fixed and poses serious 
problems when considering an implementation strat­
egy which uses a common global clock. Clock skew is 
a possible headache for any synchronous design style, 
and is magnified as technology progresses[l]. In the 
case of extensible systems such as Mayfly, where the 
total number of boards is unbounded, the synchronous 
choice becomes intractable. We therefore chose an 
asynchronous design style for the Post Office imple­
mentation.
Ken Stevens
Computer Science Department 
University of Calgary 
Calgary, Alta T2N 1N4
Another critical design constraint was the need for a 
high performance implementation, since message pass­
ing performance would be critical to the success of the 
Mayfly system. Proponents often argue that asyn­
chronous circuits are inherently faster since they are 
controlled by locally adaptive timing rather than the 
usual global worst-case clock frequency constraints. 
While we believe that this claim has merit, we feel 
that in general it is misleading. Asynchronous circuits 
require more components to implement the same func­
tion. This may result in longer wires, increased area, 
and reduced performance. When compared to a very 
well tuned synchronous design, a functionally equiv­
alent asynchronous implementation may actually run 
slightly slower. The need for speed heavily influenced 
our particular asynchronous design style.
N o ta tio n a l C om m en t: We use the terms asyn­
chronous and self-timed synonymously. All asyn­
chronous or self-timed design styles are fundamentally 
concerned with the synthesis of hazard free circuits 
under some timing model. DI (delay-insensitive) cir­
cuits exhibit hazard free behavior with arbitrary de­
lays assigned to both the gates and the wires, and SI  
(speed-independent) circuits are hazard free with ar­
bitrary gate delays but assume zero wire delays.
There are a large number of rather different de­
sign styles in today’s asynchronous design commu­
nity. One partition of design styles can be based 
on the type of asynchronous circuit target: locally 
clocked[17,10,7], delay-insensitive[12,3,23,16], or var­
ious forms of single- and multiple- input change 
circuits[22]. Yet another distinction could be made 
on the nature of the control specification: graph 
based[15,4], programming language based[12,23,2], or 
finite state machine based[17,10]. For the finite state 
machine based styles, there is a further distinction 
that can be made based on the method by which state 
variables are assigned[ll,21]. The design style space
0-8186-1060-3425/93 $03.00 ©  1993 IEEE
409
is large and each design style has its own set of merits 
and demerits. It is worthwhile to note that virtually 
all of the design styles focus on the design of the con­
trol path of the circuit.
Compiled implementations based on programming 
language like specifications[12,3,23,2], while elegant 
and robust, suffer in performance because they are 
presently compiled into intermediate library mod­
ules rather than into optimized transistor networks. 
A module of significant concern is the C-element. 
C-elements are common circuit modules in asyn­
chronous circuits and eliminating them completely is 
unlikely. C-elements are both latches and synchro­
nization points. Too much synchronization reduces 
parallelism and performance.
The methods which produce DI circuits, while not 
perfect[13], are the most tolerant of variations in de­
vice and wire delays. This tolerance improves the 
probability that a properly designed circuit will con­
tinue to function under variations in supply voltage, 
temperature, and process parameters. We chose to 
slightly expand the domain of timing assumptions 
which must remain valid to retain hazard free imple­
mentation since this permits higher performance im­
plementations at the expense of reduced operational 
tolerance. Our view is motivated by the reality that 
our designs have to meet certain performance require­
ments. For any given layout and fabrication process, 
we have models which predict the speeds of the wires 
and transistors for the desired operational window. 
We also know the percentage of error that can be tol­
erated in those predictions. We could not live with ar­
bitrary delays for performance reasons and therefore it 
seems impractical to assume arbitrary delays in order 
to ensure hazard free operation of the circuits. Our 
approach has therefore been to insure hazard free op­
eration under sets of timing assumptions that can be 
verified as being within acceptable windows of fabri­
cation and operational tolerance.
We chose to pursue a finite state machine based 
style for two reasons: 1) the finite state machine con­
cept is a familiar one for hardware designers like our­
selves, and 2) the graph and programming language 
based synthesis methods that we knew about were 
too slow for our purposes. The finite state machine 
based design style does not use C-elements, although 
C-elements are used sparingly and in stylized ways in 
interface circuits such as arbiters.
In order to achieve the necessary hazard free asyn­
chronous finite state machine (A F S M ) implementa­
tion, it is necessary to place constraints on how their 
inputs are allowed to change. The most common is the
single input change or SIC constraint[22]. SIC circuits 
inherently require state transitions after each input 
variable transition. In cases where the next interest­
ing behavior is in response to multiple input changes, 
the circuit response will be artificially slow, either due 
to too many state transitions or due to the external ar­
biters required to sequence the multiple inputs. Multi­
ple input change or MIC circuit design methods have 
been developed[22,5] but either required input restric­
tions or involved implementation techniques that were 
unsuitable for our purposes. As a result we developed 
a design style that we call b u r st-m o d e  which per­
mits a certain style of multiple input change. Our 
burst-mode implementation method does not require 
performance inhibiting local clock generation or flip- 
flops.
As our Post Office state machines got too complex 
for hand synthesis, we decided to create a tool kit that 
was capable of automatically synthesizing the transis­
tor level circuits from burst-mode specifications. We 
call this tool kit M E A T . During the development of 
MEAT, we were fortunate to have Steve Nowick spend 
two summers with us. He incorporated David Dill’s 
verifier[9] into the tool kit, and modified the verifier 
to accommodate our burst-mode timing model. Steve 
had considerable influence on our ideas and his locally 
clocked design style[17] is another outcome of these 
earlier interactions. We are indebted to Steve for his 
influence on our design style.
The Post Office was fabricated using MOSIS revi­
sion 6 design rules in a 1.2 micron CMOS process. The 
circuit contains 300,000 transistors and has an area of 
11 x 8.3 mm. There are 95 AFSMs, most of which 
operate concurrently and which account for 19% of 
the chip area. Datapath circuitry takes up 45%, pads 
cover 11%, wire routing requires 19%, and the other 
6% is unused space on the rectangular die.
Our view in the beginning of the Post Office de­
sign effort was that we could use existing tools for the 
datapath, but that since adequate performance ori­
ented synthesis tools did not exist for asynchronous 
controllers, we would have to create such a tool. Hence 
the focus of our design method and the MEAT tools 
was directed at the control path. The goal was to syn­
thesize optimized transistor level schematics that were 
fast. It was our mistaken impression that once MEAT 
worked, the design of such a large and complex de­
vice would be well supported by our CAD capability. 
What follows is a brief description of our design style, 
the MEAT tools, and a retrospective view of how well 
the design process went.
410
The Post Office datapath circuitry was designed us­
ing a style similar to that used for conventional syn­
chronous circuits. The differences are minor and are 
a direct result of the need to cooperate with the sig­
nalling protocols imposed by the AFSMs in the control 
path. Data is transferred between elements using a 
four-cycle self-timed bundled-data protocol[19]. This 
is a weaker model than that of speed-independence 
used for control signals inside state machines[14]. 
There are also some datapath circuits which are con­
trolled in a clocked domain, rather than in a self­
timed fashion. These datapath cells contain a stop­
pable clock or are clocked by state machine outputs. 
The stoppable or raw clock signals are usually burst­
mode generated in parallel with an associated set of 
asynchronous hand-shake signals. The minimum delay 
of these hand-shake signals must be greater than the 
maximum delay required by the clocked circuitry. We 
use conventional timing analysis tools on these paths 
to insure that the assumption is correct given our ex­
pected process parameters and an error margin.
Making timing assumptions for datapath logic has 
two consequences. First, the logic is smaller and sim­
pler because no completion signals are generated. Sec­
ond, there is an increased potential for errors, as cor­
rect operation is now dependent on physical circuit 
layout, process, and environmental parameters. The 
result is that these timing assumptions must only be 
made within a module which can be subsequently ver­
ified. It is dangerous to export timing assumptions be­
tween modules since the transitive nature of the path 
inequalities rapidly leads to intractable analysis and 
verification complexity.
The control path is specified as burst-mode AFSMs 
and their implementation is synthesized by the MEAT 
tool. Burst-mode permits a conjunctive input burst to 
arrive prior to responding with an output burst. There 
is an implicit fundamental mode assumption similar to 
SIC AFSMs in that once an input burst is received, 
the AFSM must be given time to respond with its 
output burst and settle into a new stable state prior 
to the arrival of a new input burst. Burst-mode is a 
restricted form of MIC signalling best illustrated by a 
simple example.
For any given exit arc from a state, the arc is la­
belled with the input burst that will cause the transi­
tion, and the associated output burst that will be the 
AFSM’s response. As such this looks like a traditional 
Mealey FSM model. Let there be 4 input variables (^ 4, 
B , C, and D) and 3 output variables (R , S,  and T).  A 
state transition may be labelled ] A ] B \ C / \ R ] T .  We
2 D e s ig n  S t y l e use a positive logic convention and hence the mean­
ing is that if  signals A  and B  go high and C  goes low 
then the AFSM should result in a low going transition 
of R  and a high transition on T. The order of the
3 input transitions is unspecified and therefore may 
occur in any order including concurrently. The same 
unspecified order applies to the output burst.
The MIC restriction is that for any set of arcs 
leaving a single state, completion of their input 
bursts must be mutually exclusive. For example, 
given a state with the previous state transition, an­
other state transition from the same state could be 
] A ] B i D / l R ] S .  However a state transition labelled 
] A ] B / \ R ] S  would be illegal. There is no way to 
safely distinguish in the asynchronous world whether 
C  is still supposed to occur or whether AFSM should 
just respond to changes on A  and B.  It would also be 
illegal for a transition to be labelled ]A [ B [ C / [R ]T  
since each state inherently must look for a known tran­
sition direction for each input variable that it must 
respond to.
The other consistency requirement is that input and 
output transitions must strictly alternate, e.g. for any 
given directed path in the AFSM for any input vari­
able X ,  ]X  must follow a [ X  and vice versa. Omitting 
X  on an intervening transition implies no change. The 
same must be true for output variables. A corollary 
to this requirement is that any circuit in the AFSM 
description will have an even number of alternating 
input and output variable changes. The MEAT tools 
analyze the AFSM specification and signal an error if 
these MIC restrictions are not met. If the specifica­
tion is valid then MEAT generates the logic equations 
which are then folded into a complex CMOS gate. A 
schematic is also produced.
3 MEAT - a Tool for Control Circuit 
Synthesis
The MEAT synthesis tool is fast enough that alter­
native design options can be explored. The designer 
is freed from the task of understanding the underlying 
transformations required to produce hazard-free asyn­
chronous circuits. The burst-mode specification has 
proven to be both a natural and efficient method for 
specifying AFSMs. Presently MEAT does not contain 
a state graph editor so the graphical state machine de­
scription is then specified textually in the MEAT entry 
format. Each arc in the state diagram is mapped to 
a single statement in the text file, which indicates the 
source and destination states along with the associ-
411
ated input and output bursts.
The first automated task performed by MEAT is 
to generate a primitive flow table[22] from the textual 
AFSM specification. This is a two-dimensional array 
structure which captures the behavior represented by 
the state diagram. Each row of this table represents 
a node in the state diagram; each column represents 
a unique combination of input signals. Each entry 
in the table thus represents a position in the possible 
state-space of the AFSM. For each entry, the value 
of the output signals and the desired next state may 
be specified. If a next-state value is the same as that 
of the current row, the state machine is said to be 
in a s ta b le  s ta te . If the next-state value specifies a 
different row, the table entry represents an u n sta b le  
s ta te .
All rows will have a stable entry where an input 
burst begins. Other entries in the same row may be 
visited when an input burst occurs. In order for MIC 
behavior to be correctly represented, it must be guar­
anteed that the circuit will remain stable in the initial 
row until the input burst is complete. At this point, 
an unstable state will be entered which will cause a 
transition to the target row specified and fire the out­
put burst. Any entry in the flow table not reachable 
by any allowed sequence of input bursts is labeled as 
a don’t care and can take on any value for the outputs 
or next-state values. As it is not immediately evident 
which values will lead to the simplest circuit, the as­
signment of specific values to the don’t care entries is 
deferred for as long as possible.
The next step is to attempt to reduce the number 
of rows in the flow table by merging selected sets of 
two or more rows into one while retaining the speci­
fied behavior. After specifying the reduced flow table, 
MEAT calculates the set of m a x im a l co m p a tib le  
states. The set of maximal compatibles consists of the 
largest sets of state rows which can be merged, which 
are not subsets of any other such set. There may be 
various valid combinations of the maximal compati­
bles that can be chosen to produce a reduced table 
with the same behavior.
The final choice of minimized states must be chosen 
by the designer. There are three constraints on this 
choice. First, and obviously, only compatible states 
may combined (c o m p a tib ility  constraint). Second, 
each state in the original design must be contained 
in at least one of the reduced states (co m p le ten ess  
constraint). Third, selecting certain sets of states to 
be merged may imply that other states must also be 
merged (c lo su re  constraint). If any of the above con­
straints are not satisfied, MEAT will inform the user
that the covering is invalid.
A set of state variables must be assigned to uniquely 
identify each row of the reduced flow table. In contrast 
to synchronous control logic design, state codes may 
not be randomly assigned, but must be carefully cho­
sen to prevent races. The MEAT state assignment al­
gorithm is based on a method developed by Tracey[21]. 
The Tracey algorithm has the advantage that it pro­
duces S in g le  T ra n sitio n  T im e  (S T T ) state assign­
ments. In cases where two or more state variables 
must change value when transitioning to a new state, 
all variables involved are allowed to change concur­
rently, or race. It must be guaranteed that the out­
come of the race is independent of the order in which 
the state variables actually transition in order to pro­
duce a non-criiical race which exhibits correct asyn­
chronous operation. Several valid assignments may be 
produced, and each will be passed to the next stage for 
evaluation. This will result in unique implementations 
for each state assignment.
After state codes are assigned, the next synthesis 
stage computes a canonical sum of products (SO P ) 
boolean expression for each output and state variable. 
A modified Quine-McCluskey minimization algorithm 
is used. The resulting expression includes all essen­
tial prime implicants, and possibly other prime im­
plicants and additional terms necessary to produce a 
covering free of logic hazards. It may be possible for 
each output or state variable to be specified using sev­
eral alternate minimal equations. The large number 
of don’t care entries typically present in the flow ta­
ble increase the likelihood that more than one mini­
mal expression will be found. Each equation is given 
a heuristic “weight” that indicates the expected diffi­
culty of implementation and speed of operation using 
complex CMOS gates. When multiple state assign­
ments have been produced in the previous step, the 
total weight of each unique SOP equation is used to 
choose between various instantiations.
The minimized equations produced in the previous 
step are then used to automatically generate tran­
sistor net lists, suitable for simulation, representing 
complex CMOS gates. A graphical schematic dia­
gram is also produced to help guide the layout pro­
cess. The complementary nature of CMOS n-type and 
p-type devices is exploited to generate a single, com­
plex, static gate through simple function preserving 
transformations. These transformations can increase 
performance while reducing the area and device count. 
As a sum-of-products equation is folded into a single 
complex gate, the number of logic levels required to  
generate the output can be reduced. If the function
412
is complex, it can easily be broken up into a tree of 
complex gates with improved overall performance[‘20]. 
Typical state machines in the Post Office have an in­
put to output delay of 2 to 5 inverter delays.
4 Design Example
In order to illustrate the synthesis process from the 
designers point of view we will use a Post Office state 
machine called sbuf-send-ctl as a design example. The 
state machine is specified in Figure 1. We have cre­
ated a pool of Post Office AFSMs that we have made 
available to other researchers and synthesized imple­
mentations of this state machine in other methods can 
be found in[18,17].
The specification of sbuf-send-ctl from Figure 1 is 
textually entered for MEAT by describing the AFSM 
name, input variables, output variables, and then each 
state transition in 2 text lines. The first line de­
scribes the current state and the input burst, while 
the next line specifies the destination state and the 
output burst. In the text A  would correspond to \A  
and A  would be equivalent to J, A  from the graphical 
version of the state machine. The textual sbuf-send- 
ctl specification is:
:fsm sbuf-send-c t l
: in (Deliver Begin-Send Ack-Send) 
:out (Latch-Addr IdleBAR Send-Pkt)
s ta te 0 (Deliver)
1 (IdleBAR * Latch-Addr)
s ta te 1 (Deliver")
2 0
s ta te 2 (Begin-Send)
3 (Latch-Addr")
s ta te 3 (Begin-Send")
4 (Send-Pkt)
s ta te 4 (Ack-Send)
5 (Send-Pkt")
s ta te 5 (Ack-Send")
0 (IdleBAR")
s ta te 4 (Deliver)
6 ()
s ta te 6 (Deliver" * Ack-Send)
7 (Send-Pkt" * Latch-Addr)
s ta te 7 (Ack-Send")
2 0
The following is a transcript from a MEAT session. 
The specification resulted in a single implementation 
with two state variables.
> (meat " sbu f-send -c t l .da ta")
Max Compatibles: ((0 5) (1 2 7) (3 4) (6)) 
Enter S tate  se t :  ’((0 5) (1 2 7) (3 4) (6))
SOP fo r  "Yl":
18: DELIVER + Yl+BEGIN-SEND"
SOP fo r  "Y0M:
28: BEGIN-SEND + Y0*ACK-SEND" + Y0*DELIVER 
SOP fo r  LATCH-ADDR:
12: Y1*Y0"
SOP fo r  IDLEBAR:
30: ACK-SEND + BEGIN-SEND + YO + Yl 
SOP fo r  SEND-PKT:
12: YO+BEGIN-SEND"
HEURISTIC TOTAL FOR THIS ASSIGNMENT: 100
The implementation is then verified for hazard-free 
operation by the verifier. The verifier reads the spec­
ification and implementation. For this example, the 
state variables and outputs generated by MEAT are 
implemented as two-level A N D /O R  logic. Each sig­
nal is generated independently of the others. Only 
direct inputs are shared, so the same inverted signal 
in different output logic blocks will use separate in­
verters. Separate inverters will result in verification 
errors in the burst-mode speed-independent analysis. 
In this example, the begin-send signal is shared by Y l
413
5 I n  R e t r o s p e c t
Figure 2: Complex CMOS G ate for sbuf-send-ctl YO
and send-pkt.  T he tw o inverters are merged and the 
ou tpu t is forked to  both  logic blocks. This implemen­
tation  is then verified. T he verifier points out a d-trio 
hazard[22] which is removed by adding an inverter to 
change the sequencing of begin-send into the YO logic.
A M EAT transcrip t of these verification steps follows:
> (ve r if ie r-read -fsm  "sbu f-send-c t l .da ta")
Max Compatibles: ((0 5) ( 1 2  7) (3 4) (6))
Enter S ta te  s e t :  ’ ((0 5) (1 2 7) (3 4) (6))
> (se tq  *impl* (merge-gates ’ (1 11) *impl*))
> (verify-module *impl* *spec*)
10 20 30 40 50
Error: Implementation produces i l l e g a l  output.
> (se tq  *impl* (connect- inverter  10 6 *impl*))
> (verify-module *impl* *spec*)
10 20 30 40 50 60 70 79 s ta te s .
T
T he canonical SOP equations generated by MEAT 
are then transform ed into complex gates for implemen­
tation . T he CMOS circuit for y o  is shown in Figure 2. 
T he complex gates are then m anually im plem ented 
using the Electric layout editor. T he physical layout 
is then sim ulated with COSMOS to check for layout 
errors. Some m inor m odifications to COSM OS were 
required in order to  sim ulate the entire chip.
W hen we s ta rted  the P ost Office effort we had 
all been actively designing reasonably complex asyn­
chronous hardw are system s for a t least 8 years, and 
in one case since the early 1970’s. W ith  the excep­
tion of the ISM chip[6], these system s, including full 
scale computers[7], were board  level designs rather 
than  chips. We had also designed a num ber of com­
plex synchronous widgets as well. T he experience im­
posed a common belief th a t while it was undeniably a 
b it harder to  design and im plem ent the AFSMs cor­
rectly, the inherent m odularity  of asynchronous sub­
system s was a trem endous advantage a t the system  
level. W hat we did not appreciate was the fact th a t 
all of our previous asynchronous designs were proof- 
of-concept p ro to types whose goal was to  dem onstrate 
functional feasibility ra ther th an  perform ance. The 
performance oriented Post Office project caused us to 
reexam ine much of we considered to  be s tandard  asyn­
chronous design practice. We also failed to  appreciate 
some of the problems im posed by the im provem ent of 
IC technology. Some of these problem s were coupled 
with a desire to use the  scalable design rules and the 
convenience of fabrication through M OSIS, and the 
eventual reality th a t the project would span a period 
of approxim ately 7 years (a  research politics side effect 
in our previous com pany). T he net result was a  feeling 
th a t the only new challenge would be the architecture 
of the Post Office. W e  w e re  w ro n g !
The SIC AFSM m ethodology soon proved to  be 
both a performance bottleneck and consum ed unrea­
sonable am ounts of silicon real esta te . T he result 
was the burst-m ode, localized tim ing assum ption, and 
complex gate approach which has proven to  be a  sig­
nificant im provem ent. T he next problem  was th a t 
too much of our lim ited m anpow er b u d g e t1 was being 
spent in the inherently error prone m anual synthesis 
of AFSMs. M EAT sta rted  as a  quick and dirty  hack 
which would correctly synthesize the  AFSM s. Since 
we were not experienced CAD people, we grossly un­
derestim ated the need to pay atten tion  to  algorithm ic 
complexity. On its first production  test, M EAT was 
given an AFSM to synthesize th a t had already been 
im plem ented manually and the tes t chip was already 
operational. It was a  small design with 18 sta tes, 10 
input variables, and 6 ou tpu ts. A fter a weekend and 
10’s of thousands of garbage collections, we stopped it 
to find th a t a few percent of the job  was done. After 
sleepless weeks, we finally had som ething acceptable.
1 Four people did all of the hardware design and implemen­
tation for tl\e entire Mayfly system.
414
We then found th a t our SIC based logic minim iza­
tion m ethods were not valid for burst-m ode opera­
tion. We then  corrected this oversight2. The MEAT 
capability gave us the perform ance leverage th a t we 
felt we needed. By using locally verifiable tim ing 
assum ptions to  increase the  perform ance w ithin well 
contained modules bu t by requiring m odules to  inter­
act w ith each other in a  speed-independent fashion, a 
reasonable design balance exists between performance 
and m odularity. T he use of bundled d a tap a th  proto­
cols between modules reduces the wiring area budget 
bu t precludes DI behavior.
AFSM module speeds and area costs are excellent. 
In order to illustrate the savings we compare a sta te  
machine called M P-Forward-Pkt  in our design style 
w ith an equivalent im plem entation th a t compiles the 
specification to  library modules. In order to  factor out 
the additional benefits of the complex gate approach 
we compare the im plem entation of these two variants 
of the circuit using a straightforw ard A N D /O R  imple­
m entation. T he M EAT specification is:
in (Ack-Out Ack-PB Req)
out (Alloc-Outbound RTS Alloc-PB Ack)
in i t -o u t (Alioc-Outbound)
s ta te 0 (Ack-Out)
1 (Alloc-Outbound* * RTS)
s ta te 1 (Req * Ack-Out")
2 (Alloc-PB * RTS")
s ta te 2 (Ack-PB)
3 (Ack * Alloc-PB")
s ta te 3 (Ack-PB" * Req")
0 (Ack" * Alloc-Outbound)
T he im plem entation in logic gates is generated by
MEAT as: ___
A ck-O ut +  Y l x Req 







Y l x Ack-PB x Req 
Y l
YO x A ck-Out x Req 
YO
T he circuit verifies under our burst-m ode and fun­
dam ental mode tim ing assum ptions bu t will fail the 
test for SI behavior. Compiling the AFSM to a speed- 
independent circuit containing C-elem ents, Merge el­
em ents, and Toggle elem ents (indicated by triangular 
elem ents) produces the circuit in Figure 3.
The M EA T version requires 6 gates and 5 inverters 











2Thanks to Steve Nowick for liis assistance in this discovery 
and its eventual solution.
Figure 3: Speed-independent M P-Forw ard-Pkt Imple­
m entation
The worst case sta te  response tim e of the M EAT cir­
cuit is 2 gate delays plus an inverter delay as opposed 
to the SI version which is 6 gate delays plus 3 inverter 
delays. This 3:1 speed im provem ent ratio  changed to 
2.33:1 for the average case. T he difference typically in­
creases w ith AFSM complexity and is substan tia l for 
a complex subsystem  such as the Post Office.
The Post Office d a tap a th  com ponents are all con­
trolled by AFSMs. Most of these com ponents (sub­
tractors, counters, com parators, RAM, latches, etc.) 
are equivalent to synchronous designs. These circuits 
are typically sm aller and faster th an  asynchronous 
da tap a th  modules since they do not include circuitry 
for generating completion signals. T he d a tap a th  m od­
ules are controlled by pulses generated as ou tpu ts by 
their AFSM controllers. A lthough these signals are 
not local to  an AFSM, their ex ten t is bounded by 
the AFSM and d a tap a th  com ponent pair. A t times, 
worst-case d a tap a th  delays were synthesized to  hand­
shake with the AFSM. C ertain  da tap a th  com ponents 
where there can be a wide variance in delays are de­
signed to  sense completion and generate acknowledg­
ments (such as the RAM cells). T he ex tra  logic to 
generate completion signals from each slave operation 
usually will result in more logic b u t if carefully de­
signed should not reduce performance.
The decision to use synchronous clocked d a tap a th  
logic has not been particularly  difficult nor error prone 
in this large circuit. However, clocked dynamic cir­
cuits potentially  introduce additional failure modes. 
One design flaw in an original version was discovered 
only after it had been in tegrated  into the completed 
circuit. A dynamic counter was supposed to  be reset 
in the idle sta te . W hen this module was fabricated 
and tested individually the circuit was not left idle for 
large periods of tim e so the flaw was no t detected. In 
the full Post Office the charge on the in ternal nodes
415
of these counters dissipated during extended periods 
where the Post Office was not needed by the rest of 
the system. T he result was lost s ta te  in the counter 
and a  functional failure.
W hile bundled d a ta  protocols save area, they can­
not be routed  random ly betw een com ponents. The 
worst case delay m ust be analyzed to  assure th a t the 
d a ta  has arrived before it is utilized. In certain cells, 
like the Post Office RAMs, arrival of th e  d a ta  can be 
sensed by the discharge of a precharged line on the 
slowest line. However, bundled d a ta  being driven di­
rectly from one source to  the next, such as betw een the 
Post Office chips, rely on delay pa th  analysis. If hard ­
wired delays are used, and they are not sufficiently 
long, there is no way to repair the circuit w ithout 
another fabrication cycle. T here’s no such thing as 
tu rn ing  down the clock in asynchronous systems.
Bundled d a ta  can efficiently be used for slave d a ta  
com ponents. T he routing  logic in the Post Office is a 
successful exam ple of pipelined control with bundled 
data . B undled d a ta  transfers which are latched (such 
as betw een Post Office chips and in ternal buffers) are 
also reasonable applications. However, a  bundled pro­
tocol should N E V E R  be used for encoding address 
selection and control signals on a bus. U nfortunately 
we had to learn this the hard  way. Buses are prone to 
be highly capacitive and as such are slow, inefficient, 
and noisy. A glitch on a bused control line can easily 
cause an AFSM to  respond incorrectly. Setup tim e for 
the bundled control signals before the enabling signal 
arrives becomes very critical.
W hile we continually tried  to  focus on performance, 
we found th a t perform ance is an elusive target. Simply 
counting transis to r delays is a  false metric. A larger 
design w ith m any more devices can result in a faster 
circuit if the gains and capacitances a t each stage are 
balanced[20]. As device size shrinks and doping in­
creases, inter-node capacitance and wire lengths be­
come increasingly critical. Fast circuits can only be 
achieved when the ou tp u t to input load ratio  is small 
and the device gain is high. Point-to-point comm uni­
cation is the best way to  achieve speed. Asynchronous 
m ethods lend them selves well to  concurrent, pipelined 
architectures if designed properly. However, the en­
tire design philosophy  needs to  avoid inefficient shared, 
capacitive struc tu res. For high perform ance systems 
buses should probably be avoided altogether. This 
implies a shift in arch itectural design styles. Driving 
large buses causes the vast m ajority  of the delay in 
the Post Office circuitry.
T he complex gates generated in MEAT have re­
sulted in com pact, fast circuits. However, care m ust
be taken to  insure th a t the size of these gates is small 
to reduce the inter-node capacitance and increase the 
gain. T he design of such gates can in troduce para- 
sitics betw een the power rails and the o u tp u t which 
can result in a large body effect. As devices continue 
to shrink, this can nullify or even result in slower de­
signs using large complex gates th an  a  series of sm aller 
NAND gates. One complex gate advantage is the  use 
of an inverter on the ou tpu t, which provides low out­
pu t loading and increased gain from  th e  complex gate 
and inverter pair.
Reducing interm odule capacitance and introducing 
buffers betw een or w ithin A FSM s will bo th  increase 
reliability and perform ance. Slowly rising signals are 
subject to device threshold variances, which can cause 
failures when isochronous forks are used. E lim inat­
ing these slow signals will reduce instantaneous power 
consum ption, resulting in reduced noise, and neu tra l­
ize the problem  of isochronous forks.
Many of our perform ance and reliability problem s 
in the Post Office design were due to  our blind faith  
a ttitu d e  th a t the intrinsic asynchronous m odularity  
capability would cure m any ills. W hile it is true 
th a t we have removed the global clocks, global plan­
ning is still im portan t. Simply cobbling circuits to ­
gether in a bo ttom -up  fashion leads to  trouble, and 
only works well for adjacent, com m unicating control 
circuits. T his is difficult when the layout is done 
manually. M anual layout m ay always be necessary 
for certain modules th a t are either heavily replicated 
or are on the critical perform ance p a th . However it 
should not be the general practice. M anual layout of 
a 300,000 transisto r circuit is certainly insane and was 
a prime contributor to our capacitive woes. In any 
large chip floor planning is critical to  the perform ance 
of the design. T he problem  is th a t the floor plan is al­
ways done first, and the plan can seldom be kept in tact 
as the modules get im plem ented and do not quite con­
form to the plan. T he problem  is exacerbated due to 
w hat we will call implementation  m om entum .  Namely 
if you ju s t spen t a hundred hours m anually laying out 
a complex com ponent th a t ju s t misses the floor plan, 
there is an undeniable tendency to  compromise the 
floor plan in order to avoid ano ther hundred hours of 
layout. In designs like the P ost Office containing hun­
dreds of modules there is am ple opportun ity  to  make 
many such compromises. T he negative im pact th a t 
these compromises make on the final design perfor­
mance is significant, and in the Post Office case the 
resu ltan t floor plan is poor and costs us a t least a  fac­
tor of 2 in to ta l perform ance.
In addition, power and ground signals re ta in  their
416
global na tu re  in asynchronous circuits. W hile m etal 
m igration is not as big a problem  in asynchronous cir­
cuits due to  inherent duty cycle variance, care m ust 
be taken to  allow sufficient current carrying capacity 
to  prevent noise. We unfortunately  learned the  hard 
way th a t as VLSI circuits are scaled down, the global 
power and ground lines should increase in w idth rel­
ative to  the feature size. T his change wreaks havoc 
yet again w ith the floor plan. Even w ith the serious 
floor plan flaws, and the poor judgm ent m ade using 
buses for interface com m unication, the Post Office can 
sustain transfer ra tes up to  200 M Bytes per second.
Our design style virtually  elim inates intra-m odule 
C-elem ents. C-elem ents can be viewed as a simple 
AFSM and in our design style their behavior is con­
volved into the controller designs in a direct way. The 
perform ance benefits o f C-element removal are clearly 
substan tial. However the Post Office does contain 54 
C-elem ents. It is interesting to  note th a t N O N E  of 
them  are used alone as a protocol preserving signal 
rendezvous. They are A L L  used in pairs. T he s tan ­
dard arb iter circuit is a  common example of this us­
age. T here are two sides only one of which is active 
and through which the shared resource acknowledge 
m ust be passed in a protocol preserving fashion while 
the other side rem ains inactive. Only recently have we 
realized th a t in this role even these C-elem ents could 
be replaced by sm aller and faster circuitry.
6 C onclusions
Building a large, fully self-timed circuit has resulted 
in many insights, the m ost im portan t of which we 
have a ttem p ted  to  pass on. Different design styles and 
varying design targets will undoubtedly provide some 
new insights b u t m any will be similar. T he need for 
in tegrated  synthesis and analysis tools th a t compare 
in quality and scope with those available to the syn­
chronous design com m unity is o f prim ary im portance. 
T his is also a moving target. Some synchronous tools 
will work ju s t fine, others will need only minor mod­
ifications, while still o thers will have to  be created 
specifically to  handle asynchronous designs. MEAT 
is a step  in the right direction, b u t many more steps 
are necessary. Some form of autom atic layout is nec­
essary unless we abandon the complex gate approach 
in order take advantage of standard  cell approaches. 
A utom atic layout is a difficult task, b u t some perfor­
mance will be lost in the standard  cell approach. We 
are investigating both options.
T here are a num ber of perform ance factors tha t 
should be included in the tool set. As a circuit is
passed down through the different stages of the  tool, 
some inform ation is lost. T he com plexity of the al­
gorithm s and simplicity of the  circuits could be en­
hanced by preserving some of this inform ation. S tate 
graphs lack the formalisms required to  analyze compo­
sitions of these circuits for safety, liveness, deadlock, 
and o ther properties. We are currently  investigating a 
process calculus as a  m eans of specifying and generat­
ing M EAT sta te  graphs as well as proving correct op­
eration and construction composed of m ultiple AFSM 
modules.
A pproxim ately a fifth of the P ost Office control 
p a th  design was done manually, and the rest was done 
using MEAT. T he au tom ated  p a rt o f the design took 
one-fourth the am ount of design tim e and was v irtu­
ally error free. O ur design style has proven to  be a  very 
n a tu ra l transition  for existing hardw are designers, pri­
marily since it is based on trad itional finite s ta te  m a­
chine control. Our synthesis techniques have gen­
erated com pact high-perform ance circuits th a t work, 
and the complexity of the synthesis algorithm s has 
proven to  be viable for large designs.
The inherent m odularity  of asynchronous designs 
and their com posability into larger m odules makes 
self-timed design of large system s very attrac tive . The 
fact th a t they can then be increm entally  improved for 
performance makes them  even more so. A num ber of 
open challenges rem ain. W hile we have found th a t 
testing large asynchronous designs is relatively sim­
ple a t the board level, it is difficult when the design 
is a single chip. Since our controllers do not contain 
latches, it is difficult to  use scan p a th s to  improve 
testability . E lectron-beam  testers do not help much 
because it is difficult to image hand-shake signals. We 
view these problem s as opportunities for fu ture  re­
search. We also hope th a t our experience and even 
our tool efforts will be of fu ture benefit to  others em­
barking on the design of large asynchronous system  
components.
417
[1] H. B. Bakoglu. Circuits, Interconnections, and 
Packaging for VLSI. Addison-Wesley, 1990.
[2] Erik B runvand and R obert Sproull. T ranslating  
C oncurrent Program s in to  Delay-Insensitive Cir­
cuits. In IEEE International Conference on Com­
puter Aided Design: Digest of Technical Papers, 
pages 262-265. IE E E  C om puter Society Press, 
1989.
[3] Steven M. Burns and Alain J . M artin. The Fu­
sion of Hardware Design and Verification, chap­
ter Synthesis of Self-Timed C ircuits by Program  
T ransform ation, pages 99-116. Elsevier Science 
Publishers, 1988.
[4] Tam -A nh Chu. On the models for designing VLSI 
asynchronous digital systems. Technical R eport 
M IT-LCS-TR-393, M IT, 1987.
[5] Henry Y. H. C huang and Santanu Das. Synthesis 
of m ultiple-input change asynchronous machines 
using controlled excitation and flip-flops. IEEE  
Transactions on Computers, C-22( 12):1103—1109, 
December 1973.
[6] W illiam S. Coates. “T he Design of an Instruc­
tion S tream  Memory Subsystem ” . M aster’s the­
sis, U niversity of Calgary, December 1985.
[7] A. L. Davis. T he A rchitecture of DDM1: A Re­
cursively S tructured  D ata-D riven Machine. Tech­
nical R eport UUCS-77-113, University of U tah, 
Com puter Science Dept, 1977.
[8] A. L. Davis. Mayfly: A G eneral-Purpose, Scal­
able, Parallel Processing A rchitecture. Lisp and 
Symbolic Computation, 5 (l/2 ):7 -4 7 , May 1992.
[9] David Dill. Trace Theory for Automatic Hierar­
chical Verification of Speed-Independent Circuits. 
An AC M  Distinguished Dissertation. M IT Press, 
1989.
[10] A. B. Hayes. Stored S ta te  Asynchronous Sequen­
tial Circuits. IEEE Transactions on Computers, 
C-30(8), A ugust 1981.
[11] Lee A. Hollaar. D irect im plem entation of asyn­
chronous control units. IEEE Transactions on 
Computers, C -31 (12): 1133—1141, December 1982.
[12] Alain M artin . Compiling Com m unicating Pro­
cesses in to  Delay-Insensitive VLSI Circuits. Dis­
tributed Computing, 1(1) :226—234, 1986.
R e f e r e n c e s [13] Alain M artin. T he Lim itations to  Delay- 
Insensitivity in A synchronous C ircuits. In 
W illiam  J . Dally, editor, Sixth M IT  Conference 
on Advanced Research in VLSI, pages 263-278. 
M IT Press, 1990.
[14] C. Mead and L. Conway. Introduction to VLSI 
Systems. McGraw-Hill, 1979. C hapter 7.
[15] Teresa Meng. Synchronization Design for Digital 
Systems. Kluwer A cademic, 1990.
[16] Charles E. Molnar, T ing-Pien Fang, and Fred­
erick U. Rosenberger. Synthesis of Delay- 
Insensitive Modules. In Henry Fuchs, editor, 
Chapel Hill Conference on Very Large Scale In­
tegration, pages 67-86. C om puter Science Press, 
1985.
[17] Steven M. Nowick and David L. Dill. A uto­
m atic synthesis of locally-clocked asynchronous 
s ta te  machines. In 1991 IEEE International Con­
ference on Computer-Aided Design. IE E E  Com­
puter Society, 1991.
[18] L. Lavagno; K. K eutzer; A. Sangiovanni- 
Vincentelli. Synthesis of Verifiably H azard-Free 
A synchronous C ontrol C ircuits. Technical R eport 
U C B /E R L  M 90/99, Univ. of California a t Berke­
ley, November 1990.
[19] I. E. Sutherland, R. F. Sproull, C. E. M olnar, and 
E. H. Frank. A synchronous System s, Volume I. 
Technical report, Sutherland Sproull and Asso­
ciates, Palo Alto, CA, January  1985.
[20] Ivan E. Sutherland and R obert F. Sproull. Logi­
cal effort: Designing for speed on the back of an 
envelope. In Carlo H. Sequin, editor, Proceedings 
of the 13th Conference on Advanced Research in 
VLSI, pages 1-16. UC S an ta  Cruz, M arch 1991.
[21] J. H. Tracey. In ternal s ta te  assignm ents for asyn­
chronous sequential m achines. IEEE Transac­
tions on Electronic Computers, EC-15:551-560, 
A ugust 1966.
[22] S.H. Unger. Asynchronous sequential switching 
circuits. W iley-Interscience, 1969.
[23] C. H. (Kees) van Berkel. Handshake circuits: an 
intermediary between communicating processes 
and VLSI. PhD  thesis, Technical U niversity of 
Eindhoven, May 1992.
418
