An introduction to asynchronous circuit design by Davis, Al & Nowick, Steven M.





Com puter Science 2Computer Science
University of Utah Columbia University'
Salt Lake City, UT 84112 New York, NY 10027
September 19, 1997 
Abstract
The purpose of this monograph is to provide both an introduction to field of asynchronous digital 
circuit design and an overview of the practical s ta te  o f  the a r t  in 1997. In the early days of digital 
circuit design, little distinction was made between synchronous and asynchronous circuits. However, 
since the 1960’s, the m ainstream  of the digital circuit design enterprise has been primarily concerned 
with synchronous circuits. Synchronous circuits may be simply defined as circuits which are sequenced 
by one or more globally distributed periodic timing signals called clocks. Asynchronous circuits are an 
inherently larger class of circuits, since there are may sequencing options other than global periodic 
clock signals. Asynchronous circuits have been studied in one form or another since the early 1950’s [92] 
when the focus was primarily on mechanical relay circuits. A number of theoretical issues were studied 
in detail by Muller and Bartky as early as 1956 [138]. Since then, the field of asynchronous circuits 
has gone through a number of hig'n-interest cycles. In recent years there has been an unprecedented 
level of interest in both academic and industrial settings [81]. Much of this recent research effort has 
focused more on theory than practice. Nonetheless, the advance of practical asynchronous circuit design 
techniques also has an unusual level of interest. T he  focus of this document is on the aspects of the 
asynchronous circuit design discipline which are either likely, or have been an influence on the practical 
design of asynchronous circuits. The a t te m p t  is to  provide an introduction to the basic concepts which 
provide the foundation for to d ay ’s design techniques and to sum m arize the current practice. The  text 
contains an extensive set of bibliographical pointers to guide the more serious student of the field.
1
1 I n t r o d u c t i o n
The intent of this report is to provide an introductory yet comprehensive overview of the field of asyn­
chronous circuit design. The focus on design implies th a t  a num ber of theoretical aspects of the discipline 
which do not directly affect the  practical design process will be ignored. Given the size of the field and 
the num ber of design methods, it  is impossible to cover all of the various design methods in depth. On 
the other hand, little could be learned if all of the m ethods were ju s t  mentioned superficially. The  result 
is th a t  there will be enough depth  in this report to  introduce the basic concepts and to  highlight a 
few of the design styles. Other design methodologies will be covered more cursorily, at the conceptual 
level. Differences and similarities between methods will be discussed. The many citations and extensive 
bibliography provide ample direction for an in-depth study of any particular method.
2 Motivation and Basic Concepts
Circuit design styles can be classified into two m ajor categories, namely synchronous and asynchronous. 
It is worthwhile to note th a t  neither is independent of the other and th a t  there are many designs th a t  
have been produced using a hybrid design style which mixes aspects of both  categories. Synchronous 
circuits may be simply defined as circuits which are sequenced by one or more globally distributed  
periodic timing signals called clocks. Asynchronous circuits are an inherently larger class of circuits, 
since there are may sequencing options other than  global periodic clock signals. It may be difficult to 
understand the  motivation for asynchronous circuit techniques when the bulk of commercial practice 
and considerable experience, artifact, and m om entum  exists for the  synchronous circuit design style. For 
some, the motivation to pursue the study of asynchronous circuits is based on the simple fact th a t  they 
are different. Others find th a t  asynchronous circuits have a particular m odular elegance th a t  is amenable 
to  theoretical analysis. However, for those interested in the  practical aspects of asynchronous circuit 
design, the motivation often comes from some concern with the basic nature  of synchronous circuits.
Of common concern are the cost issues associated with the global, periodic, and common clock 
th a t  is the temporal basis for synchronous circuits. The  fixed clock period of synchronous circuits is 
chosen as a result of worst-case timing analysis. It is not adaptive and therefore does not take advantage 
of average- or even best-case com putational situations. Asynchronous circuit proponents view this as 
an opportunity  to achieve increased performance since asynchronous methods are inherently adaptive. 
Arithmetic circuits provide a good example. Arithmetic circuit performance is typically dom inated  by 
the propagation delay of carry or borrow signals. The worst-case propagation situation rarely occurs, 
yet synchronous arithmetic circuits m ust be clocked in a m anner th a t  accommodates (.his rare worst-case 
condition. Some asynchronous circuit designers have m ade the mistake of generalizing this observation 
into a view th a t  the inherent adaptiv ity  of asynchronous circuits implies tha t  they are capable of achieving 
higher performance in general. However, this is not necessarily the case.
All asynchronous circuits have additional operational constraints when compared to their syn­
chronous counterparts. Ideally, digital signals represent binary values and therefore model 2 distinct 
voltage levels. For convenience let them  be called 0 and 1. These signals then have the possibility of 
either remaining constant or changing as a result of a circuit action or event.  W hen signals change,
2
the change may not be a monotonic transition between one voltage level and the other. Such a non­
monotonic change is often called a glitch. A circuit producing an ou tpu t  which m ay glitch is said to 
contain a hazard. There are m any types of hazards which are discussed in Section 4.3. A glitch on a 
clock signal of a synchronous circuit will typically cause the circuit to malfunction. Glitches on non­
clock signals do not cause a malfunction as long as the signal is stable a t its new value for a certain time 
before and after a clock signal transition. This glitch issue is bo th  the advantage and the disadvantage 
of synchronous circuits. It implies th a t  non-clock signals need not be designed to  be hazard, free which 
often times results in smaller circuits. However the clock m ust be carefully controlled and since it must 
be globally distributed, this often proves to  be difficult. All forms of asynchronous circuits are concerned 
with providing hazard- or glitch-free outpu ts  under some timing model. .
In order to achieve hazard-free behavior, an asynchronous circuit will often contain more gates 
than  a functionally equivalent synchronous circuit. Therefore in terms of the num ber of basic compo­
nents, asynchronous circuits are often somewhat larger than  synchronous circuits. More gates implies 
more wires, and this may result in slower ra ther than  faster circuit latencies. Furtherm ore in order to 
achieve their inherently adaptive nature, asynchronous circuits must explicitly generate sequence control 
signals such as a request and an acknowledge signal. The request signal can be used to signal initia­
tion of some action and the corresponding acknowledge signal indicates completion of th a t  action. In 
synchronous circuits much of this type of control signaling is im plic i t  in the common clock signal. The 
generation of these explicit control signals further exacerbates the complexity of asynchronous circuits, 
and may lead to a further performance degradation.
The adaptive potentia l remains where the worst-case situation is rare and when the difference 
between the worst-case and average-case latencies is significant. However, synchronous circuit designers 
are also well aware of this s ituation and take considerable care to create a clock model and circuit 
structure th a t  can take advantage of these differences. The  most notable example of this tactic is in 
the finely-grained pipeline structures of modern floating point units. Yet, for very large circuits, such 
as microprocessors, balancing all the tim ing constraints of a large com putational space to minimize the 
difference between the worst and average case tim ing models is a difficult task. The work by Mark Dean 
on the ST R iP  processor [54] provides an interesting example. Dean showed th a t  even a well-balanced 
and well-designed processor such as the MIPS-X CPU could be sped up if the instruction set were split 
into three classes, and the clock period adjusted appropriately to m atch  the tem poral needs of each class.
Dean also dem onstra ted  th a t  an even greater performance enhancement could be achieved due 
to the tighter margins which are possible with adaptive clocking. Synchronous systems usually rely on 
an externally-generated clock signal which is distributed as the com m on tim ing reference to all of the 
system components. The speed at which integrated circuits operate varies with the circuit fabrication 
process, and fluctuations in operating tem perature  and supply voltage. In order to achieve a reasonable 
shield against these variables, the clock period is extended by a certain margin. In current practice, 
these margins are often 100% or more in high-speed systems. Adaptive clocking cannot be generated 
externally, and therefore m ust be provided internally to each device. The  fact th a t  the clock generator 
is affected by the same process, tem perature, and supply variations as the rest of the chip perm its  the 
safety margin to be reduced significantly.
Clock distribution is becoming an increasingly costly component of large modern  designs. T oday’s
3
microprocessors contain over ten million transistors and their clock rates are around 200 to 400 MHz. 
T he  clock period is determined by adding the worst case propagation delay, the m argin and tbe m aximum 
clock skew. Clock skew is simply the m axim um  difference in the clock arrival as seen by a!) clocked points 
in the circuit. T he  latency of the clock pulse to the reception points is not a  concern. W ith  to d ay ’s large 
VLSI circuits exceeding 20 mm per side, several nanoseconds of skew is easily possible. However with a 
5 nanosecond clock period, several nanoseconds of skew is a disaster. Clock distribution and de-skewing 
m ethods are abundan t but they share the common characteristic of being expensive in either, power or 
area and they become more so as clock speeds increase. A common method is to distribute the clock via 
a balanced H-tree configuration [9] with amplifying buffers placed a t  the fanout points. T he  problem 
with this approach is th a t  as more buffers are added to a clock path, larger skew results. The designers of 
the D EC  Alpha CPU [193] took the opposite approach. The Alpha contains 1.68 million transistors and 
is fabricated in a .75 micron, 3.3 volt CMOS process. Even with three layers of m etal,  the chip is 16.8 
m m  by L3.9 m m . In order to keep clock skew to within 300 picoseconcls, the A lpha’s designers localized 
the clock buffering to minimize process induced variations and therefore the skew induced by the buffers. 
Details of  the method can be found in [60] but the result is a clock driver circuit tha t  occupies about 
!0% of the chip area, and consumes over 40% of tbe 30 watts of power dissipated by the chip. 19 m m 2 
of area  and  over L2 watts of power is a very high price to pay for keeping the skew under control. Power 
concerns in particular will limit the use of this technique as circuit speeds and transistor counts increase.
A nother common modern synchronous technique for controlling clock skew is to use phased locked 
loops (P L L ’s). PLL's are essentially an analog circuit tha t  can be used to dynamically adjust the phase of 
a signal to  match it with the phase of another signal. For clock deskewing purposes the local clock is kept 
in phase with an external reference clock. There are many PLL design variants. We describe a  simple 
voltage controlled version [9] even though in higher speed circuits a current controlled methodology is 
more com m on. A phase detection circuit is used to decide whether the internal clock is behind or ahead 
of the external clock and produces a Charge A dd  or ChargeRemove  signal accordingly. These signals 
are sm oothed  by a  low pass filter to provide a signal tha t  controls a voltage controlled delay line or a 
voltage controlled oscillator (VCO). T he  VCO sits between the clock generator and buffering logic. It is 
im p o r tan t  to note th a t  this PLL based technique is only capable of eliminating the components of clock 
skew th a t  are the result of the clock generator and clock buffering circuits. P L L ’s cannot remove the 
skew com ponents tha t are caused by the clock distribution tree. Using multiple P L L ’s at various points 
in the d istribution tree does not help since the area  penalty is potentially severe and minor differences 
in the individual PLL operations will cause an increase in the am ount of j i t te r  in the resulting clock 
system. More im portantly  this technique exacerbates the basic problem since now the problem shifts to 
d istribution of the reference clock. The result is th a t  P L L ’s are an im portan t and useful clock distribution 
technique which can be used to solve pa r t  o f  the deskewing problem. However this approach will be 
inadequate  in the long run since as circuit feature sizes shrink and die sizes grow, a larger fraction of 
signal speed is due to wire delays and not the logic delays in the clock driver and buffer circuits. New 
approaches will be necessary or performance will be adversely affected.
A similar skew problem exists for circuit boards as well as chips. The literature contains an 
abundance of methods for de-skewing clocks [2, 36] on a  board but most of them are also costly in 
either area or complexity, and some will probably not be robust enough for use in commercial circuits. 
An interesting example is the Monarch [168] processor chip which used active signal selection on each
4
input pad. In this instance, a five slot delay line was used to skew signals to match the clock skew. The 
appropriate  tap  in the delay line was selected based on analyzing the clock vs. the incoming signal. While 
the technique did work, its cost and complexity are probably more instructive in a pathological sense. 
The bottom  line is th a t  clock management is a difficult problem and solving it in today ’s high-speed 
complex designs is costly. Asynchronous circuit proponents advocate a simple solution, namely throw 
away the whole concept oT a global clock. This is not a free solution since global absolute timing must 
be replaced with the relative and sequential mechanisms which lie a t  the heart of asynchronous circuit 
signaling protocols. Chuck Seitz wrote an excellent introduction to this general topic in his chapter on 
System Tim ing  in the classic VLSI book by Mead and Conway [129]. The next section of this treatise 
presents some of the more commonly used protocols and terminology. ,
Another common motivation for pursuing the asynchronous circuit option is the quest for low- 
power circuit operation. The consumer m ark e t’s hunger for powerful yet portable digital systems which 
run on lightweight battery packs is growing at a  rapid rate. Hence there is a strong commercial interest 
in low-power design m ethods which extend the operational life of a particular battery technology. CMOS 
circuits have a particular appeal, since they consume negligible power when they are idle. This would not 
be true, however, if the clock of a  synchronous circuit were to continue running. Therefore, low-power 
synchronous circuits usually involve some method of shutting  down the clock to subsystems which are 
not needed a t  a particular time. Clocks must be continuously supplied the subcomponent th a t  must 
m onitor the environment for the next call to  action. The result is th a t  power must be consumed even 
during idle periods. Furthermore, these clock switches exacerbate the clock skew issues which limit 
performance and also reduce the circuit 's ability to provide m axim um  performance when it is needed. 
Asynchronous circuits have the advantage tha t they go into idle mode for free since, by nature, when 
there is nothing to do there are no transitions on any wire in the circuit. Another advantage is th a t  even 
for an active system, only the subsystems th a t  are required for the com puta tion  at hand will dissipate 
any power. Researchers such as Kees van Berkel [211] and Steve Furber [70] are pursuing asynchronous 
circuit designs in an a t tem p t  to exploit this feature.
The final motivation of asynchronous design is the inherent ease of composing asynchronous 
subsystems into larger asynchronous systems. While there is still room lor doubt about whether asyn­
chronous circuits can generally achieve their potential advantages in terms of higher performance or 
lower power operation than synchronous circuits, there is little doubt tha t  asynchronous circuits do have 
a  definite advantage with respect to composability. Asynchronous circuits are functional modules in tha t  
they contaiii both their timing and d a ta  requirements explicitly  in their interfaces. In a sense, they “keep 
time for themselves” , hence the term self-timed circuits. Synchronous circuit modules contain only d a ta  
requirements in their interfaces and share the clock. However, im portan t  temporal issues, such as when 
d a ta  must be valid to avoid set-up and hold time violations between modules, are im plicit  a t  best. In 
contrast, composing asynchronous modules is almost trivial. If the interfaces match and observe the 
sam e signaling protocol then they can simply be connected. The same cannot be said for synchronous 
circuits with their global timing requirements and clock-based sequencing. The result is th a t  a more 
detailed knowledge of module internals is required before synchronous subsystems can be connected.
T h e  problem of combining synchronous systems is exacerbated when each module has a  separate 
clock, each running a t  a different frequency. T he  effects of this problem are numerous and probabilisti­
cally involve some variant of m etastability  failure [34]. It is commonly accepted, although not definitively
5
proven to t.he au tho rs’ knowledge, tha t  it is impossible to build a  perfect synchronizer. Many of the sub­
systems in to d ay ’s computers run on clocks which are not synchronized with the CPU- A good example 
is the I /O  subsystem. Often these subsystems are confusingly called asynchronous or are considered to 
have an asynchronous interface. In reality, they are synchronous systems which use some sort of syn­
chronizing scheme in their interface. Synchronizers, while imperfect, effectively trade increased latency 
for more reliable synchronization. The reliability is adjusted to  meet the MTBF (Mean Time Before 
Failure) requirements of the system, and the resulting decreased performance is simply viewed as the 
price th a t  m ust be paid for the required reliability.
The ease of composing asynchronous subsystems is a clear advantage. It allows components from 
previous designs to be reused, it allows modification of  slower components which may result in incre­
m ental performance improvements without impacting the overall design, and it facilitates behavioral 
analysis by formal methods. However, asynchronous circuits are not presently the mainstay of com­
mercial practice. The definite advantage of composability is not a  strong enough factor to counter the 
significant synchronous circuit mom entum , and the promises of  improved performance and decreased 
power consumption remain to be generally realized. There is also a clear gap in the quality of the design 
infrastructure, e.g. CAD tools, libraries, etc. In addition, the level of synchronous design experience 
dwarfs the small experience base in asynchronous circuit design. The subsequent sections on current 
research are indications th a t  this gap is narrowing. T he  asynchronous circuit discipline is becoming more 
viable, even though much work remains to be done before they will be competitive in the commercial 
sector.
3 Controlling Asynchronous Circuits
3.1 S ignaling P ro to co ls
Most asynchronous circuit signaling schemes are based on some sort  of protocol involving requests, which 
are used to initiate an action, and corresponding acknowledgm ents , used to signal completion of tha t  
action. These control signals provide all of the necessary sequence controls for com putational events in 
the system. Strictly speaking these handshake signals are independent of any global system time and are 
only concerned with the local relative temporal relationships between two subsystems sharing a common 
interface. The resulting com putational model is very much like the dataflow model [49, 1], where the 
arrival of the necessary operand d a ta  triggers an operation. Similarly there is a concept of a sender of 
information and a corresponding receiver. From the circuit perspective, and ignoring d a ta  transmission 
issues for now, these request and acknowledge control signals typically pass between two modules of an 
asynchronous system. For example let there be two modules, a  sender A and a receiver B. A request 
is sent from A to B to indicate th a t  A is requesting some action by B. When B is either done with 
the action or has stored the request, it acknowledges the request by asserting the acknowledge signal, 
which is sent from B to A. Most asynchronous signaling protocols require a strict alternation of request 
and acknowledge events. These ideas can be extended to interfaces shared by more than 2 subsystems, 
although this is not the common case due to performance and circuit complexity issues.




Start event i 
Event i done
|  done event i+ 1 
suiri event i+1
ready for next event
Figure 1: 4-cycle Asynchronous Signaling Protocol 
Request
Acknowledge
Start eveni i t
L
Event i done
|  Event i+1 done 
Start event i+1
Figure 2: 2-cycle Asynchronous Signaling Protocol
Two choices have been so pervasive tha t  they will be described here to illustrate the concept. One 
common choice is the cycle protocol shown in Figure 1. O ther names foT this protocol are also in 
common use: R Z  (return to zero), 4-phose> and level-signaling. In Figure I, the waveforms appear 
periodic for convenience b u t  they do not need to be so in practice. The curved arrows indicate the 
required before/after sequence of events. There is no implicit assumption about the delay between 
successive events. Note th a t  in this protocol there are typically 4 transitions (2 on the request and 2 
on the acknowledge) required to complete a particular event transaction. Proponents of this scheme 
argue th a t  typically 4-cycle circuits are smaller than they are for 2-cycle signaling, and th a t  the time 
required for the falling transitions on the request on acknowledge lines do not usually cause a performance 
degradation. This is because falling transitions can happen in parallel with other circuit operations, or 
are can be used to control the transmission of the answer d a t a  back to the requester.
T he  other common choice is 2-cycle  signaling shown in Figure 2, also called t ran sit ion , 2-phase, 
or N R Z  (non-return to zero) signaling. In this case the waveforms are the same as for 4-cycle signaling 
with the exception th a t  every transition on the request wire, bo th  falling and rising, indicates a new 
request. The  sam e is true for transitions on the acknowledge wire. 2-cycle signaling is particularly useful 
for high-speed micropipelines, as pointed out by Ivan Sutherland in his Turing Award paper [2001.
2-cycle proponents argue tha t  th a t  2-cycle signaling is be t te r  from both  a  power and a performance 
standpoin t,  since every transition represents a meaningful event and no transitions or power are consumed 
in returning to zero, since there is no resetting of the handshake link. While in principle this is true, 
it is also the case th a t  most 2-cycle interface im plem entations require more logic than their 4-cycle 
equivalents. The increased logic complexity may consume more power than is saved by the reduced
control transitions. This was shown to be the case in the two versions of the low-power asynchronous 
ARM processor producer) by researchers a t  the University of Manchester. ARM ] [70] was a 2-cvcle 
design. The lack of a distinct low-power advantage in ARM1, led to an improved ARM2 [73] 4-cycle 
design which dem onstrated both a  performance and low-power improvement over the ARM1. Some of 
this improvement can certainly be a ttr ibu ted  to increased design expertise, bu t the experience provides 
compelling evidence th a t  power and performance arguments can not be based solely on counting the 
num ber of  control transitions per event.
4-cycle proponents argue that the falling (return to aero) transitions are often easily hidden by 
overlapping them with other actions in the circuit. Another approach, called early acknowledge, is to 
design 4-cycle circuits to indicate eveut completion with the reset transition on the acknowledge wire 
ra ther than by acknowledge assertion. Since the sender can then deassert the request, the implication 
is tha t  the receiver m ust latch the incoming transaction prior to completing the requested action. The 
result is an asynchronous pipeline structure  similar to synchronous pipeline circuits. The  goal of all 
pipeline circuits is the increase in throughput performance. Still, most designers would agree th a t  both 
2- and 4-cycle protocols have advantages over the other in particular circuits. Certain design styles [48] 
and designs [199] show tha t  the 2-cycle protocols can coexist in the same system, albeit on different 
interfaces. Numerous 2-cycle to 4-cycle (and vice versa) conversion circuits exist and can be used for 
interfaces where performance is not critical, since the circuits do add some latency to the interface 
operation.
O ther interface protocols, based on similar sequencing rules, exist for 3 or more module interfaces. 
A particularly common design requirement is to conjoin 2 or more requests to provide a single outgoing 
request, or conversely to provide a conjunction of acknowledge signals. A commonly used asynchronous 
element is the C-elemenl,  which can be viewed as a protocol-preserving conjunctive gate. Note th a t  this 
element is equally useful for both 2- and 4-cycle protocols. T h e  description here will consider a 2-input 
C-elemenl for simplicity. T h e  common logic symbol and a  positive-logic, gate-level implementation are 
shown in Figure 3. From an initial s ta te  where inputs x and y are both low, the ou tput 2 is low. When 
both x and y go high, then the o u tp u t  7  will go high. Similarly when both inputs go low, then the 
ou tpu t  will go low. The C-element effectively merges two requests into a single request and permits  3 
subsystems to communicate in a protocol-preserving 2- or 4-cycle manner. Many consider G-elements to 
be as fundamental as a  NAND gate in asynchronous circuits, and they will appear repeatedly in many 
of the basic circuits th a t  will be subsequently presented here. The  feedback signal from the ou tpu t  of 
the C-element to two of the 2-input NAND gates indicates th a t  the C element is itself a form of latch. It 
therefore acts as a  synchronization point which is necessary for protocol preservation. However, excess 
synchronization reduccs performance and there are numerous asynchronous circuits which have been 
designed with too many C-elements and their performance has suffered. An AND (or NAND) gate  is 
a conjunction of only the low to high input signal trajectories whereas the C-element is conjunctive for 
both rising and falling trajectories. The key is to properly understand the conjunction requirements 
of the circuit and not use C-elements where some form of AND gate  will suffice. It is rare th a t  large 
asynchronous circuits can be built using no C-elements, but the existence of m any C-elements in the 
circuit is often an indication th a t  the performance of the circuit will be reduced.
So far, the discussion has only addressed control signals. There are also choices for how to encode 
data .  A common choice is the use of a bundled protocol with either 2- or 4-cycle signaling. In this
8
Gale Level Implemenlation 
Figure 3: The C-element
case, for an n-bit d a ta  value to be passed from the sender to the receiver, n+ 2  wires will be required 
(n bits of da ta ,  I request bit, and 1 acknowledge bit). While this choice is conservative in terms of 
wires, it does contain an implied timing assumption. Namely the assumption is th a t  the propagation 
times of the control and d a ta  lines are either equal or th a t  the control propagates slower than the d a ta  
signals. A sending module will assert the d a ta  wires and when they are valid will assert the request. It 
is im portan t  th a t  the same relationship of d a ta  being valid prior to request assertion be observed at the 
receiving side. If this were not the case, the receiver could initiate the requested action with incorrect 
d a ta  values. This requirement is often simply called the bundling constraint. Most asynchronous circuits 
have been designed with bundled d a ta  protocols because the logic and wires required to implement 
bundled d a ta  circuits is significantly less than with non-bundled approaches. However, in order for 
bundled d a ta  asynchronous circuits to work properly, the bundling constraint must be met. Antagonists 
of this approach note th a t  these timing assumptions, while local to a  particular interface, are similar to 
those m ad e  for synchronous circuit design.
The common alternative to the bundled d a ta  approach is dual m il  encoding. In this case, d a ta  
and control signals are not separated onto distinct wire paths. Instead, using the dual rail approach, a 
bit of d a ta  is encoded with its own request onto 2 wires. A typical dual rail encoding has four states:
1. 00 - Idle, d a ta  is not valid
2. 10 - Valid 0
3. 0 L - Valid I
4. I I  - Illegal
In this case, for an n-bit d a ta  value, the link between sender and receiver m ust contain 3n wires: 
2 wires for each bit of d a ta  and the associated request plus another bit for the acknowledge. An 
improvement on this protocol is possible when n-bits of d a ta  are considered to be associated in every 
transaction, as is the case when the circuit operates on bytes or words. In this case it is convenient to 
combine the acknowledges into a single wire. The  resulting wiring complexity is then reduced to 2n-t-l 
wires: 2n for the d a ta  and requests plus an additional acknowledge signal. In a four cycle variant of
9
this dual rail protocol, sending a bit requires the transition from the Idle sta te  to either the valid 0 or 
valid 1 sta te  and then, after receiving the acknowledge, it m ust transition back to the idle state. The 
acknowledge wire m ust be reset prior to a subsequent assertion of a valid 0 or 1. The illegal s ta te  is not 
used. If recognized by the rccciver, it should cause an error.
A 2-cycle dual-rail protocol would signal a valid 0 by a single transition of the left bit, while a 
valid 1 would be signaled by a transition on the right bit. Concurrent transitions on both  the left and 
right bits are illegal. Sending a 0 or a 1 must be followed by a  transition on the acknowledge wire before 
another bit can be transm itted . Alternative encoding schemes have been proposed as well (218, 56]. Dual 
rail signaling is insensitive to the delays on any wire and therefore is more robust when assumptions like 
the bundling constraint cannot be guaranteed. The receiver will need to check for validity of all n-bits 
before using the d a ta  or asserting the acknowledge. T h e  downside of the dual rail approach is often the 
increased complexity in both wiring and logic.
3.2 C om p letion  Signals
One of the added complexities of asynchronous circuits is the need to generate completion signals tha t  
directly or indirectly control the acknowledge signal in a signaling protocol. There are many methods, 
none of which is universally satisfactory. One approach is to design an asynchronous module in a manner 
th a t  is similar to a synchronous circuit. Namely, the arrival of the request s tarts  the modules internal 
clock generator and after a certain number of internal clocks: the circuit is done, the clock is stopped, 
and an acknowledge is generated. The idea was originally suggested by Chuck Seitz and was used during 
the construction of the first dataflow computer, DDM1 [49]. This technique works well when the size of 
the module is large, but when the module is small, the additional logic required for the internal clock 
generator represents an overhead tha t  is too costly. Also, the technique does not lend itself well to high 
performance designs, due to the increased circuit complexity and the delay associated with starting  the 
clock generator. The result is th a t  this approach is seldom used today, since modules represent relatively 
small pieces of an integrated circuit.
Another choice for completion signal generation is the use of a  model delay. In this case, conven­
tional synchronous timing analysis of the da tapa th  is used to determine how long the circuit will take to 
compute a valid result after the request has been received. A delay element, such as an inverter chain, is 
then used to turn the request into the appropriately delayed acknowledge signal. Note tha t  this method 
works equally well for both 2- and 4-cycle signaling protocols.
Special functions often have unique opportunities. Fox example, arithmetic circuits can be built to 
generate completion signals based on carry propagation patterns  [88]. O ther functions can independently 
com pute both  F  and F and use the exclusive-OR of their ou tpu ts  to generate the acknowledge signal. 
Note th a t  this technique will only work directly in a  4-cycle signaling protocol. If used with a 2-cycle 
protocol, additional logic such as a T  flip-flop will be required.
A novel technique was proposed by Mark Dean [55] where completion detection was performed 
by observing the power consumption of the circuit. When activated the circuit consumes power, and 
when it is done the power consumption falls below a particular threshold.
10
The study of completion signal generation methods in asynchronous circuits could be the topic 
of an entire book. For now, it is only necessary to realize tha t some method must be chosen, and tha t  
the need for completion signals and related signaling protocols is a necessary overhead of asynchronous 
circuit design. Many modern memory chips are integrated into synchronous systems using this same 
technique. For example, memories are not inherently synchronous, but do have specified access latencies. 
These specifications are essentially model delays. Systems which use these chips assume tha t  the access 
is complete after a certain number of clock cycles which correspond to a delay th a t  is no smaller than 
the access latency of the device.
4 D elay M odels and Hazards
4.1 D e la y  M o d els , C ircu its and E n viron m en ts
There is a  wide spectrum  of asynchronous designs. One way to distinguish among them is to under­
stand the different underlying models of delay and operation. Every physical circuit has inherent delay. 
However, since synchronous circuits process inputs between fixed clock ticks, they can often be regarded 
as instantaneous operators, computing a new result in each clock cycle. On the other hand, since 
asynchronous circuits have no clock, they are best regarded as computing dynamically through time. 
Therefore, a  delay model is critical in defining the dynamic behavior of an asynchronous circuit.
There are two fundamental models of delay: the pure delay  model and the inertial delay  model [206. 
182]. A pure delay can delay the propagation of a waveform, bu t  does not otherwise alter it. An inertial 
delay can alter the shape of a waveform by a t tenua ting  short glitches. More formally, an inertial delay 
has a threshold period, 8. Pulses of duration less than 6 are filtered out.
Delays are also characterized by their timing models. In a fixed delay  model, a delay is assumed 
to have a  fixed value. In a bounded delay  model, a delay may have any value in a given time interval. In 
an unbounded delay  model, a  delay may take on any finite value.
An entire c ircuit’s behavior can be modeled on the basis of its component models. In a simple-  
gate> or gate-level , model, each gate and primitive com ponent in the circuit has a corresponding delay. In 
a complex-gale  model, an entire sub-network of gates is modeled by a  single delay; tha t  is, the network is 
assumed to behave as a single operator, with no internal delays. Wires between gates are also modeled 
by delays. A circuit model is thus defined in terms of the delay models for the individual wires and 
components. Typically, the functionality of a gate  is modeled by an instantaneous operator with an 
attached delay.
Given a circuit model, it is also im portan t  to characterize the interaction of the circuit with its 
environment.  The circuit and environment together form a  closed system, called a complete circuit  (see 
Muller in [132]). If the environment is allowed to respond to a c ircuit’s ou tpu ts  w ithout any timing 
constraints, the two interact in input/ou tput mode  [24], Otherwise, environmental tim ing constraints are 
assumed. The most common example is fundam ental mode [126, 206] where the environment m ust wait 
for a  circuit to stabilize before responding to circuit ou tputs. Such a  requirement can be seen as the 
hold time for a simple latch or flipflop [127].
11
Given these models for a circuit and its environment, asynchronous circuits can be classified into a 
hierarchy.
A delay-insensitive (D I)  circuit is one which is designed to operate correctly regardless of the 
delays on its gates and wires. T h a t  is, an unbounded gate and wire delay model is assumed. T he  concept 
of a  delay-insensitive circuit grows out of work by Clark and Molnar in the 1960’s on Macmmrxhrfes [42].' 
DI systems have been formalized by Udding [205] and Dill [58]. The class of DI circuits built ou t of simple 
gates and operators is quite limited. In fact, it has been proven th a t  almost no useful DI circuits can be 
built if one is restricted to a  class of simple gates and operators [121, 25], However, many practical DI 
circuits can be built if one allows more complex components [61, 91]. A complex component is constructed 
out of several simple gates. Internal to the component, timing assumptions must be satisfied; externally, 
the com ponent operates in a  delay-insensitive manner. A C-element is such a component and other 
examples of DI designs using complex components are described in Section 6/1; see Figure 15.
A quasi-delay-insensitive (quasi-DI or  QD I)  circuit is delay-insensitive except th a t  “isochronic 
forks” are required [27]. An isochronic fork is a forked wire where all branches have exactly the same 
delay. In other formulations, a bounded skew is allowed between the different branches of each fork. In 
contrast, in a DI circuit, delays on the different fork branches are completely independent, and may vary 
considerably. The motivation of QDI circuits is th a t  they are the weakest compromise to pure delay- 
insensitivity needed to build practical circuits using simple gates and operators. C-elements are somewhat 
problematic since they inherently contain an ou tput which is fed back internally to the C element. This 
case represents the worst form of isochronic.fprk, since one of the forks is contained witbin the C element 
circuit module while the other is exported to outside modules. Martin [122] and van BerkeJ [211] have 
used QDI circuits extensively and have described their advantages and disadvantages [121, 214].
A speed-independent (SI)  circuit is one which operates correctly regardless of gate delays; wires 
are assumed to have zero or negligible delay. SI circuits were introduced by David Muller in the 1950’s 
(see [132]). Midler’s formulation only considered deterministic input and ou tpu t  behavior. This class 
has recently been extended to include circuits with a limited form of non-determinism [12, 100).
A self-timed  circuit, described by Seitz [129], contains a  group of self-timed “elements” . Each 
element is contained in an “equipotential region” , where wires have negligible or well-bounded delay. 
An element itself may be an SI circuit, or a  circuit whose correct operation relies on use of local timing 
assumptions. However, no timing assumptions are made on the communication between regions; tha t  
is, communication between regions is delay-insensitive.
Each of the above circuits operate in in p u t /o u tp u t  mode: there are no timing assumptions on
'This article contains numerous references to the work of Charles iVlolnar. The asynchronous circuit discipline lost one 
of its brightest lights when Charlie passed away in December, 1996. Charlie's influence on the field was profound. He 
inspired many of the people who are today considered to be pioneers and senior statesmen of the field. His inventions are 
numerous as both his publications and patents attest. The difficult aspect of Charlie's influence for people to grasp, with 
the exception of the few who had the privilege to know and work with Charlie over the years, is the depth and creativity of 
his thinking. Charlie’s work has provided both a solid foundation for the field as well as an inspiration to continue. At the 
time of his death, he was one of the creative leaders of the asynchronous circuits group at Sun Microsystems Laboratories, 
Inc. [195]. His influence will be sorely missed.
4 .2  C la sse s  o f  A sy n ch ro n o u s C ircu its
12
when the environment responds to the circuit. The most general category is an asynchronous circuit  (206). 
These circuits contain no global clock. However, they may make use of timing assumptions both  within 
the circuit and in the interaction between circuit and environment. Latches and flip-flops, with setup 
and hold times, belong to this class. Other examples include lim ed circuits  [141], where both internal 
and environmental bounded-delay assumptions are used to optimize the designs,
4.3 H azards
A fundam ental difference between synchronous and asynchronous circuits is in their trea tm ent of hazards. 
In a synchronous system, computation  occurs between clock ticks. Glitches on wires during a clock cycle 
are usually not a problem. The system operates correctly as long as a  stable and valid result is produced 
before the next clock tick, when the result is sampled. In contrast, in an asynchronous system, there is 
no global clock; com putation  is no longer sampled at discrete intervals. As a  result, any glitch may be 
treated  by the system as a real change in value, and may cause the system to malfunction.
The potential for a glitch in an asynchronous design is called a hazard  [206]. Hazards were 
first studied in the context of asynchronous sta te  machines, and much of the original work focused on 
combinational logic. Sequential hazards are also possible in asynchronous sta te  machines; these are 
called critical I'uces or essentia! hazards , and will be discussed later.
Several approaches have been used to eliminate combinational hazards. First, inertial delays may 
be used to a t tenuate  undesired “spikes” ; much of the early work in asynchronous synthesis relied on use of 
inertial delays (see Unger [206]). Second, if a  bounded delay model is assumed, hazards may be “fixed” 
by adding appropriate  delays to slow down certain paths in a circuit. Third , hazards are sometimes 
tolerated where they will do no harm; this approach was also used in some early work. Finally, and 
most importantly , synthesis m ethods can be used to produce circuits with no hazards, i.e.. hazard-free 
circuits.
In the remainder of this section, the basics of hazard-free combinational synthesis are presented. 
Two traditional classes of combinational hazards are defined: S IC  and M I C  hazards. Classic tech­
niques to eliminate both SIC and MIC hazards in 2-level circuits are introduced, illustrated by some 
simple examples. The section concludes with a  description of recent work on hazard-free minimization. 
Throughou t this section, a  conservative circuit model is used: the combinational circuit is assumed to 
have unknown gate and wire delays. T h a t  is, an unbounded gate and wire delay model is assumed.
4 .3 .1  S IC  H a z a r d s
Hazards are temporal phenomena: they are manifest during the dynam ic operation of a  circuit. As 
an example, consider the Karnaugh m ap  ( “K-m ap” ) [127] in Figure 4(a), defining a Boolean function 
with 3 inputs: A, B , C .  A minimum-cost sum-of-prodncts realization, or cover, is given by expression 
f  =  A 1 B +  A C ; the corresponding AND-OR circuit is shown in the figure. Consider the behavior 
of the circuit during the single-input change (SIC) from A B C  =  O il to A B C  =  111, In this transition, 
only a  single input, A, changes value. Initially, AND-gate A 'B  is 1, AND-gate A C  is 0, and die OR-gate 





Figure 4: Combinational hazard example: SIC transition
is slower tlian A' B, the result is a glitch on the OR-gate  output: 1 
a  hazard for this transition.
0 —V 1. Therefore, the circuit has
T he  Karnaugh m ap  in Figure 4(b) shows an alternative hazard-free realization of the function. 
A th ird  product, B C ,  has been added to the cover. For the same SIC transition, AND-gate B C  holds 
its value a t  1, and the OR-gate o u tp u t  remains a t  1 without any glitches. Therefore, the new circuit is 
hazard-free for the transition. This new product, B C ,  is redundant in terms of function / ,  bu t is necessary 
to  eliminate the hazard. This product is used to cover the K-m ap transition, A B C  : Oil —V 111.
T he original theory of combinational hazards for SIC transitions was developed by IIulTman, 
Unger and McCluskey (see [206]). T h e  above example indicates how to eliminate an SIC static-1  hazard, 
th a t  is, for an input change where the function makes a 1 —> 1 transition. In this case, some  product 
must cover (i.e., completely contain) the entire transition. There are 3 remaining types of transitions, 
where the ou tpu t  makes a 0 - > 0 , 0 —y 1, or 1 —» 0 transition. It has been shown tha t ,  given an arbitrary 
A ND -O R implementation, no hazard  will occur for any of these 3 transitions [206]. 2 T h a t  is, only 
static-1 SIC hazards m ust be avoided during synthesis of an AND-OR circuit; other SIC transitions will 
be hazard-free.
4.3.2 M IC Hazards
T h e  case of a  multiple-input change (MIC)  is much more complex: both  static  and dynam ic hazards 
m ust be eliminated. An MIC transition  has a s tart  input value, M , and a destination input value, N ,  
where several inputs change monotonically between M  and N .
2Morc precisely, these realizations will be hazard-free as long as no AND gate contains a pair of complementary literals.
The first problem which arises when considering MIC transitions is th a t  of function hazards [206]. 
Consider the Karnaugh m ap  in.Figure 5(a). An example of an MIC transition has s ta r t  point A B C D  =  
0010 and end point A B C D  == -01J 1, and two changing inputs: B  and D .  During this MIC transition, 
the function itself is no t monotonic; tha t  is, it can change value several times. To see this, consider the 
change in the function’s value if input D first goes to 1, followed by input B. Initially, a t  A B C D  =  0010, 
the function is 1. When D goes high, the function changes to 0. When B then goes high, the function 
then changes to 1. Therefore, the function itself changes value more than once. Such an MIC transition 
is said to have a  function hazard.
It has been shown tha t ,  assuming gates and wires may have arb itrary  delay, there is no guaran­
teed m ethod to synthesize a circuit which is hazard-free for a transition with a  function-hazard [206]. 
Intuitively, a function hazard is a glitch tha t  is inherent in the K arnaugh m ap  specification itself. If input 
D  changes much later than input B, there is no way to prevent the function ou tpu t  from glitching.3
In summary, function hazards cannot be avoided. Therefore, classic synthesis m ethods focus only 
on MIC transitions which are already function-hazard-free. Examples a function-hazard-free transitions 
are shown in Figure 5(a). The transition from A B C D  — 0100 to A B C D  — 0111 is s ta t ic , since the 
o u tp u t  remains a t  a single value (1) throughout the transition. A dynam ic  transition is shown from 
A B C D  =  0111 to A B C D  — 1110; this transition is function-hazard-free, since the o u tpu t  changes 
exactly once, from 1 to 0, on all direct pa ths  from the s ta r t  point to the end point.
Given a function-hazard-free MIC transition, the goal of hazard-free synthesis is to produce an 
AND-OR circuit which is glitch-free for the transition. If a  glitch can occur, the transition is said to
3 Alternatively, even if B and D change simultaneously, there are always delay values for the given gates to force the 
function to drop low before going high.
Figure 5: Combinational hazard example: static MIC transition
(a) (b)
Figure 6: Combinational hazard example: dynam ic MIC transition
have a logic hazard. If no glitches are possible, the transition is logic-hazard-free.
Static-1 logic hazards (i.e.,  hazards during a 1 —> 1 transition) can be avoided in an AND-OR 
implementation by nsing an approach similar to the SIC case [63]. As an example, consider the Karnaugh 
m a p in  Figure 5(a). A minimum-cost sum-of-products realization is: /  =  C 'O ' +  A 'D ' +  B D .  Consider 
the MIC transition from A B C D  =  0100 to A B C D  = 0 1 1 1 ,  indicated by an arrow. AND-gates C 'O '  and 
A'D '  each make a I —> 0 transition, and AND-gate B D  makes a 0 - t  1 transition. If B D  is slow, the 
result is a  glitch on the OR-gate ou tput:  1 —» 0 —> 1. Therefore, the implementation has a s t a t i c  logic 
hazard. An alternative hazard-free implementation is shown in Figure 5(b). The hazard is e liminated by 
adding a fourth product term, A 'B ,  which holds its value a t  1 throughout the transition. This product 
covers the entire transition, A B C D  : 010 0 —> 0111.
The next problem is to eliminate static-0 logic hazards. These hazards are easily handled. In 
fact, it has been shown tha t ,  given any MIC 0 —^ 0  transition which is already function-hazard-free, the 
transition is guaranteed  to be free of logic hazards in any A N D-OR implementation (63). T h a t  is, no 
special care need be taken during 2-level synthesis to avoid static-0 logic hazards.
A more difficult problem is to eliminate MIC dynam ic logic hazards. Figure 6(a) contains the 
sam e Karnaugh map as in Figure 5(b), but with a  new MIC transition: from A B C D  =  0111 to A B C D  =  
1110. This is a  dynam ic function-hazard-free transition; the function makes a 1 —> 0 transition. The 
implementation has a dynam ic  logic hazard. AND-gates B D  and A 'B  each make a 1 —> 0 transition. At 
the same time, AND-gate A! D' has inputs changing from 10 to 01, and therefore may glitch: 0 —> 1 —> 0. 
If A' j y  is slow, this glitch will propagate  to the OR-gate after  the other AND-gates have gone to 0, and 
the OR-gate ou tpu t  will glitch: 1 —> 0 —)• 1 —> 0.
To prevent a dynam ic MIC hazard, no AND-gate may temporarily turn on during the tran-
sition. In the example, A 'D 1 becomes enabled, then disabled, as inputs A  and D  changed value. 
This phenomenon is apparent in the Karnaugh map: product A 'D '  intersects the transition from 
A B C D  : 0111 —> 1110, b u t  intersects neither  the s tart  point (A D C D  =  0111) nor  the end point 
( A B C D  =  1110) of the transition [20|. A solution to this problem was proposed by Beister[lo]: product 
A 'D '  is reduced to a smaller product A 'B 'D '  which no longer intersects the transition. Note th a t  this 
product is non-prime. The final cover is shown in Figure 6(b). AND-gate A ' B 'D  remains at 0 throughout 
the transition, and the dynam ic MIC transition is hazard-free. '
4.3.3 H azard-Free M inim ization
The above examples indicate how to eliminate hazards for any one MIC transition. Hazard elimination 
can be viewed as a covering problem on a Karnaugh map. For the 1 —► 1 case, the entire transition 
must be covered by some product. For the 1 —> 0 and 0 —»■ 1 cases, every product which intersects 
the transition must also contain its s ta r t  or end point. For the remaining case, 0 —> 0, no hazard will 
occur in any AND-OR realization [206]. These conditions suffice to e liminate any single MIC hazard. 
Unfortunately, when a ttem pting  to eliminate hazards for several  MIC transitions simultaneously, these 
covering conditions may be unsatisfiable. T h a t  is, for a given set of MIC transitions, a hazard-free cover 
may not exist [206, 66, 153].
An exact hazard-free two-level minimization algorithm was developed by Nowick and Dill [153]. 
The  algorithm finds an exactly minimum-cost cover which its hazard-free for a set of MIC transitions, if a 
solution exists. A heuristic hazard-free two-level minimization algorithm has also been developed {202].
There is.a rich literature on multi-level  hazard-free circuits as well, and several synthesis, m eth­
ods have been proposed. One approach is to s ta r t  with a hazard-free circuit (for example, a  two-level 
circuit), and apply hazard-non-increasing  multi-level transformations [206, 19, 103]. These transforma­
tions transform a hazard-free two-level circuit into a hazard-free multi-level circuit. Alternatively, a 
hazard-free multi-level circuit can be synthesized directly, using binary decision diagram s (B D D s)  [110], 
Other algorithms have been developed for the hazard-free technology m apping  of circuits to arbitrary 
cell libraries [191, 102, 14].
4.3.4 An A lternative View of Hazards
The above discussion follows a classical framework, focusing on combinational hazards separately from 
sequential hazards. This distinction has been quite useful for synthesis of asynchronous sta te  machines. 
However, for other synthesis styles, a uniform trea tm ent of hazards is more natural.  In this la tter 
approach, each gate and sequential component is assigned a specified “legal” behavior, describing the 
correct operation of the component. As components are combined into a  circuit, their composite behavior 
is formally determined. If a component may produce an ou tpu t  which cannot legally be accepted by 
another component, then a violation occurs. This notion has been formalized, in different contexts, as: 
computation interfei'ence [61], stability violation  [122] and choking [58].
17
Figure 7: The Mutual Exclusion Element
5 Arbitration
In order to avoid non-deterministic behavior, asynchronous circuits m ust be hazard-free under some 
circuit delay model. As discussed above, certain forms of MIC behavior can be tolerated; bu t the most 
general forms of signal concurrency must be controlled by arbitration in order to avoid unrestricted MIC 
behaviors tha t  result in circuit hazards. For example, if some circuit is to react one way if it sees a 
transition on signal A and react differently for a transition on signal B, then some guarantee must be 
provided th a t  this circuit will see m utually  exclusive transitions on inputs A and B. Nondeterministic 
behavior will occur if this guarantee cannot be provided. Such mutually exclusive signal conditioning is 
usually provided by arbitration.
Latches and flip-flops cannot be used for arbitration due to the inherent possibility tha t  they may 
enter their m etastab le  regions [34, 33], Arbiter circuits are typically constructed to adhere to a  particular 
signaling protocol and therefore vary somewhat, However all arbiters rely on a  mutual exclusion, or AIE  
element,  to separate possible concurrent signal transitions. The ME element is essentially a  latch with an 
analog m etastability  detector on its ou tputs. If sufficient signal separation exists between the two inputs 
then the first one wins. However, if both inputs occur within a  device-specific time, then the latch will go 
metastable , but the m etastability  detector will prevent the outputs  of the ME circuit from changing until 
the m etastability  condition is resolved. The duration of metastability  is unbounded but normally persists 
for a very short time. It has been experimentally confirmed [34, 33] th a t  the metastability  duration is 
an exponentially decaying probability which depends somewhat on the particular latch properties. The 
result is th a t  in the case of a  tie, exactly one side will win the arbitra tion . The additional implication is 
th a t  the distinction of which side wins does not matter.
Chuck Seitz proposed an M E circuit th a t  is particularly useful for MOS based designs, shown in 
Figure 7. The  cross-coupled inverters form the usual SR. latch. The ou tpu ts  of the latch are connected 
to a pair of transistors which form the metastability  detector. When the latch is in its m etastable  region, 
VI and V2 will differ by less than the threshold voltage of the N-type transistors. In this case, both T1 
and T2 will be off, since the gate-to-source voltage will be less th an  the threshold. If T1 and T2 are off 
then the ou tpu ts  of the ME circuit will remain high. When VI and V2 differ by more than the threshold 
voltage, then the latch will stabilize into either of its stable states. At this point, either T1 or T2 will 
tu rn  on and the respective ou tp u t  will fall to its asserted level.
Once the mutually exclusive resolution of the input race has been provided by an ME element,
18
Figure 8: A 4-cycle Arbiter
constructing the rest o f an arbiter circuit to conform  to a particular signaling protocol is relatively 
straightforward. An exam ple of a 4-cycle arbiter originally proposed by D avid Dill and Ed Clarke is 
shown in Figure 8. Four-cycle arbitration is relatively sim ple since the input race m ust only be detected  
for both signal trajectories going in the sam e single direction (typ ically  a low to high transition). T he  
use o f the C -elem ents in the arbiter prevents another pending request from passing through the arbiter 
until after the active request cycle has cleared.
T w o cycle arbitration is som ew hat more com plex, since the inputs o f the arbiter m ay race in 
all possible com binations o f signal trajectories. Ebergen [‘24], for exam ple, has reported on a particular 
2-cycle arbiter known as the R G D  (Request, Grant, Done) arbiter.
A nother interesting arbitration problem  was posed by D avis and Stevens during the developm ent 
of the P ost Office chip [52]. One potential perform ance difficulty with asynchronous signaling protocols 
is that w aiting for the next event is the normal m ode of operation. Hence if two requesters want to 
share som e resource, the loser m ust wait until the winner is finished before access to that resource can 
be granted. However, if  the loser wants to do som ething else if it  does not win arbitration, then the 
previously discussed arbitration m ethods will be insufficient. T he need is for a N A C K ’ing arbiter which 
provides the requester with an acknowledge if the resource is available, and a negative acknowledge or 
N A C K  i f  the resource is busy. Several versions of this N A C K ’ing arbiter have been designed. T he  
version used in the Post Office design used 4 ME elem ents [51]. Each ME elem ent resolved one of the 4 
possible race trajectories. T h e rem aining protocol control was provided by an asynchronous finite state  
m achine.
Arbiters for m ore than 2 inputs allow num erous im plem entation options. T he sim plest case is to 
create a binary tree o f 2-input arbiters o f the appropriate size (see [58]) . T he tree m ay be balanced or 
unbalanced. Balanced trees are f a i r  in that they give equal priority to all o f the leaf inputs. Unbalanced  
trees inherently provide higher priority for inputs which enter the structure closest to the root (in this 
case the ou tp ut) o f the arbitration tree. The problem with tree-structured N-way arbiters is that they  
contain m any C -elem ents and therefore suffer from decreased perform ance. Another approach is to 
use redundant ME elem ents to provide m utually exclusive assertion o f  1 o f the N input signals. T his  
approach was also used by Ken Stevens in the design o f the Post Office chip [52], and several variants of
19
the m ultip le ME elem ent them e have been investigated by Charles Molmar at Sun Laboratories for the 
counterflow pipeline processor [195].
Perhaps som e of the best work on cascaded arbiters and nacking arbiters has been performed 
by Robert Shapiro and Hartm ann Genrich [190, 189]. Sadly this work has not been published in an 
available forum. Their work started as a formal effort to prove the cascaded arbiter properties o f the 
arbiter circuits used in the aforem entioned Post Office chip, after a defect was discovered during testing  
o f an arbiter fragment chip. Shapiro and Genrich used Petri N et based m odels to create behavioral 
traces o f both the basic arbitration m odules and their properties when cascaded. Their analysis found 
w hat turned out to be a sim p le design flaw which had caused the problem . The more im portant aspect 
o f their work is that they then fonnd that the arbitration circuits were overconstrained in term s o f  
C -elem ent synchronization. They produced a series of 4-cycle designs which contained few C -elem ents. 
T he C -elem ents were replaced by N A ND  gates. Another interesting result o f their work is that arbiters 
can be m ade to be faster than those containing C -elem ents in either the forward (requesting) direction  
or in the backward (acknowledge) direction but not both.
A nother interesting approach to the low-latency cascaded arbitration problem has been taken 
by Yakovlev, Petrov, and Lavagno [22*2]. Their circuits are speed-independent and have an im proved  
response delay at the input request-grant handshake link due to two factors. First, request propagation  
is performed in parallel with the start o f arbitration. T he arrival o f  any request at a stage can trigger an 
im m ediate request to the next higher stage, prior to arbitration resolution in the lower stage. Second, 
resetting the request-grant handshakes is done concurrently in different cascades o f the request-grant 
propagation chain. •
T h e field o f arbiter design is as diverse as the circuits for which the arbiters are being designed. 
W hile the m ethods are diverse, there is little doubt that the design o f efficient arbitration structures is a 
key aspect o f any high-perform ance asynchronous system  design. In fact, organizing the system  so that  
the arbitration requirem ents becom e m inim al is viewed by m any designers as the key factor in achieving  
high performance.
6  A n  O v e r v i e w  o f  P r i o r  W o r k
6.1 P ion eer in g  Efforts
In the mid 1950’s asynchronous circuits were first studied by analyzing the nature o f  input restrictions 
on sequential circuits. T hese efforts were part o f the general interest in sw itching theory. Huffman 
postu lated  [84, 85] that there m ust be a m inim um  tim e between input changes in order for a sequential 
circuit to be able to recognize them  as being d istinct. There m ust then be two critical periods, 6\ and S2 , 
where Ji <  Sn. Signals which occur within a tim e that is less than or equal to cannot be distinguished  
as being separate events. Signals which are separated by a tim e o f 82 or greater are d istinguishable as a 
sequence o f separate events. Signal events separated by a tim e between rfi and S2 cause nondeterm inistic  
sequential circuit behavior. T h is led to a class of circuits that becam e known as Huffman circuits. T his  
work was extended in the 1950’s and 6 0 's by the fundam ental contributions of Unger, M cCluskey and 
others.
20
Muller (138, 139] proposed a different class o f circuits which are m ore closely related to modern  
asynchronous circuits. In particular, he proposed the use o f a ready  signal. Input signals to Muller 
circuits were only perm itted when the ready signal was asserted. In som e sense, the concept is sim ilar  
to that o f a sim ple 4-cycle circuit. T he unasserted acknowledge serves as a ready indication. W hen the 
circuit, is not ready to accept additional input then it can m erely hold its acknowledge to indicate that 
no further requests can be tolerated.
T h e efforts o f Muller and Huffman spurred considerable theoretical debate in the sw itching circuit 
literature. T he next notable event from a modern perspective was the sem inal work by Stephen Unger 
that resulted in the publication o f his classic text [206]. In this book, Unger provided a detailed m ethod  
for synthesizing single-input change asynchronous sequential sw itch ing circuits. He provided a partial 
view  o f w hat would be required for the larger dom ain o f m ultip le-input change circuits. T h is textbook  
had a significant influence on much of the practical work that followed in the next decade. For exam ple, 
the subsequent work o f both o f these authors was heavily influenced by U nger’s work. A dditionally, 
several early m ainfram e com puters were constructed as entirely asynchronous system s, notably the MU- 
5 and Atlas com puters.
Another noteworthy effort, the Macromodule Project  [42], conducted at W ashington University in 
St. Louis, provided an early dem onstration  o f the com position benefits o f  asynchronous circuit m odules. 
T his project created a d igita l “Lego” k it o f m odules. These m odules could (and were) rapidly used to  
configure special-purpose com puting engines, as well as general-purpose com puters. T he project took  
a significant step  forward and provided a sound foundation for the num erous m acrom odular synthesis 
approaches being investigated today (23, 210, 61],
Yet another noteworthy pioneer was Chuck Seitz, whose M IT dissertation [186] introduced a 
Petri N et like form alism  which proved to be extrem ely useful in the design and analysis o f asynchronous 
circuits. In his subsequent academ ic career, Prof. Seitz taught num erous courses at the University of  
U tah and then later at CalTech where he infected a large num ber o f  students with what proved to 
be an incurable interest in asynchronous circuits. His influence directly resulted in the asynchronous 
im plem entation  o f the first operational dataflow  com puter [49] and the first com m ercial graphics system , 
the Evans h  Sutherland LDS-1. Professor Seitz's role as an educator is also significant in that his courses 
on asynchronous circuits, starting as early as 1970, inspired m any o f the field’s current researchers.
T he influence o f these pioneering efforts is still seen in m ost o f the asynchronous circuit work that 
is in progress today.
6.2 A syn chron ous F in ite  S ta te  M achine S yn th esis
T he m ost traditional approach to specifying and synthesizing asynchronous controllers is to view them  
as finite sta te  machines. T h is view  of com putation is state-based: a m achine is in som e state, it receives 
inputs, generates ou tputs, and m oves to a new state. Such specifications are naturally described by a 
f low table  or sta te  table  [206]. T hese tables define the behavior o f outputs and next sta te as a function  
of the inputs and current state. Current and next states are described sym bolically  (see Figure 9).
T he earliest asynchronous state m achine im plem entations were Huffman machines  (see (206]).
21
Next Stole,Output X, Y Inputs a b c
^000 00) 0)1 0L0 110 111 101 100
A M A*00 tMl - - A,00
- tUL C,01 -
!>>t0 c,01 -
t>>to - - E.01
A,00 - E,0)
Figure 9: An asynchronous flow table
T hese m achines consist o f com binational logic, primary inputs, primary ou tp uts and fed-back state  
variables. N o latches or flip-flops are used: sta te is stored on feedback loops, which m ay have added 
delay elem ents. A block diagram  o f a Huffman m achine is shown in Figure 10.
Synthesis m ethods for asynchronous sta te m achines usually follow the sam e general outline as 
synchronous m ethods [127], A flow table is reduced through stale m in imization.  Sym bolic states are 
assigned binary codes using sta te assignment. Finally, the resulting Boolean functions are im plem ented  
in com binational logic using logic minimization.
There are several possible operating modes  for an asynchronous sta te  m achine. Un'ger [207] pro­
posed a hierarchy, based on the kinds o f  input changes that a m achine can accept. In a single-input 
change (SIC )  m achine, only one input m ay change at a time. Once the input has changed, no further 
inputs m ay change until the m achine has stabilized. This operating m ode is highly restrictive, but 
sim plifies the elim ination  o f hazards. A sum m ary o f SIC asynchronous sta te m achines can be found in 
Unger [206].
A multiple-input change (M IC )  m achine allows several inputs to change concurrently. Once the 
inputs change, no further inputs m ay change until the m achine has stabilized. T h is approach allows 
greater concurrency, but it is still quite restricted. In particular, MIC m achines have the added constraint 
that the m ultip le-input change is ahnost simultaneous. More form ally, all inputs m ust change within  
som e narrow tim e period, S. T h is constraint helps to sim plify hazard elim ination , which is still m ore 
com plicated than in the SIC case.
MIC designs were proposed by Friedman and Menon [67] and M ago [115]. These designs require 
the use of delays on inputs or outputs, special “delay boxes” , and careful tim ing requirements. The 
usefulness o f these designs in a concurrent environm ent is lim ited , since input changes are required to 
be near-sim ultaneous.
Finally, an unrestricted-input change (U IC )  m achine allow s arbitrary input changes, as long as 
no one input changes m ore than once in som e given tim e interval, S. T h is behavior is quite general, but 
hazard elim ination  is problem atic. UIC designs were first proposed by Unger [207], These designs are
22
Figure 10: Block diagram  o f a Huffman m achine
not currently practical: they require the use of large inertial delays and have not been proven to avoid 
m etastab ility  problem s. '
In any asynchronous sta te  m achine, the problem o f hazards m ust be addressed. First is the prob­
lem o f com binational hazards. T he difficulty o f com binational hazard elim ination depends on whether 
the m achine operates in SIC or MIC m ode. As m entioned earlier, SIC hazards are easier to elim inate. 
Hazards are elim inated  by hazard-free synthesis or by using inertial delays to filter out glitches. A lter­
natively, m any traditional synthesis m ethods ignore hazards on ou tp uts, and only elim inate hazards in 
the n ext-sta te logic. Such m achines are called S-proper  or properiy-realizable  [206].
Second, since asynchronous sta te  m achines have sta te , sequential hazards m ust be addressed. 
W hen a sta te  m achine goes from one state to another, several sta te  b its m ay changc. If the m achine  
may stabilize incorrectly in a transient state , a critical race occurs. Critical races are elim inated  using  
specialized sta te encodings, such as one-hot [206], one-shot  [206], Liu [112] or Tracey [204] critical race- 
free  codes. These codes often require extra bits. A second type o f  sequential problem  is an essential  
hazard  [206]. Essential hazards arise if a m achine has not fully absorbed an input change at the tim e the 
next-state begins to change. In effect, the m achine sees the new state before the com binational logic has 
stabilized from the input change. Essential hazards are avoided by adding delays to the feedback path  
or, in som e cases, using special logic factoring [6].
Because o f the com plexity  o f building correct Huffm an m achines, an alternative approach was 
proposed, called self-synchronized machines. These m achines are sim ilar to Huffman m achines, but have 
a local self-synchronization unit which acts like a clock on the m ach ine’s latches or flip-flops. Unlike a 
synchronous design, the clock is aperiodic, being generated as needed for the given com putations. A block 
diagram  o f a self-synchronized m achine is shown in Figure 11. Both SIC [82, 201] and MIC [3, 41, 169, 208] 
self-synchronized m achines have been proposed. In a related approach, the local clock is replaced by 











Figure 11: Block diagram  o f self-synchronized m achine
mode m achines [224, 37]. Self-synchronized m achines tend to have a sim pler construction b at a greater 
overhead than Huffman m achines.
In general, asynchronous sta te  m achines ofler a num ber o f attractive features. First, input-to- 
output latency is often low: if no delays are added to inputs or outputs, the delay is com binational. 
Second, since the m achines are state-based, m any sequential and com binational optim ization  algorithm s  
can be used, sim ilar to those which have been effective in the synchronous dom ain. However, asyn­
chronous state m achine design is subtle: it is difficult to design hazard-free im plem entations which (i) 
allow reasonable concurrency and (ii) have high-perform ance.
Much o f the recent work on asynchronous state machines is centered on burst-mode machines. 
T hese specifications were introduced to allow m ore much more concurrency than traditional SIC m a­
chines, and therefore to be m ore effective in building concurrent system s. At the sam e tim e, burst-m ode  
im plem entations are guaranteed hazard-free, w hile m aintaining high-perform ance. Burst-m ode specifi­
cations are based on the work o f Davis on the D D M  Machine  [50]. In this dataflow  m achine, Davis used 
sta te m achines which would wait for a collection o f input changes ( “input burst” ), and then respond 
with a collection  o f output changes ( “ou tp ut burst” ). T h e key difference between this data-driven style 
and M IC-m ode is that, unlike MIC m achines, inputs within a burst could be tmcorrelated', arriving in 
any order and at any tim e. As a result, these m achines could operate m ore flexibly in a concurrent 
environm ent.
More recently, Davis, C oates and Stevens im plem ented this approach in the M EAT synthesis sys­
tem  at H ewlett-Packard Laboratories [48]. T he synthesis m ethod was applied to the design o f  controllers 
for the P ost Office routing chip for the M ayfly project. However, although it produced high-perform ance 
im plem entations, it relied on a verifier to insure hazard-free designs. An exam ple o f a burst-m ode spec­
ification is shown in Figure 12. Each transition is labeled with an input burst followed by an output 
burst. Input and output bursts are separated by a slash, / .  A rising transition is indicated by a “+ ”
24

Yun and Dill (229] laler proposed an alternative im plem entation sty le for burst-m ode m achines, 
called a SD machine. These m achines are nam ed after the 3-dim ensional flow table used in their synthe­
sis. Unlike locally-clocked m achines, these are Huffman m achines, with no local clock or latchcs. The  
synthesis m ethod has been fully autom ated into a C A D  tool and applied to several large designs, includ­
ing an experim ental SCSI controller at AM D Corporation [231]. A more recent unclocked bnrst-m ode  
m ethod , U C L O C K ,  was developed by Nowick and C oates [150].
The burst-m ode approach allows greater concurrency than MIC designs, but it still has two main  
lim itations. First, it requires strictly alternating bursts o f inputs and outputs: concurrency occurs only  
within a burst. Second, as in m any asynchronous design styles, there is no notion o f “sam pling”signal 
levels which m ay or m ay not change. Yun, Dill and Nowick [232, 230] introduced extended burst-mode  
specifications to elim inate these two restrictions. T hese generalized specifications allow a lim ited form  
of interm ingled input and output changes, and provide greater concurrency. These designs also allow  
the sam pling o f level signals. Yun has extended his 3D  synthesis algorithm s and tools to handle ex­
tended burst-m ode specifications [230]. His work includes performance-oriented optim izations targetted  
to m ulti-level im plem entations [233]. A novelty o f  Yun’s m elhod is that it can be used to synthesize 
controllers for m ixed synchronous/asynchronous system s, where the global clock is one o f the controller 
inputs [230].
A number o f C A D  optim ization  algorithm s have been developed, which have been used in burst­
m ode synthesis. These include: optim al sta te  assignm ent [68]; hazard-free 2-level logic m inim ization, 
both exact (hfmin  [153, 68]) and heuristic (esprzsso-hf  [202]); hazard-free m ulti-level logic op tim iza­
tion [103, J 10]; and hazard non-increasing technology m apping [191, 102, 14], which enables more modern 
standard cell m ethodologies to be utilized.
Davis, M arshall, C oates and Siegel [117] have built a C A D  framework to incorporate all o f the 
burst-m ode synthesis m ethods. T he framework includes tools for sim ulation  and layout as well. Their 
tools have been applied to several significant designs, including an low-posver infrared com m unications  
chip for portable com m unication , developed at Hewlett-Packard Laboratories and Stanford University. 
An experim ental chip has been fabricated; the measured current consum ption o f the core receiver (w ith­
out pads) is less than 1 m A at 5 volts when the receiver is actually receiving data, and less than 1 p A  
when it is w aiting for data.
Beerel and Yun have recently used burst-m ode synthesis tools at Intel Corporation, including  
3D  [229, 230] and hfmin  [153, 68], to design o f  an experim ental high-perform ance instruction decoder. 
G opalakrishnan et ul. have developed a high-level asynchronous synthesis tool, called A C K  [101], which 
incorporates burst-m ode C A D  tools to synthesize controllers.
6.3 P e tr i-n e t and G raph-based M eth o d s
Petri nets and other graphical notations are a w idely-used alternative to specify and synthesize asyn­
chronous circuits. In this m odel, an asynchronous system  is viewed not as state-based, but rather as a 
partially-ordered sequence of events. A Petri net [163] is a directed bipartite graph which can describe
has also been developed [202].
26
both concurrency and choice. T b e net consists of two kinds of vertices: places  and transitions. Tokens 
are assigned to the various places in the net. An assignm ent o f tokens is called a marking , which cap­
tures the sta te o f the concurrent system . N um erous sem antics have been associated with Petri nets. A 
useful introductory view is that a marked place is an indication that a condition is true, and a transition  
specifies an action. W hen all o f the conditions preceding a transition are true the action may fire which 
removes the tokens from the preceding places and marks the successor places. Hence, starting from an 
initial m arking, tokens flow through the net, transform ing the system  from one m arking to another. As 
tokens flow, they fire transitions in their path according to certain f iring rides. Since the firing o f a 
transition in a Petri net corresponds to the execution of an event, each such sim ulation  or token game  
describes a different possible interleaved execution o f the system .
A Petri net is shown in Figure 13(a). Places are drawn as circles, and transitions as bars. The  
initial m arking is indicated by black dots in two o f the places. If a place is connected by an arc to a 
transition, the former is called an input place  o f  the transition. Likewise, if  a transition is connected by 
an arc to a place, the latter is called an output place o f the transition. In this exam ple, transition X  has 
input place 1 and output place 2; transition Y  has input place 3 and ou tp ut place 4.
T w o transitions are enabled in the figure: X  and Y . Each transition is enabled because there is a 
token in each input place. An enabled transition m ay fire at any tim e, rem oving a token from each input 
place and m oving one to each ou tp ut place. T he result o f firing transition X  is shown in Figure 13(b). 
T he firing o f a transition corresponds to the occurrence o f an event. In this exam ple, events X  and Y  
can occur concurrently: both transitions are enabled and may fire in any order. Figure 13(c) indicates  
the result o f firing transition Y  after X .  After both events have fired, transition Z  is enabled and may
Patil [157] proposed the synthesis o f Petri nets into asynchronous logic arm ys.  In this approach, 
the structure o f the Petri net is m apped directly into hardware. Many modern synthesis m ethods use a 
Petri net as a behavioral specification only, not as a structural specification. Using reachability analysis , 
the Petri net is typically transformed into a state graph , which describes the exp licit sequencing behavior 
o f the net. An asynchronous circuit is then derived from the state graph.
Several approaches use a constrained class o f  Petri net called a marked graph [43], Marked graphs
Figure 13: Petri-net exam ple
are used to m odel concurrency, but not choice. T hat is, a marked graph cannot m ode) that one of 
several possible inputs (or outputs) m ay change in som e state. Exam ples include S e itz s  M -N ets  [185] 
and Rosenblum  and Yakovlev’s Signal Graphs [179]. Vanbekbergen el al. [215] introduced the notion of 
a lock class to synthesize designs from marked graphs.
More general classes o f Petri nets include Molnar ct al.’s I-Nels  [134] and C hu’s Signal Transition  
Graphs or S T G s  [38, 39]. These nets allow both concurrency and a lim ited form of choice. Chu developed  
a synthesis m ethod which transforms an STG  into a speed-independent circuit, and applied the m ethod  
to a num ber of exam ples, such as an A -to-D  controller and a resource locking m odule. T h is work was 
extended by Meng [130], who produced an autom ated  synthesis tool for speed-independent designs from 
ST G s. M eng also explored design tradeoff's to allow greater concurrency in the resulting circuits.
Recent work on Petri-net and graph-based asynchronous synthesis is proceeding in three m ajor 
directions: (i) extending specifications; (ii) optim izing synthesis algorithm s; and (iii) im proving hazard 
elim ination .
Several extensions have been proposed to describe more general behavior than is possible with 
tbe original STG 's. These include the use o f  '“epsilon” and “dum m y” transitions [38], “don ’t-care” and 
“toggle” transitions [136], OR-causalily  [223] and sem aphore transitions [46]. Sutherland and Sprout I 
have introduced a notation for com posite Petri nets called “snippets” . Others allow tim ing constraints 
for specification and synthesis, using a related Eue.nL-Rule form alism  [141].
' ■ In addition, som e researchers are using sta te  graphs for specifications, as an alternative to Petri 
nets [217, 12, 100], S tate graphs allow the direct specification of interleaved behavior, avoiding som e 
o f the structural com plexity o f Petri nets. T he target designs are usually speed-independent gate-level 
im plem entations. Originally, this work focused on determ inate specifications, having no input or output 
choice, based on M uller’s sem i-m odular lattice form ulations (see [132]). More recent research allows 
generalized behavior with choice.
A number of optim ized synthesis algorithm s have been developed. Lavagno et al. [107], Van­
bekbergen el al. [216], Chu el al. [40] and Puri and Gu [165] have each developed algorithm s for state  
m inim ization and sta te  assignm ent from STG  specifications. A partitioning algorithm  for STG -based  
specifications was proposed by Puri and Gu [166], Lin and Lin [111] have developed algorithm s which 
avoid expensive interm ediate representations during synthesis, instead performing synthesis d irectly on 
an STG  representation, for a lim ited class of ST G s. More recently, the theory o f  regions has been used 
as a powerful tool in developing efficient STG  algorithm s, including sta te  m inim ization and assignm ent 
(see C ortadella et al. [44, 45]). A region is a set of states in the state graph corresponding to a place in 
the associated STG . T he theory of regions allow s synthesis steps to be performed directly on the STG , 
w ithout the need to generate a com plete state graph.
Recent STG  m ethods are also addressing the problem of gate-level hazards. Early STG synthesis 
m ethods typically assum ed a complex-gale model, where an entire com binational circuit is treated as 
a m onolith ic block, rather than a collection o f separate gates with individual delays. T hese m ethods 
could not be used to synthesize large circuits, where blocks are m apped to a network  of gates, since 
the resulting network could have hazards. Several recent m ethods address this problem , using a simple-  
gale model  which can m odel hazards due to actual delays in a collection of individual gates and wires.
28
Moon et id. [136] and Yu and Subrahm anyam  [227] proposed heuristic techniques for gate-level hazard 
elim ination for speed-independent design. Lavagno et al. [106] used logic synthesis algorithm s, hazard 
analysis and added delays to avoid hazards, assum ing bounded gate delays. Lavagno has developed  
an influential C A D  system  for STG  synthesis, which has been incorporated into the Berkeley SIS  tool 
package [108, 188].
Several speed-independent synthesis m ethods have been developed, which insure hazard-freedom  
at the gate-level. Much o f this work has been pursued by K ishinevsky, K ondratyev. Taubin and Var­
shavsky [97, 217], by Beerel and Meng [12], and by Cortadella, Lavagno, Lin, Vanbekbergen, Yakovlev 
and others [100, 44]. These m ethods have been effectively applied to a num ber o f designs. T he sustained  
research effort o f K ishinevsky el at., pursued over many years in R ussia and Japan, has been especially  
noteworthy, resulting in a collection o f algorithm s and tools which are m aking SI design practical. A 
general asynchronous CAD system , including speed-independent tools, has also been developed at IMEC  
Laboratory [2251. A com prehensive solution to the problem o f hazard-free decom position  com plex-gates  
into sim pler gates, under a speed-independent m odel, has been developed by Burns [28).
6 .4  T ransform ation  M eth od s
W hile STG -based m ethods view com putation  as partially-ordered sequences o f events, a different ap­
proach is to view au asynchronous system  as a collection o f com m unicating processes. A system  is 
specified as a program in a high-level language o f concurrency. T ypically, the program is based on a 
variant o f H oare’s C S P  [83], such as Occam or trace theory  [167]. T he program is then transform ed, by a 
series o f steps, into a low-level program which m aps directly to a circuit. Such transform ation m ethods  
use algebraic or com piler techniques to carry out the translation. Som e o f these m ethods treat datapath  
and conirol uniform ly during synthesis.
Ebergen [61] introduced a synthesis m ethod for delay-insensitive circuits using specifications called  
commands.  A com m and is a concise program notation to describe concurrent com putation  based on trace 
theory. Several operations are used to  construct a com plex com m and from sim pler com m ands, such as 
concatenation , repetition  and weave.
Figure 14 illustrates com m ands for several basic DI com ponents. A wire  is a com ponent with one 
input, a?, and one output, 6!. T he sym bol “?” indicates an input to the wire, and “!” indicates an 
ou tp ut o f the wire. In a del ay-insensitive system , a wire m ay have arbitrary finite delay. As a result, 
if  two successive changes occur on input a?, t.he ou tp ut behavior is unpredictable: b\ m ay glitch. To  
insure correct operation , input and output events m ust strictly alternate: once input «? changes value, 
no further change on a? is perm itted until ou tp ut event 6! occurs. A com m and for wire is given in 
Figure 14(a). T he notation  a?; 6! indicates that input event m ust be followed by output event 6!;
is the concatenation  operator. No d istinction is m ade between a rising  or falling  event on a wire; 
a? sim p ly  m eans a change in value on the wire. An asterisk (*) indicates repeti t ion : a? and 6! may 
alternate any num ber of tim es. F inally “pref” is t.he prefix-closvtre operator, indicating that any prefix 
of a perm itted behavior is also perm itted. T he final com m and describes the perm itted interaction o f a 
wire and its environm ent when it is properly used.







Figure 14: C om m ands for som e sim ple com ponents
and two outputs, 6! and cl. Each input event, a ?, results in exactly one output event. O utput events 
alternate or toggle: the first input event a? results in output event 6! (as indicated by the black dot); the 
next input event results in output event cl; and so on- T he resulting com m and is shown in the figure.
■ Another im portant com ponent is a C-element,  shown in Figure 14(c) (also known as a Muller 
C -elem ent, DI C -e)em ent, rendezvous, or join  elem ent). T he com ponent has two inputs, a? and 6?, and 
one ou tp ut, c!. T he com ponent w aits for events on both inputs. W hen both inputs arrive, the com ponent 
produces a single event on output c\. Each input m ay change only once between output events, but the 
input events a?  and 6? may occur in any order. Such parallel behavior is described in a com m and by 
the weave operator.  a? || 6?. T he final com m and for a C -elem ent, allowing repeated behavior, is shown  
in the figure.
A final com ponent, callcd a m erge , is shown in Figure 14(d). The com ponent is basically an 
exclusive-or  gate, but its operation is restricted so that no glitching occurs. The com ponent has two 
inputs, o? and 6?, and one output, c\. T he com ponent w aits for exactly one input event: either a l  or 
6?. Once an input event occurs, the com ponent responds w ith  output event, c\. T he com ponent can 
be thought o f as “jo in in g” two input stream s to  a single output stream , where only one input stream  
is active at a tim e. Such an exclusive choice between inputs is described in a com m and by the union 
operator, a? | b?. T he final com m and for a join  elem ent, allow ing for repeated behavior, is shown in the 
figure.
A com m and can be used to specify a com plex circuit or system . T he com m and is then decomposed  
in a series o f steps into an equivalent network o f  com ponents, using a “calculus o f d ecom p osition ” . As 
an exam ple, a modulo-3 counter can be specified by the follow ing com m and [61]:
M 0 D 3 =  pref*[a?; q\\a?; q\\ a?;p!]
T his com m and describes a counter with one input, a?, and two outputs, p! and </!. T he counter receives
30
events on input a ?. Each input event m ust be acknowledged by one ou tpu t  event before the next input 
event can occur. The  first and second input events are acknowledged on <7!, while the third input event 
is acknowledged on p!. This behavior repeats, hence the com m and describes a modulo-3 counter. Using 
techniques for delay-insensitive decomposition, this com m and can be decomposed into a network of 2 
toggles and 1 merge which implements equivalent behavior, as shown in Figure 15. Ebergen has applied 
his decomposition method to a number of designs, including modulo-n counters, stacks, com m ittee  
schedulers [17] and token ring arbiters.
A related algebraic approach was proposed by Udding and Josephs [205, 91]. Their m ethod 
is based on a delay-insensitive algebra which formally characterizes a delay-insensitive system. Using 
axioms and lemmas, a  specification is transformed into a pxovably correct delay-insensitive circuit. An 
alternative speed-independent algebra has also been proposed [89], Proof m ethods for recursively-defined 
Dl specifications have been formally justified [116]. The DI synthesis m ethod  has been used to design a 
stack, a  routing chip, an up-down counter, and » polynomial divider [90]. Lucassen and Udding [113] have 
used DI algebra to design, and prove correct, a stage in the Counterflow Pipeline Processor developed at 
Sun Laboratories. In related work, P a tra  and Fussell [158] have proposed a  ‘‘basis set” of DI components. 
They have shown th a t  any DI circuit can be constructed using only com ponents from the set, and that 
the set is minimal.
While the above m ethods use algebraic calculi to derive asynchronous circuits, other transform a­
tion methods rely on compiler-oriented techniques. An elegant and influential method for QDI synthesis 
has been developed by M artin and his s tudents a t  Caltech [120, 122]. Martin specifies an asynchronous 
system as a  set of concurrent processes which communicate on channels, using a CSP-like language. 
The language uses communication constructs from Hoare’s CSP, sequential constructs from D ijkstra’s 
guarded com m and language, and new constructs such as the probe (see [122]). T he  specification is then 
translated into a collection of gates and components which communicate on wires.
M ar t in ’s translation process is accomplished in several steps: (i) in process decomposition, a 
process is refined into an equivalent collection of interacting simpler processes; (ii) in handshaking ex­
pansion , each “communication channel” between processes is replaced by a  pair of wires, and each atomic 
‘‘communication action” is replaced by a handshaking protocol on the wires; (iii) in production-rule ex­
pansion , each handshaking expansion is replaced by a set of “production rules (PRs)” , where each rule 
has a “guard” th a t  insures it is activated (i.e..  “fires” ) under the same semantics as specified by the 
earlier handshaking expansion; and, finally, (iv) in operator reduction,  PR s are grouped into clusters, and 
each cluster is then m apped to a  basic hardware component. These steps include several optimizations
Figure 15: Ebergen’s m odulo-3 counter
and sub-steps, such as reshuffling, in step (ii); and s la te  assignment, guard strengthening, and guard 
sym m etrization, in step (iii). In most designs, a four-phase handshaking protocol is used (step (ii)), 
a lthough two-phase handshaking can be used as well.
M ar t in ’s synthesis m ethod  has been autom ated by Burns [29, 26] and applied to many substantial 
examples, including a distributed m utual exclusion element [119, 26], a  stack [122], and a  multiply- 
accumulate unit [145]. The  compiler includes algorithms for optimal transistor sizing [27]. (Designs 
for other d a tap a th  components, and for a  microprocessor, using this method are described in the next 
two sections.) M art in ’s work has been extended by Akella and Gopalakrishnan in a system called 
SHILPA  [4]. This method allows global shared variables, and uses flow analysis techniques to optimize 
resource allocation.
A different compiler-based approach was developed by van Berkel, Rem and others [210, 211, 
161, 212] at Philips Research Laboratories and Eindhoven University of Technology, using the Tangram  
language. Tangram , based on CSP, is a specification language for concurrent systems. A system is 
specified by a Tangram  program, which is then compiled by syntax-directed translation into an interme­
diate representation called a  handshake circuit. A handshake circuit consists of a  network of handshake 
processes, or components, which communicate asvnchronously on channels using handshaking protocols. 
T he circuit is then improved using peephole optimization and,-finally, components are m apped to VLSI 
implementations.
As an example, the following is a Tangram  program for a 1-place buffer, B U F i:
(a?H/ &6!W/ )- | [a; : vav W  | # [a ? x [  £>!.i:]] |
The buffer accepts input d a ta  on a and produces o u tp u t  d a ta  on b. The expression in parentheses is a 
declaration of the external ports of the module. The buffer has an input port,  a, and an output port, 
b, handling d a ta  of some type, W .  The  remainder of the program, structured as a block, is called a 
command. A local variable x is defined for internal storage of data . The  s ta tem ent # [ a ? x ; 5!x] indicates 
th a t  d a ta  is received on port a and stored in internal variable x\ this d a ta  is then sent out on port b. 
The operator indicates sequencing, and indicates infinite repetition.
This Tangram  program is translated into the handshake circuit of Figure 16. Each circle repre­
sents a handshake process or component. Each arc represents a  channel, which connects an active port  
(indicated by a black dot) to  a passive port  (indicated by a white clot). Communication on a channel is 
by handshaking: an active port initiates a  request and a  passive port returns an acknowledgment.
In this example, port >  is the top-level port for the circuit, called go. The environment activates 
the buffer by an initial request on this passive port. This port is connected to a repeater  process, which 
implements the repetition operator, This process repeatedly initiates handshaking on channel c.
Channel c is connected to a sequencer process, which implements the operator. The  sequencer first 
performs handshaking on channel d. When handshaking is complete, it then performs handshaking on 
channel e.
Channels d  and e in turn are each connected to transferrers, labelled T .  W hen the sequencer 
process initiates a request on channel d , the corresponding transferrer actively fetches d a ta  on input 
channel a and then transfers it  to storage element x.  Once the transfer is complete, the sequencer
32
initiates a  request on channel e, causing the second transferrer to fetch the d a ta  from x  and transfer it 
to ou tpu t  channel b.
A more complex example is 2-place buffer, BUF2, which can be described in terms of two 1-place 
buffers:
{a lW k c W V )-  | [&: c h a n  W  \ ( B U F x{ a t b) || B U F 2{bsc))} |
The program defines the buffer by the parallel composition, ||, of the two I-place buffers, which are 
connected by an internal channel, b. The corresponding handshake circuit is' shown in Figure 17. A 
parallel component implements the composition operator, ||. An initial request on its passive go port, 
results in parallel communication on channels I and r. These channels are both connected to a  1-place 
buffer B  (indicated by double circles). The  two buffers communicate through a synchron izer  process 
(indicated by a black dot). If active requests arrive on both of its channels, lb and rb , the synchronizer 
first performs handshaking on channel b, then returns parallel acknowledgments on channels lb and rb.
The attached run process is used to hide channel 6; it simply acknowledges every request it receives.
The Tangram  compiler has been successfully used at Philips for several experimental DSP designs 
for portable electronics, including a  systolic RSA Converter, counters, decoders, image generators, and 
an error corrector for a digital compact cassette player [213]. A m ajor  goal this work is rapid turnaround 
time and low-power implementation.
Even though some peephole optimizations have been developed, Tangram  is basically a syntax- 
directed translation m ethod. Recently, two resynlhesis methods have been proposed, by P ena /C ortade lla  [162] 
and Kolks cl al. [99], which use aggressive peephole techniques to further optimize the resulting Tan- 
gram circuits. In each approach, handshaking components are clustered, formally specified as a single 
block, then resynthesized using STG techniques. A different approach has been proposed, which uses 
burst-mode techniques for the resynthesis step [78].
Brunvand and Sproull [21, 23] introduced an alternative compiler using occam specifications. 
Unlike the approaches of Martin and van Berkel, communication between processes is through two- 
phase handshaking, or transition-signaling. In their method, an occam specification is first compiled
Figure 16: Handshake circuit for B U F1 exam ple
into an unoptimized circuit, using syntax-directed translation. Peephole optimization techniques are 
then applied to improve the resulting circuits. The circuits are then mapped to a  library of transition- 
signaling components.
6.5 T im ed  M eth od s
While all asynchronous synthesis methods make some timing assumptions, much of the discipline is 
focused on minimizing these t i l l in g  assumptions or a t  least localizing them into low-level modules. 
Myers [141, 142] contends tha t  this approach often leads to additional time and space being spent in 
the circuit to deal with contingencies which never occur. In his timed sta te  space method, ra ther than 
using timing analysis for posf^syntliesis-based optimizations to remove the unnecessary circuitry, timing 
information is used during synthesis to avoid generating unnecessary circuitry. His m ethod is based on 
tim ed event rule (E R )  structures  [140], which can be automatically generated from high-level language 
representations such as C SP or VHDL. ER structures and Petri Nets use a  similar representational 
semantics, b u t  ER  structures have a more concise syntax.
A rule set  represents a causal dependence between events. It is notated as a  4-tuple, ( e , / , / , « ) ,  
consisting of an enabling event e, an enabled event / ,  a lower timing bound I, and an upper timing 
bound u. The rule is considered to be satisfied for the time between I and u after an enabling event has 
occurred. T he  use of this timing bound restricts the possible sta te  space to events which can actually 
occur. A disjunction operator is used to perm it both  choice and exclusive-OR causality behaviors, 
although there is currently no provision for true O R  behaviors. Myers’ gate-level synthesis method 
perm its  larger circuits to be synthesized more efficiently as a  result of the sta te  space reduction.
A new set of timing analysis algorithms, based on a theory of geometric regions [173], allows a 
large number of discrete timed states to be condensed into a single region. The worst-case complexity 
of the algorithms is actually worse than discrete methods, but it has been shown th a t  the region based 
approach works well in practice [174, 143, 16]. T he  m ethod is autom ated in a tool called A T A C S , which 
has been used to design a number of practical circuits. The  tool has also been used a t  Intel Corporation 
in the design of an experimental high-performance instruction decoder.
Figure 17: Handshake circuit, foT BUF2 exam ple
Ftesults show tha t ,  in the best case, circuits can be up to 40% smaller and 50% faster than those 
synthesized using methods which do not eliminate temporally unreachable states. Perhaps the most 
interesting aspect of this work is th a t  it treats synchronous circuits as a. subset of timed circuits and 
therefore provides a method for treating hybrid circuit s tructures consisting of both synchronous and 
asynchronous modules.
6.6 D esign  of A syn ch ron ous D atap ath s
As in synchronous design, different techniques and structures are used when designing d a tap a th s  and 
controllers.
Modern da tap a th s  are often built using pipelines. However, the operation of  synchronous and 
asynchronous pipelines are fundamentally different. In a synchronous pipeline, d a ta  advances a t  a  fixed 
clock rate, in lock-step, through the pipe. Since function blocks in the stages may have different delays, 
the clock cycle must be set to the slowest stage. Furthermore, a s tage’s latency varies with the actual 
temperature, voltage, process and d a ta  inputs; therefore, additional delay margin is typically added to 
the clock cycle. Finally, the clock speed must be further reduced to avoid clock skew problems. As 
a result, under typical operating  conditions, a synchronous pipeline may operate far slower than its 
potential performance.
In contrast, an asynchronous pipeline is not globally clocked. Therefore, in principle, each stage 
may pass d a ta  to its neighbor whenever the stage is done and the neighbor is free. Such elastic pipelines 
promise improved performance: different stages may operate a t  different speeds, and stages may complete 
early depending on the actual da ta .  Of course, new overhead may be introduced, since each stage must 
now tell its neighbor when it is ready.
Sutherland introduced an elegant and influential approach to building asynchronous pipelines, 
which he called micropipelines  [200], A micropipeline has a lternating com putation  stages separated by 
storage elements and control circuitry. This approach uses transition-signaling for control along with 
bundled data .  Sutherland describes several designs for the storage elements, called “event-controlled 
registers” , which respond symmetrically  to rising aud falling transitions on inputs. Such pipelines have 
been used by several researchers in the design of asynchronous microprocessors. Sutherland, Sproull, 
Molnar, and others at Sun Labs have recently designed a “counterflow microprocessor” based on m i­
cropipelines [195]. Micropipelines also form the basis for the Manchester ARM microprocessors, devel­
oped by Furber and the A M ULET group [70, 71, 159],
Figure 18 illustrates the operation of a  micropipeline with 4 stages. For simplicity, only the 
control is indicated. In practice, a bundled da tapa th  is also used, along with event-controlled registers 
to store the d a ta  as it propagates down the pipe. A control s tage of the pipeline consists of a C-element 
(described above). A C-element with two inputs and one o u tp u t  behaves as follows. If both inputs are 
1, the o u tpu t  is 1; if both inputs are 0, the ou tpu t  is 0. Otherwise, if inputs have different values, the 
o u tp u t  holds its current value, T he  C-elements in the micropipeline behave similarly, except tha t  each 
has one inverted input.
Initially, all wires in the m icropipcline are at 0, as shown in Figure 18(a). W hen new d ata
35

Since the initial request was acknowledged at the leftmost interface. A (in ),  new d a ta  may now 
arrive and a second request, R(in),  can occur. Since R(in)  is currently 1, a request is asserted by 
changing R(in)  to 0. This request propagates through the micTOpipeline as before. The left interface 
is acknowledged ( A  ( in )  goes to 0), and the request is forwarded to the right interface (R ( l )  goes to 0). 
This process repeats a t  the 2nd and 3rd stages. However, once the second stage is acknowledged (A (2)  
goes to 0) and the request is made to the 4th stage ( R (3 )  goes to 0), the propagation halts. Although 
R (3)  m ade a transition (request from stage 3), stage 4 still contains the earlier d a ta  th a t  was entered 
into the pipe. This d a ta  was not removed, since no A (out)  transition has occurred. Since A (o u t)  is still
0, no new transition can occur on R(oul).  Instead, the new d a ta  is held in the 3rd stage. T he  resulting 
micropipeline configuration is shown in Figure 18(c). The micropipeline now contains d a ta  in the 3rd 
and 4th stages.
Finally, the original d a ta  in the 4t.h stage can be removed from the right interface, which then 
issues an acknowledge (A (out)  goes to 1). At this point, the 4th stage issues a new R(o-ut) transition 
(since R (3)  is 0 and A (o u t)  is 1), as d a ta  in stage 3 is moved to  stage 4. The 3rd stage is acknowledged as 
well (A (3) goes to 0). T he  resulting configuration is shown in Figure 18(d). In practice, more complicated 
scenarios are possible, since d a ta  may be added and removed from the pipeline concurrently.
Although micropipelines use transition-signaling, o ther signaling conventions have been used in 
asynchronous pipelines as well. Williams [220], M artin [123] and van Berkel [211] have used 4-phase 
handshaking (the “return-to-zero” protocol) between stages. An alternative two-phase signaling scheme, 
called LED R  (level-encoded dual-rail), was introduced to combine advantages of both transition-signaling 
and four-phase [56].
Asynchronous pipelines have been designed for numerous applications: multiplication [200, 145], 
division [221], and DSP [211, 131]. The Williams and Horowitz self-timed divider [221] is especially 
impressive: the fabricated chip was twice as fast as comparable synchronous designs.
Research on asynchronous pipelines and d a tap a th s  is now proceeding in many directions. Sev­
eral new asynchronous pipelining schemes have been proposed. Some emphasize low-poweT [65, 72] 
while others emphasize high-performance [53, 228, 72, 135], In addition, several generalizations to 
asynchronous pipeline structures have been proposed: rings [220], multi-rings  [194] and 2-dtmensional  
micropipelines [76], Techniques to reduce the communication overhead between stages have been devel­
oped [220, 55, 79], Liebchen and Gopalakrishnan have proposed a  reordering pipeline  [109] which allows 
the freezing and dynamic reordering of d a ta  within the pipe using “LockC” elements. Finally, low-power 
micropipeline structures have been introduced using adaptive scaling of supply voltage [146].
While pipelining is fundamental to high-performance systems, sequencing is the basic control 
operation in low-performance non-pipelined systems. A number of sequencer designs have been pro­
posed [122, 211, 209, 161, 8, 64, 164],
There has been much recent research on asynchronous da tapa ths ,  beyond the above work on 
pipelines and sequencers. Much of this work is focused on loxo-power design , including designs for a  digital 
compact cassette (DCC) error corrector [213], an infrared communications chip [117], an F F T  [137], a 
FIR Filter bank [147], and cache [74], microprocessor [123, 69] and memory designs [203]. O thers have 
developed techniques which use novel low-power devices, such as RSFQ [114].
37
Nowick introduced a  method for high-performance design , called sj>eculalwe completion, which 
uses a  single-rail bundled da tapa th  but also allows early completion [149]. The method uses a multi­
slotted matched delay, where several of the delays are faster than the worst-case. These speculative 
delays allow early completion, and are disabled for worst-case data ,  The method has been applied to a 
high-performance Brent-Kung adder [155]; SPICE results indicate a 19-29% performance improvement 
over a comparable synchronous design.
O ther da tap a th  research has focused on architectures and protocols for chip-lo-chip com munica­
tion, including recent methods by Greenstreet [80] and Roiene [172]. An architecture for communication 
between synchronous and asynchronous chips has been developed by Chappel el al. [35],
6.7 A syn ch ron ous P rocessor  D esign
Perhaps the greatest challenge in large-scale asynchronous design to date has been to combine the tech­
niques for asynchronous controller and d a tap a th  synthesis, and build asynchronous processors. Asyn- 
chrony has in fact been present in processors from the early days when there was little distinction 
between synchronous and asynchronous circuits. Some level of persistent asynchrony has always been 
present since memory systems have been typically asynchronous. With the advent of virtual memory 
and cache memory systems, there is significant uncertainty about when a memory access will be resolved. 
In today 's  systems: a  cache hit takes a  few cycles, a cache miss requires approximately a hundred cy­
cles, and a page miss takes 500,000 or more cycles to resolve. The result is that the processor-memory 
interface effectively uses a model delay and a  4-cycle protocol where (he request is associated with the 
presentation of an address and a read or write com m and. The acknowledge is a ready signal indicating 
the requested operation has been completed. In a more direct fashion, machines such as the MU-5 and 
DDM l (previously cited) have been constructed along more purely asynchronous lines.
However, it is only recently tha t such techniques have been applied systematically to the design of 
asynchronous microprocessors, The first asynchronous microprocessor-like device was developed in the 
early 1980’s at Caltech by Chuck Seitz as class projects for his courses in VLSI ancl asynchronous circuit 
design. This annual effort went through several iterations and eventually became the CalTech Mosaic 
processor [184] tha t  was subsequently used in the Cosmic Cube multiprocessor [)83]. This binary N-cube 
connected m ulticomputer spawned a significant am ount of activity in the parallel processing industry and 
was the forerunner of similarly architected machines vended by NCube corporation and Intel [196]. The 
Caltech tradition has subsequently been kept alive by Alain Martin and his students who are pursuing 
a  more formally based design approach while continuing to in their efforts to use these techniques in the 
design of asynchronous microprocessors. The  first QDI asynchronous microprocessor was developed by 
Martin el al. [123] in the late 1980’s. The 16-bit design is almost fully quasi-delay-insensitive except for 
the memory interface. A 2/i CMOS version consumed 145mW at 5V and 6.7mW at  2V. A 1.6/iCMOS 
version consumed 200mW at 5V and 7.6mW at 2V. The architecture was later re-implemented in GaAs. 
These early efforts were not competitive with synchronous microprocessor designs of the day in an 
architectural sense, however they did exhibit some of the benefits of asynchronous design methodology 
and provided realistic complexity for the design effort.
Subsequently, Martin and his students then became engaged in a  much more architecturally
38
realistic design effort to implement the asynchronous equivalent of a  MIPS R3000 processor. W hile the 
R3000 is somewhat an tiquated  in terms of 1997 complexity when compared to commercial processors 
such as the MIPS R10000, D E C ’s Alpha 21164, and the IIP  8000 or 8200, the R3000 does provide 
advanced architectural features such as on-chip caches, precise exceptions, register bypassing, branch 
prediction, and branch delay slot issues. The implementation was not simply a  refabrication of the 
MIPS R3000 using asynchronous techniques but rather an a t te m p t  to implement the R3000 instruction 
set while exploiting architecture and logic design opportunities th a t  are available to the asynchronous 
circuit designer. The result was a  deeply pipelined design of a simplified R3000 which dem onstrated  an 
interesting trade-off between performance and low-power operation. By varying the supply voltage, the 
dcvice could achieve better then expected performance at higher voltages, or low power operation at. lower 
voltages. This flexibility is not inherently available from synchronous design methods but essentially is 
a  free side effect using the quasi-delay insensitive CalTech design style. The inherent robustness of 
asynchronous design was also exploited to dem onstra te  increased pipeline elasticity over w hat could be 
expected from synchronous designs, as well as extending the pipeline model into the cache design as 
well. Of course today ’s microprocessors also utilize pipelined caches for C P U ’s which tolerate multiple 
outstanding misses without stalling. The result of the CalTech R3000 “MiniM IPS” experiment [118] is 
expected to run a t  280 MIPS and dissipate 7 watts (at 3.3V and 75 degrees Celsius) or run a t  150 MIPS 
dissipating 1 watt (at 2.0V and 75 degrees Celsius). ‘
Recently, Furber and the AM ULET group a t  Manchester University have fabricated two asyn­
chronous implementations of the ARM microprocessor [70, 160, 71, 159]. The designs are based op 
micropipelined da tapa ths ,  and are part  of a large-scale investigation of low-power techniques. The 
project addresses issues such as caching, exceptions and architectural optimization which are critical to 
the development of production-quality asynchronous machines. The  Am uletl used 2-cycle protocols and 
was disappointing in terms of both power and performance in comparison to its synchronous ARM equiv­
alent. This is not surprising since the A m uletl was the first significant asynchronous design a t tem pted  
by the Manchester group and the design experience gave them a  significant d a ta  point for analysis. The 
result was that,  even though 2-cyc)e signaling is conceptually elegant, if resulted in circuits which were 
too large and slow, and which consumed too much power for their intended ARM application.
This A m uletl  results forced the design team to explore other protocol options for the subsequent 
Amulet2 effort. A version of the Amulet2 called the Amulet2e [69] has been fabricated. The Amulet2e is 
an Amulet2 processor core (93,000 transistors) coupled with '1K bytes of memory, in a 128-pin package 
containing 454,000 transistors. It was fabricated in a .5 micron CMOS technology and operates a t  
3.3 volts. The Amulet2e is intended for embedded controller applications. A number of architectural 
enhancements were made, including a  ju m p  target buffer and a  flexible external interface called the 
funnel.  The  funnel permits 8, 16, and 32 bit external devices to be attached to the controller, as 
well as a DRAM based main memory system. Another key difference is tha t  the Amulet2e uses 4- 
cycle signaling protocols, which results in improved performance, power consumption, and circuit areas. 
While the  power consumption d a ta  has not yet been released, the performance of the Amulet2e is 38 
Dhrystone MIPS, which is faster than the synchronous ARM710 but about half th a t  of the recently 
announced ARM810 which uses the same process technology.
Sutherland, Sproull, Molnar, and others at Sun Labs have been developing an asynchronous 
Counlerflow Pipeline Processor  [195]. The architecture is based on a  novel looped micropipeline, which
39
synchronizes instructions and d a ta  flowing in opposite directions. The processor makes careful use of 
arbiters to regulate the synchronization.
Brunvand developed the N SR  RISC microprocessor [22] a t the University of Utah, using transition- 
signaling for control, bundled data , and a micropipelined da tapa th . The NSR was implemented using 
commercially available F P G A  technology. The result of the NSR effort led to a more aggressive architec­
ture  called FR ED  [171, 170] which was implemented to the level of s tructural VHDL and subsequently 
analyzed. Fred is perhaps the closest a t tem p t  to create an equivalent to the modern microprocessor in 
th a t  it provided speculative execution and precise interrupts while utilizing a novel architecture th a t  was 
inspired, or perhaps constrained, by asynchronous design techniques. Fred was an architectural s tudy 
and therefore was not actually fabricated. Other micropipelined-based RISC designs have been proposed 
by David et al. [47] and Ginosar and Michell [75].
A delay-insensitive microprocessor, TITAC ,  has been developed by Nanya et al. a t  Tokyo Insti tu te  
of Technolog}' [144]. The designers introduce several optimizations to improve performance. A different 
approach was proposed by Unger at Columbia University ['209]. His “computers without clocks” use 
traditional asynchronous sta te  machines for control logic, and a building block approach to design rather 
than  compilation schemes. This approach requires a spectrum  of tim ing assumptions to insure correct 
designs. i
Finally, D ean’s S T R iP  (self-timed RISC) processor a t Stanford University combines synchronous 
and asynchronous features [54]. The design uses synchronous functional units in a globally-clocked 
pipeline. However, the clock rate  may change dynamically based on the current contents of the  pipeline, 
using a technique called dynam ic clocking. The clock also is suspended during off-chip operations, such 
as in p u t /o u tp u t  or access to a second-level cache. Using careful simulation, the design was shown to be 
almost twice as fast as a comparable synchronous design due solely to its asynchronous features.
6.8 Form al V erification  and T im in g  A nalysis
The above survey indicates an impressive surge of activity in the design of asynchronous controllers, 
d a tap a th s  and processors. However, design techniques alone cannot make asynchronous circuits com­
mercially viable. In synchronous design, m any ancillary technology components are needed to insure 
the  correctness of designs, including verification, tim ing analysis and testing. These techniques are espe­
cially critical for asynchronous design because of their inherent subtlety. This subsequent sections briefly 
sketch some of the recent work on validation of asynchronous designs.
Due to the large variety of asynchronous design approaches, it is difficult to  find a unified approach 
to the  analysis and verification of all asynchronous circuits. For speed-independent and delay-insensitive 
systems, though, Hoare’s C SP  [83] and M ilner’s C C S  [133] have been especially effective as formal 
underpinnings.
Rem, Snepscheut and U dding’s trace theory [167], based on CSP, has been used both for specifi­
cation and formal verification. In trace theory, the  behavior of a concurrent system is described by the 
set of possible traces, or sequences of events, which may be observed. Each trace describes one possible 
interleaved behavior of the system. The traces are combined into a set, which defines the observable
40
behavior o f  the system . D ill [58, 59] and Ebergen [62] have built effective verification tools for SI and DI 
circuits based ort trace theory. In D ill’s theory, an im plem entation  and specification are each m odeled by 
trace sets. T hese sets are com pared using a formal relation called conformance,  which defines precisely  
when an im plem entation  m eets its specification. Dill has uncovered bugs in published circuits using the 
verifier. M ore efficient algorithm s for approxim ate verification (allow ing occasional false negatives) have 
been developed by Beerel el al. [11].
D ill’s verifier effectively checks for safety violations (where a design has incorrect behavior), but 
does not check for liveness violations (where a design has deadlock or livelock). Dill also introduced  
a theory o f complete truce structures  [-58], based on Bnchi au tom ata , which can m odel general liveness 
properties. A lthough these general verification algorithm s may be too expensive to apply in practice, a 
verifier has been developed for a constrained class o f specifications [197]. Other m ethods use a restricted  
notion o f liveness that can be easily checked [62, 77, 226], A m ethod which uses Signal Graphs for 
verification of properties o f speed-independent circuits has been proposed by K ishinevsky el al. [96]. 
Another approach, by Kol, G inosar and Samuel [98], uses s ta te  charts  to verify both  safety and liveness 
properties.
An alternative verification m ethod based on CCS has been proposed by Birtw istle, Stevens, et 
al. [J98, 197]. CCS has been successfully used for the specification of several asynchronous designs, 
including a token ring arbiter and SCSI controller. Specifications can then be checked for deadlock, 
safety and liveness properties using a m odal logic. A substantial specification has been developed for 
the A M U L E T  processor [18], with detailed  m odels for the different instruction classes.
T h e above verification techniques handle SI and D l circuits and protocols, and therefore are not 
concerned with tim ing. However, tim ing is critical for the analysis and verification o f m any asynchronous 
system s. A general m odel for tim ed system s was introduced by Alur and Dill [5]. T im in g analysis and 
verification m ethods for asynchronous state m achines w ith  bounded delays were developed by Devadas 
et al. [57] and Chakraborty et al. [31, 30]. M ethods using T im ed Petri N ets have been developed by 
Rokicki [173], Sem enov el al. [187] and Verlind et al. [219], W illiam s [220] and Burns [27] have introduced  
m ethods to analyze the performance o f system s. A notion o f lim ing-reliability was proposed by Kuwako 
and N anya [104], T im in g  and hazard analysis tools have been developed by Ashkinazy el al. [7], Other 
recent work has focused on tim ing analysis to determ ine m inim um  and m axim um  separation o f events 
in a concurrent circuit or system  [128, 87, 86]. Such analysis can aid in both the op tim ization  and 
verification of asynchronous designs.
6.9 T estin g  and S yn th esis-for-T estab ility
W hile formal verification is used to validate designs, testing is needed to validate the correctness of 
fabricated im plem entations. T esting and synthesis-for-testability  play a m ajor role in the industrial 
production o f synchronous chips. However, the testing o f  asynchronous circuits is com plicated by their 
special design constraints. For exam ple, asynchronous circuits m ay use redundant logic to elim inate  
hazards, but redundant logic makes testing more difficult.
Initial results on the testing of speed-independent circuits include work by Beerel and M eng [10] 
and M artin and Hazewindus [124]. These papers indicate that certain classes of speed-independent
41
circuits are “self-testing” with respect to stuck-at faults, where certain faults will cause the circuit to  
halt. Deerel and Meng generalized their approach to  handle stuck-at faults in tim ed control circuits [13].
A general synthesis-for-testability m ethod was proposed by Keutzer, Lavagno and Sangiovanni- 
Vincentelli [93] which considers both stuck-at and path-delay faults in com binational circuits. T he  
m ethod uses algebraic transform ations to produce hazard-free and fully-testable m ulti-level logic. T his  
work was extended by Nowick, Jha and Cheng [154], to include a richer set o f transform ations and to 
handle a  more general class o f hazards. '
Subsequent research has focused on testing o f  handshaking circuits and m icropipelines. Roncken 
el al. [178, 175] at, Philips Research Laboratories have developed techniques and tools for partial scan 
of handshaking circuits. T he m ethod is now used in the Tangram  synthesis com piler. A novelty of 
the approach is that testability  is insured at the highest-level, i.e., by m odifying the Tangram  program  
specification. T he m ethod was used in the design o f  a DCC error corrector, where it led to 99.9% stuck- 
at output fault coverage at an expense o f less than 3% additional area [175, 213], More realistic fault 
m odels, such as for Id d q  testing, have recently been addressed as well [177].
T he m ost prevalent “design-for-test” technique in the synchronous dom ain has been the use of 
a serial scat? path technique, which effectively creates a shift register out o f the storage com ponents 
on the chip. T he external interface provides both read and w rite capability to this shift register. The  
m ethod works well for synchronous system s, since the concept o f state corresponds to the contents of 
the storage elem ents in the circuit after a particular clock. However, the inherent tem pom lly  decoupled 
nature o f  asynchronous circuits tends to make the concept o f “total system  sta te” counter-productive. 
T he im plication is,that the design for test m ethods developed for synchronous circuits are not appropriate 
for asynchronous system s. However, a surprisingly analogous technique was developed by Khoche [95, 
94]. T h is technique applies to m acrom odular, m icropipelined self-tim ed circuits. The key idea is that, 
while the circuits operate asynchronously in normal m ode, the scan m ode operation is synchronous 
and the clock propagates in the backward direction along the m icropipeline. Khoche dem onstrated  
that the overhead o f adding scanability  to asynchronous circuits is com m ensurate with the overhead  
for synchronous circuits. More recent approaches to testing m icropipelines have been developed as 
well [181, 176].
An issue related to testing is in itializability , which is the process o f  driving a circuit at power-up  
to a known state. Initializability is also often required by autom atic test pattern generators. T w o recent 
m ethods for asynchronous in itia lizability  have been developed [32, 19‘2].
One interesting area o f asynchronous circuit testing that is just beginning to be studied is the 
issue o f hazards. Asynchronous circuits by nature often contain  redundant logic to prevent hazards. This 
is a particularly problem atic issue with respect to  testing. Nam ely, if  the circuit contains redundant logic 
to prevent hazards, then how is this redundancy tested at the ch ip ’s external pins? Fundam entally, a 
solution requires that redundancy path analysis connections be exported to the pad ring o f the chip. 
T he increase o f area, power consum ption and packaging costs o f the device to support this capability  
directly is a problem. However, integrating this analysis capability within the scan path is an interesting  
option. The solution and its com plexity remain open research issues.
42
T h is m onograph has provided an introduction to the current (1997) state of the art o f asynchronous 
circuit design. T he focus has been on m otivations, fundam ental concepts, design m ethods, and the 
physical artifacts and results that have been the result o f these design styles.
T h e current status o f asynchronous circuits is that it is a growing research area that has yet to  
have significant im pact on the design o f com m ercial integrated circuits. However there are significant 
signs that asynchronous circuits may becom e more o f a m ainstream  discipline in the future. Serious 
asynchronous circuit efforts exist in corporate research labs, nam ely at Sun M icrosystem s, Philips, and 
Intel. T h e goals of these groups vary from increased perform ance at Sun and Intel, to reduced power 
consum ption  at Philips.
A nother promising sign is the rapid restructuring o f the sem iconductor design and electronic 
design autom ation  (ED A ) com panies. T he industry leaders have joined together to form the V S I  
A llia n c e .  VSI stands for virtual socket interface, and the purpose of the alliance is to create standards  
for design technology to be exchanged and reused easily. T he goal is to prevent the need to redesign 
circuits that already exist in another com pany. Rapid changes in the m arketplace result in rapidly 
decreasing design cycle requirements. T he result is that com panies sim ply do not have the tim e to design  
each new product from scratch. It is becom ing more cost effective to buy m acrocells and processor cores 
and com bine them  to create a new design. T he high level o f com petition  however requires that new  
products take full advantage o f the latest integrated circuit technology, T he new technology is faster 
but creates a new set o f tim ing problem s which m ust be m anaged carefully in a synchronous design. 
T h e inherent ease o f asynchronous m odule com position  nicely fils the requirem ents of this new industry 
m odel. T h e com posability  advantage is directly due to the fact that asynchronous circuits exp licitly  
export their tim ing requirements at their interfaces v ia  their signaling protocols.
More fundam ental m otivation s com e from basic integrated circuit technology trends. As transistor 
sizes shrink and as chips becom e significantly larger, the cost o f d istributing increasingly faster clock 
frequencies with m inim al skew becom es too expensive. T he expense com es both in terms o f reduced 
performance and increased power consum ption. T he basis for this belief is that the speed im provem ent 
of wires versus transistors is disparate. G ate delays im prove by 150% w hile wire delays im prove by 
only 20% in each new integrated circuit process generation. Im provem ents in wire technology are to be 
expected but the disparity is equally likely to  rem ain. T he result is that given the trend o f 10% growth  
in the physical dies size per generation, less than 10% of the die area will be reachable in a single clock 
period when the feature size readies .06/im  [125]. T h e result is that it is highly unlikely that one billion  
transistor chips will be both cost effective and synchronous. T h is monograph has provided a number of 
op tion s that may well be the basis for a future solution  to this critical problem .
7 C o n c l u s i o n s
43
[1] W . B. Ackerman ancl J. B. Dennis. VAL - A Value-Oriented A lgorithm ic Language Prelim inary  
Reference M anual. Technical Report L C S /T R -218, M assachusetts Institu te Technology, C om puter  
Science Departm ent, 1979.
[2] M. Afhahi and C. Svensson. Performance o f  Synchronous and Asynchronous Schem es for VLSI 
S ystem s. IEEE Transactions on Com puters,  41(7):858-872 , July 1992.
[3] F . A ghdasi. Synthesis o f asynchronous sequential machines for VLSI applications. In Proceedings o f  
the 1991 International Conference on Concurrent Engineering and Electronic Design A utom ation  
(C E E D A ),  pages 5 5 -59 , March 1991.
[4] V. A kellaand  G. G opaiakrishnan. SHILPA: a high-level synthesis system  for self-tim ed circuits. In 
Pmceedings o f  the IE E E /A C M  International Conference on Com puter-Aided Design, pages 587-91 . 
IEEE Com puter Society Press, Novem ber 1992.
[5] R. Alur and D.L. Dill. A theory o f tim ed au tom ata. Theoretical C om puter  Science, 1( 126): 183-235, 
1994.
[6] D .B . Arm strong, A .D . Friedman, and P.R. M enon. Realization o f asynchronous sequential circuits 
w ithout inserted delay elem ents. IEEE Transactions on C om puters , C -17(2): 129-134, February
1968.
[7] A. Ashkinazv, D. Edwards, C. Farnsworth, G. Gendel. and S. Sikand. T ools for validating asyn­
chronous digital circuits. In Proceedings o f  the International Symposium on Advanced Reseai'ch 
in Asynchronous Circuits and S ystem s (AsyncOJ,),  pages 12-21. IEEE C om puter Society Press, 
Novem ber 1994.
[8] A .M . B ailey and M B. Josephs. Sequencer circuits for VLSI program m ing. In Proceedings of  the  
Working Conference on Asynchivnous Design Methodologies, pages 8 2 -90 . IEEE C om puter Society  
Press, 1995.
[9] H. B. Bakoglu. Circuits, In terconnections, and Packaging f o r  VLSI. Addison-W esley, 1990.
[10] P. Beerel and T . Meng. Sem i-M odularity and Self-D iagnostic Asynchronous Control C ircuits. In 
Carlo H. Sequin, editor, Proceedings o f  the 1991 University o f  C alifornia/Santa  Cruz Conference, 
pages 103-117. The M IT Press, 1991.
(l 1] P.A. Beerel, J. Burch, and T . M eng. Efficient verification o f determ inate speed-independent cir­
cu its. In Proceedings o f  the IE E E /A C M  International Conference on C om puter-A ided  Design, 
pages 261-267. IEEE Com puter Society Press, Novem ber 1993.
[12] P. A. Beerel and T . Meng. A utom atic gate-level synthesis o f  speed-indepeudent circuits. In P ro­
ceedings o f  the I E E E /A C M  International Conference on Com puter-A ided Design, pages 581-586 . 
IEEE Com puter Society Press, Novem ber 1992.
[13] P.A. Beerel and '1'. H.-Y. Meng. T estability  nf asynchronous timed control circuits with delay 
assum ptions. In Proceedinqs o f  the 28th A C M /I E E E  Desiqn Automation Conference, pages 446­
451. ACM, June 1991.
[14] P.A. Beerel, K.Y. Yun, and W .C. Chou. O ptim izing average-case delay in technology m apping  
of burst-m ode circuits, fn Proceedings of  the International Symposium on Advanced Research in 
Asynchronous Circuits and S ys tem s (Async96),  pages 244-260 . IEEE C om puter Society Press, 
Novem ber 1996.
R e f e r e n c e s
44
[15] J. Beister. A unified approach to com binational hazards. IEEE Transactions on C om puters , 
C -23(6):566-575, June 1974.
[16] W . Belluotnini and C.J. Myers. Efficient tim ing analysis algorithm s for tim ed state space explo­
ration. In Proceedings of  the International Sym posium  on Advanced ReseaiTh in Asynchm nous  
Circuits and S ys tem s (Async97). IEEE Com puter Society  Press, April 1997.
[17] I. Benko and J.C. Ebergen. D elay-insensitive solu tions to  th e  com m ittee problem. In Proceedings 
of the International Symposium on Advanced Research in Asynchronous Circuits and. Sys tem s  
(AsyncQj), pages 228-237 . IEEE Com puter Society Press, Novem ber 1994.
[18] G. B irtw istle and Y. Liu. Specification o f the M anchester A m ulet 1: Top level Specification. 
C om puter Science D epartm ent Technical Report, U niversity o f  Calgary ,. December 1994.
[19] J.G . Bredeson. Synthesis o f m ultip le-input change hazard-free com binational sw itching circuits 
w ithout feedback. International Journal o f  Electronics (C B ) ,  39(6):615—624, Decem ber 1975.
[20] J.G . Bredeson and P.T. Hulina. E lim ination o f  static  and d ynam ic hazards for m ultiple input 
changes in com binational sw itching circuits. Information and Control, 20:114-224, 1972.
[21] E. Brunvand. Translating concurrent com m unicating program s into asynchronous circuits. Tech­
nical Report C M U -C S-91-198. Carnegie Mellon U niversity, 1991. P h .D . T hesis.
[22] E. Brunvand. The N SR  processor. In Proceedings of  the Twenty-Sixth  Annual H awaii International 
Conference on System  Sciences , volum e I, pages 428 -435 . IEEE C om puter Society Press, January
1993. .
[23] E. Brunvand and R. F. Sproull. Translating concurrent program s into delay-insensitive circuits. 
In Proceedings o f  the IEEE International Conference on C om puter-A ided  Design, pages 262-265 . 
IEEE C om puter Society Press, Novem ber 1989.
[24] J .A . Brzozowski and J.C . Ebergen. Recent developm ents iD the design o f asynchronous circuits. 
Technical Report CS-89-18, University o f W aterloo, C om puter Science Departm ent, 1989.
[25] J .A . Brzozowski and J.C, Ebergen. On the d elay-sensitiv ity  o f gate networks. IEEE Transactions  
on Com puters,  41(11): 1349-1360, November 1992.
[26] S. M. Burns. A utom ated  com pilation  of concurrent program s into self-tim ed circuits. Technical 
Report C altech-C S-T R -88-2, California Institu te o f Technology, 1987. M.S. Thesis.
[27] S.M . Burns. Performance analysis and optim ization  of asynchronous circuits. Technical Report 
C altech-C S -T R -91-01, California In stitu te o f Technology, 1991. Ph.D . Thesis.
[28] S.M . Burns. General condition for the decom position of sta te  holding elem ents. In Proceedings  
o f  the International Sym posium  on Advanced Research in Asynchronous Circuits and System s  
(Async96),  pages 4 8 -57 . IEEE Com puter Society Press, Novem ber 1996.
[29] S.M . Burns and A .J. M artin. Syntax-directed translation o f concurrent program s into self-tim ed  
circuits. In J. Allen and T .F . Leighton, editors, Advanced Reseaixh in VLSI: Proceedings o f  the 
Fifth M I T  Conference,  pages 35-50 . M IT Press, Cam bridge, MA, 1988.
[30] S. Chakraborty and D.L. D ill. More accurate polynom ial-tim e m in-m ax tim ing sim ulation . In 
Proceedings o f  the International Symposium on Advanced Research in Asynchronous Circuits and  
System s (Async97).  IEEE Com puter Society Press, April 1997.
[31] S. Chakraborty, D .L . D ill, K .-Y . Chang, and K.Y. Yun. T im in g  analysis for extended burst-m ode 
circuits. In Pixtceedings o f  the International Sym posium  on Advanced Research in Asynchronous  
Circuits and S ystem s (Async97).  IEEE C om puter Society Press, April 1997.
45
[32] S .T . Chakradhar, S. Banerjee, R.K. Roy, and D .K . Pradhan. Synthesis o f in itializable asynchronous 
circuits. IEEE Transactions on VLSI System s.  4 (2):254-262 , June 1996.
[33] T . J. Chaney and C. E. Molnar. A nom alous Behaviour of Synchronizer and Arbiter Circuits. IEEE  
T’m n sact ions  on C om puters , C“22(4 ):4 2 1-422, 1073.
[34] T .J . Chaney. S.M . Ornstein, and YV.M. Littlefield. Beware the synchronizer. In IEEE 6th Inter­
national Com puter Conference, pages 317-319 , 1972.
[35] J .F . Chappel and S.G . Zaky. A delay-controlled phase-locked loop to reduce tim ing errors in 
synchronous/asynchronous com m unication links. In Proceedings o f  the International Symposium  
on Advanced Research in Asynchronous Circuits and S ystem s (Async94),  pages 156-165 . IEEE  
C om puter Society Press. Novem ber 1994. ,
[36] V. L. Chi. Salphasic D istribution o f Clock S ignals for Synchronous System s. IE E E  Transactions  
on Computers,  43(5):597-602 , May 1994.
[37] J.-S . C hiang and D. Radhakrishnan. Hazard-free design o f m ixed operating m ode asynchronous 
sequential circuits. International Journal o f  Electronics, 6 8 (l):2 3 -3 7 , January 1990.
[38] T .-A . Chu. Synthesis o f self-tim ed vlsi circuits from graph-theoretic specifications. Technical 
R eport M IT-LC S-TR -393, M assachusetts In stitu te of Technology, 1987. Ph.D. Thesis.
[39] T .-A . Chu. A utom atic synthesis and verification of hazard-free control circuits from asynchronous 
finite state machine specifications. In Proceedings o f  the IEEE International Conference on C om ­
p u ter  Design,  pages 407-413 . IEEE C om puter,Society Press, 1992.
[40] T .-A . Chu, N. M ani, and C .K .C . Leung. An efficient critical race-free sta te assignm ent technique 
for asynchronous finite sta te m achines. In Proceedings o f  the 30th  A C M / I E E E  D esign  A u to m a t io n  
C onference,  pages 2 -6 . ACM, June 1993.
[41] H .Y .H . Chuang and S. Das. Synthesis o f m ultip le-input change asynchronous m achines using con­
trolled excitation and flip-flops. IEEE Transactions on C om puters , C -22( 12): 1103-1109, Decem ber 
1973.
[42] W .A . Clark. M acrom odular com puter system s. In Proceedings o f  the Spring Joint C om puter  
Conference, A FI PS, April 1967.
[43] F. Com m oner, A. Holt, S. Even, and A. Pnueli. Marked directed graphs. Journal o f  C om puter  
and System  Scienccs, 5 (5 ) :511-523, October 1971.
[44] J . Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and A. Yakovlev. C om plete sta te  en­
coding based on  the theory of regions. In Proceedings o f  the International Symposium on Advanced  
Research in Asynchm nous Circuits and Sys tem s (Async96),  pages 36-47 . IEEE Com puter Society  
Press, Novem ber 1996.
[45] J. C ortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and A. Yakovlev. M ethodology and 
tools for state encoding in asynchronous circuit synthesis. In 33rd A C M /I E E E  Design A utom ation  
Conference,  June 1996.
[46] J. Cortadella, L. Lavagno, P. Vanbekbergen, and A. Yakovlev. D esigning asynchronous circuits 
from behavioural specifications w ith internal conflicts. In Proceedings o f  the International S y m ­
posium on Advanced Research in Asynchronous Circuits and System s (Async94),  pages 106-115. 
IEEE C om puter Society Press, Novem ber 1994.
[47] I. David, R. Ginosar, and M. Yoeli. Self-tim ed im plem entation  of a reduced instruction set com ­
puter. Technical Report 732, Technion and Israel Institute o f Technology, October 1989.
46
[48] A. Davis, B. Coates, and K. Stevens. A utom atic synthesis o f  fast com pact self-tim ed control 
circuits. In 1993 IFIP Working Conference on Asynchronous Design Methodologies (Manchester.  
England),  1993.
[49] A.L. Davis. T he architecture and system  m ethod o f DDM -1: A recursively-structured d ata  driven 
m achine. In Proc. Fifth Annual Symposium on Com puter  Architecture , 1978.
[50] A.L. Davis. A data-driven m achine architecture su itable for VLSI im plem entation. In C.L. Seitz, 
editor, Proceedings of  the Caltech Conference on Very Large Scale Integration , pages 479-494 , 
January 1979.
[51] Al Davis. Synthesizing Asynchronous Circuits: Practice and Experience. In Asynchronous Digital  
Circuit Design , pages 104-150, 1995. .
[52] A.L. D avis, B. Coates, and K. Stevens. T he post office experience: D esigning a large asynchronous 
chip. In Proceedings of  the Twenty-Sixth Annual Hawaii International Conference on System  
Sciences , volum e 1, pages 409-418 . IEEE C om puter Society Press, January 1993.
[53] P. Day and J.V . W oods. Investigation into m icropipeline latch design styles. IEEE Transactions  
on VLSI S ys te m s , 3 (2):264-272 , June 1995.
[54] M .E. Dean. ST R iP: A self-tim ed RISC processor architecture. Technical report, Stanford Univer­
sity, 1992. Ph .D . Thesis.
[55] M .E. Dean, D.L. D ill, and M. Horowitz. Self-tim ed logic using current-sensing com pletion detec­
tion (C SC D ). In Proceedings of  the IE E E  In ternational Conference on C om puter Design. IEEE  
C om puter Society Press, October 1991.
[56] M .E. Dean, T .E . W illiam s, and D.L. Dill. Efficient self-tim ing with level-encoded 2-phase dual-rail 
(L E D R ). In Carlo Sequin, editor, Advanced Research in VLSI : Proceedings o f  the 1991 University  
o f  California Santa Cruz Conference,  pages 5 5 -7 0 . T he MIT Press, 1991. ISBN 0-262-19308-6.
[57] S. D evadas, K. Keutzer, S. M alik, and A. W ang. Verification o f asynchronous interface circuits with 
bounded wire delays. In Proceedings o f  the I E E E /A C M  International Conference on Computer-  
A ided  Design,  pages 188-195. IEEE C om puter Society Press, Novem ber 1992.
[58] D .L . D ill. Trace Theory fo r  A u tom atic  Hierarchical Verification of  Speed-Independent Ch'cuits. 
M IT Press, Cam bridge, MA, 1989.
[59] D.L. D ill, S.M . Nowick, and R .F . Sproull. Specification and autom atic verification o f self-tim ed  
queues. Form al Methods in Sys tem  D esign , 1 (1 ):29— 60, July 1992.
[60] D. W . Dobberpuhl and et al. A 200-M Hz 64-bit. Dual-issue CM OS Microprocessor. Digital Technical 
Journal , 4 (4 ):35-50 , 1993.
[61] J.C . Ebergen. A formal approach to designing delay-insensitive circuits. Distr ibuted Computing,  
5(3): 107—119, 1991.
[62] J .C . Ebergen. A verifier for network decom positions o f com m and-based specifications. In Proceed­
ings of  the Twenty-Sixth Annual Hawaii In ternational Conference on System  Sciences , volum e I, 
pages 310-318 . IEEE Com puter Society Press, January 1993.
[63] E.B. Eichelberger. Hazard detection  in com binational and sequential sw itching circuits. IB M  
Journal o f  Research and Development,  9 (2 ):90 -99 , 1965.
[64] C. Farnsworth, D .A . Edwards, J. Liu, and S.S. Sikand. A hybrid asynchronous system  design 
environm ent. In Proceedings o f  the Working Conference on Asynchronous Design Methodologies, 
pages 9 1 -98 . IEEE Com puter Society Press, 1995.
47
[65] C. Farnsworth, D .A . Edwards, and S.S. Sikand. U tilising dynam ic logic for low power consum ption  
in asynchronous circuits. In Proceedings of  the International Symposium on Advanced Research 
in Asynchronous Ciixuits  and S ystem s (Async9^),  pages 186-194. IEEE C om puter Society Press. 
N ovem ber 1994.
[66] J. Frackowiak. M ethoden der analyse und synthese von hasardarmen schaltnetzen m il m inim alen  
kosten I. Eleklromsche Informationsverarbeitung und Kybernettk,  1 0 (2 /3 ) :149—187, 1974.
[G7] A .D . Friedman and P.R. Menoix- Synthesis o f asynchronous sequential circuits with m ultip le-input 
changes. IEEE Transactions on Com puters , C -17(6):559-566, June 1968.
[68] R..M. Fuhrer, B. Lin, and S.M . Nowick. Sym bolic hazard-free m inim ization and encoding o f asyn­
chronous finite sta te m achines. In IE E E /A C M  International Confei'ence on Com puter-A ided D e­
sign (IC C A D ),  pages 604-611 , Novem ber 1995.
[69] S. Furber, P. Day, J. Garside, N. Paver, and S. Tem ple. A m ulet2e. In E M S Y S 96  - O M I Sixth 
Annual Conference.  IOS Press, 1996. ISBN 90 5199 300 5.
[70] S. B. Furber, P. Day, J. D. Garside, N . C. Paver, and J. V. W oods. A m icropipelined ARM . In 
Proceedings of  VLSI 93, pages 5.4.1 -5 .4 .1 0 , Septem ber 1993.
[71] S .B . Furber, P. Day, J .D . Garside, N .C . Paver, S. Tem ple, and J.V. W oods. T he design and 
evaluation o f an asynchronous m icroprocessor. In Proceedings of  the IE EE lnlei~nalional Conference  
on C om puter  Design, pages 217-220 . IEEE Com puter Society Press, October 1994.
[72] S B .  Furber and J. Liu. D ynam ic logic in four-phase m icropipelines. In Proceedings o f  the Interna­
tional Symposium on Advanced Research in Asynchronous Circuits and S ystem s (Async96),  pages 
11-16. IEEE Com puter Society Press, N ovem ber 1996.
[73] S .B . Furber and P. W oods. Four-phase m icropipeline latch control circuits. IEEE Transactions on 
VLSI Systems,  4 (2 ):247-253 , June 1996.
[74] J .D . Garside, S. Tem ple, and R. Mehra. T h e A M U LE T2e cache system . In Proceedings of  the 
International Symposium on Advanced Research in Asynchronous Circuits and System s (Async96'), 
pages 208-217 . IEEE Com puter Society  Press, Novem ber 1996.
[75] Ft. G inosar and N. Michell. On the potential o f  asynchronous pipelined processors. Technical 
Report U U C S-90-015, VLSI System s Research Group, University of U tah, 1990.
[76] G. G opalakrishnan. M icropipeline wavefront arbiters using lockable C -elem ents. IEEE Design and  
Test , 1 1 (4 ):55—64, W inter 1994.
[77] G. G opalakrishnan, E. Brunvand, N. M ichell, and S.M . Nowick. A correctness criterion for asyn­
chronous circuit validation and op tim ization . IE E E  Transactions on Com puter-A ided Design o f  
IntegwLed Circuits and Systems,  13(11): 1309-1318, Novem ber 1994.
[78] G. Gopalakrishnan, P. Kudva, and E. Brunvand. Peephole optim ization  of asynchronous m acro­
m odule networks. In Proceedings o f  the IEEE International Conference on C om puter Design, pages 
4 4 2-446 . IEEE Com puter Society Press, October 1994.
[79] E. Grass, R .C .S. Morling, and I. Kale. A ctiv ity-m onitoring com plction-detection  (AM CD): a 
new single rail approach to achieve self tim ing. In Proceedings of  the International Symposium  
on Advanced Research in Asynchronous Circuits and System s (Async96),  pages 143-149. IEEE  
Com puter Society Press, Novem ber 1996.
[80] M. G reenstreet. Im plem enting a STA R! chip. In Proceedings of  the IE EE In ternational Conference  
on C om puter  Design , pages 38-43 . IEEE C om puter Society Press, O ctober 1995.
48
[81] S. Hauck. Asynchronous design m ethodologies: An overview. Proceedings of  the IEEE, 8 3 (l):6 9 -9 3 , 
January 1995.
[82] A .B . Hayes. Stored sta te  asynchronous sequential circuits. IEEE Transactions on C om pu ters , 
C -30(8):596-600 , A ugust 1981.
[83] C .A .R . Hoare. C om m unicating sequential processes. Com munications of  the AC M ,  21(8):666-677 , 
August 1978.
[84] D. A. Huffman. T h e synthesis o f sequential sw itching circuits. Journal o f  the Franklin Institute,  
257(3): 161-190, March 1954.
[85] D. A. Huffm an. T he synthesis o f  sequential sw itch ing circuits. Journal o f  the Franklin Institute, 
257(4):275-303 , April 1954.
[86] II. Ilulgaard and S.M . Burns. Bounded delay tim ing analysis o f a class o f CSP program s with  
choice. In Proceedings of  the International Symposium on Advanced Research in Asynchronous  
Circuits and S ystem s (Async9^),  pages 2 -1 1 . IEEE Com puter Society Press, Novem ber 1994.
[87] H. Hnlgaard, S.M . Burns, T . Am on, and G. Borriello. Practical applications o f an efficient tim e 
separation o f events algorithm . In Proceedings o f  the I E E E /A C M  International Conference on 
Com puter-A ided  Design,  pages 146-151. IEEE C om puter Society Press, November 1993.
[88] K. Hwang. Com puter  Arithm etic:  Principles, Architecture, and Design. John W iley and Sons,
1979.
[89] M .B. Josephs. Receptive process theory. A d a  Infonnatica , 29:17-31, 1992.
[90] M .B. Josephs, P.G. Lucassen, J .T . Udding; and T . ’Verhoeff. Formal design of an asynchronous 
DSP counterflow pipeline: A case study in handshake algebra. In Proceedings of  the International  
Sym posium  on Advanced Research  in Asynchronous Circuits and S ys tem s (Async94),  pages 208­
215. IEEE Com puter Society Press, Novem ber 1994.
[91] M .B. Josephs and J .T . Udding. An overview o f D-I algebra. In Proceedings of  the Twenty-Sixth  
Annual Hawaii International Conference on System  Sciences,  volum e I, pages 329-338 . IEEE  
Com puter Society Press. January 1993.
[92] W . Keister, A. E. R itchie, and S. H. W ashburn. The Design o f  Switching Circuits.  Van N ostrand, 
Princeton, New Jersey, 1951.
[93] K. Keutzer, L. Lavagno, and A. Sangiovanni-V incentelli. Synthesis for testability  techniques for 
asynchronous circuits. In Proceedings o f  the IEEE International Conference on C om puter-A ided  
Design,  pages 326-329 . IEEE C om puter Society Press, Novem ber 1991.
[94] A. Khoche. Testing Macro-Module Bused S elf-T im ed Circuits. PhD thesis, University o f Utah, 
1996.
[95] A. Khoche and E. Brunvand. Testing m icropipelines. In Proceedings of  the International S y m ­
posium on Advanced Research in Asynchronous Circuits and S ystem s (Async9J),  pages 239-246 . 
IEEE C om puter Society Press, Novem ber 1994.
[96] M. Kjshinevsky, A. K ondratyev, A- Taubin, and V. Varshavsky. A nalysis and identification of  
speed-independent circuits on an event m odel. Formal Methods in System  Design, 4 ( l):3 3 -7 5 , 
January 1994.
[97] M .A. Kishinevsky, A .Y . Kondratyev, A.FI. Taubin, and V .l. Varshavsky. Concurrent Hardware:  
The Theory and Practice o f  Self-T im ed Design. John W iley and Sons Ltd., 1994.
49
[98] R. K ol, R. Ginosar, and G. Sam uel. S tatechart m ethodology for the design, validation, and 
syn th esis o f large scale asynchronous system s. In Proceedings of  the International Symposium  
on Advanced Research in Asynchronous Circuits and S ystem s (Async96),  pages 164-174. IEEE 
C om puter Society Press, Novem ber 1996.
[99] T . K olks, S. Vercauteren, and B. Lin. Control resynthesis for control-dom inated asynchronous 
designs. In Proceedings o f  the International Sym posium  on Advanced Research in Asynchronous  
C ircuits  and System s (Async96),  pages 233-243 . IEEE C om puter Society Press, Novem ber 1996.
[100] A. K ondratyev, M. Kishinevsky, B. Lin, P. Vanbekbergen, and A. Yakovlev. Basic gate im plem en­
tation  o f speed-independent circuits. In Proceedings of  the 31st A C M /I E E E  Design A utom ation  
Conference, pages 5 6 -62 . ACM , June 1994.
[101] P. K udva, G. G opalakrislinan, and H. Jacobson. A technique for synthesizing distributed burst­
m od e circuits. In 33rd A C M /I E E E  Design A utom ation  Conference, June 1996.
[102] P. K udva, G. G opalakrislinan, H. Jacobson, and S.M . Nowick. Synthesis o f bazard-free custom ized  
C M O S com plex-gate networks under m ultip le-input changes. In 33rd A C M /I E E E  Design A u to m a ­
t ion  Conference,  pages 7 7 -82 , June 1996.
[103] D .S . Kung. H azard-non-increasing gate-level op tim ization  algorithm s. In Proceedings of  the 
I E E E /A C M  International Conference on Com puter-A ided Design, pages 631-634 . IEEE C om puter 
S ociety  Press, Novem ber 1992.
[104] M. Kuwako and T . Nanya. T im ing-reliab ility  evaluation o f asynchronous circuits based on differ­
ent delay m odels. In Proceedings o f  the International Symposium on y\dvanc.ed Research in A syn ­
chronous Cii'cuils and S ystem s (Async9J,), pages 22-31 . IEEE Com puter Society Press* Novem ber 
199-4. . .
[105] M. Ladd and W. P. B irm ingham . Synthesis o f m ultip le-input change asynchronous finite state  
m achines. In Proceedings o f  the 28th A C M /I E E E  Design A utom ation  Conference, pages 309-314. 
A CM , June 1991.
[106] L. Lavagno, K. Keutzer, and A. Sangiovanni-V incentelli. A lgorithm s for synthesis o f hazard-free 
asynchronous circuits. In Proceedings of  the 28th A C M /I E E E  Design Autom ation  Conference, 
pages 302-308. ACM , June 1991-
[107] L. Lavagno, C .W . M oon, R .K . Bravton, and A. Sangiovanni-V incentelli. Solving the sta te  as­
signm ent problem for signal transition graphs. In Proceedings of  the 29th IE E E /A C M  Design 
A utom ation  Conference, pages 568-572 . IEEE Com puter Society  Press, June 1992.
[108] L. Lavagno and A. Sangiovanni-V incentelli. Algorithms for  synthesis and testing o f  asynchronous  
circuits.  Kluwer Acadcm ic, 1993.
[109] A. Liebchen and G. G opalakrislinan. D ynam ic reordering o f high latency transactions using a 
m odified  m icropipeline. In Proceedings of  the IE E E  International Conference on C om puter  Design, 
pages 336-340. IEEE C om puter Society Press, 1992.
[110] B. Lin and S. Devadas. Synthesis o f  hazard-free m ulti-level logic under m ultip le-input changes 
from  binary decision diagram s. In Proceedings of  the I E E E /A C M  International Conference on 
C om puter-A ided  Design, pages 542-549 . IEEE Com puter Society  Press, Novem ber 1994.
[111] K .-J. Lin and C .-S. Lin. A utom atic synthesis o f asynchronous circuits. In Pi'oceedings o f  the 28th 
A C M /I E E E  Design A utom ation  Conference,  pages 296-301. ACM , June 1991.
[112] C .N . Liu. A state variable assignm ent m ethod for asynchronous sequential sw itching circuits. 
Journal o f  the AC M ,  10:209-216, April 1963.
50
[1131 P-G. Lucassen and J .T . Udding. On the correctness of the sproull counterflow pipeline processor. 
In Proceedings o f  the International Sym posium  on Advanced Research in Asynchronous Circuits  
and S ystem s (A sync96),  pages 112-120. IEEE C om puter Society Press, Novem ber 1996.
fl 141 M. Maezawa, J. Kurosawa, Y. Kam eda, and T . Nanya. Pulse-driven dual-rail logic gate fam ily  
based on rapid single-flux-quantum  (RSFQ ) devices for asynchronous circuits, hi Proceedings o f  Ihe 
International Symposium on Advanced Research in Asynchronous Circuits and System s (A sync96),  
pages 134-142. IEEE Com puter Society Press, Novem ber 1996.
115] G. M ago. R ealization m ethods for asynchronous sequential circuits. IEEE Transactions on C om ­
puters,, C -20(3):290-297 , March 1971.
116] W .C . M allon and J .T . Udding, Using m etrics for proof rules for recursively defined delay-insensitive  
specifications. In Proceedings of  the In ternational Syinposium on Advanced Research in A syn­
chronous Circuits  and Sys tem s (Async97).  IEEE Com puter Society Press, April 1997.
1171 A. M arshall, B. C oates, and P. Siegel. T he design o f an asynchronous com m unications chip. IEEE  
Design and T est , 11(2):8—21, Sum m er 1994.
1181 A . J. M artin, A. Lines, R. M anohar, M N ystroem , P. Penzes, R. Southworth, U. C um m ings, and 
T .K . Lee. T he Design o f an Asynchronous M IPS R 3000 Processor. In Richard B. Brown and 
A lexander T . Ishii, editors, 17t!i Conference on Advanced Research in VLSI, pages 164-181. IEEE, 
IEEE C om puter Society Press, 1997.
119] A.J. M artin. T he design o f a self-tim ed circuit for distributed m utual exclusion. In Henry Fuchs, 
editor, Proceedings of  the 198-5 Chapel Hill Conference on Very Large Scale Integration , pages 
245-60- CSP, Inc., 1985. .
1201 A.J. M artin. C om piling com m unicating processes into delay-insensitive vlsi circuits. Distr ibuted  
Computing,  1:226-234,'1986.
1211 A .J. M artin. T h e lim itation  to delay-insensitiv ity  in asynchronous circuits. In W .J. Dally, editor, 
Advanced Research in VLSI: Proceedings o f  the Sixth M I T  Conference,  pages 263-278 . M IT Press, 
Cam bridge, MA, 1990.
122] A.J. M artin. Program m ing in VLSI: From com m unicating processes to delay-insensitjve circuits, 
[n C .A .R . Hoare, editor, Developments in Conctin'ency and Communication,  U T  Year o f Program ­
m ing Institute on Concurrent Program m ing, pages 1-64. Addison-W esley, Reading, MA, 1990.
123] A .J. M artin, S.M . Burns, T .K . Lee, D. Borkovic, and P.J. Hazewindus. T he design o f an asyn­
chronous m icroprocessor. In 1989 Caltech Conference on Very Large Scale Integration , 1989.
1241 A .J. M artin and P.J. Hazewindus. Testing delay-insensitive circuits. In Carlo H, Sequin, editor, 
Advanced Research in VLSI: Proceedings of  the 1991 U C  Santa Cruz Conference,  pages 118-132. 
M IT Press, 1991.
1251 D oug M atzge. W ill Physical Scalability S abotage Performance Gains? IEEE C om puter , 30 (9 ):37-  
39, Septem ber 1997.
1261 E.J. M cCluskey. Introduction to the Theory of  Switching Circuits. M cGraw-Hill, New York, NY, 
1965.
127] E .J. M cCluskey. Logic Design Principles: with emphasis on testable sem icustom  circuits. Prentice- 
H all, Englewood Cliffs, NJ, 1986.
1281 K. McM illan and D.L. D ill. A lgorithm s for interface tim ing verification. In Proceedings o f  the 
IE E E  In ternational Conference on C om puter  Design, pages 48-51 . IEEE Com puter Society Press, 
O ctober 1992.
51
[129] C. Mead and L. Conway. Introduction to VLSI Systems,  chapter 7. Addison-W esley, Reading, MA,
1980. C.L. Seitz. System  Tim ing.
[130] T . H .-Y . Meng, R .W . Brodersen, and D.G . M esserschm itt. A utom atic synthesis o f asynchronous 
circuits from high-level specifications. IEEE Transactions on Com puter-Aided Design of  Integrated  
Circuits and System s,  8( 11): 1185-1205, Novem ber 1989.
[131] T .H . M eng. Synchronization Design fo r  D igital Systems.  Kluwer Academ ic Publishers, Boston, 
M A, 1991. .
[132] R .E . Miller. Switching Theory. Volume II: Sequential Circuits and Machines. John W iley and 
Sons, New York, NY, 1965.
[133] R. Milner. Communication und Concurrency. Prentice Ha!i, London, 1989.
[134] C .E . M olnar, T .-P . Fang, and F.U . Rosenberger. Synthesis o f  delav-insensitive m odules. In Henry 
Fuchs, editor, Proceedings of  the 1985 Chapel Hill Conference on Very Large Scale Integration, 
pages 6 7 -86 . CSP, Inc., 1985.
[135] C .E . Molnar, I.W . Jones, B. Coates, and J. Lexau. A fifo ring oscillator performance experim ent. 
In Proceedings o f  the International Sym posium  on Advanced Research in A synchm nous Circuits  
and S ystem s (Async97).  IEEE Com puter Society Press, April 1997.
[136] C .W . M oon, P.R . Stephan, and R.K. Bray ton. Synthesis o f  hazard-free asynchronous circuits from 
graphical specifications. In Proceedings of  the IEEE International Conference on Com puter-A ided  
Design, pages 322-325. IFE F Com puter Society Press, Novem ber 1991.
[137] S .V . M orton, S.S. A ppleton, and M.J. Liebelt. An event controlled reconligurable m ulti-chip  FFT. 
In Proceedings of  the In ternational.Sym posium  on Advanced Research in Asynchronous Circuits  
and S ystem s (AsyncQJ), pages 144—153. IEEE Com puter Society Press, Novem ber L994.
[138] D. E. Muller and W . S. Bartky. A theory o f  asynchronous circuits I. D igital Com puter Labora­
tory 75, University o f Illinois, November 1956.
[139] D. E. Muller and W. S. Bartky. A theory of asynchronous circuits II. D igital Com puter Labora­
tory 78, University o f  Illinois, March 1957.
[140] C. Myers. Com puter-A ided  Synthesis and Verification o f  Gate-Level T im ed Circuits. PhD thesis, 
Stanford University, 1995.
[141] C. Myers and T . M eng. Synthesis o f tim ed asynchronous circuits. In Proceedings o f  the IEEE  
International Conference on Com puter Design,  pages 279-284. IEEE Com puter Society Press, 
O ctober 1992.
[142] C. Myers and T . Meng. Synthesis o f T im ed A synchronous Circuits. IEEE Transactions on VLSI  
System s,  1(2): 106-119, June 1993.
[143] C.J. Myers, T .G . Rokicki, and T . H.-Y. Meng. A utom atic synthesis o f gate-level tim ed circuits 
w ith choice. In W .J. Dally, J .W . Poulton, and A.T- Ishii, editors, Advanced Research in VLSI : 
Proceedings of  the 1995 University o f  North Carolina Conference, pages 4 2 -58 . IEEE Com puter 
Society Press, 1995.
[144] T . Nanya, Y. Ueno, H. K agotani. M. Knwako, and A. Takamura. TITAC: design o f a quasi-delay- 
insensitive m icroprocessor. IEEE Design and Test, 11(2):50—63, Sum m er 1994.
[145| C .D . Nielsen and A . Martin. T he design o f  a delay-insensitive m ultiply-accum ulate unit. In 
Proceedings of  the Twenty-Sixth Annual Hawaii International Conference on Sys tem  Sciences,  
volum e I, pages 379-388 . IEEE Com puter Society Press, January 1993.
52
[146] L.S. N ielsen, C. N iessen, J. Sparso, and K. van Berkel. Low-Power O peration Using Self-T im ed  
C ircuits and A daptive Scaling o f the Supply V oltage. IE EE Transactions on VLSI, 2(4):7, 1994.
[147] L.S. N ielsen and J. Sparso. A low-power asynchronous d a ta p a th  for a fir filter bank. In Proceedings 
o f  the International Symposium on Advanced Research  in Asynchronous Circuits and S ystem s  
( Async96), pages 197-207. IEEE Com puter Society Press, N ovem ber 1996.
[148] S.M . Nowick. A utom atic synthesis o f burst-m ode asynchronous controllers. Technical report, 
Stanford University, March 1993. P h .D . T hesis (available as Stanford University Com puter-System s 
Laboratory technical report, G SL -TR -95-686, Dec. 95).
[149] S.M . Nowick. Design o f a low -latency asynchronous adder using speculative com pletion . IEE  
Proceedings - Com puters and Digital Techniques, 143(5):301-307, 1996. ■
[150] S.M . Nowick and B. Coates. UCLOCK: autom ated  design o f high-perform ance unclocked state  
m achines, [n Proceedings o f  the IEEE International Conference on C om puter  Design , pages 434­
441. IEEE C om puter Society Press, October 1994.
[151] S.M . Nowick, M .E. Dean, D.L. Dill, and M. Horowitz. T he design of a high-perform ance cache 
controller: a case study in asynchronous synthesis. IN T E G R A T IO N , the VLSI journal,  15(3):241— 
262, October 1993.
[152] S.M . Nowick and D.L. D ill. Synthesis o f asynchronous sta te  m achines using a local clock. In 
Proceedings o f  the IEEE In ternational Conference on C om puter  Design,  pages 192-197. IEEE 
Com puter Society Press, O ctober 1991.
[153] S.M . Nowick and D.L. Dill. E xact two-level m inim ization o f hazard-free logic with m ultiple- 
input changes. IEEE Transactions on C om puter-A ided  Design o f  Integrated Circuits and Systems,  
14(8):986-997, August 1995. '
[154] S.M . Nowick, N .K . Jha, and F.-C. Cheng. Synthesis o f asynchronous circuits for stuck-at and 
robust path delay fault testability . In Proceedings of  the Eighth In ternational Conference on VLSI 
Design (V L S I  Design 95). IEEE C om puter Society Press, January 1995.
[155] S.M . Nowjck, K .Y . Yun, P.A. Beerel, and A .E . D ooply. Specu lative com pletion for the design of 
high-perform ance asynchronous dynam ic adders. In Proceedings of  the International Symposium  
on Advanced Research in Asynchronous Circuits and S ystem s (Async97).  IEEE Com puter Society  
Press, April 1997.
[156] S.M . Nowick, K .Y . Yun, and D.L. Dill. Practical asynchronous controller design. In Proceedings of  
the IEEE International Conference on Com puter Design, pages 341-345 . IEEE Com puter Society  
Press, October 1992.
[157] S.S. Patil. An A synchronous Logic Array. Technical Report Technical M em orandom 62, Mas­
sachusetts Institute o f Technology. Project MAC, 1975.
[158] P. Patra and D.S. Fussell. Efficient building blocks for delay insensitive circuits. In Proceedings 
o f  the International Symposium on Advanced Research  in Asynchronous Circuits and System s  
(Async9^),  pages 196-205. IEEE C om puter Society Press, Novem ber 1994.
[159] N .C . Paver. The design and im plem entation  o f  an asynchronous m icroprocessor. Technical report, 
University o f M anchester, June 1994. Ph.D . Thesis.
[160] N.C. Paver, P. Day, S .B . Fitrber, J.D . Garside, and J.V . W oods. Register locking in an asynchronous 
m icroprocessor. In Proceedings o f  the IEEE International Conference on C om puter  Design, pages 
3 51-355 . IEEE Com puter Society Press. October 1992.
53
161] A. Peeters. Single-rail handshake circuits. Technical report, Eindhoven University o f  Technology. 
June 1996. P h .D . T hesis.
1621 M .A. Pena and J. Cortadella. C om bining process algebras and petri nets for the specification and 
synthesis o f asynchronous circuits. In Proceedings o f  the International Symposium on Advanced R e­
search in Asynchronous Circuits and System s (Async96),  pages 222-232 . IEEE C om puter Society  
Press, Novem ber 1996.
1631 J L. Peterson. P etr i  Net Theory and the Modeling o f  System s.  Prentice-IIall, Englewood ClifTs, 
N J, 1981.
1641 L. Plana and S.M . Nowick. Concurrency-oriented optim ization  for low-power asynchronous sys­
tem s. In IEEE International Symposium on Low-Power Electronics and Design, pages 151-156. 
IEEE C om puter Society Press, August 1996.
1651 R- Puri and J. Gu. Area efficient synthesis o f asynchronous interface circuits. In Proceedings o f  
the IEEE International Conference on C om puter Design , pages 212-216 . IEEE Com puter Society  
Press, October 1994.
1661 R- Puri and J. Gu. A m odular partitioning approach for asynchronous circuit synthesis. In 
Proceedings of  the 31st A C M /IE E E  Design A utom ation  Conference, pages 63-69 . A CM , June 
1994.
1671 M. Rem, J.L .A . van de Snepscheut, and J.T . Udding. Trace theory and the definition o f  hierarchical 
com ponents. In Randal Bryant, editor, Pix>ce.edings o f  the Third Caltech Conference on Very Large 
Sccde Integration, pages 225-239 . CSP, Inc., 1983.
1681 R- D. Rettberg. W . R. Crowther, ,,P. P. Carvey, and R. S. Tom linson. The Monarch Parallel 
Processor Hardware Design. Computer,  23(4): 18—30, April 1990.
1691 C .A . Rey and J. Vaucher. Self-synchronized asynchronous sequential m achines. IEEE Transactions  
on Computers, C-23( 12): 1306-1311, December 1974.
1701 W , Richardson and E. Brunvand. Fred: An Architecture for a Self-T im ed Decoupled Com puter. In 
Advanced Research in Asynchronous Circuits and Systems,  pages 60-68 . IEEE C om puter Society  
Press, 1996.
1711 YV. F. Richardson. Architectural Considerations f o r a  Self-T im ed Decoupled Processor.  PhD  thesis, 
University o f Utah, March 1996.
172] P.T . Roeine. A system  for asynchronous high-speed chip to chip com m unication. In Proceedings 
o f  the International Symposium on Advanced Research in Asynchronous Circuits and System s  
(Async96),  pages 2 -10 . IEEE Com puter Society Press, Novem ber 1996.
1731 T . Rokicki. Representing and m odeling d igital circuits. Technical report, Stanford U niversity, 
Decem ber 1993. Ph .D . T hesis.
1741 T . Rokicki and C. Myers. A utom atic Verification o f T im ed Circuits. In International Confei'ence 
on Com puter-Aided Verification, pages 468-480 . Springer-Verlag, 1994.
1751 M. Roncken. Partial scan test for asynchronous circuits illustrated on a DCC error corrector. In 
Proceedings o f  the International Symposium on Advanced Research in Asynchronous Cii'cvits and  
S ys tem s (Async9J,). pages 247-256. IEEE Com puter Society Press, Novem ber 1994.
1761 M- Roncken, E. Aarts, and W . Verhaegh. O ptim al scan for pipelined testing: an asynchronous 
foundation. In Proceedings of  the IEEE International Test Conference. IEEE Com puter Society  
Press, October 1996.
54
177] M. Rx>ncken and IS. Bruls. Test quality o f asynchronous circuits: a defect-oriented evaluation. In 
Pix>ceedings o f  the IEEE International Test Conference. IEEE Com puter Society Press, October 
1096.
178] M arly Roncken and Ronald Saeijs. Linear T est T im es for D elay-Insensitive Circuits: a C om pilation  
Strategy. In S. Furber and lYI. Edwards, editors, Proceedings of  the IFIP WG JO. 5 Working 
Conference on Asynchronous Desiim Methodoloaies, Manchester,  pages 13-27. Elsevier Science 
Publishers B .V ., 1993.
179] L.Y. Rosenblum  and A.V. Yakovlev. Signal graphs: from self-tim ed to tim ed ones. In Proceedings 
of Jnlcrnational Workshop on Timed P e tr i  Nets, Torino, Italy, pages 199-207. IEEE Com puter 
Society Press, July 1985.
180] R. Rudell and A. Sangiovanni-V incentelli. M ultiple-valued optim ization  for PLA optim ization , 
IEEE Transactions on Com puter-A ided  Design o f  Integrated Circuits and Systems,  6(5):727-750 , 
Septem ber 1987.
181] V. Schocber and T . Kiel. An asynchronous scan path concept for m icropipelines using the bundled  
d ata  convention. In Proceedings of  the IE EE In ternational Test Conference.  IEEE Com puter 
Society Press, October 1996.
182] C .-J. SegeT. A bounded delay race m odel. In Proceedings of  the IEEE International Conference 
on C om puter-A ided  Design , pages 130-133 . IE E E  C om puter Society Press, Novem ber 1989.
183] Charles L. Seitz. <;The Cosm ic C ube” . Com m unications o f  the AC M ,  28 (1) :22—33, January 1984.
184] Charles L. Seitz. T he M osaic C: An Experim ental Fine Grain M ulticom puter. In Alain Bensoussan  
and Jean Pierre Verjus, editors. Future Tendencies in C om puter  Science. Springer Verlag Lecture 
N otes in C om puter Science # 6 5 3 , 1992.
185] C.L. Seitz. Asynchronous m achines exhib iting concurrency. In Conference Record o f  the Project 
M A C  Confemnce on Concurrent S ys tem s and Parallel C om puta tion , 1970.
186] C.L. Seitz. Graph representations fo r  logical machines. PhD  thesis, M IT, Jan 1971,
187] A. Sem enov and A. Yakovlev. Verification o f asynchronous circuits using tim e petri net unfolding. 
In 33rd A C M /I E E E  Design A utom ation  Conference, June 1996.
188] E.M . Sentovjch. SIS: a system  for sequential circuit synthesis. Technical Report U C B /E R L  
M 92/41 , UC Berkeley, May 1992. Dept, o f EECS.
189] Robert Shapiro and Hartm ann Genrich. A Design o f a C ascadable Nacking Arbiter. M etaSoftware, 
Inc. Cam bridge, MA 02140, 1993 February.
190] R obert Shapiro and Hartm ann Genrich. Formal Verification o f an Arbiter Cascade. M elaSoftware, 
Inc. Cam bridge, MA 02140, 1992 January.
191] P. Siegel, G. De Micheli, and D. Dill. T echnology m apping for generalized fundam ental-m ode 
asynchronous designs. In Proceedings of the 30lh A C M /I E E E  Design A utom ation  Conference, 
pages 61-67 . ACM , June 1993.
192] M. Singh and S.M . Nowick. Synthesis-for-testability  o f asynchronous sequential m achines. In 
Pivceedings o f  the IEEE International Test Conference.  IEEE C om puter Society Press, October 
1996,
[193] R. L. S ites. Alpha Architecture Reference Manual.  D igital Equipm ent Corporation, 1992.
55
[194] J. Sparso and J. Staunstrup. Design and perform ance analysis o f delay insensitive m ulti-ring  
structures. In Proceedings o f  the Twenty-Sixth Annual Haxotiii International Conference on System  
Scienccs , volum e I, pages 349-358. IEEE Com puter Society Press, January 1993.
[195] R.F'. Sproull, I.E. Sutherland, and C.E. Molnar. T he counterflow pipeline processor archilecture. 
IE E E  Design & Test o f  C om puters , l l (3 ) :4 8 -5 9 , 1994.
[196] staff. New Products. IEEE Com puter , 25(1). 1992.
[197] K. Stevens, Practical Verification and Synthesis o f Low Latency Asynchronous System s. PhD  
T hesis, C om puter Science Departm ent, University o f Calgary, 1994.
[198] K. Stevens, J. Aldwinckle, G. Birtwistle, and Y. Liu. Designing parallel specifications in CCS. (n 
Proceedings o f  Canadian Conference on Electrical and Com puter  Engineering,  Vancouver, 1993.
[199] K .S. Stevens, S .V . Robison, and A.L. Davis. T he post office - com m unication support for d is­
tributed ensem ble architectures. In Sixth International Conference on Distributed Com puting S ys­
tems,  1986.
[200] I.E. Sutherland. M icropipelines. Com munications o f  the AC M ,  32(6):720-738, June 1989.
[201] M. A. Tapia. Synthesis o f asynchronous sequential system s using boolean calculus. In 14th A s i lom ar  
Conference on Circuits, System s and Computers,  pages 205-209 , November 1980.
[202] M. T heobald , S.M . Nowick, and T . Wu. Espresso-HF: a heuristic hazard-free m inim izer for two- 
level logic. In 33rd A C M /I E E E  Design A utom ation  Confei'ence, pages 71-76 , June 1996.
[203] J .A . Tierno and A.J. Martin, Low-energy asynchronous m em ory design. In Proceedings of  the 
International Symposium on Advanced Reseat'd) in A synchronous Circuits and S ystem s (AsyncOJ), 
pages 176-185. IEEE Com puter Society Press, Novem ber 1994.
[204] J.H . Tracey. Internal state assignm ents for asynchronous sequential m achines. IE EE Transactions  
on Electronic C om puters , E C -15:551-560, August. 1966.
[205] J .T . Udding. A formal m odel for defining and classifying delay-insensitive circuits and system s. 
Distr ibuted Computing,  1(4): 197-204, 1986.
[206] S.H . Unger. Asynchronous Sequential Switching Circuits.  W iley-Interscience, New York, NY, 1969.
[207] S.H . Unger. Asynchronous sequential sw itching circuits with unrestricted input changes. IEEE  
Transactions on Computers,  C-20( L2): 1437-1444, Decem ber 1971.
[208] S.H . Unger. Self-synchronizing circuits and nonfundam ental m ode operation. IEEE Transactions  
on Com puters  ( Correspondence),  C -26(3 ):278-281 , March 1977.
[209] S.H . Unger. A building block approach to unclocked system s. In Proceedings o f  the Twenty-Sixth  
Annual Hawaii International Conference on S ystem  Sciences , volum e I, pages 339-348 . IEEE  
Com puter Society Press, January 1993.
[210] C .H . van Berkel and R .W .J .J . Saeijs. C om pilation of com m unicating processes into delay- 
insensitive circuits. In Proceedings of  the IEEE International Confei'ence on C om puter  Design,  
pages 157-162. IEEE C om puter Society Press, 1988.
[211] K. van Berkel. Handshake Circuits. A n  asynchronous architecture fo r  VLSI progi'amming. Inter­
national Series on Parallel C om putation 5. C am bridge University Press, 1993.
[212] K. van Berkel and A. Bink. Single-track handshake signaling with application to m icropipelines. 
In Proceedings of  the International Symposium on Advanced Research in Asynchronous Circuits  
and S ystem s (Async96),  pages 122-133. IEEE Com puter Society Press, November 1996.
56
[213] K. van Berkel, R. Burgess, J. Kessels, A. Peeters, M. Roncken, and F. Schalij. Asynchronous 
Circuits for Low Power: a DCC Error Corrector. IE EE Design & Test, l l (2 ) :2 2 -3 2 , June 1994-
[214] Kees van Berkel. Beware the isochronic fork- Integwtion, the VLSI journal, 13(2):103-128, 1992.
[215] P. Vanbekbergen, F. C atthoor, G. G oossens, anc) H. De Man. O ptim ized synthesis o f asynchronous 
control circuits from graph-theoretic specifications. In Prvceedings of  the IEEE International Con­
ference on C om puter-A ided  Design,  pages 184-187 . IEEE C om puter Society Press, Novem ber 1990.
[216] P. Vanbekbergen, B. Lin, G. G oossens, and H. De Man. A generalized sta te  assignm ent theory 
for transform ations on signal transition graphs. In Proceedings of  the IE E E /A C M  International  
Conference on Com puter-A ided  Design , pages 112-117. IEEE Com puter Society, Novem ber 1992.
[217] V.I. Varshavsky, M.A. Kishinevsky, V .B . M arakhovsky, V .A . Peschansky, L.Y. R osenblum , A .R . 
Taubin, and B.S. Tzirlin. Self-timed Control o f  Concurrent Processes. Kluwer A cadem ic Publishers, 
1990. Russian edition: 1986.
[218] Tom  Verhoeff. Del ay-insensitive codes -  an overview. Distr ibuted Com puting , 3(1): 1-8, 1988-
[219] E. Verlinrl, G. de Jong, and B. Lin. Efficient partial enum eration for tim ing analysis o f asynchronous 
system s. In 33rd A C M /I E E E  Design A utom ation  Conference , June 1996.
[220] T .E . W illiam s. Self-tim ed rings and their application to d ivision . Technical Report C SL -T R -91-482, 
C om puter System s Laboratory, Stanford U niversity, 1991. Ph.D . Thesis.
[221] T .E . W illiam s and M .A. Horowitz. A zero-overhead self-tim ed 54b 160ns CM OS divider. IEEE  
Journal of  Solid-State Circuits , 26(11): 1651—1661. Novem ber 1991.
[222] A. Yakovlev, A. Petrov, and L. Lavagno. A Low Latency Asynchronous Arbitration Circuit’. IEEE  
Trans, on VLSI Systems,  2 (3 ) :372-377, Septem ber 1994.
[223] A .V . Yakovlev. On lim ita tion s and extensions o f STG  m odel for designing asynchronous control 
circuits. In Proceedings o f  the IEEE International Confei'ence on C om puter  Design, pages 396-400. 
IEEE C om puter Society Press, October 1992.
[224] O. Yenersov. Synthesis o f asynchronous m achines using m ixed-operation m ode. IE E E  Transactions  
on Computers,  C -28(4 ):325-329, April 1979.
[225] C. Y kinan-Couvreur, B. Lin, and H. De Man. A SSA SSIN : a synthesis system  for asynchronous 
control circuits. Technical report, IM EC Laboratory, Septem ber 1994. User and tutorial m anual.
[226] T . Yoneda and T . Yoshikawa. U sing partial orders for trace theoretic verification o f asynchronous 
circuits. In Proceedings o f  the International Symposium on Advanced Research in Asynchronous  
Ch'cuits and System s (A sync96),  pages 152-163. IEEE C om puter Society Press, Novem ber 1996.
[227] M.L. Yu and P.A . Subrahm anyam . A path-oriented approach for reducing hazards in asynchronous 
designs. In Proceedings o f  the 29th I E E E /A C M  Design A u tom ation  Conference,  pages 239-244. 
IEEE Com puter Society Press, June 1992.
[228] K .Y. Yun, P.A . Beerel, and J. Arceo. H igh-perform ance asynchronous pipeline circuits. In Proceed­
ings of  the International Symposium on Advanced Research in Asynchronous Circuits and System s  
(Async96),  pages 17-28. IEEE Com puter Society Press, Novem ber 1996.
[229] K .Y. Yun and D.L. D ill. A utom atic synthesis o f 3D asynchronous fin ite-state m achines. In P ro­
ceedings o f  the IE E E /A C M  International Conference on C om puter-A ided  Design.  IEEE C om puter 
Society Press, Novem ber 1992.
57
[230] K .Y . Yun and D.L. D ill. U nifying synchronous/asynchronous state m achine synthesis. In P ro­
ceedings of  the I E E E /A C M  International Conference on Com puter-A ided D esign , pages 255-260 . 
IEEE Com puter Society Press, Novem ber 1993.
[231] K .Y . Yun and D.L. D ill. A high-perform ance asynchronous SCSI controller. In Proceedings of  the 
IE E E  In ternational Conference on C om puter  Design , pages 44-49 . IEEE C om puter Society Press, 
O ctober 1995.
[232] K .Y . Yun, D .L. D ill, and S.M . Nowick. Practical generalizations of asynchronous state m achines. 
In The 1993 European Conference on Design A u tom at ion , pages 525-530 . IEEE Com puter Society  
Press, February 1993.
[233] K .Y . Yun. B. Lin, D .L. D ill, and S. Devadas. Performance-driven synthesis o f asynchronous con­
trollers. In I E E E /A C M  International Conference on Com puter-A ided Design ( IC C A D ),  Novem ber 
1994.
58
