Asynchronous systems are being viewed as an increasingly viable alternative to purely synchronous systems. This paper gives an overview of the current state of the art in practical asynchronous circuit and system design in four areas: controllers, datapaths, processors, and the design of asynchronous/synchronous interfaces.
Since the early and mid 198O's, several controller synthesis methods were developed, t o address these limitations. These methods fall into three general categories: (i) state machines; (ii) Petri-net and graph-based methods; and (iii) translation methods.
While this early work laid the foundations of asynchronous
Asynchronous State Machines
Much of the recent work on asynchronous state machine design is centered on burst-mode machines 150, 521.
Burst-mode specifications grew out of earlier informal specifications by Davis et al. [18, 171. Davis proposed machines which would wait for a collection of input changes (an "input burst"), and then respond with a collection of output changes (an "output burst"). The key contribution is that, unlike classical asynchronous machines, inputs within a burst could be uncorrelated: arriving in any order and at any time. Therefore, these machines could operate more flexibly in a concurrent environment. Unfortunately, their synthesis methods did not insure hazard-free designs. Nowick and Dill 152, 501 modified and formalized these specifications into the final form called burst-mode (BM) [52, 501 . They also proposed a new self-synchronized design style called a locally-clocked state machine, which was the first burst-mode synthesis method to guarantee a hazardfree implementation [52, 501 . The method has been applied to large-scale designs such as a cache controller [51] . They also developed t h e first exact hazard-free 2-level logic minimization algorithm [53] 
Datapath
This section describes some of the recent advances in selftimed datapath design, concentrating on performance issues only.
A datapath can be classified as pipelined or non-pipelined.
There has been a tremendous amount in asynchronous pipelines, starting with the classical micropipeline work by Suther- A goal is therefore to aim for comparable performance as a synchronous pipeline, but with the added benefits of "elasticity" (variable rate operation).
Our conjecture is that the average-case throughput (taking into account only data dependency, not operating conditions) of a deeply-pipelined asynchronous circuit would be close to the worst-case throughput. Shorter pipelines, though, tend to exhibit much better average-case behavior.
We describe below some recently introduced techniques to improve the average-case performance of self-timed datapaths.
Adders
In order to exploit variable data-dependent delays, selftimed datapath elements incorporate some form of conipletion detection mechanisms. The most common form is based on dual-rail logic [43] . However, as the datapath becomes wider, the overhead for completion detection becomes significant. Yun et a1 [77] observed that one way to tackle this problem is to parallelize the computation and completion detection as much as possible. Their techniques resulted in 2.811s average-case delay for a 32-bit carry bypass adder fabricated in 0.6pm CMOS process, with only 20% completion sensing overhead on average. Another way to deal with wide datapaths is to perform bitwise completion detection. Martin et a1 [44] showed an impressive throughput gain (at the expense of sacrificing latency) using this technique.
A somewhat different twist to completion detection is called the speculative completion. This technique assumes the circuit normally finishes computation significantly faster than the worst-case. If the circuit cannot complete the computation in time, it aborts reporting completion. This technique requires a special auxiliary circuit called "abort detection circuit", which operates in parallel with the datapath element itself. Nowick et a1 1551 applied this technique to a 32-bit Brent-Kung adder and resulted in the simulated average-case delay to be less than 2ns in 0.6pm CMOS process.
Iterative structures
The development of zero-overhead self-timed ring technique by Williams [74] is clearly the most significant breakthrough in self-timed iterative structures. Williams showed that a self-timed ring can be designed in dual-rail domino logic with essentially zero overhead. He applied this technique to a self-timed 160ns 54-bit mantissa divider [75] as a part of a floating-point divider. This design was incorporated in a commercial microprocessor design 1731.
It can be shown that this technique is generally applicable to any iterative structure in which the latency needs to be optimized. Consequently, this technique has been applied to other academic and industrial designs, such as a division and square root unit design by Matsubara and Ide 1451, a self-timed packet switch design by Yun et a1 1781, and a Huffman decoder design by Benes et al [3] . There have been other iterative structure designs that achieve high performance with data-dependent computation times, such as a bundled data multiplier design by Kearney and Bergniann 1341.
Large scale examples
In certain applications in which there is a large variation in processing delays between common and rare cases, asynchronous designs tend to fare much better than synchronous designs. A research group at Intel demonstrated this with their asynchronous instruction length decoder design call Instruction Decoder") [60] . The RAPPID's length decoding out-performs, by a factor of 3, the same function inside a 400RIIHz Pentium I1 fabricated in the identical 0.25pm CMOS process. This speedup is primarily attributed to optimizations for common, short-length instructions and selftimed techniques enabling these optimizations.
In This is an exciting time for asynchronous processors. Recently, at Phillips Semiconductors, pagers with asynchronous chips have been released commercially to market (see below). In addition to the current academic interest in asynchronous systems, several companies such as Intel, Sharp, Sun, and H P have shown interest. The asynchronous circuits these companies have developed are showing some promise of making their way into products.
Processors are, in many ways, the most demanding application for asynchronous techniques. In addition to being extremely complex systems, processors are often the target of the most aggressive optimization that the circuit designers can bring to bear. The optimization criterion may be raw speed, low power, noise and EMC (Electro-Magnetic Compatibility) properties, or some combination of these, but it is in a processor where such requirements are the most critical. I t is also the case that the organization of most modern high-performance microprocessors uses a synchronous pipelined approach, and alternative architectures may be required to achieve comparable results with asynchronous processors. But, it is the potential benefits of the asynchronous approach that are compelling in this world of highly-optimized systems. In terms of raw speed, lowered power, and improved EMC properties, asynchronous techniques may have much to offer. The Fred Architecture. Fred is a self-timed, decoupled, concurrent, pipelined computer architecture [59, 581. I t dynamically reorders instructions to issue out of order using an instruction window to organize the reordering, and allows out-of-order instruction completion. I t handles exceptions and interrupts, and includes a novel functionally precise exception model that works well in the asynchronous, decoupled, out of order environment [57] .
Caltech Asynchronous MIPS R3000. Subsequent to the success of their first small asynchronous processor, the asynchronous group at Caltech has built an asynchronous version of the MIPS R3000 processor. Their processor uses deep, fine-grained pipelining which is exploited naturally by the underlying asynchronous circuits. The asynchronous R3000 exhibits significantly improved MIPS/watt performance over the synchronous version when scaled to account for different processes and voltages [44] .
TITAC.
A group at Tokyo Institute of Technology and Tokyo University has fabricated several versions of a new architecture they call TITAC [49] . T h e most recent version is a full-featured 32-bit architecture t h a t uses delay-scaling techniques to improve performance by taking real circuit delays into account, rather than conservatively assuming unbounded gate delays [64] .
Asynchronous/Synchronous Interfaces
It is clear that there are interesting applications that can take advantage of asynchronous techniques. However, a vast majority of systems are and will continue to b e synchronous. T h e question then is how t o utilize some of t h e proven benefits of asynchronous circuits in a largely synchronous environment.
Some have suggested that communication between modules should be asynchronous (although the modules themselves are synchronous) because the cost of global synchrony is prohibitively high in large-scale VLSI systems. Chapiro first suggested the idea of GALS system in 1111. Yun and Donohue demonstrated a prototype GALS system with a mixture of asynchronous and synchronous modules in [84] . In this chip, synchronous modules were equipped with pausible clocking control to prevent synchronization failures.
Yet others have argued that maintaining precise frequency reference in a globally synchronous environment is not too difficult. T h e real problem is the uncertainty in clock phases. 
