Asynchronous systems are being viewed as an increasingly viable alternative to purely synchronous systems. This paper gives an overview of the current state of the art in practical asynchronous circuit and system design in four areas: controllers, datapaths, processors, and the design of asynchronous/synchronous interfaces.
Asynchronous State Machines
Much of the recent work on asynchronous state machine design is centered on burst-mode machines [50, 52] .
Burst-mode specifications grew out of earlier informal specifications by Davis et al. [18, 17] . Davis proposed machines which would wait for a collection of input changes (an "input burst"), and then respond with a collection of output changes (an "output burst"). The key contribution is that, unlike classical asynchronous machines, inputs within a burst could be uncorrelated: arriving in any order and at any time. Therefore, these machines could operate more flexibly in a concurrent environment. Unfortunately, their synthesis methods did not insure hazard-free designs.
Nowick and Dill [52, 50] modified and formalized these specifications into the final form called burst-mode (BM) [52, 50] . They also proposed a new self-synchronized design style called a locally-clocked state machine, which was the first burst-mode synthesis method to guarantee a hazardfree implementation [52, 50] . The method has been applied to large-scale designs such as a cache controller [51] . They also developed the first exact hazard-free 2-level logic minimization algorithm [53] .
Yun and Dill proposed an alternative implementation method, called 3D [79, 83] . The specifications were also generalized into into extended burst-mode (XBM), to allow greater concurrency and practicality [80, 82] . XBM specifications can be used to to synthesize controllers for synchronous/asynchronous interfaces, where the global clock is treated as one of the controller's inputs.
A number of optimization algorithms and CAD tools have been developed, for sequential and combinational logic synthesis [23, 53, 66, 36, 39] , technology mapping [61] , timing analysis [10] , and synthesis for testability [54] . Burstmode CAD tools have been applied to several industrial designs, including an experimental routing chip [17] and low-power infrared communications chip [40] at HP Laboratories, an experimental SCSI controller at AMD [81] , and a high-performance experimental instruction-length decoder at Intel [60] .
Several synthesis methods use restricted Petri nets, called marked graphs, which model concurrency, but not choice. More general Petri nets called Signal Transition Graph (STG), as well as state graphs which specify interleaved conncurrency, are now commonly used [12, 46, 72, 2, 35] .
A number of synthesis algorithms have been developed, for state minimization and assignment [37, 16] and hazardfree logic decomposition [8] (see also [72, 2, 35] ). Full-scale CAD packages are now available, including one incorporated into the Berkeley SIS package [38] , as well as Petrify [16] . Another synthesis method, called ATACS, focuses on timed circuits [48] .
Translation Methods
Translation methods specify an asynchronous system using a high-level concurrent programming language. Common languages include variants of Hoare's CSP, occam and trace theory. The program is then transformed, stepwise, into a low-level program which maps directly onto a circuit. These methods can be used to synthesize both datapath and control. A few methods use formal algebraic derivations [21, 33] . More commonly, though, compiler-oriented techniques are used.
At Caltech, Martin et al. [42] specify and asynchronous system using a CSP-like parallel language, augmented with sequential constructs based on Dijsktra's guarded commands. The specification describes a set of concurrent processes which communicate on channels. The specification is then automatically compiled into a collection of gates and components which communicate on wires [9] . An alternative approach was developed by Brunvand and Sproull based on occam specifications [7] .
At Philips Research and Eindhoven University, van Berkel et al. [68, 69] have developed an industrial synthesis package, based on their Tangram language. The tool has been applied to both commercial and experimental designs, including a DCC error corrector and an 80C51 microcontroller (discussed in Section 3).
Datapath
This section describes some of the recent advances in selftimed datapath design, concentrating on performance issues only.
A datapath can be classified as pipelined or non-pipelined. There has been a tremendous amount in asynchronous pipelines, starting with the classical micropipeline work by Sutherland [63] . Pipeline control can be implemented using either a two-phase protocol [24, 76, 1] or a four-phase protocol [19, 28, 25, 27] .
All of the asynchronous datapath designs strive to obtain higher average-case speed than the worst-case speed of comparable synchronous circuits. For non-pipelined datapaths, the performance advantage of non-pipelined asynchronous circuits is much clearer. The latency, the only relevant metric in non-pipelined datapaths, is simply the sum of all datapath element delays in the critical path. Thus the average-case latency for asynchronous datapaths, determined roughly by the sum of the average-case delay of individual elements, is in general much lower than synchronous counterparts. Some examples of non-pipelined datapaths are Williams's divider ring [75] , van Berkel et al's DCC error corrector [4] , Yun et al's differential equation solver [77] , and Benes et al's Huffman decoder [3] .
For pipelined datapaths, tradeoffs are more complex. Work at Sun Labs [15] shows that asynchronous pipelines, if designed properly, can approach the speed of synchronous shift registers. However, it is unclear if asynchronous pipelines, except in some special cases [60] , can ever out-perform synchronous counterparts. A goal is therefore to aim for comparable performance as a synchronous pipeline, but with the added benefits of "elasticity" (variable rate operation).
Our conjecture is that the average-case throughput (taking into account only data dependency, not operating conditions) of a deeply-pipelined asynchronous circuit would be close to the worst-case throughput. Shorter pipelines, though, tend to exhibit much better average-case behavior.
We describe below some recently introduced techniques to improve the average-case performance of self-timed datapaths.
Adders
In order to exploit variable data-dependent delays, selftimed datapath elements incorporate some form of completion detection mechanisms. The most common form is based on dual-rail logic [43] . However, as the datapath becomes wider, the overhead for completion detection becomes significant. Yun et al [77] observed that one way to tackle this problem is to parallelize the computation and completion detection as much as possible. Their techniques resulted in 2.8ns average-case delay for a 32-bit carry bypass adder fabricated in 0.6µm CMOS process, with only 20% completion sensing overhead on average. Another way to deal with wide datapaths is to perform bitwise completion detection. Martin et al [44] showed an impressive throughput gain (at the expense of sacrificing latency) using this technique.
A somewhat different twist to completion detection is called the speculative completion. This technique assumes the circuit normally finishes computation significantly faster than the worst-case. If the circuit cannot complete the computation in time, it aborts reporting completion. This technique requires a special auxiliary circuit called "abort detection circuit", which operates in parallel with the datapath element itself. Nowick et al [55] applied this technique to a 32-bit Brent-Kung adder and resulted in the simulated average-case delay to be less than 2ns in 0.6µm CMOS process.
Iterative structures
The development of zero-overhead self-timed ring technique by Williams [74] is clearly the most significant breakthrough in self-timed iterative structures. Williams showed that a self-timed ring can be designed in dual-rail domino logic with essentially zero overhead. He applied this technique to a self-timed 160ns 54-bit mantissa divider [75] as a part of a floating-point divider. This design was incorporated in a commercial microprocessor design [73] .
It can be shown that this technique is generally applicable to any iterative structure in which the latency needs to be optimized. Consequently, this technique has been applied to other academic and industrial designs, such as a division and square root unit design by Matsubara and Ide [45] , a self-timed packet switch design by Yun et al [78] , and a Huffman decoder design by Benes et al [3] . There have been other iterative structure designs that achieve high performance with data-dependent computation times, such as a bundled data multiplier design by Kearney and Bergmann [34] .
Large scale examples
In certain applications in which there is a large variation in processing delays between common and rare cases, asynchronous designs tend to fare much better than synchronous designs. A research group at Intel demonstrated this with their asynchronous instruction length decoder design called RAPPID ("Revolving Asynchronous Pentium Processor c Instruction Decoder") [60] . The RAPPID's length decoding out-performs, by a factor of 3, the same function inside a 400MHz Pentium II fabricated in the identical 0.25µm CMOS process. This speedup is primarily attributed to optimizations for common, short-length instructions and selftimed techniques enabling these optimizations.
In another application, an asynchronous Huffman decoder design by Benes et al [3] , by exploiting the large data-dependent variation in decoding time, achieves a similar average-case performance as the worst-case performance of comparable synchronous designs, but with 5-10 times smaller area.
So far, we have only discussed techniques to exploit variable data-dependent delays. However, if the operating condition is taken into account, we can obtain much more significant performance benefits from asynchronous circuits. This speedup is essentially due to inherent margins that must be built in synchronous systems to accommodate worst-case timing behavior but are not required for asynchronous systems. Dean [20] proposed a self-timed processor architecture called STRiP based on this idea. Yun et al [77] demonstrated a high-performance asynchronous differential equation solver chip, whose average-case speed (tested at 22
• C and 3.3V) is 48% faster than comparable synchronous designs (designed to operate at 100
• C and 3V for the slow process corner).
Asynchronous Processors
This is an exciting time for asynchronous processors. Recently, at Phillips Semiconductors, pagers with asynchronous chips have been released commercially to market (see below). In addition to the current academic interest in asynchronous systems, several companies such as Intel, Sharp, Sun, and HP have shown interest. The asynchronous circuits these companies have developed are showing some promise of making their way into products.
Processors are, in many ways, the most demanding application for asynchronous techniques. In addition to being extremely complex systems, processors are often the target of the most aggressive optimization that the circuit designers can bring to bear. The optimization criterion may be raw speed, low power, noise and EMC (Electro-Magnetic Compatibility) properties, or some combination of these, but it is in a processor where such requirements are the most critical. It is also the case that the organization of most modern high-performance microprocessors uses a synchronous pipelined approach, and alternative architectures may be required to achieve comparable results with asynchronous processors. But, it is the potential benefits of the asynchronous approach that are compelling in this world of highly-optimized systems. In terms of raw speed, lowered power, and improved EMC properties, asynchronous techniques may have much to offer.
Until recently, there have been relatively few asynchronous processors reported in the literature. Early work in asynchronous computer architecture includes the Macromodule project during the early 70's at Washington University [13, 14] and the self-timed dataflow machines called DDM-1 and DDM-2 (Data Driven Machine) built at the University of Utah in the late 70's [18] .
More recent academic projects include the Caltech Asynchronous Microprocessor [41] which was the first asynchronous microprocessor of the VLSI era, the NSR [6] , fully decoupled and built from FPGAs, and the Rotary Pipeline Processor [47] which takes a circular ring approach to the pipeline. In addition, at Sun Labs, a new counterflow architecture has been proposed, with a fully asynchronous implementation [62] . Some recent asynchronous processors are highlighted below.
Philips Asynchronous 80C51. At Philips Labs, an asynchronous version of the venerable 80C51 controller has been developed that exhibits nearly four times lower power than a power-optimized synchronous version. It also has significantly reduced EM emissions. These properties have convinced Philips to develop a family of these asynchronous controllers for pagers, and commercial pagers using these chips are now on the market [70, 71] .
Sharp DDMP Signal Processor. Sharp Corporation has developed an experimental self-timed data driven multi-media processor aimed at digital television receivers and other applications. The fabricated processor exhibits impressive performance and power consumption, operating at a speed of 8600 Million Operations per Second and with power consumption less than 1 watt. The processor consists of 8 programmable, data-driven processing elements connected by an elastic router [65] .
The Amulet. A group at the University of Manchester has built a number of versions of a self-timed micropipelined VLSI implementation of the ARM processor [26] which is an extremely power-efficient commercial microprocessor. The first-generation Amulet design is within a factor of two of the commercial ARM of the same time [56] . The secondgeneration Amulet 2e was targeted at embedded applications and demonstrated a modest improvement in power per MIPS over the commercial synchronous version [27, 29] , as well as nearly immediate restart from full standby mode. The third-generation Amulet 3 promises further improvements in both performance and low power [30] .
The Fred Architecture. Fred is a self-timed, decoupled, concurrent, pipelined computer architecture [59, 58] . It dynamically reorders instructions to issue out of order using an instruction window to organize the reordering, and allows out-of-order instruction completion. It handles exceptions and interrupts, and includes a novel functionally precise exception model that works well in the asynchronous, decoupled, out of order environment [57] .
Caltech Asynchronous MIPS R3000. Subsequent to the success of their first small asynchronous processor, the asynchronous group at Caltech has built an asynchronous version of the MIPS R3000 processor. Their processor uses deep, fine-grained pipelining which is exploited naturally by the underlying asynchronous circuits. The asynchronous R3000 exhibits significantly improved MIPS/watt performance over the synchronous version when scaled to account for different processes and voltages [44] .
TITAC. A group at Tokyo Institute of Technology and
Tokyo University has fabricated several versions of a new architecture they call TITAC [49] . The most recent version is a full-featured 32-bit architecture that uses delay-scaling techniques to improve performance by taking real circuit delays into account, rather than conservatively assuming unbounded gate delays [64] .
Asynchronous/Synchronous Interfaces
It is clear that there are interesting applications that can take advantage of asynchronous techniques. However, a vast majority of systems are and will continue to be synchronous. The question then is how to utilize some of the proven benefits of asynchronous circuits in a largely synchronous environment.
Some have suggested that communication between modules should be asynchronous (although the modules themselves are synchronous) because the cost of global synchrony is prohibitively high in large-scale VLSI systems. Chapiro first suggested the idea of GALS system in [11] . Yun and Donohue demonstrated a prototype GALS system with a mixture of asynchronous and synchronous modules in [84] . In this chip, synchronous modules were equipped with pausible clocking control to prevent synchronization failures.
Yet others have argued that maintaining precise frequency reference in a globally synchronous environment is not too difficult. The real problem is the uncertainty in clock phases. Ginosar and Kol [31] suggested an adaptive synchronization scheme to remedy this problem. Furthermore, some synchronous systems [32, 85] are moving closer to asynchronous by allowing significant time borrowing to overcome clock skew and jitter problems.
