As transistor switching speed improves, synchronizing a global clock increasingly degrades system performance. Therefore, self-timed asynchronous logic becomes potentially faster than synchronous logic. To do so, however, it must exploit the techniques used in fast synchronous designs, including: redundant logic, inverting logic, transistor size optimization, dynamic logic, and phase alignment. Most techniques can be applied equally well to asynchronous logic, indeed phase alignment is easier; but combining dynamic and asynchronous logic is more difficult. We must guarantee minimum refresh intervals, together with race and hazard free operation. This paper describes an initial chip implementation, that combines dynamic and asynchronous logic running at 500MHz in 2µm CMOS.
II. Introduction to Fast CMOS Logic Techniques
The techniques used in a design depend on the requirements, such as speed, area or power. This section briefly reviews six key techniques for fast CMOS designs. The first four (redundant logic, inverting logic, transistor size optimization, and dynamic logic) are known and accepted. The final two (asynchronous logic and phase alignment) are not as well known or accepted.
A. Redundant Logic
Concurrency uses more gates (area) or communication (wiring) than necessary to improve speed. This use of redundant logic improves speed at the cost of area and design complexity. An example of redundant logic is the carry look-ahead tree used to reduce the ripple carry propagation delay in a fast adder design [9] . By increasing the number of gates and communication lines, the worst case adder delay is reduced from a linear to logarithmic relation to the number of bits in the sum.
B. Inverting Logic A single stage CMOS gate is inherently inverting. Therefore any circuit with an odd number of gates between input and output requires an active-high input and an active-low output, or an active-low input and an active-high output. An extra inverter at the output removes the inversion but increases the delay.
So, while NAND gates require only one gate delay, AND gates require two gate delays. A faster design technique, which we call inverting logic, is to alternate between active-low inputs/active-high outputs and active-high inputs/active-low outputs.
An example of inverting logic is the carry propagation chain of a Manchester carry chain adder [9] .
Here the odd stages in the carry propagate section expect to see active-low inputs and have active-high outputs; the even stages expect to see active-high inputs and have active-low outputs. If each stage ties its carry input to the carry output of the previous stage, each stage sees the correct active-low or activehigh signals.
C. Dynamic Storage It is unnecessary to build an explicit storage element in CMOS logic. Designs can exploit the temporary storage property of CMOS gates, provided they guarantee all storage nodes are refreshed at some minimum frequency (dependent on technology). Therefore, although reliable and easy to design, standard complementary static CMOS gates are not always optimal for speed. Exploiting the inherent dynamic storage in CMOS can reduce the amount of logic necessary to implement a function, reducing the capacitance loading, and increasing speed.
An example of dynamic storage is NORA logic [10] . NORA either replaces the p-transistor pull-up section of a gate with a pre-charge p-transistor, or replaces the n-transistor pull-down section of a gate with a pre-discharge n-transistor. The pre-charge (or pre-discharge) initializes the output of a gate to high because the parameters cannot easily be isolated; but, even allowing only a modest doubling in area, the use of transistor size optimization can yield almost double the speed [4] .
E. No Global Clock
Where propagation delay is a significant fraction of switching time, performance is less than would be expected based on switching speed. A major factor limiting the performance is the need to distribute a global clock. Typically, the logic delay is only half that of the safe clock period because of the:
• Large capacitive load on the clock buffer.
• Distance between the central clock and the switching elements.
• Skew between different clock phases.
• Need to allow for worst case fabrication and environmental conditions.
• Uneven current distribution when all cells switch at the clock edge.
Careful design reduces some of these problems. For example, increasing the number of power terminals, using wider power lines, and using a true single phase clock [3] show for our master-slave latch, this can be kept down to a minimum).
Asynchronous systems are potentially more reliable [11] and easier to design [6] ; but in this paper we concentrate on the increased speed made possible by removing the global clock. The first reason for this speed increase is that, provided the data transfer is kept local, delay is independent of the size of the system. Second, the speed is based on the average, rather than worst case speed of the elements.
Asynchronous systems, having no central choreography, run at the average logic speed of its switching elements, not the speed set by the clock. Third, by reducing current spikes, asynchronous logic reduces the "ground bounce" performance degradation.
F. Phase Alignment Long distance communication requires a significant percentage of the cycle time for fast systems.
Today, this delay is noticeable in chip-to-chip communication; and, as switching speed improves, it is likely to be an increasing problem on-chip. The effect of communication delay can be eliminated by running the two communicating elements out of phase. The receiving element runs "behind" the sending element, so it does not require the data until it arrives. The effect of skew between different inputs can be overcome by buffering the signals at a cost of increased area dedicated to the buffers. The phase alignment buffers delay the closer signal more than the more distant signal.
Phase alignment, is inherently asynchronous. Unlike the guaranteed race and hazard free acknowledged asynchronous logic, phase alignment uses unacknowledged asynchronous logic. If new data was kept until the old data was acknowledged, we simply double the problem of long distance communication (round trip delay). Providing the two elements run at the same average rate, however, the phase alignment buffers do not need acknowledgment lines.
G. Summary Table I summarizes the speed techniques and their costs in area and/or design complexity. 
III. Synchronous and Asynchronous Master-Slave Latches
This section reviews the fundamental differences between synchronous and asynchronous systems, using an example of a two stage shift register built from master-slave latches.
A. Synchronous Shift Register Figure 1 shows a pair of synchronous master-slave latches forming a 2-stage shift register [12] . As in all synchronous systems, a global clock choreographs the transfer of data (in most systems today the clock line consists of multiple phases of the clock). The latch loads the input into its master storage element when the clock is low, and loads the slave storage element (which is also the output) with the output of the master storage element when the clock is high.
The right part of figure 1 shows the timing waveform. Initially D0 (input to the first latch) has a value X, D1 (input to the second latch) has value Y, and D2 (output of the second latch) has value Z. When the clock goes from low to high (rising transition), this triggers all the data to be shifted down (D0 -> D1 -> D2) and a new value (W) can be safely placed on the input D0.
B. Asynchronous Shift Register Figure 2 shows an equivalent pair of asynchronous master-slave latches forming a 2-stage shift register.
The acknowledge line, going from the output of one stage to the sample input of the previous stage, replaces the global clock. Also, the forward channel (D0->D1->D2) now includes additional information enabling the down-stream latch to decide when the new input data is valid. A latch passes the data from its input to its output without intermediate storage (we retain the name master-stage latch only to show equivalent functionality to the well known synchronous latch). It passes the data when the input contains new data and the sample input (acknowledge line) shows its output has been latched by the next stage.
The right part of figure 2 shows the timing waveform. For example, consider the first latch with output D1. When this latch sees its input (D0) has a new value (X), not equal to its current output (Y), and its sample input (D2) says the second latch has stored its current output (Y), it can pass X onto its output.
Again, when D1 sees W on its input and X on its sample input, it passes W onto its output.
There are many methods of passing the state change information. In the next section, we describe one such method which we call 4-state coding.
IV. 4-State Logic
This section gives a brief tutorial on 4-state coding for asynchronous communication. We describe the 4-state code and give an example for the master-slave latch.
A. 4-State Code
Using 4-state coding of asynchronous data is a good candidate for fast logic implementation [8] . Unlike 3-state coding [6] [13], it communicates without a null state, passing information as fast as possible.
Unlike micropipelines [7] it guarantees race and hazard free operation independent of the delays. Figure   3 shows the transition diagram for the 4-state code. As its name implies, 4-state code uses four states to represent a single binary bit of information. The 4-state code alternates between P & Q phase data. To guarantee a transition, it has two logic "1" states (P1 & Q1) and two logic "0" states (P0 & Q0). Thus, for example, encoding the data stream 1, 0, 1, 1, 0 we get P1, Q0, P1, Q1, P0. Note that the phases must alternate (i.e. PI, QJ, PK, QL, and so on).
B. 4-State Master Slave Latch
The simplest 4-state circuit is an asynchronous master-slave latch. sample input (ki) goes to phase P; then the output goes to Q1: that is, it passes the data.
C. Redundant, Inverting 4-state Latch
Normally, the four states would be implemented using a two bit gray code [8] . To maximize speed, however, we incorporate redundant logic and inverting logic (two of the fast CMOS design techniques described in section II) into our 4-state master-slave latch.
First, we redundantly use four bits to represent each of the four states. changes the output to P0 (0100) when the acknowledge input is Q (1). Similarly, a P0 at the input to an even stage (0100) changes the output to P0 (0111) when the acknowledge input is Q (0).
V. Guaranteeing Refresh in 4-state Logic
In a synchronous system, the clock signal goes high and low every cycle, independent of the data.
Therefore, provided each node is set or reset by the clock, refresh is guaranteed. With an asynchronous signal we must find a similar signal that changes at a regular interval, independent of the data.
The guaranteed phase change in 4-state logic provides a simple way of combining dynamic logic with asynchronous logic. The acknowledge line (ki or ko from figure 4) indicates the phase (P or Q) of the data;
and because we know the phase changes for each change of data, independent of the data, the acknowledge line goes high and low every two data changes. Therefore, provided each node is set or reset by the acknowledge input (ki), refresh is guaranteed.
VI. Phase Alignment
This section describes the performance of a phase alignment buffer built from asynchronous masterslave latches.
A. Asynchronous FIFO Figure 2 shows an asynchronous shift register. Because the output is acknowledged (by the sample input into the final master-slave latch) independent of the input changes, the shift register forms an asynchronous FIFO (First-In First-Out buffer).
Consider an n-stage asynchronous FIFO consisting of n asynchronous latches receiving data, on average, every c ns. Let the average latch delay from data in to data out be x ns and from data in to ack out be y ns. Using a synchronous analogy, the acknowledgment (ack) is the 'clock' with data latched on 
VII. Dynamic Master-Slave Latch
This section describes the transistor implementation of a 4-state asynchronous master-salve latch and compares it with an equivalent synchronous master-slave latch.
A. Synchronous Master-Slave Latch Figure 1 shows a synchronous latch, storing the input (current state) in the master node when the clock is low and passing the contents of the master node onto the slave node and output (next state) when the clock is high. Figure 6 shows a high performance implementation of this master-slave latch [3] , using three storage nodes (dm, ds & do). Again, when the clock is low, the data input (di) passes through to the master node (dm). But now, when the clock is high, the master node is passed onto two slave nodes (ds and do) and the output (do).
B. Asynchronous Master-Slave Latch Figure 7 shows a high performance implementation of the 4-state asynchronous master slave latch, a P0 at the input (di=0111) this discharges do [1] , causing the output to go into state P0 (0100). This in turn causes the acknowledge output to go into state P (ko=1).
This latch guarantees refresh (see section V) every two inputs because the nodes ki and ko change every time data changes, and the nodes do [3] and do[0] are set every time ki is low and the nodes do [2] and do [1] are reset every time ki is high.
The delay for this latch (see section VI) are given by:
y = x + Inverter delay (
C. Asynchronous FIFO Figure 8 shows a n-state FIFO built from n of the dynamic asynchronous master-slave latches shown in figure 7 . In addition it shows the acknowledge set (pullup) and reset (pulldo) circuits, which set or reset the acknowledge lines when rst/rst_b are active. The redundancy encoder (tr24) and decoder (tr42) convert between the 4-bit redundant inverting 4-state representation (see section IV) and the standard 2-bit nonredundant logic representation. Figure 8 shows the reset state, with all the data lines are reset to the same null state (d=0110) and the acknowledge inputs alternating between 1 and 0. With inverting logic all the acknowledge inputs represent the same Q-phase information, though it requires alternating pullup and pulldo cells to achieve the necessary logic values. Notice, however, that the latch remains the same despite the use of inverting logic; thus greatly reducing the design complexity overhead.
VIII. Results
A chip was fabricated with four asynchronous FIFOs: two with n=4 and two with n=6. Table III summarizes the chip characteristics. The length differences between the FIFOs allowed testing the internal speed, circumventing our CMOS pad limitation. With the standard 12-µm p-transistors and 6-µm do [2] do [3] do [1] do [0] ki di [2] di [3] di [1] di n-transistors, the internal nodes never fully switch to the rail voltage: improving speed and reducing power dissipation. For around twice the area (109µmx71µm), the asynchronous circuit ran about 10% faster than the equivalent synchronous circuit: 500 MHz in 2µ CMOS.
The chip incorporates five of the six fast CMOS logic techniques. Incorporating the final technique, transistor size optimization, has been simulated but not fabricated. However, we believe that, because the original chip closely matched our initial circuit simulations (HSPICE), the figures for speed are accurate to within 10%. The HSPICE optimization routine [14] adjusted the widths of all the transistors extracted from the layout, based on the goal of minimizing the delay between the rise and fall times at the output nodes. The best results show the same circuit running at over 800MHz, with about double the overall area (the transistors were limited to 48µm).
• encoder (tr24) and decoder (tr42), and the acknowledge set (pullup) and reset (pulldo) circuits.
do [2] do [3] do [1] do [0] rst rst_b
do [2] do [3] do [1] 
IX. Conclusion
This paper describes a method of removing the global clock to improve system performance. We combine this idea with other high speed circuit techniques to design a dynamic asynchronous masterslave latch. Using redundant inverting 4-state logic, the latch has the equivalent of one gate delay and guarantees the refresh of every dynamic storage node after any two data samples have been processed.
All six high speed circuit techniques are needed to maximize speed. Redundant logic doubles the number of transistors and wires, and Inverting logic increases the amount of circuit design by about 10%; but both were acceptable, because combined they increased speed nearly 100%. Dynamic storage requires minimum refresh and increases the design complexity; but is tolerable in 4-state logic and improves speed by about 50%. Having no global clock doubles the area; but although it increases speed only about 20%, the improvement will become more significant with faster switching technology and large systems.
Phase alignment buffers have a fixed area and design complexity overhead; but the area overhead can be kept small if limited to a few signals (such as chip-to-chip signals), the design is a natural extension of the 4-state logic, and it gives a significant speed increase for long distance communication. A chip implementing the phase alignment buffer, using standard transistor sizes, runs at 500 MHz in 2µ CMOS.
Optimizing transistor sizes almost doubles the area, but was acceptable since it increased speed over 50%.
Simulations using optimized transistor sizes predicts speeds of 800MHz in 2µ CMOS.
More research is needed to extend the high speed circuit techniques to submicron technology. Also, the high speed circuit techniques need testing on a large design and their affect observed in a production environment. In addition, getting the high speed internal signals off-chip will require low voltage swing pads.
