The design for dynamic CMOS cells that can be used as building blocks in an asynchronous pipeline are discussed in this paper. The proposed circuit elements are variations on TSPC logic using Sutherland's micropipeline structure combined with dual rail logic to detect operation completion. The resulting cell occupies more area than a TSPC cell, but has higher functionality because of the built-in data flow control mechanisms provided by t h e handshake signaling of t h e micropipeline structure.
INTRODUCTION
Most of today's digital design uses synchronous techniques, in which all events are synchronized with a central clock. As clock frequencies increase, clock skew, which is the difference in propagation delay between the clock and data paths, increases relative to the clock period. This causes difficulties in clock distribution, and eventually limits the operating speed of a VLSI chip. Self-timed asynchronous circuits, which use handshake signaling to control data flow, are not subject to this limitation.
Asynchronous circuits also offer the important advantage of dissipating less power, because signal transitions occur onIy when and where data are being processed. As much as 80% reduction in power has been demonstrated [4] . This is an increasingly important consideration given today's proliferation of portable applications.
Dynamic CMOS circuits depend on clock signals to repeatedly charge and discharge the capacitance associated with various circuit nodes. Circuit operation consists of alternating precharge and evaluate phases, controlled by the clock. In the absence of a clock signal, most previous work on asynchronous design used static CMOS circuits. This paper introduces a design technique for applying dynamic VLSI technology to asynchronous design. In such circuits, alternatives to a clock signal are needed to control the precharge and evaluate phases.
TSPC [ 11 is a clocking scheme that is commonly used in synchronous circuits implemented in dynamic CMOS.
A modification of TSPC is presented in which handshake signals between successive pipeline stages provide the timing information for the precharge and evaluate phases. The design is based on Sutherland's micropipeline structure [2] , using Muller c-elements to synchronize the operation of each pipeline stage with data arriving from different logic blocks.
The features of TSPC and Sutherland's micropipelines needed in the subsequent discussion are presented in Section 2, followed by a description of the proposed scheme in Section 3. An adder circuit has been used to demonstrate and test the proposed approach. Section 4 presents this circuit together with the test results obtained through simulation.
BACKGROUND
The ideas presented in this paper build on several design techniques that have been described in the literature. The essential features of these techniques are summarized in this section.
TRUE SINGLE PHASE LOGIC
The property of TSPC logic that sets it apart from other clocking schemes is that only one clock signal is needed to control the precharge and evaluate phases of the circuit's operation. Also, the logic evaluation and latching mechanisms are nicely integrated, making it suitable for use in pipelined circuits.
TSPC circuits use two types of logic blocks. N-type blocks are active when the clock is high and P-type CCECE'97 0-7803-3716-6 /97/$5.00 0 1997 IEEE blocks are active when the clock is low. Figure 1 shows a N-logic block followed by a P-latch stage.
. out FIGURE 1: Basic TSPC N-stage and P-latch
COMPLETION DETECTION
For asynchronous logic it is necessary to know when a logic stage has completed the evaluation phase. This can be achieved using the dual rail structure shown in Figure  2 , known as Dual Cascoded Voltage Switch Logic (DCVSL) [3] . During the precharge phase both out and out are precharged high, setting the complete signal low.
During the evaluate phase, one of the precharge nodes pulls down through either F or setting the complete signal high at the end of evaluation.
complete FIGURE 2: Dynamic logic with completion detection
MICROPIPELINES
The Sutherland micropipeline has been selected because its handshake signals can be easily used to control the precharge and evaluate phase of a dynamic CMOS circuit. The control used in this paper is simpler than in other micropipelines as only one control signal is use to both derive timing information and clock the latches in each stage.
These signals are generated using the Muller c-element, which is shown in Figure 3 .
c FIGURE 3: C-element symbol and schematic
The output value of the c-element doesn't change until the inputs converge to the same logic value. Note that there is a very weak feedback inverter on the internal node. This is necessary to prevent the node from being in an indeterminate state.
A micropipeline in which c-elements provide asynchronous handshake signals between successive stages is shown in Figure 4 .
FIGURE 4 The Sutherland Micropipeline
Initially the outputs of the c-elements and the Rin (input request) and Aout (output acknowledge) inputs are low. Setting Rin high forces the first c-element high, which in sequence will set the rest of the c-elements high, eventually setting Rout high. That is, the input request will propagate to the output, independently of further changes at Rin. This shows the elastic nature of the pipeline, where events get processed for either a full pipeline or just a single request. The operation of the micropipeline can be summarized by saying that in each stage a c-element copies the state of the previous celement when the states of the previous and next celements differ [2].
CIRCUIT DESIGN
The objective of this paper is to demonstrate the use of pipelined TSPC circuits in an asynchronous framework. It is necessary to integrate the TSPC cells within the micropipeline structure such that the precharge and evaluate phases of the dynamic logic function correctly. Since there is no clock, the precharge and evaluate phases must be controlled using local signals derived from the micropipeline protocol.
BASIC CELL STRUCTURE
The basic cell, shown in Figure 5 , is a modified TSPC block which accommodates dual-rail logic.
CYmpzet: I 1 T FIGURE 5: TSPC block with completion detection
Each cell has one N-logic block with a completion signal, a P-latch, and a c-element. The output of the c-element is used to provide the clocking (ctrl) signal to the logic blocks. The inverted inputs of the c-elements originate from previous cells providing data, and the non-inverted inputs are from subsequent cells accepting new data. These signals are opposite in polarity to the corresponding signals in Figure 4 for reasons that will become apparent shortly. Both the N-logic and P-latch of a cell are controlled by the same ctrl signal.
When the c-element is waiting for the proper handshaking signals, ctrl is low. In this state the N-logic is precharging high and the P-latch is latching a value from the previous evaluation. The complete signal in the N-logic is low, because both precharge nodes are precharged high. Thus, the low state of the c-element results in a low complete signal.
Once the c-element gets activated by the handshaking signals, the ctrl signal goes high. At this point the input data is guaranteed to be ready. This starts the evaluation phase of the N-logic and the P-latch shuts off. As evaluation takes place, one of the internal precharge nodes will evaluate to low, setting the complete signal high. Thus, the high state of the c-element eventually leads to a high value on the complete signal.
In summary this circuit structure has the nice feature of being able to transfer the state of the c-element to the complete signal output. However, this state transfer only occurs after the logic evaluation or internal precharge has taken place.
MICROPIPELINE INTEGRATION
We will now discuss how the basic cell of Figure 5 can be used in the micropipeline structure of Figure 4 . The basic cell is inserted in the path between a c-element and its neighbours as shown in Figure 6 . Since the state of the c-element is transferred to the complete signal, the micropipeline behaves in the same manner as described previously, except for polarity.
FIGURE 6: Integration of logic into the micropipeline
In the new structure, in-ready signals when d a t a 3 is valid and out-ready indicates when the next stage has finished processing data and is ready to accept new inputs. These two signals correspond to the complete signal from the previous and next stages respectively. The in-ready and out-ready can be multiple signals because the data may come from different sources or go to multiple outputs. In such cases, multiple input celements would be needed.
Flow control in the pipeline of Figure 6 differs somewhat from Figure 4 . The complete signal of the basic block becomes high at the end of the evaluate phase. However, the result is not available at the latch output until precharge takes place and complete becomes low. Thus, the evaluate phase for one stage must start during the precharge phase of the previous stage.
When the pipeline is idle, successive stages will be alternately in the evaluate and precharge phases of their operation. The complete signals will be alternately in the high and low states, respectively.
Assume that initially in-ready is low, the input data is valid, the ctrl input and complete output of the first basic cell of the pipeline are high, and the cell is in the evaluate phase. The second cell is in the precharge phase. When in-ready changes to high, the first cell enters the precharge phase. Meanwhile the result of the evaluation is transferred to the P-latch, becoming available for the next cell when complete goes low. The change in state of the complete signal causes the second stage to enter the evaluate phase, and so on.
C-ELEMENT AS THE LOCAL CLOCK
As noted earlier, the outputs of the c-elements in each basic cell replace the clock in synchronous TSPC. This enables a very good form of localized clock buffering.
T h e output inverter in the c-element c a n be appropriately sized to accommodate the load on the clock signal. If necessary, more buffering can be added to a given cell after the c-element without affecting the correctness of the circuit. Local buffering of the clock in a synchronous circuit would introduce unacceptable clock skew.
The c-element introduces the critical delay which determines the maximum operating speed of the circuit. Where multiple inputs are involved, we can use either a tree of 2-input c-elements or a single multi-input element. The tree structure is preferable because for the same latency, the layout area of the tree structure is smaller. At the same time, the input gate load for the tree structure can be made smaller when sizing these input transistors, helping speed up the circuit.
OPTIMIZATIONS
In general there may be multiple in-ready and out-ready signals entering each basic cell, one for each source or sink of data. This is needed to implement a system that is fully asynchronous across all paths, and can tolerate large variable delays anywhere. As the number of inputs of the c-element increases, so does its propagation delay. In many cases, however, the above methodology is too rigorous. If the relative timing of different in-ready or out-ready signals is known, the latest arriving one is sufficient to control the pipeline.
SIMULATION and RESULTS
Asynchronous circuits based on the proposed micropipeline structure were designed in NORTEL's 0.8
BiCMOS technology. The design was carried out using CADENCE software and the resulting circuits were simulated using HSpice.
Two circuits were tested. The first is an asynchronous FIFO structure, consisting of a series of basic cells chained together as shown in Figure 6 . It served to determine the upper bound on the speed of circuits implemented using this approach.
The second circuit tested was a 4-bit adder, which was used to test the functionality and performance of the proposed structure. Each pipeline stage consists of a 1-bit adder. It is worth noting that by using XOR functions, most of the logic for F and F that implement the Carry and Sum signals could be shared. In the first version of the adder, each pipeline stage used a 4-input tree of c-elements. Three of these inputs combine to form the in-ready signal, representing the three inputs of a l-bit full adder.
Critical delay in an adder circuit occurs along the carry chain. Thus, the handshake signals associated with the carry dominate the pipeline timing. Based on this observation, a second version of the adder circuit was implemented, using only one 2-input c-element for each pipeline stage.
The adder circuit was first tested by applying requests that arrived at random intervals. Then, the circuit was tested for speed by clocking the input request lines as fast as possible in a synchronous fashion. Inputs were provided to the circuit by registers clocked by the request signals.
Simulation results for the FIFO and the two versions of the adder are given in Table 1 . The results for a 4-bit adder designed as a synchronous pipeline are also given. However, proper comparison is not possible in a small 53 I circuit as there are no clock skew problems. The limiting factor for the speed of the FIFO circuit is the delay in the handshake signalling that controls the operation of the micropipeline. This delay increases when multiple-input c-elements are needed, as in the case of the basic adder circuit. The speed of the optimized adder, which uses only 2-input c-elements, approaches that of the FIFO. is free from clock skew limitations, as all clocking signals are generated locally. The circuit also has the potential for low power dissipation, because no switching takes place when the circuit is idle.
