This paper presents new high-performance building blocks for two-phase micropipelines. We develop pseudo-static Svensson-style double edge-triggered D-ip-ops (DETDFF) for datapath storage in place of traditional capture-pass or transmission gate latches. We compare a DETDFF FIFO bu er implementation with the current state-of-the-art micropipeline implementation using four-phase controllers designed by Day and Woods for the AMULET-2 processor. We implemented both designs in the MOSIS 1:2 m CMOS process and simulated them under the worst-case process corner with a 4.6V power supply and at 100 C. Our SPICE simulations show that the DETDFF design has 70% higher throughput. This higher throughput is due to latching the data on both edges of the latch control, removing the need of a reset phase and simplifying the control structures. In addition, we present two commonly used micropipeline event-control structures, the select and toggle elements, implemented using the extended-burst-mode 3D synthesis system. Detailed simulations demonstrate that our implementations are up to 50% faster than traditional implementations. This speed advantage can be primarily attributed to careful applications of generalized C-elements rather than discrete basic gates.
Introduction
Micropipelines is a popular design style for building asynchronous circuits, introduced by Ivan Sutherland in his Turing Award Lecture in 1988 5] . Micropipelines is a building block approach which makes designing deeply pipelined circuits relatively easy and facilitates the reuse of combinational functional units previously designed for use in synchronous circuits. Micropipelines is based on a two-phase, or transition-signaling, event-driven communication protocol in which an event is either a low-to-high or high-to-low transition on a control wire with no distinction being made between the two. Since traditional latches are level-sensitive, however, traditional designs required two-to-four and four-to-two phase converters, which hinder performance. The AMULET group recently designed a new set of control blocks using four-phase event-driven communication which eliminates the need for phase-converters thereby signi cantly improving performance.
This paper introduces a simple, but e ective, idea of replacing traditional latches (e.g., capturepass and transparent) in two-phase micropipelines with pseudo-static double edge-triggered D-ipops (DETDFF). DETDFFs are fast and compact when designed with two Svensson-style latches placed in parallel 1] and can be easily made pseudo-static. They seamlessly integrate into the two-phase micropipeline methodology without the need of costly two-to-four phase converters. DET ip-ops, however, have higher input capacitance than level-sensitive latches and do not provide a \latch-done" signal which means that additional timing assumptions must be analyzed and accounted for to ensure correct operations. We designed both a DETDFF FIFO bu er and the leading (four-phase) FIFO bu er designed by Day and Woods 2] in the MOSIS 1:2 m CMOS process using the same transistor sizing. We simulated both designs under the worst-case process corner with a 4.6V power supply and at 100 C. Our design achieves a cycle time of 3:9ns while Day and Woods's design achieves a cycle time of 6:7ns | 2:8ns slower. Thus, our design is 70% percent faster than their design. The cost of this improvement is that the DETDFF design requires approximately twice the area compared with convential designs and consumes up to four times as much power.
To extend the application of micropipelines to general pipeline circuits, Sutherland also introduced a set of event-control building blocks 5]. A C-element acts as the AND of two events, synchronizing two independent pipeline streams. An xor gate acts to OR (or MERGE) two independent events. A select element routes an input event one of two ways depending on a select input, facilitating data-dependent choice. A toggle element alternates between routing an input event to one of two outputs. We present improved generalized-C-element-based implementations of the more complicated select and toggle elements obtained using the burst-mode 3D synthesis systems 8, 7, 9] .
We designed both the traditional implementations of these elements 4] and our implementations in the MOSIS 1:2 m CMOS process and simulated them under the worst-case process corner with a 4.6V power supply and at 100 C. Our designs demonstrated signi cant improvements in latency and cycle time, as much as 1.8ns vs 2.5ns (40%) for latency and 2.2ns vs 3.3ns (50%) for cycle time for the select element and 1.3ns vs 1.9ns (50%) for latency and cycle time for the toggle element. The results demonstrate that although the synthesis paradigm is based on the fundamental-mode assumption 6, 7] , the two synthesized designs have negligible settling time delay and thus realistic environments can easily adhere to the fundamental-mode assumption. Compared with traditional designs, our select element is approximately 7approximately 38designs is ...
The organization of the remainder of the paper is as follows. Section 2 reviews background of micropipelines. Section 3 presents the DETDFF FIFO and our comparison with traditional pipeline structures. Section 4 describes the synthesis of our burst-mode implementations of the select and toggle elements and compares them with traditional designs. Section 5 presents our conclusions.
Background: Micropipelines
In a one-dimensional asynchronous FIFO data is usually depicted moving from left to right, as illustrated in Figure 1 . Once stage i receives data, it sends a request to stage i + 1 to pass the data on. If stage i + 1 is empty, it latches the new data and sends an acknowledgment to stage i. This tells stage i that the data transfer is complete and that stage i can accept new data.
Numerous implementations of this request/acknowledge communication has been suggested and analyzed. As depicted in gure 1, Sutherland's original paper 5] suggests an implementation in which neighboring stages communicate by two wires, one wire for request and one wire for acknowledgment. The request wire is labeled R out at the output of stage i (the source) and R in at the input to the stage i + 1 (the sink). Similarly, the acknowledge wire is labeled A in at the input to stage i + 1 (the source) and A out at the output of stage i (the sink). Sutherland original paper suggests a two-phase handshaking protocol, illustrated in Figure 2a ), in which both the low-to-high transition and the high-to-low transitions, both referred to as events, of any control wire have the same meaning. For example, both R in events indicates that new data is valid at the input to the stage.
PUT IN DESCRIPTION OF CAPTURE PASS LATCHES HERE
Because capture-pass latches are large and consumer signi cant power, the AMULET group proposed using transparent latches with with level-sensitive enables. Building fast implementations using these latches is challenging because they requires designing a two-to-four-phase interface around the level-sensitive enables. A simple implementation using an xor and a toggle element is illustrated in gure 3 3, 4] . The cycle time of this two-phase design includes the delay overhead of both the xor and toggle elements, limiting its performance.
PAVER DEMONSTRATES THIS CAN BE ACHIEVED WITH a Xand a X To improve the performance, the AMULET group suggested a number of optimizations 4, 2]. The fastest circuits they have developed abandon the two-phase paradigm and adopt a four-phase signaling depicted in gure 2b). In four-phase (or level-sensitive) signaling the rising and falling transitions of the interface signals are distinguished. The rising transitions are the active transitions that indicate data valid and data consumed and the falling transitions reset the control wires for the next communication.
To implement four-phase micropipeline control circuits careful consideration must be taken into the placement of the reset phase of the request and acknowledge control lines. The fastest fourphase micropipeline control circuit for FIFO applications that they developed is the semi-decoupled circuit depicted in gure 4.
Two-Phase Pipeline Circuit
We can build a simple 2-phase asynchronous pipeline circuit using double edge-triggered D-ip-ops as storage elements and a C-element to control each stage of the pipeline as shown in gure 5.
This pipeline circuit is similar to Sutherland's micropipeline, except that it uses double edgetriggered ip-ops in place of capture-pass latches and relies on a simple set of timing constraints for correct operations.
To make this circuit simpler to understand, assume negligible delays on latch control (\clock") bu ers for DETDFFs for now. Consider stage i whose inputs are R in and A out and whose outputs are A in and R out . When its left neighbor toggles R in signaling that the data D in is valid, stage i's pipeline control toggles its output A in (if the previous request to the right neighbor has been acknowledged). Toggling of the control C-element acknowledges the receipt of data to its left neighbor, latches the data in stage i's storage (DETDFFs), and enables R out to toggle after a bundling delay. The bundling delay is necessary to provide su cient data setup time for stage i + 1, i.e., D out of stage i must be valid before stage i + 1's latch control toggles.
In practice, the latch control bu ers incur signi cant delays (1.9ns delay to drive 32 DETDFFs in our simulation). Thus, the data can be delayed accordingly. In that case, toggling of the request
Data Data
Figure 1: An asynchronous pipeline. line means that the data will be available at the next stage after some delay (roughly the same as the latch control bu er delay). In order for this pipeline circuit to function correctly, two timing constraints must be met: This inequality is trivially satis ed because t buf i = t buf i+1 and t ck!Q i > t h i+1 in general.
The minimum cycle time of this pipeline is the minimum time interval between successive toggling of a request signal: In Afghahi and Yuan's original design, once rises, nodes x and y oat, so it must rely on the capacitance on those nodes to hold the charge. However, DETDFFs used in asynchronous pipelines cannot depend on the capacitance to hold the logic level, because the next transition of may arrive at an arbitrary time. Our design uses two weak PMOS transistors to maintain the voltage level when is high. When Q 0 becomes low, y is pulled up by P 6 , which keeps the N-stack of the right inverting stage turned on, keeping Q 0 low. On the other hand, to keep Q 0 high after it becomes high, y must remain low. Our circuit accomplishes this by turning on P 5 when y becomes low, which maintains the N-stack of the middle inverting stage turned on. The FETDFF functions similarly as switches from high to low.
Note that feeding back Q 0 to both RETDFF and FETDFF sections (which happens because the outputs are dotted to form the DETDFF) does not cause the inactive section to turn on inadvertently. For example, when is high, Q 0 = 0 has no e ect. Neither does Q 0 = 1, because node z is already pulled down due to being high.
Simulation Results
We implemented 5 stages of our DET pipeline circuit and Day and Woods's semi-decoupled pipeline circuit in MOSIS 1:2 m CMOS process. We used consistent transistor sizing for both designs: W=L of the minimum size transistors for control circuits is 45 =2 for PMOS and 20 =2 for NMOS.
In addition, for both designs, we adjusted the capacitive loading on the latch control signals to drive 32-bit latches/ ip-ops. In Day and Woods's design, the capacitive loading of 32-bit latches corresponds to 1pF. Because the capacitive loading of the latch control signal in our DETDFF is 4 times as much as the n-type true single-phase latches used in Day and Woods's design (8 vs 2 transistors), we designed our latch control bu er to drive 4pF in each stage.
We simulated both designs with the worst-case process parameter using a 4.6V power supply at 100 C. Mentor Graphics Accusim analog simulator was used for simulation.
The simulation traces are shown in gure 7. A breakdown of the cycle times for both designs is given in There are two key factors that contributed to the di erence in cycle times. The rst obvious factor is that the DETDFF is capable of latching data on both edges of its enable signal. The second factor is that the acknowledgment of the receipt of data is made very early | as soon as the request from the left neighbor is detected (and, of course, the last request to the right neighbor has been acknowledged). We can a ord to do that because the previous pipe stage can generate the next data no faster than the delay it takes for the acknowledgment signal to propagate through the latch enable bu er.
If the latch enable bu er delay is removed from Day and Woods's semi-decoupled 4-phase micropipeline design, their design can also shave o somewhat less 2 than 1.1ns from the minimum cycle time as illustrated in gure 8. Even so, our design would still be 50% faster than Day and Woods's design. 
Area and power comparison
The main disadvantages of the pseudo-static DETDFFs are larger area and heavily loaded latch control lines. Each DETDFF is approximately twice the size of a standard edge-triggered ipop and the capacitance on the DETDFF control lines is approximately four times greater than standard level-sensitive latch control lines, yielding four-times more energy consumption in the control lines per transition. The overall e ect on power consumption, however, is not clear because the DETDFFs are switched half as often as level-sensitive designs and they are naturally blocking 4], preventing glitches from propagating through multiple datapath stages. 1 We note that the fraction of cycle time consumed by every component in the simulated implementation of Day and Woods's circuit is virtually the same as in their published delays, supporting the claim that the implementation of their design is reasonable. 
Two-Phase Micropipeline Control Circuits
In this section, we present two commonly used micropipeline event-control structures: select and toggle elements. These elements were speci ed in extended burst-mode and synthesized using the 3D synthesis system. Detailed simulations demonstrate that our implementations are up to 50% faster than traditional designs. Figures 9 and 10 show the extended burst-mode select circuit and a conventional one used in AMULET-1.
Select
for all except weak inverters L = 2λ
Figure 9: Extended burst-mode select circuit.
We implemented both circuits in MOSIS 1:2 m CMOS process and simulated them under the same condition as before. We simulated both circuits simultaneously using the same test inputs. T and F traces are from the outputs of our select circuit, and T.p and F.p are from the outputs of the AMULET-1 select circuit. Tin and Fin are outputs of xor gates (see gure 10). The AMULET-1 select circuit always propagates its input transition to both Tin and Fin; however, only one event is selected and propagated to the output of the associated transparent latch. For example, if sel is high, then the event on Tin propagates to T. When T changes, this change is fed back to the input of the xor whose output is Fin and cancels the event not selected. The environment of the circuit must wait for the feedback delay through xor gates before changing the sel input.
The cycle time of our select circuit corresponds to the delay from toggling of in to a change in T 0 or F 0 . The cycle time of the AMULET-1 select circuit is the delay from toggling of in to the cancellation of Tin or Fin | not the rst change in Tin or Fin after an input change but the change in Tin or Fin caused by the fed-back F or T transition. As shown in table 2, our select circuit has the latency of 1.8ns and the cycle time of 2.2ns. The circuit from AMULET-1 has the latency of 2.5ns and the cycle time of 3.3ns. Thus our circuit is faster than the AMULET-1 circuit by 40{50%.
Ignoring the small weak inverters, our select element design requires 
Toggle
The toggle is a circuit that routes input events to two outputs alternately. For example, if an input event causes the output dot to toggle, then the next event will cause the output blank to toggle, and the event after the next one will cause the output dot to toggle again. We implemented the burst-mode toggle circuit (see gure 12) and the one used in AMULET-1 in MOSIS 1:2 m CMOS process and simulated them under the same condition as before. Again, we simulated both circuits simultaneously using the same test inputs. Simulation traces are shown in gure 13. dot and blank are from the outputs of our toggle circuit, dot.p and blank.p are from As shown in table 3, in our circuit the delay from an event on in to a change in one of the outputs is 1.3ns, and in the AMULET-1 circuit it is 1.9ns. Thus our circuit is 50% faster than the AMULET-1 circuit. Note that both circuits are designed to be speed-independent. Our circuit was initially designed to be a fundamental-mode circuit but turned into a speed-independent one by feeding back blank from the internal node of the blank logic, instead of inverting the blank output and feeding the inverted signal back. This was done to make the comparison simpler.
PUT phase micropipelines. The rst improved building block is a pseudo-static double edge-triggered D-ip-ops which updates data on both transitions of the latch control signal. As a result, no resetphase is needed and e cient two-phase control structures can be used. The second and third are burst-mode select and toggle elements, which are signi cantly faster than previously hand-designed implementations. The source of the increased speed is the application of generalized C-elements which compactly implements the function of a collection of discrete gates. In spite of the fact that the synthesis method, 3D, is based on the fundamental-mode restriction, the synthesized circuits have looser environmental timing restrictions than the hand-designed circuits, demonstrating the versatility of the fundamental-mode assumption. The improved speed of these circuits also suggests that more complex two-phase control structures might bene t from being formally speci ed and automatically synthesized rather than decomposed by hand into basic event-control elements.
