A two-phase control wrapper for a micropipeline is presented. The wrapper is implemented in an Artisan 0.13µ standard cell library that has not been augmented with any special cells for asynchronous design. The wrapper supports early evaluation allowing the output to be updated after a subset of the inputs have arrived, thus improving the throughput of the micropipeline.
will either all transition low-to-high, or high-to-low. After all Cin and acknowledgements have transitioned, then the C-element output transitions high-to-low or low-to-high. The XOR gate and Cout loopback signal generates a high-pulse on the GC signal when the C-element output changes state, latching the new outputs. The delay elements on the Cin inputs are used to match the delay of the control path to the compute function path. Figure 2 shows the wrapper of Figure 1 modified to support early evaluation. Early evaluation was used for performance enhancement of the microprocesor design presented in [5] . An early fire is defined as the EE_sel signal being a '1' after arrival of the early control inputs (the inputs to the trigger C-element). This causes the data (Dout) and control (Cout) signals to be updated after the trigger C-element toggles. The acknowledgement Aout is updated after all inputs have arrived, causing the late C-element to toggle. After an early fire, the input delays on the late-arriving inputs (the inputs to the late Celement) should be short circuited so that the acknowledgement (Aout) is produced as quickly as possible once all inputs have arrived. Figure 3 shows the initial design of the DKill delay element. A single multiplexer cannot be used to bypass a long delay chain, because an input transition from the previous early fire may still be traversing the delay chain when the inputs for the next firing arrive, producing a hazard on the input to the late C-element. A normal fire occurs when EE_sel is a '0' after arrival of the early control inputs. In this case, the Dout/Cout/Aout outputs are updated after all control inputs arrive and the late C-element toggles.
A Two-Phase Wrapper with Early Evaluation:
The delay block on the output of the late C-element is needed for a normal fire if the difference between the Aout delay and Dout/Cout delay paths is large, which can occur if the GC signal drives a large number of latch inputs. If Aout is provided too far in advance of Dout/Cout, a predecessor block can change the input value to this stage, corrupting the compute function output value before it has been latched by the GC signal.
The asynchronous microprocessor design presented in [5] has been subsequently redesigned and synthesized to an Artisan 0.13µ standard cell library. C-elements wer mapped to standard cells using the approach in [4] . Pre-layout gate-level Verilog simulations using back-annotated SDF timing indicated the early evaluation wrapper design of Figure 2 was slow in producing an acknowledgment after an early fire occurred, primarily due to excessive loading on the late control input signals by the DKill block. The wrapper was also slow to produce a new Dout output when an early fire followed a normal fire (EE_sel '0' → '1') because of excessive loading by the DKill block on the EE_sel signal. Figure 4 shows a re-design of the early evaluation wrapper that has a dedicated C-element for producing a fast acknowledgement after an early fire.
The DKill block was also redesigned as shown in Figure 5 to reduce loading on the Cin input signals and the EE_sel signal. The new DKill design uses two delay blocks; the toggling of the sel signal routes the a input between the two delay blocks so that one delay block is 'recovering' while the other delay block is 'active'. Normal operation is either a+ → N1+ (sel = 1) or a-→ N0-(sel = 0 ) where the full delay chain penalty is used. An early fire can cause sel to change while the a transition is still within dly1 or dly0. A change in sel chooses the opposite delay path, whose value is the normal arrival value for the previous delay path. The Program Counter block in the redesigned asynchronous microprocessor has six late inputs, and four early inputs; this block is used as an example in Table 1 to contrast the performance difference between the two wrapper designs. The maximum number of delay elements on a late control input was 9. Table 1 shows that the Cin to Aout delay after an early fire of the Version 2 wrapper is 34% less 
