# Self-Timed Design with Dynamic Domino Circuits

Jung-Lin Yang Electrical and Computer Engineering University of Utah jyang@cs.utah.edu

### Abstract

We introduce a simple hierarchical design technique for building high-performance self-timed components using dynamic domino-style circuits. This technique is useful for building handshaking style functional blocks and for self-timed data path components. We wrap the dynamic domino circuit in a wrapper that communicates using a request/acknowledge protocol and mediates the precharge/evaluate cycle of the dynamic logic. We apply standard bundled delay matching for completion detection but add an early completion feature that can signal completion if function validity can be determined from the output value. The circuit overhead required for this early-acknowledge feature is relatively small, but can provide measurable speedup in some situations. We call this approach semi-bundled delay (SBD).

## 1. Self-Timed semi-bundled wrapper

We would like to leverage the advantages of dynamic data path circuits for use in self-timed systems by developing support circuits that provide the necessary completion detection and handshaking functionality. In any digital system, functional blocks perform arithmetic calculations, encoding and decoding, or some other special purpose combinational data manipulation. If functional data path blocks are to be used in an asynchronous or self-timed system they should also use handshaking techniques to communicate with the other circuit blocks in the system [1,2].

There are a variety of control protocols used in asynchronous or self-timed systems but the most fundamental requirements for a data path element are that it be able to know when to start executing its function, and that it report completion of the function and validity of the output values. Towards this end, our wrapper circuit provides three basic functions:

1. Communication with the environment – This entails completion detection on the dynamic

Erik Brunvand School of Computing University of Utah elb@cs.utah.edu

function block and following the chosen communication handshake protocol with the rest of the system

- 2. Control of the pre-charge/evaluate cycle of the dynamic function block
- 3. Latching the output data as appropriate

Bundling is a term that means using a matched delay that models the delay in a functional unit to generate an acknowledge signal [3,4]. Because our completion detection circuit falls somewhere between bundling and actual completion detection we call our wrapper "semibundled." We will call our dynamic circuit with a selftimed semi-bundled wrapper an SBD (Semi-Bundled Delay) component. A block diagram of this wrapper is shown in Figure 1.



Figure 1. Semi-Bundled Delay (SBD) Component

## 2. Handshaking wrapper implementation

The wrapper circuit consists of 5 major sub-blocks: pre-charge/evaluation signal generator (SBD\_PC), worstcase matched delay (SBD\_MDelay), pre-charge matched delay (MD\_PC), asynchronous latch (SBD\_latch), and completion signal generator (SBD\_ACK). Except the SBD\_latch, which is a standard TSPC latch, but with completion detection, every block in the wrapper is im-



plemented by either a generalized C-element (gC) [5] or NMOS domino logic. Due to space constraints we will show circuits only for three of the major subblocks.

## 2.1. Pre-charge/evaluation signal (SBD\_PC)

SBD\_PC generates the control signal to the Domino evaluation network to tell when to evaluate and when to pre-charge (see Figure 2). The output of this block (PC) will be set to high as long as the request signal from the environment has been pulled up to '1'. This puts the domino function block into evaluation mode so that it can start to compute outputs through its evaluation network. Once the evaluation network's result is latched, PC will be pulled down to '0' putting the domino function block into pre-charge mode.



Figure 2. SBD\_PC transistor circuit

#### 2.2. Worst-case matched delay (SBD\_MDelay)

This block is used to postpone the rising edge of the request signal for the matched (worst-case) delay of the domino function block. However, the falling edge should propagate to its output (MD\_ACK) without any additional delay. Figure 3 illustrates the transistor level implementation of SBD\_MDelay block. The "Worst-case Matched Delay" is implemented as a delay chain implemented as tunable-delay inverters or by early-reset inverters.



Figure 3. SBD\_MDelay transistor circuit

An important feature of the completion detection circuit allows our wrapper to provide better completion times than always relying on the worst-case timing. If the evaluation network makes a transition, the monotonic nature of the domino function block means that the final output has been reached, and completion can be signaled even before the worst-case timing expires. The MD\_PC circuit is similar.

#### 2.3. Completion signal (SBD\_ACK)

The SBD\_ACK is the final stage of our SBD handshaking wrapper. Its output ACK provides two important functions in the four-phase handshake with the rest of the self-timed system: the rising edge of ACK tells when the evaluated result of the domino functional block is valid, and the falling edge indicates that a new request can be issued. The transistor-level implementation is shown in the Figure 4.



Figure 4. SBD\_ACK transistor circuit

### 3. Design example

We use an extensible self-timed adder design as an example of using the SBD wrapper circuit with domino function blocks. Figure 5 shows a 12-bit self-timed adder built from three-bit adder blocks (ADD3) and three-bit plus-1 blocks (INC3). The ADD3 blocks compute a three-bit addition in a single domino gate. The INC3 blocks at the bottom of the figure use a chain of domino gates to generate the carry signal based on the results of the ADD3 blocks at the top. In this block diagram, we show a four-stage design but the circuit is easily extended to make larger adders.

We simulated this 12-bit self-timed SBD adder in PSPICE using TSMC 0.25um 2.5V CMOS models. Each ADD3 component has its own domino SBD and completes its task in 1465ps to 1741ps (~ 17 FO4). This delay consists of the domino function block evaluation time, register (latch) delay, and handshaking overhead. The de-



lay variation is small because the ADD3's Domino logic core takes only 470ps to 680ps of calculation time (~ 5 FO4). The overhead due specifically to the wrapper circuit (that is, additional overhead not found in a standard domino version) is around 450ps to 500ps, or around 5 FO4 delays in this technology. High-performance microprocessors use more than 20 FO4 delays between latches [6] indicating that our ADD3 functional block is perhaps too small to be compared directly with realistic data path circuits. Larger domino circuits would suffer less overhead penalty from the SBD wrapper. However, it does demonstrate that dynamic domino circuits can be used in a straightforward way in self-timed designs.



Figure 5. 12-bit Self-timed adder building block

A purely static design for exactly the same circuit using a full P-type pull-up stack and N-type pull-down stack with the same output latch and the same bundling delay margins simulates at 1915ps. So, even with the overhead of the wrapper circuit and the relatively small size of the domino function block, the domino version runs about 10% faster than the static version. Larger differences would be expected for larger domino function blocks.

## 6. Conclusions

We have shown a simple wrapper circuit that can make high-performance dynamic-logic function blocks usable in a self-timed system. The wrapper presents a standard self-timed req/ack protocol at its interface, and implements pre-charge/evaluate sequencing, variabletime completion detection, and possibly output latching for the dynamic function block inside the wrapper. The completion detection operates by either directly sensing the completion of the dynamic function block or by a matched delay. Because this has the potential to signal completion earlier than a purely bundled delay we call this approach Semi-Bundled Delay (SBD). For other approaches to early completion detection see [7,8].

An SBD wrapper is suitable for any dynamic logic

family that include a pre-charge phase that sets the outputs to a known level and has monotonic output behavior. Specifically we have shown an example using wrapper circuits around dynamic domino function blocks. The wrapper circuits themselves are either generalized Celements or domino circuits. For the simple ADD3 circuit the overhead due to the wrapper itself is about 5 FO4 delays. Including this overhead, the domino circuit is about 10% faster than a fully static version of the circuit. Larger differences can be expected for more complex domino function blocks.

Of course, there are many other possibilities for implementing the wrapper itself. We have also designed and simulated DCVSL-based circuits for the wrapper, and are exploring domino function blocks that look more like finite state machines than just simple combinational functions.

Semi-bundled wrapper circuits allow a designer to take advantage of high-speed dynamic data path circuits that can be used with any self-timed or asynchronous design style that relies on an explicit completion signal.

## 7. References

- Scott Hauck. Asynchronous design methodologies: An overview. *Proceedings of the IEEE*, 83(1):69-93, January 1995.
- [2] Erik Brunvand, Steven Nowick, and Kenneth Yun. Practical advances in asynchronous design and in asynchronous/synchronous interfaces. In *Proc. ACM/IEEE Design Automation Conference*, pages 104-109, 1999.
- [3] Ivan E. Sutherland. Micropipelines. *Communications of the ACM*, 32(6):720-738, June 1989.
- [4] Jens Sparsø and Steve Furber, editors. *Principles of Asynchronous Circuit Design: A Systems Perspective*. Kluwer Academic Publishers, 2001.
- [5] Kenneth Y. Yun. Automatic synthesis of extended burstmode circuits using generalized C-elements. In *Proc. European Design Automation Conference (EURO-DAC)*, pages 290-295, September 1996.
- [6] David Chinnery and Kurt Keutzer. Closing the Gap Between ASIC & Custom. Klewer Academic Publishers, 2002.
- [7] Mark E. Dean. *STRIP: A Self-Timed RISC Processor Architecture*. PhD thesis, Stanford University, 1992.
- [8] S. M. Nowick. Design of a low-latency asynchronous adder using speculative completion. *IEE Proceedings, Computers* and Digital Techniques, 143(5):301-307, September 1996.

