Self-timed design with dynamic domino circuits by Brunvand, Erik L. & Yang, Jung-Lin
S e l f - T i m e d  D e s i g n  w i t h  D y n a m i c  D o m i n o  C i r c u i t s
Jung-Lin Yang Erik Bm nvand 
Electrical and Computer Engineering School of Computing 
University o f  Utah University o f  Utah 
jyang@ cs.utah.edu elb@cs.utah.edu
Abstract
We introduce a simple hierarchical design technique 
fo r  building high-performance self-tim ed com ponents us­
ing dynam ic domino-style circuits. This technique is use- 
f id  fo r  building handshaking style func tiona l blocks and  
fo r  self-tim ed data path  components. We wrap the dy­
nam ic domino circuit in a wrapper that com m unicates us­
ing a request/acknowledge protocol and mediates the p re­
charge/evaluate cycle o f  the dynam ic logic. We apply 
standard bundled delay m atching fo r  completion detec­
tion but add an early completion fea ture that can signal 
completion i f  function  validity> can be determ ined from  
the output value. The circuit overhead required fo r  this 
early-acknowledge fea ture is relatively small, but can 
provide measurable speedup in some situations. We call 
this approach sem i-bundled delay (SBD).
1. Self-Timed semi-bundled wrapper
We would like to leverage the advantages of dy­
namic data path circuits for use in self-timed systems by 
developing support circuits that provide the necessary 
completion detection and handshaking functionality. In 
any digital system, functional blocks perform arithmetic 
calculations, encoding and decoding, or some other spe­
cial purpose combinational data manipulation. If func­
tional data path blocks are to be used in an asynchronous 
or self-timed system they should also use handshaking 
techniques to communicate with the other circuit blocks 
in the system [ 1,2].
There are a variety of control protocols used in 
asynchronous or self-timed systems but the most funda­
mental requirements for a data path element are that it be 
able to know when to start executing its function, and that 
it report completion of the function and validity of the 
output values. Towards this end, our wrapper circuit pro­
vides three basic functions:
I. Communication with the environment -  This 
entails completion detection on the dynamic
function block and following the chosen 
communication handshake protocol with the rest 
of the system
2. Control of the pre-charge/evaluate cycle of the 
dynamic function block
3. Latching the output data as appropriate
Bundling is a term that means using a matched de­
lay that models the delay in a functional unit to generate 
an acknowledge signal [3,4]. Because our completion de­
tection circuit falls somewhere between bundling and ac­
tual completion detection we call our wrapper “semi­
bundled.” We will call our dynamic circuit with a self­
timed semi-bundled wrapper an SBD (Semi-Bundled De­
lay) component. A block diagram of this wrapper is 









F igure  1. S em i-B und led  Delay (SBD) C om ponen t
2. Handshaking wrapper implementation
The wrapper circuit consists of 5 major sub-blocks: 
pre-charge/evaluation signal generator (SBD^PC), worst- 
case matched delay (SBDJVIDelay), pre-charge matched 
delay (MD^PC), asynchronous latch (SBD Jatch), and 
completion signal generator (SBD^ACK). Except the 
SB D Jatch, which is a standard TSPC latch, but with 
completion detection, every block in the wrapper is im-
Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI'03)
0-7695-1904-0/03 $17.00 © 2003 IEEE
plemented by either a generalized C.-element (gC.) [5] or 
NMOS domino logic. Due to space constraints we will 
show circuits only for three of the major subblocks.
2.1. Pre-charge/evaluation signal (SBD_PC)
SBD_PC. generates the control signal to the Domino 
evaluation network to tell when to evaluate and when to 
pre-charge (see Figure 2). The output of this block (PC) 
will be set to high as long as the request signal from the 
environment has been pulled up to ‘1’. This puts the 
domino function block into evaluation mode so that it can 
start to compute outputs through its evaluation network. 
Once the evaluation network’s result is latched, PC will 
be pulled down to ‘O’ putting the domino function block 
into pre-charge mode.
2.2. Worst-case matched delay (SBD_MDelay)
This block is used to postpone the rising edge of the 
request signal for the matched (worst-case) delay of the 
domino function block. However, the falling edge should 
propagate to its output (MD_AC.K) without any addi­
tional delay. Figure 3 illustrates the transistor level im­
plementation of SBD_MDelay block. The “Worst-case 
Matched Delay” is implemented as a delay chain imple­





j [■-----< | Din_b
--- < |md_ack
An important feature of the completion detection 
circuit allows our wrapper to provide better completion 
times than always relying on the worst-case timing. If the 
evaluation network makes a transition, the monotonic na­
ture of the domino function block means that the final 
output has been reached, and completion can be signaled 
even before the worst-case timing expires. The MD_PC. 
circuit is similar.
2.3. Completion signal (SBD_ACK)
The SBD_ACK is the final stage of our SBD hand­
shaking wrapper. Its output ACK provides two important 
functions in the four-phase handshake with the rest o f the 
self-timed system: the rising edge of ACK tells when the 
evaluated result of the domino functional block is valid, 
and the falling edge indicates that a new request can be 
issued. The transistor-level implementation is shown in 
the Figure 4.
F igure  3. SBD ^M D elay tra n s is to r  c irc u it
F igure  4. S B D ^A C K  tra n s is to r  c irc u it
3. Design example
We use an extensible self-timed adder design as an 
example of using the SBD wrapper circuit with domino 
function blocks. Figure 5 shows a 12-bit self-timed adder 
built from three-bit adder blocks (ADD3) and three-bit 
plus-1 blocks (INC3). The ADD3 blocks compute a 
three-bit addition in a single domino gate. The INC3 
blocks at the bottom of the figure use a chain of domino 
gates to generate the carry signal based on the results of 
the ADD3 blocks at the top. In this block diagram, we 
show a four-stage design but the circuit is easily extended 
to make larger adders.
We simulated this 12-bit self-timed SBD adder in 
PSPTC.E using TSMC. 0.25um 2.5V CMOS models. Each 
ADD3 component has its own domino SBD and com­
pletes its task in 1465ps to 174lps (~ 17 F04). This delay 
consists of the domino function block evaluation time, 
register (latch) delay, and handshaking overhead. The de-
Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI'03)
0-7695-1904-0/03 $17.00 © 2003 IEEE
lay variation is small because the ADD3’s Domino logic 
core takes only 470ps to 680ps of calculation time (~ 5 
F04). The overhead due specifically to the wrapper cir­
cuit (that is, additional overhead not found in a standard 
domino version) is around 450ps to 500ps, or around 5 
F 0 4  delays in this technology. High-performance micro­
processors use more than 20 F 0 4  delays between latches
[6] indicating that our ADD3 functional block is perhaps 
too small to be compared directly with realistic data path 
circuits. Larger domino circuits would suffer less over­
head penalty from the SBD wrapper. However, it does 
demonstrate that dynamic domino circuits can be used in 
a straightforward way in self-timed designs.
F igure  5. 12-bit Se lf-tim ed  adder b u ild in g  b lock
A purely static design for exactly the same circuit 
using a full P-type pull-up stack and N-type pull-down 
stack with the same output latch and the same bundling 
delay margins simulates at 1915ps. So, even with the 
overhead of the wrapper circuit and the relatively small 
size of the domino function block, the domino version 
runs about 10% faster than the static version. Larger dif­
ferences would be expected for larger domino function 
blocks.
6. Conclusions
We have shown a simple wrapper circuit that can 
make high-performance dynamic-logic function blocks 
usable in a self-timed system. The wrapper presents a 
standard self-timed req/ack protocol at its interface, and 
implements pre-charge/evaluate sequencing, variable­
time completion detection, and possibly output latching 
for the dynamic function block inside the wrapper. The 
completion detection operates by either directly sensing 
the completion of the dynamic function block or by a 
matched delay. Because this has the potential to signal 
completion earlier than a purely bundled delay we call 
this approach Semi-Bundled Delay (SBD). For other ap­
proaches to early completion detection see [7,8].
An SBD wrapper is suitable for any dynamic logic
family that include a pre-charge phase that sets the out­
puts to a known level and has monotonic output behavior. 
Specifically we have shown an example using wrapper 
circuits around dynamic domino function blocks. The 
wrapper circuits themselves are either generalized C- 
elements or domino circuits. For the simple ADD3 circuit 
the overhead due to the wrapper itself is about 5 F 0 4  de­
lays. Including this overhead, the domino circuit is about 
10% faster than a fully static version of the circuit. Larger 
differences can be expected for more complex domino 
function blocks.
Of course, there are many other possibilities for im­
plementing the wrapper itself. We have also designed and 
simulated DCVSL-based circuits for the wrapper, and are 
exploring domino function blocks that look more like fi­
nite state machines than just simple combinational func­
tions.
Semi-bundled wrapper circuits allow a designer to 
take advantage of high-speed dynamic data path circuits 
that can be used with any self-timed or asynchronous de­
sign style that relies on an explicit completion signal.
7. References
[1] Scott Hauck. Asynchronous design methodologies: An 
overview. Proceedings o f the IEEE. 83( 1 ):69-93. January 
1995.
[2] Erik Brunvand, Steven Nowick, and Kenneth Yun. Practical 
advances in asynchronous design and in asynchro­
nous/synchronous interfaces. In Proc. ACM/IEEE Design 
Automation Conference, pages 104-109, 1999.
[3] Ivan E. Sutherland. Micropipelines. Communications o f the 
ACM. 32(61:720-738, June 1989.
[4] Jens Spars0 and Steve Furher, editors. Principles o f Asyn­
chronous Circuit Design: A Systems Perspective. Kluwer 
Academic Publishers, 2001.
[5] Kenneth Y. Yun. Automatic synthesis of extended burst­
mode circuits using generalized C-clcments. In Proc. Euro­
pean Design Automation Conference (EURO-DAC), pages 
290-295, September 1996.
[6] David Chinnery and Kurt Keutzer. Closing the Cap Be­
tween ASIC & Custom. Klewer Academic Publishers, 2002.
[7] Mark E. Dean. STRiP: A Self-Timed RISC Processor Archi­
tecture. PhD thesis, Stanford University, 1992.
[8] S. M. Nowick. Design of a low-latcncy asynchronous adder 
using speculative completion. IEE Proceedings, Computers 
and Digital Techniques, 143(51:301-307, September 1996.
Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI'03)
0-7695-1904-0/03 $17.00 © 2003 IEEE
