DOE based high-performance gate-level pipelines by Núñez, Juan et al.
DOE Based High-Performance Gate-Level Pipelines 
 
Juan Núñez, María J. Avedillo and Héctor J. Quintero 
Instituto de Microelectrónica de Sevilla, IMSE-CNM (CSIC/Universidad de Sevilla) 
Av. Américo Vespucio s/n 41092, Seville (Spain) 
{jnunez,avedillo,quintero}@imse-cnm.csic.es 
 
 
Abstract— Domino dynamic circuits are widely used in 
critical parts of high performance systems. In this paper we show 
that in addition to the functional limitation associated to the non-
inverting behavior of domino gates, there are also performance 
disadvantages when compared to inverting dynamic gates, which 
can be related to this feature. These penalties rise from the fact 
that in order to produce a logic one, a non-inverting gate requires 
one or more of its inputs to be also at logic one. We analyze the 
operation of gate-level pipelines implemented with domino and 
with Delayed Output Evaluation (DOE), an inverting dynamic 
gate we have recently proposed, and compare their performance. 
Using domino and DOE gates similar in terms of delay, 
improvements in operating frequencies around 50% have been 
obtained by the DOE pipelines. 
Keywords— Nanopipeline, Dynamic logic, Robust design 
techniques. 
I. INTRODUCTION 
Design of functional units implementing very fine-grained 
pipelining for high performance applications is currently an 
area of active research. These solutions do not apply 
conventional pipeline techniques which insert flip-flops to 
short down signal propagation paths in combinational logic, 
but instead rely on logic circuit styles which naturally exhibit 
the capacity to block data propagation, and thus, are well 
suited to implement pipeline architectures without memory 
elements. Potential of dynamic logic, with its precharge and 
evaluation phases, to implement this kind of pipelining was 
long ago recognized. Thus, in [1] the operation of the well-
known dynamic-based domino logic in a pipelined fashion 
using an overlapping multi-phase clock scheme and without 
latches between consecutive clock phases (superpipeline) was 
analyzed in depth. It is known that many variations of this 
multi-phase solution have been developed achieving high 
performance [2]-[8] and some of them have been applied to 
speed up critical parts of commercial microprocessors. In 
particular, architectures with a single gate per clock phase 
(nanopipeline) have been proposed and demonstrated large 
operating frequency and throughput [4],[5],[7],[8]. In this 
context, development of novel domino-based topologies for 
dynamic gates exhibiting good performance-robustness 
tradeoff and/or tolerance to process variations is an area of 
active research [9],[10],[11]. 
In spite of their speed advantages, it is well known that 
domino-based gates exhibit limitation such that only non-
inverting blocks can be chained (a static inverter is added 
between each two dynamic stages to guarantee that all inputs 
to the next logic block are set to 0 after the pre-charge period). 
In addition to the functional limitation associated to the non-
inverting behavior of domino gates, we claim that there are 
also speed limitations in pipelined networks which can be 
related to this feature. These penalties rise from the fact that in 
order to produce a logic one, a non-inverting gate requires one 
or more of its inputs to be also at logic one. This translates in 
that logic ones can degrade as they propagate through the 
logic network eventually leading to a functional failure.  
Recently we have proposed an inverting dynamic gate 
topology called Delayed Output Evaluation (DOE) [12]. DOE 
exhibits good speed – noise tolerance tradeoffs which are 
attractive for DSM technologies. In addition, we have 
identified that its inverting nature allows operating frequency 
improvements over domino-based style. In this paper we focus 
on these advantages at the interconnection level. In particular, 
we analyze the operation of DOE based nanopipelines and 
compare its performance to domino counterparts.   
The paper is organized as follows: in Section 2, the DOE 
logic style is described and the implications of the non-
inverting behavior of domino on circuit operation are 
illustrated. In Section 3, DOE and domino nanopipelines are 
evaluated through simulations experiments and compared. 
Finally, some conclusions are given in Section 4. 
 
II. ARCHITECTURE ANALYSIS 
A. DOE Topology 
Figure 1a shows a generic conventional dynamic gate or 
domino gate. It operates in two phases called precharge (CLK 
= 0) and evaluation (CLK = 1). It is composed of a dynamic 
stage and a static output stage. The dynamic stage realizes the 
logic functionality while the output stage is required to solve 
cascading of gates and to drive fan-out. Keeper transistor is 
added to protect dynamic node against leakage/noise.  
Figure 1b shows the schematic of the Delayed Output 
Evaluation (DOE) topology for a generic gate. Note that the 
static inverter is changed into a NAND gate whose inputs are 
the dynamic node and a delayed clock, DCLKV , plus a static 
inverter. Clocks VCLK and DCLKV  are also depicted. The rising 
edge of DCLKV is delayed with respect to the rising edge of VCLK 
by ∆CLK, while, ideally, both falling edges are simultaneous. 
For DCLKV = 0, VNAND is pulled up independently of VDYN. The 
static inverter is added guarantying that the precharge value of 
the gate output (VOUT) is low as in domino logic.  For DCLKV = 1, 
the NAND gate evaluates its input. For those input 
combinations which discharge the dynamic node, the pull-
down network is off and gate output remains low. For input 
combinations which do not discharge dynamic node, the 
NAND output node is pulled down and VOUT is pulled up.  
The evaluation delay in DOE gate is determined by the 
speed of the NAND-INV static stage and by the amount by 
which evaluation of the NAND is delayed. Gate delay is to 
some extend independent of how fast dynamic node 
discharges. As a result, achieved delay-noise tolerance trade-
off is significantly better than in domino gates as we showed 
in [12]. Also, the design of a Kogge-Stone adder using DOE 
gates is reported in [12] in order to validate their capability to 
build up logic networks. 
However this is not the only advantage of the DOE 
topology. As already mentioned when describing its operation, 
DOE produces a logic one when an input combination which 
does not discharge the dynamic node is applied. Clearly, this 
is not the behavior of the domino gate. It is the result of 
adding an inverting stage to the domino gate. Next sub-section 
illustrates how this translates in speed and robustness 
improvements when interconnecting gates. 
B. Interconnecting gates 
We compare a chain of ten 16-input NOR DOE gates with 
a chain of 16-input OR domino gates. Both circuits are 
operated in a gate-level pipeline with a three-phase overlapped 
clock scheme as depicts in Fig. 2a. Gates are connected such 
that input changes propagate through the circuit and each gate 
is excited with the worst case input combination. A sequence 
alternating “0” and “1” is applied. Dynamic stages as well as 
keeper transistor and feedback and output inverters have been 
identically sized in both circuits and such that the delays of 
both gates are equal. However, maximum operating frequency 
of DOE network is higher than its domino counterpart. We 
have analyzed in depth the operation of both circuits.  
Figure 2 depicts results for both simulated circuits. DOE 
circuit (Fig. 2b) produces correct output at the simulated 
frequency (VOUT,DOE). Domino (Fig. 2c) does not work 
(VOUT,CONV). The outputs of intermediate stages for both 
CONV and DOE are also shown. It can be observed how 
domino outputs degrade with the number of stages but DOE 
does not.  
The differences could be explained on the basis of the 
input combination producing a zero-to-one transition of the 
gate output for each topology. In domino, being non-inverting, 
this output transition is associated with inputs combinations 
discharging the dynamic node. Discharging of the dynamic 
node requires one or more inputs being at logic one.  “Good” 
PDN
VDD
VCLK
VCLK
VOUTINPUTS
VDYN
     
 (a)      
 
WPREC WK
PDN
WIN
VDD
VCLK
VCLK
VCLK
D
WFOOTER
VOUTINPUTS
VDYN
VNAND
                                  
(b) 
 
Fig 1. (a) Domino gate. (b) DOE gate. 
 
STG1
VCLK,1
STG2
VCLK,2
STG3
VCLK,3
STG4
VCLK,1
STG10
VCLK,1
...
VOUT,STG1 VOUT,STG2 VOUT,STG3
VOUT
 
 
VCLK,1
VCLK,2
VCLK,3
 
                      
                     (a) 
                  
0 t
VOUT,DOE
VOUT,STG4
VOUT,STG3
VOUT,STG2
VDD
0 t
VDD
0 t
VDD
0 t
VDD
 
  
(b) 
                         
t0
VDD
VOUT,CONVVOUT,STG4
VOUT,STG3
VOUT,STG2
 
 
(c) 
 
Fig 2. (a) Three-phases clock scheme. (b) Simulation results 
corresponding to a chain of ten 16-inputs DOE NOR gates. (c) 
Simulation results of its conventional domino counterpart. 
 
ones are required to fully discharge dynamic node and produce 
a “good” output one. Otherwise, functional failures can occur 
after several stages. Contrary, in DOE, implementing inverting 
functionality, the zero to one output transition occurs for input 
combinations which do not discharge dynamic node. Thus, 
how good the output logic one is does not depend on how 
good the input logic ones are. Unlike domino, degraded logic 
ones do not propagate through the circuit, and so do not 
accumulate, leading to functional failures after several stages.  
It is interesting to analyze the behavior of the dynamic 
node in each circuit. For that, the voltage level to which each 
dynamic node discharges has been measured. Figure 3 depicts 
this voltage level versus stage number for domino (at VDD 
=1.2V) and DOE (at VDD=1.2V and VDD=1V) networks. 
Voltage levels are identical for stage number one. It can be 
observed that in domino, the discharge of the dynamic node 
degrades in consecutive stages and the complete chain does 
not operate correctly. DOE behavior is completely different. 
Minimum voltage level is slightly increased from first stage to 
second one due to non-ideal inputs but then remains constant, 
even for the lower VDD value. 
III. EXPERIMENTAL RESULTS 
Experiments have been carried out in order to complete the 
comparison between domino and DOE pipelines operation. 
Different ten stage chains like those described in previous 
section have been simulated. In all the simulated chains NOR 
(OR) gates are used for DOE (domino) circuits. Five different 
chain –pairs are characterized in a commercial 1.2V 130nm 
technology. They differ only in the size of the keeper tran-
sistor. Table I reports gate delays for some of the keeper sizes. 
Keeper width increases from gate K1 to gate K2. Larger 
keeper transistor widths imply that the dynamic node 
discharges more slowly. As expected, domino delay increases 
while DOE delay keeps constant, although discharging rate of 
dynamic node does impact the operation of the DOE pipelines 
as well. DOE and domino delays are similar for gate version 
K3. Both in domino and in DOE, noise tolerance increases 
upsizing keeper transistor. 
 
 
 
TABLE I.  GATE DELAYS 
 
Delay (ps) 
Domino DOE 
K1 71.9 86.1 
K3 89.8 86.4 
K5 108.1 86.6 
 
Delayed clock required for DOE operation is generated 
inside each gate by means of a pair of inverters.  
Different frequencies have been measured to characterize 
the operation of each working (correct behavior) pipelined 
chain. These frequencies are: 
• F1: Clock frequency up to which every gate behaves 
“ideally”. By “ideally” we mean that dynamic nodes 
are fully discharged and charged in worst case 
scenarios and “1” inputs to every stage are ready (90% 
of final value reached) when evaluation clock rises.  
• F2: Clock frequency up to which every dynamic node 
discharges under 100mV and charges over 1.1V. 
• F3: Maximum clock operating frequency. Correct 
output is obtained. 
• F4: Clock frequency up to which correct behavior is 
obtained in SS, FF, SF and FS corners. Supply voltage 
is reduced by 10% in SS, SF and FS corners and 
increased by same amount in FF corner.  
• F5: Clock frequency up to which a correct operation is 
observed for a 30 MC simulations (3-σ). 
Table II summarizes obtained results. Frequencies have 
been normalized with respect to the smallest one measured in 
the experiment (F4 for domino K5). 
TABLE II.  CHARACTERIZATION OF THE NORMALIZED FREQUENCIES OF 
THE PIPELINES 
 
F1 F2 F3 
Dom. DOE Dom. DOE Dom. DOE 
K1 3.13 2.50 3.31 3.75 3.25 4.06 
K2 2.69 2.50 2.94 3.50   
K3 2.25 2.50 2.63 3.38 2.63 3.81 
K4 1.75 2.50 2.13 3.19   
K5 1.38 2.50 1.63 3.00 1.88 3.50 
 
 
F4 F5 
Dom. DOE Dom. DOE 
K1 2.00 2.56 3.06 3.19 
K2 
  2.63 3.06 
K3 1.56 2.38 2.13 3.00 
K4 
  1.81 2.88 
K5 1.00 2.00 1.44 2.75 
 
 
 
 
Fig 3. Voltage level to which VDYN of each stage discharges for both 
topologies. 
 
It is observed that F1 values are independent of the keeper 
size in DOE circuits. In these circuits, frequency F1 is limited 
by the criteria associated to the readability of inputs in all the 
cases and so, since gate delay is the same for all gate versions, 
so it is F1. In domino F1 decreases from K1 to K5 as expected 
since gate delays increase. It is interesting to compare K3 
results as both gate delays and dynamic node behavior are 
similar for domino and DOE in this case. Note that DOE F1 is 
slightly higher than in domino (~10%). This is due to the fact 
that although evaluation delays (reported in Table 1) are 
similar, precharge delays are not. They are larger in the DOE 
gates and it translates in small advantage for pipeline 
operation. However these advantages are limited. We expect 
larger differences on the basis of the distinct behaviors pointed 
out in previous section. 
Concerning F2, it can be observed that the operating 
frequency of the DOE chain is higher than the operating 
frequency of the domino chain for all the keeper transistors. 
Even for those for which the delay of the domino gate is 
smaller than the delay of its DOE counterpart (K1 and K2). 
Improvements increase from K1 to K5 as expected, ranging 
from 13% to 85%. As we anticipated, larger differences 
between domino and DOE are observed. For example, for K3, 
which corresponds to almost equal gate delays, improvement 
rises from 11% (when comparing F1) to 29% (F2). In addition 
to the advantages associated to the larger precharge times 
previously pointed out, there are advantages which rise from 
the inverting feature of the DOE gates discussed in Section 2. 
Since discharging of dynamic node does not degrade through 
the chain, the operating frequencies fulfilling the F2 criteria 
are higher.  
Also F3 of DOE is higher than the F3 of the domino chains 
for all the keeper transistors. Improvements range from 25% 
(K1) to 87% (K5). For K3, improvement rises from 29% 
(when comparing F2) to 45% (F3). No differences are 
observed between F2 and F3 improvements for K5. This 
probably is the result of a bad DOE design for the largest 
keeper K5. 
Concerning F4, improvements range from 28% (K1) to 
100% (K5). It is interesting to compare the results obtained for 
F1 (very conservative design criteria) and for F4. For example, 
for K3, F4 is 30% smaller than F1 for domino, but only 5% 
smaller for DOE. These results show the better performance 
including variability robustness of DOE with respect to 
domino. Other experiments have been carried out in order to 
analyze in depth the performance of both architectures. 
Monte Carlo analysis has been applied. Clock skew/jitter 
simulation has also been considered in this experiment. For 
these random variables with Gaussian distributions have been 
associated to the position of the edges of the three clock phase 
signals. The frequency up to which correct operation is 
observed in 30 MC simulations has been measured. This 
frequency is reported as F5 in Table II. Better performance of 
DOE is clearly observed. 
Analyzing results for K3 is relevant since, as we have 
already repeatedly pointed out, DOE and domino gate delays 
are similar. For this gate version, DOE is better than domino 
for all measured frequencies. Achieved improvement goes 
from 10% for F1 to 52% for F4. Even higher differences can 
be obtained in cases for which noise tolerance constraints 
would require keeper transistor sizes for which DOE gate 
Delays are smaller than domino ones. 
IV. CONCLUSIONS 
We have analyzed the operation of gate-level pipelines 
implemented with domino and with DOE and their 
performance in terms of operating frequency. The differences 
are explained on the basis of the input combination producing 
a zero to one transition of the gate output for each topology. In 
domino, being non-inverting, a non-ideal one degrades as it 
propagates through the logic network eventually leading to a 
functional failure. Unlike domino, degraded logic ones do not 
propagate in DOE circuit. Using domino and DOE gates 
similar in terms of delay, improvements in operating 
frequencies around 50% have been obtained for ten-stages 
chains of high-fan in gates operated in a gate-level pipelined 
architecture with three clock phases. 
ACKNOWLEDGMENT 
This work has been funded by Ministerio de Economía y 
Competitividad del Gobierno de España with support from 
FEDER under Project TEC2010-18937 and TEC2011-28302. 
REFERENCES 
 
[1] D. Harris and M.A. Horowitz, "Skew-tolerant domino circuits", IEEE 
Journal of  Solid-State Circuits, vol.32, no.11, pp.1702-1711, Nov. 
1997. 
[2] R. Hossain, “High Performance ASIC Design”, Cambridge, 2008. 
[3] S. Horne, D. Glowka, S. McMahon, P. Nixon, M. Seningen and G. 
Vijayan, "Fast14 Technology: design technology for the automation of 
multi-gigahertz digital logic", International Conference on  Integrated 
Circuit Design and Technology, pp. 165- 173, 2004 
[4] W. Belluomini; D. Jamsek; A. Martin; C. McDowell; R. Montoye; T. et 
al. “An 8 GHz floating point multiply”, IEEE International Solid-State 
Circuits Conference, pp. 374-604., 2005. 
[5] J. Sivagnaname, H.C. Ngo, K.J. Nowka, R.K. Montoye and R.B. 
Brown,”Wide limited switch dynamic logic circuit implementations”, 
IEEE International Conference on VLSI Design, 2006. 
[6] R.J. Sung and D.G. Elliot, “Clock-logic domino circuits for high-speed 
and energy-efficient microprocessor pipelines”, IEEE Transactions on 
Circuits and Systems II: Express Briefs, vol. 54, no.5, pp. 460-464, 
2007. 
[7] C.K. Jerry, W.-H. Ma, S. Kim and M. Papaefthymiou, "2.07 GHz 
floating-point unit with resonant-clock precharge logic", IEEE Asian 
Solid State Circuits Conference (A-SSCC),  pp.1-4, Nov. 2010. 
[8] Z. Owda, Y. Tsiatoushas and T. Haniotakis, “High Performance and 
Low Power Dynamic Circuit Design” IEEE New Circuits and System 
Conference  pp. 502-505, 2011. 
[9] A. Peiravi and M. Asyaei, “Current-Comparison-based Domino: new 
low-leakage high speed domino circuit for wide fan-in gates”, IEEE 
Trans. On Very Large Scale Integration Systems. no. 21, vol. 5, pp.934-
943, 2012. 
[10] A. Alvandpour, R.K. Krishnamurthy, K. Soumyanath and Shekhar Y. 
Borkar. A su-130-nm condicional Keeper technique. IEEE Journal of 
Solid-State Circuits. no.37, vol. 5, pp 633-638, May 2002. 
[11] H. F. Dadgour and K. Banerjee. A novel variation-tolerant keeper 
architecture for high-performance low-power wide fan-in dynamic or 
gates.  IEEE Trans. On Very Large Scale Integration Systems. no. 18 
vol.11, pp. 1567-1577, 2010. 
[12] J. Núñez, M.J. Avedillo, J. M. Quintana, H. J. Quintero. “Novel 
Dynamic Gate Topology for Superpipelines in DSM Technologies” 
Proceedings Digital System Design 2013. pp. 280-28, 2013. 
