# **Energy Recovery Clocked Dynamic Logic**

Matthew Cooke, Hamid Mahmoodi, Qikai Chen, and Kaushik Roy School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA { cookem, mahmoodi, qikaichen, kaushik}@purdue.edu

# Abstract

Energy recovery clocking results in significant energy savings in clock distribution networks as compared to conventional squarewave clocking. However, since energy recovery clocks are sinusoidal in nature, standard dynamic logic styles do not work efficiently when used with energy recovery clocks. We propose novel dynamic logic styles that operate more efficiently with sinusoidal clocks, enabling energy recovery from their clock networks, and resulting in significant energy savings. Based on the simulation results using TSMC 0.25 $\mu$ m CMOS process technology, at iso-performance, the proposed dynamic logic styles exhibit up to 53% power reduction.

## **Categories and Subject Descriptors**

B.6.1 [Logic Design]: Design Styles – *energy recovery clock, domino logic.* 

General Terms: Performance, Design, Theory.

**Keywords:** Energy recovery, clock, domino, logic

## 1. Introduction

With the continuing increase in the clock frequency and complexity of clock distribution networks, the resulting increase in power consumption has become the major obstacle to the realization of high-performance designs. The major fraction of the total power consumption in highly synchronous systems, such as microprocessors, is due to the clock network. In the Itanium<sup>TM</sup> microprocessor, more than 30% of the total chip power is due to the clock distribution network [1]. Thus, innovative clocking techniques for decreasing the power consumption of the clock networks are required for future designs.

Energy recovery clocking is a promising technique developed for reducing power dissipation over clock networks [2-7]. Energy recovery clocking achieves low energy dissipation by slowly charging and discharging the clock network capacitances and recycling the energy stored on the capacitors using a sinusoidal clock [2,3]. The energy recovery clocking scheme recycles the energy from the clock network capacitance in each cycle of the clock. The energy recovery clock does not replace the constant supply required for logic, as is done in adiabatic logic [8,9], but rather acts as a timing reference in the same way as a conventional clock signal. The rest of the system still operates with a constant supply voltage.

Besides efficient sinusoidal clock generation and distribution, for energy recovery schemes to work properly, logic circuits which require the clock as a timing reference must be re-designed to operate effectively with a sinusoidal clock. In [2] and [3], flip-

GLSVLSI'05, April 17-19, 2005, Chicago, Illinois, USA.

Copyright 2005 ACM 1-59593-057-4/05/0004...\$5.00.

flops that work efficiently with a single-phase sinusoidal clock have been proposed. Another major type of circuit that directly connects to the clock network is dynamic logic gates [4,14]. In dynamic logic gates, the clock signal is used for setting precharge and evaluation phases of operation. Due to the slow rising and falling transitions of the clock signal, applying an energy recovery clock (sinusoidal clock) to a dynamic logic gate may not result in efficient circuit operation. Therefore the existing dynamic logic styles need to be modified so that they can operate efficiently with a sinusoidal clock signal.

In this paper, we investigate several possible schemes for the dynamic logic to operate efficiently with a single-phase sinusoidal clock. We propose 5 solutions for the dynamic logic (specifically Domino logic) to operate with the sinusoidal clock. We evaluate the proposed schemes in terms of power consumption. When taking into account the power dissipated on their local clock network, the proposed schemes exhibit significant reduction in power consumption as compared to the standard square-wave clocked domino circuits.

## 2. Local Square-Wave Clocked Domino

An initial idea to address the issue of using dynamic logic with an energy recovery clock is to generate a local square-wave clock from the sinusoidal clock to drive the dynamic logic. Therefore, standard domino gates can be used without any modifications. A sinusoidal signal can be converted to a square waveform by a chain of progressively upsized. This scheme cannot exhibit significant power advantage as compared to the standard square-wave clocked domino logic, since it is essentially the same logic with an added clock conversion overhead. However, it does provide a simple method for using conventional domino gates while keeping the power-saving benefits of the global energy recovery clock. Another benefit of this scheme is that simply skewing the inverters for the clock converter can significantly change the duty cycle of the local clock. This is sometimes desirable to allocate the majority of the clock period for evaluation.

## 3. Energy Recovery Clocked Domino Schemes

In this section, three schemes of domino logic operating directly with an energy recovery (sinusoidal) clock are described.

### A. Energy Recovery (ER) Clocked Domino

In order to be able to recover energy from the clock capacitance of the domino logic, the domino logic needs to operate with the sinusoidal clock. Applying a sinusoidal clock to the standard footed



Evaluation Time, O = Overlap Time of Precharge and Evaluation)

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.



Figure 2: Currents of ER Clocked Domino: (a) Nominal  $V_{dd}$  and (b) low  $V_{dd}$ 

domino, the NMOS evaluation network is ON when the clock signal is above the threshold voltage of NMOS ( $V_{tn}$ ) and the precharge transistor is ON when the clock signal is below  $V_{dd}$ - $|V_{tp}|$ , where  $V_{dd}$  is the supply voltage and  $V_{tp}$  is the threshold voltage of PMOS. Therefore, there is a large fraction of the cycle time when both precharge and evaluation transistors are ON simultaneously (Fig. 1) resulting in short circuit current. Nonetheless, in the case of footed domino, the reduction in local clock power, which is a greater fraction of total power, can mitigate the increased short-circuit power and result in an overall decrease in power.

#### **B.** Reduced Supply ER Clocked Domino

In order to further reduce power, short-circuit power must be reduced. One way to reduce this portion of the power is to reduce the duration of time over which both evaluation and precharge transistors are ON. This can be done by reducing the supply voltage of the logic while maintaining the clock swing at full V<sub>dd</sub>. If the supply voltage of the logic is reduced to V<sub>ddLOW</sub> less than V<sub>dd</sub>, then the precharge duration is when V<sub>CLOCK</sub> < V<sub>ddLOW</sub> – |V<sub>tp</sub>|. This duration is shorter than the original duration of V<sub>CLOCK</sub> < V<sub>dd</sub> – |V<sub>tp</sub>| as shown in Fig. 1. The shorter duration of precharging reduces the timing overlap of precharge and evaluation phases. This, in turn, results in less short-circuit current.

A comparison of this overlap region and corresponding shortcircuit current between nominal supply Domino and low  $V_{dd}$ Domino can be seen in Fig. 2. The currents shown in this figure are the total currents of all gates in a chain of 10 cascaded OR gates designed in TSMC 0.25µm technology. It is clear that when low  $V_{dd}$  is applied, precharge current does not appear until a later time in the clock cycle, and thus, does not overlap as much with evaluation current as when using a nominal  $V_{dd}$ . Both the shortcircuit and switching power dissipation are reduced by lowering the supply voltage. In general, the reduced supply voltage slows down the evaluation. However, in the simulation, it is verified that the first 100mv of reduction can actually speed up the evaluation by reducing precharge/evaluate time overlap, as shown in Fig. 3.







Figure 5: QUAD Timing (P.P. = Precharging Pulse)



Figure 3: Effects of Reduced Supply ER Clocked Domino On (a) Total Power and Area, and (b) Short-Circuit Power

To verify the power saving of the reduced supply ER clock Domino logic, a chain of 10 OR gates is implemented in TSMC 0.25µm. The transistors are sized to maintain iso-performance under different supply voltages. Fig. 3(a) shows total power versus the voltage used for low  $V_{dd}$ , and how changing  $V_{dd}$  affects the circuit area required to maintain iso-performance. Fig. 3(b) shows how this affects short circuit power specifically. When V<sub>ddLOW</sub> is scaled below some point, the short-circuit current begins to increase because the transistor sizes must increase significantly to maintain the same speed as with the nominal V<sub>dd</sub>. This results in more short-circuit power dissipation during the overlap time between precharge and evaluation. It can be seen by comparing Fig. 3(a) and Fig. 3(b) that reduction of short circuit power is only part of the power reduction between nominal and optimal V<sub>ddLOW</sub>. The increase of "Strong Evaluation" time, as seen in Fig. 1 (noted as "SE1" and "SE2"), helps to allow for smaller transistor sizes than would otherwise be expected to maintain iso-performance with reduced voltage. Here, strong evaluation refers to the period when  $V_{CLOCK} > V_{dd} - |V_{tp}|$ , i.e. when the precharge transistors are OFF. As observed from Fig. 3, there is an optimum low supply (V<sub>ddLOW</sub>) voltage, which is 2.0V in this case.

## C. QUasi-Adiabatic Domino (QUAD)

For standard square-wave clocked domino logic, footless gates can be used to improve performance by reducing the transistor stack height in the evaluation network. Short circuit power can be managed by delaying the square-wave clock using buffers. However, without using an analog delay (RC delay), an energy recover clock cannot be delayed while maintaining the energy recovery property. This type of delay is not practical because it would reduce the voltage swing, integrity, and energy recovery efficiency of the clock. Although footless gates are possible without a delayed clock, our simulations show that the power penalty due to short circuit current during precharging makes them very power-hungry and not competitive with footed gates. To avoid this problem, yet still benefit from the higher performance that can be obtained from removing the footer transistor, we propose the Quasi-Adibatic Domino (QUAD) logic, a sinusoidalclocked dynamic logic style which does not require a footer transistor.



Figure 6: QUAD Logic Gates: (a) All QUAD and (b) Mixed QUAD/ ER Clocked Domino (Footed)



Figure 7: Operational Waveforms For Single Gate Of QUAD Logic (X = Precharge Node)

A diagram of a generic QUAD gate is shown in Fig. 4. Although this gate does not strictly adhere to the adiabatic switching principles [8], a significant portion of energy can still be recovered from internal nodes. This is different from other dynamic logic styles, in which the capacitance seen by the clock is the gate capacitance of the precharge transistor and possibly the footer transistor. Here, a precharge pulse is generated to control the precharge transistor. Also energy is recovered from the precharge node (X in Fig. 4). More importantly, QUAD logic works without a footer transistor in the evaluation path, improving potential performance.

For this logic style, precharging occurs when the clock is high, and evaluation occurs when the clock is low. Specifically, precharging occurs when the "precharge pulse" (see Fig. 4 & 5) is low and clock is rising between  $V_{dd}/2$  and  $V_{dd}$ . Fig. 5 shows typical timing windows of QUAD logic. Realistically, the evaluation time will be slightly less than half the clock period. This is because during evaluation the precharge node cannot be discharged to a voltage lower than the clock voltage. Though the inverters are highly skewed for a fast low to high transition at their outputs, their trip point cannot be above  $V_{dd} - |V_{tp}|$ . Also, the evaluation path cannot turn on until  $V_{CLOCK} < V_{dd} - V_{tn}$ . Therefore, evaluation is slower when the CLK signal is at higher voltages.

For this logic to work properly, the "precharge pulse" must be properly timed and shaped by several gates, as shown in Fig. 6(a). This requires some amount of power, but it can be shared among many gates. Another issue that must be addressed with this logic is the potential for short circuit power in the static inverter. When the sinusoidal clock is rising, it charges the input of the inverter for gates which have been evaluated to zero. Precharging begins when the clock reaches a voltage around  $V_{dd}/2$ . At this point, the precharging phase begins, but the voltage of the precharge node cannot reach a voltage of V<sub>dd</sub> faster than the clock charges to V<sub>dd</sub>. Therefore, in the worst case, for a gate which evaluates to low at the beginning of the evaluation phase ( $V_{CLOCK} \sim V_{dd}/2$ ), the input to its inverter follows the clock voltage for about 75% of the clock period. Since the clock is slow to rise and fall, short circuit power in the inverter can be more than that in the footed ER clocked domino case. Since this inverter is highly skewed for a fast low to high transition, the NMOS is typically minimum sized. However, to further reduce the short circuit current, two minimum-sized transistors in series have been used in the pull-down network of the inverter, as shown in Fig. 4. By this way, the pull down path of the inverter can be made very weak so as to decrease short-circuit power without affecting the gate performance that is dependent on the strength of the pull up path of the inverter. Simulations show that this yields short-circuit power consumption similar to the ER clocked domino.

Simulation setup for a chain of QUAD gates is shown in Fig. 6(a). Waveforms for a single gate in the middle of the chain are shown in Fig. 7. Notice that after a gate has evaluated to low, its precharge node (X) continues to follow the clock until it peaks at  $V_{dd}$ .

To improve the power and performance of QUAD gates, footed ER clocked domino gates can be used at the beginning and end of each stage (as shown in Fig. 6(b)), where QUAD gates are slow. This allows the QUAD gates in the middle of the stage to be sized smaller while maintaining the expected overall performance of the chain, reducing clock and switching power. Since QUAD gates evaluate when the clock is low, they must be mixed with PMOS domino gates, which pre-discharge to a low voltage and evaluate to a high voltage via a PMOS transistor evaluation network. When interfacing PMOS domino gates with QUAD gates, the inverter of the preceding gate must be removed. This is because PMOS domino gates output a high when pre-discharged, while QUAD gates expect a low input during precharging. Likewise, PMOS domino gates expect a high input during the precharge phase, though QUAD gates output a low when precharged. Though PMOS domino gates are somewhat slower than NMOS gates, they are faster than using QUAD gates at the beginning and end of the evaluation phase. Unlike standard P-N domino logic, QUAD logic should be more immune to noise propagation between P-type gates and N-type gates because it is impossible for one gate to evaluate to a high voltage while the next one evaluates to a low voltage, since every gate must be evaluated to CLK. With the introduction of OUAD logic, we have shown a way to get the performance benefits of footless logic while maintaining the energy recovery properties of the clock signal over not only the global clock interconnects, but also the local clock interconnects and gate capacitances and even some internal nodes of logic gates.

#### 4. Comparisons and Discussions

To evaluate and compare the power consumption of the proposed dynamic logic schemes, simulations were performed on a simple circuit (a cascade of 10 OR gates). All the circuits are designed in TSMC 0.25µm and sized for iso-performance. Power breakdown and comparisons of all proposed sinusoidal clocked domino schemes and the standard square-wave clocked domino schemes are shown in Fig. 8. They appear in the order presented in this paper. The power is broken down into clock, shortcircuit, and switching power. For the logic styles that need pulse or local clock generators, the corresponding power overhead is shown separately. The clock power represents the power dissipated for switching the clock inputs of the gates. In [2] and [3], it is shown that using an energy recovery clock allows at least 90% of the energy associated with charging and discharging the clock network and associated capacitive loads to be recovered. In effect, this is the power dissipated over parasitic resistances in the clock interconnects.



Figure 8: Power Breakdown Of All Logic Types



Figure 9: Area Comparison

Taking these factors into account, the power that would be dissipated to drive the clock input of an ER clocked domino gate using a square-wave clock is multiplied by 10% to estimate the clock component of power dissipation when using an ER clock. For the case of QUAD logic, this power was added to the power dissipated over the PMOS and NMOS transistors used for charging and discharging the precharge node to get the total clock component of power dissipation. The short circuit power is the power dissipation caused by direct paths from the supply to ground. Except for QUAD logic, this is mainly a result of the simultaneous conduction of the precharge transistor and the evaluation network. The remainder the total power is the switching power, which is the power due to switching of internal nodes. Between standard square-wave domino gates, footless domino consumes more power. Although switching power is less for footless domino, the added power of delaying the clock for every stage leads to a dramatically increased clock power. The reduction in switching power is a result of elimination of the diffusion capacitance of the footer transistor, and smaller transistors in the evaluation network for the given target frequency.

The first proposed technique that utilizes an energy recovery clock, the local square-wave clocked domino, consumes slightly more power than conventionally clocked footed domino. This is due to the power required for converting the sinusoidal clock to a local square-wave clock. Clock power is insignificant because the clock only drives the first small inverter in the local clock generator. The ER clocked domino and the reduced supply ER clocked domino schemes operate directly with the sinusoidal clock, enabling energy recovery from the clock capacitance associated with the domino clock load. For ER clocked domino, switching power and short-circuit power are slightly higher than footed square-wave clocked domino (due to the slow rise and fall times), but energy saved by using the energy recovery clock more than makes up for it. By employing reduced supply ER clocked domino, the switching and short circuit power are reduced, resulting in an overall power savings of 42% as compared to the footed squarewave domino, and 53% as compared to the footless version. For nominal V<sub>dd</sub> ER clocked domino, the increased short-circuit power did lead to an increase in the non-clock-related portion of the power. Employing the reduced voltage ER Clocked Domino scheme presented, we were able to reduce short-circuit power by 51%, and the total power by 27% as compared to using nominal V<sub>dd</sub>.

QUAD logic consumes only slightly less power than ER Clocked Domino using nominal voltage. However, it has the advantage of a higher potential performance. Another important advantage is that the pulse generation logic can be shared among many more gates. Since the majority of the power associated with generating the pulse comes from the generation of the pulse as opposed to the actual driving of the gate capacitances, increasing the load on this pulse generation circuitry will only increase a fraction of its power. The fact that the power does not significantly decrease for QUAD logic as compared to nominal  $V_{dd}$ 

ER Clocked Domino is similar to the situation with footed and footless square-wave domino gates. Although switching power decreases, the potential of higher performance is paid for by an increase in overhead that makes the footless operation possible. However, unlike with the conventional footless delayed clock domino, the pulse generation overhead can more effectively be shared among gates, and overhead power will not increase as drastically when frequency increases. Overall, 70% of the clock energy associated with QUAD logic was recovered. As mentioned previously. OUAD logic has non-adiabatic switching when the precharge node is discharged, and thus the efficiency of energy recovery from the precharge node was only 56%. However, the total energy recovered using QUAD was more than could be recovered using ER Clocked Domino, because the capacitance of the precharge node (which is composed of two gate capacitances and two diffusion capacitances) is much greater than the gate capacitance of the precharge transistor. When combining QUAD and ER Clocked Domino, power can be effectively reduced as compared to only using QUAD logic. Although power consumption here is not as low as with the reduced supply ER Clocked Domino technique, the power associated with the logic gates (the sum of switching, short-circuit, and clock power) is lower, and it does not require a reduced  $V_{dd}$  or a dual-V<sub>dd</sub>. Depending on the clock rate, the chain length, and technology, the number of QUAD gates and ER Domino gates may vary. We have replaced the first two and last two QUAD gates in the chain with ER Clocked Domino gates. Shorter chain may require replacing only two gates for optimal power dissipation, and longer chains may require replacing more than four.

Fig. 9 shows an area comparison of all optimized chains. The area is estimated based on total gate area. This area includes local clock drivers required for square wave clocking. There is not a large difference in area among logic types. Even with the added pulse generation circuitry, the mixed QUAD and ER Domino requires the smallest area. As expected, transistor sizing in the QUAD gates could be significantly reduced to maintain iso-performance when ER Domino gates were added to latch the input and output of each stage to make up for the decreased performance in these QUAD gates.

#### 5. Conclusions

We have proposed novel dynamic logic styles enabling energy recovery clocking schemes in domino logic, resulting in significant total energy savings as compared to square-wave clocking. The proposed domino logic styles operate with a single-phase sinusoidal clock, which can be generated with high efficiency. By employing energy recovery clocking and using the proposed domino logic styles, the total energy associated with dynamic logic can be reduced by up to 53% as compared with using conventional square-wave clocked domino logic.

#### References

- [1] S.D. Naffziger, et. al., ISSCC, pp. 344-472, 2002.
- [2] M. Cooke, et. al., ISLPED, Aug. 2003
- [3] C.H. Ziesler, et. al., pp. 48-53, Aug. 2003
- [4] A. J. Drake, et. al., IEEE JSSC, Vol. 39, pp. 1520-1528, Sept. 2004.
- [5] S.C. Chan, et. al, ISSCC, pp. 342-343, 2004.
- [6] J. Wood, et al., ISSCC Dig. Tech. Papers, pp. 400-401, Feb. 2001
- [7] F. O'Mahony, et. al., ISSCC Dig. Tech. Papers, pp. 428-429, Feb. 2003.
- [8] W. C. Athas, et. al., IEEE Trans. VLSI Systems, vol. 2, no. 4, pp. 398-406, Dec. 1994.
- [9] S. Kim, et. al., vol. 9, no. 1, Feb. 2001.