Power Reductions with Energy Recovery Using Resonant Topologies by Bezzam, Ignatius S.A.
Santa Clara University
Scholar Commons
Engineering Ph.D. Theses Student Scholarship
5-5-2015




Follow this and additional works at: http://scholarcommons.scu.edu/eng_phd_theses
Part of the Electrical and Computer Engineering Commons
This Dissertation is brought to you for free and open access by the Student Scholarship at Scholar Commons. It has been accepted for inclusion in
Engineering Ph.D. Theses by an authorized administrator of Scholar Commons. For more information, please contact rscroggin@scu.edu.
Recommended Citation









Department of Electrical Engineering 
 
Date: May 5, 2015 
  
I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPERVISION BY 




POWER REDUCTIONS WITH ENERGY RECOVERY 
USING RESONANT TOPOLOGIES 
 
BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF 
 





































Submitted in partial fulfillment of the requirements for  
the Degree of Doctor of Philosophy  
in Electrical Engineering  
in the school of Engineering  
at Santa Clara University, May 5, 2015 
 









  Laura Maria Carcione  












"If I love, what business is it of yours?" – Johann Von Goethe 
This whole work has been about doing what I love, even if it is nobody’s 
profit making business; something I could not do even in my childhood, but got to 
finally in my 50’s. I am passionate about the electrical engineering fields of 
Analog/RF, High Speed Digital and Power management. This thesis is a statement of 
it. It is dedicated to my better half as that is the least I can do for all the love and 
support I received like nobody’s business. 
I want to sincerely thank my advisor Dr. Shoba Krishnan, professor in the 
Department of Electrical Engineering, Santa Clara University in supporting me with 
the freedom to disagree and that too with heart and resources. In addition, her ability 
to provide a clear cut direction in the presentation of data has helped me tremendously 
in my written communication success so far. So if this dissertation is indeed readable, 
the credit goes to her above 90%. 
I am also indebted to Prof. C. Mathiazhagan who has spurred, inspired and 
challenged me since we did undergrad in IIT Madras in 1993. I thank him immensely 
for taking the time to talk regularly, guide and encourage me to work towards the 
completion of this work. His brilliance, patience, friendship and attention to detail 
have helped me cross many barriers in this work.    
I would also like to thank my “low power guru” Dr. Tezaswi Raja for agreeing 
to be on the Ph.D. committee and co-author several papers with me, in midst of all his 
silicon and carbon tape-outs. I am indebted to Prof. Samiha Mourad for directly 
guiding my research and fruitful interactions with other doctoral students. I am also 
thankful to Prof. Tokunbo Ogunfunmi for the intensely useful courses and continuous 




with Dr. Ahmed Amer, on my Ph.D. committee. Once again, I thank my advisor and 
all the committee members for taking the time to decide on my courses, teach and 
keep track of the progress of this work. I am grateful to all my other instructors, from 
SCU faculty and outside, for various courses and probing questions answered. 
Last but not the least; I am proudly grateful to my son Eric Francis, who is 
completing his Electrical & Computer Engineering Bachelor of Science in Germany, 
for challenging me to give him more to follow and for competing to finish his 





Table of Contents 
Acknowledgements ....................................................................................................... iv 
List of Tables ..............................................................................................................viii 
List of Figures ............................................................................................................... ix 
Abstract ......................................................................................................................... xi 
1 Introduction ............................................................................................................ 1 
1.1 Motivation for Wide Frequency Energy Recovery ......................................... 1 
1.2 Literature review ............................................................................................. 4 
1.3 Top-down Clock Distribution ......................................................................... 7 
1.4 Bottom-up View of Non-Resonant (NR) Digital Circuits .............................. 8 
1.5 Organization of the thesis .............................................................................. 11 
2 Low Power Design through Energy Reuse .......................................................... 14 
2.1 Adiabatic Circuits with energy recovery ....................................................... 14 
2.2 LC Resonance Energy Reuse ........................................................................ 17 
2.2.1 Continuous Parallel Resonance Driver (CPR) with Bias Supply .......... 18 
2.2.2 Parallel Resonance with decoupling Capacitor ...................................... 21 
3 Series Resonance for wide frequency clocking ................................................... 25 
3.1 Pulsed Series Resonance (PSR) .................................................................... 25 
3.2 Generalized Series Resonance ...................................................................... 30 
3.3 GSR with decoupling capacitor (GSR-C) ..................................................... 33 
3.4 GSR Transistor level configurations ............................................................. 37 
3.5 Series Resonance Simulation Results ........................................................... 39 
3.5.1 PSR Functionality .................................................................................. 39 
3.5.2 GSR Functionality and Performance ..................................................... 40 
3.5.3 GSR Schematic Diagrams...................................................................... 42 
4 Support circuitry .................................................................................................. 44 
4.1 GSR Configuration ........................................................................................ 44 
4.2 PSR Reconfiguration and Application .......................................................... 47 
4.3 Flip-flops For Energy Recovery .................................................................... 49 
4.3.1 Conventional Solutions .......................................................................... 50 
4.3.2 Dynamic Latch Solutions for PSR ......................................................... 51 
4.4 PSR Flip-flop Functional Verification .......................................................... 53 
4.5 GSR Functionality and Performance............................................................. 54 
4.6 Circuit Design Optimizations ........................................................................ 55 
5 Timing Performance of Driver solutions ............................................................. 57 
5.1 Propagations Delays and Transition Times ................................................... 57 
5.1.1 Non-Resonant Driver ............................................................................. 57 
5.1.2 Continuous Parallel Resonance (CPR) .................................................. 60 
5.1.3 Pulsed Series Resonance (PSR) ............................................................. 62 
5.1.4 Generalized Series Resonance (GSR) .................................................... 64 
5.2 Comparative Analysis ................................................................................... 66 
6 Data Path applications.......................................................................................... 68 
6.1 Resonant Dynamic Logic (RDL) .................................................................. 68 
6.2 RDL Power and Delay .................................................................................. 70 
6.3 RDL simulations ........................................................................................... 72 
7 Area estimates ...................................................................................................... 74 
7.1 PSR Implementation in 45nm ....................................................................... 75 




7.3 Inductors ........................................................................................................ 79 
8 Performance Power Area (PPA) Trade off Analysis ........................................... 80 
8.1 Tradeoffs between NR, CPR, PSR and GSR ................................................ 84 
8.1.1 Power and Dynamic Voltage Scaling .................................................... 84 
8.1.2 Delays .................................................................................................... 85 
8.1.3 Rise/Fall Times and Slew Rates ............................................................ 86 
8.1.4 Skew and Jitter ....................................................................................... 87 
8.1.5 Area of Driver ........................................................................................ 87 
8.1.6 Predriver Overhead ................................................................................ 88 
8.2 Energy-Delay (E-D)  Tradeoff ...................................................................... 88 
8.3 PPA Optimization ......................................................................................... 91 
8.4 Applications .................................................................................................. 91 
9 System Level Experimental Results .................................................................... 93 
9.1 System Timing Closure ................................................................................. 95 
9.2 PSR vs. NR sub-system performance ........................................................... 99 
9.3 GSR vs. NR sub system Performance ......................................................... 103 
9.4 GSR, PSR, CPR and NR Comparative Analysis ........................................ 105 
10 Design methodology and Flow .......................................................................... 111 
11 Conclusions ........................................................................................................ 115 
11.1 Summary .................................................................................................. 115 
11.2 Conclusion ............................................................................................... 117 
11.3 Future Work ............................................................................................. 119 
12 References .......................................................................................................... 121 
Nomenclature ............................................................................................................. 125 
Appendix A:  MATLAB for solving ODE and Deriving Expressions .............. 128 
Appendix B: LTSPICE Schematic Diagrams .................................................... 131 
Appendix C: Test Benches for Simulations ....................................................... 133 
Appendix C: Spread Sheet for Design Calculations .......................................... 135 






LIST OF TABLES 
TABLE 1    PERFORMANCE POWER AREA TRADEOFFS ............................................................................ 81 






LIST OF FIGURES 
FIGURE 1.1 SYSTEM EXPENSES AND ROOT CAUSES ................................................................................. 2 
FIGURE 1.2  A COMPREHENSIVE CLOCK DISTRIBUTION AND DATA CAPTURE SCHEME. ......................... 7 
FIGURE 1.3  DYNAMIC VOLTAGE AND FREQUENCY SCALING. ................................................................. 8 
FIGURE 1.4 CLOCK DRIVER TOPOLOGY FOR NR. ...................................................................................... 9 
FIGURE 2.1 LOSSES IN CONVENTIONAL VS. ADIABATIC CHARGING. ...................................................... 15 
FIGURE 2.2 CLOCK DRIVER TOPOLOGIES. ............................................................................................... 17 
FIGURE 2.3 CLOCK DRIVER TOPOLOGY FOR CONTINUOUS PARALLEL RESONANCE (CPR). .................... 18 
FIGURE 2.4 CONVENTIONAL CONTINUOUS LC RESONANT CLOCKING DRIVER (CPR). ........................... 22 
FIGURE 3.1 PULSED SERIES RESONANCE (PSR) (A) SWITCHING CIRCUIT (B) LINEAR MODEL ................ 26 
FIGURE 3.2  PSR OPERATION WITH LOSSES. (A) INPUT PULSE (B) OUTPUT PULSE. ............................... 26 
FIGURE 3.3 GSR (A) SWITCHING CIRCUIT (B) EQUIVALENT CIRCUIT MODEL ......................................... 31 
FIGURE 3.4 TIMING DIAGRAM FOR GENERATING RAIL-TO-RAIL CLOCK OUTPUT. ................................. 31 
FIGURE 3.5 GSR-C WITH ENERGY RECOVERY CAPACITANCE CER. ........................................................... 35 
FIGURE 3.6 SAME AS FIGURE 3.4, REPEATED FOR CONVENIENCE. ........................................................ 35 
FIGURE 3.7  GSR FULL CONFIGURATIONS. ............................................................................................. 38 
FIGURE 3.8 GSR RECONFIGURATIONS. ................................................................................................... 39 
FIGURE 3.9 PSR OPERATION TIMING WAVEFORMS. .............................................................................. 40 
FIGURE 3.10  SIMULATIONS OF GSR AND GSR-C SHOWING THE FUNCTIONALITY. ............................... 41 
FIGURE 3.11 GSR VOLTAGE AND FREQUENCY SCALING OPERATION FOR DVFS. ................................... 42 
FIGURE 3.12  GSR SCALABLE RECONFIGURABLE DRIVER SCHEMATIC AND MACRO CELL SYMBOL. ...... 43 
FIGURE 3.13  TYPICAL CONFIGURATION OF DRIVER FOR GSR RAIL TO RAIL OPERATION. ..................... 43 
FIGURE 4.1 GENERATING CONTROL SIGNALS FOR GSR DRIVER. ............................................................ 45 
FIGURE 4.2 PSR DRIVER CLOCKING A BANK OF N TSPC LATCHES. ......................................................... 48 
FIGURE 4.3 EXPLICIT-PULSED FLIP-FLOP EPDCO..................................................................................... 51 
FIGURE 4.4 EPTSPC DRIVEN BY PSR. ....................................................................................................... 52 
FIGURE 4.5 DUAL EDGE TRIGGERED TSPC BASED FLIP FLOP (DETSPC). ................................................. 53 
FIGURE 4.6 DETSPC VS. EPTSPC DET FOR NEGATIVE SETUP. .................................................................. 53 
FIGURE 4.7 MONTE CARLO SIMULATIONS OF GSR WITH PREDRIVER. ................................................... 55 
FIGURE 5.1 SIMULATED OUTPUT VOLTAGE WAVEFORM ON A 20PF LOAD CAPACITOR (VC). ............... 66 
FIGURE 6.1  CMOS IMPLEMENTATION OF RESONANT DYNAMIC LOGIC (RDL). ..................................... 69 
FIGURE 6.2 TIMING SIGNALS DERIVED FROM CLOCK SUPPORTING ENERGY RECOVERY SWITCHING. .. 69 
FIGURE 6.3 OPERATION AT 1.8V SUPPLY AND 0.5GHZ. ......................................................................... 72 
FIGURE 7.1 DISTRIBUTED CLOCK TREE DRIVING 1024 FLIP-FLOPS. ........................................................ 74 
FIGURE 7.2 LAYOUT FLOOR PLAN FOR COMPARING PSR AND NR CLOCKING. ...................................... 76 
FIGURE 7.3 GSR DISTRIBUTED AT FAR-END FOR HIGHEST Q AND MINIMUM POWER. ......................... 78 
FIGURE 8.1  H-TREE ENERGY PER CYCLE WITH VOLTAGE SCALING AT 500MHZ. ................................... 84 
FIGURE 8.2 DELAY VARIATIONS WITH SUPPLY VOLTAGE. ...................................................................... 86 
FIGURE 8.3 SKEW VARIATION WITH SUPPLY VOLTAGE. ......................................................................... 87 
FIGURE 8.4 DERIVING E-D PRODUCT CURVE. ......................................................................................... 89 
FIGURE 8.5 E-D PRODUCT FOR NR, CPR AND GSR. ................................................................................. 90 
FIGURE 8.6 PARETO GRAPHS FOR ENERGY VS. DELAY. .......................................................................... 90 
FIGURE 9.1 TYPICAL ARCHITECTURE OF CDN. ........................................................................................ 93 
FIGURE 9.2 BOTTOM-UP TIMING ERROR SOURCES. .............................................................................. 94 
FIGURE 9.3 IBM ISPD2010 SKEW GENERATION BENCHMARK. .............................................................. 95 
FIGURE 9.4 GENERALIZED STATISTICAL TIMING SLACK CALCULATIONS. ............................................... 97 
FIGURE 9.5  PSR  VS. NR WITH SAME TDCQ ........................................................................................... 99 
FIGURE 9.6 POWER SAVINGS OVER DVFS RANGE. ............................................................................... 100 
FIGURE 9.7  PVT AND MC SKEW SIMULATIONS COMPARING PSR AND NR H-TREES. .......................... 101 
FIGURE 9.8  POWER SAVINGS AND ENERGY. ....................................................................................... 102 
FIGURE 9.9  PVT AND MC SKEW SIMULATIONS SHOWING PSR ADVANTAGE. ..................................... 102 
FIGURE 9.10  POWER SAVINGS OVER 10× CLOCKING FREQUENCY RANGE IN 45NM. ......................... 104 
FIGURE 9.11 VARIATIONS IN THE DELAY CONTRIBUTING TO CLOCK SKEW. ........................................ 105 




FIGURE 9.13 SIMULATED SKEWS OF H-TREE ACROSS OPERATING FREQUENCIES. .............................. 108 
FIGURE 9.14  GSR POWER SAVINGS COMPARED TO NR. ..................................................................... 109 
FIGURE 10.1 STANDARD IC DESIGN TOP DOWN FLOW. ....................................................................... 111 






Power reductions with energy recovery using resonant topologies 
 
Ignatius Bezzam 
Department of Electrical Engineering 
Santa Clara University  





The problem of power densities in system-on-chips (SoCs) and processors has 
become more exacerbated recently, resulting in high cooling costs and reliability 
issues. One of the largest components of power consumption is the low skew clock 
distribution network (CDN), driving large load capacitance. This can consume as 
much as 70% of the total dynamic power that is lost as heat, needing elaborate sensing 
and cooling mechanisms. To mitigate this, resonant clocking has been utilized in 
several applications over the past decade. An improved energy recovering 
reconfigurable generalized series resonance (GSR) solution with all the critical 
support circuitry is developed in this work. This LC resonant clock driver is shown to 
save about 50% driver power (>40% overall), on a 22nm process node and has 50% 
less skew than a non-resonant driver at 2GHz. It can operate down to 0.2GHz to 
support other energy savings techniques like dynamic voltage and frequency scaling 
(DVFS).  
As an example, GSR can be configured for the simpler pulse series resonance 
(PSR) operation to enable further power saving for double data rate (DDR) 
applications, by using de-skewing latches instead of flip-flop banks. A PSR based 




is demonstrated. This new resonant driver generates tracking pulses at each transition 
of clock for dual edge operation across DVFS. PSR clocking is designed to drive 
explicit-pulsed latches with negative setup time.  Simulations using 45nm IBM/PTM 
device and interconnect technology models, clocking 1024 flip-flops show the 
reductions, compared to non-resonant clocking. DVFS range from 2GHz/1.3V to 
200MHz/0.5V is obtained. The PSR frequency is set >3× the clock rate, needing only 
1/10
th
 the inductance of prior-art LC resonance schemes. The skew reductions are 
achieved without needing to increase the interconnect widths owing to negative set-up 
times.  
Applications in data circuits are shown as well with a 90nm example. Parallel 
resonant and split-driver non-resonant configurations as well are derived from GSR. 
Tradeoffs in timing performance versus power, based on theoretical analysis, are 
compared for the first time and verified. This enables synthesis of an optimal topology 







There are fundamental electrical engineering principles underlying the severe 
problem of managing the power that produces heat dissipation in SoCs and processors 
operating at GHz clock rates. A literature survey of the current state-of-art on 
addressing this problem shows the limitations in various solutions available now. The 
energy usage from a top down and bottom up perspective is examined in order to 
understand the metrics to be maintained while power is reduced.  
1.1    Motivation for Wide Frequency Energy Recovery 
Laptops cannot be operated on top of laps anymore due to the intense heat 
generated. To solve the same issue on a larger scale, cooling costs in the order of 
$50billion/year are needed for just small businesses. Businesses use farms of 
workstations for computing which are made of ICs. As shown in Figure 1.1, these 
costs are quickly outpacing the cost of hardware due to the thermal costs associated 
with ICs consuming 100’s of watts of power. This is primarily because of these 
thermal problems from power densities of microchips. There is an increase in the 
power consumed per transistor as well as the number of transistors on a single IC die. 
Silicon chips using deep sub-micron (DSM) nanometer scale processors can now 
reach the temperature of a rocket nozzle. They may soon have spots as hot as the 
surface of the sun. To handle this and the consequent reliability concerns, elaborate 
sensing and thermal management are required. Thus, power consumption is a key 
issue in high performance systems based on processors (CPUs and GPUs) as they 
consume hundreds of watts as shown in Figure 1.1. Higher IC power results in 



















Figure 1.1 System expenses and root causes 
(courtesy Dr. T. Raja, NVIDIA Corporation, Lecture on Low Power Design). 




Thus, there is an urgent need for low power techniques for the following reasons, 
a) Increase battery lifetime and/or decrease number of solar cells 
b) Reduce Cooling Fixtures, Form factor and Costs 
c) Increase Reliability & Sustainability  
VLSI circuits operating in GHz range typically have switching power dissipation 
much larger than leakage losses. For example, high end GPUs can take over 
300Watts. To meet stringent skew requirements (<8ps across 64mm
2
 chip from AMD 
shown in [11]), synchronous clocking alone can take 24%-70% of power from 
processors to SoCs. 
In clock power reduction, DVFS is a very important technique in runtime 
power management, and is extensively used by high performance processors.  Here 
part or all of the IC is dynamically scaled to run at the minimum frequency needed 
and the supply voltage scaled to the minimum needed to support the minimum 
frequency. All other energy recovery techniques need to incorporate wide frequency 
operation supporting DVFS. Prior-art resonant solutions inherently do not do that.  
Recovering continuous switching energy from clocking that is spread all over 
the chip not only saves power but can also eliminate cooling costs. An all-important 
performance metric to be maintained while achieving power reduction is the timing 
closure that involves a host of specifications like skew, jitter, delay variations etc. 
Some of the resonant schemes deteriorate these while achieving power savings and 
this may not be acceptable. So an additional requirement on any new resonant 
solution is to achieve lower skew while reducing power, especially at higher 
frequencies. 
Thus, the aim of this work is to arrive at energy recovering resonant 




performance in terms of lower skew and jitter for timing closure. This 
dissertation examines solutions to various limitations in using prior-art resonant clock 
drivers and the best way to use their energy recycling feature over a wide frequency 
range. A novel reconfigurable scheme called generalized series resonance (GSR) is 
proposed. This can be dynamically programmed into various series or parallel 
resonance modes of operation for optimal trade-off. Closed-form design equations 
determining the power consumption improvements are arrived at while analyzing the 
timing performance at the clock sink points to enable automatic design synthesis. 
Special flip-flops for ultra-low energy applications usually need to be 
designed to work with low amplitude signals from global clock grids from resonant 
clocks. The design of these is described here for applications where it is demonstrated 
to save further power. It is also desirable to have the new resonant schemes be able to 
directly drive standard flip-flops and gates to fit the standard design flow. Resonant 
techniques that can be used in the data path are also desirable to recover more energy, 
over and above the clock power reductions. 
The new LC resonance operation proposed in this work is engaged only for 
the rise and fall transitions, rather than the entire clock period, and thus is not tied to 
one clock frequency. Energy recovery is then achieved over a much wider frequency 
range enabling DVFS. Run time optimization of the operation, through pulse width 
control, results in more savings of the clock power. CDNs savings can total to several 
watts of power in current DSM processors, SoCs and ASICs. 
1.2 Literature review 
Power dissipation considerations continue to dictate the use of multi-core 
architecture in processors and SoCs in technologies beyond 45nm [1], [2].  A full chip 




take 25% of total power in processors and sometimes as much as 70% in SoCs [3]. 
Transistor scaling using ‘More of Moore’s law’ reduces area and gives faster 
transistors. Power densities are significantly higher when the constant voltage scaling 
method is used [4]. Due to the cooling costs needed to contain the large power 
densities, there has been an abrupt halt in the clock frequency increase even though 
the transistors themselves can switch much faster [5]. This calls for improvements 
beyond Moore’s law scaling. 
The so called ‘More than Moore’ solutions [6] can be applied for this dilemma 
for reducing power using MEMS/NEMS resonators [7]. These technologies are not 
main-stream yet and involve additional costs. Architectural choices, like the use of 
multi-cores, give higher throughput using lower clock frequencies, resulting in lower 
power densities [4].  
A low cost way is to use the passive components like metal spiral inductors, 
already available on standard process technologies [8], to consume less power in 
switching. Even in multi-core processors, total energy can be further reduced by using 
inductors. The energy used to charge the clock grid node capacitance (C) each period 
can be recovered and reused with an integrated inductor (L) in parallel, forming a 
resonant tank network [3]. The recovered energy would have been otherwise 
dissipated as heat. LC resonant circuit operation for reducing power consumption in 
high speed clocking applications has been extensively reported [11]–[14]. Since only 
losses need to be overcome at resonance, after the initial start-up, additional power 
savings can be realized by reducing the strength of clock buffers driving the LC load. 
Such recovery techniques are currently used in nanometer commercial processors for 
global clocking [11]. Even in multi-core processors, total energy can be further 




commercially viable on standard CMOS technology for reducing power consumption 
in the Clock Distribution Network (CDN) by energy recovery and reuse [1]. These 
continuously parallel resonant (CPR) solutions save 25% power or more, albeit over a 
narrow range of clock speeds. 
Integrated inductor based tuned circuits have long been used for efficient 
power transfer in small signal radio frequency (RF) amplifiers [15]. They are 
extensively used in integrated DC-DC converters at large voltages and currents, albeit 
at low frequencies [16], [17]. These inductors are now well characterized for 
commercial use. Their use for clocking presents unique challenges, as operation with 
large signals and at very high (GHz) frequencies is needed [3], [18]. 
In order to reduce power as much as possible, modern high performance 
mobile designs are also using increasing number of voltage domains and regional 
clock trees [18]. Thus, it is beneficial to extend resonant solutions from global to 
regional clocking shown in Figure 1.2 [19]-[22]. However, the smaller capacitance 
values from local trees will dictate larger values of inductances for the same LC 
resonance frequency [23].  
Resonant clock solutions extending the operating frequency range for DVFS 
have been reported [1], [7], [23]- [26]. Parallel resonance structure, as described in 
Chapter 2, can switch in multiple inductors for different frequency ranges as shown in  
[1], [25]. 
Chapter 3 describes series resonance topology that inherently gives wide 
frequency operation [23], [24], [26]. Pulsed mode resonance described in [23], [24] 
uses special latches to achieve best savings of power and area. Series resonance driver 




However, the supporting control signals need special circuits to generate them, which 
have not been published in detail. This thesis describes them in detail in Chapter 4. 
 
Figure 1.2  A Comprehensive Clock Distribution and Data Capture Scheme. 
A silicon validation of a simplified series resonance called Intermittent 
Resonance (IR) is described in [24] and shows promising future for this work not yet 
realized in silicon.  
1.3 Top-down Clock Distribution 
Figure 1.2 shows the integration of resonant and non-resonant clock drivers at 
various levels of CDN, which will be treated in detail in later chapters. The numerous 
active and distributed passive components involved are detailed later. Figure 1.2 is an 




be dissipated in the local buffer stages driving the flip-flops [21]. From a high level 
perspective, for real life clocking applications in high speed computing and 
communication, timing closure is of utmost importance for functionality, 
performance, and yield [14], [18]. Lowering power at the expense of timing 
parameters like insertion delay variations, slew rates, skew and jitter may not be 
acceptable [3], [18].  
 
Figure 1.3  Dynamic Voltage and Frequency Scaling. 
(courtesy Dr. T. Raja, NVIDIA Corporation, Lecture on Low Power Design) 
 
Another important system level requirement is the ability to operate the same 
chip at different frequencies in different parts as shown in Figure 1.3. At a system 
level the strategy to minimize power is to operate only as fast as necessary and at the 
lowest voltage supporting that clock speed. 
As will be seen, a sizable portion of the dynamic power is taken up by the 
clock distribution itself, to maintain the synchronous nature of the system. 
1.4 Bottom-up View of Non-Resonant (NR) Digital Circuits 
The root cause of power wasted in digital circuits and the reason for the 
runaway in thermal issues is now examined. As a baseline for power calculations and 
timing performance, equations for known drivers are considered first [27], [28].  




large capacitive load CL. Various parasitic resistors and lumped interconnect parasitics 
that can affect the slew rates and delays are shown. Switch parasitic capacitances are 
neglected compared to CL. Output is near 50% duty cycle though input pulses are not. 
 
Figure 1.4 Clock Driver Topology for NR. 
The split pull up and pull down scheme in Figure 1.4 minimizes the short 
circuit currents and thus consumes minimum dynamic power [29], [30]. This is at the 
expense of more circuit area, which is an acceptable tradeoff in DSM regime. The 
actual width of the pulse is not critical as long as a minimum duty cycle is maintained 
across operation [21], [31]. Smaller pulse widths cause less static leakage power. The 
output voltage VC, when falling from VDD to 0, is given by [28],  
𝑉𝐶(𝑡) = 𝑉𝐷𝐷. 𝑒
−𝑡
(𝑅𝑑+𝑅𝑤)𝐶𝐿 (1.1) 
The corresponding capacitor discharge current flowing through interconnect 









(𝑅𝑑+𝑅𝑤)𝐶𝐿   (1.2) 
If the clock period TCLK is sufficiently large to accommodate the transit times, 
the output capacitor voltage VC swings rail to rail (0 to VDD). Energy in a cycle can be 
derived as 𝐶𝐿𝑉𝐷𝐷
2  by integrating the instantaneous power (𝑉𝐶(𝑡) × 𝑖𝐶(𝑡)) over one 
period TCLK [27]. Then EVDD, the energy drawn from supply per cycle, is 𝐶𝐿𝑉𝐷𝐷
2 . 
Similarly, EC the energy stored in the capacitor can be derived as 𝐶𝐿𝑉𝐷𝐷
2 /2  [27]. EC is 
also the energy dissipated in pull down resistor. For large values of interconnect Rw, 
the output may not swing rail to rail within the TCLK. In that case, the actual logic high 
VOH and logic low VOL values can be used, giving a more generalized equation [27] 
for average power for a clock frequency fCLK (=1/TCLK) as, 
Pavg = VDD (VOH  - VOL) 𝐶𝐿𝑓𝐶𝐿𝐾.   (1.3) 
For rail-to-rail operation, the equation for NR operation, valid for all 
frequencies, is more commonly written as 
PNR = 𝐶𝐿𝑉𝐷𝐷
2  𝑓𝐶𝐿𝐾.   (1.4) 
Equation (1.4) is used as a base-line for comparison with other driver 
schemes. The output is a square wave and does not need special amplifiers to drive 
flip flops or logic, but may use local clock buffers shown in Figure 1.2. NR supports 
dynamic voltage and frequency scaling (DVFS) below the maximum operating 
frequency that the process technology is capable of.  
Using (1.4) at 1GHz clock rate, to achieve even a 1V swing on a 1nF 
capacitor, it takes at least 1W of power [26]. An LC resonant global CDN from IBM 
driving a large load (~6nF) at 4 GHz is integrated in the processor described in [14]. 
Full functionality over a 20% range in clock frequencies was demonstrated, while 




grid solution from AMD that saves 25% of the clock distribution power of another 
high performance processor was reported in [11]. 
 For load capacitor CL total power dissipation is frequency f times  𝐶𝐿𝑉𝐷𝐷
2  [6]. 
In these resonance schemes, for a given choice of inductor value L, the operating 
clock range is restricted around the resonance frequency f =1/2√𝐿𝐶𝐿. The solution is 
thus tied to one operating clock frequency. It does not maintain the power savings 
across dynamic voltage and frequency scaling (DVFS).   
1.5  Organization of the thesis 
The thesis is organized as follows. In Chapter 2, prior-art low power design 
techniques through energy reuse are formulated, for base line comparisons. Series 
resonance is examined in Chapter 3, as opposed to more commonly used parallel 
resonance. The simpler pulsed series resonance (PSR) is detailed first. Simulation 
results in 45nm CMOS process for clocking operation are shown. Chapter 3 
introduces GSR, derived from PSR, as a general purpose solution that can be 
configured to all other solutions. Simulations validating the design on a 22nm process 
technology are shown.  
In Chapter 4, the support circuitry needed for top down implementation of the 
clocking schemes using PSR are reviewed. The power losses from the support 
circuitry and receiving processor units are factored to understand the true overall 
savings. Previous energy recovery flip-flops are reviewed and true single phase 
clocking (TSPC) is selected. Circuitry for adaptive pulse generation on both edges of 
the incoming clock is described. Design of critical circuits needed for the GSR 




Chapter 5 derives timing performance of all drivers. This thesis does a 
comparative tradeoff analysis of series, parallel and non-resonant topologies for the 
first time. The implementation details of resonant circuits in deep submicron nodes 
(DSM) can have implications on area and timing performance.  
Chapter 6 shows how the GSR principle can also be extended to data 
processing circuits using domino-style dynamic logic family with pre-charge 
mechanisms. Simulation results in 90nm illustrate the power savings achieved by 
these specialized circuits. 
Chapter 7 estimates the active and metal area required by various solutions 
and other costs of fabrication. No additional area is needed by PSR for dual edge data 
capture. Complete layout and parasitics are estimated for a 45nm process as an 
example. Chapter 8 looks at the Power, Performance and Area (PPA) together. The 
tradeoffs between these for various resonant clocking schemes are discussed. 
Theoretical performance and power relations of various resonant and non-resonant 
topologies that can be configured from GSR are tabulated  
Chapter 9 shows system integration of PSR clock generation driving 1024 
flip-flops through an H-tree. High performance processor benchmark from ISPD2010 
clock synthesis contest, drawn from IBM and Intel, in 45nm [32] is used as a test case 
to demonstrate power reductions. A complete clocking solution with PSR, to 
minimize power of regional clocks for leaf cells in high performance multi-GHz 
designs is shown. This novel resonant driver generates pulses at both edges of the 
square clock for operation in the dual edge mode. Details of simulation results in 
45nm CMOS process for clocking and flip-flops are compared. Skew comparison 
between various schemes shows the advantages of GSR in the performance/price 




Chapter 10 discusses a new design flow to incorporate GSR as part of clock 
tree synthesis to save power and minimize inductance while meeting the timing 
closure goals. Chapter 11 concludes the thesis with guidelines for extension of this 
work into the future.  
The appendix includes the MATLAB codes for verifying the mathematical 
derivations used in the chapters. The transistor level schematics of all the circuits 





2  LOW POWER DESIGN THROUGH ENERGY REUSE 
One way to reuse the energy 𝑪𝑳𝑽𝑫𝑫
𝟐 /𝟐 stored on the capacitor, that is wasted 
away as heat during discharge, is to store it on another storage component and 
recover it. However, the charging process itself takes 𝑪𝑳𝑽𝑫𝑫
𝟐 /𝟐 energy, as seen 
Chapter 1, so that a better means of transfer must be used. An alternative is the so 
called adiabatic charging using time varying supply voltage.  Another method is by 
using an inductor to transfer the charge on to a capacitor and recover it. Both are 
explored here. 
2.1 Adiabatic Circuits with energy recovery 
The set of circuit design techniques targeted at the implementation of 
computations with minimal (asymptotically zero) power consumption during charge 
transfer is generally known as adiabatic switching or adiabatic charging. The use of 
the word adiabatic is suggestive of the thermodynamic principle of state change with 
no loss of gain or heat. The principle of adiabatic switching can be best explained by 
contrasting it with conventional dissipative switching.  
Figure 2.1(a) shows how energy is dissipated during a switching transition in 
static CMOS circuits by conventional charging. The transition of a circuit node from 
LOW to HIGH can be modeled as charging an RC tree through a switch, where C is 
the capacitance of the node and R is the resistance of the switch and interconnect. 
When the switch is closed, a high voltage (VDD) is applied across R and current starts 
flowing suddenly through R. After a short period of time, C is charged to a constant 
supply voltage VDD. The energy taken from the power supply is 𝐶𝑉𝐷𝐷
2 , but only half of 
that , 𝐶𝑉𝐷𝐷






Figure 2.1 Losses in Conventional vs. Adiabatic charging.  
Now, consider the circuit and current waveform for adiabatic charging shown 
in Figure 2.1(b). Notice that, in contrast to conventional charging, the transition has 
been slowed down by using a time varying voltage source (VPC) instead of a fixed 
supply. By spreading out the charge transfer more evenly over the entire time 
available, peak current is greatly reduced. The overall energy dissipated ER in the 
transition has been shown to have a proportional relationship [9], 
ER  ∝  (𝑅𝐶 𝑇𝑆⁄ )𝐶𝑉𝐷𝐷
2 . (2.1) 
where R is the effective resistance of the driver device, C is the capacitance to 
be switched, TS is the time over which the switching occurs, and VDD is the voltage to 
be switched across. The constant of proportionality is related to the exact shape of the 
time-varying voltage source waveform and can be calculated by direct integration. 
Ideally, by increasing the time TS over which computation is performed, it 
should be possible to create a circuit which computes with vanishingly low energy 




the field as “asymptotically zero energy consumption,” practical circuit 
implementations of these logic elements have been demonstrated [9]. These circuits 
achieve low, but nonzero, dissipation for computations performed over fixed amounts 
of time.  
Because some of the energy in these circuits (in the form of charge stored on 
capacitances) was being recovered instead of dissipated, the terms charge recovery or 
energy recycling began to be used to describe these circuits. Broadly speaking, the 
term charge recovery is nowadays being used to describe systems that reclaim some 
of the 𝐶𝐿𝑉𝐷𝐷
2 /2 energy that is stored in their capacitors during a computation and 
reused it on subsequent computations. 
It should be observed that whenever current experiences a voltage drop V, 
energy is dissipated at the rate of i×V (instantaneous dissipative power), where i is 
the current. Such energy dissipation can be greatly minimized by deploying adiabatic 
switching described, where the supply swings gradually from 0 to VDD. There is little 
voltage drop across the channel of PMOS/NMOS transistor, and hence only a small 
amount of energy is dissipated. Using simple model of (2.1) to estimate the power 
dissipation [10], with RC < 1ns for a moderate fan-out, and switch sampling time of 
TS ≈  l/ fCLK and with an operating clock frequency fCLK ≈ 10 MHz, ER is reduced to a 
very small value of nearly 1/50
th
 of  conventional switching. At higher frequencies of 
course the savings are less. 
Thus, adiabatic charge recovery techniques are very useful in the lower 
frequency range, like in battery powered systems that need to minimize energy drain. 
But for clock operation in the GHz range, where the severe heat dissipation occurs in 




2.2 LC Resonance Energy Reuse 
In this work, the conventional LC resonant solutions are termed as CPR 
solutions since the resonating inductor and capacitor are connected to each other 
continuously in parallel. A pulsed mode resonant driver is used for driving pulsed 
flip-flops that can save area and energy. 
Figure 2.2 shows the topological comparisons between a non-resonant driver 
(NR), CPR driver and the new pulsed series resonance driver (PSR). The resonant 
clocking technique based on Fig 2.2(b) is currently the most commercially viable as it 
requires minimum change from conventional clock design [14].  
The global clock tree can modified to enable resonant (sinusoidal) clocking 
with an additional metal layer added on top of the conventional tree to attach the 
inductors and decoupling capacitors [14].  
 







2.2.1 Continuous Parallel Resonance Driver (CPR) with Bias Supply 
Another way to minimize the capacitor energy discarded is through LC 
resonance. An inductor placed in parallel with the load capacitor minimizes the 
effective capacitance load at resonance frequency of the LC tank formed and can thus 
reduce the switching energy [22]. Figure 2.3(a) shows a simplified continuous 
parallel resonant driver (CPR) using an extra VDD/2 power supply for the inductor. 
Here the inductor is always connected to load capacitance and the output is a 
sinusoidal waveform with peak-to-peak reaching twice the bias supply VDD/2. In 
Figure 2.3(b) when the switch Sd is open, it reduces to a parallel RLC tank with Q 
given by inductor QL at resonance frequency. No PMOS pull up is needed as LC tank 
in resonance will swing twice the inductor voltage VDD/2 to give output high of VDD. 
This scheme has been shown capable of driving the entire clock network of a low 
power ARM processor [19].  
 
Figure 2.3 Clock Driver Topology for Continuous Parallel Resonance (CPR). 
As seen in Figure 2.3(b), a parallel RLC network is formed when the grounding 




inductor (RP = LpQfL2 ). The combined quality factor Q of the tank is determined by 
the parasitic resistance of the inductor and the equivalent series resistance ESR (rESR) 
of the capacitance CL, if significant [ [20], [22]. The ESR is ignored here with respect 
to RP, allowing the overall tank Q to be approximated as QL.  
 The general solution for the parallel RLC network is obtained from circuit 









 = 0. (2.2) 













.   (2.3) 
The initial conditions assumed is VC(0) = 0 with the corresponding initial 
current in the capacitor CL dVC/dt  = VDD/2Rp. Solving using these initial values for 














sin(2𝜋𝑓𝑅𝑡)]    
(2.4) 









) and tank Q = 
Rp/√𝐿𝑝/𝐶𝐿. This is also called the underdamped case of (2.3) with complex conjugate 
roots, when Lp < 4𝑅𝑝
2𝐶𝐿. As it can be easily seen, a Q > 0.5 actually guarantees this 
condition of underdamped oscillation.  At higher values of Q (> ), fR is the well-




2fRESRpCL. Ignoring the ESR of capacitor, tank Q can be approximated as the 
inductor component QL = RP / pfL2 . 
At high enough frequencies (fR >> 1/RpCL), the capacitor voltage from (2.4) in 







cos(2𝜋𝑓𝑅𝐸𝑆𝑡).  (2.5) 
Since the average DC value is VDD/2 on both sides of the resistor Rp, the 
effective DC power is zero. Thus the CPR power is calculated using power consumed 
in Rp by the sinusoidal component in (2.5).  The average power over one clock period 
TRES can be obtained from (2.5) as 0.5𝑉𝐷𝐷
2 /4Rp. With RP = LpQfL2 = Q/2fRESCL at 





2  𝑓𝑅𝐸𝑆. (2.6) 
Even for a low Q value of , CPR power is only a quarter of NR power from 
(2.6). Note that (2.6) as derived is valid only at resonance frequency of operation 
when TCLK = TRES = 2√𝐿𝑝𝐶𝐿.  For DVFS applications, it is necessary to know how 
far the operation can be stretched. At clock frequencies above resonance (TCLK <TRES), 
only a portion of the sinusoid in (2.5) is captured. At frequencies below resonance 
(TCLK > TRES) more than one cycle of this sinusoid is captured. The voltage at end of 










) . (2.7) 
This evaluates to zero for TCLKv=TRES and thus no extra power is consumed 
other than (2.6). For other frequencies, where output is still periodic and valid, the 
extra power, coming from discharging the energy stored in the capacitor at 
voltage 𝑉𝐶(𝑇𝐶𝐿𝐾), is 0.5𝐶𝐿𝑉𝐶




According to (2.7), for TCLK < 0.5TRES, less than half the resonant cycle will be 
captured, making the amplitude lower than VDD/2. Similarly, when TCLK > 1.25 TRES, 
the waveform will cross the midpoint VDD/2 and can cause an additional crossover. 
The corresponding DVFS frequency range for clock signals can thus be approximated 
to be from 0.8fRES to 2fRES. The average power for this range can again be obtained by 
integrating )(2 tVC /Rp from (2.4) over a clock period TCLK, giving a more general 



















At fCLK = fRES, (2.8) is same as (2.6). More power is consumed when fCLK < fRES 
as well as when fCLK > fRES, forming a minima at fRES. This behavior is later verified by 
simulations and also validated by several silicon realizations [11], [12]. The pulse 
width must be kept sufficiently wide to completely discharge the node through the 
switch Sd at the given frequency [33], [34]. This is an additional requirement 
compared to the input pulse stream of NR in Figure 1.4. 
Resonant solutions, with characteristic sine wave signals, were initially 
applied to lower speed systems. Special flip-flops for ultra-low energy applications 
were designed to work with these low amplitude signals from global clock grids [21]. 
These custom cells need to be incorporated into standard cell libraries for synthesis. 
2.2.2 Parallel Resonance with decoupling Capacitor 
The need to meet a high performance clock skew target necessitates the use of 
a mesh that connects all low skew sinks as shown in Figure 1.2.  The capacitance C of 




given by (1.4) as, PNR =  𝐶𝑉𝐷𝐷
2  fCLK. This can be several 10’s of watts to meet the 
stringent skew requirements in high performance designs at GHz clock speeds. 
A different implementation for Fig 2.2(b) CPR driver, with the inductor bias 
supply replaced by capacitors, is shown in Figure 2.4. The inductor bias voltage 
source is eliminated with use of large capacitors, but 50% duty cycle inputs are 
needed and lower power savings are obtained. Within Figure 2.4 the parallel R, L and 
C network has a combined tank quality factor Q=2fCLKRPC at resonance (i.e. 
fCLK=1/2√𝐿𝐶 ) with RP = Q/2fCLKC. With inductor biased at 0.5VDD and with 
VOUT(t) intially at  VDD, the resonant clock output signal can be solved for, similar to 
(2.5), to give the equation, 
VOUT (t) = 0.5VDD  +  0.5VDD  cos(2fRES t) (2.9) 
 
Figure 2.4 Conventional Continuous LC Resonant Clocking Driver (CPR). 
Resonant clocks have also been synonymously referred to as sinusoidal clocks 
[11] due to the waveform from (2.9). The indcutor current can be shown to be, 




where the peak current Io =0.5VDD/√𝐿 𝐶⁄ .  
Power is dissipated only in the equivalent resistor RP. The DC component of 
power is 0.5Vdd in RP given as  𝑉𝐷𝐷
2 /4 RP. The AC power due to a sinusoidal 
component of 0.5VDD amplitude is 0.5(0.5VDD)
2
/RP. Thus the total DC and AC power 
consumption is 1.5𝑉𝐷𝐷
2 /4RP. Substituting for RP=Q/2fRESC, the power dissipation 
with decoupling capacitance can be expressed as, 
 PCPR-C =1.5×2fRES C𝑉𝐷𝐷
2 /4Q = (3/4Q) C𝑉𝐷𝐷
2 fRES (2.11) 
This is assuming resonance at fCLK = fRES, resulting in pure sinusoidal outputs 
that would take minimum power. Q is the combined quality factor of inductor and 
load capacitor [24]. It accounts for the ESR of the capacitor and the DC resistance 
(DCR) of the inductor. Even for realizable low Q values like , CPR power will only 
be ¾ of NR power for global CDNs. CPR based global CDNs have been reported to 
yield 25% or more power reductions [11].  
Additional chip area occupied by the inductor may not be acceptable, 
especially for low load capacitance values of 1pF or less. As the resonance frequency 
is set by fCLK=1/2√𝐿𝐶, different inductor values are needed to operate at different 
frequencies. This makes it incompatible to DVFS unless the inductors are changed on 
the fly [1]. Moreover, at frequencies 2× lower than resonance, waveforms get warped 
and the skew suffers as well [24]. While the CPR can be disconnected at these 
frequencies, the power savings will not be available [11]. As Figure 2.4 shows, large 
decoupling capacitors are needed for CPR schemes to hold VDD/2 center bias. This 
takes couple of cycles of clock before settling to the final value. Thus clock gating, to 
shut down switching power dynamically, is not possible with this scheme as th.. use e 




LC resonant circuit operation can reduce the buffer sizes as well. This reduces 
the total load capacitance and lowers the power further. Hence, in spite of several 
issues discussed above, CPR CDNs are attractive at global clock level. Usually, local 
gates and flip-flops in a sector are buffered by local clock buffers (LCB). The clock 
signal feeding the registers, as shown in the bottom of Figure 1.2, is a square (wave) 
clock. Inserting inverters in the clock path eliminates the energy recovery property. If 
the bulk of the CDN capacitance is in its leaves, then the largest power advantage will 
come by extending the resonance down to the flip-flops. In [21],[23], [24]  the clock 
buffers are removed to allow the clock energy to resonate between the inductor and 




3 SERIES RESONANCE FOR WIDE FREQUENCY CLOCKING 
This chapter arrives at the new configurable Generalized Series Resonance 
(GSR) and shows how various clock driver schemes to drive large capacitive loads 
can be derived from it. The theoretical tradeoffs between various resonance solutions 
are analyzed so that the optimum configuration may be selected for the given 
application.  
 
3.1 Pulsed Series Resonance (PSR) 
Another way to use an inductor to save energy stored on a large load 
capacitance is shown in the resonant topology of  
Figure 3.1(a), where the inductor is periodically connected to load capacitance with 
controlled input pulse width TPW. Output has a pulse of width TRES driving a higher 
capacitive load at resonance. For ideal inductor (QL >> 10), both input and output are 
from 0 to VDD. Figure 3.1 (b) shows series RLC model for analysis with bottom 
switch Sr closed and top switch Su open during time 0 to TPW. The implementation was 
presented in ISCAS2014 [23] and the theoretical analysis with performance trade-off 
equations is detailed here. Compared to CPR in Figure 2.3, the inductor is moved 
from the output to bottom of switch Sr. Controlled by the pulses of PLS_CLK signal, 
Sr closes when output needs to go low. The series inductor allows the energy stored 
on the load capacitor to be transferred to the VDD/2 node and then recovered back 
immediately to make the output go high. This creates a pulse of resonance period 
TRES. Energy can be recycled with the series LC resonant tank (fRES=1/2√𝐿𝑆𝐶𝐿) 
formed in Figure 3.1(b) when Sr is closed [23], [24]. Thus, the pull-up switch does not 
need to charge the output to VDD all the way from 0V. Such a pulsed series resonance 




The input stream PLS_CLK is required to have certain width (TPW), as shown 
in Figure 3.2(a), to generate a resonant pulse stream at the output [24]. Figure 3.2(b) 
shows the output timing waveforms for the PSR circuit. The energy recovery process 
is done through the inductor current in resonant mode. 
 
 
Figure 3.1 Pulsed Series Resonance (PSR) (a) Switching Circuit (b) Linear Model 
 





 When input signal PLS_CLK is high, the resonant tank is formed and when 
low, the driver is in non-resonant mode. Unlike in CPR, there is an extra requirement 
on keeping the incoming pulse width TPW exactly related to TRES, across all operating 
frequencies, for a given CL and LS. The resonance time is TRES = 2√𝐿𝑆𝐶𝐿 < TCLK. 
This inequality requirement, rather than equality in CPR, between CL, LS and TCLK 
values provides an extra degree of freedom. Several advantages result from this as 
described later in Chapter 8.1. 
When operating with narrow output pulses, TRES is always less than the period 
TCLK across DVFS. The PLS_CLK signal with required TPW can be derived from the 
regular clock using circuitry shown in Chapter 4. Analysis of  
Figure 3.1(b) is first done for a step input from the closing of the Sr (NMOS) switch.  
In Figure 3.1(b), the total resistance is the series combination RT = 
(Rr+RW+rS). Here rS =2fLS/QL is from the finite QL of inductor at frequency f, and 
can include the output impedance of VDD/2 supply as well [23], [35]. The parasitic 
equivalent series resistance (ESR) of the load capacitance is ignored in this 
comparative analysis, but can be factored as the component quality factor QC. Thus, 
the overall tank Q = 2fLS/RT is degraded, as RT is larger than rS.  











This leads to second order differential equation for inductor current iL (t) with 
initial condition iL (t) = 0 and  
𝑑𝑖𝐿
𝑑𝑡
















For underdamped case having complex conjugate roots, the inductance needs 
to have minimum value given by condition LS > 𝑅𝑇
2CL/4 [7], [24]. The solution for 






𝑒−𝑡𝑅𝑇/2𝐿𝑆 sin(2𝜋𝑓𝑅𝑡) (3.3) 
where the damped oscillation frequency fR is given by, 










2  = 𝑓𝑅𝐸𝑆√1 −
1
4𝑄2
  (3.4) 
and the tank Q by √𝐿𝑆/𝐶𝐿/RT. The currents peaks are between +VDD/2√𝐿𝑆/𝐶𝐿 .  
Assuming 1/fR < TPW <TCLK, the capacitor output voltage can be 










sin(2𝜋𝑓𝑅𝑡)].   (3.5) 
For large tank Q values the two frequencies fR and 𝑓𝑅𝐸𝑆 can be taken as equal. 
At resonance, the RLC tank Q=2fRRTCL, is also large when underdamped case is 
met. The last term in (3.5) can also be neglected for large Q values. 
In Figure 3.2(a) an input pulse stream required at clock frequency with 
controlled pulse width TPW. Figure 3.2(b) shows output pulse with non-deal inductor 
(QL < 10) when cycling though one clock period. Input pulse width TPW must be 
larger than damped oscillation cycle TR. Voltage VC on the capacitor (QC > 30) does 
not swing rail-to-rail. Extra power is needed to restore VC to VDD rail. If the width of 
input pulses (TPW) is sufficient to allow the inductor current waveform to go through a 
complete resonance cycle TR  = 1/fR, all the possible energy can be recovered. The 
output voltage rises to high by itself till a certain voltage recovery point, without 
drawing current from VDD power supply. The charging and discharging waveforms are 




The underdamped capacitor output will ring with minimum value at t = TR /2. 
The first maximum is at t = TR, giving rise to the waveform in Figure 3.2(b). 
Substituting from the RLC series resonance Q expression RT/LS = 2f/Q, the first 




To reach 90% of VDD, as normally required, a Q ≥ 14

is needed. As this is 
generally too high to realize on chip, the output is pulled up to rail using the Su 
(PMOS) switch, forcing the final VOH to VDD.  
Similarly, the minimum voltage logic low VOL can be calculated from (3.5) at 
TR/2 as, 
VOL  =  0.5VDD(1-e
-2Q
). (3.7) 
To reach the standard 10% of VDD, a Q ≥ 7 is needed. This is less difficult to 
achieve than VOH requirement. Lower VOL can also be obtained by using an inductor 
bias lower than VDD/2. This will also change (3.1) and (3.5) giving a lower VOH than 
(3.6), but is taken care of by pull up switch Su. As shown in Figure 3.2(b), the highest 
voltage recovery point from freewheeling resonance oscillation is less than VDD. Thus 
power needed to pull it from this to full VDD swing on CL at frequency fCLK can be 
obtained similar to (1.3) as, 
PPSR     = VDD (VDD - 0.5 VDD (1 +e
Q
)) 𝐶𝐿𝑓𝐶𝐿𝐾   
      = 0.5 (1-e
-Q
) 𝑉𝐷𝐷
2  𝐶𝐿𝑓𝐶𝐿𝐾 .            
(3.8) 
This is valid for all frequencies where fCLK < fR and not just at resonance like 
CPR. At Q = , PSR takes about 1/3 power of NR. While the power savings are 





3.2 Generalized Series Resonance  
Figure 3.3 show a series resonance scheme generalized from PSR [23], [26] 
and termed here as GSR. Figure 3.3(a) shows Generalized Series Resonance (GSR) 
with pull up and pull down switches for rail-to-rail operation. Figure 3.3(b) shows an 
equivalent series resonant circuit model for GSR with Sr closed, Su open and Sd open. 
The output of PSR is a narrow pulse stream rather than near 50% duty cycle of 
standard clocks. 
 Figure 3.4 shows the required timing diagram for generating rail-to-rail (0 to 
VDD) clock output pulses crucial for controlling the switching operation in GSR. The 
equal pulse widths of VSR generated from rising and falling edges of the clock input 
can be used to logically derive the switch control signals VUP and VDN to generate 
ideal 50% duty cycle output clock at VC.  
All voltage signals swing 0-VDD. The iL current peaks are ≃ +VDD /2√LS/CL. 
With switch control timing shown in Figure 3.4, outputs with duty cycle close to 50% 
are obtained in GSR. As the values of Q are very low (< 4) on-chip, the VOH of PSR is 
be improved from (3.6) by using a separate pull up switch Su in  
Figure 3.1.  
Additionally, the VOL can be improved from (3.7) with a pull down switch Sd. 
GSR has the extra pull down switch Sd to give rail-to-rail operation. This new GSR 
topology in Figure 3.3 has independent control nodes for switches Su and Sd, like NR 
of Figure 1.4. The active high control signal VSR is derived (as shown later in Chapter 






Figure 3.3 GSR (a) Switching circuit (b) Equivalent circuit model  
 





The switch Sr in series with inductor is closed twice in a cycle, first to store the 
discharging energy and later to recover it. The switch Sr control input pulse stream 
VSR needs to have a specific width (TR/2) for resonance. The active low 𝑉𝑈𝑃, after 
resonant recovery during VSR pulse, pulls up the output to VDD. The active high VDN 
signal pulls down the low going output signal all the way to ground, after the VSR 
pulse. As seen by (3.7), for low Q, the output does not go all way to bottom rail with 
resonant discharge. 
Adiabatic transfer of the energy between the inductor and load capacitor 
during the resonance periods effectively conserves dynamic energy. Compared to 
PSR, the inductor in GSR is switched at twice the rate (2fCLK) of the incoming clock 
and for half the duration (TPW  ≈ TR/2). The governing equations during Sr closure are 
same as (3.1) and (3.2) derived for PSR, but with different initial conditions. The 
inductor current is then given by (3.3) and capacitor voltage by (3.5). However, the 
waveforms last only for half the cycle. The energy recovery process can be seen from 
the ideal inductor current iL into the VC node, where the current during discharge is 
recovered back for charging. 
When VSR pulse closes Sr for half the resonance period, the VC is discharged to 
lowest point 0.5VDD(1-e
-2Q
) from (3.7). The switch is ideally opened when the 
current is zero and charge stored on the VDD/2 node. The VDN signal then closes, 
connecting output to ground and forcing the VOL to 0V rail. When the VSR pulse comes 
next in charging phase, it will follow (3.5) again with a half cycle time shift starting 
from 0V. It will not reach the PSR maximum recovery point VOH but will be shifted 
















When the VUP signal is active, it will pull up from the VCmax value to VDD. 
From (3.9), it can be seen that the voltage recovery point is lower than in PSR (3.6), 
requiring more energy to replenish, for the rail-to-rail operation. 
The power needed in PGSR to pull VC from the value in (3.9) to VDD at 
frequency fCLK can be derived similar to (3.8) as, 
PGSR  = (VDD - VCmax) VDD CL fCLK 
          = (VDD – 0.5 VDDe
-Q
 – 0.5 VDDe
-Q
) VDD CL fCLK 







By connecting the inductor branch closer to the load, the series resonance total 
resistance can be reduced to RT = (Rr+rs). This will prevent significant Q 
degradations, improving the energy savings further. The same assumption is made, as 
in PSR, 4LS/RT > RTCL for underdamped condition, implying a minimum value of 
inductance and Q. 
The power is less than that taken by NR and, for a Q of  nearly 50% savings 
is predicted. GSR savings are valid over DVFS clock frequency range. The tank Q for 
GSR can be maximized as the inductor is free to be connected closer to CL. 
3.3 GSR with decoupling capacitor (GSR-C) 
It is also possible to use GSR with a large decoupling capacitor instead of the 
extra inductor bias supply, like in CPR, as shown in Figure 3.5.  An energy recovery 
capacitor CER, is incorporated for electrical energy storage and initializing the logic 
operation as shown in Figure 3.5(a). Figure 3.5(b) shows an equivalent series resonant 




The resonant circuit incorporates a high-Q inductor LS connected in series with 
capacitors CER and CL, along with switching transistors. Rr is the ON resistance in the 
FET switch, when operating in the linear regime, and rs is the inductor series 
resistance in the resonant circuit. The equivalent circuit of series RLC resonator with 
Rr, rs and LS connected in series with capacitors CER and CL is shown in Figure 3.5(b). 
A virtual voltage source is created by adding the energy-recovery storage 
capacitor CER in the circuit. This capacitor is precharged to a voltage of VDD/2 to begin 
with. The restoring voltage VDD/2 in the storage capacitor CER is assumed to be stable 
during the charging and discharging of CL. This energy conserving resonant circuit is 
used for generating flat-topped (trapezoidal) clocking waveform with a very low 
energy loss.  
Figure 3.6 shows the timing diagram for generating the flat-topped output 
pulses by the energy recovery logic circuit. The period of clocked waveform can be 
determined by the designed values of inductance and capacitances in a circuit. Here, 
the energy recovery capacitor CER  is used as a reservoir, as energy moves back and 
forth to load capacitor CL. Current flows into the load capacitor and a voltage is 
generated in a series inductor, LS. When the output voltage VC  reaches the same 
potential (VDD/2) as the storage capacitor CER, the voltage across the inductor begins 
to collapse and the current is forced to flow in the same direction through the inductor 
LS, forcing the VC  to approach VDD. 
The output voltage VC reaches VDD at the point when the current iL in the series 
inductor becomes zero. At this time, the switch Sr is turned off, and the switch Su is 
turned on. This holds the output voltage at VDD for finite time, giving the flat-top of 
the output pulse. Energy is wasted through switch Su to bring the output to the full 






Figure 3.5 GSR-C with energy recovery capacitance CER. 
 
 
Figure 3.6 Same as Figure 3.4, repeated for convenience. 
In the discharge phase, the charge is returned to the energy-recovery biasing 
capacitor CER by current flowing through the inductor by turning on the switch Sr. 




circuit for energy transfer from to the inductor by current flowing out of CL through 
LS. This causes a build-up of voltage in LS in the direction opposite to the charging 
phase, returning charge to the capacitor CER. When VC  decreases to VDD/2, the voltage 
of LS collapses forcing the current in same direction, making VC reach ground at logic 
‘0’. At this point, the current iL becomes zero and the switch Sr is turned off, and the 
switch Sd is turned on to hold the output voltage to ground (i.e., logic ‘0’). Through 
this resonant energy transfer mechanism, most of the energy is recovered. The charge 
is restored to the biasing capacitor CER, and a stable VDD/2 stored voltage is 
maintained during the circuit operation. For this, the designed value of CER is kept 
much larger than CL. In this way, the resonant driver with controlled switches 
generates a sequence of output voltage pulses with finite flat-tops. 
The control signals for the switches are almost identical to the ones in Figure 
3.4 and repeated here for convenience. The loop in Figure 3.5(b) from Kirchhoff’s 














This leads to second order differential equation for inductor current iL (t) with 















= 0 (3.12) 




underdamped case having complex conjugate roots, the inductance needs to have 
minimum value given by condition LS > 𝑅𝑇
2CL/4 [7], [24]. The solution for (3.2) 












) 𝑒−𝑡𝑅𝑇/2𝐿𝑆 sin(2𝜋𝑓𝑅𝑡) (3.13) 
 
where the damped oscillation frequency fR is given by, 










2  = 𝑓𝑅𝐸𝑆√1 −
1
4𝑄2
  (3.14) 


















sin(2𝜋𝑓𝑅𝑡)].   (3.15) 
Energy disspiation occurs because of the resistive losses and this can also be 
obtained by intergrating instantaneous power 𝑖𝐿(𝑡) × 𝑅𝑇  over a cycle to yield the 
disspiation and power as, 




2 𝑓𝐶𝐿𝐾 (1 +
𝐶𝐿
𝐶𝐸𝑅
) (1 − 𝑒
−
2𝜋
√4𝑄2−1)  (3.16) 
The power is less than that taken by NR and, for a Q of  nearly 80% savings 
is predicted with CER>10×CL. There is of course an area penalty for using this 
scheme. GSR-C savings are valid over DVFS clock frequency range. The tank Q for 
GSR-C can also be maximized as the inductor is free to be connected closer to CL. 
However, the CER capacitor needs to be placed close to the inductor so that there are 
routing challenges for this during integration. Also clock gating is not possible with 
GSR-C 
 
3.4 GSR Transistor level configurations 
Figure 3.7 shows transistor level implementation of the GSR driver output 




inductor bias supply is used. Figure 3.7(b) uses a large capacitor. The clock input is 
buffered and filtered to pre-bias the line as needed. The capacitor is charged to mid 
voltage VDD/2 by filtering a buffered version of the input clock signal, that is usually 
50% duty cycle. Inductor LDC is kept 10-100 times LS as practical. Capacitors CER1  
and CER2 are taken to be 5 times CL. The input clock to this generator may be gated as 
needed to reduce the extra power consumption.  
 
Figure 3.7  GSR full configurations. 
Figure 3.8 shows three possible reconfigurations of the GSR to give NR, CPR 
and PSR modes. The NR schemes does not need Mr transistor and can thus be turned 
off. The CPR scheme similarly does not need Mu and this can be tied off. PSR does 
not need Md and this can be disabled too. These reconfigurations can also be done 






Figure 3.8 GSR Reconfigurations. 
3.5 Series Resonance Simulation Results 
Using the transistor level configurations of Figure 3.8, PSR and GSR 
configurations are simulated to verify the basic functionality and power savings 
derived in theory. 
3.5.1 PSR Functionality  
The resonance time, designated as TRES, is given by 2√LC. TPW should thus 
ideally be of TRES duration, basically the period of resonance for large Q. This period 
(TRES=1/fRES) is set at a third of maximum TCLK or even less. As an example, for a 1pF 
load at 1GHz clock rate, TRES can be set to 0.2ns using a 1nH inductor resulting in 
5GHz resonance frequency. Conventional CR would need 25nH to resonate with a 




global bias line VLB. Figure 3.9 shows the basic operation of PSR for a 1GHz clock in 
a 45nm IBM compatible process [36], [37]. 
There is some ringing in the current that can be observed when the inductor is 
disconnected and left floating in the non-resonant portion as TPW is larger than TRES. 
This is actually necessary to conserve energy. The performance must be viewed along 
with data capture of flip-flops as shown in Chapter 9. 
 
 
Figure 3.9 PSR Operation Timing Waveforms. 
 
3.5.2 GSR Functionality and Performance 
 
The functionality and robustness of the new GSR and GSR-C drivers is 
verified by 22nm SPICE simulations [36], [37]. The results plotted in Figure 3.10 
show that, the GSR (red) and GSR-C (blue) output VC are functional to drive standard 
local buffers generating an output signal for flip flops and other parts of the digital 




that come from variations in load capacitance. The VSR , pull up VUP and pull down 
VDN signals are shown later in Chapter 4.  The bias voltages are shown for the two 
different schemes. Although GSR-C generates less than VDD/2 bias through the 
filtering, the functionality is on par with GSR. The supply current and the 




Figure 3.10  Simulations of GSR and GSR-C showing the functionality. 
Operation at multiple voltages is shown in Figure 3.11, plotting the power 
drawn for driving a 20pF load in the functional frequency range for DVFS.  Higher 
VDD supply voltages give large frequency sweep but take higher power. Power is 
saved by moving to an operating point of lowest VDD for a given frequency. No 
interconnect resistance is factored so that output swings rail-to-rail with a tank Q = 3. 
Lower supply voltages give lower maximum frequency but take less power at 
functional frequencies. The ability to scale voltage down to the minimum needed at 




the spacing between the curves in Figure 3.11. The GSR simulated power at 1V and 
1GHz is nearly half of 𝐶𝐿𝑉𝐷𝐷
2 𝑓𝐶𝐿𝐾 as per (3.8). System level simulations with real life 
clock trees are shown in Chapter 9. 
 
 Figure 3.11 GSR Voltage and Frequency scaling operation for DVFS. 
3.5.3 GSR Schematic Diagrams 
 
Scalable CMOS implementation of GSR is shown in Figure 3.12.  External 
connections of this macro cell determine the mode in which it is used. The macro 
symbol I shown at the bottom of Figure 3.12.  In case of distributed inductance the 
transistor Mr will be placed outside this macro. The device widths will be scaled 
based on the technology’s minimum channel length L used. The operation has been 
verified from 90nm to 22nm. The sizing of width also depends on the load 
capacitance driven. The sizes shown are for 1pF capacitance. For multiple pFs of 






Figure 3.12  GSR Scalable Reconfigurable Driver Schematic and Macro Cell Symbol. 
 
 
Figure 3.13  Typical Configuration of Driver for GSR rail to rail operation. 
 
Figure 3.12  shows a typical GSR configuration for a load capacitance of sCL×1pF 
and inductor bias set to half the power supply used to generate the waveforms in 




4 SUPPORT CIRCUITRY 
This chapter describes the transistor level implementation of the different 
blocks shown so far. It details the important circuits for realization of the complete 
GSR solution in practice and how they can be used in other configurations as well. 
Low power implementation of one or more of the following functions are needed for 
resonant and non-resonant operation: 
1. Pulse Generators with controlled width 
2. Multiple non-overlapping pulse streams 
3. Voltage Doublers 
4. Extra supply voltage VDD/2 or bias generation 
4.1 GSR Configuration 
Figure 4.1 shows how the above 1, 2 and 3 may be realized. An optimum 
delay of 0.5TR is generated from the RLC and inverter in the input stage of Figure 4.1. 
The series inductor (LD) is a replica of LS (from Figure 3.7), and matching capacitance 
CM1 tracks the load CL. The pulse width, 0.5TR ≤ √𝐿𝑆𝐶𝐿 in Figure 4.1, is determined 
by √𝐿𝐷𝐶𝑀1. The inductor LPW is chosen large enough so that TPW = 
2√𝐿𝑃𝑊(𝐶𝑀𝑟 + 𝐶𝑀2) is slightly larger than 0.5TR. Matched delays create pulse 
widths that are replica of load capacitance resonance times. GSR inductor control 
output is at double the supply voltage to reduce switch on-resistance. Here CMr is the 
non-negligible gate capacitance of the inductor switching transistor Mr in GSR 
scheme shown in Figure 3.7. CM2 is also matched to CL like CM1. This replica timing 
eliminates the need for synchronization with conventional DLL/PLL circuitry that 




Repeated low going pulses are generated from both the edges of the input 
CLOCKin using an XNOR gate and the replica delayed signal. The XNOR output can 
be inverted to obtain the VSR signal that controls the GSR inductor switch. The other 
two signals VUP and VDN are readily obtained through logical operations of CLOCKin 
and the XNOR output. 
Thanks to the Miller gain around CM1 buffer, it is not necessary to have the 
entire load capacitance duplicated for replica delay. This saves power in charging and 
discharging this capacitor as well. For run-time tuning, accounting for inductor and 
load capacitance variations, the variable resistor Ropt can be tuned to adjust the RLC 
delay and change TR appropriately. CM1 and CM2 can be varied to match the loads 
used, during die to die calibrations.  
 




The switch on resistance in GSR, for the same device size as NR, will be 
higher due to source bias voltage of 0.5VDD in the NMOS. The drain source resistance 
is inversely proportional to gate source voltage Vgs and is given as 𝐿 2μCoxW(𝑉𝑔𝑠-𝑉𝑡)⁄  
[38], [39]. While Vgs is full gate voltage of VDD in NR case, in GSR it is only half that, 
as the source is now biased at 0.5VDD. Transistor width (W) can be increased to 
compensate for this but will increase area and capacitance. Other alternative is to 
drive the gate with higher voltage [24]. Resonant techniques can also be used to drive 
the VSR line itself [40].   
A low power voltage doubler scheme for VSR is shown in Figure 4.1 that uses 
pulsed resonance technique. Pulse resonance based PMOS driver is used as a voltage 
doubler.  The GSR inductor control output (VSR) can swing at twice the supply voltage 
[15]. The circuit is actually a PMOS complement of PSR driver discussed. When the 
PMOS switch is closed, the inductor series resonates with the capacitance CM2 and 
𝐶𝑀𝑟. The series inductor (LPW) needs to be large enough to give the 0.5TR timing 
needed at VSR, with the additional load of GSR driver gate capacitance 𝐶𝑀𝑟.  
For large load capacitances (>10pF) the resonant inductance values are quite 
small (<0.1nH) allowing the use of larger values of LPW  to give lower area CM2.. For 
load capacitors a QC > 30 is assumed at 5GHz giving less than 1of series resistance 
per 1pF. While the aspect ratio W/L is indeed large (> 600), resulting gate capacitance 
of 10fF increases the switching power of a 1pF load only by 1/33
rd
. The dominant 
GSR predriver capacitance is 2CL  for dynamic power calculations and can thus be 
effectively scaled to <0.2 CL  for large loads. 
To estimate the power of this predriver, it can be seen equivalent to switching 




output capacitance (that absorbs the gate capacitance of Mr switch 𝐶𝑀𝑟 < 𝐶𝐿 /33). 
With 5× Miller gain and 10× inductor value than the driver inductance value 𝐿𝑆,  the 
effective capacitance driven can be < 0.2CL. Each logic inverter (termed INV) too has 
total input and output capacitance < 𝐶𝐿/33 across various processes from 90nm to 
22nm. The minimum power predriver can this be estimated as, 




2 𝑓𝐶𝐿𝐾 +  0.2 𝐶𝐿𝑉𝐷𝐷
2 𝑓𝐶𝐿𝐾 ≈ 0.5 𝐶𝐿𝑉𝐷𝐷
2 𝑓𝐶𝐿𝐾 (4.1) 
This is similar to NR overhead with tapered buffers. The signal generator of 
Figure 4.1 can be shared among 3 or more  GSRs with the same TR requirements to 
reduce power and area overhead to less than 0.2 𝐶𝐿𝑉𝐷𝐷
2 𝑓𝐶𝐿𝐾. The use of inductors in 
pre-drivers as well lowers the power needed to drive capacitive loads in the support 
circuitry while achieving the doubler function.  While the doubled voltage means 4 
times the power, the PSR structure reduces the power to 1/3
rd
.  
The bias voltages needed by CPR, PSR and GSR are readily available in 
modern multi-voltage domain SoCs, especially in mobile processors. The VDD/2 bias 
line draws no effective power as more current is pushed into it than pulled out. The 
output impedance requirement of this, as a fraction of total resistance RT, can be 
calculated so that Q is not degraded to adversely affect the condition for underdamped 
oscillation and performance. For efficient energy savings, the output impedance of 
these is targeted to be less than 10% of the switch on-resistance. 
4.2 PSR Reconfiguration and Application 
PSR driver needs only a portion of the support circuits from GSR. It is well suited to 
drive level sensitive latches like true single phase latches (TSPC) [27]. A part of 
Figure 4.1 GSR pre-driver used for PSR is shown in Figure 4.2 along with data 




TSPC latches. The LC delay of pulse generator matches the resonance pulse width of 
PSR output. In the absence of the voltage doubler, inductor bias VLB as low as VDD/4 
may be used, to achieve lower VOL levels when effective Q value is very small. The 
pulse widths are programmed to full TR rather than 0.5TR. The pulses are available on 
both edges of clock to support DDR. 
  
Figure 4.2 PSR driver clocking a bank of n TSPC latches. 
To take advantage of the pulsed nature of the PSR driver output, the true 
single phased clocked latch (TSPC) shown in Figure 4.2 can be used instead of 
master-slave flip flops [23]. This latch is often called explicit-pulsed true single phase 
clocked flip flop (epTSPC) [21], [27], [31], [33]. TSPC latches also demand steep and 
controlled slopes of the enabling clock edge to prevent malfunctions from undefined 
values and race conditions. 
The predriver portion of PSR takes roughly half the power of GSR giving, 




2 𝑓𝐶𝐿𝐾 +  0.1 𝐶𝐿𝑉𝐷𝐷
2 𝑓𝐶𝐿𝐾 ≈ 0.25 𝐶𝐿𝑉𝐷𝐷
2 𝑓𝐶𝐿𝐾 (4.2) 
The predriver can be shared among 3 or more GSRs with the same TR 
requirements to reduce power and area overhead to less than 0.1 𝐶𝐿𝑉𝐷𝐷




The PSR can create the controlled sharp falling edges needed to correctly 
trigger latches. The width TPW needs to be large enough to complete one cycle of LC 
resonance and meet the latch transparency window target. PSR enables extra power 
savings in DDR applications. An ideal dual edge-triggered (DET) flip-flop allows the 
same data throughput as a single edge-triggered flip-flop while operating at half the 
clock frequency by sampling DDR. The power in the CDN is reduced by a factor of 
two or more if voltage is scaled as well. 
 PSR can achieve dual edge operation with TSPC latches without having to 
double the circuitry [23]. By clocking explicit-pulsed latches no additional flip-flop 
area is needed for double data rate operation [23]. This reduces the frequency and 
voltage for operation giving 40% area and power reductions for 1024 flops in 45nm 
CMOS process as shown in [23], [41]. All the required transistor level topologies to 
implement the solution have been shown in Appendix B: LTSPICE Schematic 
Diagrams 
4.3 Flip-flops For Energy Recovery 
The best ways to combine PSR with flip-flops to save local clocking power is 
now examined. Flip–flops are the basic elements of synchronous designs. Their 
choice and implementation can reduce the power consumption and provide more 
slack time for the timing budget. Various dual edge-triggered flip–flops compared in 
[7] have been extensively referenced and used [42], [43].  This includes implicit-
pulsed flip–flops and explicit-pulsed flip–flops.  Pulse-triggered flip–flops, 
characterized by a simple structure, negative setup time and soft edge, perform better 
than traditional master–slave flip–flops [43]. The pulse generator of the explicit-
pulsed flip–flop can be shared by neighboring flip–flops, contributing to less power 




to half that of the single-edge flip–flop while maintaining the same data throughput, 
so that power dissipation is decreased [42]. These are reviewed now for use with P.  
4.3.1 Conventional Solutions 
At the leaf end of the tree, high-performance and low-power, energy recovery 
flip-flops that operate with resonant clocks have been proposed, exhibiting significant 
reduction in delay, power, and area [21], [31]. Another approach for energy recovery 
clocked flip-flops is to locally generate square-wave clocks from a sinusoidal clock. 
This technique has the advantage that existing square-wave flip-flops could be used 
with the energy recovery clock. However, extra energy is required in order to generate 
and possibly buffer the local square waves. 
One of the lowest energy and area flip-flops reported in [21] is the Single-
ended Conditional Capturing Energy Recovery (SCCER) flip-flop. This is 
representative of what are called implicit-pulsed dynamic flip-flops. It has differential 
circuitry to handle the special sine waves of CR drivers. With the PSR of Figure 4.2, 
these features may be redundant and so is the need to generate implicit pulses in every 
waveform. This pulse generator has the same function as the input stage TR pulse 
generator in Figure 4.1.  
Figure 4.3 from [43] shows an explicit-pulsed, hybrid semi-dynamic flop (ep-
DCO) that consumes extra energy for the explicit pulse generator. However, this 
power consumption can be significantly reduced by sharing a single pulse generator 
among a group of flip-flops. Due to the dynamic nature of the circuit, back-to-back 
inverters are needed to hold the state of the intermediate output and the final output. 
The ipDCO and epDCO with shared pulse generators are the best among all 
semi-dynamic flip-flops considered for use in high speed critical paths.  The explicit-




of all the flops with time-borrowing (negative setup time) capability. The tradeoff is 
that the minimum delay of epSFF is larger than the minimum delay of epDCO. It is 
appropriate for the large number of paths on a chip which are speed-sensitive and can 
benefit from a fast delay and large amount of time-borrowing. 
 
Figure 4.3 Explicit-pulsed flip-flop epDCO. 
4.3.2 Dynamic Latch Solutions for PSR 
The true single phased clocked latch (TSPC) is a compromise between the 
above two, with proven reliability, robustness and scaling advantages. Thus the choise 
is to pair TSPC with the explicit pulse output of PSR. This is the combination shown 
in Figure 4.4, termed as explicit-pulsed true single phase flip-flop (epTSPC). The 
main advantage is the use of a single clock phase. Dynamic output nodes are isolated 
by static inverters to prevent charge sharing effects.  
Although simpler split output versions are possible, this topology allows for 
the targeted voltage scaling from 1.3V to 0.5V. Careful sizing on internal transistors is 
necessary to prevent glitching, even for static data [27]. TSPC latches also demand 




undefined values and race conditions. As simulated before and described later, the 
PSR naturally creates the controlled sharp falling edges from resonance, to trigger 
correctly the bank of TSPC latches and interconnect (CL). 
 
Figure 4.4 epTSPC driven by PSR. 
An ideal dual edge-triggered (DET) flip-flop allows the same data throughput 
as a single edge-triggered flip-flop while operating at half the clock frequency and 
sampling data on both edges of the clock. If the clock load of the DET flip-flop is not 
significantly larger than the single edge-triggered version, the power in the clock 
distribution network is reduced by a factor of two. Dual edge operation for epTSPC 
simply implies that the explicit pulse generator gives pulses at both edges of the 
clock. The epTSPC of Figure 4.4 works on negative pulses from the PSR of Figure 
3.2. For dual edge triggered TSPC (deTSPC), some of the circuit structure needs to be 
replicated with appropriate change in devices as shown in Figure 4.5. These are used 
with conventional clock drivers for power savings comparison. While epTSPC has 
lesser transistors, the burden falls on the PSR to have additional circuitry to generate 





Figure 4.5 Dual Edge Triggered TSPC based Flip Flop (deTSPC). 
4.4 PSR Flip-flop Functional Verification 
Figure 4.6 compares the data capture edges with the clock leading data at both 
the rising and falling edges. NR with deTSPC fails to capture data with no set-up 
time. 
 
Figure 4.6 DeTSPC vs. epTSPC DET for negative setup. 










































NR clocked deTSPC 
Data not captured 
PR clocked epTSPC 
Data correctly captured 
Clock 




PSR with epTSPC captures the data correctly even with the negative setup 
time. This can be used advantageously for clock de-skewing purposes. The hold time 
for epTSPC is well defined by the width of the resonance pulse and the clock to Q 
propagation (tdCQ ) is 4 inverter delays. Thus, the tdCQ can be kept larger than hold 
time to minimize hold time violations for timing closures. 
As an example of PSR, for a load of 1pF, a matching capacitance of less than 
0.2pF is sufficient for generating 200ps pulses with 1nH inductor. These component 
value choices are made at design time. For run-time adjustments, the variable resistor 
Ropt can be tuned to adjust the RLC filter delay and minimize dynamic power. The 
matching mechanism from design time ensures functionality over PVT corners and 
mismatches. Run-time tuning is more energy efficient. The GSR system simulations 
are shown in Chapter 9. 
4.5 GSR Functionality and Performance 
The functionality and robustness of the new GSR driver and pre-driver 
circuitry is verified by 22nm SPICE simulations across 30% variation in LC 
component values and transistor model parameters [36], [37]. The results plotted in 
Figure 4.7 show that, in spite of some outliers, the GSR output VC is functional to 
drive standard local buffers generating a CLOCKout signal for flip flops and other 
parts of the digital system. Signals from Figure 4.1 are shown to check robustness 
over 30% variations in values of active devices and passive components.  




C. Signals correspond to Figure 3.4 and a 
standard inverting buffer giving CLOCKout. The pulse width of VSR varies to track 
the changes in the LC resonance time that come from variations in load capacitance. 





Figure 4.7 Monte Carlo simulations of GSR with predriver. 
4.6 Circuit Design Optimizations 
Figure 4.2 showed a novel PSR sub-system with an input delay generator for 
the required pulse width. The series input inductor with a Miller multiplier of 
matching capacitance generates an LC filter delay equal to one pulse width. This acts 
as a replica delay and tracks the PSR output resonance pulse width of TR. The width 
needs to be large enough to complete one cycle of LC resonance as discussed earlier. 
The width is also chosen to meet the latch transparency window target. Thanks to the 
Miller gain, it is not necessary to have the entire load capacitance duplicated for 
replica delay. For a given load capacitance the feedback capacitance can be just 20% 
or less of the load capacitance to minimize area overhead.  
Lower capacitance values can be used as well with higher series resistance. If 




resistor to ground instead of being across the inverter. This efficient circuit can drive 
epTSPC, meeting the requirements of robustness and controlled steep slew rates. The 
pulsed resonance,  naturally creates the controlled sharp falling edges. The input stage 
that generates pulses can be shared among multiple PSRs if the TR requirements are 
homogenous among the drivers.  
It is possible to use the CPR and GSR drivers, replacing extra supply voltage 
of VLB = VDD/2, with a large bias capacitor CB (≈10×CL) charged to 0.5VDD, as shown 
in Sections 2.2.2 and 3.3. The operating equation is similar to (3.10) but power 
savings may be less.  A pull up switch is required in CPR for this case. It takes several 
cycles for the output clock to be stable after turn on, so clock gating is not possible 
with this scheme [12], [14]. In GSR the total bias capacitance CER can be slightly 
smaller (≈8×CL), to build and hold a bias voltage on the inductor storage end. The 
power consumption is similar to GSR, but with an extra factor (1+CL/CB) in power 
equation (3.16). CPR, PSR and GSR described in Chapter sections 2.2.1, 3.1 and 3.2 






5 TIMING PERFORMANCE OF DRIVER SOLUTIONS 
Skew and jitter are very critical performance parameters as they directly affect 
timing closure at high speeds in the nanometer regime, taking significant design time 
resources. Slow slew rates affect skew and jitter, as well as cause short circuit 
currents. As insertion delays are used to match timing skews, large variations in 
propagation delay are also detrimental to achieving closure over process variations. 
Based on the circuit models and output voltage equations derived in Chapter 2  and  
Chapter 3, the propagation delays and slew rates of the various clock drivers 
discussed so far are now analyzed. The intrinsic gate delays are ignored as the CL is 
assumed to be much larger than device capacitances. 
5.1 Propagations Delays and Transition Times 
Propagations delay tPD is the delay from the mid-rail of the input to the mid-
rail of the output. Transition times are the output rise and fall times between the 90% 
VDD and 10%VDD points. The slew rate is the slope of transitions at mid-rail. 
5.1.1 Non-Resonant Driver  
The delay to midpoint, averaging over rise and fall, can be obtained from (1.1) 
as shown in [27] as,  
𝑡PD = ln(2) [Rw+(Ru+Rd)/2]CL = 0.69[Rw+(Ru+Rd)/2]CL. (5.1) 
 This propagation delay does not include any predriver delay. To minimize 
overall delay tapered buffers are used as predrivers in practice [20], [28]. Tapered 




 ), where 𝑛  is the number of buffers. When n=3 the excess capacitance from 
predrivers is 0.5CL. Accordingly the excess power in NR predriver is given by, 
PP-NR =  = 0.5 𝐶𝐿𝑉𝐷𝐷




 The criteria for minimum delay implies that the delay in each stage is the 
same [27]. Thus, the total insertion delay through the predrivers and drivers is given 
by, 
𝑡INS= (n +1) tPD = 0.69 (n +1)[Rw+(Ru+Rd)/2]CL. (5.3) 
From (1.1), the 90% to 10%VDD fall time can be calculated as [27], 
Tfall = 2.2(Rd+Rw)CL. = Trise (5.4) 
The rise time is identical as it is governed by a similar equation. This is usually 
kept less than 10% of the clock period. An upper bound reduces the effect on 
setup/hold constraints and decreases short-circuit power. A lower bound is also needed 
to reduce peak supply currents and cross-coupling noise and electromagnetic 
interference (EMI).  
Skew between two clock lines can occur due mismatch in routing lengths and 
variation input threshold of the buffers due to device process variations and 
supply/signal voltage differences. The equivalent offset voltage V of the buffers is 
proportional to supply voltage VDD by a proportionality , giving V = ±VDD. The 







The worst case skews 𝑡skw in the buffer clock lines, assuming +V offset on 
















The value of  is typically contained to be less than 10% through matching so 
that skew 𝑡skw is more than half of the driver propagation delay 𝑡PD. Clock skew is 
typically budgeted to be 10% the minimum time period TCLK (at maximum operating 
frequency) so that rest of the timing budget can be allocated to logic path delays and 
setup/hold times. Since logic designers anyway consider this in their timing 
constraints, further reduction at the expense of power is unnecessary. This implies that 
the propagation delay 𝑡PD of final NR driving long times needs to be less than 20% of 
the clock period, assuming that the predrivers are shared and only contribute to the 
total insertion delay. 
The supply/ground variations and cross-talk from other signals can be taken as 
changing threshold level in the buffers and causing the variation in delay from input to 
output arrival. Assuming peak-to-peak variation the power supply is V = ±VDD with 
other cross-talks are combined into this, the peak-to-peak time variation based on slew 




 =  4εβ (𝑅𝑑/𝑢 + 𝑅𝑤)𝐶𝐿 (5.7) 
Again, supply variations can be contained within 10% by careful shielding, 
decoupling and limiting of current spikes [39]. This would give 𝑡jit−pp to be only 
1/10
th
 of 𝑡skw , the skew for each buffer which is less than 1% of TCLK. However, the 
jitter is accumulated over the entire buffer clock buffer chain inserted and not just the 
final buffer. This chain may have upto 10-20 buffers. The peak-to-peak jitter values 
from each buffer do not directly add  but the standard deviations of the variation in the 
Gaussian distribution can be summed. The final jitter number will depend on the 




as the skew 𝑡skw. Clock jitter can often dominate the timing budgets and insertion 
delays are often minimized to handle this.  
Larger device sizes can decrease switching resistance and reduce basic delays. 
Wider interconnect lines can minimize the skew and jitter, but all at the expense of 
more power. NR has the lowest delays and transit times and fully supports DVFS but 
takes higher power than other drivers described below.  
5.1.2  Continuous Parallel Resonance (CPR)  
CPR waveform given by (2.5) takes a quarter of a cycle to reach the midpoint 
voltage at resonance frequency, resulting in a propagation delay of TRES/4. Combining 
with underdamped condition of Lp < 4𝑅𝑝
2𝐶𝐿, delay of CPR for this maximum allowed 
inductance can be approximated for high Q as,  
𝑡PD ≤ TRES/4 = 2√4𝑅𝑝2 𝐶𝐿𝐶𝐿𝑅𝑝 𝐶𝐿 (5.8) 
This is larger than NR case as Rp > (Ru+Rd)/2+Rw. Less delay with smaller Rp 
implies smaller Q and will have less energy saving efficiency as per (2.6). Thus, one 
sees a tradeoff between delay and energy across driver circuit topologies.  
Buffer sizes needed for CPR are much smaller than for NR and thus no 
significant predriver is needed. Tapered buffers are not necessary and the excess 
capacitance can be kept less than 1/10
th
 of NR at 0.05CL. Thus the insertion delay 𝑡INS 
is nearly same as the propagation delay 𝑡PD.  Excess power in predriver is thus only, 




2 𝑓𝐶𝐿𝐾 (5.9) 
For CPR the 10% - 90% rise time (fall time) points of the sinusoidal signal in 
(2.5), can be shown as [33],  




 This is nearly 30% of the clock period rather than the desired 10%.When the 
rise times are long, as is the case for low frequencies, it leads to power and delay 
performance degradation. This is one of the reasons CPR is still not widely adopted.  
Skew and jitter can be derived similar to NR case with slew rate derived from 
differentiating (2.5) and evaluating at TRES/4  as, 
𝑆𝑅𝐶𝑃𝑅 =   
𝑉𝐷𝐷
2
2𝜋𝑓𝑅𝐸𝑆  sin(2𝜋𝑓𝑅𝐸𝑆𝑇𝑅𝐸𝑆/4)  
=   𝜋𝑉𝐷𝐷𝑓𝑅𝐸𝑆 
(5.11) 










 𝑇𝑅𝐸𝑆 (5.12) 
For the same values of ε  as NR, the skew from the driver itself is less than 
7%, though the rise/fall transitions are 30%. 
 The final skew to be considered includes the delay mismatch from 
interconnects, and this can be large in CPR as the 𝑡PD delay itself is larger. Overall, 
the skew tends to be larger for CPR. However, jitter is less as long buffer chains are 
avoided and most of it comes from the final driver itself, which can be derived similar 




 =  
2εβ
𝜋
 𝑇𝑅𝐸𝑆 (5.13) 
For the same ε and β as NR above, CPR has less than 1% of period as jitter 
from above, which is very beneficial. CPR also avoids EMI and jitter coming from 
multiple harmonics of clocks as the resonance operation by definition rejects all 
harmonics above the fundamental frequency. The problem of course is that, for clock 
frequencies away from the resonance frequency, the power increases non-linearly and 




5.1.3 Pulsed Series Resonance (PSR)  
Unlike CPR, the resonance frequency in PSR can be higher than clock 
frequency, as TRES = 2√𝐿𝑆𝐶𝐿 is less than TCLK. The propagation delay to VDD/2 of the 
falling edge in Figure 3.2 is less than TR/4. For large Q, this can be taken as TRES/4. As 
keeping TRES small implies using smaller inductors with higher Q, it is attractive to 
use PSR. Combining with underdamped condition needing minimum inductance as LS 
> 𝑅𝑇
2CL/4, the delay relation for PSR can be approximated for high Q as, 
𝑡PD ≥  2√𝑅𝑇
2  𝐶𝐿 𝐶𝐿 4⁄ 4⁄ 𝑅𝑇 𝐶𝐿 4.⁄  (5.14) 
 However, as RT is usually smaller than Rp, (5.14) can give an smaller delay 




 from (3.4). 
Predriver for PSR shown in Figure 4.2 shows the delay of TRES and propagation 
delay of approximately three unit buffers given by (5.1). Thus the insertion delay 𝑡INS 
is more than five times propagation delay 𝑡PD, as shown by, 












(𝑅𝑢 + 𝑅𝑑)𝐶𝐿 
(5.15) 
Similar to the sinusoidal waveform of CPR, the fall time from 90% to 10% 
points for PSR can be obtained from (3.5) as, 
Trise = 0.29TR. (5.16) 
Trise is larger than Tfall for lower Q (<14) as it includes the RC based pull up 
time shown in Figure 3.2(b).  Tfall is only 6% of the clock period at the fastest rate, as 
series resonance frequency is typically set to at least five times the maximum clock 




operation. When the rise times are small, even in the case for low frequencies, it leads 
to lower power from short-circuit currents. This is one of the advantages of PSR over 
CPR and NR.  
Skew and jitter can be derived similar to NR/CPR case with slew rate derived 
from differentiating (3.5) and evaluating for high Q at t = TRES/4 as, 




−𝑡𝑅𝑇/2𝐿𝑆 sin(2𝜋𝑓𝑅𝐸𝑆𝑡)  
=   𝜋𝑉𝐷𝐷𝑓𝑅𝐸𝑆𝑒
−𝜋/4𝑄 
(5.17) 


















 𝑇𝑅    𝑓𝑜𝑟  𝑄 ≥ 𝜋 
(5.18) 
For the same values of ε as NR and CPR, the skew from the driver itself is less 
than 5% assuming 𝑇𝑅 <
𝑇𝐶𝐿𝐾
5
, since the series resonance frequency is typically 5× the 
maximum 𝑓𝐶𝐿𝐾. This is in addition to the timing budget savings from the rise/fall 
transitions. The final skew to be considered includes the delay mismatch from 
interconnects, and this is small in PSR as the 𝑡PD delay itself is small. This is not 
counting the common predriver delay. The skew tends to be also frequency 
independent.  
Jitter is less as long as buffer chains are avoided. Most of it comes from the 























 𝑇𝐶𝐿𝐾    𝑓𝑜𝑟  𝑄 ≥ 𝜋 
For the same ε and β as in NR and CPR, PSR has peak-to-peak jitter less than 
1% of period TCLK , which is very beneficial. The predrivers and three buffers add 1% 
each to give less 2% total jitter. PSR like CPR also avoids EMI and jitter coming from 
multiple harmonics of clocks by the virtue of its resonance.  
Another advantage over CPR is that the switch closure time TR set by LC 
resonance frequency is independent of the clock period TCLK. This gives the wide 
frequency operation feature of PSR, down to the lowest clocking frequency. PSR does 
not have the problem of CPR in supporting DVFS. Additionally, the slew rate is set 
by the faster TR time rather than the variable TCLK.  It is optimal to use PSR with level 
sensitive latches that only depend on controlled fall time. The pulse mode of operation 
can also save power downstream by replacing flip-flops with lower power latches 
[23], [24]. 
5.1.4 Generalized Series Resonance (GSR)  
GSR has the same advantages over CPR as PSR. The delay equations remain 
the same but the fall time is faster with extra pull down switch. With multiple timing 
signals, GSR can give rail-to-rail outputs and 50% duty cycle outputs. The ability to 
interface with standard logic makes it more attractive to use than PSR or CPR. It takes 
more area for the extra switches and needs more support circuitry discussed in Section 
4.1.  GSR is a general purpose resonant scheme that can be reconfigured as PSR or 
CPR as shown in Section 3.4. 
The delay equation for the driver alone 𝑡PD ≥𝑅𝑇 𝐶𝐿 4⁄  is a valid 




different from PSR predriver due an additional series resonance doubler stage 
embedded, giving the overall value as, 












(𝑅𝑢 + 𝑅𝑑)𝐶𝐿  
(5.20) 
This equation is good for comparative analysis but it is based on simplified 
linear models assuming a fixed load capacitance. The actual values will be different 
due to voltage dependent non-linear capacitances.  
The slew rate is governed by the same equation (5.17) as PSR and can be 
taken as 𝑆𝑅𝐺𝑆𝑅 = 𝜋𝑉𝐷𝐷𝑓𝑅𝑒
−𝜋/4𝑄 without loss of generality. The skew from the driver 
alone can then be bound by the relation 𝑡skw ≤
2.5ε
𝜋
 𝑇𝑅 for Q ≥ 𝜋. 
For the same values of ε as NR and CPR, the skew from the driver itself is 
less than 5%, like PSR, as the series resonance frequency is typically 5× the 
maximum 𝑓𝐶𝐿𝐾. The final skew to be considered includes the delay mismatch from 
interconnects, and this is small in GSR too as the 𝑡PD delay itself is kept small. This is 
not counting the common predriver delay.  
The skew tends to be also frequency independent. Jitter is less as long as 
buffer chains are avoided. Most of it comes from the final driver and predriver itself, 




 𝑇𝐶𝐿𝐾 for 𝑄 ≥  𝜋.   
For the same ε and β as NR, CPR and PSR, GSR driver by itself has peak-to-
peak jitter less than 1% of the clock period TCLK . The predriver is a PSR stage having 




2.3% of TCLK.  GSR like PSR/CPR also avoids EMI and jitter coming from multiple 
harmonics of clocks by the virtue of its resonance.  
 
5.2 Comparative Analysis 
The timing for GSR is compared with NR and CPR in Figure 5.1. PSR has 
similar results to GSR.   The waveforms compared are from (1.1), (2.9) and (3.5) with 
the simulated delays and transition times for a 20pF load and <3 of switch resistance 
without any interconnect parasitics. The simulated delay values are within 10% of the 
theoretical calculations using (5.1) - (5.16). The pre-driver delays are not factored for 
simplicity as they do not affect slew rates appreciably.  
 
Figure 5.1 Simulated output voltage waveform on a 20pF load capacitor (VC).  
 
The resonant frequency of CPR is 1.0GHz. Propagation delays (tPD) to mid-
points at 50% marker are shown vertically on individual curves. The NR curve is the 




a rising sinusoidal wave, whose falling edge does not need a triggering input. Thus no 





6 DATA PATH APPLICATIONS 
While special latches were used for power savings with PSR clock, it is 
desirable to have logic blocks for computation with lower power than standard CMOS 
logic. One such promising logic family is the domino-style dynamic logic that is 
traditionally encumbered by the clocking power.  While several innovations like [44] 
have helped the use of dynamic logic in mainstream for higher speed at lower power, 
they have not shown energy recycling advantage presented here. By using the GSR 
principles in the clocking of standard dynamic logic, power can be saved in the 
switching required in every cycle irrespective of data. The refresh cycle of domino 
logic naturally performs the pull up function in the GSR. Thus the GSR predriver can 
readily generate the clocking signals for dynamic logic operation. 
6.1 Resonant Dynamic Logic (RDL)  
In dynamic logic gates, the output is pulled to VDD during refresh/pre-charge 
phase of the clock cycle TCLK [44]. Valid input is required only during the evaluation 
phase of the period.  Figure 6.1 shows a resonant version of domino-style dynamic 
logic [45]. Figure 6.2 shows the timing signals necessary for the correct logical 
operation of RDL. While the pre-charge (REF) and evaluate (EVAL) signals are also 
part of the resonant gate operation shown below, an additional phase is needed for 
energy recovery with the timing signal REC. When input IN is at logic 1, the inductor 
is disconnected from the output. When IN is at logic 0 it is connected to the output 
twice before the next clock cycle starts. M1 functions as the refresh switch. M2 is used 
to charge and discharge capacitor C through inductor. The preprocessing CMOS gate 
shown will generate the necessary control voltages to connect and disconnect the 




0.5TLC for resonance operation. TLC like TRES before is given by 2√𝐿𝐶and is a fraction 
of TCLK in order to fit two units of it in the Evaluate and Recover phases. The logic 
expression for LON is given by, LON = EVAL. 𝐼𝑁̅̅̅̅  + REC. 𝑂𝑈𝑇̅̅ ̅̅ ̅̅  
 
Figure 6.1  CMOS Implementation of Resonant Dynamic Logic (RDL). 
 
 




At the end of the recovery, the refresh switch M1 is momentarily closed by 
REF pulse to compensate for finite Q losses and bring OUT voltage fully back to Vdd. 
The refresh switch may also be closed during logic 1 to account for any charge leakage 
from the capacitor. Note that the inductor is only utilized during the transition times 
and otherwise free for rest of the cycle.  
For input IN = 0, LON is high and M2 connects the inductor to the output load 
capacitor C. By lossless resonance given by (6), OUT goes to ground when the switch 
is closed for duration (TLON) of 0.5TLCThus the correct logical evaluation for the 
driver with the energy stored in the inductor supply is achieved. For the OUT = 0 now, 
LON evaluates to high (VDD) again with active low REC pulse for LON. The M2 switch 
is again closed for another short period of 0.5TLC. This will restore the output to the 
pre-charge value VDD, assuming ideal lossless transfer of energy from the inductor 
supply to output load capacitor. To compensate for finite Q losses, the refresh switch 
M1 is momentarily closed by REF pulse, at the end of the recovery, to bring the 
voltage fully back to VDD. This is the operation during which most power is consumed. 




2  𝐶𝐿𝑓𝐶𝐿𝐾. Figure 6.1 shows the preprocessing logic for a simple inverter, but it can 
be extended for an n input logic gate driving appreciable line capacitance C. 
6.2 RDL Power and Delay 
 The NR power for static logic is given by standard expression, with data 
switching at most half the clock rate, with an activity factor  as, 












The second term accounts for the n-input logic processing. Activity factor 
indicates the fraction of times that the output signal goes high. For NR dynamic logic 
power is only consumed on low going signals but at twice the rate as signals are pulled 
high immediately after being pulled low. This would give the power for an n-bit 
domino style dynamic logic as, 





2 𝑓𝐶𝐿𝐾 (6.2) 
 This includes the second term for the extra power for the n input logic 
preprocessing combined with the clock. Thus, while dynamic logic can give fastest 
data rates and smallest propagation delays possible for a given clock, it does not give 
the lowest power possible for any data rate as the data is toggled on the high 
capacitance output node like a clock. In fact the power is almost double for an even 
case of =0.5.  
 For RDL using the power savings from (3.8) of the PSR structure, the total 
power can be estimated for  comparative analysis as, 
PRDL =  
1
2





2 𝑓𝐶𝐿𝐾 (6.3) 
In comparison for =0.5, and a realizable Q > , RDL power is a third of 
standard domino logic power and 50% less than standard static logic. Thus the 
advantages of dynamic logic’s fastest processing are realized without the power 
penalty, by using RDL. This of course more practical for large size C that would 
make the necessary inductor L value small enough. The propagation delay for fall 
time 𝑡PD, can be derived similar to PSR as, 
𝑡PDR ≤ TR/4 + 3 ×0.69 (2 𝑅𝑢) 
𝑛+1
33













This is of course larger than standard domino NMOS only delays, but can still 
be kept less than the delay of standard CMOS logic. 
6.3 RDL simulations 
The W/L ratio for M2 is kept large enough to minimize the ON resistance and 
to maximize the effective quality factor (Q) of the LC tank. The charge/discharge time 
0.5TLC is a fraction of the main clock period set at 0.2 TCLK. The inductor needed is less 
than 5nH for a 1pF load at 1GHz for TLC =0.4ns. 
Figure 6.3 shows simulation results using BSIM3 models for a 90nm standard 
CMOS MOSIS process. An on-chip capacitor is assumed as the load, that is equivalent 
to driving 800 unit area (.1 x .1) transistors for clock/data lines or 2mm long 
interconnects. Power is compared a non-resonant (NR) domino style circuit driving 
same load..
 




Simulation results show that at 0.5GHz rate they match well with the 
theoretical description of the resonant operation. The output voltage discharges in the 
evaluate cycle for IN=0, and charges up again in recover phase. The inductor current 
curve in Figure 6.3 shows the sinusoidal operation. An on-chip value of about 3 is 
targeted for the Q factor.  
When the inductor switches off, a certain amount of overshoot or ringing may 
be seen in the inductor current at a higher frequency. This is due to parasitic 
capacitances and the residual energy left in the inductor. While a smaller Q actually 
helps in reducing the ringing, it will also diminish the power savings. Keeping the 
switch closed for a slightly longer time helps to recover extra energy and can give 
more power savings. Note that the inductor is only utilized during the transitions times 
and is otherwise free for rest of the cycle. The same inductor may thus be shared 





7 AREA ESTIMATES  
This chapter deals with the layout and area considerations that affect the 
performance of clock drivers and the distribution. Clock skew is directly related to the 
clock tree and other interconnect topologies chosen. There are several choices to be 
made when trying to minimize mismatches and variations while keeping the power to 
a minimum. There is also the concern for the area of inductors and their proper 
placement. Though inductors in theory do not take active area but just metal, 
excessive metal usage can cause routing blockages and directly inhibit timing closure. 
Placement of inductors close to supply lines can cause eddy currents that cut down the 
value of inductance and quality factor as well. Thus the layout is an integral part of 
design, just as in analog design, when it comes to CDNs using inductors for resonant 
recycling of energy. In general it is assumed that reasonable increase in area is 
acceptable when reducing power consumption. The common H-tree is shown in 
Figure 7.1. 
 




The H-tree distributes the clock signal in a symmetric fashion to multiple sinks 
with minimum mismatch and skew. The implementation of a complete clock and data 
sub-system in the SoC is shown in [28] with scalable tapered buffers, also known as 
driver horn. This is used as a benchmark CDN in this thesis to compare the power 
dissipations and performance in Chapter 8. 
The total input capacitance for the local bank of flip-flops and the connecting 
wires shown as CL, may not be identical for each branch of the tree.  The gain ‘n’ is 
balanced evenly across the driver stages with the input capacitance of each stage 
being the output capacitance divided by ‘n’. Figure 7.1 represents the actual 
implementation of a 4-stage tapered buffering shown at the bottom of Figure 1.2 for 
NR clocking. Resonant clocks do not need to use such horns and directly drive the 
local CL loads. But they can have the inductors distributed along the horn or at the far 
end.   
All the driver schemes shown need additional circuitry for input pulse stream 
generation. NR and GSR need non-overlapping pulses. CPR needs a minimum timing 
pulse width for a given driver size for proper operation [34]. Keeping the pulse widths 
minimum will minimize the static leakage in large driver devices. The predriver 
requirements are also important in determining total power and silicon area. 
7.1 PSR Implementation in 45nm  
The area of the PSR output stage is equivalent to 5 medium-sized standard 
inverters (INVs) which have a 10m NMOS and 14.6m PMOS in the IBM/PTM 
45nm technology [32]. The rest of the active circuitry shown in PSR predriver takes 
the equivalent of 6 INVs. In contrast NR buffer horn as represented in Figure 7.1 
would take 64 such INVs. Thus there is a 4× reduction in active area with PSR. The 




resistance and 0.2fF/m capacitance. Clock skew can be reduced by wires in parallel 
at the expense of more power. With proper sizing and spacing of clock wires, the 
clock skew targets can be met [18]. Figure 4.1 replaces the entire chain of 64 inverters 
driving the clock tree. 
The layout plan of these cells is shown in Figure 7.2 as verified in Calibre. 
The epTSPC takes less than 60% of deTSPC area as illustrated in Figure 7.2 (a) in a 
cell to cell comparison of epTSPC vs deTSPC.   The flips-flops, grouped into 32x32 
registers, are distributed across 100m x 100m in Figure 7.2 (b). Complete PSR with 
predriver and the 1024 epTSPCs can fit in the 100m x 100m area shown in Figure 
7.2 (b). PSR driver PRD includes the predriver.   
(a) epTSPC vs deTSPC   (b) PSR driver PRD  1024 esTSPCs  (c) NR Driver NRD and 1024 deTSPCs 
Figure 7.2 Layout Floor Plan for comparing PSR and NR Clocking. 
       Two 1nH inductors, needed for PSR and its predriver, can be best 
implemented in the top metal layer well within the 100m x100m area above the 
active area of the flops. The deTSPC flips-flops, grouped into 32x32 registers, are 
distributed across 100m x 100m in Figure 7.2(c).  Additional 50% area is needed 
for NR buffer horns as shown in Figure 7.2(c). The Non Resonant Driver  NRD  




than NR as the 1024 deTSPC flops alone take 10,000m
2
 area and 50% more is 
needed for NR buffer horns. 
The complete leaf cell test bench of 1024 flip-flops clocked by PSR through 
an H-tree clocking network can be extracted. The extracted parasitics from layout 
affecting the performance are used in SPICE simulations. 
7.2 GSR Implementation  
GSR can also be implemented in similar fashion to PSR. The flip flop area 
will be similar to NR. The predriver of GSR takes roughly twice the size of PSR. 
When driving entire clock tree loads (>100pF), the matched capacitors in GSR 
predriver of Figure 4.1 can take excessive area. Making the inductors LD and LPW 10 
times or more can scale the capacitance area down by 10×. Inductors’ extra metal area 
is not usually considered as they can be stacked on top of the active area of the 
predriver. 
The GSR predriver takes an equivalent of only 16 INVs compared to 6 for 
PSR. However, NR driver does need predrivers (nearly 5 INVs) to reduce delays in 
driving the large gate capacitance of clock drivers leading to tapered buffers [20].  In 
an NR H-tree clock distribution, the extra capacitance driven can be 50% of CL for 
optimal delays, leading to 50% more power [20]. CPR buffer sizes are small 
compared to other schemes. For comparison, NR needs 8 INVs to drive a load of 1pF 
with optimal delays; CPR takes less than 4 INVs; PSR takes 5 INVs and GSR 15. 
The inductor value for a given resonance frequency and capacitance is given 
by 1/4
2𝑓𝑅𝐸𝑆
2 𝐶𝐿. For nominal load capacitance values of 1pF, an LP of more than 25nH 
is needed for CPR at clock speed of 1GHz. In GSR/PSR, giving some margin for pull 
up/down time, the resonance width (TR=1/fR) is usually set at about 1/5
th




TCLK, resulting in 5× larger value for resonance frequency than the clock [24]. The 
series inductor value is then smaller, given by LS=LP(fCLK/fR)
2
. For the 1pF load at 
1GHz clock rate, TR can be set to 0.2ns using a 1nH inductor resulting in a 5GHz fR. 
Both PSR and GSR need less metal area for inductors in the driver compared to CPR. 
Inductor metal area for PSR and GSR can be on top of the driver active area and not 
encroach on other active areas as shown in Figure 7.2. The inductor metal usage can 
sometimes affect critical performance due to routing blockages in the clock tree 
synthesis. PSR can also use bond wire inductors or off-chip inductors, especially for 
low frequency operation [24]. In GSR implementation, distributed coils have the 
transistor Mr distributed at multiple locations as well, as shown in Figure 7.3. 
 
Figure 7.3 GSR distributed at far-end for highest Q and minimum power. 
The extra capacitance of distributing VSR  is already factored in CM2 tuning 
capacitance shown. This can be as high as the load CL when distributed to far-end, 
next to the load. For multiple parallel lines, the widths can be decreased to lower the 




decreased to handle larger capacitance values. Optimal inductor location whether it is 
at the driver end, or far end, or at middle, involves trade-offs in power and skew. It is 
to be noted that the split-NR driver topology, commonly used, also needs two clock 
distribution lines one for NMOS and the other for PMOS.  
7.3 Inductors 
The inductor layout can be quite involved to achieve best possible quality 
factors as reported in [11]. As many as a 1000 inductors can be used in a processor 
design. In CPR their sizes can be prohibitive enough to block normal place and route. 







8 PERFORMANCE POWER AREA (PPA) TRADE OFF ANALYSIS 
In order to facilitate Performance Power and Area (PPA) analysis, all the 
information from previous chapters has been summarized in Table 1. It shows a 
transistor level implementation of GSR topology and reconfigurations for NR, CPR or 
PSR operation. When all devices are used, GSR operation is enabled with correct 
control signals as in Figure 3.4. Mode may be selected for the best performance or 
power at the frequency of operation. As an example, for low frequency wafer testing 
NR may be used. For DVFS, GSR or PSR may be used. For maximum clock speed 
and savings at a single frequency CPR may be optimal.  
CPR gives the lowest power if the resonance frequency can be set at or below 
the operating frequency. Parallel LC resonant circuit operation can operate in 
sinusoidal mode with reduced buffer sizes. This is because the on-resistance Ru or Rd 
being higher does not adversely affect power consumption, as long as the oscillations 
are underdamped. If the device sizes are made smaller for CPR, the on-resistances 
will be higher, but Rp determining the delay in (5.8) is not directly affected. This 
reduces the pre-driver overhead to drive load capacitance CL and lowers the total 
power further. Since only losses need to be overcome at resonance, after the initial 
start-up, additional power savings can be realized by reducing the strength of the 
clock buffers driving the LC load [3], [12], [14], [20]. More than 40% of power 
saving is predicted with optimal synthesis algorithms [3], [18]. In practice continuous 
resonant solutions with L always connected in parallel to C are shown to save 25% 






Table 1    PERFORMANCE POWER AREA TRADEOFFS 















Low Frequency Testing 
Smallest Delays 
Fixed 𝑓𝐶𝐿𝐾 for Global CDN 
Lowest power at high 𝑓𝐶𝐿𝐾 
Pulse mode DDR Latches Lowest 
Power DVFS 
General Purpose Low 
Power. 











   
   















































Tank QCPR = Rp /√𝐿𝑝/𝐶𝐿 











Tank QPSR = √𝐿𝑆/𝐶𝐿/ RT 



















Cont. Parallel Resonance 
(CPR) 


































2  𝑓𝑅𝐸𝑆 
QCPR≈QL=RP/2 f 
 LP=2f LS/ rS 
QPSR=2fLS /(Rr+RW+rS)<QL QGSR =2 f LS /(Rr+rS)< QL 
Driver Area 
Proportional to CL  
and Routing Lengths  
<0.25 NR Active Area 
Large Inductor metal Area 
Active Area ≈  NR 
Ind. Metal area < CPR 
Active Area ≈  1.25 NR 





  ≤ 0.5𝐶𝐿    
& (n/a) 
< 0.05𝑪𝑳,  
& (n/a) 
𝐶𝐿 & LS 
or  0.1×𝐶𝐿 & 10×LS 
2𝐶𝐿 & 2LS 
or  0.2×𝐶𝐿 & 20×LS 
Predriver 
Power (PP) for 
n stages 
≤ 0.5𝐶𝐿𝑉𝐷𝐷
2  𝑓𝐶𝐿𝐾 
n ≥ 3 & for min. delay 
< 0.05𝑪𝑳𝑽𝑫𝑫
𝟐  𝒇𝑪𝑳𝑲 
≈ 0.1 𝐶𝐿𝑉𝐷𝐷
2  𝑓𝐶𝐿𝐾  
shared over >4 drivers 
≈ 0.2𝐶𝐿𝑉𝐷𝐷
2  𝑓𝐶𝐿𝐾  
shared over >4 drivers 
(PD + PP) Total 
Power for 
 Q >  
< 1.5𝐶𝐿𝑉𝐷𝐷



















Cont. Parallel Resonance 
(CPR) 





0.69 𝑅𝑁𝑅𝐶𝐿  
RNR=(Ru+Rd)/2+Rw 
RNR < 𝑅𝑇  <𝑅𝑝 
𝑅𝑝 𝐶𝐿
𝑅𝑝 > 𝑅𝑇 > 𝑅𝑁𝑅 
𝑅𝑇𝑃𝑆𝑅𝐶𝐿
𝑅𝑇𝑃𝑆𝑅 = (Rr + RW + rS) 
RNR < 𝑅𝑇𝑃𝑆𝑅<𝑅𝑝 
𝑅𝑇𝐺𝑆𝑅𝐶𝐿
𝑅𝑇𝐺𝑆𝑅 = (Rr + rS) 
RNR <𝑅𝑇𝐺𝑆𝑅< 𝑅𝑇𝑃𝑆𝑅<𝑅𝑝 
Predriver 
Delay 
n×0.69 RNR CL (n-1)×0.69 RNR CL TR + 0.69 RNR CL TR + 3×0.69 RNR CL 
Insertion Delay 
















(𝑅𝑢 + 𝑅𝑑)𝐶𝐿 
rise/fall times 2.2×(Ru/d+Rw)CL 
0.29TCLK @ 𝑓𝐶𝐿𝐾 = 𝑓𝑅𝐸𝑆 
(Ru+Rw).CL << TR < TCLK 
0.29TR 
(Ru+Rw).CL << TR < TCLK 
0.29TR 
(Ru+Rw).CL << TR < TCLK 














 𝑇𝑅    𝑓𝑜𝑟  𝑄 ≥ 𝜋 𝑡skw ≤
2.5ε
𝜋
 𝑇𝑅for 𝑄 ≥ 𝜋 





 𝑇𝐶𝐿𝐾    𝑓𝑜𝑟  𝑄 ≥ 𝜋 ≤
εβ
2𝜋




8.1 Tradeoffs between NR, CPR, PSR and GSR 
As shown in Table 1, following are the pros and cons in choosing a scheme for 
a given application [23]. 
8.1.1 Power and Dynamic Voltage Scaling 
Energy in all driver cases goes as the square of supply voltage as given by 
𝐶𝐿𝑉𝐷𝐷
2 . Plotted on a logarithmic scale, this would present a straight line for all drivers 
with same slopes but different offsets, as shown in Figure 8.1. At higher voltages 
though, the on resistance of switches is smaller, leading to larger tank Q and more 
energy savings for resonant schemes. This can be seen at higher voltages where the 
curves are below the linear extrapolation. 
 
Figure 8.1  H-tree Energy per cycle with voltage scaling at 500MHz. 
While NR needs no inductors, the resonance schemes need a characterized 
inductor L that sets fRES = 1/2√𝐿𝐶𝐿. For CPR, fRES = fCLK, so different inductor  𝐿𝑝 







 VDD (volts ) 
GSR Energy in cycle (ns)
NR Enrgy in cycle (nJ)


















frequency range of power savings is only an octave or so. This is a severe limitation 
in DVFS systems that aggressively scale down frequencies and supply voltages to the 
minimum needed at run-time. With large variations in load capacitances over PVT 
corners, even the best choice of Lp may not be optimal in actual operation without 
run-time tuning. Power savings in CPR over NR are not uniform, but frequency 
dependent, as shown in Table 1. For GSR and PSR, the resonance time TRES need only 
be less than TCLK. This inequality requirement enables the DVFS support by PSR and 
GSR. It also has the benefit of providing an extra degree of freedom for handling 
variations in CL and LS. The component QL (for frequencies before the onset of skin-
effect [8]) is higher for PSR/GSR, than CPR, since resonance frequencies are higher.  
8.1.2 Delays 
 NR gives the shortest propagation delay. The propagation delay of CPR 
driver is much larger than NR. This adversely affects skew and jitter due to the larger 
absolute variations and supply sensitivities. However the insertion delay for CPR can 
be comparable since the predriver requirement are much less. This can lead to lower 
jitter for CPR than other schemes. PSR and GSR resonate at much higher frequencies 
at the edges of the clock rather than the whole period like CPR, giving lesser 
propagation delay than CPR. The change of delay from supply variations is important 
for several reasons. 
During run time the designer would like it to operate at supply voltage to meet 
the performance criterion so that power can be minimized. Having the information on 
which topology will give lowest power for a given delay requirements is helpful to 
determine which configuration to choose for GSR. Finally, it is to be noted that jitter is 
directly determined by sensitivity of the delay to supply variations () and it is 




using parallelism allows for lower supply voltage and the corresponding power 
reductions from 𝐶𝐿𝑉𝐷𝐷
2  𝑓𝑅𝐸𝑆. 
 
Figure 8.2 Delay variations with supply voltage. 
 
8.1.3 Rise/Fall Times and Slew Rates 
 In resonant schemes, the rise/fall times depend on the resonance period TRES 
(Trise/fall= 0.29TRES). For CPR, this is nearly TCLK, so the rise/fall times are long for 
lower frequencies, causing increased timing delays. This further leads to increase in 
power of the receiving gates due to short circuit currents. In contrast, since TR in PSR 
and GSR is much smaller than minimuim TCLK, the slew rates are fast, well controlled 
and fixed, resulting in low skew values. Again observing the change with supply 
voltage an optimum  resgion of operation can be arrived at. Slew rates directly affect 
the clcok skew and more power is needed to achieve lower skew. While NR rise/fall 
times and slew rates depend on supply voltages slightly, the resonant schemes have 











0.25 0.45 0.65 0.85 1.05
 VDD (volts ) 
CPR skew
NR Delay (ns)












8.1.4 Skew and Jitter 
Skew and jitter directly affect the timing budget. Figure 8.3 shows the skew 
variations over supply voltage. Once the minimum skew requirement is determined, 
the right topology and minimum supply voltage can be chosen. 
 
Figure 8.3 Skew variation with supply voltage. 
8.1.5 Area of Driver 
CPR drivers take less than 25% of the active area of an NR driver. PSR driver 
takes around the same active area as NR but needs extra metal area. GSR takes 25% 
more active area than NR and needs more metal area than PSR. 
The inductor value for a given resonance frequency and capacitance is given 
by 1/4
2𝑓𝑅𝐸𝑆
2 𝐶𝐿. In GSR/PSR, giving some margin for pull up/down time, the 
resonance width (TR=1/fR) is usually set at about 1/5
th
 of nominal TCLK, resulting in 5× 
larger value for resonance frequency than the clock [24]. The series inductor value is 
then smaller, given by LS=LP(fCLK/fR)
2
. Both PSR and GSR need less metal area for 






0.7 0.9 1 1,1














 Inductor metal area for PSR and GSR can be on top of the driver active area 
and not encroach on other active areas. The inductor metal usage can sometimes 
affect critical performance due to routing blockages in the clock tree synthesis. PSR 
can also use bond wire inductors or off-chip inductors, especially for low frequency 
operation [24]. For comparison, NR needs 8 INVs to drive a load of 1pF with optimal 
delays; CPR takes less than 4 INVs; PSR takes 5 INVs and GSR 15. 
8.1.6 Predriver Overhead 
All the driver schemes shown need additional circuitry for input pulse stream 
generation. NR and GSR need non-overlapping pulses. CPR needs a minimum timing 
pulse width for a given driver size for proper operation [34]. Keeping the pulse widths 
minimum will minimize the static leakage in large driver devices. The predriver 
requirements are also important in determining total power and silicon area. When 
driving entire clock tree loads (>100pF), the matched capacitors in Fig. 4 can take 
excessive area. Making the inductors LD and LPW 10 times or more can scale the 
capacitance area down by 10×. Inductors’ extra metal area is not considered as they 
can be stacked on top of the active area of the predriver. 
The PSR predriver takes an equivalent of only 6 INVs compared to 16 for 
GSR. However, NR driver does need predrivers (nearly 5 INVs) to reduce delays in 
driving the large gate capacitance of clock drivers leading to tapered buffer sizes. In 
an NR H-tree clock distribution, the extra capacitance driven can be 50% of CL for 
optimal delays, leading to 50% more power [20]. CPR buffer sizes are small 
compared to other schemes. 
8.2 Energy-Delay (E-D)  Tradeoff 
Modern low power designs employ quantitative pareto analysis to arrive at 




graphs into a combined metric of Energy-Delay product (or sometimes called 
speed/power metric) shown in Figure 8.4 allows for a holistic view of topology 
selection.  
 
Figure 8.4 Deriving E-D product curve. 
Figure 8.5  shows the Energy-Delay (E-D) product for NR, CPR and GSR to 
see the figure of merit of one over the other. CPR has the lowest (best) values since 
the insertion delays are the lowest due to little overhead in terms of predriver delay, 
although the driver itself is slower than other schemes.  However the operating 
frequency is only valid over a small range of voltages over which frequencies around 
the resonance are supported. GSR is a good balance between NR and CPR. 
By plotting energy vs. delay as in Figure 8.6 pareto analysis can be more 
effectively used. Area can be factored in the Pareto chart as well to do a 


















0.5 0.7 0.9 1 1,1




































Figure 8.5 E-D Product for NR, CPR and GSR. 
 








0.5 0.7 0.9 1 1,1





















NR Energy (nJ) *  Delay  (ns)
GSR Energy (nJ) *  Delay  (ns)






0 0.2 0.4 0.6 0.8 1

















Energy-delay metric is usually improved with technology scaling with ‘More 
of Moore’.  Figure 8.5 shows how it can be improved through the use of inductors 
which is basically a ‘More than Moore’ solution. 
8.3 PPA Optimization 
In what is commonly termed as PPA optimization, power, performance and 
area estimates, as shown in Table 1, are considered simultaneously. An optimal 
configuration (indicated by bold text) may be selected for the best performance or 
lowest power, at the frequency of operation. As an example, for low frequency 
operation, NR may be used since dynamic power is small and acceptable. PSR needs 
minimum driver and buffer sizes and is ideal for single frequency operation like in 
global clock distribution. For DVFS in regional clocks, PSR or GSR provides power 
savings at all clock rates. For DDR operation, PSR is the best, operating on both the 
edges of the clock. PSR has Q degradation compared to GSR. GSR, like NR, can 
drive standard gates without needing special buffers or latches and thus preferred over 
PSR for the current automatic synthesis tools. GSR may also be used in data path 
using dynamic logic for power savings as shown in Chapter 6. 
8.4 Applications 
Power consumed in post processing resonant clock waveforms may need to be 
considered for the given application. Due to the sinusoidal nature of the distributed 
clock signal, special flip-flops [21], [31], [32] are often needed to capture data 
correctly for CPR. PSR, on the other hand, may give additional savings in flip-flops 
with its pulsed outputs as described in next section. The pulsed output of PSR can 
drive simpler latches, instead of full master-slave flip-flops, saving more power and 




CPR driver clock distribution is employed at global clock level as it takes least 
power near resonant frequency and least active area. Thus power is reduced with 
respect to NR, while the delay performance worsened and area decreased, showing a 
different tradeoff. PSR is well suited for double data rate (DDR) operation. PSR can 
operate with lower VLB of VDD/4 for low Q values. RT needs to be kept low for series 
RLC to keep the inductor size small. This needs large size switches to keep Rr 
component small. Even for low Q of 2, more than 60% NR power can be saved using 
PSR. Accordingly, the tradeoff obtained for PSR is better performance than CPR, but 
with more power and active area. 
Modern mobile and high performance designs are using increasing number of 
voltage domains and with regional clock trees and grids [18]. Thus, it is beneficial to 
improve and extend the globally-resonant clock drivers to locally-square non-
resonant drivers in the CDN [14].  
Resonant solutions, with characteristic sine wave signals, were initially 
applied to lower speed systems. Special flip-flops for ultra-low energy applications 
were designed to work with these low amplitude signals from global clock grids [21]. 
These custom cells need to be incorporated into standard cell libraries for synthesis. 
The power savings are further improved by dual edge-triggered (DET) operation 
wherein the clock speed itself can be halved and a lower supply voltage used. The 





9 SYSTEM LEVEL EXPERIMENTAL RESULTS 
From a top down perspective power needs to be saved while meeting the 
timing requirements at system level for synchronous operation with a common clock.  
A poor clock distribution network can result in  
 Limited speed due to setup timing violations 
 Functional failures due to hold timing violations 
 Large Power consumption due to excessive loads 
Objective of CDN shown in Figure 9.1 top-down view is to distribute a clock 
signal to the sequential storage elements in a manner that, for every pair of flip-flops 
(i, j) through which there is a timing path, both the setup constraint and the hold 
constraints are satisfied as in equation (9.1) and (9.2) for timing closure. 
  




𝐷𝑖,𝑗 + 𝑡𝑑𝐶𝑄 ≤  𝑇𝐶𝐿𝐾 −  𝑡𝑠𝑘𝑤 𝑖,𝑗 −  2𝑡𝑗𝑖𝑡 −  𝑡𝑠𝑒𝑡𝑢𝑝  
(9.1) 
 
𝑑𝑖,𝑗 + 𝑡𝑐𝐶𝑄 ≥  𝑡𝑠𝑘𝑤 𝑖,𝑗 +  2𝑡𝑗𝑖𝑡 +  𝑡ℎ𝑜𝑙𝑑  
(9.2) 
 
 In these equations, 𝑑𝑖,𝑗 (𝐷𝑖,𝑗) is the minimum (maximum) data path delay 
between the sequential elements i and j, 𝑇𝐶𝐿𝐾  is the clock period, 𝑡𝑠𝑒𝑡𝑢𝑝 (𝑡ℎ𝑜𝑙𝑑) is the 
setup (hold) time, 𝑡𝑑𝐶𝑄 (𝑡𝑐𝐶𝑄) is the clock to output delay (contamination delay) of a 
sequential element, 𝑡𝑠𝑘𝑤 is the skew and 𝑡𝑗𝑖𝑡 is the jitter. The local skew is 𝑡𝑠𝑘𝑤𝑖,𝑗= ti − 
tj from sequential element i to j where ti and tj are the delay of the clock signal to the 
sequential element clock pin, which is also called a clock sink. The maximum, 
minimum or average delay from the clock source to all sinks is also referred to as  the 
insertion delay of the clock tree. Jitter is the maximum variation in clock arrival time 
at a sink [18]. 
As an example, Figure 9.2 shows the bottom-up view from flip-flops FF1 (i) 
and FF2 (j) with a common path delay from buffer predriver C1 and non-identical 
drivers C2 and C3 creating a skew. The data path delay 𝑑𝑖,𝑗 (𝐷𝑖,𝑗) comes from D1 and 
D2. The  𝑡𝑠𝑘𝑤 𝑖,𝑗 will also include interconnect mismatch effects from n1 and n2 
wires. The data path wires n3 and n4 contribute to 𝑑𝑖,𝑗 (𝐷𝑖,𝑗) as well. 
 
 




Figure 9.3 shows a standard benchmark recommended by IBM in ISPD2010 
to evaluate skew in clock synthesis [32]. The target is a balanced H-tree but actual 
implementation mismatches the nominal length of 1.25mm by as much as 32%. 
 
 
Figure 9.3 IBM ISPD2010 skew generation benchmark. 
9.1 System Timing Closure  
Combining Figure 9.2 and H-tree from Figure 9.3, one can visualize the C3 
and C2 skew coming from two different branches of the tree driven by different 
buffers and  interconnect that are of the same type but suffer from systematic and 
random on-chip variations (OCV). 
Static Timing analysis (STA) evaluates the timing slack/margin of nodes and 
edges based on the difference of actual arrival times and required times. STA 
computes an upper bound on the delay of all paths from the primary inputs to the 
primary outputs, irrespective of the input signal combination. STA is a highly 
efficient method to characterize the timing performance of digital circuits, to 
determine the critical path, and to obtain accurate delay information. In Figure 9.2 
example, STA predicts the earliest time when FF2 can be clocked, while ensuring that 




In Figure 9.2, for example, there are two choices to improve performance: 
speed up the clock to FF1 or slow down the clock to FF2. Without considering 
process variations, there are many options that have the same effect. For example, 
wire n2 can be made wider so that it presents more loading to gate C3; gate C3 can be 
made smaller so that it has larger delay; or wire n1 can be made narrower to increase 
its resistivity. These options are just for slowing down the branch to FF2 - similar 
options exist to speed up the branch to FF1. Deterministic STA optimization would do 
some combination of these moves to quickly converge to a solution.  
Considering the process variations, however, some of these options are less 
attractive than others due to the correlation between the data-path delay from FF1 to 
FF2 and the clock tree skew between the clock nodes of FF1 and FF2. To make the 
design more robust, it is best that these two delays be correlated. If they are 
correlated, a process parameter will affect both the data-path delay and clock skew 
equally and, in turn, not impact performance. For example, if the data-path is gate 
delay dominated, one may wish to add extra delay in the clock tree by sizing C3. If, 
however, the data-path is metal interconnect dominated, one may wish to add delay in 
the clock tree by sizing a metal wire to improve the correlation. 
If C1 is powerful enough to drive FF1 and FF2 only the interconnect 
mismatch would matter. If the end points of C2 and C3 were shorted (grid) again the 
mismatch would be minimized.  Both these are utilized in Resonant clocking to 
control skew. A negative set-up time in the FFs gives an extra margin to the amount 
of harmful skew that can be tolerated. 
 Figure 9.4 shows the generalized model for statistical calculations. Typical 





Figure 9.4 Generalized Statistical Timing Slack Calculations. 
Simplifying the setup check as 
𝑠𝑒𝑡𝑢𝑝:     𝑡𝐺𝐷𝑚𝑎𝑥 + 𝑡𝑠𝑒𝑡𝑢𝑝 +  𝑡𝑠𝑘𝑤−𝑚𝑎𝑥 ≤  𝑇𝐶𝐿𝐾  
(9.3) 
 
where tGDmax is the maximum possible delay of the path GD, tsetup the setup 
time of the receiving flop, 𝑇𝐶𝐿𝐾  the desired cycle time, and 𝑡𝑠𝑘𝑤−𝑚𝑎𝑥is the estimated 
variation in skew for the slow process corner. A negative setup time in flip-flop or 
latch element will make it easier to meet setup constraints at highest clock speeds. 
Similarly, the hold check is given by: 
ℎ𝑜𝑙𝑑:     𝑡𝐺𝐷𝑚𝑖𝑛  ≥  𝑡ℎ𝑜𝑙𝑑 +  𝑡𝑠𝑘𝑤−𝑚𝑖𝑛 
(9.4) 
 
where tGDmin is the minimum possible delay of the path GD, thold is the hold 
time of the receiving flop, and 𝑡𝑠𝑘𝑤−𝑚𝑖𝑛 is the skew for the fast process corner. Most 
timing analysis flows account for process variations in calculating these skews by 
applying a process variation penalty in addition to the nominal clock skew. This 
penalty can be derived from approximate first-order formulas based on the clock path 





One can actually compute the margin for the check according to statistical 
theory [46]. The basic failure mechanism for a setup check is that the time it takes for a 
signal to reach the receiving flip-flop via path CGD is greater than that of the sampling 
CS augmented by a cycle delay. This difference called the margin is a quantity that 
should be analyzed statistically. The reason for statistical analysis of margin is that 
differences are treated differently for statistical quantities than for deterministic 
quantities. The deterministic margin at the receiving flop is given by:  
𝑚𝑎𝑟𝑔𝑖𝑛 =  𝑡𝐶𝑆  + 𝑇𝐶𝐿𝐾  -  𝑡𝐺𝐷 
(9.5) 
 
Equation above is a valid equation for computing the mean of the margin. The 
variance of the margin according to statistical theory is given by: 
𝜎𝑚𝑎𝑟𝑔𝑖𝑛
2 =    𝜎𝑑𝑒𝑙𝑎𝑦,𝐶𝑆
2   + 𝜎𝑑𝑒𝑙𝑎𝑦,𝐶𝐺𝐷
2  - 2 𝑐𝑜𝑣(𝑡𝐶𝑆, 𝑡𝐶𝐺𝐷) 
(9.6) 
 
Where cov(tCS,tCGD) represents the covariance due to process variations in the 
respective path delays. It can be seen from above that the variation of data path delay 
adds to the overall variation which is in contrast to the subtraction of mean delay of 
path CGD. Moreover, a component of the statistical variation represented by the 
common variations of both paths, the covariance term in (9.6), can be used to improve 
the margin.  
This recovery of margin because of the correlation in the systematic component 
of clock and data path delay variation allows for a less pessimistic (and more accurate) 
estimate of setup and hold margins thereby expanding the design window. For resonant 
clocking no active buffers are necessary so that delay matching to the data delay can be 
added to increase the covariance term and eliminate excessive guard-bands in the 




9.2 PSR vs. NR sub-system performance 
The PSR naturally creates the controlled sharp falling edges. This can be seen 
from the PSR clock sampling in Figure 9.5. PSR can drive epTSPC meeting the 
requirements of robustness and controlled steep slew-rates. At system level, the 
predriver that generates pulses can be shared among multiple PSR drivers if the TR 
requirements are homogenous among the drivers.  Figure 9.5 shows the results for NR 
clocking with optimally sized tapered buffers driving the inputs of the 1024 flip-flops. 
Skew can be reduced as needed with wider interconnect lines, but at the expense of 
more power. The combined clocking and flip-flop operation is compared to 
demonstrate the equivalent throughput of PSR and NR schemes for same latency and 
skew. For PSR sized to drive the 1024 epTSPCs and interconnect with less than 10ps 
skew, a savings of 68% is seen when compared to NR with the same latency. This 
agrees with the theoretical calculations. For a 𝑡𝑑𝐶𝑄 of 48ps, epTSPC takes only 5.9fJ 
of energy per cycle driving a 5fF load at 1V supply, whereas the deTSPC needs 7.8fJ. 
This is a saving of > 26% in FFs while the overall saving is 45% for 1024 flops. 
           
Figure 9.5  PSR  vs. NR with same 𝑡𝑑𝐶𝑄 
 










































NR CLK to  
Q Outputs 
DATAin 
PSR CLK to  






Figure 9.6 Power Savings over DVFS range. 
The DVFS operation of PSR is verified over a decade of frequencies in the 
system as shown in the transient simulation of Figure 9.6. For DVFS operation, the 
clock frequency is scaled down to 200MHz supporting 400Mbps peak data rate at 
0.5V. It is also scaled up to 2 GHz with 4Gbps at 1.3V. Figure 9.6 shows the 
functionality over the entire DVFS range and instantaneous NR power compared to 
PR power. Note that the horizontal/vertical scales are zoomed in for clarity for 
different signals with scaled voltages and frequencies. The PR dynamic power can be 
seen to be less than half of the NR power over the DVFS range.  
The PSR-epTSPC and NR-deTSPC transistors are sized and designed using 
PTM 45nm devices. Test benches by IBM from ISPD2010 clock synthesis are used, 
which include interconnect parasitics [32]. A fan out of four (FO4) loading (5fF) is 
used and the supply voltage varied from 1.3V to 0.5V. Extensive simulations in 
SPICE with PTM 45nM devices verify operation of the PSR-epTSPC for power 
savings and skew control. The complete leaf cell implementation in 45nm of the 1024 
flops clocked by PR through an H-tree network was used for post-layout simulations. 
Figure 9.7 shows the worst case of combined simulations of pulse generator and 
latches. Top of Figure 9.7 shows the early clock and late data (150ps skew) stress test 








C. Comparing the data capture operation at 
both the rising and falling edges, NR with DET FF fails to capture data in some 
corners when there is no set-up time before clock edge. PR with epTSPC captures the 
data correctly in all cases, even with negative setup time. This can be used as an 
advantage for clock de-skewing purposes. This reduces the width of interconnect lines 
needed to meet a given skew specification resulting in lower load capacitance and 
power. The hold time for epTSPC is well defined by the width of the resonance pulse 
and the clock to Q propagation (𝑡𝑑𝐶𝑄) is 4 inverter delays. This allows for predictable 
operation and timing closures.  
 
 
Figure 9.7  PVT and MC skew simulations comparing PSR and NR H-Trees. 
Power and energy curves are derived as shown in Figure 9.8. The top curve 
shows the percentage power savings for PSR driver (PRD) over NR for clocking. The 
energy- delay product on right vertical axis shows 300fJ.ps at 1V and 1GHz compares 





Figure 9.8  Power Savings and Energy. 
Figure 9.9 compares the data capture edges with the clock leading data at both 
the rising and falling for repeated Monte Carlo runs. NR with deTSPC fails to capture 
data with no set-up time. PR with epTSPC captures the data correctly even with the 
negative setup time. This can be used advantageously for clock de-skewing purposes. 
The hold time for epTSPC is well defined by the width of the resonance pulse and the 
clock to Q propagation is 4 inverter delays. Thus, the clock to Q propagation can be 
kept larger than hold time to minimize hold time violations for timing closures. 
 





A robust wide-frequency clock driver based on pulsed resonance driver (PRD) 
topology that consumes 60% less power than a conventional driver horn is 
demonstrated for local buffering.  The PRD can work with standard latches (epTSPC) 
in DET applications taking 40% less area and power for 1024 flops compared to the 
current schemes.  Negative setup time of epTSPCs give extra margin for skew 
management.  
PRD itself can drive lower skew wider interconnect lines with less power. 
Small inductor values sufficient for pulsed resonance make this solution an attractive 
option for multiple voltage and multiple frequency domain regional clocks. As with 
CPR, issues can be progressively resolved in silicon for PSR. Though silicon 
measurements are not available at this time, the simulation results match well with the 
theory developed and corroborate well with previous silicon results re-simulated 
under same test benches of the bench marks. 
9.3 GSR vs. NR sub system Performance 
Dynamic power evaluation on 45nm IBM compatible process from ISPD2010 
bench marks is chosen as a test case. A CDN, scaled for a 45nm, is simulated for 
more than a frequency decade below the maximum operating frequency (Fmax) of 
4GHz. Power savings over a 10× frequency range of the GSR configured as a wide 
frequency resonant driver are compared to those of a NR driver in Figure 9.10. In (a) 
2GHz GSR operation with power savings over NR is shown while in   (b) 200MHz 
GSR operation with power savings over NR  is shown.      
For a direct comparison, the NR and GSR  are sized to drive a 1pF load. 
Though power is needed for the pre-drivers of both GSR and NR, they in turn 




energy per cycle of PGSR (<1.4mW) in a fixed interval for GSR is less than that of PNR 
(>2.5mW) of NR. This can be seen from comparing the total area under the PNRD and 
smaller PGSR curves in the bottom row of Figure 9.10. GSR does need current from 
VLB bias supply, but puts it back during discharge cycle, as seen in the negative 
excursions. GSR saves power for both the frequencies of 2GHz in Figure 9.10 (a) and 
200MHz in Figure 9.10 (b).   
  
(a) 2GHz operation    (b) 200MHz operation                       
Figure 9.10  Power Savings over 10× clocking frequency range in 45nm. 
The functionality and robustness of the new GSR driver and pre-driver 
circuitry is also verified by 22nm SPICE simulations across 30% variation in LC 
component values and transistor model parameters. The input drive of the resonant 
schemes can take power when large loads are being driven. The skew requirement 
between clock sinks often sets the drive strengths needed. Figure 9.11 shows the 
launched waveforms and the skew in arrival at the flip-flop clocking nodes for the 





Figure 9.11 Variations in the delay contributing to clock skew.  
Skew is minimized for NR, GSR and CPR with wide interconnects. A nominal 
skew of less than +20ps is targeted for all to compare power required in 22nm.The 
skew from unequal loads are made to be smaller for NR, CPR and PSR by proper 
sizing and wire widths. 
9.4 GSR, PSR, CPR and NR Comparative Analysis 
In order to verify the tradeoff presented, the various clock drivers are tested 
under identical IC implementation parasitics from a symmetric H-tree benchmark [23], 
[32]. The resonance inductance values are derived from a standard metal spiral 
inductorof 0.5nH with rS < 10 with a QL > 3 at 5GHz [8], [11].The clock tree global 
interconnect is distributed on a metal layer with wires that typically have 0.1/m 
resistance and 0.2fF/m capacitance. Clock distribution is done using 6 segments of 
1.25mm each with 8 wires in parallel to reduce the nominal interconnect resistance to 
less than 2. A ±30% random variation in length is considered for determining the 
clock skew. By keeping effective series resistance RT < 0.2 a tank Q > 1 is obtained, 
which is sufficient for successful GSR operation. The effect of finite component QC 




For a 1V nominal operation, driving a distributed load totaling 160pF, Figure 
9.12 compares NR, CPR and GSR power consumptions calculated across frequencies 
using SPICE simulations. GSR has LS = 6pH and rS < 0.1@ 5GHz and CPR LP = 
160pH and rS < 0.3@ fRES = 1GHz for VDD=1V. Dotted lines show theoretical 
calculations. CPR is optimal at its resonance frequency fRES and is not operated below 
0.8fRES. Inductor sizes are constant for CPR and GSR during the frequency sweep. 
The predriver power is included in Figure 9.12 in order to see a direct comparison 
between driver solution use-cases. Multiple unit inductors of 0.5nH are distributed in 
parallel along the tree to get the low 6pH value required to resonate at 5GHz. In 
Figure 9.12, GSR trend follows (3.10) and the NR and CPR track the theoretical 
equations for PD from Table 1. NR takes the highest power (PD), GSR less, and CPR 
takes the least. 
 
Figure 9.12 Power consumption versus frequency for NR, GSR and CPR. 
The global interconnect lines reduce the output swing at higher frequencies 




calculated. NR predrivers can improve the attenuated swing and minimize delays 
using tapered buffers, but at the expense of 50% more power. 
 Table 1 shows GSR predriver power overhead (PP) of about  0.2 𝐶𝐿𝑉𝐷𝐷
2 𝑓𝐶𝐿𝐾. 
GSR driver takes about 50% of NR driver power of 𝐶𝐿𝑉𝐷𝐷
2 𝑓𝐶𝐿𝐾. At 2GHz, as seen in 
Figure 9.12, total GSR simulated power (PD+PP) is about 57% of NR power, 
compared to 47% from Table 1 calculations. While the lumped model analysis is only 
accurate to 20%, it shows the comparative benefits of one topology over another. 
 The actual power values from simulations are also be different due to voltage 
dependent non-linear capacitances not accounted for in the theory. Short circuit 
currents in the NR predriver tapered buffers also cause deviation from the theory. It 
can be seen from Figure 9.12 that, as the propagation delays and rise/fall times get 
larger across topologies, less power is consumed by GSR and CPR, compared to NR, 
at higher frequencies. This is similar to the principle of adiabatic reversible logic, 
where slower transition times can give power savings [9]. 
Receiving local buffers will have varying logic thresholds that will cause 
appreciable skew for large slew rates. These thresholds will also vary due to dynamic 
supply variations causing jitter. For minimum skew, it is preferred to drive NR 
without distributed predrivers. Similarly, GSR and CPR with all inductors at source 
give minimum skew. However, due to Q degradation, this will consume more power 
than inductors distributed at sink points.  
Figure 9.13 shows skews extracted from simulations over the DVFS frequency 
range for 160pF H-tree for topologies at 1V operation. Skew is the highest for CPR 





Figure 9.13 Simulated skews of H-tree across operating frequencies. 
This is the true clock performance for a given power that needs to be 
considered. The GSR can give the lowest skew all the way to 2GHz, using the well-
controlled falling edge as the trigger. CPR shows the highest skew and, like NR, 
cannot achieve functional swing at 2GHz.  
With wider interconnects, target skew and functionality can be met in CPR, 
and NR as well, but at the expense of significant increase in the load capacitance and 
power [3], [18]. This again illustrates the fundamental trade-off between energy and 
delay, as one has to be increased to decrease the other. GSR gives low power 
performance below the resonance frequency fR. However, with run-time 
reconfiguration to CPR, using the same inductor, its operation can be extended to fR.  
Figure 1.2 is the basis for a high performance CDN Mesh/Grid with DVFS 
operation from 2GHz @ 1V to 500MHz @ 0.5V. It saves more than 25% dynamic 
power on 45nm process from ISPD2010 bench marks. GSR based solutions have Run-
time Digital Tuning capability for power and skew optimizations by varying resonance 




metal area [23]. The inductors are placed in the bottom rail of resonant drivers. A fairly 
large clock mesh capacitance of 1nF is targeted. Figure 9.14 shows the power savings 
for both 1V and 0.5V operation for GSR implementation across a wide frequency 
range, shown in log scale. 
Figure 9.14 also compares simulated power savings of GSR with various 
conventional continuous resonant driver (CPR) solutions. Re-simulations of 
previously reported CPR solutions for global clocks in 90nm [14] and 32nm [11] are 
done under identical test conditions. The peak frequencies of CPR can be larger than 
fR of GSR even for a slower process like the 90nm shown. The 32nm CPR curve 
shows narrow band of operation but good power savings at the resonant frequency, as 
verified by silicon measurements [11]. 
 
Figure 9.14  GSR Power Savings compared to NR.  
As seen, GSR has an order of magnitude frequency range advantage over 
CPRs in maintaining power savings. The design has been verified over 90nm, 45nm 




Table 2 summarizes the advantages and constraints involved in various driver 
choices and system level trade-offs. These have been simulated and validated in this 
chapter. As shown in Table 2, each scheme has its own unique advantages depending 
on performance needs and power. But all the schemes can be dynamically 
reconfigured from GSR. Only NR and GSR can drive standard cells with their 
outputs. 


















Extra Local Buffers 
or Sense Amp Flip-
Flops [31] 
Lower power with 
TSPC latches 




for less delay 
but more 
skew 








DVFS Yes No Yes Yes 
Auto Place 
& Rout 







Larger power than 
NR at low 
frequencies.  
Large power for low 
skews. 
Large inductor sizes. 
Timing Closure 
issues. 
Pulsed output not 
50% duty cycle. 
 













to drive low power 
latches 
Rail to rail output. 
Lower skew for single 
un-buffered driver. 
 
This generalized series resonance (GSR) technique achieves 50% less power 
dissipation than NR drivers, while reducing the skew by 50% for meeting timing 
requirements. This series resonance schemes supports DVFS operation and has 




10 DESIGN METHODOLOGY AND FLOW 
The standard design flow shown in Figure 10.1 needs to be enhanced to 
include the resonant clocking with the best choice of configuration, inductors, driver 
sizes and placement. As a baseline NR solution is computed first as supported by 
most clock tree synthesis (CTS) tools. 
 




In a typical design flow, each design stage specifies certain characteristics that 
have to be implemented at the next level. Timing closure needs to be obtained in the 
final Physical Level stage.  
Power consumption can be reduced by the designer at every stage by trading 
off area and/or performance (PPA). This requires that the power be estimated 
accurately at each stage. The equations derived in this thesis enable that. The accuracy 
of estimation needs to increase as the design progresses down the stages. 
Resonant topologies will involve gate level, transistor level and physical level 
design stages. The split driver topology will need routing of symmetrical lines, in 
parallel, to the sink points of local buffers. This is used as the baseline solution to fall 
back on if the resonant schemes do not give appreciable power savings for the given 
skew and area limitations.  
The algorithm for CPR inductor design and placement is in Appendix D: 
Design Synthesis. If DVFS of 5× or more is desired, PSR is the ideal solution along 
with custom latches, especially if DDR is used. The overall algorithm for PSR is 
similar to above as shown below:  
Algorithm P: Overall PSR Synthesis Methodology 
Input: Near-zero skew routed tree with LCB at root & Grid nodes from C1,  
fCLK ;& DVFS, Lmin-max,  MA-max (inductor metal  area); 
Skew 𝑡skw constraint 
Output: Inductor sizes and buffer locations 
1. taperWires() 
2. while |Vswing| < V(minSwing) do 
3. Vbest ← 0,  
4. sizeLCTanks() 
5. sizeDriver() 
6. Run SPICE 
7. if |Vswing| > Vbest then 
8. Vbest ← |Vswing| 
9. end if 
10. end for 
11. sizeLCTanks() 




13. Place tank at n. 
14. Run Spice 
15. if min V(sinks) > V(minSwing) then 
16. maxSwingNode = n 
17. minSwing =min V(sinks) 
18. end if 
19. Remove tank from n. 
20. end for 
21. Place tank at maxSwingNode 
 
If custom latches are not feasible and DDR is not employed, GSR can be 
chosen and the algorithm follows PSR algorithm. These are shown in the flow chart 
of Figure 10.2 as integrated into the main IC design flow. This can be incorporated 
into Automatic Place and Rout (APR) software as a low power design flow.  
The following appendices contain more information of the flow and design 
synthesis. 
Appendix C: Spread Sheet for Design  shows a spread sheet that determines 
the basic feasibility for the given specifications.  













 Load CL,  fCLK max-min,  DVFS=yes/no, VDD max-min, Skew 𝑡skw Jitter 𝑡jit−pp  
DDR?, Interconnect Length Rw, Metal/Active Area-max 
 
NR Baseline Solution: GRID Generation 





















PNR, n (horn  length), Marea, Aarea # 
Sinks, # INVs, Placement 𝑡skw−max 
 
DVFS? 
CPR LC Tank placement /sizing  
Buffer Resizing, Grid Buffer Reduction, 
Resonant GRID Generation 
 Yes 
PCPR, Marea, Aarea # Sinks, # INVs, 
Placement 𝑡skw−max 
 














PGSR < PNR 
Areas <Max 
PGSR, Marea, Aarea # 
Sinks, # INVs, Placement 
𝑡skw−max 
 
GSR LC Tank placement/sizing  
Buffer Resizing, Resonant GRID 





NR Baseline Solution 
 
Timing Closure 
PSR LC Tank placement/sizing  
Buffer Resizing, Resonant GRID 
Generation, Inductor Placement 
 





11 CONCLUSIONS  
As stated in the motivation section 1.1 of this dissertation, resonant solutions 
that inherently work over the entire DVFS range have been demonstrated in terms of 
the PSR and the GSR. The timing performance of PSR in terms of setup time, skew 
and jitter are superior to other solutions. PSR saves area as well using the TSPC 
designs shown. GSR improves skew and jitter but at the expense of area used. GSR 
can be used with standard library cells and reconfigured dynamically to other resonant 
and non-resonant schemes.  
11.1 Summary 
In summary, the GSR can be considered equivalent of a general purpose 
operation amplifier for clock distribution applications. The GSR driver gives rail-to-
rail outputs that can directly interface to standard cell library flip flops and logic, and 
also allows clock gating. It has digitally controlled pulse width tuning for inductor 
variations, fast slew rates and lowest skew for a given power consumption. GSR can 
be reconfigured to give other schemes like CPR, PSR and NR. The only downside, if 
any, is the increase in area for GSR and metal inductors used. In this era of ‘dark 
silicon’ this is an acceptable compromise. In fact, increased area can reduce power 
density.  
All the important circuitry for realization of the drivers was described to 
enable the drivers’ deployment.  Design equations for delay and power based on 
theoretical analysis have been derived and listed in Table 1. These are verified to be 
accurate with simulations on 90nm, 45nm and 22nm process nodes. All the sources of 
power consumption and delays in implementing resonant and non-resonant schemes 
are accounted for and compared. The performance, power and area (PPA) tradeoff for 




solution for the given application. To the author’s knowledge such a comprehensive 
comparative analysis has not been attempted so far. 
Additional receiver circuitry is needed by the resonant clock waveforms in 
CPR and PSR. CPR for example, needs specialized drivers or flip-flops that can 
handle non-square clock waveforms. The pre-driver of the series resonant schemes 
can take more power when large loads are driven by the driver. PSR actually takes 
less power than NR across the DVFS range, both for resonant clocking and flip-flops. 
The skew reductions are achieved without needing to increase the interconnect widths 
thanks to the negative set-up times. 
Validation of PSR and GSR area also shown on a 45nm with layout plans to 
illustrate the scalability of the design. A comprehensive top down solution for 
applying resonance in clock and data timing is discussed. As the resonant inductor is 
used only during the rise and fall times, smaller values of inductors are sufficient and 
a decade of operating frequency range is possible. This allows for seamless DVFS 
operation that runs at lower voltages and frequencies to dynamically scale power 
consumption in high performance processors. Smaller inductor values of series 
resonance schemes make them an attractive option for multi-voltage and multi-
frequency local clocking solutions. With sufficient unused top metal layers area, the 
inductors can be realized with little active area penalty. 
A dynamic logic circuit RDL that uses GSR principle is also shown. Other 
dynamic logic circuits can also be combined with GSR for power reductions at 
functional level. This topology can also be used in driving the large capacitance that 
results in the word-lines and bit-lines of memory arrays. Inductors can also be shared 




This work does not necessitate the use of high-Q custom inductors that need 
more active area or specialty processes. With reasonable tank Q values (>3), 
practically realizable on-chip, GSR solutions presented here can recycle more than 
50% of driver energy over the entire DVFS range, reducing clocking power at system 
level by 40% on average. This LC resonant clock driver is shown to save power on a 
22nm process node and has 50% less skew than a non-resonant driver at 2GHz. It can 
operate down to 0.2GHz to support other energy savings techniques like DVFS. There 
is less than 25% area penalty on GSR drivers. 
Use of PSR and TSPC latches can further reduce the system power by another 
25%. As an example, GSR can be configured for the simpler pulse series resonance 
(PSR) operation to enable further power saving for double data rate (DDR) 
applications, by using de-skewing latches instead of flip-flop banks. A PSR based 
subsystem for 40% savings in clocking power with 40% driver active area reduction 
was demonstrated. Simulations using 45nm IBM/PTM device and interconnect 
technology models, clocking 1024 flip-flops show the reductions, compared to non-
resonant clocking. DVFS range from 2GHz/1.3V to 200MHz/0.5V is obtained. The 
PSR frequency is set >3× the clock rate, needing only 1/10
th
 the inductance of prior-
art LC resonance schemes.    
11.2 Conclusion 
The stated goal of this thesis was to arrive at energy recovering resonant 
solutions that inherently operate over wide frequencies and give better performance in 
terms of lower skew and jitter for timing closure. The dissertation has shown how to 
achieve that using GSR, with detailed theory and implementation.  
A typical processor bench mark has 25% allocation for clocking and 20% for 




40% of power amounting to 18% of system power. This amounts a decrease in 
temperature rise above ambient by 18%. Failures are accelerated with temperature and 
this can amount to a 10% decrease in failure rate. It also allows for choice of more 
economical packaging and 10% lesser cooling costs for the end customer. For the IC 
vendor, yield is improved due to decrease in area as well as the improved margins in 
timing performance. A 40% decrease in clocking and flip-flops area gives effective 
die size savings of more than 10%. Die cost decreases proportional to 4
th
 power of die 
area giving a cost savings of 35% [27]. Adding increased performance margin in 
timing can take this to 40% savings in die-costs, including testing, when compared to 
NR based DDR designs. So cost savings are realized along the whole chain from IC 
manufacturer to the end equipment user. 
Standard DSM CMOS implementation of GSR, a reconfigurable on-chip LC 
resonant clock distribution solution, was shown. This generalized series resonance 
(GSR) technique can achieve 50% driver power savings compared to non-resonant 
drivers, while reducing the skew by 50% (below 10ps) to make it easier to achieve 
timing closure. Taking processor designs as a benchmark 25% power and 25% area 
can be assumed to be consumed by CDN for an NR design. A 25% reduction in clock 
power can result in more than 6% savings in the overall power. At the worst case 
there can be 5% increase in die costs which can be compensated by yield gain from 
timing margins. Decrease in hot spots can increase the reliability and more than 5% 
increase in the life time of the ICs.  
Thus, recycling energy in this fashion reduces the hotspot occurrences that 
were discussed in the motivation section 1.1. All these can lead to much lower cooling 
costs for workstation and server farms increasing their reliability and leading to more 




The key performance index energy-delay product, which is usually lowered 
with ‘More of Moore’ technology scaling, is shown to be improved through a ‘More 
than Moore’ solution using inductors. 
The power reduction solutions presented in this thesis do entail an 
enhancement in the design flow and development of CAD software for automatic 
inductor synthesis. These are one-time costs that are far less than the typical 
development costs of current DSM SoCs and processors. 
11.3    Future Work 
Using the equations derived, further work is now possible to automatically 
synthesize GSR and PSR solutions with power and timing optimization. Further work 
is now possible to develop automatic place and route (APR) solutions to synthesize 
series resonance solutions, thus allowing their main stream deployment. Various GSR 
configurations can be fabricated on test chip to verify the theoretical predictions. 
Once these unit cells are characterized and incorporated into the standard cell library 
data base, main stream applications can be addressed. 
Future work will address optimal layout implementation of GSR with multiple 
inductors and distributed parasitics for power and delay optimizations in asymmetric 
trees. An actual clock tree from low power processor like ARM can be taken and 
converted into a resonant based driver and distribution scheme. Various resonance 
schemes can be applied at multiple levels of the clock distribution hierarchy. Data 
paths can be converted to dynamic logic scheme to save overall power. Most of the 
inductors in data path can be shared between various lines.  
Statistical Static timing analysis can be better applied to PSR and a far better 




opens the possibility of using injection locking techniques to improve the jitter in 
clocks [48]. 
This work further advances the cause of using energy saving resonance in 
future SoCs and processors by providing new topologies and a comprehensive trade-







[1] P. Restle, D. Shan, D. Hogenmiller, Y. Kim, A. Drake, J. Hibbeler, T. Bucelot, G. 
Still, K. Jenkins and J. Friedrich2, “Wide-Frequency-Range Resonant Clock with On-
the-Fly Mode Changing for the POWER8
TM
 Microprocessor,” in IEEE International 
Solid-State Circuits Conference, 2014, pp. 100-101. 
[2]  T. N. Theis and P. M. Solomon, “In quest of the Next Switch: Prospects for 
greatly reduced power dissipation in a successor to the silicon field-effect transistor,” 
Proc. IEEE, vol. 98, no. 12, pp. 2005–2014, Dec. 2010. 
[3]   X.Hu and M. Guthaus, “Distributed LC Resonant Clock Grid Syntheis,” IEEE 
Transactions on Circuits And Systems—I: Regular Papers, Vol. 59, No. 11, pp. 2749- 
2760, November 2012. 
[4]  L. Chang, D. Frank, R. K. Montoye, S. J. Koester, B. L. Ji, P. W.Coteus, R. H. 
Dennard, and W. Haensch, “Practical strategies for power-efficient computing 
technologies,” Proc. IEEE, vol. 98, no. 2, pp. 215–236, Feb. 2010. 
[5]  Shien-Yang Wu, C.Y. Lin, S.H. Yang, J.J. Liaw and J.Y. Cheng, “Advancing 
foundry technology with scaling and innovations,” in  Proc, International Symposium 
on VLSI-TSA, 2014, pp. 1-3 
[6] H. A.  Moore, M. Dietrich, A. Herkersdorf, F. Miller, T. Wild, K. Hahn, A. 
Grunewald, R. Bruck, S. Krohnert and J. Reisinger, “System integration — The 
bridge between More than Moore and More,” in Proc. DATE, 2014, pp.  1-9. 
[7]  R. K. Jana, G. L. Snider and D. Jena, “Energy-Efficient Clocking Based on 
Resonant Switching for Low-Power Computation,” IEEE Transactions On Circuits 
And Systems—I: Regular Papers, Vol. 61, No. 5, pp. 1400-1408, May 2014. 
[8] A. Zolfaghari, A. Chan, and B. Razavi, “Stacked inductors and transformers in 
CMOS technology,” IEEE J. Solid-State Circuits, vol. 36, no. 4, pp. 620–628, Apr. 
2001. 
[9] K. Suhwan, C. Ziesler, M. Papaefthymiou, “Charge-recovery computing on 
silicon,” IEEE Transactions on Computers, vol. 54 , issue 6, pp. 651- 659, June 
2005.  
[10] Yibin Ye and K. Roy “Energy recovery circuits using reversible and partially 
reversible logic,” IEEE Transactions on Circuits and Systems I: Fundamental Theory 
and Applications, Volume: 43, Issue: 9, pp. 769-778, 1996. 
[11] V. S. Sathe, V. Arekapudi, C. Ouyang, M. Papaefthymiou, A. Ishii and S. 
Naffziger, “Resonant-Clock Design for a Power-Efficient, High-Volume x86-64 




[12] S. C. Chan, K. L. Shepard, and P. J. Restle, “Uniform-phase, uniform 
amplitude, resonant-load global clock distributions,” IEEE J. Solid-State Circuits, vol. 
40, no. 1, pp. 102–109, Jan. 2005. 
[13] J. Rosenfeld and E. Friedman, “Design methodology for global resonant H-
tree clock distribution networks,” IEEE Transactions on Very Large Scale Integration 
(VLSI) systems, vol. 15, no. 2, pp. 135–148, February 2007. 
[14] S. C. Chan, P. J. Restle, T. J. Bucelot, J.S. Liberty, S. Weitzel, J. M. Keaty, B. 
Flachs, R. Volant, P. Kapusta, and J. S. Zimmerman, “A Resonant Global Clock 
Distribution for the Cell Broadband Engine Processor,” IEEE Journal Of Solid-State 
Circuits, Vol. 44, No. 1,  pp. 64-72, January 2009. 
[15] T. Lee, “passive RLC networks,” in The design of CMOS RF integrated 
circuits, New York: Springer. 
[16] Yuhui Chen, F. Lee, L. Amoroso and H. Wu, “A  resonant MOSFET gate 
driver with efficient energy recovery,” IEEE Transactions On Power Electronics, 
Vol. 19, No. 2, pp. 470- 477, March 2004. 
[17] N. Kurd, M. Chowdhury, E. Burton, T.P. Thomas, C. Mozak, B. Boswell, M. 
Lal, A. Deval, J. Douglas, M. Elassal, A. Nalamalpu, T.M. Wilson, M. Merten, S. 
Chennupaty, W. Gomes, R. Kumar, “Haswell: A family of IA 22nm processors,” in 
IEEE International Solid-State Circuits Conference Digest of Technical Papers, 2014, 
pp. 112-113. 
[18] M. Guthaus, G. W. Silke and R. Reis, “Revisiting Automated Physical 
Synthesis of High-Performance Clock Networks,” ACM Transactions on Design 
Automation of Electronic Systems,” Vol. 18, No. 2, Article 31, pp.31:1-31:2,  March 
2013 
[19] A. Ishii, J. Kao, V. Sathe, and M. Papaefthymiou, “A resonant-clock 200MHz 
ARM926EJ-S™ microcontroller,” in Proc. IEEE Eur. Solid-State Circuits Conf. , 
Sep. 2009, pp. 356–359. 
[20] Alan J. Drake, Kevin J. Nowka, Tuyet Y. Nguyen, Jeffrey L. Burns, and 
Richard B. Brown, “Resonant Clocking Using Distributed Parasitic Capacitance,” 
IEEE Journal Of Solid-State Circuits, Vol. 39, No. 9, pp. 1520-1528, Sept 2004. 
[21] H Mahmoodi, V. Tirumalashetty, M. Cooke, and K. Roy, “Ultra Low-Power 
Clocking Scheme Using Energy Recovery and Clock Gating,” IEEE Transactions On 
Very Large Scale Integration Systems, Vol. 17, No. 1, pp. 33-44, January 2009.  
[22] V. Sathe, “Hybrid resonant-clocked digital design,” Ph.D. dissertation, Dept. 
Electr. Eng. Comput. Sci., Univ. of Michigan, Ann Arbor, MI, May 2007. 
[23] I. Bezzam and S. Krishnan, “A Pulsed Resonance Clocking for Energy 
Recovery,” in Proc. IEEE International Symposium on Circuits and Systems, 




[24] H. Fuketa, M. Nomura, M. Takamiya and T. Sakurai, “Intermittent Resonant 
Clocking Enabling Power Reduction at Any Clock Frequency for Near/Sub-Threshold 
Logic Circuits” IEEE J. Solid-State Circuits, vol. 49, no. 2, pp. 536– 544, Feb. 2014. 
[25] K. Ikeuchi, K. Sakaida, K. Ishida, T. Sakurai and M. Takamiya, “Switched 
Resonant Clocking (SRC) scheme enabling dynamic frequency scaling and low-speed 
test,” Custom Integrated Circuits Conference, 2009, pp. 33- 36. 
[26] I. Bezzam, C. Mathiazhagan, S. Krishnan and T. Raja, “Low Power Low 
Voltage Wide Frequency Resonant Clock and Data Circuits for SoC Power 
Reductions,” IEEE Latin American Symposium on Circuits and Systems, Peru, 
February 2013.  
[27] J. M. Rabaey, A. Chandarakasan and B. Nokolic, Digital Integrated Circuits: 
A Design Perspective, 2nd Ed. New Jersey: Prentice Hall, pp. 349-361, 2003. 
[28] J. Rabaey, “Optimizing Power @ Design Time- Circuit Level Techniques” in 
Low Power Design Essentials, 1st Ed. New York: Springer, 2009, pp. 86-88. 
[29] G. Wilke, R. Fonseca, C. Mezzomo, and R. Reis, “A novel scheme to reduce 
short-circuit power in mesh-based clock architectures,” in Proc. SBCCI, 2008, pp. 
117–122. 
[30] C. Yoo, “A CMOS buffer without short-circuit power consumption,” IEEE 
Trans. Circuits System II, Analog Digit. Signal Process, vol. 4, no. 9, pp. 935–937, 
Sep. 2000 
[31] S. E. Esmaeili, A. J. Al-Kahlili, and G. E. R. Cowan, “Low-Swing Differential 
Conditional Capturing Flip-Flop for LC Resonant Clock Distribution Networks,” 
IEEE Transactions On VLSI Systems, Vol. 20, No. 8, pp.1547-1551, August 2012. 
[32] C. N. Sze, P. Restle, G.-J. Nam, and C. J. Alpert, “Clocking and the ISPD’09 
clock synthesis contest,” in Proc. ISPD, 2009, pp. 149–150 
[33] S. Esmaeili, A. Al-Khalili, and G. Cowan, “Dual-edge triggered sense 
amplifier flip-flop for resonant clock distribution,” IET Computers & Digital 
Techniques, vol. 4, no. 6, pp. 499 - 514, 2010. 
[34] S. E. Esmaeili, A. J. Al-Kahlili, and G. E. R. Cowan, “Estimating Required 
Driver Strength in the Resonant Clock Generator,” IEEE Transactions On VLSI 
Systems, Vol. 20, No. 8, pp.927-930, August 2012. 
[35] C. Yue and S. Wong, “On-chip spiral inductors with patterned ground shields 
for Si-based RF ICs,” IEEE J. Solid-State Circuits, vol. 33, no.5, pp. 743–752, May 
1998. 
[36] Arizona State Univerisity, Predictive technology models 




[37] W. Zhao, Y. Cao, "New generation of Predictive Technology Model for sub-
45nm early design exploration," IEEE Transactions on Electron Devices, vol. 53, no. 
11, pp. 2816-2823, November 2006. 
[38] B. Razavi, Chapter 2 in Design of Analog CMOS Integrated Circuits, 1st ed. 
New York, NY, USA: McGraw-Hill Higher Education, 2000.  
[39] W. Daly, Digital Systems Engineering, 2nd Ed. New Jersey: Prentice Hall, 
2003, pp. 349-361. 
[40] Domenico Campolo, Metin Sitti and Ronald S. Fearing, “Efficient Charge 
Recovery Method for Driving Piezoelectric Actuators with Quasi-Square Waves,” 
IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, vol. 50, 
no. 1, pp.1-9, January 2003. 
[41] I. Bezzam and S. Krishnan, “Minimizing Power and Skew in VLSI-SoC 
Clocking With Pulsed Resonance Driven De-skewing Latches,” in IEEE  27th 
International Conference on VLSI Design, 2014, pp. 157-161. 
[42] C. Kim and S. Kang (2002). A Low-Swing Clock Double-Edge Triggered 
Flip-Flop. IEEE Journal of Solid-State Circuits, Vol. 37, No. 5, May 2002, pp. 648-
652. 
[43] J. Tschanz, S. Narendra, Z. Chen, S. Borkar, and M. Sachdev (2001). 
Comparative Delay and Energy of Single Edge-Triggered & Dual Edge-Triggered 
Pulsed Flip-Flops for High-Performance Microprocessors.  Proceedings of 2001 
ISLPED, pp. 147-152, August 6-7, 2001, USA. 
[44] Terence M. Potter and James Blomgren (2006). Null value propagation for 
FAST14 logic.  US patent  No. 7,053,664, May 2006. 
[45] I. Bezzam, S. Krishnan and C. Mathiazhagan (2012). Low power SoCs with 
Resonant Dynamic Logic using Inductors for Energy Recovery. VLSI-SoC. 
[46] Chirayu S. Amin, Noel Menezes, Kip Killpacks, Florentin Dartus, Umakanta 
Choudhug, Nagib Hakims, Yehea I. Ismail  “Statistical Static Timing Analysis: How 
simple can we get?” DAC2005, June l3-17,2005, Anaheim, California, USA. 
[47] Hewlett-Packard, Intel, Microsoft, Phoenix, and Toshiba (2011). Advanced 
Configuration and Power Interface (ACPI) is an open industry specification 5.0:   
http://www.acpi.info 
 
[48] Zheng Xu and Kenneth L. Shepard “Design and Analysis of Actively-
Deskewed Resonant Clock Networks,” IEEE Journal Of Solid-State Circuits, Vol. 44, 










 C Capacitor 
CDN Clock Distribution Network 
CL Load Capacitor 
CMOS Complementary Metal Oxide Semiconductor 
COUT Output Capacitor 
CPR Continous Parallel Resonance 
D Data input of a flip-flop 
DC Direct Current 
DCR DC resistance of inductor 
DDR Double Data Rate 
DET Dual Edge Triggering 
DVFS Dynamic Voltage Frequency Scaling  
EC Energy stored on capacitor C per cycle 
EMI Electro-Magnetic Interference 
ESR Electrical Series Resistance of Capacitor  
EVDD Energy drawn from VDD supply per cycle 
fCLK  Clock Frequency 
fR Frequency of damped oscillations 
fRES ideal Frequency of  Resonance 
GSR Generalized Series Resonance  
IC Integrated Circuit 
iL Inductor Current  




IR Intermittent Resonance  
L Inductor 
LC Inductor (L) Capacitor (C) series/parallel combination 
LCB Local Clock Buffers  
MEMS Micro-Electro-Mechanical Systems   
MS Master Slave  
NEMS Nano-Eleectro-Mechanical Systems 
NMOS N-type Metal Oxide Semiconductor 
NR No Resonance 
Pavg Average Power per cycle 
PCPR CPR Power  
PGSR GSR Power  
PLS_CLK Clock Pulse Stream 
PMOS P-type Metal Oxide Semiconductor 
PNR Non Resonant Power  
PPA Power, Performance and Area  
PPSR PSR Power  
PSR Pulsed Series Resonance 
Q (italicized) Quality factor 
Q Output of flip-flop 
QC Component Quality factor of Capacitor C 
QL Component Quality factor of Inductor L 
Rd pull-Down switch Resistance 
RF Radio Frequency  




Rr Resonance on-off switch Resistance 
Ru pull-Up switch Resistance 
Rw Interconnect Wire Resistance 
SCB Sector Clock Buffers  
SoC System on Chip 
TCLK Clock Period 
TPW Pulse Width Time 
TSPC True Single Phase Clocking  
VC Capacitor Voltage  
VDD Power Supply voltage connected to Drain of PMOS 
Vin Input Voltage 
VLB Inductor Bias Voltage 
VOH logic Output High Voltage 
VOL logic Output Low Voltage 
VOUT Output Voltage 
µ micro meter units 





Appendix A:  MATLAB for solving ODE and Deriving Expressions 
 
A- 1 Power in CPR  
Integrating V
2
/R averaged over period T 
>> syms Vdd t Tr Fr x 




 y = (3*pi*C*Fr*Q*Vdd^2)/4  
 Same as hand derivation 
 
At non resonance Fc = x. Fr 
y=int((.5*Vdd+.5*Vdd*sin(2*pi*t/Tr))^2,0,Tr/x)*pi*2*Q*x*Fc*C/(Tr/x) 
 y =(C*Fc*Q*x*(12*pi*Vdd^2 + 8*Vdd^2*x - 8*Vdd^2*x*cos((2*pi)/x) - 
Vdd^2*x*sin((4*pi)/x)))/16 
 
=C*Fc*Q*Vdd^2 - C*Fc*Q*Vdd^2*cos(pi/x)^2 + (3*pi*C*Fc*Q*Vdd^2)/(4*x) - 
(C*Fc*Q*Vdd^2*cos(pi/x)^3*sin(pi/x))/2 + (C*Fc*Q*Vdd^2*cos(pi/x)*sin(pi/x))/4 
 
>>> Example 
Fc =   1.0000e+09 
Q =    3.1400 
Vdd =     1 
>> eval(z) 
 ans =((3*pi)/10 + x/5 - (x*cos((2*pi)/x))/5 - (x*sin((4*pi)/x))/40)/(1256*x) 
 
 >> expand(z) 
 ans = (cos(pi/x)*sin(pi/x))/12560 - cos(pi/x)^2/3140 + (3*pi)/(12560*x) - 
(cos(pi/x)^3*sin(pi/x))/6280 + 1/3140 
zz=(3*pi)/(12560*x)  + 1/3140 - cos(pi/x)^2/3140 
 
>>> Comparing Results from SPICE 
 
>> hold off 
>> ezplot(zz, [0.5,2]) 
>> hold on 
>> ezplot(n, [0.5,2]) 






Approximation of closed form Power vs. Frequency close to sims 
 
A- 2  PSR Evaluating expressions for  VOL & VOH 
PSR VOL derivation 
vv(t) =  vv(t)=.5*Vdd+.5*Vdd*exp(-t*pi/(Tr*Q))*cos(2*pi*t/Tr) 
>> eval(vv(Tr/2)) 
 
ans = Vdd/2 - (Vdd*exp(-pi/(2*Q)))/2 
 Same as hand derivation 
 PSR VOH derivation 
 >> eval(vv(Tr)) 
  
 ans = Vdd/2 + (Vdd*exp(-pi/Q))/2 
 
 Same as hand derivation 
 
  

















A- 3 Ordinary Differential Equations (ODE) Solving PSR  
First order example with initial conditions 
>> syms u(t) 
> Du=diff(u); 
>> dsolve(diff(u,2)==u,u(0)==1,Du(0)==0) l 
ans =exp(-t)/2 + exp(t)/2 
 
Solving PSR Differential Equation 
>> syms C L R Vdd V 
>> V=dsolve(diff(u,2)==-diff(u)/C*R -u/(L*C),u(0)==0,Du(0)==Vdd/2*R*C ) 
simplify(V) 
 ans =-(C^2*L*R*Vdd*exp(-(t*((L^2*R^2 - 4*C*L)^(1/2) + L*R))/(2*C*L)) - 
C^2*L*R*Vdd*exp((t*((L^2*R^2 - 4*C*L)^(1/2) - L*R))/(2*C*L)))/(2*(L^2*R^2 - 
4*C*L)^(1/2)) 
 
>> I=dsolve(diff(u,2)==-diff(u)/tLR -u/(tLC*tLC),u(0)==1) 
 I =C4*exp(-(t*(tLC + ((tLC - 2*tLR)*(tLC + 2*tLR))^(1/2)))/(2*tLC*tLR)) - exp(-
(t*(tLC - ((tLC - 2*tLR)*(tLC + 2*tLR))^(1/2)))/(2*tLC*tLR))*(C4 - 1) 
>> VL(t)=int(I) 
(tLC*exp(- t/(2*tLR) - (t*(tLC^2 - 4*tLR^2)^(1/2))/(2*tLC*tLR))*(2*C4*tLR^2 - 




 ans =(4*tLC*tLR^2 - CL*Vdd*tLR*(tLC^2 - 4*tLR^2)^(1/2) + 
CL*Vdd*tLC*tLR)/(8*tLC*tLR^2 - 2*tLC^3 + 2*tLC^2*(tLC^2 - 4*tLR^2)^(1/2)) 
 
>> Vo(t)=int(0.5*Vdd*((W0/Wd)^2)*exp(-a*t)*sin(Wd*t)) 











 ans = (pi*(C*L)^(1/2))/2  
 





Appendix B: LTSPICE Schematic Diagrams 
 
 
B 1 GSR Scalable Reconfigurable Driver Schematic and Macro Cell Symbol 
 
 






B 3 GSR Scalable Predriver and Symbol 
  




Appendix C: Test Benches for Simulations 
 
C- 1 Test bench of GSR configuration with Predriver and bias voltage for inductor 
 





C- 3  NR configuration with GSR macro cell 
 
C- 4 CPR Configuration with GSR macro cell 
 




Appendix C: Spread Sheet for Design Calculations 
 
Available as Design Aids at:  tinyurl.com/Bezzam  
 
 








Fr 30 1 4.77 3.14 0.5 1.0 1 3.872 0.992   
Fres, ESR, 
rS, Qt, Fr 30 1 0.95 3.14 1 10.1 5.0 2.870 4.956 0.2 
Ls,rS, Qt, 
Fr  ESR, 30 20 0.24 3.14 1.27 2.5 1 2.870 0.985   
Fres, ESR, 
rS, Qt, Fr 30 20 0.24 3.14 1.25 2.5 1.0 2.870 0.991   
Ls,rS, Qt, 
Fr  ESR, 30 20 0.05 3.14 0.05 0.5 5 2.870 4.924 0.2 
Ls,rS, Qt, 
Fr  ESR, 30 160 0.03 3.14 0.16 0.3 1 2.870 0.985   
Ls,rS, Qt, 
Fr  ESR, 30 160 0.01 3.14 0.006 0.1 5 2.870 4.924 0.2 
 
 
Highlighted items show derived values.  




Appendix D: Design Synthesis Algorithms 
 
Algorithm C1:  Resonant Grid Generation 
Input: Near-zero skew routed tree without buffers 
Output: Routed tree with resonant local sectors 
1: Insert min-size Local Clock Buffer (LCB) at root 
2: Place LC Tank at output of LCB 
3: Size LC tank (C2) 
4: Adjust LC Tank Placement (C3) 
5: while Voltage swing at sinks < 90% do 
6: Increase Buffer size 
7: Size LC tank  
8: Run Spice sims to verify swing 
9: end while 
 
Algorithm C2: inductor placement and sizing algorithm 
Input: Near-zero skew routed tree with LCB at root &  Grid nodes from A-I,  fCLK ; 
Lmin-max,  MA-max (inductor metal  area); Skew 𝑡skw constraint 
Output: Inductor sizes and locations 
Output: Correctly sized LC tank for resonance at the desired frequency 
1: Ltank = 1/2C 
2: Run Spice 
3: while | fdesired − fmin| > 10MHz do 
4: L = L− |fdesired−fmin|/ fmin 
5: Run Spice 
 
Algorithm C3: inductor placement and sizing algorithm 
Input: Properly sized tank, topologically sorted list of nodes in tree 
Output: LC Tank placed at a node that provides good voltage swing 
1: maxSwingNode = Null 
2: minSwing = 0 
3: Remove tank from LCB output. 
4: for n 2 First 10% of nodes in tree do 
5: Place tank at n. 
6: Run Spice 
7: if min V(sinks) > V(minSwing) then 
8: maxSwingNode = n 
9: minSwing =min V(sinks) 
10: end if 
11: Remove tank from n. 
12: end for 
13: Place tank at maxSwingNode 
  
