Design methodologies for robust low-power digital systems under static and dynamic variations by Chae, Kwanyeob
DESIGN METHODOLOGIES  
FOR ROBUST LOW-POWER DIGITAL SYSTEMS  
























In Partial Fulfillment 
of the Requirements for the Degree 
Doctor of Philosophy in the 












Copyright ©  2013 by Kwanyeob Chae 
DESIGN METHODOLOGIES 
 FOR ROBUST LOW-POWER DIGITAL SYSTEMS 






















Approved by:   
   
Dr. Saibal Mukhopadhyay, Advisor 
School of Electrical and Computer 
Engineering 
Georgia Institute of Technology 
 Dr. Arijit Raychowdhury 
School of Electrical and Computer 
Engineering 
Georgia Institute of Technology 
   
Dr. Sudhakar Yalamanchili 
School of Electrical and Computer 
Engineering 
Georgia Institute of Technology 
 Dr. Hyesoon Kim 
School of Computer Science  
Georgia Institute of Technology 
   
Dr. Sung Kyu Lim 
School of Electrical and Computer 
Engineering 
Georgia Institute of Technology 
  
   
































“Fear not, for I am with you; be not dismayed, for I am your God; I will strengthen you,  
I will help you, I will uphold you with my righteous right hand.”  Isaiah 41:10 
 
First of all, I would like to thank God for giving me strength and wisdom in my 
life. I also would like to express my sincere gratitude to my research advisor, Dr. Saibal 
Mukhopadhyay. His optimistic attitude was great encouragement to me for completing 
my research. As not only my advisor but also my friend, he inspired me to explore 
technical challenges and encouraged me to overcome my difficulties.  
I would like to thank Dr. Sudhakar Yalamanchili and Dr. Sung Kyu Lim for 
contributions and efforts in serving on my thesis reading committee members. I am also 
grateful to the other committee members, Dr. Arijit Raychowdhury and Dr. Hyesoon Kim, 
for insightful comments and suggestions. I would acknowledge the gratitude to my 
former research advisor, Dr. Joy Laskar. It was my great pleasure to work with him.  
This work could not have been possible without the constant encouragement and 
support of the members of Gigascale Reliable Energy Efficient Nanosystem Laboratory 
at Georgia Tech. I am grateful to them for active discussions and collaborations.  
To my parents, Byungdae Chae and Myungsoon Yoo, my parents-in-law, 
Kyungsik Park and Okhun Park, my sister, Gilsun Chae, and my sister-in-law, Jiyoung 
Park, I show my gratitude for their unconditional love and prayers. Especially, I would 
like to express my deepest love and gratitude to my wife, Jihee Park, and my sons, Juho 
Chae and Juhyung Chae, for all their endless love, encouragement, and support.  
 
v 
TABLE OF CONTENTS 
 
ACKNOWLEDGEMENTS ......................................................................................... iv 
LIST OF TABLES ..................................................................................................... viii 
LIST OF FIGURES ..................................................................................................... ix 
LIST OF ABBREVIATIONS ......................................................................................xv 
SUMMARY .............................................................................................................. xvii 
CHAPTER 1 INTRODUCTION AND OBJECTIVES OF RESEARCH .....................1 
1.1 Problem statement ......................................................................................... 1 
1.2 Variations ....................................................................................................... 2 
1.3 Safety margin under variations ...................................................................... 4 
1.4 Thesis outline ................................................................................................. 5 
CHAPTER 2 RESEARCH TRENDS ............................................................................7 
2.1 Introduction ................................................................................................... 7 
2.2 Post-silicon tuning approach ......................................................................... 9 
2.3 Non-design-intrusive dynamic adaptation ................................................... 10 
2.4 Design-intrusive dynamic adaptation .......................................................... 11 
CHAPTER 3 POST-SILICON TUNING APPROACH ..............................................13 
3.1 Introduction ................................................................................................. 13 
3.2 Variation challenges in 3D ICs .................................................................... 14 
3.3 Tier-adaptive-voltage-scaling for 3D ICs .................................................... 16 
3.3.1 Effect of die-to-die variation in 3D design ...................................... 17 
3.3.2 Tier-adaptive-voltage-scaling methodology .................................... 21 
3.3.3 Simulation results ............................................................................ 25 
 
vi 
3.4 Tier-adaptive-body-biasing for 3D ICs ........................................................ 29 
3.4.1 Analysis of 3D clock network under variations .............................. 30 
3.4.2 Tier-adaptive-body-biasing .............................................................. 36 
3.4.3 Simulation results ............................................................................ 41 
3.5 Summary ...................................................................................................... 54 
CHAPTER 4 NON-DESIGN-INTRUSIVE APPROACH ..........................................55 
4.1 Introduction ................................................................................................. 55 
4.2 Adaptive clock modulation .......................................................................... 56 
4.2.1 Clock modulation methodology ...................................................... 58 
4.2.2 Test chip and measurements ............................................................ 62 
4.3 Adaptive bias-voltage generation ................................................................ 70 
4.3.1 Target voltage generation methodology .......................................... 71 
4.3.2 Simulation results ............................................................................ 79 
4.4 Summary ...................................................................................................... 82 
CHAPTER 5 DESIGN-INTRUSIVE APPROACH ....................................................84 
5.1 Introduction ................................................................................................. 84 
5.2 Time-borrowing and clock-stretching ......................................................... 85 
5.2.1 Methodology for prevention of timing error ................................... 86 
5.2.2 Circuit-level implementation ........................................................... 90 
5.2.3 Test chip and measurement results .................................................. 96 
5.3 Programmable-time-borrowing and delayed-clock-gating ........................ 107 
5.3.1 Methodology .................................................................................. 107 
5.3.2 Test chip and measurement results ................................................ 110 
5.4 Case studies for overhead estimation ........................................................ 119 
5.4.1 Case study for TB–CS ................................................................... 119 
 
vii 
5.4.2 Case study for PTB–DCG ............................................................. 125 
5.5 Summary .................................................................................................... 127 







LIST OF TABLES 
 
Table 3.1: Power analysis. .................................................................................................23 
Table 3.2: Analysis of power/area overhead. .....................................................................28 
Table 3.3: Parameters used in simulation. .........................................................................31 
Table 4.1: Comparison of prior works. ..............................................................................69 
Table 5.1: Area and power of the components. ..................................................................99 
Table 5.2: Normalized performance and power of the chip6 at the minimum 
operating voltage. .....................................................................................104 
Table 5.3: Total area and measured power of the chip6. ..................................................106 
Table 5.4: Summary of implemented 3D graphic processing unit. ..................................124 
Table 5.5: Simulated maximum input clock frequency and PC. ......................................124 




LIST OF FIGURES 
 
Figure 1.1: CMOS technology trend: (a) Device scaling. (b) VTH variation. ......................1 
Figure 1.2: (a) Frequency trend. (b) Thermal design power trend. ......................................2 
Figure 1.3: Time scale of aging effect, temperature, and voltage variations. ......................3 
Figure 1.4: Increased safety margin under variations at a 45nm CMOS technology.
......................................................................................................................4 
Figure 2.1: Speed and impact amount of variations. ...........................................................8 
Figure 2.2: Local and global variation components.............................................................9 
Figure 2.3: Concept of post-silicon tuning (AVS and ABB)..............................................10 
Figure 2.4: Concept of non-design-intrusive dynamic adaptation. .................................... 11 
Figure 2.5: Concept of design-intrusive dynamic adaptation. ...........................................12 
Figure 3.1: D2D variation issue in 3D ICs. .......................................................................14 
Figure 3.2: Path classification for 3D ICs. .........................................................................18 
Figure 3.3: Process variation impact in 2D paths in 3D ICs. .............................................18 
Figure 3.4: Standard deviation of delay variations in 3D paths considering 
different path division (a) neglecting TSV variations and assuming 
standard deviation () for D2D variations in both dies same, (b) 
considering TSV variations (same D2D variations for two dies), 
and (c) considering different standard deviations of D2D variations 
for two dies. ...............................................................................................20 
Figure 3.5: The system architecture of tier-adaptive-voltage-scaling. ...............................21 
Figure 3.6: Circuit diagrams of level shifters. ...................................................................23 
Figure 3.7: Path-delay-based insertion of different types of level shifters. .......................24 
Figure 3.8: The statistical analysis of the sensor output (left) and voltage 
assignments (right) for different tiers and different chips. .........................26 
Figure 3.9: Simulation results with TAVS in different path types. ....................................27 
 
x 
Figure 3.10: The impact of TAVS on leakage (left) and dynamic (right) power 
distribution. ................................................................................................28 
Figure 3.11: Three different types of 3D-clock networks: (a) Type1 (1 TSV); (b) 
Type2 (10 TSVs); (c) Type3 (100 TSVs). ..................................................32 
Figure 3.12: Skew histogram: clock skew base line without process variations of 
(a) the clock network Type1, (b) Type2, and (c) Type3. ............................33 
Figure 3.13: Correlation coefficient (ρ) between the latencies of the die1 and the 
die2 for (a) the clock network Type1, (b) Type2, and (c) Type3 not 
considering process variation. ....................................................................34 
Figure 3.14: 2D and 3D skew distribution of specific points in the clock network 
according to different variations. ...............................................................36 
Figure 3.15: The tier-adaptive-body bias (TABB) system. ................................................37 
Figure 3.16: The modified RO-based (a) nMOS and (b) pMOS variation sensors. ..........39 
Figure 3.17: The correlation between the normalized nMOS or pMOS delay 
impacted by D2D variation and the normalized output of (a) the 
nMOS sensor and (b) the pMOS sensor according to the channel 
length size and the transistor stack; and (c) the detail correlation 
analysis for nMOS sensor at points A and B..............................................40 
Figure 3.18: The body-bias assignments according to the sensor outputs of the 
pMOS and the nMOS variation sensors considering 15% WID 
variation and 5% D2D variation with 50mV resolution with (a) 
FBB/RBB and (b) FBB/ZBB. ....................................................................41 
Figure 3.19: The histogram of the body bias assignments of the die1 and the die2 
considering (a) FBB/RBB and (b) FBB/ZBB with 15% WID and 5% 
D2D variation.............................................................................................42 
Figure 3.20: Results of TABB on the clock network Type1 considering D2D and 
WID variations: (a) mean skew; (b) skew variation; (c) max skew. ..........46 
Figure 3.21: Results of TABB on the clock network Type2 considering D2D and 
WID variations: (a) mean skew; (b) skew variation; (c) max skew. ..........47 
Figure 3.22: Results of TABB on the clock network Type3 considering D2D and 
WID variations: (a) mean skew; (b) skew variation; (c) max skew. ..........48 
Figure 3.23: (a) Clock slew rate without TABB; (b) clock slew rate with 
FBB/RBB; (c) clock slew rate with FBB/ZBB according to VTHN 
and VTHP skew. ...........................................................................................49 
 
xi 
Figure 3.24: Results of TABB (FBB/RBB or FBB/ZBB) of the data paths (two 2D 
paths and one 3D path) according to D2D and WID variations: (a) 
mean delay; (b) delay variation..................................................................50 
Figure 3.25: Results of TABB on the clock power considering D2D and WID 
variations: (a) mean power; (b) power variation. .......................................52 
Figure 3.26: Layout overhead considering adaptive body biasing. ...................................53 
Figure 4.1: Overview of the proposed clock modulation approach. ..................................58 
Figure 4.2: Block diagram of the global modulator. ..........................................................59 
Figure 4.3: Timing diagram of the global modulator. ........................................................59 
Figure 4.4: Block diagram of the local modulator. ............................................................60 
Figure 4.5: Timing diagram of the local modulator: (a) DM off and (b) DM on. .............61 
Figure 4.6: Clock gating circuits (A1 and A2 block). ........................................................62 
Figure 4.7: Block diagram of the implemented system. ....................................................63 
Figure 4.8: Measured frequency modulation results of the GM and the LM under 
DC voltage shift; (a) clock modulation of the GM; (b) the effective 
clock frequency of the GM at the frequency transition region; (c) 
the effective clock frequency of the LM at the frequency transition 
region. ........................................................................................................65 
Figure 4.9: Characteristics of the duty modulated output clock of LM with 
DM(1X) on at low-operating voltage: (a) measured output 
waveforms of the LM with DM(1X) on near the frequency 
transition region and (b) The measured maximum/minimum 
frequency and the effective frequency. ......................................................66 
Figure 4.10: Measured waveforms of the modulated clock: (a) without noise; (b) 
only GM on; (c) only LM (DM off) on; (d) both GM and LM (DM 
on) on with noise. .......................................................................................67 
Figure 4.11: Measured effective frequency under global and local supply noise: (a) 
only GM on; (b) only LM on; (c) GM and LM on. ...................................68 
Figure 4.12: The die-photo and characteristics of the chip. ...............................................69 
Figure 4.13: Generation of a voltage at the given target frequency. ..................................72 
Figure 4.14: Block diagram of the adaptive voltage generator. .........................................73 
Figure 4.15: Block diagram of the delay comparator. .......................................................74 
 
xii 
Figure 4.16: Operational waveform of the adaptive voltage generator; (a) voltage 
down; (b) voltage up. .................................................................................75 
Figure 4.17: The delay line architecture and the level shifter. ...........................................76 
Figure 4.18: Concept of delay line with reset to prevent harmonics lock; (a) 
harmonics lock case (b) delay line with reset. ...........................................76 
Figure 4.19: Delay compensation with a level shifter; (a) delay versus supply 
voltage; (b) delay mismatch according to supply voltage; (c) level 
shifter delay; (d) compensated mismatch. .................................................77 
Figure 4.20: Power stage architecture. ...............................................................................78 
Figure 4.21: Performance variation under process variation. ............................................80 
Figure 4.22: Generation of a voltage at the given target frequency; (a) 
performance variation under aging; (b) adaptive voltage change; (c) 
compensated performance. ........................................................................81 
Figure 4.23: Temperature compensation. ...........................................................................81 
Figure 4.24: DVFS simulation; (a) voltage change according to input frequency 
change; (b) automatic DVFS. ....................................................................82 
Figure 5.1: The conceptual operation of the pipeline with (a) the flip-flops, (b) the 
pulsed latches, and (c) the LTD with clock stretching. ..............................87 
Figure 5.2: (a) The clock stretching concept in time domain and (b) the control 
flow of the proposed methodology. ...........................................................89 
Figure 5.3: (a) The schematic of the proposed latch with the time-borrowing 
detection (LTD) and (b) its timing diagram. ..............................................91 
Figure 5.4: The schematic of the time-borrowing detection collector. ..............................92 
Figure 5.5: The block diagram of the ring VCO. ...............................................................93 
Figure 5.6: The clock pulse generator for (a) the 4-phase clocks and (b) the 8-
phase clocks. The operation of the clock pulse generator for (c) the 
4-phase clocks and (d) the 8-phase clocks. ................................................94 
Figure 5.7: The block diagrams of the clock shifter for (a) the 4-phase clocks and 
(b) the 8-phase clocks. The operation of the clock shifter for (c) the 
4-phase clocks (d) the 8-phase clocks. .......................................................95 
Figure 5.8: The block diagram of the test pipeline. ...........................................................96 
Figure 5.9: The path delay distribution (FO4 delay=76.23ps). ..........................................97 
 
xiii 
Figure 5.10: The test environment and the die-photo of the test chip. ..............................99 
Figure 5.11: Measured maximum input clock frequency (No errors, VDD=1.8V)...........100 
Figure 5.12: The measured frequency and the power of chip6 (VDD=1.8V, PC=0.1).
..................................................................................................................101 
Figure 5.13: Measured performance of chip6 according to PC. .......................................103 
Figure 5.14: Measured operating voltage ranges of test chips at 160MHz input 
clock frequency. .......................................................................................104 
Figure 5.15: Measured system clock waveform for the 8P case. .....................................105 
Figure 5.16: Programmable-time-borrowing and delayed-clock-gating. ........................108 
Figure 5.17: The overall architecture of a pipeline with the proposed 
programmable time borrowing with delayed clock gating. ..................... 110 
Figure 5.18: The architecture of the test-chip. ................................................................. 111 
Figure 5.19: The die-photo of the test-chip and key design features. .............................. 111 
Figure 5.20: Measured operational waveforms of the proposed method. ........................ 112 
Figure 5.21: Measured error rate and effective frequency of the test pipelines 
under DC voltage shift/noise. .................................................................. 113 
Figure 5.22: Measured effective frequency of the test pipeline only with PTB 
under local noise injection. ...................................................................... 114 
Figure 5.23: Noise cancel-out effect of the PTB3. .......................................................... 115 
Figure 5.24: Measured output frequency of the clock modulator with DC voltage 
variations. ................................................................................................. 116 
Figure 5.25: Measured output frequency of the clock modulator and PTB3 with 
DC voltage variations. ............................................................................. 117 
Figure 5.26: Measured tolerable noise ranges under DC and AC noise injection. .......... 118 
Figure 5.27: Design flow for inserting pulsed-latches and LTDs. ...................................120 
Figure 5.28: The layout of the implemented graphic processing unit with 8P case 
with an 180nm CMOS technology. ..........................................................120 
Figure 5.29: The path delay analysis of the rasterizer with an 180nm CMOS 
technology: (a) distribution of all path delays; (b) the worst path 
 
xiv 
delay per each flip-flop showing the flip-flops that are selected to 
be replaced by pulsed latches. ..................................................................122 
Figure 5.30: The distribution of the worst-case output path delays of all the pulsed 
latches for the (a) the 4P case and (b) the 8P case showing the ones 
that will be replaced by LTDs. .................................................................122 
Figure 5.31: Distribution of the delay of the input path of one critical flip-flop 
showing the need for hold fixing in certain paths. ...................................123 
Figure 5.32: The automated layout of a rasterizer unit with programmable time 
borrowing (considering PTDN1) in 45nm node. ......................................126 







LIST OF ABBREVIATIONS 
 
CMOS  Complementary metal-oxide-semiconductor 
RDF  Random dopant fluctuation 
LER  Line-edge roughness 
OTF  Oxide thickness fluctuation 
VTH  Threshold voltage 
NBTI  Negative bias temperature instability 
nMOS  n-channel metal-oxide-semiconductor 
pMOS  p-channel metal-oxide-semiconductor 
ABB  adaptive body biasing 
AVS  adaptive voltage scaling 
IC  Integrated Circuit 
PVT  Process-voltage-temperature  
3D IC  Three-dimensional integrated circuit 
D2D  Die-to-die 
WID  Within-die 
TSV  Though-silicon-via 
FF  Flip-flop 
MSFF  Master-slave flip-flop 
SAFF  Sense-amplifier-based flip-flop 
LS  Level shifter 
CVSL  Cascade voltage switch logic 
RO  ring-oscillator 
 
xvi 
TAVS  Tier-adaptive-voltage-scaling 
TABB  Tier-adaptive-body-biasing 
FBB  Forward body bias 
RBB  Reverse body bias 
ZBB  Zero body bias 
MC  Monte-Carlo 
VCO  Voltage controlled oscillator 
PLL  Phase-locked loop 
DLL  Delay-locked loop 
GM  Global modulator 
LM  Local modulator 
DM  Duty modulation 
LTD  Latch with time-borrowing detection  
CS  Clock shifter  
CPG  Clock pulse generator  
TDC  Time-borrowing detection collector 
SPI  Serial peripheral interface 
PTB  Programmable-time-borrowing 
PLTD  Pulsed latch with time-borrowing detection 
PTDN  Programmable-time-borrowing detection network 
GPIB  General-purpose interface bus 
TB-CS  Time borrowing and clock stretching 







Variability affects the performance and power of a circuit. Along with static 
variations, dynamic variations, which occur during chip operation, necessitate a safety 
margin. The safety margin makes it difficult to meet the target performance within a 
limited power budget. This research explores methodologies to minimize the safety 
margin, thereby improving the energy efficiency of a system. The safety margin can be 
reduced by either minimizing the variation or adapting to the variation. This research 
explores three different methods to compensate for variations efficiently. First, post-
silicon tuning methods for minimizing variations in 3D ICs are presented. Design 
methodologies to apply adaptive voltage scaling and adaptive body biasing to 3D ICs and 
the associated circuit techniques are explored. Second, non-design-intrusive circuit 
techniques are proposed for adaptation to dynamic variations. This work includes 
adaptive clock modulation and bias-voltage generation techniques. Third, design-
intrusive methods to eliminate the safety margin are proposed. The proposed 
methodologies can prevent timing-errors in advance with a minimized performance 
penalty. As a result, the methods presented in this thesis minimize static variations and 






INTRODUCTION AND OBJECTIVES OF RESEARCH 
 
1.1 Problem statement 
Scaling of complementary metal-oxide-semiconductor (CMOS) devices have 
enabled a high level of integration and a fast switching speed in integrated circuits. As the 
feature size of a device reaches nanometer nodes, variability of device parameters is 
inevitably increasing [1]. The variability of CMOS devices is significantly affected by 
random dopant fluctuation (RDF), line-edge roughness (LER), and oxide thickness 
fluctuation (OTF). RDF, the variation in the number and the location of dopants, becomes 
dominant as the number of channel dopants decreases in scaled device and results in 
threshold voltage (VTH) variation of a device. LER and OTF, which are caused by rough 
line-edge and silicon-oxide interface, become pronounced in small devices. Those 
variations contribute to variability in VTH as shown in Figure 1.1 [2]. The increasing VTH 
variations results in increased variability in circuit performance and leakage.  
 








































H increasing process variation
 
                                  (a)                                                                         (b) 
Figure 1.1: CMOS technology trend: (a) Device scaling. (b) VTH variation. 
 
2 
Furthermore, the growing number of integrated devices in a chip and the fast 
switching speed increase the power consumption of a digital system as shown in Figure 
1.2 [2]. The rapidly growing power consumption increases the dynamic variations, such 
as the voltage and the temperature variations. The dynamic variations added on top of the 
static variations, such as the process variations and the aging effects, significantly affect 
the speed of a circuit. As static and dynamic variations increase, it is becoming 
challenging to guarantee reliable operation of a digital system at the target frequency with 





































































                                 (a)                                                                         (b)  
Figure 1.2: (a) Frequency trend. (b) Thermal design power trend. 
 
1.2 Variations 
The variations in the process parameters do not change after the chip fabrication. 
In other words, the process variation is not a function of time. Thus, this variation can be 
classified as a static variation. On the other hand, there are sources of variations that 
affect circuit performance as a function of time. An aging effect, a temperature variation, 
and a voltage variation are time-dependent variation sources. Even if the aging effect is a 
time-dependent variation source, it can be classified as a near-static variation due to its 
 
3 
very slow impact on circuit performance. Temperature and voltage variations, which are 
relatively faster than the aging effect, can be classified as dynamic variations.  
Aging effects like negative bias temperature instability (NBTI) caused by 
interface traps in an oxide layer of a device slowly shift the threshold voltage of a pMOS 
transistor over a long time period. Aging effects slowly degrade performance of a circuit 
as shown in Figure 1.3. On the other hand, temperature variations affect the chip in the 
time scale of milliseconds or seconds. Temperature variations affect not only VTH but also 
the mobility of carriers. Even though increased temperature reduces VTH, it also reduces 
the mobility of carriers. Thus, increased temperature could increase or decrease a data-
path delay according to which impact is more dominant. The relation between 
temperature and the delay is dependent on a technology and the operating voltage. 
Assuming normal temperature dependence between temperature and a data-path delay, 
behavior of temperature and a delay is shown in Figure 1.3. The relation between the 
operating voltage and a data-path delay is quite obvious and faster than other variation 


















Figure 1.3: Time scale of aging effect, temperature, and voltage variations. 
 
4 
1.3 Safety margin under variations 
Under variations like in Figure 1.4, the performance variability of a circuit 
increases. The traditional design approaches considering worst-corner case necessarily 
require excessive safety margin to ensure error-free operation of a circuit. The safety 
margins imply an increased operating voltage or a reduced operating frequency of a 
circuit. However, all worst-corner combinations are not likely to occur at the same time. 
Therefore, the safety margin considering all possible worst-corner cases, which is highly 
unlikely to occur, can lead to excessive power overhead or performance loss during 
normal operation. Not to lose the gain achieved by device scaling, minimizing the safety 










































1.4 Thesis outline 
The goal of this thesis is to develop robust design methodologies for low-power 
digital systems under static and dynamic variations. Increasing static and dynamic 
variations lead to excessive safety margins, which increase power overhead or 
performance loss. Thus, minimizing safety margins is a key challenge to achieving high 
performance under a power constraint and increasing variations. However, just 
minimizing safety margin without any adaptation technique to variations do not guarantee 
error-free operation of digital systems. Safety margins can be minimized by adaptive 
circuit techniques that compensate for static and dynamic variations.  
There are three possible ways to reduce the safety margin under variation. First, 
the safety margin can be reduced by minimizing the effect of variations on circuit 
parameters. If the variation can be compensated, required safety margins also can be 
reduced. Second, adaptively adjusting operating conditions helps tolerate variations. 
Third, making a circuit error-tolerant under variations can also minimize the safety 
margin. Without any safety margin, a digital system could have timing errors. If the 
digital system can detect and manage timing errors, or prevent them, the system can 
operate with minimal safety margins.  
This thesis considers above three approaches to overcome increasing static and 
dynamic variations. First, this work considers post-silicon tuning methods, which have 
been explored in prior works for 2D ICs, to compensate for the static process variation in 
three-dimensional integrated circuits (3D IC). 3D IC is a promising technology, which 
provides a high level of integration with high performance and low power by stacking 
different dies utilizing a through-silicon-via (TSV) technology [9]-[18]. Although 3D 
integration has shown the promise of improving power and performance of a system with 
the reduced footprint, a 3D IC can be significantly affected by variations [19], [27]. In 
chapter 3, this thesis focuses on applying post-silicon tuning and developing design 
methodologies to compensate for variations in 3D ICs.  
 
6 
Second, this thesis proposes adaptive clock and voltage generation techniques to 
compensate for dynamic variations. If a circuit can change the operating condition, such 
as clock frequency or supply voltage, it can overcome time-dependent variations. Thereby, 
safety margins for dynamic variations can be minimized. In chapter 4, methodologies for 
tolerating fast-changing variations are proposed.  
Third, this work explores error-tolerant design techniques to eliminate the safety 
margin. Design methodologies in chapter 3 and chapter 4 are based on replica circuits. In 
chapter 5, timing-violation detection and error-prevention methodologies in real paths are 
explored. Previous adaptation techniques focus on preventing errors utilizing replica 
circuits. On the other hand, this approach focuses on preventing errors in real data paths 
without replica circuits. The replica-based approaches have two major limitations; 
mismatches between replica circuits and real data paths; and control speed from the noise 
sensing to the adaptation. If timing-errors can be recovered or prevented even after the 
error condition occurred, the safety margin is no longer required. 
This thesis presents various solutions to minimize static variations and adapt to 
dynamic variations considering design types and variations types. As a result, the 
proposed approaches can minimize safety margins while maintaining robust operation, 








Overcoming static and dynamic variations is important for meeting a performance 
target with a reduced supply voltage. Without adaptive design techniques of 
compensating for variations, excessive safety margins are required. Excessive safety 
margins require considerable design efforts to meet target performance or significant 
performance loss to guarantee reliable operation. Since the corner-based design 
approaches consider the worst case conditions, for target performance under variations, 
the size of a circuit should be increased at the expense of higher power and cost. However, 
the worst-corner case occurs only when the process, the voltage, and the temperature are 
all in the worst corner at the same time. The probability that all the worst conditions 
occur simultaneously is very low. Therefore, under normal conditions, most chips are 
operating with excessive safety margins. Thus, minimizing safety margins is a key 
challenge to achieving high performance under increasing variations. Safety margins can 
be minimized by adaptive circuit techniques that compensate for static and dynamic 
variations. 
Different variations occur in different time scales. First, process variation is 
determined after integrated circuit (IC) fabrication. Thus, it can be assumed to be static 
since it does not change over time while ICs are operating. Second, the aging effect 
degrades IC performance over a long time period, i.e., 1~3 years. Since this degradation 
of a device is very slow process, it can be classified as near static variations as shown in 
Figure 2.1. Third, the temperature variation is affected by the power consumption of ICs. 
The power consumption is strongly dependent on workloads of ICs. Significant change of   
 
8 
workloads necessarily leads to the variation in power consumption, which results in 
variation in heat dissipation. However, significant workload changes in the order of 
milliseconds or seconds. In addition, the thermal time constant of silicon-based ICs are in 
the order of milliseconds. As a result, the changing speed of temperature is relatively 
faster than the aging effect and can be classified as a slow-dynamic variation as shown in 
Figure 2.1. Fourth, the voltage variations, which are affected by power demands and the 
quality of the power delivery network, affect data-path delays significantly. It is a faster 
process than other variations and occurs in ns to µs order.  Thus, the voltage variation is a 
relatively more difficult variation to tolerate compared to other variations due to its 






















Figure 2.1: Speed and impact amount of variations. 
 
 
Static and dynamic variations are classification based on the temporal behavior of 
the variations. The variations can be sub-categorized into local and global variations in 
terms of spatial behavior as shown in Figure 2.2. Local variations only affect local areas 
near the variation sources. Examples of local variations are within-die (WID) variations, 
local hot spots, and local voltage droops. Unlike the local impact of local variations, 
global variations affect the whole chip. Examples of global variations are die-to-die (D2D) 








Figure 2.2: Local and global variation components. 
 
2.2 Post-silicon tuning approach 
Process-parameter variations result in variations in performance and power. WID 
variations cause mismatches in the transistor characteristics within a die. In addition, each 
die has different global device parameters. These parametric variations increase the 
performance variation. Because of the performance variation, some dies cannot be 
accepted because of either low performance or excessive power. The number of 
acceptable dies will decrease unless the frequency target reduces. In other words, 
increased process variations indicate either yield- or performance-loss.  
Two major techniques are widely used to minimize performance variations. One 
technique is to apply different supply voltages according to the speed of a die [3], [39], 
[40]. If the die is slow, a high supply voltage is applied to make the die fast. On the other 
hand, if the die is fast, a low supply voltage is applied to make the die slow. This 
technique is referred to as adaptive-voltage-scaling (AVS). The other technique is to 
apply a non-zero body-to-source bias to modulate the threshold voltage of a transistor as 
shown in Figure 2.3 [4], [5]. A reverse body bias increases the threshold voltage and 
makes a die slow. On the contrary, a forward body bias decreases the threshold voltage of 
 
10 
a transistor and makes a die fast. This technique is referred to as adaptive-body-biasing 
(ABB). Thus, it is possible to control the transistor speed through AVS or ABB after chip 
fabrication. These post-silicon-tuning methods help compensate for the delay spread and 

























Figure 2.3: Concept of post-silicon tuning (AVS and ABB). 
 
Post-silicon-tuning methods are widely used in 2D chips to minimize static 
variations, such as process variations and aging effects, which change over a long time 
period [6]. However, limited studies have been performed to understand the effect of 
variability and to develop post-silicon-tuning methodologies in 3D chips. This work 
evaluates the effectiveness of post-silicon tuning in 3D ICs. Then, a methodology for 
post-silicon tuning for 3D ICs is proposed. 
 
2.3 Non-design-intrusive dynamic adaptation 
An attractive approach is to utilize critical-path replica circuits of which delays 
are strongly correlated with the critical-path delays of a real-logic block as shown in 
Figure 2.4 [52]-[54]. When timing errors are likely to occur in replica circuits, the supply 
voltage or the operating frequency can be dynamically changed to prevent a chip failure. 
This replica-based approach increases variation tolerance of a circuit since it adaptively 
changes the operating condition of a circuit to prevent a chip failure. However, different 
 
11 
geographical locations of the actual circuits and replica circuits on a chip can result in 
different process-voltage-temperature (PVT) variations [63]-[66]. To compensate for a 
mismatch, an adaptive design based on replica circuits cannot totally eliminate safety 
margins. However, since this replica-based method is not an intrusive technique, any 
modification of the main circuit is not required. The critical challenge of this approach is 
to minimize the delay from sensing environmental variations to changing operating 
conditions to handle fast-changing variations. Because of this challenge, this method is 
useful for adapting a circuit to slow-changing global noise, which affects the whole chip. 
This thesis proposes a method for adaption to fast-changing variations with the replica-
















Figure 2.4: Concept of non-design-intrusive dynamic adaptation. 
 
2.4 Design-intrusive dynamic adaptation 
An alternative approach is to use in-situ error-detection and error-correction 
mechanisms to tolerate the variation while operating without safety margins [63]-[71]. 
Without any safety margin, voltage and temperature variations (i.e., the dynamic 
environmental variations) can cause timing errors. Thus, it is necessary to detect timing 
errors in the real data paths. If a system can recover from the detected errors, it can 
operate without any safety margin. Above all, in-situ-based approach can tolerate fast 
local noise because it detects and corrects errors in the real data paths. Thereby, this 
 
12 
approach can potentially eliminate the safety margin. However, the main circuit requires 
modification for implementing error detection and recovery circuits in the real data paths. 
Therefore, the in-situ-based approach is a design-intrusive technique. In addition, all error 
recovery methods incur additional power and performance penalties. Thus, if errors occur 
frequently, the penalty can become significant. If the penalty associated with error 
recovery can be reduced, the performance and the power of pipelined systems can be 
significantly enhanced [66]. Traditionally, an architectural replay, i.e., re-executing 
erroneous instructions, is used for error-recovery. Thus, the error-recovery mechanism 
becomes a platform-dependent solution if the architectural replay, which is embedded in 
a microprocessor, is used for error correction.  Because of the increased penalty for error 
correction and the platform dependency, this approach cannot be used in general circuits. 
This research proposes a method for preventing timing errors under fast-changing 
variations with the minimized performance penalty in general circuits, such as state-






















POST-SILICON TUNING APPROACH 
 
3.1 Introduction 
The methodologies to modify parameters of ICs after the fabrication are called 
post-silicon tuning. Post-silicon tuning became more popular for digital and analog 
circuits in deeper submicron technologies to achieve the target yield or the target 
performance. The post-silicon tuning in analog circuits has a long history since analog 
circuits are very vulnerable to mismatches [6]. Even small mismatches in analog circuits 
can cause significant output offsets. Post-silicon tuning for analog circuits focuses on 
repairing mismatches of devices utilizing fuses or tunable capacitor/resistor bank arrays. 
In digital domains, various solutions for post-silicon tuning, such as AVS, ABB, and 
redundancy, have been explored. Prior-arts focused on solving variation problems in 2D 
ICs.  
A three-dimensional integrated circuit (3D IC) is a promising technology, which 
provides a high level of integration with high performance and low power by stacking 
different dies utilizing through-silicon-vias (TSV) [7]-[18]. Although 3D integration has 
shown the promise of improving power and performance of a system with the reduced 
footprint, a 3D IC can be significantly affected by variations [17]-[19], [27]. From the 
process perspective, a 3D IC can be thought of as multiple separate chips fabricated in 
different wafers. In other words, when different dies are stacked, dies come from different 
lots and wafers. This wafer-to-wafer or lot-to-lot variation can lead to significant 
performance mismatch between different dies in a 3D chip. Therefore, in a 3D IC, both 
WID and D2D process variations contribute to within-chip variations as shown in Figure 
3.1. Moreover, variations in resistance and capacitance properties of the TSV also add to 
 
14 
total path-delay variations in 3D ICs [19]. Therefore, it is necessary to develop 
methodologies to minimize variations in a 3D IC technology. Otherwise, performance 













Operating Frequency of a 3D IC = f2
 
Figure 3.1: D2D variation issue in 3D ICs. 
 
3.2 Variation challenges in 3D ICs 
The variation in process parameters is a key challenge to performance or leakage-
limited yield of designs in nanometer technologies [62]. Die-to-die (D2D) and within-die 
(WID) variations in process parameters can lead to significant chip-to-chip variations in 
the delay and the power of logic circuits. The post-silicon adaptation, such as voltage-
scaling and body-biasing, can be used to improve parametric yield of a chip [3], [4]. The 
objective of post-silicon tuning is to tune the supply voltage or the body bias of a chip 
depending on the global process corner [3], [4]. The D2D variation in process is the 
collective effects of lot-to-lot, wafer-to-wafer, and within-wafer variations caused by 
different sources of manufacturing imperfections. For logic circuits where transistors are 
normally larger than minimum size devices, D2D variations can dominate over WID 
variations.  
From the process perspective, a 3D chip can be thought of as multiple separate 
chips fabricated in different wafers. In other words, when the different dies are stacked, 
 
15 
the dies come from different lots and wafers. This can lead to significant variations 
between two dies in a 3D chip [9]. Therefore, in a 3D IC, both WID and D2D variations 
contribute to within-chip variations [10]-[13]. This is unlike 2D ICs, where the within-
chip variation is determined only by WID variations. Further, variations in RC properties 
of the through-silicon-via (TSV) also add to total variations in 3D ICs [10]-[13]. It is 
necessary to develop methodologies to reduce the delay and leakage spread of 3D chips 
considering within-chip and chip-to-chip variations caused by D2D and WID variations.  
Understanding and mitigating the effect of process variations on 3D ICs are 
evolving fields. Limited studies have been performed to model the effect of variability on 
3D chips, to understand the effect of different physical-design choices on the variation 
characteristics, and to develop design-level methods to improve parametric yield [10]-
[13]. Methods for process-corner-aware bonding of dies have also been studied to 
optimize yield [14]. However, such bonding approaches are primarily limited to die-to-
die 3D bonding. In a wafer-to-wafer bonding 3D technology, it is difficult to achieve an 
optimal bonding configuration for all dies in a wafer [15]-[17]. This chapter explores two 
design methodologies for reducing variability in 3D ICs considering different design 
cases.  
This work considers two design scenarios. First, two tiers are implemented with 
different independent clock networks. This design case includes processor-memory and 
processor-processor tiers with independent clock networks. This 3D integration scenario 
is referred to as a block-level 3D integration. Second, a functional block can be 
implemented in separate tiers with one 3D clock network. This scenario can be classified 
as a logic-level 3D integration. For the 3D design scenarios above, widely used 
adaptation techniques in 2D designs are considered in this chapter. Adaptive voltage 
scaling (AVS) and adaptive body biasing (ABB) are widely used techniques to offset 
D2D variations as the post-silicon tuning methodologies [3], [4]. In AVS, higher VDD is 
assigned to a slower die (to improve speed), and lower VDD is assigned to a faster die (to 
 
16 
save power) [3]. Effectiveness of AVS is studied for reducing logic delay variability in 
3D ICs [27]. However, AVS for clock networks with multiple clock TSVs is challenging 
because all clock TSVs will require level shifters which will introduce additional source 
of delay variations (i.e., skew) and power overhead. The second approach is to use ABB, 
where forward body bias is applied to slow dies and reverse body bias is applied to fast 
dies [3]. ABB has a significant advantage over AVS for 3D clock network, as body 
biasing does not require different VDD for each die. Hence, the signals between different 
dies can be interfaced without level shifters. 
 Considering the block level 3D integration, this thesis presents tier-adaptive-
voltage-scaling (TAVS) as a methodology for post-silicon tuning for data paths in 3D ICs 
considering D2D and WID process variations. TAVS reduces data-path delay variability 
of 3D ICs by independently tuning supply voltages of different tiers. The circuit issues 
associated with the design of TAVS including level shifters are discussed. This 
methodology is developed for the case when  
For the logic level 3D integration, this thesis presents tier-adaptive-body-biasing 
(TABB) as a methodology for post-silicon tuning for 3D clock networks in 3D ICs under 
D2D and WID process variations. TABB reduces skew and slew variability of 3D ICs by 
independently applying adaptive body biases to different tiers. Digital circuit techniques 
to sense D2D variations of pMOS and nMOS transistors are discussed.  
 
3.3 Tier-adaptive-voltage-scaling for 3D ICs 
This section evaluates the possibility of post-silicon adaptation to mitigate the 
effect of within-chip and chip-to-chip variation and improve parametric yield of 3D ICs. 
Tier-adaptive-voltage-scaling (TAVS) as a post-silicon tuning technique is proposed to 
reduce the spread in the delay distribution of 2D and 3D paths, and hence, that of 3D 
chips. The methodology of TAVS senses the process corner of individual tier (i.e., dies) in 
 
17 
a 3D chip and independently adapts the supply voltage of individual tiers to control the 
delay spread. The concept and design of TAVS is presented and impact of TAVS on delay 
distribution of 3D ICs is shown. This section makes the following contributions; 1) this 
study analyzes the effect of process variations on different types of critical paths in 3D 
design, namely, 2D critical paths contained in a single die and 3D critical paths that 
traverse between dies; 2) this study discusses efficient methods for voltage-level 
conversion for 2D and 3D paths to realize TAVS-based 3D design; 3) this study evaluates 
the effectiveness of TAVS considering different types of 2D and 3D critical paths. TAVS 
simulation results show 26-39% reduction in delay variability of 2D and 3D paths in 3D 
ICs.  
 
3.3.1 Effect of die-to-die variation in 3D design 
In this section, the impact of D2D variation on the delay distribution of different 
types of paths in a 3D design is analyzed. Note the D2D variation in 3D ICs contributes 
to within-chip variation i.e. different logic blocks and/or different segments of a path in a 
3D design can come from different inter-die process corners. First, different types of 
paths that can exist in a 3D design are classified. The first type is 2D path (or non shared 
paths) that are contained in a single die (does not cross the die boundary) like a 2D design 
(Figure 3.2). The second type is shared paths or 3D paths which cross the boundaries of 
two dies as shown in Figure 3.2. In the 3D paths, the balanced (path2b) and the 
unbalanced (path2u) case are classified according to the partitioning of the paths between 
dies (i.e. TSV insertion point). The extremely unbalanced (path2us) case is a special case 

























































Figure 3.2: Path classification for 3D ICs. 
3.3.1.1 Non-shared 2D paths 
There are paths that reside only in one die. In this case, the path will be affected 
only by the WID and D2D variations of one die. The impact of variations on the delay of 
a two-tier 3D IC with 2D critical paths in each die is analyzed. The maximum delay of 
each die is determined by the most critical path of that die. This section focuses on a 
typical condition when the critical paths of two dies are designed for equal delay. Each 
die will have independent delay distribution of the critical path. If dies are selected 
randomly for stacking, the operating speed distribution of the 3D chip will depend on the 

































6 7 8 9 10 11 12 13 14
6 7 8 9 10 11 12 13 14
6 7 8 9 10 11 12 13 14
 
Figure 3.3: Process variation impact in 2D paths in 3D ICs. 
 
19 
3.3.1.2 3D paths 
A 3D path is separated into different dies and segments of the path are connected 
through TSV. The case where the path contains only one TSV is analyzed. Different 
insertion points of the TSVs in a 3D path are considered as shown in Figure 3.2. First, 
TSV variation was excluded assuming equal standard deviation for threshold voltage 
variations of two tiers. In Figure 3.2, 0% division point indicates the case where the all 
logic gates in path are placed in the die2 and the signal source is a flip-flop (FF) in the 
die1. On the other hand, 100% division point means the case where all logics are in the 
die1 and the signal sink is a FF in the die2.  
The reduced impact of D2D variation is observed when the logic gates in the path 
are separated into different dies. First, the process variation of two dies is assumed to 
follow independent normal random distribution with same standard deviations (Figure 
3.4(a)). Since the uncorrelated D2D mismatches between the dies can offset each other, 
separating a path into different dies helps reduce delay spread. Hence, when the standard 
deviation of process variations of two dies are same, (a) 0% or 100% division point 
results in highest standard deviation for path delay, and (b) 50% division point (i.e., the 
evenly balanced case) results in the minimum standard deviation of path delay. In the 
analysis considering WID variations, WID variation also changes this minimum standard 
deviation point.  
In 3D chips TSV variation will be introduced on top of D2D and WID variation. 
To observe the impact of TSV variation, the resistance and capacitance variation of TSVs 
are included, and variability of 3D paths is estimated (Figure 3.4(b)). The capacitance and 
the resistance of TSV are set to 35fF and 50mΩ, respectively [18], [19]. The σs of the 
capacitance and the resistance of TSVs are assumed to be 15%. Further, the delay spread 
is analyzed considering both device and TSV variability as a function of TSV insertion 
point. Under all insertion point, the TSV variability increases the delay spread of 3D 
paths (Figure 3.4(b)). In addition, the more balanced 3D paths showed the lower delay 
 
20 
spread even with the TSV variation. With the TSV variability, the unbalanced 3D paths 
can have higher variation than 2D paths. The most balanced 3D paths have minimum 
delay spread that can be smaller than 2D paths even with TSV variation.  
Next, the scenario when standard deviations of process variations of the two dies 
are different is studied. The estimated delay variation for the 3D path with the scenarios 
when all logic is placed in either die 1 or die 2 is compared as shown in Figure 3.4(c). 
Higher D2D variation for die1 implies that variation of the 3D path delay is more 
strongly affected by D2D variation of the die1. As D2D variations in die 1 increase, the 
minimum standard deviation point of 3D path delay moves towards 0% division point. 
This is because a reduced portion of logic in the die1 decreases the influence of higher 
D2D variation of die 1 on the 3D path delay.  
From the above observation it can be found that the variation aware 3D 
partitioning methods that aim to balance the nominal delay of the segments of critical 
paths in two dies help reduce the delay variability of 3D paths. Also, this optimal division 
point is a strong function of the relation between the standard deviations of the D2D 































Driver(FF) in die1 
affects σ of this path
Division Point (%)
D2D
10 20 30 40 50 60 70 80 90
All logics are in die1 and the destination FF is in die2




















































Minimum sigma when evenly divided 
= path type2b































                            (a)                                          (b)                                      (c) 
Figure 3.4: Standard deviation of delay variations in 3D paths considering different path 
division (a) neglecting TSV variations and assuming standard deviation () for D2D 
variations in both dies same, (b) considering TSV variations (same D2D variations for 





3.3.2 Tier-adaptive-voltage-scaling methodology 
In this section, the proposed tier-adaptive-voltage-scaling (TAVS) approach is 
discussed for post-silicon tuning of 3D chips. In a 2D chip, the adaptive voltage scaling 
can offset D2D variation of a chip with different voltage assignment. For a 3D chip, the 
adaptive voltage scaling needs to assign different voltage to each tier independently. This 
may cause different segments of a path (critical or non-critical) to reside in different 
voltage domains (i.e., different tiers) requiring level shifters within a path. The overall 
system architecture of the TAVS system with the on-chip regulator is shown in Figure 3.5. 
The proposed TAVS system considers independent supply domains for individual tiers 
controlled by on-chip voltage regulators. The high level system components include 
delay variation sensors, control logic, and voltage regulator. A delay-based sensor detects 
the process corner of individual dies. After power is on, the initial regulator output is 
nominal voltage. Under this condition, an enable-pulse input is applied to the delay 
sensor. Depending on the output of the delay sensor, the supply voltage is selected using 
the on-chip power regulator. After the output voltage of on-chip regulator is changed, the 
system can start normal operations. In this section, the key circuit components of the 














































Figure 3.5: The system architecture of tier-adaptive-voltage-scaling. 
 
22 
3.3.2.1 Delay variation sensor 
The first circuit component is the sensor to characterize the D2D variation. In this 
chapter, a ring-oscillator-based delay sensor is used to sense the D2D variation of a die. 
Frequency of a ring oscillator changes due to process variation and this signature can be 
used to detect the process corner of a die. According to the speed of gates, the generated 
clock frequency is different. If a counter is used to count the number of edges of the 
generated clock, the counter value indicates the speed of the logic gates. To detect D2D 
variations, the effects of WID variations need to be minimized. The effect of WID 
variation on the D2D delay sensor can be reduced by increasing the number of chains in 
the ring oscillator [20]. 
3.3.2.2 Level shifter 
Level shifters are required within a 3D path to convert the different voltage levels. 
The level conversion is always performed in the receiver side of a TSV. The objective is 
to ensure that only one supply voltage domain exists within a die. This creates additional 
challenge as two different VDDs within one side of the TSV are not possible (either 
driver or receiver)  
This chapter considers two different types of level shifters to achieve the above 
goals (Figure 3.6). The level shifter Type1 (LS Type1) requires single input and single 
voltage and can be applied without the input or the supply constraints [21]. The entire 
circuits remain in the receiver side of the TSV and no design modification is required in 
the driver side. In the LS Type1, the diode-connected M0 transistor has voltage drop as 
much as threshold voltage (Vth). Thus the supply voltage level of the input buffer (inv0) 
is VDD2-Vth. Therefore, the propagation delay of falling transition is slow when the single-
ended level shifter is used. Level shifter Type2 (LS Type2) is the modified cascade 
voltage switch logic (CVSL). The propagation delay of the CVSL is small since it has 
differential inputs. Thus, the LS Type2 is much faster than the LS Type1. However, since 
 
23 
the CVSL is a differential logic, the inverter should be placed in the other different 
voltage domain. It implies that two TSVs are required to convert voltage levels incurring 
additional area and power for the extra TSV. 
In the spice simulation, the delay of the LS Type2 is less sensitive to different VDD 
conditions compared to the LS Type1. The power dissipation associated with the level 
shifter, which is designed to achieve a target delay ~50ps is estimated. Table 3.1 
summarizes the estimated power of the driver (inverter) and the TSV. Since introducing 
the LS Type2 incurs much higher power overhead, the LS Type2 should be carefully 























Level shifter Type1 Level shifter Type2
 
Figure 3.6: Circuit diagrams of level shifters. 
 
Table 3.1: Power analysis. 
Power overheads of level shifters (500MHz, 1V, and α=0.5). 
 Total power LS power Overhead 
Driver + TSV 11.77uW - - 
Driver + TSV + LS Type1 11.30uW 2.06uW 1.59uW 




3.3.2.3 Methodology for level shifter insertion 
The discussion in the previous sub-section shows that the LS Type2 introduces 
more power and area overhead but less delay penalty. On the other hand, the LS Type1 
reduces power/area overhead but incurs additional delay penalty. All 3D paths (i.e., paths 
with a TSV) do not have equal delay and are not equally timing critical, but all of them 
will require level shifters for TAVS compatibility. The delay criticality of the 3D paths 
can be considered to preferentially insert the LS Type1 or the LS Type2.  If the 3D path is 
not timing critical, the LS Type1 can be used to minimize the area and the power 
overhead (Figure 3.7). However, if the 3D path is timing critical, the LS Type2 is 
recommended as it significantly reduces the delay overhead. Finally, if a path is totally 
unbalanced (like the path Type2us) the receiver end of the TSV directly terminates in a 
flip-flop (Figure 3.7). In this case, it is proposed to convert the flip-flop itself to a level 
converting flip-flop to reduce the delay overhead of level shifters. A sense-amplifier-
based flip-flop (SAFF) is a good solution for level shifting flip-flops [22]. With the SAFF, 







































3.3.3  Simulation results 
The test system was simulated in a predictive 45nm technology [23]. The 
behavior of the power regulator was modeled with Verilog-A. The outputs of the 
regulators were restricted to three voltage levels (0.9V, 1V, and 1.1V). For generic 
analysis of the proposed TAVS, a statistical analysis framework (instead of focusing on a 
specific design examples) was considered. First, different types of critical paths (i.e., 2D 
critical paths and 3D critical paths) are assumed to exist in a 3D design. Further, both 
single-ended (LS Type1) and differential (LS Type2) level shifters for 3D critical paths 
are considered to study the delay overhead. Next, D2D and WID variations in threshold 
voltage of the devices are considered to study the overall functionality and performance 
of TAVS system. For the analysis, standard deviations of D2D and WID in threshold 
voltage, and RC variations in TSVs are all considered as 15%. 
3.3.3.1 Statistical analysis of voltage assignments 
First, the statistical behavior of the voltage assignments for different tiers and 
different chips are analyzed. 1000 Monte-Carlo simulations are performed, and D2D and 
WID variations are assigned to different dies for designs in tier 1 and tier 2. Then, they 
are randomly bonded. Simulation of the proposed TAVS architecture was performed 
using SPICE. First, the statistical variations in the sensor outputs are monitored. As 
shown in Figure 3.8, different sensor outputs of each tier, which indicate different D2D 
corners, were obtained. Next, the control logic uses this sensor output and interacts with 

























Figure 3.8: The statistical analysis of the sensor output (left) and voltage assignments 
(right) for different tiers and different chips. 
 
3.3.3.2 Analysis of delay variation in different paths 
The impact of D2D and WID variations in VTH, TSV variations in different types 
of paths, and the effect of TAVS are analyzed. The process variations in the level shifters 
are considered. Figure 3.9 summarizes the mean and standard deviations (+3) of 
different types of path delay distributions. The statistics were generated considering 3D 
designs composed of 2 tiers with appropriate supply voltages after TAVS as shown in 
Figure 3.8. With TAVS, the mean delay of 2D critical paths reduces marginally, but delay 
spread reduces significantly (~34%). For balanced 3D paths with intermediate level 
shifter, the mean delay increases due to the delay of the level shifters. In particular, 
inserting the level shifters only (without TAVS) results in higher mean and standard 
deviations of delay. The mean increase is more with single-ended (LS Type1) compared 
to differential (LS Type2) level shifters. However, the delay spread reduces significantly 
in both cases with TAVS compared to the case without level shifters. With level shifters 
and TAVS, the mean delay increase (compared to no TAVS) is limited to 5.9% and 3.1% 
with LS Type1 and LS Type2, respectively. The reductions in standard deviations are ~36% 
in both cases (compared to no TAVS case). For highly imbalanced 3D paths, TAVS and 
SAFF provide maximum advantage. Without TAVS, the imbalanced paths suffer 
significantly from D2D, WID, and TSV variations. TAVS results in significant (~39%) 
 
27 
reductions in the delay spread. A SAFF results in a lower spread with almost no change in 
the mean delay for extremely unbalanced paths. Figure 3.9 summarizes the observations 































































3σ (balanced 3D) (unbalanced 3D)
(2D)
 
Figure 3.9: Simulation results with TAVS in different path types. 
* P1=path1 (2D path), P2b=balanced path2 (3D path), P2u=unbalanced path2 (3D path), 
and P2us=extremely unbalanced path2 (3D path). 
* LS1 =LS Type1, LS2=LS Type2, and SA=SAFF. 
 
3.3.3.3 Power and area overhead analysis 
The effect of TAVS on leakage and dynamic power distribution is analyzed. A 
statistical analysis is performed considering a design with 100K NAND2 gates in each 
tier. The process corner and voltage assignments obtained from the MC simulations 
discussed in Figure 3.9 are used for power estimates. TAVS marginally reduces both 
leakage and dynamic power spread (Figure 3.10).  
The power and the overhead associated with TAVS and the level shifters are 
analyzed. Based on previous studies (e.g., [8]), it is considered that the two tiers of 100K 
NAND2 gates are connected using 1000, 2000, and 3000 TSVs. The maximum (when all 
TSV paths are replaced with LS Type2) and minimum (when all TSV paths are replaced 
with LS Type1) power and area overheads are computed for each case. The NAND2 gate 
power was estimated considering an average of fanout-of-4 load and 10fF of interconnect 
 
28 
capacitance. The estimated power overheads are shown in Table 3.2. To estimate the area 
overhead, the area of each die is evaluated considering 100K NAND2 gate (1.8772μm
2 
per gate based on [23] and predictive 45nm node), and additional 30% of area for routing 
and power network [26]. The area of level shifters is estimated from the layout. The 
estimated maximum and minimum area overheads (compared to a TSV and a driver) due 
to the LS Type1 (2.964μm
2
) and the LS Type2 (1.976μm
2
) with an additional TSV 
including keep-out zone (16μm
2
) are also shown in Table 3.2 [24]. The area overhead 
caused by the delay sensor (107.47μm
2
) is 0.04%. Although the on-chip regulator incurs 
the area overhead (depending on the current demand of a chip), it helps reduce the 
voltage droop. This relaxes the constraints on the off-chip regulators and on-chip 
decoupling capacitors. It also enables fast dynamic voltage-frequency-scaling due to fast 






































Power (W)  
Figure 3.10: The impact of TAVS on leakage (left) and dynamic (right) power distribution. 
 
Table 3.2: Analysis of power/area overhead. 
Estimation of power and area overhead considering a 200K NAND2 gate design  
(100K in each tier) in 45nm nodes and different # of TSVs 
 Power Overhead Area overhead 
# of TSVs Minimum  Maximum Minimum Maximum  
1000 0.11% 0.72% 1.14% 6.91% 
2000 0.21% 1.44% 2.15% 13.024% 




3.4 Tier-adaptive-body-biasing for 3D ICs 
The performance and the functionality of digital circuits depend on variations in 
logic delays and clock skews. The clock skew is defined as the difference between arrival 
times of the clock signal at different flip-flops. A higher clock skew worsens performance 
and/or robustness of a design. In 2D ICs, WID variations change the delay difference 
between various branches of the clock tree, leading to increased clock skews. The D2D 
variation changes the delay of the entire clock tree and, hence, does not affect the clock 
skew significantly. On the contrary, clock skews in 3D ICs are affected by both D2D and 
WID variations as both of them lead to within-chip variations. 
The history of variation-aware 3D clock network design is short. Zhao et al. 
investigated the TSV random effects on clock skew uncertainties and analyzed the impact 
of WID and D2D process variations on 3D clock performance [34], [36]. The 
experiments indicated that a 3D clock network using multiple TSVs is able to decrease 
the clock skew variations by using fewer buffers and shorter interconnects. In addition, 
Xu et al. [38] proposed a statistical clock skew model for a regular 3D H-tree considering 
the WID and D2D variations in buffers. The use of clock TSV redundancy in a 3D clock 
network for fault-tolerant design has been explored [38]. 
This section analyzes the effects of D2D and WID variability on the clock skew in 
a 3D clock tree and presents tier-adaptive-body-biasing (TABB) – a post-silicon tuning 
method to reduce clock skew variations in 3D ICs. A system architecture is presented to 
independently sense the process variations in p-channel metal-oxide-semiconductor 
(pMOS) and n-channel metal-oxide-semiconductor (nMOS) devices using on-chip-delay-
based sensors and adapt the body bias of the nMOS/pMOS devices of each tier to 
mitigate the impact of process variations. The effectiveness of the approach is 
demonstrated through statistical simulations considering D2D and WID variations on 
example 3D clock trees with different number of TSVs in a predictive 45nm node. The 
body bias tuning helps mitigate the effect of tier-to-tier process shifts and reduce clock 
 
30 
skew variations. The clock slew variation is also reduced as the separate body biasing for 
nMOS and pMOS transistors compensate the VTH-skew between nMOS and pMOS 
transistors. Moreover, it is shown that TABB helps reduce variations in power of clock 
network and reduces the delay variability for logic paths. The application of ABB to 
reduce clock skew/slew, dynamic/static power, and logic path delay variations is a unique 
contribution of this research.  
 
3.4.1 Analysis of 3D clock network under variations 
3D clock trees are generated and used in the study using the synthesis method 
presented by Zhao et al. [34]. Given a set of clock nodes (=clock inputs of flip-flops) 
distributed into two dies and the clock source, the goal is to build a single tree that 
connects all the nodes to the source so that the skew and the total power consumption are 
minimized. TSVs are used to connect the nodes in different dies. The IBM r4 benchmark 
design that has 1000 clock nodes is used. The location and the input capacitance of clock 
nodes as well as the RC parasitic of clock wires, TSVs, and buffers are given as input 
parameters. The input capacitance of clock nodes in this tree varies from 30-60 fF. All 
design and simulations are performed considering a predictive 45nm technology [23]. 
The various design/simulation parameters for devices, wires, and TSVs are shown in 
Table 3.3. Three different types of 3D-clock networks were designed with 1 (Type1), 10 
(Type2), and 100 (Type3) TSVs to observe the 3D clock skew variations according to 
D2D variation as illustrated in Figure 3.11(a), (b), and (c), respectively. Each die size is 
10mm x 10mm. In the clock network Type1, each die has a complete clock network 
which is connected at a clock source through a single TSV. The clock network Type2 and 
Type3 have multiple TSVs and has a main clock network in die1 (a complete 2D network) 
and sub-clock networks in die2. The sub-clock networks in die2 are connected through 
the clock TSVs from the branches in the middle of the main clock network in die1. The 
 
31 
Type2 has 10 TSVs and Type3 has 100 TSVs, where the size of sub-clock networks in 
Type3 is much smaller than those in Type2. Hence, the network latencies of the 100 sub-
clock networks in Type3 are much shorter than that of the 10 sub-clock networks in 
Type2; the clock latency in die2 is the highest for Type1. 
 
Table 3.3: Parameters used in simulation. 
Parameters Description 
Process Model 45nm NCSU PTM model [23] 
Threshold Voltage (VTH) 
nMOS: 
 VTH = 0.471V 
pMOS:  
VTH =   0.423V 
Wire r = 0.1 Ω/μm, c = 0.2 fF/μm 
TSV 
RC π model : RTSV = 50 mΩ, CTSV = 15fF  
(CTOP = 7.5fF, CBOTTOM = 7.5fF) 
[D2D σ,WID σ]  
(VTH, wire, and TSV) 




















































Clock Network in the Die1 Clock Network in the Die2
Clock SinkTSV Clock SinkTSV


















































Clock Network in the Die1 Clock Network in the Die2
Clock SinkTSV Clock SinkTSV


















































Clock Network in the Die1 Clock Network in the Die2
Clock SinkTSV Clock SinkTSV
Clock Source  
(c) 
Figure 3.11: Three different types of 3D-clock networks: (a) Type1 (1 TSV); (b) Type2 




The baseline values of clock skew, which are computed under no process 
variations, are shown in Figure 3.12. From Figure 3.12(a), it can be found that 2D skews 
are independent of each other in the clock network Type1. On the other hand, 2D skews 
are similar to each other in the clock network Type2 and Type3. Since the sub-clock 
networks in die2 of Type2 and Type3 are connected from the branches of the main clock 
network of die1, the clock skew performances of sub-clock networks in die2 are affected 
















































































                                            (b)                                                     (c) 
Figure 3.12: Skew histogram: clock skew base line without process variations of (a) the 
clock network Type1, (b) Type2, and (c) Type3. 
 
The clock network latencies of various clock sinks in die1 and die2 are shown in 
Figure 3.13 (not considering variation), and a correlation coefficient (ρ) between the 
latencies of die1 and die2 is calculated. As expected from the preceding discussion, the 
 
34 
clock network Type1 has the lowest ρ (0.1727). On the other hand, in the clock network 
Type2 and Type3, the sub-clock networks share the common path with the main clock 
network. Hence, the correlation between skew of die1 and die2 is much higher. The 
correlation is the highest for Type3 as it shares the longest common paths with the main 
clock network in die1. From this result, it can be conjectured that the skew performance 








































































































































































Clock Network Latencies in Die1 Clock Network Latencies in Die2
 
(c) 
Figure 3.13: Correlation coefficient (ρ) between the latencies of the die1 and the die2 for 
(a) the clock network Type1, (b) Type2, and (c) Type3 not considering process variation. 
 
35 
Figure 3.14 illustrates the skew variability in the clock network Type1, Type2, and 
Type3 for different D2D and WID variability. In terms of 2D skew variation, all clock 
network types show the same trend. As the WID variation becomes stronger, 2D skew 
variation increases. From the results, it is concluded that WID variation is a dominant 
factor that decides the level of the 2D skew variation. The clock network Type1 showed 
extremely high 3D skew variation even under the low D2D variation (5% WID variation). 
Since the clock network Type1 in the die1 does not have a common path with the clock 
network in the die2, it showed the worst 3D skew variation. In addition, as the impact of 
D2D variation gets stronger, 3D clock skew variations of the clock network Type1 and 
Type2 increase. It implies that the D2D variation strongly impacts skew variations of the 
3D clock network. However, as the number of clock TSVs increases (as the common 
clock path gets longer), the impact of D2D variation on skew variation becomes weaker 
as illustrated in Figure 3.14 – it is observed that 3D skew variation is maximum for Type1 
and minimum for Type3 clock networks. As the impact of D2D variation decreases, the 
impact of the WID variation on 3D skew becomes observable. For example, the variation 
in 3D skew and 2D skew are comparable for the clock network Type3 when the D2D 
variation is weak; as the D2D variation increases the 3D skew variation dominates the 2D 
skew variation. In summary, an excessive number of clock TSVs reduces 3D skew 
variations, at the expense of additional area overhead for TSVs and the test clock routing 
for separate die test. In addition, it could also cause yield problems due to the TSV yield. 
More number of TSVs could lead to a higher possibility of failure in the clock network. If 
the D2D variation can be compensated, possible performance loss can be minimized even 




D2D σ  = 5%
WID σ  = 15%
2D skew(die1)
D2D σ  = 10%
WID σ  = 10%
D2D σ  = 15%
WID σ  = 5%
2D skew(die2) 3D skew(die1-die2)













































































































































































skew σ  
Type1: 130.15ps
Type2:   66.77ps
Type3:   40.45ps








skew σ  
Type1: 252.64ps
Type2: 115.37ps
Type3:   44.83ps








skew σ  
Type1: 391.04ps
Type2: 187.85ps
Type3:   49.45ps
 
Figure 3.14: 2D and 3D skew distribution of specific points in the clock network 
according to different variations. 
 
3.4.2 Tier-adaptive-body-biasing 
Tier-adaptive-body-biasing (TABB) is proposed to compensate for the D2D 
variation and reduce 3D clock skew. The basic approach is to detect the global variation 
in the threshold voltage in each die. Forward body bias (FBB) is applied to a slow die to 
reduce VTH and improve performance, while reverse body bias (FBB) is applied to 
increase VTH to make a fast die slower. Independent body-bias levels are required to 
compensate for the VTH shifts in nMOS and pMOS. 
 
3.4.2.1 System architecture 
The system architecture of TABB is shown in Figure 3.15. Each tier includes 
sensors to independently detect the threshold voltage shifts in nMOS and pMOS devices. 
 
37 
The variation sensors are enabled during power-up, and based on their outputs, a voltage 
regulator (body-bias regulator) changes the body voltages for nMOS and pMOS 
transistors in each tier separately. Note that all nMOS devices in a tier receive the same 
body bias and so do all pMOS devices in a tier. In this section, an off-chip power 
management IC is assumed to generate the body-bias voltages. The body bias range for 
nMOS and pMOS transistors is bounded within +0.3V and -0.3V, respectively. The 
limiting factor of FBB is the increased sub-threshold leakage current as well as the 
potential for forward-bias current through the body-to-source diode. The limiting factor 
of RBB is the increase in the short channel effect and the higher junction tunneling 
current in nanometer technologies. Further, two ABB options are explored. First, both 
FBB and RBB are considered. However, RBB is only possible when the voltage regulator 
can provide a negative voltage for nMOS transistors and a voltage higher than VDD for 
pMOS transistors. Since generating a negative voltage or a voltage higher than VDD is 
more complex (specifically, for on-chip generators), the option of using only FBB and 














Body-bias for NMOS 
transistors






Figure 3.15: The tier-adaptive-body bias (TABB) system. 
 
38 
3.4.2.2 D2D variation sensor 
A D2D variation sensor based on the principle of ring-oscillator (RO) type sensors 
is proposed in this research. The frequency of a ring oscillator changes due to process 
variations, and this signature can be detected using a counter. A RO type sensor can be 
easily implemented with digital components. The outputs are also digital and hence, can 
be easily utilized in digital systems [42]-[47]. The effects of WID variations can be 
minimized by increasing the number of chains in the ring oscillator, which helps average 
out the random WID across the stages [27]. However, in the tier-adaptive-body-biasing, it 
is required to independently detect the D2D variation of nMOS and pMOS devices.  
Since the delay of a RO is affected almost equally by nMOS and pMOS transistors, it is 
difficult to determine Vth shifts in nMOS and pMOS devices separately. This could result 
in an incorrect assignment of nMOS and pMOS body-biases, resulting in a reduced 
effectiveness. Further, in a clock network, a larger difference between the effective 
strengths of nMOS and pMOS devices can worsen the clock slew rate. Iizuka et al. have 
proposed an effective all-digital method to measure the performance variation of nMOS 
and pMOS devices separately by counting the number of pulses vanishing to 0 or 1 in a 
buffer ring [47], [48]. However, this method requires additional calculation process to 
solve equations for obtaining the final results. The method proposed by Zhang [49] for 
characterizing rising and falling time of standard cells includes analog circuits and 
complex measurement procedure.  
In this section, the RO type D2D sensor is modified to sense the delay variation of 
nMOS and pMOS transistors separately without post calculation process, complex 
detection process or sophisticated analog circuits (Figure 3.16). The nMOS variation 
sensor is composed of the inverters with a pull-down network with stacked long-channel 
nMOS (Wn/Ln) transistors and a pull-up network with a single pMOS transistor 
(Wp0/Lp0). When the enable signal (EN) is high, the nMOS variation sensor oscillates 
with a frequency that is a strong function of the speed of nMOS transistors. This is 
 
39 
because due to the higher stack height, the fall time through the pull-down network is 
more dominant than the rising time. For the pMOS variation sensor, the inverter is 
composed of a pull-up network with stacked long-channel pMOS (Wp/Lp) transistors and 
a pull-down network with a single nMOS transistor (Wn0/Ln0). In this case, the rising time 








































(a)                                            (b) 
Figure 3.16: The modified RO-based (a) nMOS and (b) pMOS variation sensors. 
 
Figure 3.17(a) shows the correlation between the nMOS speed and the sensor 
output, which increases with an increase in the nMOS channel length and the stack height. 
As shown in Figure 3.17(c) for the nMOS sensor, at A, the measured correlation factor 
was 0.280 with a short channel length (50nm) and one transistor stack. At B, with a long 
channel length (250nm) and 2-transistor nMOS stack the correlation factor increases to 
0.915. Likewise, Figure 3.17(b) shows that a higher channel length and stack height of 
pMOS transistors increases the correlation between the pMOS process corner and the 
sensor output. From A to B in Figure 3.17(b), the correlation factor increases from 0.69 to 
0.918. The correlation factor can be further increased by increasing the number of stages. 
Next, the size of pull-up pMOS in the nMOS variability sensor and the size of the pull-
 
40 
down nMOS in the pMOS variability sensor are optimized to improve the correlation 
factor. For the nMOS variation sensor, if the pMOS transistor is too small, the pull-up 
delay becomes high. Thus, the pMOS speed introduces noise at the sensor output. On the 
other hand, when the size of the pMOS transistor is too large, the contention between 
pull-down and pull-up network becomes high. This also degrades the sensitivity of the 






































































(a)                                          (b) 




















nMOS delay Impact by D2D Variation
B point
nMOS delay Impact by D2D Variation





















ρ = 0.280 ρ = 0.915
 
(c) 
Figure 3.17: The correlation between the normalized nMOS or pMOS delay impacted by 
D2D variation and the normalized output of (a) the nMOS sensor and (b) the pMOS 
sensor according to the channel length size and the transistor stack; and (c) the detail 




3.4.3 Simulation results 
This section presents statistical simulation results to demonstrate the effectiveness 
of TABB. Monte-Carlo (MC) simulations were conducted for the clock network Type1, 
Type2, and Type3. The simulations also include 3 different combinations of D2D 
variations and WID variations: 1) when [D2D σ,WID σ] are [5%, 15%], it indicates a 
process of higher WID variations than D2D variations; 2) when [D2D σ,WID σ] are 
[10%, 10%], it implies a process with equal WID variations and D2D variations; and 3) 
when [D2D σ,WID σ] are [15%, 5%], it indicates a process of higher D2D variations than 
WID variations. For each MC simulation point, nMOS and pMOS variation sensors 
generate digital codes for the global nMOS and pMOS process corners. The body-bias 
levels for each tier are selected accordingly. Figure 3.18 shows a summary of the outputs 
of pMOS and nMOS variation sensors for the case of 15% WID variation and 5% D2D 
variation.  


























Sensor Output Code  
(a) 



















Sensor Output Code Sensor Output Code  
(b) 
Figure 3.18: The body-bias assignments according to the sensor outputs of the pMOS and 
the nMOS variation sensors considering 15% WID variation and 5% D2D variation with 
50mV resolution with (a) FBB/RBB and (b) FBB/ZBB. 
 
42 
The scenarios of (i) both FBB and RBB application (Figure 3.18(a)), and (ii) only 
FBB and ZBB applications (Figure 3.18(b)) are considered. According to the sensor 
outputs, different body biases are applied with 50mV resolution. Note that this resolution 
is well-within the capabilities of common voltage regulators (e.g., 6mV~12mV resolution 
[25]). Figure 3.19 shows the histogram of body bias assignments of die1 and die2 

































Figure 3.19: The histogram of the body bias assignments of the die1 and the die2 




3.4.3.1 Effect of TABB on the clock skew variation 
With different body biasing conditions (without TABB, TABB with FBB/RBB, or 
TABB with FBB/ZBB), The trends of the mean, max, and standard deviation (σ) of clock 
skew in the clock networks are observed while changing D2D σ and WID σ. 
Observations for the clock network Type1, Type2, and Type3, are summarized in Figure 
3.20, Figure 3.21, and Figure 3.22, respectively. 
3.4.3.1.1 Effect of TABB on 2D skew 
Higher WID variation increases the variability in 2D skew. However, even 
without any TABB, the effect is generally weak. Note that the impact of WID variations 
on 2D clock skew can be further reduced if the size of clock driver transistor increases. 
Generally, the buffers in clock network are designed with transistors that are larger than 
minimum-sized. When TABB is applied with FBB/RBB, it is observed that a marginal 
reduction in the mean skew, but comparably larger reduction in the standard deviation in 
skew and the maximum skew. TABB is more effective in reducing the mean, the standard 
deviation, and the maximum 2D skew when the D2D variation becomes higher. This is 
because the effect of the WID variation is more severe with worse global VTH corners, 
and D2D variation compensation with TABB helps reduce 2D skew variations. Further, it 
is observed that TABB with only FBB gives marginally better benefits for 2D skew than 
TABB with FBB/RBB. This is because the effect of the WID variation on 2D skew is 
stronger for slow (high VTH) dies (slow dies have higher delay sensitivity than fast dies). 
Since FBB compensates for variations in slow dies, FBB can be more effective in 
reducing 2D skew σ. 
3.4.3.1.2 Effect of TABB on 3D skew 
Without TABB, the D2D variation strongly affects 3D skew. A higher D2D 
variation results in a significant increase in the mean, the standard deviation, and the 
 
44 
maximum skew. As TABB reduces the D2D variation, it helps reduce the mean, the 
maximum value, and the standard deviation of 3D skew significantly. As expected, the 
effectiveness of TABB is stronger when the D2D variation is larger. TABB with 
FBB/RBB is more effective in reducing 3D skew compared with TABB with only FBB. 
This is because using both FBB and RBB results in a better compensation of the D2D 
variation than FBB alone. The advantage of using both FBB/RBB is more pronounced 
under higher D2D variations. However, this observation is reversed when the maximum 
3D skew of clock network Type3 is considered [Figure 3.22(c)]. The reason is discussed 
in the next section. 
3.4.3.1.3 Effect of TABB on different types of clock network 
TABB shows a consistent effectiveness for different types of clock networks. 
Different clock networks showed similar results for 2D skew performance. As explained 
earlier, the characteristics of 2D skew in die1 and die2 are very different for the clock 
network Type1. TABB has a similar impact on 2D skew for both dies in Type1. For the 
clock network Type1, 3D skew variations due to D2D variations dominate 2D skew 
variations in each die. TABB with RBB/FBB significantly reduces 3D skew variations, 
and hence the overall skew variations in the network Type1. Due to this factor, TABB is 
most effective for the clock network Type1, which has only one TSV. As the number of 
TSVs in the clock network increases, however, the effectiveness of TABB reduces and 
the least impact is observed for the clock network Type3 (100 TSVs). This is because in 
the network Type3, the sub networks in the die2 have the longest common path with the 
main clock network in the die1. This causes the clock skews in the two dies to become 
more and more correlated and primarily determined by the skew variations in the main 
clock network in the die1. Therefore, the effectiveness of TABB reduces as the adaptive 
body biasing for the die1 only becomes important. It is also observed that variations in 
2D skew and 3D skew become comparable. For the clock network Type3, FBB/ZBB 
 
45 
achieved higher reduction in skew variations. Since the clock network Type3 has small 
sub-networks in the die2 (the clock sub-networks in the die2 have the maximum shared 
clock path with the main clock network in the die1), it is affected less significantly by 
D2D variations than the clock network Type1 and Type2. As the D2D variation impact 
gets weaker, the WID variation shows a stronger impact on skew performance. Thus, 
FBB/ZBB could achieve a higher gain than FBB/RBB since making path delay shorter 
help reduce delay variation. In summary, FBB/RBB reduces skew variations more when 
the skew variation is a strong function of the D2D variation. On the other hand, 



























































































































Figure 3.20: Results of TABB on the clock network Type1 considering D2D and WID 










































[D2D σ,WID σ]   1 : =[5%,15%]   2 : =[10%,10%]   3 : =[15%,5%]
TABB(FBB/RBB) TABB(FBB/ZBB)no TABB
 
 (a)  

















































































Figure 3.21: Results of TABB on the clock network Type2 considering D2D and WID 











































[D2D σ,WID σ]   1 : =[5%,15%]   2 : =[10%,10%]   3 : =[15%,5%]
TABB(FBB/RBB) TABB(FBB/ZBB)no TABB
 
 (a)  






































2D skew (die1) 2D skew (die2) 3D skew (die1/die2)









































Figure 3.22: Results of TABB on the clock network Type3 considering D2D and WID 





3.4.3.1.4 Effect of TABB on clock-slew rate 
The effect of TABB on the variability in clock slew rate is studied. Figure 3.23(a) 
shows the clock slew rate according to different threshold voltage variations (∆VTHN and 
∆VTHP) of nMOS and pMOS transistors. It can be observed that there exist significant 
variations in the clock slew rate depending on the process shifts, even when the opposite 
VTH shifts in pMOS and nMOS variations result in similar clock network latency (i.e., 
minimal skew).  As shown in Figure 3.23(b)-(c), FBB/RBB or FBB/ZBB can effectively 
reduce the variations in the clock slew rate. It implies that applying separate body bias to 
nMOS and pMOS transistors helps better compensate variations for circuit parameter like 
clock slew rate, which are sensitive to VTH-skew. Reducing the clock slew rate variation 
is important as slew can significantly impact the timing characteristics (i.e., setup time 














































































                     (a)                                            (b)                                          (c) 
Figure 3.23: (a) Clock slew rate without TABB; (b) clock slew rate with FBB/RBB; (c) 
clock slew rate with FBB/ZBB according to VTHN and VTHP skew. 
 
3.4.3.2 Effect of TABB on overall performance 
The results in the previous sections show that TABB reduces the mean, the 
standard deviation, and maximum values of 2D and 3D skew under D2D and WID 
variations. However, as the body of all devices in clock buffers and logic gates are shared, 
TABB also affects delays of data paths.  This is particularly true for nMOS devices 
 
50 
(assuming non-triple well process). Hence, it is needed to consider the impact of TABB 
on logic paths as well. This research evaluates the effect of TABB on two 2D data paths 
(the whole path is only in a die) and one 3D data path (the data path occupies two dies 
and uses 5 TSVs) of the 3D design. For data path, the absolute delay is important. Thus, 
D2D variation increases the delay variation of both 2D and 3D logic paths (Figure 3.24).  
 


















2D path0 3D path
-6.99%
2.18%
2D path1  
 (a)                                             




























Figure 3.24: Results of TABB (FBB/RBB or FBB/ZBB) of the data paths (two 2D paths 
and one 3D path) according to D2D and WID variations: (a) mean delay; (b) delay 
variation. 
 
Further, it is observed that the delay σ/μ of 3D path was smaller than 2D paths. 
This is because the independent D2D variations of two dies can partially offset each other, 
thereby reducing the overall delay variations [27]. TABB with FBB/RBB significantly 
reduces delay variation but has a marginal impact on the mean delay. The reduction in the 
 
51 
delay spread is less when TABB with only FBB is considered. Both 2D and 3D data paths 
experience a significant reduction in delay variation with TABB. In summary, TABB 
reduces variability in both clock skews and logic path delays, thereby significantly 
reducing the chip-to-chip variability in the performance of 3D ICs. 
3.4.3.3 Impact of TABB on area and power of clock network 
In the TABB architecture, power overhead of the sensors can be neglected since 
nMOS and pMOS variation sensors are activated only once during an initial boot-up 
sequence. However, it is needed to carefully analyze the impact of TABB on the power 
overhead of clock and logic paths. In case of FBB/ZBB, since FBB/ZBB causes slow 
logic gates to switch faster, it could help reduce short circuit current, which occurs when 
both the pMOS transistor and the nMOS transistor are on. A faster transition reduces the 
time when the pMOS and the nMOS transistors are both on. On the other hand, FBB 
increases the sub-threshold leakage current as well as the potential for forward-bias 
current through the body-to-source diode. In overall, the mean power overhead with 
FBB/ZBB was 0.47% ~ 0.49% of the total clock network power. With RBB/FBB, the 
average power consumption was reduced by 1.45% for all clock network types as shown 
in Figure 3.25(a). Although FBB could increase the average power, RBB helps reduce the 
excessive leakage current. Thus, in case of the total power (dynamic and leakage power), 
RBB/FBB reduced the mean total power of clock networks slightly. The variation in total 
power, on the other hand, reduces significantly (~40.59%) if TABB with FBB/RBB is 
used. This is because FBB increases the total power for slow dies, and RBB decreases the 
total power for fast and leaky dies. Thus, FBB/RBB reduces the total power variation 
down to 40.59%. On the other hand, FBB/ZBB decreases the total power variation by 
9.62% only.  Since FBB/ZBB works only for slow dies while FBB/RBB works for both 
slow dies and fast dies, FBB/RBB reduces power variation more. Further, as the total 
 
52 
power variation is significantly affected by the D2D variation, the reduction is higher 
























































Figure 3.25: Results of TABB on the clock power considering D2D and WID variations: 
(a) mean power; (b) power variation. 
 





, respectively. The size of the sensors becomes negligible as the chip size gets 
bigger. Assuming a local sensor in a 1 mm
2
 local area (1000μm x 1000μm), the area 
overhead from sensors becomes 0.033%. Because the current in the transistor body is at 
least two orders of magnitude smaller than the supply current, the cost of body bias 
routing is significantly less than the power grid [3]. Previous works have reported that the 
 
53 
area overhead of body bias routing is less than 2% of the total chip area. The area 
overhead was estimated from a test layout as shown in Figure 3.26. TAP cells for separate 
body contacts (substrate and n-well contacts) and routing was inserted at every 30µm. 
The feasible width of a TAP cell considering a 45nm DRC design rule is 0.35um, from 
which the area overhead can be estimated considering body contacts and routing. The 
estimated overhead is measured to be 1.17%.  
The measured power consumptions of the nMOS and the pMOS sensors are 
24.78µW and 26.93 µW, respectively at typical conditions (1.0V supply and 27ºC 
temperature). The overhead of the power consumption is 0.49% of the clock network 
Type1 power at 1.0V supply, 27ºC temperature, and 100MHz clock input. Considering 
logic power, this overhead will become much smaller. In addition, this power overhead 






NWELL contact + Via1
SUB contact + Via1
NWELL


















As the 3D technology matures, variation-tolerant design methodologies for 3D 
ICs will continue to be an important challenge. This research explores design 
methodologies for applying post-silicon tuning techniques to 3D ICs. This thesis 
presented TAVS as a methodology for post-silicon tuning to reduce delay variability of 
3D ICs by tuning supply voltages of different tiers. The analysis results show that TAVS 
can improve the delay distribution of both 2D and 3D critical paths. TAVS can be 
beneficial for 3D ICs, which have separate functional blocks with different clock 
networks in each tier. Considering the case when a function is separated into different 
tiers with a one clock network, design methodologies for TABB are proposed. TABB 
helps perform post-silicon tuning of 3D clock trees to reduce variability. TABB can 
minimize skew and slew variability of 3D ICs by applying adaptive body biases to 
different tiers. The analysis results show that TABB can improve the system performance 
by reducing variability in clock skew and slew rate as well as logic path delay. TABB is 
effective in reducing the clock skew variability in all types of 3D clock network, but the 
effectiveness varies mainly based on the number of TSVs used. The maximum 








With increasing process and environmental variations in deep-submicron 
technologies, meeting performance specifications with limited power budget becomes a 
critical challenge [52]-[54]. The traditional worst-case corner based design introduces 
“safety margin” (e.g., operating at lower frequency or higher voltage) to tolerate 
variations. Having safety margin, although helps tolerate variations, leads to appreciable 
performance loss or power overhead [52]. This has led to investigations in adaptive 
design methods that compensate for detected [56]-[61]. An attractive approach is to 
utilize critical path replica circuits whose delays are strongly correlated with the critical 
path delays of the actual logic block [54]. Based on the delay of the replica circuits, 
operating condition, i.e., clock frequency and supply voltage, can be adaptively changed 
dynamically. The main purpose of this dynamic adaptation is to minimize the safety 
margin considering time-variant variations. To change the clock frequency or supply 
voltage adaptively depending on the dynamic environmental noises, detection of dynamic 
variations should precede the operating condition change. However, if any noise is 
detected, it implies that variation has already occurred. Thus, under dynamic variations, 
safety margin is required to make a circuit operate without any error. The extent of 
required safety margin is dependent on the speed of noise and the control time from the 
noise detection to the adaptation. As a result, fast detection and adaptation is the main 
focus for minimizing safety margin under dynamic variations.  
This chapter presents two circuit techniques based on replica circuits for 
adaptation to time-dependent dynamic variations, such as supply voltage, temperature 
 
56 
variations, and aging effects. First, this chapter presents method to adapt to fast transient 
supply noises by modulating the system clock and the local clock. The proposed method 
enables direct clock modulation from the replica circuits and hence, allows within-a-cycle 
frequency modulation. As a result, it enables fast clock adaptation to fast transient noises 
and minimizes the safety margin for supply noises. Second, a methodology for generating 
a supply voltage level that is used as the reference voltage for on-chip regulators is 
presented. The voltage reference is the target level at which a circuit can operate without 
timing errors. The proposed adaptive voltage generator provides a cost-effective on-chip 
solution to minimize the voltage guard band under process variations, aging effects, and 
temperature variations. Additionally, the proposed work can generate a target voltage at a 
given target frequency. It implies that automated dynamic-voltage-frequency-scaling 
(DVFS) can be achieved considering dynamic variations. 
 
4.2 Adaptive clock modulation 
In digital systems, power supply noises (IR and Ldi/dt) affect logic path delays 
requiring careful characterization of the timing uncertainties [55]. Higher timing and/or 
voltage margins are required to ensure error free operation considering the worst-case 
power supply noise [56]. But higher margins reduce average performance and limit 
voltage scalability. Therefore, tolerance to transient supply noise is a major challenge to 
achieving higher performance at a low voltage while ensuring error-free operation [57]-
[59].  
Modulation of the clock frequency only during noise events can provide better 
average performance while preventing timing errors [58]-[61]. Kurd et al. have proposed 
to sense voltage droops and modulate the clock frequency by controlling the voltage-
controlled oscillator (VCO) [58]. However, the response time is limited by the round-trip 
delay from noise sources to the VCO. Further, as the adaptation is performed at the clock 
 
57 
source only, the clock is not adapted against local droops. The alternative approaches 
employ clock stretching mechanisms in the clock buffers without changing the clock 
source to exploit clock-data compensation phenomenon [59]. Jiao et al. have proposed 
phase modulation in a VCO to effectively match the clock phase and the data path delay 
change [60]. However, if the clock source is not changed, stretching is followed by clock 
period contraction, which can have negative impact unless changes of the clock phase 
and the data path delay are matched. In addition, modulation is performed only in 
response to the first droop noise not DC and local noises since the modulation is coupled 
with global AC noises. In general, modulating the clock frequency by sensing the voltage 
droop is an indirect approach as the correlation between the critical path delay and the 
transient voltage droop is not accurately captured. Therefore, modulating the clock 
frequency using a delay sensor based on critical path replica circuits is more direct and 
accurate method to track the impact of noise on timing [61]. However, the delay through 
the sense-and-correct control loop makes it difficult to adapt to fast (cycle-by-cycle) local 
noise.  
To address the preceding challenges, this research presents an all-digital adaptive 
clocking method to prevent timing errors under global/local supply noise. The proposed 
all-digital global modulator (GM) and local modulators (LM) integrate the critical path 
replica within the modulator, instead of using the replica as a sensor, to directly generate 
global/local clocks with periods determined by the replica delay (Figure 4.1). The GM 
modulates the input (PLL) clock to generate the system clock (CKG) with a range of 
discrete frequencies; and the LM modulates CKG in a limited but continuous frequency 
range to generate the local clock (CKL). The replica based direct approach ensures the 
clock period accurately tracks delay variations in the critical path due to supply noise. 
The integration of the replica within the modulator reduces the delay of the sense-and-
adapt loop facilitating adaptation to both fast and slow transient noise.  The key features 
of the proposed approach are: (i) fast (within a cycle) adaptation of clock to protect 
 
58 






 droops); and (ii) adaptation against both 







































Figure 4.1: Overview of the proposed clock modulation approach. 
 
4.2.1 Clock modulation methodology 
4.2.1.1 Global modulation 
The GM uses two identical replicas that are activated alternatively (Figure 4.2). 
The basic idea is to generate CKG directly from the replica with a period (TCKG) 
determined by the replica delay (Tp). The timing diagram of the operation is illustrated in 
Figure 4.3 with the key events marked with numbers (1 to 6). AR is set only after the 
delay (TP) of one activated replica (S01 or S11 is set). (1) If AR is set and CKPLL is low, 
EN is set to generate a clock pulse (CKS). (2) CKS pulse triggers rising edge of CKG and, 
in turn, toggles S00 and S10 (de-activates one replica and activates the other replica). (3) 
This de-activation of one replica clears AR. The activated replica starts propagating the 
high signal. (4) S0h or S1h, the intermediate node of the replica, is set first after a 
propagation delay of TH (< TP). The rising edge of S0h or S1h generates a pulse (DP) 
which clears CKG, i.e., the rising edge of S0h or S1h results in the falling edge of CKG 
and determines the duty of CKG. (5) After the replica delay (TP), AR is set again and the 
 
59 
rising edge of CKG is generated. Thus, the clock period (TCKG) tracks Tp. (6) If TP 
increases due to supply noise, TCKG increases as well. The control granularity of TCKG 
depends on the clock period of CKPLL (TPLL ) as it defines the sampling time points for 
AR. A higher input clock frequency results in a finer grain control at the expense of 
increased power dissipation in the modulator. The duty of CKG can be changed by 
programming the delay (TDT) in DP path. TP is also variable for compensating 
mismatches between the replica and the real critical path delay by changing the delay 





































































4.2.1.2 Local modulation 
The basic structure of the LM is similar to the GM (Figure 4.4). The operation of 
the LM depends on the duty modulation window (TDM). TDM is the time difference 
between CKG and CKD. TDM can be changed by programming the delay chain in CKD 
path. If TDM is 0 [clock gating mode or duty modulation (DM) off mode], the LM gates 
the input clock pulse when the increased replica delay is greater than the input clock 
period (Figure 4.5(a)). If TDM is not 0 [duty modulation (DM) on mode], the local clock 
pulse (CKL) is generated only when the critical path delay is smaller than TCKG+TDM 
(Figure 4.5(b)). Otherwise, the input clock pulse (CKG) is gated. However, if TP is 
between TCKG and TCKG + TDM, the clock duty is modulated and the effective clock period 
is changed. To prevent a false glitch at CKL, the propagation delay from CKL to AR reset 
(CKG-to-CKL delay, the toggle flip-flop delay, and-or gate delay, and the delay of the 






















TDMDM off (TDM = 0)
DM on (DM(1x) or DM(2x))
0<TDM @ DM(1x)<TDM @ DM(2x) 
CKL
TM2
TDM : Programmable duty 
modulation window size 
















DM off (Clock gating mode)
noise




















Figure 4.5: Timing diagram of the local modulator: (a) DM off and (b) DM on. 
 
4.2.1.3 Design considerations 
As TP approaches n∙TPLL in the GM and n∙TCKG (DM off) or n∙TCKG+TDM (DM on) 
in the LM, the latches used for clock gating circuit in modulators could have metastable 
condition. The metastability can incur timing variations in the rising edge of the output 
clock (CKG and CKL). Although it is impossible to eliminate metastability, the 
possibilities of metastable conditions are reduced by using Schmitt-trigger and sense-
amplifier based latches, both of which have high gains, in the GM and the LM (Figure 
4.6). The metastable point of the sense-amplifier is designed to be lower than the logic 
threshold of the Schmitt-trigger. Even if the metastability occurs in the sense-amplifier, 
the output of the Schmitt-trigger can filter out the signal lower than the logic threshold of 
the Schmitt-trigger. The intermediate level (near the metastable voltage) of clock is 
therefore unlikely to propagate through the clock network. Nonetheless, the rising edge of 
CKG and CKL can be delayed due to metastability in A1 and A2 when the critical path 
 
62 
delay is close to the generated clock period. This delay, although introduces jitters in the 
current cycle, effectively extends the current clock period which is beneficial for timing 
margin in data paths. The next clock period is also controlled to become higher than the 
replica delay in that period as the replica circuits are activated based on the current clock 
edge and the next rising edge of the clock is determined by the delay of that activated 
replica circuit. In summary, even if the current clock edge is delayed due to metastability, 
timing errors are prevented in both the current and next clock cycle. However, the clock 
edge uncertainty can lead to challenges in synchronizing different blocks operating with 





















A1 in the GM
A2 in the LM
 
Figure 4.6: Clock gating circuits (A1 and A2 block). 
 
4.2.2 Test chip and measurements 
A test chip is implemented in 130nm CMOS to validate the adaptive clocking 
scheme as shown in Figure 4.7. The design includes a voltage controlled oscillator 
(VCO), a serial peripheral interface (SPI), control registers, a frequency counter, an error 
counter, 5-stage test pipelines, and a noise injector. The noise injector is comprised of 
nMOS transistors, which draw instantaneous current from the power supply and incur 
 
63 
voltage noises in the supply when an enable pulse is applied as shown in Figure 4.7. The 
VCO generates CKPLL, which is used for the input clock of the GM. The frequency of the 
VCO is controlled by the external VCO control voltage (VCTRL). The output of the GM is 
used for the input clock of the LM. The LM generates the local clock for the test pipeline. 
The different modes, namely, DM(1X) and DM(2X), represent different duty modulation 
window (TDM) and are determined by the programmable delay chain in CKD path that 
indicates the delay between CKG and CKD. In this design, TDM for DM(1X) corresponds 
to 6-inverter-chain delay and TDM for DM(2X) corresponds to 10-inverter-chain delay for 
duty modulation. The performance of the test pipeline with clock modulation was 








































Figure 4.7: Block diagram of the implemented system. 
 
4.2.2.1 Effect of DC voltage variation 
Figure 4.8(a) shows the measured clock frequency generated by the GM in 
different supply voltages (LM was turned off) to observe the ability to adapt frequency 
with the DC voltage change. It can be observed that the GM automatically generates the 
output clock frequency at which the test pipeline can operate without timing errors over a 
 
64 
wide voltage range (0.74V~1.3V). Within the voltage range of the frequency transition i.e. 
when TP is almost equal to n·TPLL, two different clock frequencies are observed [Figure 
4.8(b)]. As shown in the frequency transition region in Figure 4.8(b) (zone marked A in 
Figure 4.8(a)), it is observed that small variation in TP can make TCKG either 10ns or 11ns. 
Since two different clock frequencies are found, the effective clock frequency is defined 
as:                     , where FREF is the external input clock (100KHz) and 
NCNT is the counter value of the frequency counter. Thus, FEFF can be calculated in 
0.1MHz resolution. 
Figure 4.8(c) shows the measured effective clock frequency generated by the LM 
with GM off according to DC voltage change with the direct input clock (100MHz). 
When DM is off, two different frequencies (100MHz or 50MHz) are observed at the 
frequency transition region. When DM(1X) is on, the effective frequency slowly reduces 
to 50MHz as the supply voltage reduces. Although the replica circuit delay is higher than 
the input clock period, the LM with DM on modulates the clock period by reducing the 
clock duty within duty modulation window (TDM). Since the rising clock edge is 
determined by the delay of replica circuits, the rising edge of the clock will keep delayed 
within the DM window. Eventually, the clock pulse will be gated. Because of this 
behavior, the measured effective clock frequency changed discretely like a step according 
to the number of the gated clock pulses. When DM(2X) is on, the effective frequency 








250 Measured frequency 
change points (16 chips)






































































































                       (a)                                            (b)                                          (c) 
Figure 4.8: Measured frequency modulation results of the GM and the LM under DC 
voltage shift; (a) clock modulation of the GM; (b) the effective clock frequency of the 
GM at the frequency transition region; (c) the effective clock frequency of the LM at the 
frequency transition region. 
 
The above observation is further clarified in Figure 4.9, which shows the 
characteristics of the duty modulated clock. Figure 4.9(a) shows waveforms of the duty 
modulated clock [DM(1X) mode] at voltages around the frequency transition region. At 
812 mV supply, the output clock frequency is the same with the input clock frequency to 
the LM. As the supply voltage decreases further (~800mV), the clock frequency is 
modulated by modulating the clock duty. Reducing the voltage further (~790mV) 
eventually causes the AR signal to be set outside of the duty modulation windown 
resulting in gating of a clock signal. The duty modulation amount is limited by TDM and if 
the duty continues reducing in successive cycles eventually clock is gated as in Figure 
4.9(a). If the voltage is reduced further (~784mV) clock gating becomes more frequent. 
The frequency characteristics of the duty modulated output clock are summarized in 
Figure 4.9(b). The 100MHz clock frequency can be observed around frequency transition 
region (790mV~810mV). Below 780mV, 100MHz frequency disappears and the duty 
modulated clock determines the maximum frequency. The minimum clock frequency is 
higher than 50MHz in the frequency transition region as the clock is gated after clock 






















































                               (a)                                                                      (b) 
Figure 4.9: Characteristics of the duty modulated output clock of LM with DM(1X) on at 
low-operating voltage: (a) measured output waveforms of the LM with DM(1X) on near 
the frequency transition region and (b) The measured maximum/minimum frequency and 
the effective frequency.  
 
4.2.2.2 Effect of transient supply noise 
Figure 4.10 shows the measured clock and supply waveforms, which demonstrate 
operations of the GM and the LM under transient supply noise. Figure 4.10(a) shows the 
measured waveform of CKL, which is modulated to 100MHz by the GM and the LM at 
0.81V supply without noise injection. The effective frequency (Feff) is calculated from the 
frequency counter value. Figure 4.10(b) shows the frequency modulation of the GM 
under transient noise. The output clock period is changed from 10ns to 11ns in the 
presence of the supply noise. The modulation results of the LM (DM off) is shown in 
Figure 4.10(c). With DM off, the LM modulates the output clock period from 10ns to 
20ns. When both the GM and the LM (DM on) are on, frequency modulation effect by 
the GM and the duty modulation effect by the LM can be observed as shown in Figure 
4.10(d). When DM is on, the duty modulation looks like increasing clock jitter, which is 
not random like normal jitter but controllable and correlated with the replica path delay 
variation. Since TDM(2X) is larger than TDM(1X), wider duty modulation window is observed 
at DM(2X) mode than at DM(1X) mode. The result of reduced clock duty with 
 
67 



















LM : on (DM off)
Noise : on
GM : on










(c)                                          (d) 
Figure 4.10: Measured waveforms of the modulated clock: (a) without noise; (b) only 
GM on; (c) only LM (DM off) on; (d) both GM and LM (DM on) on with noise. 
 
Figure 4.11 shows the measured performance of the test pipelines under noise 
injection. In this measurement, the noise injection per cycle is varied. Timing errors are 
measured using the error counter to detect the failure of the pipelines. Although, the 
transient noise is injected locally, the injected noise results in a transient global noise as 
well. Since the GM is close to the power source and the test pipeline is close to the noise 
source, power noise at the GM is lower than that at the LM (Figure 4.11(a) inset). Hence, 
although the GM modulates the system clock, it could not prevent timing errors in the 
pipeline due to local noise as shown in Figure 4.11(a). As the LM is placed near the 
pipeline, it modulates the clock in response to local noise and prevents timing errors even 
at the highest noise injection frequency Figure 4.10(c) and Figure 4.11(b)). However, 
 
68 
with the DM off (clock gating mode), the pipeline operates at reduced Feff (Figure 4.11(b), 
the GM off). Turning the DM on significantly increases Feff as illustrated in Figure 
4.11(b). With both the GM and the LM (DM on) on, Feff is improved further as the GM 
modulates the system clock in response to global noise reducing the probability of the 
local clock gating (Figure 4.10(d) and Figure 4.11(c)). Note with both the GM and the 
LM on, the timing errors are prevented even under transient noise. With a direct clock 
input, the measured maximum frequency was 93.3MHz under noise injection. The 
measured Feff with the proposed modulation [GM+LM(DM on)] reaches 100MHz when 
injection frequency is equal or less than 1/5. Thus, the proposed clock modulation 
method increases performance by 7.2% while maintaining correct operation of the 
pipelines under supply noise. The measured power consumptions of the GM and the LM 
at 0.81V are 56µW and 46µW, respectively. The die-photo and the characteristics of the 
implemented test chip are shown in Figure 4.12. The key feature of the proposed design 
in comparison to existing works in the area of clock modulation in response to supply 










1/1 1/3 1/5 1/7 1/9 1/11 1/13 1/15





































Modulation in response to global droop
1/1 1/3 1/5 1/7 1/9 1/11 1/13 1/15



































1/1 1/3 1/5 1/7 1/9 1/11 1/13 1/15






















































Max. frequency without 
clock modulation
                          
(a)                                           (b)                                           (c) 
Figure 4.11: Measured effective frequency under global and local supply noise: (a) only 

















Technology  130 nm
Voltage  1.30 ~ 0.74V




















Figure 4.12: The die-photo and characteristics of the chip. 
 
Table 4.1: Comparison of prior works. 
 [58] [59] [60] [61] This work 
Global power noise AC,DC AC (first droop) AC, DC AC, DC 
Local power noise no no no partially yes 






4.3 Adaptive bias-voltage generation 
Minimizing power consumption of digital systems in mobile applications is quite 
challenging due to increasing variations, such as process, voltage, and temperature. In 
addition, a recent research trend is moving towards on-chip voltage regulation to explore 
benefits of multiple voltage domains, fast dynamic voltage frequency scaling (DVFS), 
good supply noise reduction, and cost effectiveness [89]. This section proposes a feasible 
solution considering above-mentioned technical challenges.  
A system requires a voltage guard-band to operate at the target frequency without 
any errors due to variable factors. An increased voltage guard-band considering worst-
corner cases can lead to unnecessary power overhead. If the extent of variations is 
unknown, there is no way but to add the maximized guard-band considering all possible 
worst scenarios even though it is least feasible. Assuming the extent of static variations 
can be detected, a supply voltage level to compensate for the detected variation amount 
can be applied, and the target performance can be achieved. In addition, if supply voltage 
can be adjusted only at the detection of dynamic variations, additional voltage overhead 
considering dynamic variations can be removed. This technique is called adaptive voltage 
scaling (AVS). Thus, AVS is becoming popular since it can reduce the system power 
applying not a fixed voltage but an adaptive voltage required for a circuit to operate a 
target performance. However, voltage assignment is a still challenging problem. AVS 
works assuming a supply voltage is known for a target performance considering all 
variations. In addition, providing an on-chip voltage reference is an important issue needs 
to be solved considering the on-chip voltage regulators. Providing external voltage 
references for multiple on-chip voltage regulators is costly and inefficient for variation 
compensation. Deciding an appropriate voltage level for a target frequency considering 
DVFS is also important feature for on-chip voltage regulators.   
Previous works explored methodologies to decide a supply voltage for a target 
performance considering variations [64], [86], [88]. There are two ways—open-loop and 
 
71 
closed-loop method—to find the target voltage [86]. First, open-loop method requires a 
delay detector circuit and a pre-characterization process. After the evaluation of the delay 
of a circuit, proper voltage to compensate for the delay variation can be pre-characterized 
and be stored in a table. Depending on the delay variation detected, proper voltage can be 
found from the already constructed table. Since the target voltage can be found directly 
from the sensor output value, target voltage can be found quickly. However, it requires a 
matching process between the delay sensor and the real critical paths and the pre-
characterization process to construct the voltage table. In addition, the number of table 
should be increased depending on the different frequencies, which a circuit can operate 
for dynamic voltage frequency scaling (DVFS). Second, closed-loop method requires a 
feedback loop from delay sensing to voltage control. Adjusting the voltage until the target 
delay is met, the target voltage can be found through the feedback [64], [86]-[88]. The 
main benefit of this method is that the pre-characterization process is not required and it 
works at different frequency targets without additional circuits, though it requires 
iteration to reach the target voltage. However, finding the target voltage based on the 
behavior of real circuits [64] could interrupt the real operation of the circuits. In addition, 
the closed loop system necessarily requires careful design considerations for loop 
stability. Furthermore, the fast acquisition and the area-efficiency are also important 
design factors. 
The main objective of this research is to develop a cost-effective stable bias-
voltage generator that can provide fast generation of a target supply voltage at a given 
target frequency while not interrupting the operation of real circuits considering on-chip 
voltage regulators.  
4.3.1 Target voltage generation methodology 
The objective of this research is to implement a closed loop solution, which can 
find the operating voltage of a circuit for a target frequency under static and dynamic 
 
72 
variations as shown in Figure 4.13. The target voltage is used as a reference voltage for 
on-chip regulators [89] to apply the adaptive supply voltage to the load circuits. The 
direct generation of bias-voltage can be beneficial compared to all digital methodologies. 
First, designers do not need to consider quantization errors. Since the output is not digital 
but analog, unnecessary voltage margins due to quantization errors can be included in the 
supply voltage level. Second, the proposed approach can reduce the design burden in an 
on-chip voltage regulator. The quantization errors necessitate a voltage regulator with 
multiple offsets and fine grains (i.e., 12.5mV voltage resolution). The design overhead 
considering programmability of the on-chip voltage regulators can be quite high. Third, 
the on-chip voltage reference generation makes external voltage references unnecessary. 
























































4.3.1.1 Adaptive voltage generation methodology 
The overall adaptive voltage reference generator is comprised of a delay 
comparator, a tunable delay line, a charge pump, a loop filter, and a power stage as shown 
in Figure 4.14. The overall system generates a reference voltage (VREF) at which the 
circuit can operate with the input reference clock frequency (CKREF). The output delay of 
the tunable delay line is compared with the input reference clock (CKREF). If the delay of 
the delay line (TD) is smaller than the target delay (TREF), DN pulse is set to decrease the 
voltage at VC node as shown in Figure 4.14. As a result, VDL and VREF increase. The 
increased VDL affect TD, which decreases. On the other hand, if TD is higher than TREF, 
UPn pulse increases the voltage at VC node, and hence, VDL and VREF decrease. This 
iterative process finds the target voltage, VREF, to make TD equal to TREF. Total operation 









































Figure 4.14: Block diagram of the adaptive voltage generator. 
 
4.3.1.2 Circuit Implementation 




4.3.1.2.1 Delay Comparison 
The block diagram of the delay comparison block is shown in Figure 4.15. EN is 
a signal to start delay measurement of the tunable delay line. The rising edge of CKREF 
sets ST, which is used to set FF1 and FF2, and consequently PL1 and PL2 are set to high. 
PL1 signal is used as the input of the tunable delay line as shown in Figure 4.16. This 
PL1 activates the tunable delay line. Then, the delay line starts propagating high signal, 
which is again fed back to DLin pin. Since the delay line operates at lower voltage than 
the delay comparator block, it requires a level shifter. However, DLin generates RP1 to 
clear PL1. As a result, the pulse width of PL1 is affected by the delays of the delay line 
and the level shifter. After a one clock period of CKREF, EN is cleared, and hence, ST is 
also cleared. Consequently, pulse generator (PG2) generates a reset pulse (PR2), and PL2 




































































(a)                                                                      (b) 
Figure 4.16: Operational waveform of the adaptive voltage generator; (a) voltage down; 
(b) voltage up. 
 
The pulse width of PL2 (PW2) is determined by the clock period of CKREF, and 
the delay of the delay line (PW1) determines the pulse width of PL1. If PW1 is smaller 
than PW2, it implies the delay of the delay line needs to be increased to be matched to 
PW2.  
 
4.3.1.2.2 Tunable delay line 
Implementing a real critical path is a not practical method since it is not a scalable 
solution. On the other hand, tunable delay line is a programmable method with a good 
scalability. However, it requires a matching process between the real critical paths and the 
tunable delay line. This research includes the tunable delay line, which is shown in Figure 
4.17. This work includes a special delay line, which has a clear input (CLR). If the CLR 
is cleared, the internal node values of the delay line are all cleared. The benefit of this 
reset scheme of the delay line is shown in Figure 4.18. The main technical issue of DLL 
is harmonics locking. As shown in Figure 4.18, when the delay of the delay is too high, 
 
76 
the DLL architecture cannot detect this harmonics locking problem. To prevent this issue, 
complex control circuits are required. This work proposes a delay line with a reset 
scheme. If the input to the delay line is cleared, the internal nodes of the delay line are all 
cleared after a fixed delay amount (TCK+∆T). Thus, the possible erroneous propagation 
greater than or equal to 2TCK can be cleared.  
The level shifter in Figure 4.17 is used for voltage level conversion and delay 

































                                    (a)                                                         (b) 
Figure 4.18: Concept of delay line with reset to prevent harmonics lock; (a) harmonics 




If VDL (the supply voltage of the delay line) is low, the delay mismatch between 
the delay line and the real critical paths can increase as shown in Figure 4.19. At low 
voltage, the delay difference between the delay line and the critical paths can become 
negative as shown in Figure 4.19(b). However, the propagation delay of the level shifter 
also increases as the voltage of delay line decreases. As a result, the increased delay of 
the level shifter at low voltage adds to the delay of the delay line, and the mismatch at 














































tDL : delay of the delay line
tLV : delay of the level shifter tDL1~tDL4 : 
delay of load Type1~4
tDL + tLV
tDL












Supply Voltage (V)Supply Voltage (V)





























(c)                                                                      (d) 
Figure 4.19: Delay compensation with a level shifter; (a) delay versus supply voltage; (b) 





4.3.1.2.3 Power stage 
The proposed system includes two identical power stages. One power stage is 
used to provide the voltage for the tunable delay line (TDL). The load capacitance of the 
VDL node is small in this design. It moves the dominant pole at the power stage towards 
higher frequency, which allows better stability and higher bandwidth of the loop. Instead, 
it can create voltage noises at VDL, which is not appropriate for a voltage reference. Thus, 
this work proposes shadow power stage, which creates clean voltage reference as shown 
in Figure 4.20.  Furthermore, it is possible to drive a heavy capacitive load (CEXT) 












CL :small capacitance for 
high loop bandwidth
























Figure 4.20: Power stage architecture. 
 
An additional benefit of the shadow-power stage is automatic safety margin in 
VREF generation. Since the power stage includes the current of the delay line, VREF has an 
offset as much as the current of the delay line. It means this offset will increase in 
proportion to the current of the delay line. Since the current of the delay increases in 
proportion to the input reference clock, higher offset voltage in VREF, higher safety 
margin, is generated. In higher clock frequency, voltage droop in the power network can 
increase. Thus, higher safety margin is required considering higher voltage droop at 
 
79 
higher operating frequency. The shadow-power stage automatically guarantees this 
automatic compensation for voltage droops at high operating frequency. 
 
4.3.2 Simulation results 
This section presents simulation results of adaptive voltage generator under static 
and dynamic variations. The design is implemented with a 130nm CMOS technology. In 
addition, automatic voltage generation according to the input clock change is shown. It 
implies the proposed circuit can enable a system to operate with fast and automatic DVFS.  
  
4.3.2.1 Process variation 
The MC simulation results with static process variations are shown in Figure 4.21. 
With a fixed supply voltage (0.95V), the normalized frequency ranges from 0.8 to 1.2 due 
to process variations. Thus, the operating frequency of the chips should be 0.8 
considering the yield. It means a performance loss for most cases. On the other hand, the 
performance of a circuit with the adaptive voltage generator varied from 1.01 to 1.08. To 
meet the performance target (higher than 1), the operating voltage should be increased 
from 0.95V up to 1.1V. The average voltage of the adaptive supply was 0.974V. On 
average, 0.126V (1.1V-0.974V) voltage guard band is required considering process 
variations without adaptive supply voltage. The proposed work achieved 85.58% 
reduction in the standard deviation. From the simulation results, it can be found that the 































Figure 4.21: Performance variation under process variation. 
 
4.3.2.2 Aging effect 
To emulate the NBTI aging effect, the body bias of pMOS transistors are 
increased gradually. The pMOS body bias higher than the supply voltage creates a 
reverse body-bias condition and increases VTH of pMOS transistors. As a result, the delay 
of critical paths will degrade slowly. The maximum frequencies of different critical paths 
are shown in Figure 4.22(a). Depending on the path types, the degradation ranged from 
20.13% to 24.13%. The adaptive voltage level according to delay degradation is shown in 
Figure 4.22(b). As a result, the delay variation of critical paths with adaptive voltage 
generator ranged from 0.99% to 3.12%. The results imply that the aging impact on delay 
up to 24.14% can be minimized since the adaptive voltage generator find the voltage 
continuously to compensate for the delay degradation. As a result, a voltage guard band 
considering the aging effect can be reduced. In this test case, the required voltage guard 
band considering delay degradation up to 24.14% is 60mV.   
 
81 












































































                        (a)                                            (b)                                            (c) 
Figure 4.22: Generation of a voltage at the given target frequency; (a) performance 
variation under aging; (b) adaptive voltage change; (c) compensated performance. 
 
4.3.2.3 Temperature variation 
The simulation was conducted varying the temperature from -40ºC to 120 ºC as 
shown in Figure 4.23. The performance variation under the temperature variation at 
0.95V supply voltage was 20.62%. With the adaptive voltage reference, the performance 
variation was within 2.13%. The generated voltage reference at 120 ºC was 1.042V. It 
implies 95mV voltage guard band is required without adaptive voltage reference 
considering the worst temperature corner.  
 



































4.3.2.4 Dynamic voltage frequency scaling 
The automatic DVFS is one of the main benefits of the proposed adaptive voltage 
generator. As shown in Figure 4.24, the VREF adaptively changes to make the critical 
paths operate above the target frequency. In addition, at high operating frequencies, 
different types of load circuits have higher frequency margin, which indicates higher 
frequency margin or voltage margin. Considering voltage droops at high clock 





































































                                 (a)                                                                       (b) 
Figure 4.24: DVFS simulation; (a) voltage change according to input frequency change; 
(b) automatic DVFS. 
 
4.4 Summary 
This chapter presented two replica-based approaches to adapt to static and 
dynamic variations. First, this research presents an effective way to prevent timing errors 
by modulating the system clock and the local clock in response to DC and transient 
supply noise. The direct clock modulation from the replica circuits allows within-a-cycle 
frequency modulation which enables fast clock adaptation to fast transient noises. The 
measurement results demonstrate that a pipeline employing the proposed all-digital clock 
modulation can operate reliably over a wide operating voltage range even under transient 
 
83 
supply noise. Second, the proposed work explores a DLL-based method to find a target 
operating voltage for a target frequency. The benefit of the DLL-based approach is simple 
and stable loop control. The possible harmonics-locking problem associated with the 
DLL scheme is solved by the proposed delay line with a reset signal. The delay line with 
a level shifter provides automatic delay-mismatch compensation at low voltages. The 
proposed shadow-power stage provides stable feedback loop even with a heavy 
capacitive load. In addition, this shadow-power stage benefits providing an adaptive 
safety margin considering high voltage droops at input high frequency. The proposed 
adaptive voltage generator provides a solution to minimize the voltage guard band under 
process variations, aging effects, and temperature variations. An additional key benefit is 











A digital system without safety margins necessarily can have functional failures 
under dynamic variations. Thus, if the system can detect the errors and recover from the 
errors, it can operate without any safety margin. Since this methodology detects the errors 
in situ (in the real data paths), it obviates the need for replica circuits and makes safety 
margins unnecessary. Instead, it requires modification in real circuits to implement error 
detection and correction methods. However, there are two approaches for resolving 
timing errors. One approach is to stall the pipelines by gating the clock to allow 
propagation of the correct data when errors are detected [63]. However, this method has a 
strict control-time requirement (i.e., control time from the error detection to the clock 
gating circuit) [64]. Hence, it cannot be applicable to high performance microprocessors. 
An alternative approach for error recovery mechanisms is architectural replay [65]-[70]. 
In microprocessors, the erroneous operation can be re-performed by flushing the pipeline 
and re-executing the instruction (architectural reply). Since an architectural replay is an 
embedded function in microprocessors (i.e., it is like an operation in the case of a branch 
misprediction), it has no limitation in operating frequency of microprocessors. However, 
the architecture replay can only be used in microprocessors and not for generic circuits. 
In addition, this error recovery method can incur significant power and performance 
penalty for flushing the pipelines and re-executing instructions when the erroneous 
operations are detected. Furthermore, it is required to change the operating condition (i.e., 
increasing the voltage or reducing the frequency) to prevent errors in the next try since 
the same instructions are executed again.  
 
85 
This chapter presents two error-prevention techniques to minimize performance 
penalty associated with managing timing errors. Those two techniques are platform-
independent solutions, which are applicable to general circuits. The first solution is an 
error-prevention technique with a performance penalty less than a clock cycle. It utilizes 
time-borrowing and clock-stretching (TB-CS) techniques. However, this solution has a 
strict control time requirement. Thus, it is difficult to apply the first solution to high 
performance applications (i.e., operating at high frequencies). The second solution 
utilizing programmable-time-borrowing and delayed-clock-gating (PTB-DCG) is 
presented for the better trade-off between the control time requirement and the 
performance penalty. This solution allows the relaxed control time requirement, thereby 
achieving high frequency operations.    
  
5.2 Time-borrowing and clock-stretching 
This chapter presents a method to prevent the timing errors in advance to improve 
tolerance to delay variations in logic stages in a pipelined system with a minimized 
performance penalty (less than a clock cycle) while operating the system at a clock period 
less than the critical path delay. The approach couples the concept of time borrowing with 
innovative circuit techniques to prevent timing errors [71]. The time-borrowing is a well-
known concept in pipelines designed with pulsed-latches or soft-edge flip-flops where 
valid signal transition is allowed even after the clock edge (during the limited transparent 
time period) resulting in propagation of correct values to the next stage [72]-[74], [76]-
[81].   
This section presents timing error prevention using time borrowing and clock 
stretching to enable design with low safety margin. For a target operating voltage, the 
pipeline with the proposed approach can operate at a clock period less than the critical 
path delay without causing any timing error. Hence, the pipeline can have better 
 
86 
performance for a given power. The critical paths are the ones that are most likely to fail 
under PVT variations or aging. The proposed approach prevents errors in the critical 
paths when they are activated and hence, helps tolerate dynamic variations in the delay of 
a logic stage for a given input frequency. Compared to architecture replay, the proposed 
approach is more general and can be applied to non-microprocessor pipelines also. 
Compared to clock gating based error recovery, the proposed approach has a lower 
performance penalty (fraction of a clock cycle), which can be significant when critical 
path activation probability is high. Finally, unlike error detection and correction, the 
proposed approach guarantees minimum system performance.  
 
5.2.1 Methodology for prevention of timing error 
In this section, the underlying methodology of the proposed effort is discussed. 
Consider a two-stage pipeline as shown in Figure 5.1. In Figure 5.1(a), if the critical path 
delay [clock-to-Q delay (TCK-Q) + logic delay (TP1) + setup time (TSETUP)] is greater than 
the clock period (TCK), the negative time slack causes timing failures in the pipeline. If 
pulsed latches are used in the pipeline as shown in Figure 5.1(b), the pipeline can have 
more flexible timing budget [17]. A pulsed latch has time-borrowing characteristics, and 
this property makes the timing requirement relaxed. Because of this time-borrowing 
behavior, the pulsed latch can sample the correct data as long as the path delay is less 
than the sum of the clock period and time-borrowing window (TCK + TBW). However, the 
borrowed time (i.e., TCK-Q + TP1 - TCK) is added to the path delay of the next stage. In the 
next stage, if the increased total path delay [=(TCK-Q + TP1 - TCK ) + TD-Q + TP2 + TSETUP, 
where TD-Q is the data-to-Q delay of the latch, and TP2 is the logic delay of stage 2] is 
greater than the sum of the clock period and time-borrowing window (TCK + TBW), the 
pulsed latch in the next stage cannot sample the correct data D3 as shown in Figure 5.1(b). 
 
87 
Therefore, the pulsed latch with a limited time-borrowing widow can prevent timing 











































TCK-Q + TP1 + TSETUP < TCK + TBW 




























TCK-Q + TP1 + TSETUP < TCK + TBW 
(TCK-Q + TP1) – TCK + TD-Q  + TP2 + TSETUP < TCK + TBW + TST
 
(c) 
Figure 5.1: The conceptual operation of the pipeline with (a) the flip-flops, (b) the pulsed 
latches, and (c) the LTD with clock stretching. 
 
The preceding scenario can be avoided if the increased path delay is resolved by 
stretching the clock period to prevent possible timing errors in the next pipeline stage. 
This approach is illustrated in Figure 5.1(c). In the clock-period-stretching process, the 
 
88 
input clock frequency to the system is not changed, but the period of the internally 
generated clock is changed dynamically at every clock cycles. The proposed approach 
uses a special latch, hereafter, referred to as the pulsed latch with time-borrowing 
detection (LTD), only at the timing critical paths of all pipeline stages. The LTD operates 
as a pulsed latch where the pulse width defines the time-borrowing window (TBW). The 
occurrence of the critical-path transition within TBW is defined as a time-borrowing event. 
The LTD allows time borrowing in the current stage when critical paths are activated and 
generates a detection signal. The proposed LTD detects time borrowing and generates a 
time-borrowing detection (TD) signal in the presence of time borrowing. Based on the 
TD signal, the clock period is stretched by TST. Thus, the pipeline can guarantee the 
correct operation using clock stretching as long as the increased total path delay [=(TCK-Q 
+ TP1 - TCK ) + TD-Q + TP2 + TSETUP] is less than TCK + TBW +TST. As a result, the design 
with the LTDs and the clock stretching concept can prevent possible timing errors with 
the elastic clock control.  
To guarantee correct operation of a pipeline with the proposed timing error 
prevention scheme, the timing requirement for the worst case is given by:  
                                 ,                  (5.1) 
where max(TPn) is the maximum logic delay considering all pipelined stages. The 
input clock period and the stretched clock period are defined as Tmin and Tmax, 
respectively. In this work, time borrowing window (TBW) and the clock stretching amount 
(TST) are given by:  
             –     .                                                (5.2) 
TBW and TST are generated from the multiple clock phases. TBW and TST are 
defined as 1/4 or 1/8 of the input clock period (Tmin) depending on whether TBW and TST 
are generated from 4-phase or 8-phase clocks, respectively.  
The proposed method enables a circuit to operate elastically at the clock period of 
Tmin or Tmax according to the detection of time borrowing [Figure 5.2(a)]. If the critical 
 
89 
paths (the paths that cause time borrowing) are not activated, the pipeline with the 
proposed approach operates at the minimum clock period, Tmin. When the critical paths 
are activated, the design operates at the maximum clock period, Tmax. As the clock period 
changes adaptively depending on the activation statistics of critical paths, the system with 
the proposed method will have different clock periods as shown in Figure 5.2(b). Hence, 
effective operating frequency (FEFF) is defined to estimate the effective performance of 
the pipelined system as follows:   
             
                          
   ,       (5.3) 
where TEFF is the effective clock period. The PC is the probability of time-
borrowing events. PC function is different according to the designs. From the equation, it 
can be concluded that as PC gets smaller, TEFF gets smaller (i.e., higher performance). 
Hence, for a target operating voltage, the pipeline can operate at a clock period less than 
the critical path delay without causing any timing error. The effective performance 
depends on the activation probability of the critical path and can be better than a 





Clock Period = Tmax
No clock stretching






























                                      (a)                                                              (b) 
Figure 5.2: (a) The clock stretching concept in time domain and (b) the control flow of 




5.2.2 Circuit-level implementation 
5.2.2.1 Latch with time-borrowing detection 
The LTD shown in Figure 5.3 is comprised of a latch and the time-borrowing 
detection circuit. The basic operation of the LTD used in this work is similar to the 
latches with the error detection circuit used in [66]-[70]. The time-borrowing window of 
the latch L0 in the data path is determined by the clock high pulse width generated from a 
clock pulse generator in the clock control circuit. If the input data D arrives before the 
rising clock edge, the output Q has valid transition like a flip-flop. If the input data D 
arrives late after the rising clock edge, the output Q still has valid transition like a latch as 
long as the input changes during the clock high pulse i.e., the time-borrowing window. In 
this case, the data sampled in the latch L1 will be different from the output of the buffer 
BUF0. Hence, the time-borrowing pulse (TBP) signal (i.e., the output of the XOR gate) 
will be high. Once the TBP signal is set during the high pulse of the clock, the pre-
charged node (PRE) in the TBP detection circuit will be discharged, and eventually the 
TD signal will be set. The input change while the clock is high indicates time borrowing 
from the next pipeline stage, and the TD signal is set to notify this event. Therefore, the 
LTD allows valid output transition as long as the input arrives within TBW. In the TBP 
detection circuit, there could be static current if TBP becomes high while clock is low. 
During clock low, transistor M1 should be turned off to prevent the case when both M0 
and M1 are on. In the proposed design, the buffer BUF0 is used to match the input-to-
output propagation delay (TD-Q) of the latch L1. The latch L1 will be in transparent mode 
and behaves like a buffer while clock is low. Hence, although the input changes while 
clock is low, the unnecessary transitions of TBP can be prevented as illustrated in the A 
and B point in Figure 5.3(b). As a result, the TBP detect circuit can be implemented 
without a footer transistor in the evaluation path. It improves the evaluation speed and 
reduces area. The purpose of the buffer BUF1 is to make the M0 transistor turned off 
 
91 
before TBP is set or turned on after TBP is cleared. This can be achieved by making the 
buffer have fast rising time and slow falling time. It also helps extend the time zone 
where time borrowing can be detected (i.e., extend the pulse width of TD signal). 
However, the keeper circuit in the PRE node needs to be very weak considering the 
contention during pre-charge and evaluation. The time-borrowing window makes the 
setup-time requirement more flexible but makes hold-time requirement more stringent. 
The minimum delay requirement of the logic paths increases as the time-borrowing 
window in LTD increases. Therefore, more number of buffers to fix hold time violation 
should be inserted if a LTD is used with a short delay path. However, as the LTDs are 
used only for long delay paths (critical paths) in the proposed design the required number 


































                                    (a)                                                              (b) 
Figure 5.3: (a) The schematic of the proposed latch with the time-borrowing detection 
(LTD) and (b) its timing diagram. 
 
5.2.2.2 Time-borrowing detection collector 
Time-borrowing detection signals from the LTDs are combined to generate one 
clock shift signal that is used to stretch the clock period. The TD collecting (TDC) circuit 
shown in Figure 5.4 gathers TD signals from LTDs and generates the clock stretch signal 
 
92 
(SHIFT) for the clock shifter. The TDC circuit functions as a wired-OR logic and is 
implemented using dynamic logic. However, time-borrowing detection cannot be 
evaluated correctly in the TDC circuit if the pulse width of the TD signal is too short. 
This case can happen when the time-borrowing event happens almost near the falling 
edge of the clock. This can be limiting factor of the error-prevention range of the 
proposed technique. When the clock is high, the TDC circuit pre-charges the PRE node. 
After the clock becomes low, the PRE node is evaluated according to the value of the TD 
signals generated by the LTDs. However, the propagation time from the PRE node in a 
LTD to the TDC should be longer then the clock skew between the LTD and the TDC to 
avoid race condition. A Schmitt-Trigger inverter (SINV) is used in the TDC circuit to 
ensure sharp transition and increase noise immunity. The TBEN signal is used to turn off 









Figure 5.4: The schematic of the time-borrowing detection collector. 
 
5.2.2.3 Clock stretching circuit 
5.2.2.3.1 Clock pulse generator 
A phase-locked loop (PLL) is a common building block for frequency synthesis in 
digital systems. The PLL includes a voltage-controlled oscillator (VCO) that generates 
the clock with the target frequency. The VCO is comprised of multiple stages like a 
 
93 
differential ring oscillator. Hence, it generates multiple clocks with the same frequency 
but different phases. For example, when the VCO has 4 differential delay cells like in 
Figure 5.5, there are 8 output clocks with different phases. The time difference of 
adjacent phase clocks is TCK/8 assuming the clock period is TCK.  
A clock pulse generator [Figure 5.6(a)] is designed to generate multiple clock 
pulses from the VCO clocks. The high pulse width and the phase difference of each clock 
pulse are determined by how to match the different phase clocks to create clock pulses. 
The primary clock inputs (the pins marked in block in Figure 5.6(a) and (b)) determine 
the rising edge and the secondary clock inputs determine the falling edge of the generated 
clock pulse (CLK4P<3:0> or CLK8P<7:0>). Hence, the clock high pulse width (TBW) is 
determined by the selection of clock phases to primary and secondary clock inputs (the 
matching marked A in Figure 5.6(a)). The selection of clock phases to the primary clock 
inputs decides the time difference (TST) of each clock pulses (matching marked B in 
Figure 5.6(a)). As mentioned earlier, TBW and TST are considered to be equal in this thesis. 
4-phase (4P) or 8-phase (8P) clocks are chosen to generate the clock pulses (Figure 5.6). 
Simulations in an 180nm CMOS technology show that the clock pulse generators in 
Figure 5.6 can have 4 clock pulses with TCK/4 high pulse width and time difference or 8 






























































































                                    (c)                                                                   (d) 
Figure 5.6: The clock pulse generator for (a) the 4-phase clocks and (b) the 8-phase 
clocks. The operation of the clock pulse generator for (c) the 4-phase clocks and (d) the 
8-phase clocks. 
 
5.2.2.3.2 Clock shifter 
The phase-shifting circuit is designed as shown in Figure 5.7 to implement clock 
stretching. The clock shifter is comprised of shift registers and clock gating circuit. The 
purpose of the shift registers is to generate selection signals (SEL<3:0> or SEL<7:0>) 
that are used to select one clock pulse among multiple clock pulses with time difference 
but same frequency. The initial value of the shift registers is one-hot. Hence, only one of 
the multiple clock pulses is enabled at each clock cycle and the others are gated in the 




























































































shift registers clock gating circuit
                                    (a)                                                                   (b) 

































































                                    (c)                                                                   (d) 
Figure 5.7: The block diagrams of the clock shifter for (a) the 4-phase clocks and (b) the 
8-phase clocks. The operation of the clock shifter for (c) the 4-phase clocks (d) the 8-
phase clocks. 
 
When SHIFT signal transits from low to high, the shift registers shift the selection 
signals. This shifting selection signals indicates changing to a different clock phase. 
Eventually, by shifting the clock phase from one to other clock phase, the clock period 
can be stretched effectively as shown in Figure 5.7. Hence, without shifting selection 
signals, a clock pulse with the clock period (Tmin) is delivered to the system. With shifting 
selection signals, which means time-borrowing detected, a clock pulse with the 
synthesized clock period (Tmax) is generated in the clock shifter as shown in Figure 5.7. 
Thus, the clock shifter can change the different clock periods - Tmin or Tmax - depending 
on the presence of time borrowing. 
 
96 
5.2.3 Test chip and measurement results 
The test circuit shown in Figure 5.8 was fabricated in an 180nm CMOS 
technology to verify the proposed approach. A simple 3-stage pipeline with the proposed 
methodology was designed with master-slave flip-flops (MSFF) in non-critical paths and 
LTDs in critical paths. The combinational logic paths were implemented with inverter 
chains to emulate the path delay distribution like in Figure 5.9. The MSFFs are used in 
non-critical paths (path0~path47) as shown in Figure 5.8. LTDs are used in 16 critical 
paths (path48~path63). The critical paths of the successive pipeline stages are assumed to 
be cascaded. This scenario is chosen in the design as it defines the worst-case 
performance penalty of the proposed approach. The TDC, CPG, and CS circuits are 
designed. CPG and CS circuits are different according to the selection of phase (4P or 8P), 






































































































































































































Figure 5.9: The path delay distribution (FO4 delay=76.23ps). 
 
The test circuit includes serial peripheral interface (SPI) slave controlled by a test 
program from a computer. The SPI block contains a 128-bit register file, which is 
accessible by the test program. The non-critical path group and the critical path group are 
activated separately to control the activation probabilities of the non-critical paths and 
critical paths independently. This is achieved by setting two different 8-bit toggling-rate-
control registers. These registers are controlled by corresponding toggling-enable bits. 
Each path group is connected to one toggle flip-flop, which is used as the input to the test 
pipeline as shown in Figure 5.8. This implies the paths in one group are controlled 
together i.e., all non-critical paths in a pipeline stage toggle at the same time, and all 
critical paths also toggle at the same time. The toggling rates of the flip-flops are 
programmable from 0.39% to 100% depending on the value of toggling-rate-control 
register. The toggling occurs when the toggling enable bit is set. If the toggling enable bit 
is cleared, the toggling rate becomes 0%. During the measurement, the activation 
probability of non-critical paths is fixed at 20%. This implies the toggling flip-flop 
associated with the non-critical paths changes value once at every 10 clock cycles 
resulting in 20% activation probability. Note 20% activation probability means that there 
is a 20% chance that any non-critical path is activated per cycle. The activation rate of the 
 
98 
toggling flip-flop associated with the critical paths i.e., the critical path activation 
probability (PC) was varied using SPI slave from 0% to 100%. 
To detect the malfunction of the pipelines, the reference pipelines with MSFFs 
and short delay paths are also implemented. The reference pipelines generate the correct 
output values for all (non-critical + critical) the paths. The final values of the pipelines 
are compared with the reference value and the number of errors is counted by the error 
counter whose value is accessible from the control program through SPI slave.  
To measure the maximum operating frequency or the minimum voltage range, the 
error counter is monitored while changing the input clock frequency or the input voltage. 
The time-borrowing detection (TD) counter is used to count the number of the clock-
stretching events to calculate the effective frequency and observe the time-borrowing 
probability. The time-borrowing probability is same as the critical path activation 
probability (PC). To calculate the effective frequency, the pipeline operates in one-time 
operation mode (i.e., only during a pre-determined clock cycles). After the test starts the 
pipeline operates for 2
16
 clock cycles. The effective frequency and PC can be calculated 




The test environment is shown in Figure 5.10. The measurements are done with 
two scenarios. The first scenario is to fix the supply voltage and vary the input clock 
frequency. From this scenario, the achievable maximum operating frequency of the test 
circuit with the proposed methodology can be measured. The second scenario is to fix the 
input clock frequency and vary the supply voltage. From this measurement, it can be 
observed whether the proposed method can increase the range of operating voltage of the 
circuits. All the measurements are done for the reference and the test pipeline with MSFF 
and LTDs. In addition, the test pipeline was verified with the 4-phase clock pulses (4P) 
and the 8-phase clock pulses (8P). Table 5.1 summarizes the area and the power of 


















CPG : Clock Pulse Genrator




Voltage Range 1.49 ~ 1.9V
Frequency Range 125MHz ~ 202MHz
Total Die Size 1900 um X 670 um
 
Figure 5.10: The test environment and the die-photo of the test chip. 
 
Table 5.1: Area and power of the components. 
Cell Size Power 
MSFF 22um x7um 35.68 uW 
LTD  26um x7um 36.8 uW 
TDC 15um x7um 4.5 uW 
8P CPG+CS 41um x63um 295.4 uW 
4P CPG+CS 41um x35um 177.3 uW 
   
              * Simulation @ 1.8V and 200MHz 
 
5.2.3.1 Increased clock frequency at the fixed supply voltage 
Figure 5.11 shows the measured maximum input clock frequency that can be 
applied to the test system without causing any functional failure. To perform this 
measurement, the frequency of the input clock is increased till the first timing failure 
appears. The above experiment measures the TCK defined in equation (1) for a given 
voltage with different values of TBW. Note that TCK is independent of the critical path 
activation probability (as long as Pc > 0) as the timing failure occurs if the critical path 
gets activated even once. However, during the measurements Pc was kept at 0.1. It is 
 
100 
crucial to note that the clock frequency in Figure 5.11 is not the effective frequency 
defined in the equation (3) which depends on the activation probability of the critical 
paths. The measured average performance of the conventional circuit (the pipeline with 
MSFFs) was 166.04MHz. The proposed methodology with the 8-phase clock or 4-phase 
clock increased the maximum input clock frequency up to 182.71MHz (10.04% increase) 
and 199.17MHz (19.95% increase), respectively. Theoretically, the 4-phase (4P) case 
should allow the system to operate at a 25% higher TCK as TBW=0.25TCK. Likewise, with 
the 8-phase (8P) case, ideally the input frequency can be increased by 12.5% from the 
conventional case, without causing any timing error. The deviation of the measurement 
results from the ideal values is due to the non-idealities in the duty of the pulsed clock 
which resulted in a reduced pulse width and hence, a reduced TBW and lower 












































Figure 5.11: Measured maximum input clock frequency (No errors, VDD=1.8V). 
 
With the sample chip6, power consumption is measured for different input clock 
frequency as shown in Figure 5.12. The supply voltage was set to 1.8V and PC was 
 
101 
programmed to at 0.1 (as measured by TD counter). Around 156MHz input clock, the 
frequency safety margin of the conventional circuit is eliminated and the circuit fails. 
Since time-borrowing and the clock stretching starts working, the proposed circuit 
operates with a faster clock but consumes more power. The CPG and the CS for the 4P 
case are implemented with smaller area than those for the 8P case. Consequently, the 8P 
case has higher power overhead than the 4P case at the same supply voltage and the clock 
frequency. In addition, the 4P case operates at higher frequency than the 8P due to wider 
time-borrowing window. Therefore, the proposed method can improve the performance 
beyond what is achievable by only removing the safety margin in a conventional design 
(i.e., wider operating frequency range). 
 
time borrowing and 
clock stretching
Input Clock frequency (MHz)

























Figure 5.12: The measured frequency and the power of chip6 (VDD=1.8V, PC=0.1). 
 
Figure 5.13 shows the measurement result of effective frequency for different Pc. 
During measurements, the same activation probability is used for the 4P and the 8P case 
to directly compare the performance of the 4P and 8P cases at different PC. Further, 
during the measurements, the input clock frequency for the 4P, the 8P, and the 
 
102 
conventional cases were fixed at their respective “maximum input clock frequency” as 
noted in Figure 5.13. The effective frequency is calculated from the input clock frequency 
and the value of the TD counter. The trend shows that the proposed design outperforms 
the conventional design for PC < 0.9 and the benefit increases at lower PC. Theoretically, 
the proposed method should have the same performance as the conventional design at 
PC=1. However, due to the non-ideal clock duty the proposed design is slower than the 
conventional case at PC=0.1. During the preceding measurements, the input clock 
frequencies for the 4P case was higher than the 8P case i.e., Tmin_8P = (1.25/1.125)Tmin_4P 
= 1.11Tmin_4P. But the stretched clock period (i.e., Tmax) after time borrowing is the same 
for the 4P and the 8P cases as 1.25Tmin_4P = (1.25/1.11) Tmin_8P =1.125Tmin_8P. Therefore, 
at Pc=1, the two designs ideally have same effective frequency but for PC < 1, the 4P 
design has higher effective frequency than the 8P case.  
The relative difference between the effective frequencies of the 4P and the 8P case 
can vary from the observations in Figure 5.13. For example, when same input clock 
frequency is applied to both cases (i.e., Tmin_4P = Tmin_8P), the 4P case will have lower 
effective frequency than the 8P case as long as Pc > 0. This is because the clock period for 
non-critical transitions remains the same, but the increase (penalty) in the clock period for 
each time-borrowing event is larger for the 4P case than the 8P case. The difference in the 
effective frequencies of the 4P and the 8P case can also be due to differences in the 
critical path activation probability. Note that for a given design and input patterns, the 
number of ‘critical’ paths (i.e., paths with time-borrowing detection latches) in the 8P 
case will be less than the 4P case. Consequently, the time-borrowing event for the 8P case 
will be less than the 4P case potentially resulting in higher effective frequency of the 8P 
case than the 4P case. For example, based on Figure 5.13, the effective frequency of the 
8P design operating at Pc=0.1 can be higher than the effective frequency of the 4P design 
operating at Pc>0.5. Such scenarios are a strong function of the path delay distribution of 

































Input Clock Frequency 
4P : 196.25 MHz
8P : 177.5 MHz
Conv. : 160 MHz
Conv.
4P measured 4P ideal
8P measured 8P ideal
 
Figure 5.13: Measured performance of chip6 according to PC. 
 
5.2.3.2 Reduced supply voltage at the fixed input frequency 
With the same clock frequency, the minimum operating voltage for test circuits 
were measured while reducing the supply voltage until the first timing error is found 
(Figure 5.14). As expected, the 4P case could operate at lower supply voltage (~1.51V) 
than the 8P (~1.65V) case. The average voltage for the conventional design without 
errors was 1.76V. The measurement results show that the proposed method could operate 
at a lower supply voltage (i.e., over a wider supply range). The effective performance and 
power of the design at this first failure point (i.e., at the minimum operating voltage) is 
summarized in the Table 5.2 (for the chip 6 only with PC=0.1). The performance penalty 
of the 8P case and the 4P case are observed to be 1.25% and 2.4%, respectively. The 8P 
case has lower performance penalty than due to lower clock stretching amount than the 
4P case. The measured power at the minimum operating voltage is reduced by 13% and 
25% in the 8P and 4P cases, respectively. Finally, compared to the conventional design, 
 
104 










































chip1 chip2 chip3 chip4 chip5 chip6  
Figure 5.14: Measured operating voltage ranges of test chips at 160MHz input clock 
frequency. 
 
















1.80 1 1 1 
8P 1.63 0.99 0.87 1.14 
4P 1.53 0.98 0.75 1.31 
* The power measurements includes the overheads 
 
5.2.3.3 Increasing tolerance to dynamic delay variations 
The ability of the design to tolerate dynamic variations in physical effects is 
demonstrated in the test-chip by reducing the supply voltage for a constant input 
frequency. The chip was tested under reduced dc voltage reduction. Since the test circuit 
 
105 
did not include supply noise injector, the case of fast voltage droops was not measured. 
Note the dc voltage reduction is a more stringent case as a time-borrowing event happens 
when critical path activation and dynamic variations occur at the same time. The slow or 
dc variations leads to higher probability of time-borrowing as they affect the system over 
a large number of cycles increasing the critical path activation events and hence 
performance penalty. On the other hand, the fast variations affect the delay only over a 
much shorter time interval. The measurements results can be generalized to suggest that 
the proposed technique implies better tolerance to dynamic variations in delay either due 
to critical path activation or due to environmental variations with the minimized 
performance penalty. 
5.2.3.4 Waveforms of operation 
Figure 5.15 demonstrates the dynamic change of the clock period. The measured 
waveform is the system clock output of the proposed circuit. The waveform shows that 










5.2.3.5 Power and area overheads 
The area overheads of the proposed method for 4P and 8P cases were 1.1% and 
1.7%, respectively (Table 5.3). The measured power overhead of the proposed method for 
4P and 8P was 4.56% and 7.38% when the activation probabilities of the non-critical 
paths and the critical paths are 0.2 and 0.1. Note the power values in Table 5.1 are 
simulated results (referred to as the simulated power), which describe the power of 
individual blocks. On the other hand, Table 5.3 reports the measured total circuit power 
(the CPG, the CS, the TDC, the LTDs, the MSFFs and logics) for 4P, 8P, and the 
conventional cases. The measured area and the power overheads are primarily due to the 
clock pulse generator and the clock shifter. As the area, complexity, and power of the 
pipeline increases, this overhead will reduce as power of the clock pulse generator and 
the clock shifter remains the same. 
 
Table 5.3: Total area and measured power of the chip6.  
 Phase Total area (mm
2
) Total power (mW) 
Conventional - 0.198 4.61 
Proposed 8P(Total) 0.201 4.95 
4P(Total) 0.200 4.82 





5.3 Programmable-time-borrowing and delayed-clock-gating 
The error prevention (instead of correction) has been explored to reduce the 
overhead in the previous section. Timing errors in a pipeline due to supply noise can be 
prevented by borrowing time from the following stage and resolving the borrowed time 
by stretching/gating the next clock cycle. Though time-borrowing with clock stretching 
allows error prevention with the minimized performance penalty, the control delay from 
error detection to correction circuits limits the size/frequency of the pipeline.  
This section presents a flexible error-prevention circuit – programmable-time-
borrowing (PTB) – to prevent errors under supply noise in pipelines. The key novelty of 
the system is that it can be programmed to enable time borrowing over the multiple 
pipeline stages and clock gating after multiple cycles from the time-borrowing detection 
point. In addition, the proposed approach is a platform-independent solution, which can 
be applicable to any types of circuits, such as control circuits, state-machines, and circuits 
with feedback data paths. A test-chip is designed in 130nm CMOS technology to verify 
the effectiveness of the proposed approach considering time-borrowing over 1, 2 and 3 
pipeline stages with 1, 2, and 3 cycle delayed clock gating, respectively. The potential of 
on-line programming of PTB to trade-off noise tolerance with performance penalty has 
been demonstrated.  
 
5.3.1 Methodology 
The proposed idea basically utilizes the concept of time-borrowing like the work 
in the previous section. The pulsed-latch with time-borrowing detection (PLTD [67]) 
circuit behaves like a normal flip-flop when the input data D is set before the rising edge 
of the pulsed clock as shown in Figure 5.16. If the input data D changes during high 
period of the pulsed clock, referred to as the time-borrowing window, the PLTD allows 
signal to propagate to the next stage. This event is referred to as a time-borrowing event. 
 
108 
Time-borrowing is inherent characteristics of pipelines design with pulsed latches. The 
noise tolerance is determined by the time-borrowing window. The PLTD includes 
additional circuits to detect whether signal transition occurred during the time-borrowing 
window and generate a time-borrowing detection (TB) signal. Assuming the cascaded 
pipeline stages have critical paths, TB signal is used to gate the clock cycle to resolve 
possible timing errors in the cascaded stages. The prior works on error prevention using 
time-borrowing always the next clock cycle after time-borrowing detection is controlled 
resulting in a constant performance penalty and noise tolerance for a given pulse width 
[71], [82]. Moreover, the delay of the feedback clock-control path between the time-
borrowing detection and clock gating circuit is constrained by 1 clock cycle making the 
approach less suitable for large circuit blocks. The proposed system addresses this 
limitation by removing the n
th
 clock pulse after the time-borrowing detection (n = 1, 2, or 
3 in this work). Hence, the signal is allowed to propagate over multiple stages after time-
borrowing detection before the clock gating is initiated. Figure 5.16 shows the case when 




















































































Figure 5.16: Programmable-time-borrowing and delayed-clock-gating. 
 
109 
To implement the PLTD, a pulsed latch with an additional flip-flop to detect time 
borrowing is used as shown in Figure 5.17 [68]. If a disparity between the pulsed latch 
and the flip-flop occurs, TB is set to notify time-borrowing event. If a time-borrowing 
event occurs, it propagates though the cascaded critical paths until the time slack is 
resolved by the clock gating. The time-borrowing detection (TB) signals are propagated 
to a clock gating (CG) circuit through a PTB detection network (PTDNn where ‘n’ 
indicates clock is gated after ‘n’ cycles after a time borrowing detection). In PTDN1, 
clock is gated at the next cycle after the time borrowing detection. In PTDN3, the time-
borrowing window is shared across 3 pipelined stages, and CKEN becomes low 3-cycle 
after the detection. The CG circuit gates the clock when CKEN is low and re-activates the 
next clock pulse i.e., the clock is gated after 3-cycle. In PTDN3, once CKEN becomes 
low, time borrowing detection signals are masked for 2-cycle to cancel the pending time-
borrowing detections in the PTDN since clock gating resolves the borrowed time in the 
pipelines. However, metastability does not occur in data paths since pulsed-latches are 
used in data paths. The time-borrowing detection signal (TB) can have metastability, but 
the multiple pipeline stages (1~3 stages) and convergence time (1~3 cycles) in PTDN and 
CG circuits help reduce the possibility of metastability in the control path. Fundamentally, 
the metastability cannot be eliminated but only can be reduced adding more flip-flops or 
increasing convergence time.  
The proposed methodology has four key benefits. First, a system with PTDN1, 
PTDN2, or PTDN3 (i.e., PTB mode: PTB method with PTDNn) allows on-line trade-off 
between noise tolerance and performance penalty even with a single time-borrowing 
window. A higher value of n implies less performance penalty per time-borrowing event 
but also less noise tolerance as the single time-borrowing window is shared across 
multiple stages. Hence, each PTB mode can have different performance penalty and noise 
tolerance. Second, PTB relaxes the control-delay requirement - PTDN1 has 1-cycle 
control delay requirement while PTDN3 has 3-cycle constraints. The system can be 
 
110 
designed with the appropriate PTB mode depending on the size/ frequency of the pipeline. 
Third, a system with PTDN1, PTDN2, or PTDN3 guarantees minimum performance since 
the performance penalty saturates at the worst case scenario since pended time-borrowing 
events can be canceled for multiple cycles. Fourth, the proposed solution is applicable to 































































PCT<1:0> PAS1 PAS0 RST1 RST0
11 1 1 1 1
MODE
PTDN1
01 0 1 0 1 PTDN2























































Figure 5.17: The overall architecture of a pipeline with the proposed programmable time 
borrowing with delayed clock gating. 
 
5.3.2 Test chip and measurement results 
To verify the proposed method, a test-chip including test pipelines, control 
circuits, and a noise injector circuit is designed in 130nm CMOS as in Figure 5.18. In 
addition, a clock modulator to change the input clock frequency (CKp) in response to the 
global noise [83] is integrated. Simple five-stage pipelines (Figure 5.17) are implemented 
with cascaded critical paths considering worst case timing scenario (i.e., the input 
changes every cycle). The noise injector is comprised of nMOS transistors, which draw 
 
111 
instantaneous current from the power supply and incur voltage noises in the supply when 
an enable pulse is applied as shown in Figure 5.18. The VCO generates CKIN, which is 
used for the input clock to the clock pulse generator for the test system clock. The 
frequency of the VCO is controlled by the external VCO control voltage (VREF). Figure 
5.19 shows the die-photo of the test-chip. Figure 5.20 shows the operational waveforms 
of the proposed methodology. When the supply noise is injected, clock pulse is gated 
after one, two, or three cycles for PTB1, PTB2, or PTB3 case, respectively. The clock 














































































598.4  μW @1.1V 200MHz
11.87% (PTB path only)*








* As a design size increases, overheads decrease since 
PLTDs&PTDN are used only in cascaded critical paths  










































Figure 5.20: Measured operational waveforms of the proposed method. 
 
5.3.2.1 Effect of DC voltage noise 
Figure 5.21 shows the DC operating voltage range of the test pipeline with 
master-slave flip-flops (PIPE1) and PTB at reduced voltage (DC noise). The 
measurements of the point of the first failure (PoFF) indicate that PTB1 (PTB method 
with PTDN1) has the highest DC noise tolerance. The error probability was measured 
with an error counter implemented in the test block while reducing the supply voltage. 
The effective frequency was measured utilizing frequency counter, which counts the 
number of the clock edges during a pre-determined time period (i.e., the time period is 
controlled by an external test clock). The effective frequency with PTB1 drops to 
133MHz under DC noise shift. PTB3 can improve the effective performance to 160MHz. 
Hence, Figure 5.21 demonstrates the ability of PTB to trade-off noise tolerance and 
 
113 
effective frequency. As shown in Figure 5.21, PTB3 can tolerate less DC noise then PTB1 
or PTB2 since the time-borrowing window is shared across three pipeline stages. On the 
other hand, PTB3 shows lower performance penalty than PTB1 or PTB2. PTB3 removes 
clock pulse three cycles after the time borrowing detection as shown in Figure 5.20. Thus, 
pending time borrowing detections in the PTDN3 can be neglected after the clock gating 
(i.e., all timing slacks in the pipelines can be resolved by clock gating). As a result, PTB3 
can have lower performance penalty. The measurement results are when the activation 
probability of the critical paths (PC) are 100% considering the worst case scenario. The 
performance penalty will reduce as PC reduces. The measurement results prove that the 
proposed method always guarantees the minimum performance of the pipeline system 
















As PC approaches 0




































































0.9 0.95 1 1.05 1.1
Supply Voltage (V)
Minimum performance of PTB1
Minimum performance of PTB2










5.3.2.2 Effect of AC voltage noise 
Figure 5.22 shows the operation of PTB under transient local noise. When the 
noise is injected every clock cycle, the measured effective frequencies for PTB1, PTB2, 
and PTB3 are 133, 150, and 160 MHz, respectively. For the PTDN3 case, the 
performance penalty is constant for the noise injection ratio of 1 to 1/4 as the clock is 
gated once in every 5 cycles causing pending time-borrowing events to be canceled with 
clock gating. It implies PTB3 has less performance penalty for frequent noise but has less 
noise tolerance than PTB1 or PTB2. At points A and B in Figure 5.21, the effective noise 
injection ratio becomes 1/6 and 1/4 although the controlled noise injection ratios are 1/3 
and 1/2, respectively. It happens as noise during the clock gating period does not cause 
time borrowing as shown in Figure 5.23. In other words, the injected noise during the 
clock gating event can be invalidated. As a result, the proposed system can show better 
performance than expected at a certain noise injection frequencies (i.e., lower effective 
noise frequency then the injected noise frequency).  
 































Noise Injection Frequency (1/cycles)























Noise Frequency = 1/5
noise
gating
1 clock gating per 6 cycles 
CLKi
VDD
Noise Frequency = 1/4
noise
gating
1 clock gating per 5 cycles 
CLKi
VDD
Noise Frequency = 1/3
noise
gating
injected noise during clock gating is 
invalidated
Effective Noise Frequency = 1/6
1 clock gating per 7 cycles 
CLKi
VDD






1 clock gating per 5 cycles 
CLKi
VDD
Noise Frequency = 1/1
noise
gating






Noise Frequency = 1/6
noise
gating
1 clock gating per 7 cycles 
 
Figure 5.23: Noise cancel-out effect of the PTB3. 
 
The noise tolerance of the PTB is measured with DC noise and AC noise injection 
as in Figure 5.26. While injecting AC noise up to 130mV, changed DC noise level from 
the power supply utilizing general purpose interface bus (GPIB). In the measurement, DC 
noise implies possible IR drop or voltage regulator offset. AC noise emulates the transient 
local voltage noise due to the instantaneous current. Measurement results demonstrate 
that the pipeline with the proposed method can give good noise tolerance with the 
minimized performance penalty. In the presence of 130mV local noise, using only PTB, 
125mV~196mV global noise can be tolerated. However, the main contribution of this 
work is to relax the control time requirement up to three clock cycles while achieving 
supply noise tolerance. The measured area and power overheads of the test chip were 
 
116 
11.87% and 12.32%, respectively. However, remember that the implemented test case 
just include 5-stage pipelines only with cascaded critical paths.  
 
5.3.2.3 Integration of PTB-DCG with adaptive clocking 
The noise tolerance of the pieline can be further improved by integrating PTB 
with an adaptive-clocking method [83]. The adaptive clocking uses a global clock 
modulator, as shown in Figure 5.24, to change the clock frequency in response to global 
supply noise [83]. The measrued output clock frequency of the clock modulator in 











0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25

































Power (CM) @ Output 













reduce clock frequency in 
response to voltage droop
800MHz
 




To verify the operation of the combined system, significant voltage droop was 
generated as shown in Figure 5.25. Figure 5.26 summarizes the noise tolerance of PTB 
without clock modultion and with clock modulation. The global noise tolerance was 
improved up to 405mV ~ 442mV using the combined system. Adaptive clocking helps 
improve the noise tolerance over what is achievable with PTB alone. Moreover, adaptve 
clocking proactively change the clock frequency in response to DC voltage drop. Hence, 
if there exists sustained global noise, instead of continuosly employing time-borrowing 
and clock gating, the combined system moves to a reduced frequency to prevent errors 










































Min. Voltage for PIPE1(@200MHz) only with local noise












tolerable global voltage droop minimum voltage without errors
 





5.4 Case studies for overhead estimation 
An automated design flow is developed to apply two proposed approaches to a 
large design and performed a case study for the overhead and the performance gain 
analysis. For this case analysis with the automated design flow, a rasterizer was chosen 
[84]. The rasterizer is an essential graphic processing unit, which converts a vector 
graphic format into a raster image on a video display or printer, or in a bitmap file. This 
section presents realistic area and power overheads based on the real hardware 
implementation. 
5.4.1 Case study for TB–CS 
An automated design flow is developed and applied to implement the rasterizer 
with the proposed TB-CS technique. The rasterizer was implemented with an 180nm 
CMOS technology. To reduce the performance overhead, LTDs are used only in the 
cascaded critical timing paths. If the next stage has enough timing slack to resolve the 
borrowed time from the previous stage, only pulsed-latches can be used.  The developed 
design flow is shown in Figure 5.27. First, after the place and route of the synthesized 
design and the timing analysis, the list of FFs in critical paths is generated. Then, the FFs 
in critical paths are replaced with the pulsed-latches. Second, from the timing report from 
the replaced pulsed-latches, the list of cascaded critical paths is generated. Then, pulsed-
latches in cascaded critical paths are replaced with LTDs. The automated layout shown in 
Figure 5.28 is generated using the design flow, and the locations of the LTDs and pulsed 






Replace FFs with 
pulsed-latches




latches with LTDsTpath1+Tpath2 > 2(TCK-TBW)
TDC insertion
Hold-fix with the 
modified netlist
List critical paths 
for pulsed latches
List cascaded critical 
paths for LTDs
 
Figure 5.27: Design flow for inserting pulsed-latches and LTDs. 
 
Full Chip Layout Clock Network








LTD : 11 cells
TDC : 3 cells






Figure 5.28: The layout of the implemented graphic processing unit with 8P case with an 
180nm CMOS technology. 
 
121 
Figure 5.29(a) shows the path-delay distribution of the rasterizer. Critical paths 
are highly populated close to the target delay. To find the flip-flops, which are candidates 
for replacements, the delay distribution of most critical paths for each flip-flop is 
evaluated. Depending on the time-borrowing window, the flip-flops, which have long 
delay paths as input, are selected to be replaced to pulsed latches. This is illustrated in 
Figure 5.29(b), 93 and 155 flip-flops are chosen to be replaced by pulsed latches for the 
8P and the 4P case, respectively. However, only a fraction of these pulsed latches are 
replaced by LTDs depending on whether any path originating from that pulsed latch 
terminates in another pulsed latch. This is performed by elaborating all the output paths 
from each flip-flop, evaluating their delay, and finding whether it has any output path 
with a delay higher than the critical delay. Figure 5.30(a) shows the distribution of the 
worst-case path delays from all of the pulsed latches selected in Figure 5.29(b) for the 4P 
case. Only the pulsed latches with the worst-case output path delay higher than Tmin are 
selected to be replaced by LTDs. Figure 5.30(b) shows the same analysis for the 8P case. 
Figure 5.29 clearly illustrates that only a small fraction of the flip-flops are replaced by 
the pulsed latches and LTDs. The major overhead of the proposed approach comes from 
the need for hold fixing. Since the critical path includes convergent short-delay paths, 
hold-fix buffers should be inserted in short-delay paths to avoid hold-timing violation. 
This is illustrated in Figure 5.31, which shows the distribution of delay of all input paths 
of an example critical flip-flop (a pulsed latch or a LTD). The paths whose delay is less 
than the time-borrowing window require hold-fixing and hence, incur additional 















































(Path delay) / (FO4 delay)






























total number of paths = 856total number of paths = 155.2K
 
(a)                                                           (b) 
Figure 5.29: The path delay analysis of the rasterizer with an 180nm CMOS technology: 
(a) distribution of all path delays; (b) the worst path delay per each flip-flop showing the 
flip-flops that are selected to be replaced by pulsed latches. 
 























0 10 20 30 40 50 60 70




































Tmin = 0.75xTCK 
= TCK-TBW
Tmin = TCK-TBW
Tmin = 0.875xTCK 
= TCK-TBW
 
(a)                                                           (b) 
Figure 5.30: The distribution of the worst-case output path delays of all the pulsed latches 













(Path delay) / (FO4 delay)
Hold-fix paths (4P case)








































Figure 5.31: Distribution of the delay of the input path of one critical flip-flop showing 
the need for hold fixing in certain paths. 
 
Table 5.4 summarizes the result of the proposed design. In the implemented rasterizer, 
12.7% and 9.6% FFs were replaced with pulsed-latch and 5.4% and 1.3% were replaced 
with LTDs for the 4P and the 8P case, respectively. The overhead due to additional 
circuits for the proposed technique reduced as the design size is larger, and the overhead 
due to the hold-fix buffers became dominant (Table 5.4). The 4P case and the 8P case 
improved the performance per power by 35% and 19%, respectively. The proposed 
technique extended the operating frequency range of the rasterizer. This is shown in  
Table 5.5. PC was measured with different test vectors (vector1~vector3) with the 
maximum possible input clock frequency. The effective frequency was calculated with PC 
and input clock frequency. With these different test vectors, PC was measured at the 






 rows of  
Table 5.5. The effective frequencies (FEFF) are smallest with vector1 and vector2 
for the 8P case and the 4P case, respectively. With the same vector, the 8P case showed 
much smaller probability of clock-stretching event than the 4P case since the number of 
LTDs in the 8P case is smaller than that in the 4P case. The maximum input clock 




Table 5.4: Summary of implemented 3D graphic processing unit.  
(@ 1.8V and 201.6 MHz) 
 MSFF 8P 4P 
FF 856 (100%) 763 (89.1%) 701 (81.9%) 
Pulsed-latch 0 (0%) 82 (9.6%) 109 (12.7%) 
LTD 0 (0%) 11 (1.3%) 46 (5.4%) 
CPG+CS 0 1 (8P) 1 (4P) 
TDC 0 3 12 






























(*) estimated from spice simulation considering the vector1. 
 
Table 5.5: Simulated maximum input clock frequency and PC. 
(@ 1.8V and 201.6 MHz) 
 MSFF 8P 4P 








PC (vector1) - 0.68%  7.83% 
PC (vector2) - 0.04% 9.02% 
PC (vector3) - 0.02% 6.18% 
Min. Feff 201.6 MHz 225.4 MHz (+11.8%) 243.5 MHz (+20.8%) 
 
It is observed that the proposed technique will be beneficial for graphic 
processing units since graphic accelerators are not much heavily pipelined and have 
slower operating clock frequency (i.e., 200MHz~400MHz) in mobile processors. In 
addition, a low performance penalty can be expected in the graphic processing units since 
the possibility of critical patterns in arithmetic is generally low. It is interesting to note 
that the test-chip measurement shows a lower power overhead for the 4P case compared 
to the 8P case (Table 5.3) while the observation in Table 5.4 is opposite. This is because 
the test-chip was implemented with inverter chains to verify the basic concepts. In the 
 
125 
test-chip, the critical paths do not include convergent short-delay paths, and hence, there 
were no LTDs (or pulsed latches) that received input from the short paths. Consequently, 
hold-fix buffers were not required in the test-chip. The higher overhead for the 8P case 
was due to the higher complexity of the pulse generator and the clock shifter. On the 
other hand, the path delay distribution shown in Figure 5.31 illustrates that hold fixing 
was required in the rasterizer design. A larger time-borrowing window in the 4P case 
resulted in an increased number of required buffers (Table 5.4 and Figure 5.31). Hence, 
the area/power overhead of the 4P case is observed to larger than the 8P case in Table 5.4. 
In general, for a moderate size design, the overhead can be higher for the 4P case due to 
the need for more hold-fix buffers. However, the power overhead will also depend on the 
switching activity in these short paths (i.e., activity of the additional buffers). 
5.4.2 Case study for PTB–DCG 
The PTB-DCG technique is also applied to the rasterizer with a Nangate 45nm 
technology model [85] (Figure 5.32). PTB1, PTB2, and PTB3 cases are implemented 
with the programmable PTDNn (n is 1, 2 or 3). Table 5.6 compares the baseline design 
(all MSFFs) with the PTDN1, PTDN2, and PTDN3 cases. To implement the PTDN1, the 
MSFFs in the baseline designs are selectively replaced by the pulsed latches and PLTDs. 
The basic approach is to keep MSFFs in the non-critical paths. When a critical path in 
one stage is followed by non-critical paths in the successive stage, the MSFF is replaced 
by a pulsed latch to allow time-borrowing. The borrowed time in the current stage is 
resolved by the inherent slack of the connected non-critical path(s) in the following stage. 
When a critical path is connected to another critical path in the following stage, the 
corresponding MSFF is replaced by a PLTD. The PTDN modes are programmable 
(PTDN1, PTDN2, or PTDN3 mode) and the PTB1 design can operate with 
PTDN1~PTDN3 modes. However, a programmable PTB1 design requires more number 
of PLTDs compared to a fixed PTB2 or PTB3 design. To understand the overhead 
 
126 
incurred to enable programmability, the PTB2 or PTB3 designs were also implemented 
directly. For PTB2 and PTB3 cases, PLTDs are inserted only if three and four critical 
paths are connected in successive stages, respectively. Hence, the number of PLTDs is 
less than the number of PLTDs in the PTB1 design. However, PTB2 can only work in 
PTDN2 or PTDN3 modes, and PTB3 can only work in PTDN3 mode. The overheads are 
primarily determined by the number of hold-fix buffers. The choice between the different 
design cases in the Table 5.6 depends on the design size, number of critical paths and 
clock frequency. 
 
Clock network and flip-flops
MSFFs : 690, Pulsed-latches: 135, PLTDs: 31
Pulsed-latches : 135 PLTDs: 31
PTDNn Network











Figure 5.32: The automated layout of a rasterizer unit with programmable time borrowing 




Table 5.6: Overhead summary of the implemented rasterizer in a 45nm technology.  
 MSFF CASE1 CASE2 CASE3 
Design - PTB1 PTB2 PTB3 
PTDN mode - PTDN1/2/3 PTDN2/3 PTDN3 
Operating modes  PTB1/2/3 PTB2/3 PTB3 
Control time 
requirement 













































      * 454.5MHz, 1.1V, and 300ps clock pulse width 
 
5.5 Summary 
This chapter presented effective methods for preventing timing failures. First, the 
TB-CG approach achieved more than eliminating the safety margin in design and 
improved power-performance trade-off of the design. As the timing error is prevented 
using time-borrowing and clock stretching, the time/energy overhead of error recovery is 
significantly reduced. The design and measurement of the test chip demonstrates that a 
system employing the proposed method can operate at a higher frequency and/or at a 
lower supply voltage compared to the conventional design. The proposed design can 
efficiently tolerate timing variations due to process, voltage, and temperature fluctuations 
with minimal performance penalty even at high activation probability of critical paths. In 
other words, as the proposed circuit enables operation at lower supply voltage under the 
same environmental variation, it helps reduce the voltage safety margin required for the 
 
128 
conventional design. Hence, the proposed approach allows a system to operate over a 
wide voltage and frequency range while maintaining the system reliability. Second, the 
platform-independent PTB-DCG solution is proposed. It allows trade-off between noise 
tolerance and performance penalty while using a fixed time-borrowing window and 
relaxes the required control time. The measurement results from a test chip in 130nm 
CMOS under DC and AC supply noise demonstrate the effectiveness of the 
programmable time-borrowing. The proposed circuit enables operation at lower supply 
voltage even under transient supply noise and can help reduce the voltage safety margin 
required in conventional pipelines. In addition, the analysis results with the graphic 
processing unit are presented to show the real area and power overhead. For this case 
study, the automated design flow was developed. The implemented designs with the 










The goal of this thesis is to develop methodologies for robust low-power digital 
systems under static and dynamic variations. Increasing static and dynamic variations due 
to the device scaling and the increased operating frequency limit the system performance. 
To improve the system performance with the limited power budget, safety margins for 
worst corner cases should be minimized. Even though minimizing safety margins could 
reduce the performance or the power loss, it also increases the risk of functional failures 
under variations. This thesis explored design methodologies to reduce safety margin 
while maintaining robust operation of digital systems. In this research, three different 
methodologies are proposed to compensate for different types of variations efficiently.  
In chapter 3, two post-silicon tuning methods to apply AVS and ABB to 3D ICs 
are explored to compensate for process variations considering implementation types of 
3D ICs. First, considering the block-level 3D integration (i.e., 3D system with separate 
clock networks), TAVS is proposed to apply adaptive supply voltage to 3D ICs. In 
addition, the design flows for level shifters are developed. The proposed methodology 
improved not only the average performance but also the variability of power consumption 
of 3D ICs. Second, the clock skew issue in 3D ICs is investigated considering the logic-
level 3D integration. In this design case, 3D clock skews can be the critical limiting 
factor deciding the performance of 3D ICs. TABB is proposed to adaptively minimize the 
3D clock skew and thereby, improve the variability of 3D systems. In addition, circuit 
techniques for variation sensors are developed. TABB with the proposed sensor technique 
reduced clock skew and slew variation and improved the overall performance of 3D ICs 
implemented in logic level with a one clock network. 
 
130 
In chapter 4, two non-design-intrusive circuit techniques are proposed for 
adaptation to increasing dynamic variations. Minimizing dynamic variations is a key 
challenge to minimizing the safety margins. Since dynamic variations are time-dependent 
variations, faster adaptation for fast changing variations can lead to lower safety margins. 
The proposed two circuit techniques are based on replica circuits. Thus, modification in 
real data paths is not required. First, fast clock modulation techniques for global and local 
variations are proposed. The proposed adaptation techniques achieved fast clock 
modulation in response to global and local transient supply noises within a clock cycle. 
The proposed techniques improved noise tolerance of a system while maintaining reliable 
operations with minimized safety margins. Second, the adaptive bias-voltage generation 
technique for on-chip regulators is proposed to reduce the safety margin associated with 
static and dynamic variations. The adaptive voltage generator allows seamless adaptation 
to static and dynamic variations with a reduced design cost. In addition, it enables 
automatic DVFS. 
In chapter 5, design-intrusive methods to eliminate a safety margin are proposed. 
Even though the design-intrusive method requires modification in the real data paths to 
embed the timing error-detection and the error-management schemes into them, it does 
not require safety margins, which cannot be eliminated in a replica-based technique. In 
addition, the timing errors occur only when the critical paths are activated. Not all 
dynamic variations necessarily lead to timing errors. Thus, timing errors happen on the 
condition that the dynamic variations and the critical-path activation occur at the same 
time. As a result, the joint probability of dynamic variations and critical-path activation 
can be very low. Compared to the replica-based approaches, which do not consider the 
probability of critical-path activation, this design-intrusive method can be more efficient. 
However, if the activation probability of critical paths is high, the performance penalty 
associated with the error management can overshadow the benefits achieved by this 
methodology. In addition, the benefits come with the high design effort to modify the 
 
131 
core circuits. This thesis presented two design-intrusive methodologies that can minimize 
the penalty for error management with reduced design efforts, thereby improving 
effectiveness of the design-intrusive method even at high activation probability of critical 
paths. Furthermore, the proposed techniques are platform-independent solutions and 
hence, applicable to general circuits. First, this work presented time-borrowing and clock-
stretching (TB-CS) methods. The proposed methodology achieved timing-error 
prevention in advance with the minimized performance penalty.  Thus, the minimum 
performance can be guaranteed even at high activation probability of critical paths. 
However, this solution has a limitation in the operating frequency. The control time 
requirement from the timing error detection to the clock control limits the operating 
frequency of this solution. Second, programmable-time-borrowing and delay-clock-
gating (PTB-DCG) method is proposed to eliminate the safety margin with relaxed 
control time requirement. By allowing time-borrowing over the multiple pipeline stages 
and delaying clock gating, the proposed technique could relax the control time 
requirement up to 3 cycles. Thus, it makes the solution applicable to high performance 
processors. In addition, a performance penalty can be minimized even at the worst case 
without complex controls. 
This thesis presented effective solutions to minimize static variations and adapt to 
dynamic variations considering design types and variation types. As a result, the proposed 
approaches help minimize safety margins while maintaining robust operations, thereby 
achieving robust low-power digital systems.  
Furthermore, this thesis can lead to combined solutions for a future work. Non-
design-intrusive approach has advantages in terms of design effort and wide operation 
range. However, it cannot eliminate the safety margin associated with adaptation speed 
and mismatches. On the other hand, design-intrusive approach can eliminate the safety 
margin, but is has a limited variation-tolerance range due to the limited transparency 
window. Future works may involve the combined solution of non-design-intrusive and 
 
132 
design-intrusive approaches presented in this thesis. Non-design-intrusive approach can 
determine the operating condition without consideration of the safety margin if it is 
combined with design-intrusive approaches as shown in Figure 6.1. Without any safety 
margin, there can be timing errors due to mismatches and local variations. If these 
possible errors can be managed by the design-intrusive approach, the combined solution 
can provide a wide variation-tolerance range without any safety margin. As a result, the 
integrated work, which includes adaptive bias-voltage generation, clock modulation, and 
the error-prevention technique, can provide total solutions in the system level for 





























Bias-Voltage Generation + Error Prevention






























Bias-Voltage Generation + Clock Modulation + Error Prevention
Off-chip Regulator Case
Compensate for static and slow 
dynamic variations
 





[1] S. Mukhopadhyay and K. Roy, "Modeling and estimation of total leakage current in 
nano-scaled CMOS devices considering the effect of parameter variation," in Proc. 
International Symposium on Low Power Electronics and Design, 2003, pp. 172-175. 
[2] http://cpudb.stanford.edu. 
[3] J. Tschanz, S. Narendra, R. Nair, and V. De, “Effectiveness of adaptive supply 
voltage and body bias for reducing impact of parameter variations in low power and 
high performance microprocessors,” IEEE Journal of Solid-State Circuits, May 
2003, pp. 826-829.  
[4] J. W. Tschanz, J. T. Kao, S. G. Narendra, R. Nair, D. A. Antoniadis, A. P. 
Chandrakasan, and V. De, “Adaptive body bias for reducing impacts of die-to-die 
and within-die parameter variations on microprocessor frequency and leakage,” 
IEEE Journal of Solid-State Circuits, vol.37, no. 11, pp.1396–1402, Nov. 2002. 
[5] J. Tschanz, N. Kim, S. Dighe, J. Howard, G. Ruhl, S. Vanga, S. Narendra, Y. 
Hoskote, H. Wilson, C. Lam, M. Shuman, C. Tokunaga, D. Somasekhar, S. Tang, D. 
Finan, T. Karnik, N. Borkar, N. Kurd, and V. De, “Adaptive frequency and biasing 
techniques for tolerance to dynamic temperature-voltage variations and aging,” in 
IEEE ISSCC Dig. Tech. Papers, Feb. 2007, pp. 292-293. 
[6] W. Yao, Y. Shi, L. He, S. Pamarti, “Joint design-time and post-silicon optimization 
for digitally tuned analog circuits,” in Proc. IEEE/ACM International Conference on 
Computer-Aided Design, pp725-730, 2009.  
[7] J. Van Olmen, A. Mercha, G. Katti, C. Huyghebaert, J. Van Aelst, E. Seppala, C. 
Zhao, S. Armini, J. Vaes, R. C. Teixeira, M. Van Cauwenberghe, P. Verdonck, K. 
Verhemeldonck, A. Jourdain, W. Ruythooren, M. de Potter de ten Broeck, A. 
Opdebeeck, T. Chiarella, B. Parvais, I. Debusschere, T. Y. Hoffmann, B. De Wachter, 
W. Dehaene, M. Stucchi, M. Rakowski, P. Soussan, R. Cartuyvels, E. Beyne, S. 
Biesemans, and B. Swinnen, “3D stacked IC demonstration using a through silicon 
via first approach,” in Proc. IEEE International Electron Devices Meeting, 2008. 
[8] D. H. Kim, K. Athikulwongse, and S. K. Lim, "A study of through-silicon-via 
impact on the 3D stacked IC layout," in Proc. IEEE/ACM International Conference 
on Computer-Aided Design, pp674-680, 2009.  
[9] F. Akopyan, C. Otero, D. Fang, S. J. Jackson, and R. Manohar, “Variability in 3-D 
integrated circuits,” in Proc. IEEE Custom Integrated Circuits Conference, 2008. 
[10] S. Garg and D. Marculescu, “3D-GCP: An analytical model for the impact of 
process variations on the critical path delay distribution of 3D ICs,” in Proc. 
International Symposium on Quality Electronic Design, 2009. 
 
134 
[11] S. Reda, A. Si, and R. I. Bahar, “Reducing the leakage and timing variability of 2D 
ICs using 3D ICs,” in Proc. International Symposium on Low Power Electronics and 
Design, 2009. 
[12] S. Garg and D. Marculescu, “System-level process variability analysis and 
mitigation for 3D MPSoCs,” in Proc. IEEE/ACM Design, Automation, and Test in 
Europe, 2009.  
[13] S. Ozdemir, G. Memik, Y. Pan, G. Loh, A. Das, and A. Choudhary, “Quantifying and 
coping with parametric variations in 3D-stacked microarchitectures,” in Proc. 
IEEE/ACM Design Automation Conference, 2010. 
[14] C. Ferri, S. Reda, and R. I. Bahar, “Strategies for improving the parametric yield and 
profits of 3D ICs,” in Proc. IEEE/ACM International Conference on Computer-
Aided Design, 2007.  
[15] G. Smith, L. Smith, S. Hosali, and S. Arkalgud, “Yield considerations in the choice 
of 3D technology,” in Proc. IEEE International Symposium on Semiconductor 
Manufacturing, 2007. 
[16] S. Reda, G. Smith, and L. Smith, “Maximizing the functional yield of wafer-to-
wafer 3-D integration,” IEEE Transactions on VLSI Systems, Sept. 2009. 
[17] J. Verbree, E. J. Marinissen, P. Roussel, and D. Velenis, “On the cost-effectiveness of 
matching repositories of pre-tested wafers for wafer-to-wafer 3D chip stacking,” in 
Proc. IEEE European Test Symposium, 2010. 
[18] D. H. Kim, S. Mukhopadhyay, and S. K. Lim, “TSV-aware interconnect length and 
power prediction for 3D stacked ICs,” in Proc. IEEE International Interconnect 
Technology Conference, 2009. 
[19] X. Zhao, S. Mukhopadhyay, and S. K. Lim, “Variation-tolerant and low-power clock 
network design for 3D ICs,” in Proc. IEEE Electronic Components and Technology 
Conference, 2011, pp. 2007–2014. 
[20] S. Mukhopadhyay, K. Kang, H. Mahmoodi, and K. Roy, “Reliable and self-repairing 
SRAM in nano-scale technologies using leakage and delay monitoring,” in Proc. 
IEEE International Test Conference, 2006. 
[21] B. Zhang, L. Liang, and X. Wang, “A new level shifter with low power in multi-
voltage system,” in Proc. IEEE International Conference on Solid-State and 
Integrated Circuit Technology, Shanghai, pp. 1857-1859, 2006. 
[22] B. Nikolic, V. Stojanovic, V. G. Oklobdzija, W. Jia, J. Chiu, and M. Leung, “Sense 





[25] W. Kim, M. S. Gupta, G.-Y. Wei, and D. Brooks, “System level analysis of fast, per-
core DVFS using on-chip switching regulators,” in Proc. IEEE International 
Symposium on High Performance Computer Architecture, 2008. 
[26] M. Cho, C. Liu, D. H. Kim, S. K. Lim, and S. Mukhopadhyay, “Design method and 
test structure to characterize and repair TSV defect induced signal degradation in 3D 
system,” in Proc. IEEE/ACM International Conference on Computer-Aided Design, 
2010. 
[27] K. Chae and S. Mukhopadhyay, “Tier-adaptive-voltage-scaling (TAVS): a 
methodology for post-silicon tuning of 3D ICs,” in Proc. IEEE/ACM Asia and South 
Pacific Design Automation Conference, Jan. 2012, pp. 277-282. 
[28] S. Sauter, D. Schmitt-Landsiedel, R. Thewes, and W. Weber, “Effect of parameter 
variations at chip and wafer level on clock skews,” IEEE Transactions on 
Semiconductor Manufacturing, vol. 13, no. 4, pp. 395 –400, nov 2000.  
[29] A. Narasimhan and R. Sridhar, “Impact of variability on clock skew in H-tree clock 
networks,” in Proc. IEEE International Symposium on Quality Electronic Design, 
2007, pp. 458–466.  
[30] J. Minz, X. Zhao, and S. K. Lim, “Buffered clock tree synthesis for 3D ICs under 
thermal variations,” in Proc. IEEE/ACM Asia and South Pacific Design Automation 
Conference, 2008, pp. 504–509.  
[31] X. Zhao, D. L. Lewis, H. H. S. Lee, and S. K. Lim, “Pre-bond testable low-power 
clock tree design for 3D stacked ICs,” in Proc. IEEE/ACM International Conference 
on Computer-Aided Design, 2009, pp. 184–190.  
[32] T.-Y. Kim and T. Kim, “Clock tree synthesis with pre-bond testability for 3D stacked 
IC designs,” in Proc. IEEE/ACM Design Automation Conference, 2010, pp. 723–728.  
[33] X. Zhao and S. K. Lim, “Power and slew-aware clock network design for through-
silicon-via (TSV) based 3D ICs,” in Proc. IEEE/ACM Asia and South Pacific 
Design Automation Conference, 2010, pp. 175–180.  
[34] X. Zhao, J. Minz, and S. K. Lim, “Low-power and reliable clock network design for 
through-silicon via (TSV) based 3D ICs,” IEEE Transactions on Components, 
Packaging and Manufacturing Technology, vol. 1, no. 2, pp. 247–259, 2011.  
[35] T.-Y. Kim and T. Kim, “Clock tree embedding for 3D ICs,” in Proc. IEEE/ACM Asia 
and South Pacific Design Automation Conference, 2010, pp. 486–491.  
[36] X. Zhao and S. K. Lim, “TSV array utilization in low-power 3D clock network 
design,” in Proc. IEEE International Symposium on Low Power Electronics and 
Design, 2012.  
 
136 
[37] C.-L. Lung, Y.-S. Su, S.-H. Huang, Y. Shi, and S.-C. Chang, “Fault-tolerant 3D 
clock network,” in Proc. IEEE/ACM Design Automation Conference, 2011, pp. 645–
651.  
[38] H. Xu, V. F. Pavlidis, and G. De Micheli, “Process-induced skew variation for scaled 
2-D and 3-D ICs,” in Proc. International workshop on System level Interconnect 
Prediction, 2010, pp. 17–24.  
[39] S. Herbert and D. Marculescu, “Variation-aware dynamic voltage/frequency scaling,” 
in Proc. IEEE International Symposium on High Performance Computer 
Architecture, 2009, Raleigh, NC, Feb. 2009. 
[40] S. Dighe, S. R. Vangal, P. Aseron, S. Kumar, T. Jacob, K. A. Bowman, J. Howard, J. 
Tschanz, V. Erraguntla, N. Borkar, V. K. De, and S. Borkar, “Within-die variation-
aware dynamic-voltage-frequency-scaling with optimal core allocation and thread 
hopping for the 80-Core TeraFLOPS processor,” IEEE Journal of Solid-State 
Circuits, vol. 46, no. 1, pp. 184–193, 2011. 
[41] S. H. Kulkarni, D. M. Sylvester, and D. T. Blaauw, “Design-time optimization of 
post-silicon tuned circuits using adaptive body bias,” IEEE Transactions on 
Computer-Aided Design of Integrated Circuits and Systems, vol. 27, no. 3, pp. 481-
pp. 494, Mar. 2008. 
[42] M. Bhushan, M. Ketchen, S. Polonsky, and A. Gattiker, “Ring oscillator based 
technique for measuring variability statistics,” in Proc. IEEE International 
Conference on Microelectronic Test Structures, 2006, pp. 87–92. 
[43] B. Wan, J. Wang, G. Keskin, and L. T. Pileggi, “Ring oscillators for single process-
parameter monitoring,” in Workshop on Test Structure Design for Variability 
Characterization, 2008. 
[44] L. Pang and B. Nikolic, “Measurements and analysis of process variability in 90 nm 
CMOS,” IEEE Journal of Solid-State Circuits, vol. 44, no. 5, pp. 1655–1663, 2009. 
[45] I. A. K. M. Mahfuzul, A. Tsuchiya, K. Kobayashi, and H. Onodera, “Process-
sensitive monitor circuits for estimation of die-to-die process variability,” in Proc. 
ACM/ IEEE International Workshop on Timing Issues in the Specification and 
Synthesis of Digital Systems, 2010, pp. 83–88. 
[46] K. Shinkai and M. Hashimoto, "Device-parameter estimation with on chip variation 
sensors considering random variability," in Proc. IEEE/ACM Asia and South Pacific 
Design Automation Conference, pp. 683-688, 2011. 
[47] T. Iizuka, J. Jeong, T. Nakura, M. Ikeda, and K. Asada, “All-digital on-chip monitor 
for pMOS and nMOS process variability measurement utilizing buffer ring with 
pulse counter,” in Proc. IEEE European Solid-State Circuits Conference, pp. 182–
185, Sep., 2010. 
 
137 
[48] J. Jeong, T. Izuka, T. Nakura, M. Ikeda, and K. Asada, “All-digital pMOS and 
nMOS process variability monitor utilizing buffer ring with pulse counter,” in Proc. 
IEEE/ACM Asia and South Pacific Design Automation Conference, pp. 79-80, 2011. 
[49] X. Zhang, K. Ishida, M. Takamiya, and T. Sakurai, “An on-chip characterizing 
system for within-die delay variation measurement of individual standard cells in 
65-nm CMOS,” in Proc. IEEE/ACM Asia and South Pacific Design Automation 
Conference, 109-110, 2011. 
[50] A. Hokazono, S. Balasubramanian, K. Ishimaru, H. Ishiuchi, C. Hu, and T. K. Liu, 
“Forward body biasing as a bulk-Si CMOS technology scaling strategy,” IEEE 
Transactions on Electron Devices, Vol.55, No.10, October 2008, pp. 2657-2664. 
[51] S. G. Narendra and A. Chandrakasan, Leakage in Nanometer CMOS Technologies, 
Springer, November 2005. 
[52] T. Sato and Y. Kunitake, “A simple flip-flop circuit for typical-case designs for 
DFM,” in Proc. International Symposium on Quality Electronic Design, 2007, 
pp.539–544. 
[53] A. K. Uht, “Going beyond worst-case specs with TEAtime,” IEEE Computer, vol.37, 
no.3, pp 51-56, Mar. 2004. 
[54] A. Drake, R. Senger, H. Deogun, G. Carpenter, S. Ghiasi, T. Nguyen, N. James, M. 
Floyd, and V. Pokala, “A distributed critical-path timing monitor for a 65 nm high-
performance microprocessor,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2007, pp. 
398–399. 
[55] R. Franch, P. Restle, N. James, W. Huott, J. Friedrich, R. Dixon, S. Weitzel, K. Van 
Goor, and G. Salem, “On-chip timing uncertainty measurements on IBM 
microprocessors,” in Proc. IEEE International Test Conference, Oct. 2007, pp. 1–7. 
[56] J. Xu, P. Hazucha, P. Huang, M. Huan, P. Aseron, F. Paillet, G. Schrom, J. Tschanz, 
C. Zhao, V. De, T. Karnik, and G. Taylor, “On-die supply-resonance suppression 
using band-limited active damping,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2007, 
pp. 286–603. 
[57] J. Gu, R. Harjani, and C. Kim, “Distributed active decoupling capacitors for on-chip 
supply noise cancellation in digital VLSI circuits,” in IEEE Symp. VLSI Circuits 
Dig., Jun. 2006, pp. 216–217. 
[58] N. Kurd, P. Mosalikanti, M. Neidengard, J. Douglas, and R. Kumar, “Next 
generation Intel® Core™ micro-architecture (Nehalem) clocking,” IEEE Journal of 
Solid-State Circuits, vol. 44, no. 4, pp. 1121–1129, Apr. 2009. 
[59] K. L. Wong, T. Rahal-Arabi, M. Ma, and G. Taylor, “Enhancing microprocessor 
immunity to power supply noise with clock-data compensation,” IEEE Journal of 
Solid-State Circuits, vol. 41, no. 4, pp. 749–758, Apr. 2006. 
 
138 
[60] D. Jiao, and C. Kim, "A programmable adaptive phase-shifting PLL for clock data 
compensation under resonant supply noise," in IEEE ISSCC Dig. Tech. Papers, Feb. 
2011, pp. 272-274. 
[61] T. Fischer, J. Desai, B. Doyle, S. Naffziger, and B. Patella, “A 90-nm variable 
frequency clock system for a power-managed itanium architecture processor,” IEEE 
Journal of Solid-State Circuits, vol. 41, no. 1, pp.218–228, Jan. 2006. 
[62] S. Borkar, “Designing reliable systems from unreliable components: the challenges 
of transistor variability and degradation,” in Proc. IEEE/ACM International 
Symposium on Microarchitecture, Nov-Dec 2005, pp. 10-16.  
[63] D. Ernst, N. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, 
K. Flautner, and T. Mudge, “Razor: A low-power pipeline based on circuit-level 
timing speculation,” in Proc. IEEE/ACM International Symposium on 
Microarchitecture, Dec. 2003, pp. 7–18. 
[64] S. Das, D. Roberts, S. Lee, S. Pant, D. Blaauw, T. Austin, K. Flautner, and T. Mudge, 
“A self-tuning DVS processor using delay-error detection and correction,” IEEE 
Journal of Solid-State Circuits, vol. 41, no. 4, pp.792–804, Apr. 2006. 
[65] S. Das, C. Tokunaga, S. Pant, W. H. Ma, S. Kalaiselvan, K. Lai, D. M. Bull, and D. 
T. Blaauw, “RazorII: In situ error detection and correction for PVT and SER 
tolerance,” IEEE Journal of Solid-State Circuits, vol. 44, no. 1, pp.32–48, Jan. 2009. 
[66] K. A. Bowman, J. W. Tschanz, N. S. Kim, J. C. Lee, C. B. Wilkerson, S. L. L. Lu, 
T. Karnik, and V. K. De, “Energy-efficient and metastability-immune resilient 
circuits for dynamic variation tolerance,” IEEE Journal of Solid-State Circuits, vol. 
44, no. 1, pp. 49-63, Jan. 2009. 
[67] K. A. Bowman and J. W. Tschanz, "Resilient microprocessor design for improving 
performance and energy efficiency," in Proc. IEEE/ACM International Conference 
on Computer-Aided Design, 2010, pp. 85-88. 
[68] J. Tschanz, K. Bowman, S.-L. Lu, P. Aseron, M. Khellah, A. Raychowdhury, B. 
Geuskens, C. Tokunaga, C. Wilkerson, T. Karnik, and V. De, "A 45nm resilient and 
adaptive microprocessor core for dynamic variation tolerance," in IEEE ISSCC Dig. 
Tech. Papers, 2010, pp. 282-283. 
[69] K. A. Bowman, J. W. Tschanz, S. L. Lu, P. A. Aseron, M. M. Khellah, A. 
Raychowdhury, B. M. Geuskens, C. Tokunaga, C. B. Wilkerson, T. Karnik, and V. K. 
De, "A 45 nm resilient microprocessor core for dynamic variation tolerance," IEEE 
Journal of Solid-State Circuits, vol. 46, pp. 194-208, 2011. 
[70] D. Bull, S. Das, K. Shivashankar, G. S. Dasika, K. Flautner, and D. Blaauw, "A 
power-efficient 32 bit ARM processor using timing-error detection and correction 
for transient-error tolerance and adaptation to PVT variation," IEEE Journal of 
Solid-State Circuits, vol. 46, pp. 18-31, 2011. 
 
139 
[71] K. Chae, S. Mukhopadhyay, C.-H. Lee, and J. Laskar, “A dynamic timing control 
technique utilizing time borrowing and clock stretching,” in Proc. IEEE Custom 
Integrated Circuits Conference, Sept. 2010, pp. 1-4.  
[72] V. G. Oklobdzija, V. Stojanovic, D. Markovic, and N. Nedovic, Digital System 
Clocking, High-Performance and Low-Power Aspects, John Wiley, January 2003. 
[73] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits: A Design 
Perspective, Prentice Hall, January 2003. 
[74] N. Weste and D. Harris, Eds., CMOS VLSI Design: A Circuits and Systems 
Perspective, Addison Wesley, 2005.  
[75] L. T. Clark, E. J. Hoffman, J. Miller, M. Biyani, L. Luyun, S. Strazdus, M. Morrow, 
K. E. Velarde, and M. A. Yarch, “An embedded 32-b microprocessor core for low-
power and high-performance applications,” IEEE Journal of Solid-State Circuits, 
vol. 36, no. 11, pp. 1599-1608, Nov. 2001. 
[76] S. D. Naffziger, G. Colon-Bonet, T. Fischer, R. Riedlinger, T. J. Sullivan, and T. 
Grutkowski, “The implementation of the Itanium 2 microprocessor,” IEEE Journal 
of Solid-State Circuits, vol. 37, no. 11, pp. 1448-1460, Nov. 2002. 
[77] H. Partovi, R. Burd, U. Salim, F. Weber, L. DiGregorio, and D. Draper, “Flow-
through latch and edge-triggered flip-Flop hybrid elements,” in IEEE ISSCC Dig. 
Tech. Papers, 1996, pp. 138-139. 
[78] S. Kozu, M. Daito, Y. Sugiyama, H. Suzuki, H. Morita, M. Nomura, K. Nadehara, S. 
Ishibuchi, M. Tokuda, Y. Inoue, T. Nakayama, H. Harigai, and Y. Yano, “A 100 MHz 
0.4W RISC processor with 200 MHz multiply-adder, using pulse-register technique,” 
in IEEE ISSCC Dig. Tech. Papers, 1996, pp. 140-141. 
[79] S. Paik, L. Yu, and Y. Shin, “Statistical time borrowing for pulsed-latch circuit 
designs,” in Proc. IEEE/ACM Asia and South Pacific Design Automation 
Conference, 2010, pp. 675-680. 
[80] V. Joshi, D. Blaauw, and D. Sylvester, “Soft-edge flip-flops for improved timing 
yield: design and optimization,” in Proc. IEEE/ACM International Conference on 
Computer-Aided Design, Nov. 2007, pp. 667-673. 
[81] K. Bowman, J. Tschanz, M. Khellah, M. Ghoneima, Y. Ismail, and V. De, “Time-
borrowing multi-cycle on-chip interconnects for delay variation tolerance,” in Proc. 
International Symposium on Low Power Electronics and Devices, Oct. 2006, pp. 79-
84. 
[82] M. Fojtik, D. Fick, K. Yejoong, N. Pinckney, D. Harris, D. Blaauw, and D. Sylvester, 
"Bubble Razor: An architecture-independent approach to timing-error detection and 
correction," in IEEE ISSCC Dig. Tech. Papers, 2012, pp. 488-490. 
 
140 
[83] K. Chae and S. Mukhopadhyay, “All-digital adaptive clocking to tolerate transient 
supply noise in low voltage operation,” IEEE Trans. Circuits and Systems II, vo. 59, 
no. 12, pp. 893-897, Dec. 2012. 
[84] OpenCores: http://opencores.org/ 
[85] Nangate: http://www.nangate.com/ 
[86] S. Dhar, D. Maksirnovi, B. Kranzen, “Closed-loop adaptive voltage scaling 
controller for standard-cell ASICs,” in Proc. International Symposium on Low 
Power Electronics and Devices, Oct. 2002, pp. 103-107. 
[87] J. Kim and M. A. Horowitz, “An efficient digital sliding controller for adaptive 
power-supply regulation,” IEEE J. Solid-State Circuits, vol. 37, no. 5, pp. 639–647, 
May 2002. 
[88] H. Okano, T. Shiota, Y. Kawabe, W. Shibamoto, T. Hashimoto, and A. Inoue, 
“Supply voltage adjustment technique for low power consumption and its 
application to SOCs with multiple threshold voltage CMOS,” in IEEE Symp. VLSI 
Circuits Dig., Jun. 2006, pp. 208–209. 
[89] A. Raychowdhury, D. Somasekhar, J. Tschanz, V. De, “A fully-digital phase-locked 
low dropout regulator in 32nm CMOS,” in IEEE Symp. VLSI Circuits Dig., Jun. 





[1] Kwanyeob Chae and Saibal Mukhopadhyay, "Resilient pipeline under supply noise 
with programmable-time-borrowing and delayed-clock-gating," IEEE Trans. 
Circuits and Systems II, under review. 
[2] Kwanyeob Chae and Saibal Mukhopadhyay, "A dynamic timing error prevention 
technique in pipelines with time borrowing and clock stretching," IEEE Trans. 
Circuits and Systems I, accepted. 
[3] Kwanyeob Chae and Saibal Mukhopadhyay, “All-digital adaptive clocking to 
tolerate transient supply noise in low voltage operation,” IEEE Trans. Circuits and 
Systems II, vo. 59, no. 12, pp. 893-897, Dec. 2012. 
[4] Kwanyeob Chae, Xin Zhao, Sung Kyu Lim, and Saibal Mukhopadhyay, “Tier 
adaptive body biasing: a post-silicon tuning method to minimize clock skew 
variations in 3-D ICs,” IEEE Trans. Components, Packaging, and Manufacturing 
Technology, accepted. 
[5] Kwanyeob Chae, Xin Zhao, Sung Kyu Lim, and Saibal Mukhopadhyay, “Post-
silicon tuning method for clock network to minimize clock skews in 3D ICs,” 
TECHCON, Sep. 2012.   
[6] Kwanyeob Chae, and Saibal Mukhopadhyay, “Tier-adaptive-voltage-scaling 
(TAVS): a methodology for post-silicon tuning of 3D ICs,” IEEE ASP-DAC, Feb. 
2012. 
[7] [Invited] Kwanyeob Chae, Minki Cho, and Saibal Mukhopadhyay, “Low-power 
design under variation using error prevention and error tolerance,” LATW, 2012. 
[8] Kwanyeob Chae, Mitchelle Rasquinha, Syed Minhaj Hassan, Sudhakar 
Yalamanchili, and Saibal Mukhopadhyay, “Statistical analysis of the effect of 
network on performance of many-core platform with 3D-stacked DRAM,” 
TECHCON, Sep. 2011. 
[9] [Invited] Kwanyeob Chae, Chang-Ho Lee, and Saibal Mukhopadhyay, “Timing 
error prevention using elastic clocking,” IEEE International Conference on IC 
Design and Technology, May 2011. 
[10] Kwanyeob Chae, Saibal Mukhopadhyay, Chang-Ho Lee, and Joy Laskar, “A 
dynamic timing control technique utilizing time borrowing and clock stretching,” 






Kwanyeob Chae was born in Haenam, South Korea. He received his B.S. and M.S. 
degrees in electronics engineering from Korea University, Seoul, Korea, in 1998 and 
2000, respectively. He is currently pursuing the Ph.D. degree in electrical and computer 
engineering with the Georgia Institute of Technology, Atlanta, GA.  
  He joined Samsung Electronics Co., Ltd., in 2000, where he was engaged in the 
development of digital circuits. His research interests include self-adaptive circuits, low-
power circuits and systems, variation-tolerant design, non-volatile memories, and 3D ICs. 
He was a recipient of the 2007 Samsung LSI Presidential Award and the 1998 LG 
Semiconductor Contest Award. 
