Scalable Analysis, Verification and Design of IC Power Delivery by Zeng, Zhiyu
SCALABLE ANALYSIS, VERIFICATION AND DESIGN OF IC POWER
DELIVERY
A Dissertation
by
ZHIYU ZENG
Submitted to the Oﬃce of Graduate Studies of
Texas A&M University
in partial fulﬁllment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
December 2011
Major Subject: Computer Engineering
SCALABLE ANALYSIS, VERIFICATION AND DESIGN OF IC POWER
DELIVERY
A Dissertation
by
ZHIYU ZENG
Submitted to the Oﬃce of Graduate Studies of
Texas A&M University
in partial fulﬁllment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
Approved by:
Chair of Committee, Peng Li
Committee Members, Weiping Shi
Duncan M. Walker
Byung-Jun Yoon
Sani R. Nassif
Head of Department, Costas N. Georghiades
December 2011
Major Subject: Computer Engineering
iii
ABSTRACT
Scalable Analysis, Veriﬁcation and Design of IC Power Delivery. (December 2011)
Zhiyu Zeng, B.S., Zhejiang University
Chair of Advisory Committee: Dr. Peng Li
Due to recent aggressive process scaling into the nanometer regime, power
delivery network design faces many challenges that set more stringent and specif-
ic requirements to the EDA tools. For example, from the perspective of analysis,
simulation eﬃciency for large grids must be improved and the entire network with
oﬀ-chip models and nonlinear devices should be able to be analyzed. Gated pow-
er delivery networks have multiple on/oﬀ operating conditions that need to be fully
veriﬁed against the design requirements. Good power delivery network designs not
only have to save the wiring resources for signal routing, but also need to have the
optimal parameters assigned to various system components such as decaps, voltage
regulators and converters. This dissertation presents new methodologies to address
these challenging problems.
At ﬁrst, a novel parallel partitioning-based approach which provides a ﬂexible
network partitioning scheme using locality is proposed for power grid static analysis.
In addition, a fast CPU-GPU combined analysis engine that adopts a boundary-
relaxation method to encompass several simulation strategies is developed to simulate
power delivery networks with oﬀ-chip models and active circuits. These two proposed
analysis approaches can achieve scalable simulation runtime.
Then, for gated power delivery networks, the challenge brought by the large veri-
ﬁcation space is addressed by developing a strategy that eﬃciently identiﬁes a number
of candidates for the worst-case operating condition. The computation complexity is
reduced from O(2N) to O(N).
iv
At last, motivated by a proposed two-level hierarchical optimization, this dis-
sertation presents a novel locality-driven partitioning scheme to facilitate divide-and-
conquer -based scalable wire sizing for large power delivery networks. Simultaneous
sizing of multiple partitions is allowed which leads to substantial runtime improve-
ment. Moreover, the electric interactions between active regulators/converters and
passive networks and their inﬂuences on key system design speciﬁcations are ana-
lyzed comprehensively. With the derived design insights, the system-level co-design
of a complete power delivery network is facilitated by an automatic optimization ﬂow.
Results show signiﬁcant performance enhancement brought by the co-design.
vTo my wife Jing and my parents.
vi
ACKNOWLEDGMENTS
Looking back, I am very grateful for all I have received throughout these years.
My sincere gratitude goes to my advisor, Prof. Peng Li. Without his guidance
and support, most of the work presented in this dissertation would not have been
accomplished. He has enlightened me through his wide knowledge in Electronic De-
sign Automation area and his deep intuitions about where my research should go and
what is necessary to get there. During more than four years of knowing him, he has
taught me the ways of how to analyze the problems and how to build concrete steps
towards solving them. Prof. Li’s valuable feedbacks from numerous one-on-one meet-
ings, informal discussions and group meeting presentations have been the continuous
source of my motivation and inspiration. He has always encouraged me to bear a big
dream and to think for the long term which has helped to shape my personality.
I would like to thank Dr. Sani Nassif from IBM ARL and Prof. Vivek Sarin
from the CSE Department. Discussing problems with Dr. Nassif has always been a
pleasure to me. It was him who encourage me to look at the problems of on-chip
voltage regulation. He also invited me to give a presentation about my research at
IBM ARL and introduced me to Dr. Frank Liu, Dr. Zhuo Li, Dr. Howard Smith
and Mr. Dave Thomas for helpful technical interactions. It was my pleasure to work
with Prof. Sarin on building performance model for the parallel power grid static
analysis approach. He taught me a lot on parallel computing and helped me migrate
my codes to the supercomputer, Hydra. I also owe a lot to Prof. Weiping Shi, Prof.
Duncan Walker, Prof. Byung-Jun Yoon and Prof. Jiang Hu for serving on my PhD
committee and providing insightful feedbacks on my preliminary exam.
I would also like to thank many other people in the Computer Engineering Group
who contributed a lot on my dissertation by providing reference codes and technical
vii
discussions. Dr. Zhuo Feng has been a big brother to me whose passion and diligence
keep driving me forward. His GPU-based multigrid power grid solver is essential to
the successful implementation of my CPU-GPU combined simulation engine, GSim.
Dr. Wei Dong taught me a lot on general circuit simulation. Dr. Guo Yu guided me
to design and draw layout for the incrementer residing in a digital PLL. Dr. Xiaoji
Ye provided me with the modiﬁed SPICE code for GSim and his experience on using
APPS. Suming Lai’s voltage regulator designs and solid analog circuit background
were critical in my research for on-chip voltage regulation. Tong Xu’s fast imple-
mentation skill guaranteed our success in 2011 TAU Power Grid Simulation Contest.
Dr. Boyuan Yan was always there when I needed help. I also thank Leyi Yin, Yong
Zhang, Amandeep Singh, and Haokai Lu for helpful discussions.
I thank Prof. Jiming Song and Prof. Kangsheng Chen who served as my advisors
when I was in Iowa State University and Zhejiang University respectively. They were
the ﬁrst two people who led me into the research ﬁeld and made me enjoy doing
research. Dr. Rajesh Gupta, Mr. Mandeep Singh, Mr. Zach Coombes and Mr. Lam
Nguyen from Samsung Austin R&D Center helped me understand the importance of
power delivery in real circuit designs and shorten the gap between industrial designs
and research at the school.
Thank to my family, especially to my grandparents, my parents, my uncles and
aunts, and my cousins for their help, support, comfort and encouragement through
all these years. The last but not the least, I appreciate so much to the spiritual
and emotional support and help from my wife Jing. Without her I would be a very
diﬀerent person today, and it would have been certainly much harder to ﬁnish my
PhD degree. Still today, her unselﬁsh love makes me a better person.
I ﬁnish with a silence of gratitude for my life.
viii
TABLE OF CONTENTS
CHAPTER Page
I INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . 1
A. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 1
1. Design Perspectives . . . . . . . . . . . . . . . . . . . 3
2. Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. Design Trends and Challenges . . . . . . . . . . . . . 6
B. Survey of Previous Work . . . . . . . . . . . . . . . . . . . 8
1. Survey of PDN Analysis . . . . . . . . . . . . . . . . . 8
2. Survey of PDN Veriﬁcation . . . . . . . . . . . . . . . 9
3. Survey of PDN Design . . . . . . . . . . . . . . . . . . 10
C. Proposed Solutions . . . . . . . . . . . . . . . . . . . . . . 12
1. Proposed Solutions on PDN Analysis . . . . . . . . . 12
2. Proposed Solutions on Power-Gated PDN Veriﬁcation 13
3. Proposed Solutions on PDN Wire Sizing and On-
Chip Voltage Regulation Design . . . . . . . . . . . . 14
II POWER DELIVERY NETWORK ANALYSIS AND VERI-
FICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
A. Locality-Driven Parallel Static Analysis for Power De-
livery Networks . . . . . . . . . . . . . . . . . . . . . . . . 16
1. Background . . . . . . . . . . . . . . . . . . . . . . . . 17
2. Overview of the Proposed Approach . . . . . . . . . . 18
3. Parallel Locality-Driven Static Analysis . . . . . . . . 21
a. Parallel Boundary Current Approximation Us-
ing Locality . . . . . . . . . . . . . . . . . . . . . 21
b. Parallel Partition Simulation Using Approxi-
mated Boundary Currents . . . . . . . . . . . . . 24
c. Block-Based Iterative Error Reduction . . . . . . 28
d. Algorithm Flow and Implementation . . . . . . . 29
e. Computational Cost Analysis . . . . . . . . . . . 32
f. Performance Modeling for Parallel Implementation 35
4. Experimental Results . . . . . . . . . . . . . . . . . . 38
5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . 44
ix
CHAPTER Page
B. GSim: A Fast CPU-GPU Combined Parallel Simulator
for Power Delivery Networks with On-Chip Voltage Regulation 44
1. Background and Overview . . . . . . . . . . . . . . . . 44
2. CPU-GPU Combined Transient Simulation . . . . . . 46
a. GPU-Based Multigrid Method . . . . . . . . . . . 46
b. Boundary Relaxation . . . . . . . . . . . . . . . . 48
c. Convergence . . . . . . . . . . . . . . . . . . . . . 49
d. Simulation Flow . . . . . . . . . . . . . . . . . . . 49
3. Experimental Results . . . . . . . . . . . . . . . . . . 49
4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . 52
C. Transient Veriﬁcation of Power-Gated Power Delivery
Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
1. Background . . . . . . . . . . . . . . . . . . . . . . . . 52
a. Modeling for Power-Gated PDNs . . . . . . . . . 53
b. Veriﬁcation Metrics . . . . . . . . . . . . . . . . . 53
2. Overview of the Transient Veriﬁcation . . . . . . . . . 55
a. Veriﬁcation Tasks . . . . . . . . . . . . . . . . . . 55
b. Overview of the Proposed Transient Veriﬁca-
tion Methodology . . . . . . . . . . . . . . . . . . 58
3. Stable-Mode EM Veriﬁcation for Global Grids . . . . 59
a. Challenges for Transient EM Veriﬁcation . . . . . 59
b. Superposition Approximation . . . . . . . . . . . 62
c. Worst Case Validation . . . . . . . . . . . . . . . 67
4. Other Veriﬁcations . . . . . . . . . . . . . . . . . . . . 70
a. Stable-Mode EM Veriﬁcation for Local Grids . . . 70
b. Stable-Mode Peak Dynamic Voltage Drop Ver-
iﬁcation . . . . . . . . . . . . . . . . . . . . . . . 72
c. Power-On Peak Dynamic Voltage Drop Veriﬁcation 75
5. Experimental Results . . . . . . . . . . . . . . . . . . 78
a. Stable-Mode Veriﬁcation . . . . . . . . . . . . . . 79
b. Power-On Veriﬁcation . . . . . . . . . . . . . . . 83
6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . 83
III POWER DELIVERY NETWORK DESIGN . . . . . . . . . . . 84
A. Locality-Driven Parallel Power Grid Wire Sizing . . . . . . 84
1. Background . . . . . . . . . . . . . . . . . . . . . . . . 85
a. Problem Formulation . . . . . . . . . . . . . . . . 85
b. Constrained Nonlinear Optimization . . . . . . . 86
xCHAPTER Page
2. Overview of the Proposed Parallel Optimization . . . 87
a. Key Issues in Parallelizable Power Grid Optimization 87
b. Two-Level Hierarchical Optimization . . . . . . . 89
c. Proposed Parallel Two-Step Optimization For-
mulation . . . . . . . . . . . . . . . . . . . . . . . 92
3. Parallel Solution of Optimal Boundary Conditions . . 94
4. Parallel Optimization of Partitioned Sub Power Grids 96
a. Optimization of Partitioned Grids Using Re-
laxed Boundary Constraints . . . . . . . . . . . . 98
b. Maintenance of IR Drop Constraints . . . . . . . 100
c. Fixing EM Violation . . . . . . . . . . . . . . . . 103
d. Algorithm Flow for the Proposed Locality Driv-
en Parallel Optimization . . . . . . . . . . . . . . 104
5. Experimental Results . . . . . . . . . . . . . . . . . . 104
a. Partition Optimization . . . . . . . . . . . . . . . 105
b. Overall Optimization . . . . . . . . . . . . . . . . 106
6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . 110
B. System-Level Co-Design of Power Delivery Networks
with On-Chip Voltage Regulation . . . . . . . . . . . . . . 112
1. Background . . . . . . . . . . . . . . . . . . . . . . . . 112
a. System Modeling . . . . . . . . . . . . . . . . . . 112
b. Beneﬁts of On-Chip Voltage Regulation . . . . . . 113
c. Low-Dropout Regulator Design . . . . . . . . . . 115
d. Decoupling Capacitor Sizing and Placement . . . 117
e. Buck Converter Design . . . . . . . . . . . . . . . 117
2. System-Level Co-Design . . . . . . . . . . . . . . . . . 120
a. LDO-Decap System Co-Design . . . . . . . . . . . 120
b. LDO-Decap-BC System Co-Design . . . . . . . . 126
3. Co-Optimization Formulation and Methodology . . . . 127
a. Co-Optimization for LDO-Decap System . . . . . 127
b. Co-Optimization for LDO-Decap-BC System . . . 133
4. Experiment Results . . . . . . . . . . . . . . . . . . . 134
a. LDO-Decap System Co-Optimization . . . . . . . 135
b. LDO-Decap-BC System Co-Optimization . . . . . 141
5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . 143
IV CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
xi
CHAPTER Page
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
xii
LIST OF TABLES
TABLE Page
I Comparison of boundary IR drops and currents obtained in bound-
ary window simulations using diﬀerent window sizes with the ex-
act solutions. Voltage drop is in mV. Current is in mA. . . . . . . . . 25
II Simulation runtime for ﬂat simulation and the proposed parallel
simulation. All the grids are divided into 16 partitions with 24
windows. 16 processors are used. Chol: Cholesky decomposi-
tion time; Sol: triangular solve time; Tot: total runtime; Mem:
memory; Iter: number of iterations; Sp: speedup; Em: maximum
node voltage error; Ea: average node voltage error. Runtime is in
seconds; error is in mV; and memory is in GB. . . . . . . . . . . . . 39
III Speedups of the parallel implementation of the algorithm over
the sequential implementation of the algorithm. 16 processors are
used for the parallel simulation. Runtime is in seconds. . . . . . . . . 40
IV Runtime analysis with various numbers of partitions for 9M, 12.96M,
and 16M grids. Chol: Cholesky decomposition time; Sol: trian-
gular solve time; Iter: number of iterations; Tot: total runtime.
Runtime is in seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
V Number of iterations required by using various window sizes. Iter:
number of iterations; Tot: total runtime. Runtime is in seconds. . . . 43
VI Transient simulation runtime and numbers of iterations of GSim
for PNDs with on-chip LDOs. 1200 time steps are simulated.
CPU%: the percentage of runtime spent on CPU. The runtime is
in seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
xiii
TABLE Page
VII EM stable mode veriﬁcation for gated PDN. The global VDD grid
is examined. # Con.: number of conﬁgurations; T: total runtime;
|EM |max: maximum absolute average current; P: number of worst
case candidates. Runtime is in hrs. EM is in mA. Runtime of the
enumeration methods for larger circuits are the estimated time
(∼ time value). The transient veriﬁcations are run for 200 clock
cycles (2000 time steps). . . . . . . . . . . . . . . . . . . . . . . . . . 80
VIII Peak DVD stable mode veriﬁcation for gated PDN. For grids with
8 and 16 grids, two diﬀerent local grids are examined. Gcor:
corner grid; Gcen: center grid; # Con.: number of conﬁgurations;
T: total runtime; DVDp,max: maximum peak DVD; P: number of
worst case candidates. Runtime is in hrs. DVD is in mV. Runtime
of the enumeration methods for larger circuits are the estimated
time (∼ time value). The transient veriﬁcations are run for at
least 200 clock cycles (2000 time steps). . . . . . . . . . . . . . . . . 81
IX Peak DVD transition veriﬁcation for gated PDN. The transition
grid is at the center. For grids with 8 and 16 grids, two nodes
in two diﬀerent grids are examined. Gclo: the grid close to the
transition grid; Gfar: the grid far away from the transition grid;
# Con.: number of conﬁgurations; T: total runtime; DVDp,max:
maximum peak DVD; P: number of worst case candidates. Run-
time is in hrs. DVD is in mV. Runtime of the enumeration meth-
ods for larger circuits are the estimated time (∼ time value). The
transient veriﬁcations are run for at least 200 clock cycles (2000
time steps). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
X Optimized boundary voltage and current as functions of window
size in the window-based optimization. WS: window size. C4S: C4
ring size. NV: voltage for a boundary node in V. NC: current for
a boundary branch in mA. AV: average voltage for the boundary
in V. AC: average current for the boundary in mA. OPT: optimal value. 96
xiv
TABLE Page
XI Optimization runtime for ﬂat optimization, serial locality-driven
optimization, and parallel locality-driven optimization. N: num-
ber of nodes. P: number of partitions. SIM: ﬂat simulation run-
time in sec. OPT: optimization runtime in sec. BWO: boundary
window optimization runtime in sec. PO: partition optimization
runtime in sec. TOT: total runtime in sec. AR: area reduction
in %. WSb: beginning window size. WSt: maximum terminating
window size. IT: number of iterations for window size determi-
nation. NVio: number of EM violation. Viomax: maximum EM
violation in mA/um. . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
XII Multi-level optimization vs. straight optimization. . . . . . . . . . . 141
xv
LIST OF FIGURES
FIGURE Page
1 Diagram of a complete power delivery network. . . . . . . . . . . . . 2
2 Diagram of a on-chip power delivery network. . . . . . . . . . . . . 2
3 A complete model of the power delivery network. . . . . . . . . . . . 5
4 VDD grid model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5 Design trends and challenges for power delivery networks. . . . . . . 7
6 Work towards scalable analysis, veriﬁcation and design for IC
power delivery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7 Power grid partitions and partition boundaries. . . . . . . . . . . . . 18
8 Spatial locality in power grids. Red dots represent C4 connections. . 22
9 Finding the near-exact boundary currents via locality. Red dots
represent C4 connections. Black dash lines represent partition
boundaries. Blue dash lines represent window boundaries. . . . . . . 23
10 Impact of window size on boundary current approximation. Red
dots represent C4 connections. Black dash lines represent parti-
tion boundaries. Blue dash lines represent window boundaries.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
11 Power grid partition simulation by setting boundary currents. . . . . 27
12 Algorithm ﬂow of the proposed parallel locality-driven method. . . . 31
13 Accuracy comparison of boundary current approximations using
diﬀerent numbers of partitions. The power grid in (a) is divided
into 16 partitions, and the power grid in (b) is divided into 32
partitions. The values are node voltage errors after ﬁrst iteration.
The unit is Volt. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
xvi
FIGURE Page
14 Accuracy comparison of boundary current approximations using
diﬀerent window sizes. The power grid in (a) uses window size
40 (C4 ring size 1), and the power grid in (b) uses window size
80 (C4 ring size 2). The values are node voltage errors after ﬁrst
iteration. The unit is Volt. . . . . . . . . . . . . . . . . . . . . . . . 36
15 Runtime comparison between the ﬂat simulation and the proposed
parallel simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
16 The model for a power delivery network with on-chip voltage reg-
ulators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
17 GSim simulation diagram. . . . . . . . . . . . . . . . . . . . . . . . 46
18 Scalability of runtime and memory consumption for the GPU
solver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
19 Boundary relaxation for a single LDO. . . . . . . . . . . . . . . . . . 48
20 GSim simulation ﬂow. . . . . . . . . . . . . . . . . . . . . . . . . . . 50
21 A power-gated power delivery network model. . . . . . . . . . . . . 53
22 EM and DVD metrics. . . . . . . . . . . . . . . . . . . . . . . . . . 54
23 Stable and transition conﬁgurations. . . . . . . . . . . . . . . . . . . 56
24 Proposed transient veriﬁcation tasks. . . . . . . . . . . . . . . . . . 57
25 The proposed transient veriﬁcation methodology. . . . . . . . . . . . 59
26 A simple power-gated PDN in [26]. For simplicity, only four local
grids are shown. Each local grid has a single sleep transistor and
two DC current loads. . . . . . . . . . . . . . . . . . . . . . . . . . . 60
27 The PDN model diagram. For simplicity, only four local grids are
shown. Each local grid has two sleep transistors, one current load
and one decoupling capacitor. . . . . . . . . . . . . . . . . . . . . . 61
28 Switchable current source model for local grids. For simplicity,
the package is not shown and there are only three local grids. . . . . 62
xvii
FIGURE Page
29 A simple RLC model for the power delivery network. . . . . . . . . 64
30 Switchable current source values for the global basic conﬁguration
(a) and the full-decap basic conﬁguration (b). For simplicity, the
package is not shown and there are only three local grids. . . . . . . 65
31 A simple example for switchable current source approximation.
The package is not shown and there are only two local grids. . . . . 65
32 Superposition for global grid veriﬁcation. . . . . . . . . . . . . . . . 66
33 Flow for worst case validation. . . . . . . . . . . . . . . . . . . . . . 68
34 Superposition for local grid veriﬁcation. . . . . . . . . . . . . . . . . 71
35 Dynamic voltage drop for circuit block Tn. For simplicity, the
package is not shown. . . . . . . . . . . . . . . . . . . . . . . . . . . 73
36 Drain-source conductance of a PMOS sleep transistor during power-
on time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
37 Veriﬁcation of power-on peak dynamic voltage drop. . . . . . . . . . 77
38 A naive divide-and-conquer optimization approach. . . . . . . . . . 88
39 Power grid partitioning by setting boundary voltages and cur-
rents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
40 Optimization of the partitioned grid under relaxed boundary con-
ditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
41 Merging process analysis . . . . . . . . . . . . . . . . . . . . . . . . 102
42 Impact of boundary conditions on the quality of ﬁnal optimized
power grid. (a): node voltage distribution before optimization.
(b): EM distribution of horizontal wires before optimization. (c):
node voltage distribution after optimization using window size 40.
(d): EM distribution of horizontal wires after optimization using
window size 40. (e): node voltage distribution after optimization
using window size 65. (f): EM distribution of horizontal wires
after optimization using window size 65. . . . . . . . . . . . . . . . . 107
xviii
FIGURE Page
43 Speedup of parallel over serial locality-driven optimization. . . . . . . 111
44 Threads runtime (in sec) of boundary window optimizations and
partition optimizations for 640K-node power grid. . . . . . . . . . . . 111
45 System structure and model of a power delivery network with
on-chip voltage regulation. . . . . . . . . . . . . . . . . . . . . . . . 113
46 Voltage drops for a power domain with LDOs and without LDOs. . 114
47 LDO topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
48 Buck converter topology. . . . . . . . . . . . . . . . . . . . . . . . . 118
49 Power consumption of the PDN with on-chip LDOs. . . . . . . . . . 120
50 Stability check reasoning procedure. . . . . . . . . . . . . . . . . . . 125
51 Phase margin vs. Rs. . . . . . . . . . . . . . . . . . . . . . . . . . . 125
52 LDO-decap system co-optimization ﬂow. . . . . . . . . . . . . . . . 130
53 Multigrid-based optimization. . . . . . . . . . . . . . . . . . . . . . 132
54 Optimization results of decap uni-optimization, LDO uni-optimization,
and co-optimization. (a) LDO and decap area in mm2. (b) Sys-
tem power eﬃciency. (c) Ground power in mW. (d) Number of
LDO blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
55 Optimization results of favoring diﬀerent design aspects. (a) LDO
and decap area in mm2. (b) System power eﬃciency. (c) Ground
power in mW. (d) Number of LDO blocks. . . . . . . . . . . . . . . 138
56 Optimization results for using planar gate-oxide decoupling capac-
itance vs. deep trench decoupling capacitance (a) LDO and decap
area in mm2. (b) System power eﬃciency. (c) Ground power in
mW. (d) Number of LDO blocks. . . . . . . . . . . . . . . . . . . . 140
57 Optimization results of LDO-decap-BC system co-optimization
vs. LDO-decap system co-optimization. (a) LDO and decap area
in mm2. (b) System power eﬃciency. (c) Ground power in mW.
(d) Number of LDO blocks. . . . . . . . . . . . . . . . . . . . . . . 142
1CHAPTER I
INTRODUCTION
A. Preliminaries
Very Large-Scale-Integration (VLSI) Power Delivery Networks (PDNs), also known
as power grids, play the critical role of reliably powering on all on-chip devices [1].
The main functions of a power delivery network include [1] [2] [3]:
• Maintaining stable voltage supply levels to all on-chip devices under all possible
chip activities.
• Providing average and peak power demands for the entire chip.
• Providing current return path for signals.
The entire power delivery network mainly consists of a on-board voltage regu-
lation/conversion module, oﬀ-chip PDNs (power and ground networks) and on-chip
PDNs (power and ground networks). The diagram showing a complete power delivery
network is presented in Figure 1 [3]. The current ﬂows from the on-board Voltage
Regulator Module (VRM), through the Printed Circuit Board (PCB) PDN, the sock-
et, the package and the C4 bumps to the chip. On the chip, as shown in Figure 2 [1],
the current goes from the top metal layers all the way down to the transistors. The
current returns to the ground in the path from the on-chip PDN to the PCB. There
are oﬀ-chip decoupling capacitors (decap) and on-chip decoupling capacitors in the
network to reduce the noise caused by fast on-chip switching circuits.
The journal model is IEEE Transactions on Very Large Scale Integration (VLSI)
Systems.
2PCB
Socket
Chip
Package
C4
Bumps
Fig. 1. Diagram of a complete power delivery network.
Fig. 2. Diagram of a on-chip power delivery network.
31. Design Perspectives
To design a power delivery network, the following ﬁve perspectives should be consid-
ered:
• Power: While delivering power to on-chip devices, the power delivery network
itself dissipates power. The dissipated power mainly consists of the power loss
of the voltage regulation/conversion module (caused by the regulator/converter
circuit dynamics and the quiescent current) and the leakage power of the on-
chip decoupling capacitance (caused by gate leakage for MOS-based decaps)
[2]. Therefore, the power eﬃciency and the quiescent power of the voltage
regulation/conversion module as well as the total amount of on-chip decaps
and the on-chip decap technology have to be considered.
• Noise: The power supply noise in the PDN has two components: IR drop and
Ldi/dt noise [4]. The IR drop is caused by the resistance between the on-board
voltage supply and the on-chip nodes. The Ldi/dt noise is introduced by the
inductive parasitics of the network. Since all the on-chip devices reside between
the power network and the ground network, both the voltage drop of the power
network and voltage overshoot of the ground network have to be considered.
• Reliability: Electromigration (EM) is an eﬀect of having metal ions transported
by a direct current ﬂowing through a metal wire in a substantial time period [5].
If this eﬀect is accumulated over a long time period, eventually, it causes the wire
to break or to short-circuit to another wire. The rate of the electromigration
highly depends on the average current density. Thus, the average current density
on each wire segment is required to be checked during design.
• Routing: Enough number of metal wires for the on-chip PDN should be allo-
4cated in all metal layers in order to reduce current densities and IR drops [2].
However, on-chip PDNs may use up a lot of signal routing resources. Therefore,
the total number and area of the metal wires that are used by the PDN on each
layer have to be considered.
• Area: The area cost of the PDN mainly consists of the area used by the voltage
regulation/conversion module and the die area taken by on-chip decoupling
capacitance. Therefore, the total area for the PDN is constrained by the on-
board and on-chip available area.
All the design speciﬁcations associated with the above design perspectives must be
satisﬁed in the PDN design. The exact numbers of these design speciﬁcations are
determined by the circuit functions, CMOS technology, system budgets, etc..
2. Modeling
In order to eﬃciently and accurately check the performance of PDN designs and
analyze the network electric characteristics, a complete model of the power delivery
network is built, as presented in Figure 3. The PDN model consists of an oﬀ-chip
model and an on-chip model [2] [6] [7].
The oﬀ-chip model captures the decoupling capacitors and the parasitics of the
package and the PCB that reside between the on-chip PDN and the on-board voltage
regulator. A variety of distributed models, lumped models and macromodels can be
used for the PCB and the package. In this dissertation, a ladder RLC model [6], as
shown in Figure 3, is used. The on-board voltage regulator is modeled as an ideal
DC voltage source.
The on-chip power delivery network has the following major components: C4
bumps, a VDD grid, a GND grid and on-chip decoupling capacitors. The C4 bumps
5DC
VDD Grid
GND Grid
PCB Package
Off-Chip Model On-Chip Model
C4 Bump
Decap
Switching 
Circuit
Fig. 3. A complete model of the power delivery network.
Decap Switching 
Circuit
C4 Bump
Fig. 4. VDD grid model.
6that connect the power/ground grid with oﬀ-chip network are modeled as RL pairs
[6]. The VDD and GND grids are purely resistive meshes as shown in Figure 4. The
decoupling capacitors reside between the VDD grid and the GND grid. The switching
circuits are replaced by linear current sources, also called the current loadings, that
mimic the current consumption of the circuits [7].
3. Design Trends and Challenges
Due to the recent aggressive process scaling into the nanometer regime and the design
trend of pushing the high-performance vs. low-power envelope, the power delivery
network has been impacted in a number of perspectives, as presented in Figure 5:
• Firstly, the circuit clock frequency has been increased but the supply voltage
has been signiﬁcantly scaled down, which leads to larger power supply noise
and a higher voltage drop percentage with respect to the supply voltage.
• Secondly, with the increased gate density and the reduced chip area, the network
complexity becomes much larger and less metal wire resources are available for
the power delivery network.
• At last, due to the signiﬁcant increase of the leakage power and power density
in the nanometer regime, many ﬁne-grain power management techniques have
been proposed, such as adding sleep transistors to cut the leakage power of
unused circuit blocks and employing multiple power supplies for the circuit
blocks having diﬀerent timing and power requirements.
All these impacts have raised more stringent and speciﬁc requirements on Electronic-
Design-Automation (EDA) tools for power delivery networks, as shown in Figure 5.
7Pr
oc
es
s 
Sc
al
in
g 
an
d 
C
irc
ui
t D
es
ig
n 
Tr
en
d
Clock Frequency ?
Supply Voltage?
Gate Density ?
Chip Area?
Leakage Power ?
Power Density?
Supply Noise ?
Voltage Drop Percentage ?
Network Complexity ?
Metal Wire Resources?
Power Gating
Multiple Voltage Islands
Verification:
Multiple Power On/Off Configurations
Design:
Wire Size Optimization
System-Level Co-Design
Analysis:
Large Network Complexity
Voltage Regulator/Converter
Fig. 5. Design trends and challenges for power delivery networks.
• From the perspective of PDN analysis, simulation techniques must be develope-
d to be able to analyze a on-chip power grid with multi-million nodes or more
eﬃciently and accurately. In addition, techniques that have the capability of
tackling the entire power delivery network with an oﬀ-chip model and integrat-
ed sophisticated/nonlinear devices such as on-chip voltage regulators are also
needed.
• For a gated PDN, turning on/oﬀ gated grids would create many power gating
conﬁgurations. Therefore, in terms of performance veriﬁcation, a scheme need-
s to be developed to verify whether the gated power delivery network works
properly under all possible on/oﬀ conﬁgurations.
• In terms of PDN design, on one hand, power grid wire sizes must be optimized
to save wiring resource for signal routing. On the other hand, due to the sizes of
traditional on-board voltage regulators with large inductors or capacitors, there
are signiﬁcant interests in developing fully integrated on-chip voltage regula-
tors to facilitate ﬁne-grain multiple power domains. Hence, systematic analysis
on the electric interactions between passive network and active voltage regu-
8lators/converters, detailed tradeoﬀ analysis over diﬀerent design speciﬁcations
and a system-level co-design scheme that automatically optimizes key design
parameters for the entire power delivery network must be provided for PDNs
employing on-chip voltage regulation.
B. Survey of Previous Work
In the past several years, there have been a lot of eﬀorts and progresses made on power
delivery network analysis, veriﬁcation and design, however with diﬀerent tradeoﬀs.
1. Survey of PDN Analysis
Many approaches have been proposed to address the on-chip power grid analysis
problem [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21]. Among these,
ideas of employing Cholesky decomposition [9] [15], preconditioned conjugate gradi-
ent method [8], multigrid techniques [9] [10] [11] [12] [13] [14], random walks [18],
locality [19], relaxation iterative method [20], and Poisson solver optimized for GPU
platforms [21] have emerged. However, all these methods, called ﬂat methods, can on-
ly be applied to analyze the power grid as an entirety (non-partitioned). As a result,
when tackling modern power grid designs with many-million nodes, these methods
may suﬀer from memory overﬂow and unbearable runtime. At the same time, some
of the algorithms are not parallelizable so that they may not fully utilize the increas-
ingly available parallel computing resources to improve eﬃciency. To overcome these
limitations, macromodeling method [15] [16] and non-overlapping domain decomposi-
tion method [17], called partitioning methods, are proposed. These methods employ
the strategy of divide-and-conquer that is realized through grid partitioning. In [15],
[16] and [17], the power grids are divided into several partitions or subdomains whose
9electric properties are represented by the circuit responses on the ports or interface
nodes by applying matrix transformation and substitution to each partition or sub-
domain. However, for the macromodeling method, the ﬁnal system matrix for the
global grid (including the ports) is dense; and for the non-overlapping domain de-
composition method, the dominant time is spent on forming the Schur complements.
Therefore, although these two methods are naturally parallelizable, when the number
of boundary nodes increases, they may suﬀer from runtime ineﬃciency.
Moreover, for power delivery networks with active components, such as voltage
converters and regulators, existing simulation tools are not capable to handle them
since the existing tools can only solve the passive networks. On the other hand, if
traditional SPICE-like simulators are used, due to the multi-million-node complexity
of the passive network, the simulation could easily run out of memory or take days
to get the results.
2. Survey of PDN Veriﬁcation
PDN veriﬁcation is a very important but challenging task to chip designers. It can be
deﬁned as: under all possible conditions, verify if the PDN can satisfy given electro-
migration and voltage drop speciﬁcations. Traditionally, by considering the current
loadings variations (due to carrying out diﬀerent instructions over the time), several
works have been done to ﬁnd a current loading distribution (current proﬁle) that pro-
duces the largest voltage drop [22] [23] [24] [25]. In these works, the loading currents
are considered as unknown variables, and the worst-case voltage drops are obtained
by formulating a formal optimization problem which is then solved with existing
optimization techniques. While some times limited by the capacity of underlying
optimization packages, particularly for large power grid designs, these approaches
provide a valuable methodology to addressing current loading variation.
10
Currently, due to the wide adoption of the power gating technique, it is equally
important to examine that for the PDN with a given set of current loadings, whether
the EM and voltage drop constraints are satisﬁed under all possible on/oﬀ conﬁgu-
rations and on/oﬀ transitions. In this kind of veriﬁcation, called the power gating
veriﬁcation, the current waveforms are treated as known. In [26], a useful DC EM
analysis approach for the global grid is proposed which can eﬃciently compute exact
DC currents for all possible power gating conﬁgurations. Whereas, in the more com-
plicated transient veriﬁcation of gated power grid networks, a number of complications
arise, such as new transient noise behaviors, the handling of transient superposition
under multiple sleep transistors per local grid, the need to verify both the global and
local grids, and the handling of decoupling sharing eﬀect between local grids. On the
other hand, since there is a very large veriﬁcation space that consists of all possible
power gating conﬁgurations and transitions, a brute-force exhaustive enumeration
over all possible conditions is impractical. For example, a multi-core design with 16
local power grids has 216 possible power gating conﬁgurations.
3. Survey of PDN Design
Despite the progresses made on analysis and veriﬁcation, it is equally important to
address the design and optimization issues of such large networks that are even more
challenging. In [27], sequential linear programming based approach is proposed to
improve the eﬃciency of ﬂat power grid wire size optimization. Multigrid-like heuris-
tic is proposed to reduce the complexity of large power optimization in [28]. In [29],
the macromdeling idea of [15] and [16] is adopted to facilitate partitioning-based op-
timization. In principle, ﬂat optimization is only limited to small or medium sized
power grids. It may be impractical for a complete large grid due to extensive runtime
and memory requirement. Multigrid-like heuristic improves the optimization eﬃcien-
11
cy by operating on signiﬁcantly simpliﬁed coarse grids. Whereas, design constraints
may not be exactly satisﬁed during the approximation step [28]. The macromodel-
ing based approach provides a nice feature of incremental optimization that allows
for individual optimization of one partition at a time [29]. However, partitioning
in the mesh-like power grid structure creates a large number of interface nodes at
the partitioning boundaries. It produces large and dense macromodels expensive to
compute. Furthermore, the optimization of the entire grid requires a sequential opti-
mization of all partitions. It is important to note, the above approaches are not ready
for simultaneous optimization of multiple partitions, hence, cannot be immediately
parallelized.
Moreover, a great amount of eﬀort has been geared towards developing fully
integrated on-chip voltage regulators/converters. DC-DC converters are considered
to be power eﬃcient even when the input-to-out voltage diﬀerence is large [30] [31].
Therefore they are widely used for Dynamic Voltage Scaling (DVS). However, to ful-
ly integrate DC-DC converters on the chip, designing an area-eﬃcient inductor at
the converter output becomes the major obstacle. On the other hand, Low-Dropout
regulators (LDOs) are more amenable for on-chip integration due to their small area
overhead, low standby current, low dropout voltage, improved power eﬃciency and
superior transient response to fast load current variation [32] [33] [34] [35]. Fully
integrated LDOs are very attractive for regulating large and fast local voltage ﬂuctu-
ations and for providing multiple levels of supply voltage [32] [33] [34] [35]. They can
be used as post-regulators following switching converters (with high power conversion
eﬃciency) to provide low-noise supply voltage while maintaining good overall power
eﬃciency [36] [37]. On the other hand, passive power delivery network design, for
example through the means of decoupling capacitance insertion, has also been the
subject of many researchers [28] [38]. However, so far these two threads of work are
12
Analysis
Verification
Design
Locality-Driven Parallel Static Analysis for On-Chip 
Power Grids
GSim: A Fast CPU-GPU Combined Simulation 
Engine for PDNs with On-Chip Voltage Regulation
Simulation-Based Verification for Gated PDNs 
Using Fast Superposition Approximation
Locality-Driven Parallel Wire Size Optimization
System-Level Analysis and Co-Optimization for 
PDNs with On-Chip Voltage Regulation
Wire Size Optimization
System-Level Co-Design
Multiple Power On/Off Configurations
Large Network Complexity
Voltage Regulator/Converter
IC
 P
ow
er
 D
el
iv
er
y 
N
et
w
or
k
Fig. 6. Work towards scalable analysis, veriﬁcation and design for IC power delivery.
disjointed. Voltage regulator/converter design is typically done in isolation with an
assumed simple capacitive load; existing passive PDN optimization work does not
consider active regulator/converter circuits. Little work has been geared towards un-
derstanding the detailed electric characteristics of having multiple on-chip voltage
regulators operate inside a large power delivery network.
C. Proposed Solutions
In the dissertation, to overcome the limitations of existing work on power delivery
network analysis, veriﬁcation and design, new methodologies and approaches are pro-
posed as presented in Figure 6.
1. Proposed Solutions on PDN Analysis
For the on-chip power grid static analysis, by focusing on the C4 ﬂip-chip type power
grids, a novel parallel partitioning based power grid simulation method is proposed.
In this approach, the power grid is divided into several partitions. For each partition,
the impact of the rest of the grid is modeled as the currents ﬂowing into that par-
tition. Using these currents, called the boundary currents, each partition is solved
13
independently. Thus the entire circuit responses can be obtained in parallel. An
eﬃcient and eﬀective boundary current approximation scheme using spatial locality
of the ﬂip-chip power grids is introduced to provide near-exact current values on the
boundary. This scheme only requires to solve several small-size power grids to get
the approximations. Thus would not jeopardize the performance when the number
of boundary nodes increases. Errors can be reduced quickly by using a block-based
iterative process which employs the same boundary current approximation scheme.
As a result, the proposed approach not only has the feature of natural parallelization
and the ability to tackle large power grids, but also can achieve excellent runtime
eﬃciency and partitioning ﬂexibility. In addition, by looking into the main factors
that aﬀect the parallel performance and conducting extensive experimental studies,
detailed computational cost analysis and performance modeling are provided. In ad-
dition, we propose a strategy that helps users determine the optimal (or near-optimal)
values of some key parameters to achieve the lowest parallel runtime.
On the other hand, we address the signiﬁcant challenge in simulating complex
PDNs with a large number of integrated LDOs at SPICE-level accuracy by developing
an integrated CPU-GPU analysis engine: GSim. Our engine achieves its eﬃciency
through circuit partitioning and the integration of linear iterative, linear direct and
nonlinear solvers running on Graphics Processing Unit (GPU) and Central Processing
Unit (CPU) respectively. These solvers are optimized for large on-chip power grids,
oﬀ-chip models and transistor-level LDO models, respectively.
2. Proposed Solutions on Power-Gated PDN Veriﬁcation
In the dissertation, by focusing on power gating veriﬁcation, we propose a practical
simulation-based approach that veriﬁes the complete power delivery hierarchy (local
and global grids), under all possible stable power gating conﬁgurations and core
14
power-on noise injection in terms of EM and voltage drops. The proposed approach
achieves eﬃcient veriﬁcation by fast equivalent circuit modeling and superposition
methods for approximate and conservative identiﬁcation of worst-case violations in the
large veriﬁcation space. A few selective full simulations are carried out for validation.
3. Proposed Solutions on PDN Wire Sizing and On-Chip Voltage Regulation
Design
In terms of wire size optimization, by focusing on the C4 ﬂip-chip type power grids,
we take the same basic partitioning philosophy to achieve scalability for large power
grid optimization, but via a rather diﬀerent avenue. Although applying partitioning
seems to be rather natural for attacking large mesh-like circuits such as power grids,
its proper employment under the context of constrained optimization is nontrivial.
Simply neglecting the coupling along the partitioning boundaries can easily lead to
a large number of IR drop and EM violations, preventing any eﬀective partitioning
based optimization. We address such a challenge by taking a diﬀerent route to the
partitioning based optimization. We ﬁrst re-formulate the original (ﬂat) constrained
power grid optimization problem into a two-level optimization problem. This new
two-level hierarchical formulation is built upon the essential idea of partitioning based
optimization and has an appealing form that seemingly enables divide-and-conquer -
based scalable optimization. Motivated by this hierarchical view, we develop practical
solutions to address the fundamental limitations of the hierarchical approach in terms
of convergence and eﬃciency which lead to a locality-driven two-step optimization.
One key feature of the proposed approach is that it is fully parallelizable since our
algorithm construction permits simultaneous sizing of an arbitrary selection of par-
titions, including those that are adjacent to each other. This oﬀers the important
ability of utilizing the increasing parallel computing hardware to address the chal-
15
lenges in power grid design. The performance of the proposed approach is largely
independent of the choice of partitioning boundaries, hence is not constrained by de-
sign hierarchy. As a result, the partitioning boundaries, size and number of partitions
can be ﬂexibly chosen to tradeoﬀ between runtime and memory requirements as well
as to facilitate load balancing in parallelization. It shall also be noted that our power
grid optimization approach is general and it does not depend on a speciﬁc choice of
underlying numerical optimization packages used.
Moreover, eﬀorts are spent towards understanding the beneﬁts and detailed elec-
tric characteristics of on-chip LDOs under the large power delivery network context.
An attempt is also made to link the regulator/converter design together with pas-
sive decoupling capacitance insertion, which targets the critical joint co-optimization
of active and passive components so as to achieve the optimal performance for the
entire PDN design. To achieve our goal, we ﬁrst conduct systematic design analysis
to describe the analog characteristics of voltage regulators under a network contex-
t and use it as the basis to understand the interactions between the active voltage
regulation and the passive decoupling. The derived design insights are employed to
facilitate a system-level co-design in which key regulator/converter parameters, the
number of on-chip regulators, and the amount of decap inserted are considered as de-
sign variables. To feasibly optimize large PDNs, we leverage a custom fast simulation
environment, a multi-level based optimization strategy as well as design knowledge
to develop an automatic optimization ﬂow. Using our optimization approach, we
demonstrate huge beneﬁts of system-level co-optimization that involves both active
voltage regulation/conversion optimization and passive power grid optimization. The
tradeoﬀs between diﬀerent design speciﬁcations, such as area, power, placement and
routing, and noise, are presented. We also analyze the impact of decoupling technolo-
gies (such as deep-trench decaps) on the design of power delivery.
16
CHAPTER II
POWER DELIVERY NETWORK ANALYSIS AND VERIFICATION ∗
As stated in Chapter I, the PDN analysis faces the challenges of large network com-
plexity (large runtime and memory consumption) and the integration of nonlinear
devices such as on-chip voltage regulators with the passive network (existing power
grid solvers are not applicable and the general SPICE is too slow), while the veri-
ﬁcation for gated PDNs has a large veriﬁcation space (exhaustive simulation-based
veriﬁcation is impractical). To address these challenges, in this chapter, a novel par-
allel partitioning-based static analysis approach for large on-chip power grids is ﬁrst
presented. Then, a fast CPU-GPU combined simulation engine that can eﬃciently an-
alyze complex power delivery networks with sophisticated on-chip voltage regulators
is introduced. At last, an eﬃcient simulation-based veriﬁcation methodology using
an eﬀective circuit modeling and a fast superposition approximation is illustrated.
A. Locality-Driven Parallel Static Analysis for Power Delivery Networks
In this section, by employing the divide-and-conquer methodology and the locality
property of the ﬂip-chip type power grids, a partitioning-based parallel power grid
static analysis approach is proposed to reduce the excessive runtime and heavy work-
load caused by the large power delivery network complexity.
∗Part of this chapter is reprinted with permission from “Locality-driven parallel static
analysis for power delivery networks” by Z. Zeng, Z. Feng, P. Li and V. Sarin, 2011.
ACM Trans. on Design Automation of Electronic Systems, Vol. 16, pp. 28:1-28:17,
Copyright [2011] by ACM.
17
1. Background
The static analysis is the most fundamental analysis for power delivery networks.
It is widely used to detect potential EM failures (wires with large average currents)
and hot spots (nodes with large voltage drops). For the static analysis, the PDN is
simply a resistive network which can be further divided into two completely separated
grids by splitting the current sources [39]. For simplicity, only the VDD grid static
analysis is discussed in this dissertation. The GND grid can be solved in the same
way. Assume the power grid has N nodes, using the Modiﬁed Nodal Analysis (MNA),
the system equation can be represented as [9] [15]
GV = I, (2.1)
where G ∈ RN×N is the conductance matrix, V ∈ RN×1 is the vector for node
voltages, and I ∈ RN×1 is the vector for current sources and voltage supplies. For
modern power grid designs, N can be multimillion. Such a system can be solved by
direct methods i.e. LU or Cholesky factorization [9] [15]. Other methods, such as
preconditioned conjugate gradient method [8], multigrid techniques [9] [10] [11] [12]
[13] [14], macromodeling method [15] [16], non-overlapping domain decomposition
method [17], random walks [18], locality [19], relaxation iterative method [20], and
a Poisson solver optimized for GPU platforms [21], have also been proposed to solve
(2.1).
18
2. Overview of the Proposed Approach
To apply the divide-and-conquer methodology to the power grid static analysis as
shown in (2.1), the power grid Ω, represented by the system matrix G, is divided
into K partitions Ω1, · · · ,ΩK (as shown in Figure 7, K = 16 in this case), and the
system equation can be expressed as
GV =
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
G1 G
T
12 . . . G
T
1K
G12 G2 . . . G
T
2K
...
...
G1K G2K . . . GK
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
V1
V2
...
VK
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
=
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
I1
I2
...
IK
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
, (2.2)
where G1, · · · , GK are the conductance matrices for partitions Ω1, · · · ,ΩK . Gij rep-
resents the connections between partition Ωi and partition Ωj (i, j = 1, . . . , K; i = j).
Vi and Ii are the node voltage vector and current loading vector for partition Ωi
(i = 1, . . . , K).
IBaa a’?1 ?2 ?3 ?4
b IBb b’?5 ?6 ?7 ?8
c IBc c’
?9 ?10 ?11 ?12
?13 ?14 ?15 ?16
Fig. 7. Power grid partitions and partition boundaries.
19
By moving the oﬀ-diagonal terms into the right-hand side, (2.2) becomes⎡
⎢⎢⎢⎢⎢⎢⎢⎣
G1
G2
. . .
GK
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
V1
V2
...
VK
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
=
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
I1 −GT12V2 . . .−GT1KVK
I2 −G12V1 . . .−GT2KVK
...
IK −G1KV1 . . .−G(K−1)KVK−1
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
=
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
I1 − IB1
I2 − IB2
...
IK − IBK
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
,
(2.3)
where IBi =
∑i−1
k=1GkiVk +
∑K
k=i+1G
T
ikVk, i = 1, . . . , K. IBi is the vector of currents
ﬂowing from Ωi to other partitions, called boundary current. As shown in Figure 7,
the currents IBa, IBb, and IBc are the boundary currents for Ω7 (-IBa, -IBb, and -IBc
are the boundary currents for Ω8). Without explicit boundary currents, the parti-
tion simulations are messy and highly coupled with each other, which are diﬃcult
to be parallelized. However, once the boundary currents are obtained, each Vi can
be calculated by solving GiVi = Ii − IBi independently which leads to straightfor-
ward parallelization. The methods in [15], [16] and [17] spend a signiﬁcant amount
of computation to obtain the exact boundary currents through dense matrix factor-
ization and Schur complement formation. Thus their runtime eﬃciencies are limited
by expensive boundary current calculations.
We address the limitations of the existing partitioning-based simulation methods,
runtime eﬃciency and partitioning ﬂexibility, by adopting a novel approach to provide
near-exact approximations to the boundary currents. For clarity, assuming the power
grid is partitioned in the way illustrated in Figure 7, the ﬂow of the proposed approach
is shown as follows.
20
• Step 1: Obtain near-exact approximated boundary currents I∗B1, . . . , I∗BK in
parallel.
• Step 2: Use I∗B1, . . . , I∗BK to solve (2.3) in parallel.
• Step 3: Residues are computed to form the new right-hand side of the matrix
equation for the full grid. Repeat step 1, step 2 and step 3 until the convergence
is reached.
Although, in step 1, called the boundary current approximation step, it is ideal
to ﬁnd the exact boundary currents at the ﬁrst place, they may only be available after
the full system or at least the system consisting of all the boundary nodes is solved.
However, by exploiting the strong locality behavior in C4 ﬂip-chip-type power grids
[19], it can be shown that near-exact approximations could be eﬃciently obtained
without solving large complex systems. According to the same locality property, I∗Bis
can be computed by solving a set of uncoupled local grid simulation problems, leading
to an immediate parallel implementation, as detailed in Section II.A.3.a. In step 2
(called the partition simulation step), a set of partition grids are analyzed in parallel,
since they are shielded to the rest of the system by the boundary settings of using
the approximate boundary currents I∗Bis, as illustrated in Section II.A.3.b. Errors
would occur due to the inaccuracy of I∗Bis. In step 3 (called the error reduction step),
those errors are computed and fed back to the original grid to solve for the correction
components. Instead of using the block Jacobi iteration process in [40], the analysis
schemes in step 1 and step 2 are employed to correct the circuit responses. It can be
shown that by including step 1 into this error reduction step to update the boundary
information, analysis convergence can be signiﬁcantly improved.
In summary, the convergence and runtime eﬃciency of the proposed approach are
achieved by ﬁnding near-exact boundary currents eﬃciently. In this dissertation, we
21
restrict our discussion on using the Cholesky factorization method for each grid solve.
However, it shall be noted that in principal any power grid analysis method can be
applied to carry out the boundary currents approximations and partition simulations.
Hence, the proposed approach is generic and formulated purely based upon the nature
of the application.
3. Parallel Locality-Driven Static Analysis
In this section, we ﬁrst describe the parallel boundary current approximation, the
parallel partition simulation, and the block-based iterative error reduction scheme
using spatial locality in detail. Then, the overall ﬂow of the proposed method is
presented. At last, computational cost analysis is provided to identify the key factors
that aﬀect the solution process and parallel performance modeling is carried out to
determine the optimal (or near-optimal) values of the key parameters for the most
eﬃcient runtime.
a. Parallel Boundary Current Approximation Using Locality
In modern chip designs, C4 ﬂip-chip packaging technology is commonly used which
provides a large number of VDD/VGND connections evenly spreading out in the on-
chip power delivery networks. In a local region, due to the existence of many C4 bump-
s, the majority of currents are supplied through the low impedance paths from/to the
nearby VDD/VGND pads. Hence, the local voltage responses are largely dependent
on the VDD/VGND connections, wire resistances, and current loadings in the neigh-
borhood, exhibiting strong spatial locality. For example, in Figure 8, a power grid is
divided into nine partitions, and node A is at the center of partition 5. Then, the
voltage response at A is primarily determined by C4 bumps (red dots), current load-
ings, and wire resistances in the same partition, and is less inﬂuenced by the elements
22
Partition?5 VDD
AA
Iload
Fig. 8. Spatial locality in power grids. Red dots represent C4 connections.
in other partitions. When the size of partition 5 is large enough so that the impacts
of other partitions are negligible, the circuit response for node A obtained by only
simulating partition 5 (the circuit elements in other partitions are neglected) would
be close to the exact circuit response obtained by analyzing the entire power grid.
An overlapping power grid shell-based partitioning method has employed the spatial
locality to accelerate the power grid analysis and has shown favorable eﬀectiveness
for solving industrial power grid designs [19].
In this work, the spatial locality is employed for ﬁnding near-exact boundary
currents for individual partition simulations. As shown in Figure 9, the basic idea
is to introduce a window to enclose each partition boundary (the black dash line)
at the window center. The size of the window is made large enough to include a
ring of C4 bumps around the partition boundary. Then, we are only focusing on
the circuit elements in the window and neglecting all other circuit elements outside
of the window. After solving the truncated circuit in that window, the currents on
the partition boundary are retained as the approximate boundary currents. Usually,
near-exact boundary currents could be obtained when the size of the window is made
large enough to include a suﬃcient number of C4 bumps such that the grid outside
23
Window
Boundary
Window
Partition?
Boundary
Window?
Size
Fig. 9. Finding the near-exact boundary currents via locality. Red dots represent C4
connections. Black dash lines represent partition boundaries. Blue dash lines
represent window boundaries.
of the window has negligible inﬂuence to the circuit responses of the boundary nodes.
As expected, by using these near-exact boundary currents, the errors introduced in
the partition simulations are small, which only requires a few iterative correction
steps to be reduced to an acceptable level. Since the approximation for each set of
boundary currents is determined by solving an independent partial grid problem, the
entire procedure can easily be parallelized.
The window size is deﬁned as the number of nodes the partition boundary away
from the window boundary, as shown in Figure 9. Assume the partition boundary
has l node and the window size is s, then, the window incloses 2s × (l + 2s) nodes.
Since the approximate boundary current values are largely dependent on the number
of C4 connections in the window, we introduce another term: C4 ring size, to describe
the size of the window. The C4 ring size is the number of C4 connections existing
between the partition boundary and window boundary. For example, in Figure 10,
the C4 ring sizes for the three windows are 1, 2, 3, respectively. A 1-million-node
power grid with C4 ﬂip-chip packaging is used as an example to illustrate the spatial
24
Window?size:?40
C4 ring size: 1 Window size: 80? ? ? ? ?
C4?ring?size:?2
Window?size:?120
C4?ring?size:?3
Fig. 10. Impact of window size on boundary current approximation. Red dots repre-
sent C4 connections. Black dash lines represent partition boundaries. Blue
dash lines represent window boundaries.
locality. The C4 bumps are evenly distributed in the grid with 40-node distance away
from each other as shown in Figure 10. The power grid is divided into four equally
sized partitions. A boundary is examined as an example. The IR drops and branch
currents of a boundary node obtained via the boundary window simulation using
diﬀerent window sizes are shown in Table I, and they are compared with the exact
IR drop and current solutions. The average IR drops and branch currents for the
same boundary are also examined. Note the quick convergence of the result. The
convergence is reached at the window size of 120, corresponding to the C4 ring size
of 3.
b. Parallel Partition Simulation Using Approximated Boundary Currents
As presented in Section II.A.3.a, the approximated boundary currents I∗Bis are com-
puted in step 1 of the proposed approach. In step 2, each power grid partition is
solved using I∗Bi, which is added to the boundary nodes as extra current loadings.
Using the same notations presented in Section II.A.3.a, the system equation for the
partition simulation can be expressed as
25
Table I. Comparison of boundary IR drops and currents obtained in boundary window simulations using diﬀerent window
sizes with the exact solutions. Voltage drop is in mV. Current is in mA.
Window Size C4 Ring Size Node Drop Node Current Avg. Drop Avg. Current
40 1 55.693 0.9712 60.850 0.470
80 2 51.366 1.195 54.287 0.513
120 3 50.474 1.245 53.195 0.490
Exact 4 50.469 1.278 53.280 0.490
26
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
G1
G2
. . .
GK
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
V1
V2
...
VK
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
=
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
I1 − I∗B1
I2 − I∗B2
...
IK − I∗BK
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
, (2.4)
where G1, . . . , GK are the conductance matrices of the partitions of Ω1, . . . ,ΩK ;
V1, . . . , VK are the internal partition node voltage vectors; I1, . . . , IK are the inter-
nal partition current loading vectors; and I∗B1, . . . , I
∗
BK are the approximate partition
boundary currents.
In Figure 11, the partitioning of a grid along a vertical line is illustrated as an
example. Starting from the initial grid, the approximated boundary currents (I∗Bis)
are computed, as shown on the left of the ﬁgure. The grid is then partitioned into
two. For each partition, the boundary currents are set to I∗Bis by attaching ideal
current sources to the corresponding boundary nodes, as shown in the middle of the
ﬁgure. Each partition is simulated by including these additional current sources on
the boundary. At last, the node voltages for diﬀerent partitions are retained to form
the complete solution, as shown on the right of the ﬁgure. Since the approximate
boundary currents may not be exact, after assembling partition simulation results
(node voltages), the KCL equations for the boundary nodes might not be satisﬁed.
The residues on the boundary nodes would lead to the errors of internal nodes of
each partition, thus cause global errors throughout the entire grid. Since the boundary
currents shield each partition from the rest of system, the simulation for each partition
has no coupling with others (there are no oﬀ-diagonal matrix blocks in (2.4)), thus
can be easily parallelized.
27
I* a’Baa a
bb I* b’Bb
cI*c Bc c’
a’ a a’IBa
b’
I*Ba I*Ba b IBb b’
+*
’
I Bb I*Bb IBc ’c
*
c c
I Bc I*Bc
Fig. 11. Power grid partition simulation by setting boundary currents.
28
c. Block-Based Iterative Error Reduction
As stated in previous sections, inaccurate boundary currents always cause errors in
the solution for the entire system. Although larger window size could always be
chosen to provide more accurate boundary currents, it would cost longer simulation
time and use more computing resources, like memory. Therefore, an eﬀective and
eﬃcient error reduction scheme is indispensable to control the errors in a reasonable
level while maintaining the runtime eﬃciency.
A block-based iterative scheme using locality is proposed to reduce errors. After
obtaining the full system solution, residues are computed and set as the new current
loadings to the original grid to form a correction grid. Unlike the traditional block
Jacobi iteration scheme [40], which always sets boundary currents to zero in the error
correction grid, the proposed process employs the boundary currents approximation
scheme in step 1 to provide a large coupling region to obtain more accurate boundary
currents, thus reduce the errors more rapidly. Then the correction components of
node voltages are computed in step 2. These newly obtained correction components
are added to previous solutions, and the residues are computed again for the next
error reduction iteration, until the convergence is reached.
The system matrix equation for the kth iteration, the residue expression after
k − 1th iteration, and the updated node voltage after kth iteration are expressed as
GVkres = I
k
res =⇒
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
G1 G
T
12 . . . G
T
1K
G12 G2 . . . G
T
2K
...
...
G1K G2K . . . GK
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
V k1,res
V k2,res
...
V kK,res
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
=
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
Ik1,res
Ik2,res
...
IkK,res
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
, (2.5)
29
Ikres =
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
Ik1,res
Ik2,res
...
IkK,res
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
=
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
I1 −G1V k−11 − . . .−GT1KV k−1K
I2 −G12V k−11 − . . .−GT2KV k−1K
...
IK −G1KV k−11 − . . .−GKV k−1K
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
, (2.6)
Vk =
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
V k1
V k2
...
V kK
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
=
⎡
⎢⎢⎢⎢⎢⎢⎢⎣
V k−11 + V
k
1,res
V k−12 + V
k
2,res
...
V k−1K + V
k
K,res
⎤
⎥⎥⎥⎥⎥⎥⎥⎦
. (2.7)
Noted that V0 = 0 and I0res = I. Obviously, the circuit conductance matrices for
the boundary windows and partitions remain the same throughout the iterations,
and only the right-hand sides of the system equations are updated. Therefore, those
matrices can be pre-formulated and stored locally for fast resolving at each iteration.
From the classic iterative method point of view, step 1 and step 2 form a block
Gauss-Siedel iteration. Step 1 updates the solutions for boundary currents, and Step
2 updates the solutions for internal nodes. Therefore, the convergence of this iterative
error reduction scheme can be guaranteed.
d. Algorithm Flow and Implementation
Finally, we summarize the entire ﬂow of the proposed approach in Algorithm 1.
Assuming the original power grid Ω is divided into K partitions Ω1, . . . ,ΩK with
L partition boundaries B1, . . . , BL. L boundary windows ΩW1, . . . ,ΩWL are created
with the window size s. The conductance matrices for the partitions are G1, . . . , GK ,
and the conductance matrices for those boundary windows are GW1, . . . , GWL. The
error tolerance ε is set to check the convergence. The algorithm ﬂow is presented in
Figure 12.
30
Algorithm 1 Parallel Locality-Driven Method
Input: Partition conductance matrices G1, . . . , GK , window conductance matrices
GW1, . . . , GWL, partition current loadings I1, . . . , IK , window current loadings
IW1, . . . , IWL, error tolerance ε.
Output: Node voltage V.
1: k = 0;
2: V0 ← 0, I0res ← I;
3: while ||Ikres||2 > ε do
4: for i ← 1 to L par do
5: Solve GWiV
k
Wi,res = I
k
Wi,res;
6: end for
7: Store I∗kB1,res, . . . , I
∗k
BK,res;
8: for i ← 1 to K par do
9: Solve GiV
k
i,res = I
k
i,res − I∗kBi,res;
10: V ki ← V k−1i + V ki,res;
11: Ik+1i,res ← Ii −
(∑i−1
j=1GjiV
k
j +GiV
k
i +
∑K
j=i+1G
T
ijV
k
j
)
;
12: end for
13: k ← k + 1;
14: end while
15: V ← {V k+11 , . . . , V k+1K };
31
Yes
PDN Netlist
Circuit Partitioning 
L windows: GW1, …, GWL
K partitions: G1, …, GK
Update Current Loadings 
and Node Voltages
Converged?
Result
Window #1 
Solve GW1VW1=IW1
Parallel Boundary Currents Approximation
Partition #1
Solve G1V1=I1-I*B1
Parallel Partition Simulation
Approximated Boundary 
Currents: I*Bis, i=1,…,K
No
Window #L 
Solve GWLVWL=IWL
Partition #K 
Solve GKVK=IK-I*BK
…
…
Fig. 12. Algorithm ﬂow of the proposed parallel locality-driven method.
32
The divide-and-conquer strategy of this approach makes it able to take advantage
of the increasingly available parallel computing resources such as multicore machines
and distributed computing network systems. For multimillion (less than 10 million)
node grids, which can be stored in the memory a single multicore machine (shared
memory), a signiﬁcant amount of runtime improvement can be obtained by running
the proposed approach in parallel. However, for many-million (over 10 million) n-
ode grids, which cannot ﬁt into the memory of a single machine, it is favorable to
use distributed computing networks. The original grid can be divided into several
window grids and partition grids at ﬁrst, so that each grid is small enough for an
eﬃcient solve on a single machine. Then those grids are stored and analyzed at dif-
ferent machines of the distributed network locally. The only communication through
the network is to feed the approximate boundary currents to partition simulations
and to update adjacent boundary node voltages of partition boundaries for residue
computation. Although we have not implemented the proposed approach on the
distributed computing networks, its performance is expected to be promising when
handling many-million node grids.
e. Computational Cost Analysis
In this section, we present the computational cost analysis for the proposed approach
and the trade-oﬀs between total runtime, the number of partitions, and the window
size.
Assume the power grid Ω (with N nodes) is divided into K partitions Ω1, . . . ,ΩK
and L boundary windows ΩW1, . . . ,ΩWL. The numbers of nodes in the partitions and
windows are n1, . . . , nK and m1, . . . , mL, respectively. Suppose the cost of Cholesky
factorization for a sparse matrix with dimension n × n is C1(n), the cost of solving
the system with the Cholesky factor is C2(n), and the cost of matrix formulation
33
is C3(n). Assume there are d iterations being carried out, the overall cost of the
sequential implementation of the algorithm would be
Cs =
K∑
i=1
C1(ni) +
L∑
i=1
C1(mi) + d
(
K∑
i=1
C2(ni) +
L∑
i=1
C2(mi)
)
+
K∑
i=1
C3(ni) +
L∑
i=1
C3(mi),
(2.8)
where Cs is the overall sequential cost; N =
∑K
i=1 ni;
∑K
i=1C1(ni) is the Cholesky
factorization time for the partitions;
∑L
i=1C1(mi) is Cholesky factorization time for
the windows;
∑K
i=1C2(ni) is the triangular solve time for partitions;
∑L
i=1C2(mi) is
the triangular solve time for windows;
∑K
i=1C3(ni) is the matrix formulation time
for partitions; and
∑L
i=1C3(mi) is the matrix formulation time for windows. If no
partitioning is applied to the power grid, the cost for solving the entire grid is C1(N)+
C2(N) + C3(N). For a sparse matrix arising from a grid with n nodes, it is known
that C1(n) = O(n
1.5), C2(n) = O(n logn), and C3(n) = O(n), which is superlinear
in n [41]. For extreme large power grids, by careful selection of the window size, the
number of partitions, and the number of iterations, the sequential implementation of
the proposed approach could be much faster than the ﬂat simulation.
One of the most promising features of this method is the parallel simulation
for the power grid, as illustrated in Algorithm 1. For parallel implementation, as-
sume there are enough number of processors available for concurrent executions and
the parallelization overhead (such as communication cost and system overhead) is
negligible, the computational cost can be estimated by
Cp =
K
max
i=1
{C1(ni)}+ Lmax
i=1
{C1(mi)}+ d
(
K
max
i=1
{C2(ni)}+ Lmax
i=1
{C2(mi)}
)
+
K
max
i=1
{C3(ni)}+ Lmax
i=1
{C3(mi)},
(2.9)
where Cp is the overall cost for parallel simulation. When the partitions or windows
34
have diﬀerent sizes, the workloads for diﬀerent processors would vary, which causes the
load imbalanced problem. In this case, the partition or window with the largest size
determines the overall runtime. Therefore, in order to improve the parallel eﬃciency,
it is desired to have similar sizes for partitions or windows. Note that parallel time
can be lowered by allowing processors to proceed to the next step when their work is
ﬁnished, as long as the computation remains error-free.
Assume the power grid is divided into K partitions with the equal numbers of
nodes n, all the L windows have m nodes as well as size s, and there are p processors
available for concurrent execution. Then, the computational cost for the parallel
implementation can be approximated as
Cpe =
⌈
K
p
⌉
C1 (n) +
⌈
L
p
⌉
C1 (m) + d
(⌈
K
p
⌉
C2 (n) +
⌈
L
p
⌉
C2 (m)
)
+
⌈
K
p
⌉
C3 (n) +
⌈
L
p
⌉
C3 (m) ,
(2.10)
where Cpe is overall parallel simulation cost for power grids with equal-size partitions;
n = N
K
. Since the window ΩWi on the boundary of partition Ωj is a subgrid of
2s × (2s + √n) nodes, it can be shown that m ∝ s√
K
. Moreover, the number of
windows is in proportion to the number of partitions: L ∝ K.
The number of partitions K and window size s are two most important factors
aﬀecting the computational cost, and we have following important observations.
• Due to the superlinear complexity of Cholesky decomposition, an increase in the
number of partitions K would decrease the cost of decomposition for partition-
s. However, since L
p
C1 (m) ∝ KC1
(
s√
K
)
, the cost for window decomposition
increases along with K for a ﬁxed s.
• Increase in the number of partitions K may increase the number of iterations
d, since more regions are aﬀected by the errors of boundary current approxima-
35
tions. As shown in Figure 13, when K increases from 16 (Figure 13(a)) to 32
(Figure 13(b)), after the ﬁrst iteration, there are more nodes with large errors.
Therefore, the case with 32 partitions may require larger number of iterations
to reduce the errors.
• Increase in the window size s may decrease the number of iterations d, since the
boundary current approximations are more accurate. As shown in Figure 14,
when s increases from 40 (C4 ring size 1, Figure 14(a)) to 80 (C4 ring size 2,
Figure 14(b)), a signiﬁcant reduction of errors after ﬁrst iteration is observed.
However, since m ∝ s, increase in s increases the time for window matrix
decomposition and triangular solves.
Therefore, in order to maximize the runtime eﬃciency, the number of partitions and
the window size must be carefully chosen. The coeﬃcients of the dominant terms in
(2.10) could be ﬁtted and used to guide the parallel implementation as demonstrated
in the next section.
f. Performance Modeling for Parallel Implementation
From the observations illustrated in Section II.A.3.e, there is trade-oﬀ between the
number of partitions and overall simulation cost. Hence, there exists an optimal
number of partitions for which the simulation cost can be minimized. For extremely
large power grids, a single run of simulation may cost several hours, therefore, it is
required to ﬁnd the optimal (or near-optimal) number of partitions in order to save
time cost. A simple strategy to determine the optimal number of partitions uses
an approximation to the parallel execution time. Using the notation introduced in
36
800 18
x 10
-3
800 18
x 10
-3
600
700
14
16
600
700
14
16
500
10
12
500
10
12
300
400
8
300
400
8
200 4
6
200 4
6
100 200 300 400 500 600 700 800
100 2
100 200 300 400 500 600 700 800
100 2
(a) (b)
Fig. 13. Accuracy comparison of boundary current approximations using diﬀerent
numbers of partitions. The power grid in (a) is divided into 16 partitions,
and the power grid in (b) is divided into 32 partitions. The values are node
voltage errors after ﬁrst iteration. The unit is Volt.
700
800
0.018
0.02
700
800
0.018
0.02
600
0.014
0.016
600
0.014
0.016
400
500
0.01
0.012
400
500
0.01
0.012
300
0.006
0.008 300
0.006
0.008
100
200
0 002
0.004
100
200
0 002
0.004
( ) (b)
100 200 300 400 500 600 700 800
.
100 200 300 400 500 600 700 800
.
a
Fig. 14. Accuracy comparison of boundary current approximations using diﬀerent win-
dow sizes. The power grid in (a) uses window size 40 (C4 ring size 1), and the
power grid in (b) uses window size 80 (C4 ring size 2). The values are node
voltage errors after ﬁrst iteration. The unit is Volt.
37
Section II.A.3.e, we approximate C1, C2, and C3 as
C1(n) ≈ α1n1.5 + β1, (2.11)
C2(n) ≈ α2n logn + β2, (2.12)
C3(n) ≈ α3n+ β3, (2.13)
where αi and βi (i = 1, 2, 3) are constants used to obtain a ﬁt with the experimental
observations.
Based on our experimental setups and results shown in Section II.A.4, we have
the following observations (using diﬀerent direct solvers may lead to diﬀerent obser-
vations).
• s is usually chosen as C4 ring size 2 so that the number of iterations is small.
• C2(n) is a very small term compared with C1(n) and C3(n), since the decom-
position time and matrix formulation time are dominant, as shown in Section
II.A.4.
Notice that m ≈ 2s(√n+2s) ≈ 2s√n = 2s
√
N
K
and L ≈ 2K. Therefore, the parallel
computational cost in (2.10) can be approximated as
Cpe ≈ K
p
C1 (n) +
L
p
C1 (m) +
K
p
C3 (n) +
L
p
C3 (m) (2.14)
= aK + b(NK)
1
2 + c
(
N3K
) 1
4 + dN + e
(
N3
K
) 1
2
, (2.15)
where a = 3β1+3β3
p
, b = 4sα3
p
, c = 4
√
2s
√
sα1
p
, d = α3
p
, e = α1
p
. Using the experimental
runtime data and doing linear regression, we can get the values for a, b, c, d, e. Then,
by solving
dCpe
dK
= f(K) = a+
b
2
(
N
K
) 1
2
+
c
4
(
N
K
) 3
4
− e
2
(
N
K
) 3
2
= 0 (2.16)
38
we can obtain the optimal (or near-optimal) number of partitions Ko.
4. Experimental Results
The proposed parallel partitioning-based simulation method has been implemented in
C. Parallelization is implemented by creating multithreads using Pthreads. Experi-
mental results for the ﬂat simulation, and the sequential as well as parallel simulations
using the proposed approach are obtained on an IBM p5-575 processing node with 16
Power5+ processors at 1.9GHz and 32G RAM (25G available for computing) running
64-bit AIX 5L (5.3).
The proposed approach has been tested under seven large-scale power grids with
varying sizes: 2.56M, 4M, 5.76M, 7.82M, 9M, 12.96M, and 16M. All the power grids
use C4 bump power supply pads that are evenly distributed (40 nodes away from each
other). The current loadings of the grid diﬀer from blocks to blocks but are regular
inside each block. A direct solver using Cholesky decomposition [42] is chosen to carry
out circuit simulations. L factors for boundary window circuits and partition circuits
are stored for reusing in the iterative error reduction process. The matrix formulation
time, data transferring time, Cholesky decomposition time, and triangular solve time
are included in the total runtime. The Cholesky decomposition time for parallel
simulation includes the time for window decompositions and the time for partition
decompositions. Moreover, the triangular solve time for parallel simulation consists
of the time for window matrices solves and partition matrices solves.
At ﬁrst, all the grids are divided into 16 partitions (equal-sized) with 24 windows,
and the window size is 80 (C4 ring size of 2). The runtime of the ﬂat simulation and
the proposed parallel simulation is presented in Table II. As shown, the signiﬁcant
advantage of the proposed approach is its scalability. The standard ﬂat simulation
does not scale well with circuit complexity (especially the Cholesky decomposition).
39
Table II. Simulation runtime for ﬂat simulation and the proposed parallel simulation. All the grids are divided into 16
partitions with 24 windows. 16 processors are used. Chol: Cholesky decomposition time; Sol: triangular solve
time; Tot: total runtime; Mem: memory; Iter: number of iterations; Sp: speedup; Em: maximum node voltage
error; Ea: average node voltage error. Runtime is in seconds; error is in mV; and memory is in GB.
Num.
Nodes
Flat Simulation Parallel Simulation
Chol Sol Tot Mem Chol Sol Iter Tot Sp Mem Em Ea
2.56M 21.0 1.5 33.2 1.8 1.6 0.15 3 3.9 8.5X 2.6 5.8e-2 2.7e-3
4M 40.7 2.4 59.1 2.9 2.2 0.22 3 5.3 11.2X 3.5 4.8e-2 1.0e-3
5.76M 69.4 3.5 97.2 4.4 3.0 0.32 3 7.0 13.9X 4.8 6.5e-2 1.1e-3
7.84M 101.0 4.8 138.1 6.0 4.1 0.43 3 9.4 14.7X 6.6 6.6e-2 9.4e-4
9M 121.5 5.6 164.3 6.9 4.4 0.49 3 10.6 15.5X 7.4 5.2e-2 1.1e-3
12.96M 203.3 8.1 266.1 10.3 6.9 0.70 3 14.9 17.8X 10.5 4.3e-2 1.3e-3
16M 267.9 9.9 346.6 12.8 9.5 0.97 3 18.4 18.8X 12.8 5.0e-2 1.1e-3
40
Table III. Speedups of the parallel implementation of the algorithm over the sequential
implementation of the algorithm. 16 processors are used for the parallel
simulation. Runtime is in seconds.
Num. Nodes Window Size Num. Partition Sequential Parallel Speedup
2.56M 80 16 36.4 3.9 9.3X
4M 80 16 53.6 5.3 10.1X
5.76M 80 16 76.7 7.0 10.9X
7.84M 80 16 104.0 9.4 11.1X
9M 80 16 118.1 10.6 11.1X
12.96M 80 16 173.7 14.9 11.6X
16M 80 16 216.9 18.4 11.8X
In contrast, the divide-and-conquer nature of the proposed method makes itself highly
scalable (the Cholesky decomposition time is almost linear). As shown in Figure 15,
the runtime for the proposed method is almost linear with very small slope, therefore
the runtime speedups over the ﬂat simulation keep increasing throughout all the
cases. For the largest 16M-node grid, the speedup is already 18.8X. More speedups
are expected for even larger grids. The superlinear speedups come from the proposed
method itself and the parallel implementation on the 16-core machine. Moreover,
Table II also shows that the proposed block-based iterative process is very eﬀective
and eﬃcient. The accuracy of less than 0.07mV for the maximum node voltage
error can be reached after only three iterations. For small power grids, our method
consumes more memory than the ﬂat method for the reason that the numbers of nodes
in the windows are comparable to the numbers of nodes in the partitions. However,
along with the increasing size of the power grid, the memory consumptions growth
rate of the proposed approach is less than that of the ﬂat method. We can expect
41
350
Flat Simulation
250
300
)
Parallel Simulation
150
200
R
un
tim
e 
(s
)
0
50
100
2 4 6 8
Number
10 12 14 16
x 10
6 of Nodes
Fig. 15. Runtime comparison between the ﬂat simulation and the proposed parallel
simulation.
favorable memory consumptions of our method for extremely large power grids.
Next, we analyze the parallel eﬃciency of the proposed algorithm. Table III
compares the execution time of the proposed method on a single processor with that
on 16 processors. As presented, the parallel implementation can bring about 11X
speedups. The ideal 16X speedup is not achieved due to system overheads in the
present implementation.
As discussed in Section II.A.3.e, the number of partitions and the window size
determines the number of iterations, thus have impacts on the total runtime. As
shown in Table IV, when we increase the number of partitions (from 8 to 16) so
that all the available cores can be fully utilized for parallel simulation, the total
runtime decreases. However, when we further divide the grid into more partitions,
the runtime eﬃciency degrades. There are three reasons for this behavior. One is the
hard limit of the number of cores available, 16 in our experiment. Another reason is
that increasing the number of partitions increases the simulation cost for windows.
42
Table IV. Runtime analysis with various numbers of partitions for 9M, 12.96M, and 16M grids. Chol: Cholesky de-
composition time; Sol: triangular solve time; Iter: number of iterations; Tot: total runtime. Runtime is in
seconds.
Num.
Nodes
8 Partitions 16 Partitions 32 Partitions
Chol Sol Iter Tot Chol Sol Iter Tot Chol Sol Iter Tot
9M 10.7 0.94 3 19.6 4.4 0.49 3 10.6 4.1 0.54 4 12.0
12.96M 15.7 1.3 3 28.3 6.9 0.70 3 14.9 6.3 0.80 4 16.3
16M 18.5 1.6 3 33.5 9.5 0.97 3 18.4 8.9 1.3 3 19.6
43
Table V. Number of iterations required by using various window sizes. Iter: number
of iterations; Tot: total runtime. Runtime is in seconds.
Window
Size
C4 Ring
Size
4M Grid 5.76M Grid 7.84M Grid 9M Grid
Iter Tot Iter Tot Iter Tot Iter Tot
40 1 5 5.4 5 7.1 5 9.7 6 11.3
80 2 3 5.3 3 7.0 3 9.4 3 10.6
120 3 3 6.6 3 8.6 3 10.9 3 12.2
The last is that using small-size partitions may lead to more iterations because small
partitions tend to have more coupling to the rest of the grid. In addition, as shown in
Table V, more accurate boundary currents obtained from the large boundary window
simulation would reduce the number of iterations needed for the error correction, but
may cost more time in parallel boundary current approximation. Our experiments
indicate that a window size of 80 (C4 ring size of 2) is suﬃcient for the test power
grids, which is consistent with [19]. The block Jacobi method is corresponding to the
case with window size of 0, whose convergence can be expected to be much slower
than our method.
By using the data in Table II and doing linear regression, we can get that in our
experimental setup,
[a, b, c, d, e]T = [1.10,−1.35× 10−2, 9.18× 10−4, 1.86× 10−5, 3.18× 10−9]T . (2.17)
For the 16M power grid, by solving (2.16), the optimal (or near-optimal) number of
partitions Ko is found to be 16 which is consistent with the result shown in Table IV.
It can be expected that Ko increases with N .
44
5. Summary
In this section, we have presented a parallel partitioning-based power grid analysis
approach using the spatial locality. We have identiﬁed the main factors that eﬀect the
solving process: the number of partitions and the window size. The interdependence
of these parameters and their inﬂuence on the execution time have been analyzed. We
have also suggested a strategy that helps users in determining the optimal (or near
optimal) values of these parameters to achieve lowest parallel runtime. The proposed
approach is shown to have excellent parallel eﬃciency, fast convergence, ﬂexible par-
titioning, and favorable scalability. By using distributed computing networks, it is
believed to be able to handle extremely large power grids (with many-million nodes)
in an eﬃcient way.
B. GSim: A Fast CPU-GPU Combined Parallel Simulator for Power Delivery Net-
works with On-Chip Voltage Regulation
Detailed and accurate analysis to the PDNs with on-chip voltage regulators is hin-
dered by the lack of eﬃcient simulation techniques for such PDNs. In this section, the
simulation challenges are addressed by the proposed partitioning relaxation method.
Using this method, an existing fast GPU multigrid solver for on-chip power grids and
a general SPICE simulator for nonlinear regulators are integrated together to achieve
excellent eﬃciency and accuracy.
1. Background and Overview
The detailed model of a multiple-domain power delivery network with on-chip low-
dropout voltage regulators is presented in Figure 16. The on-chip PDN has a global
VDD grid, several on-chip LDOs, a number of local grids and a global GND grid.
45
DC
Global VDD Grid
Local Grid 1 Local Grid 2
Global GND Grid
LDO LDO LDO LDO
PCB Package
Off-Chip Model On-Chip Model
C4 Bump
Decap
Fig. 16. The model for a power delivery network with on-chip voltage regulators.
The global VDD grid distributes input voltage to on-chip LDOs through metal wires.
Each local grid corresponds to a power domain, and its voltage is provided by LDOs.
By identifying the need to accelerate the simulation for large linear power grids,
the diﬃculty of simulating multiple non-linear voltage regulators and the hierarchical
nature of the network structure, we adopt a black-box Gauss-Seidel relaxation algo-
rithm [43] to develop the GSim, a GPU accelerated simulation engine that can solve
extremely large PDNs with good runtime and memory eﬃciency. GSim utilizes an ef-
ﬁcient iterative partitioning relaxation method to analyzes LDOs, the oﬀ-chip circuit
as well as extremely large on-chip power grids. As presented in the left of Figure 17,
the entire PDN can be partitioned into ﬁve major circuit blocks: the oﬀ-chip circuit,
the LDOs, the global VDD grid, the local grids and the global GND grid. For the
transient simulation, at each time point, GSim solves each circuit block individually
and updates the solutions through the partition boundary until the convergence is
reached. To solve each block in the most eﬃcient way, the oﬀ-chip circuit is solved
46
Passive Engine GPU?EngineS t P titi i? ys em ar on ng
Global?VDD?Grid GPU?Engine GPU
ir
cu
it
s
LDO… LDOLDOLDO SPICE Engine PCI E
M
od
el
?C
i ? ?
Ch
ip
?M Local?Grid?N…Local?Grid?1 GPU?Engine CPU
O
ff
?C
Global GND Grid GPU Engine? ? ?
SPICE?Engine
i iPass ve?Eng ne
Fig. 17. GSim simulation diagram.
by a passive network solver on CPU, the transistor-level LDOs are analyzed by a
SPICE solver on CPU, and all the power grids (the global VDD grid, local grids and
the global GND grid) are tackled by a GPU multigrid solver [13] which is over 50X
faster than the state-of-the-art direct solver CHOLMOD [42]. The updated results for
partition boundaries are exchanged through PCI-E between CPU and GPU. It will
be shown in the experimental results, most of the simulation time is spent on solving
power grids. Thus our partitioning-based simulation scheme which puts power grid
simulations on a fast GPU engine is very eﬃcient in terms of runtime.
2. CPU-GPU Combined Transient Simulation
a. GPU-Based Multigrid Method
In this work, we solve the on-chip power grids on GPU by adopting the hybrid multi-
grid (HMD) method [13]. The idea of the HMD method is to set the original 3D
irregular power grid as the ﬁnest grid in the multigrid hierarchy, and deﬁne a set of
47
0 2 4 6 8 10
10-1
100
101
102
103
104
Regular Grid Size (millions)
R
un
 T
im
e 
(s
ec
on
ds
)
GPU
CHOLMOD
0 2 4 6 8 10
101
102
103
104
105
Regular Grid Size (millions)
G
PU
 M
em
or
y 
U
sa
ge
 (M
b)
GPU
CHOLMOD
Fig. 18. Scalability of runtime and memory consumption for the GPU solver.
topologically regularized 2D grids as the coarser to coarsest level grids that can be
obtained from the original 3D irregular PNDs. The most time-consuming smoothing
steps, as well as other multigrid operations are accelerated on GPU eﬃciently. More
speciﬁcally, PDN simulation can be eﬃciently performed on GPU with no explicit
sparse matrix-vector operations. The convergence of the HMD algorithm shares the
common properties of multigrid methods which is linear in time and memory con-
sumption. Our experiments on a variety of industrial power grid designs show that
after only a few HMD iterations, power grid error components can be damped out
quickly (with maximum errors smaller than 1mV and average errors smaller than
0.1mv).
The GPU based HMD solver [13] has been shown to be very eﬃcient for solving
the on-chip power grids. For instance, for synthetic power grids with one million
to nine million nodes, as shown in Figure 18 [13], the GPU simulation engine can
solve them at a rate of three million nodes per second, which is 50X to 180X faster
than the state of art direct solver CHOLMOD [42] executed on a quad-core 2.6GHz
computer. Additionally, the GPU solver only consumes 20X less memory than the
48
VDD Grid VDD Grid+?
D
?
D VD
?
D VDVD
LDO
C
LDO
C update?VA, VCfix?VB, VDVC
LDO
C V’C
AA VA A V’A
+?
B
Local?Grid
B VB
Local?Grid
B VBVB
Fig. 19. Boundary relaxation for a single LDO.
CHOLMOD solver. For instance, less than 500Mb memory is required for solving the
nine-million-node grid.
b. Boundary Relaxation
Detail analysis for individual circuit block simulation and the boundary relaxation
scheme is given to a single LDO. All other circuit blocks and partition boundaries
can be handled in the same way. As shown in Figure 19, an LDO is connected with
a local grid (through a resistor between nodes A and B) and the VDD grid (through
a resistor between nodes C and D). A and C are contained in the LDO block, B is
on the local grid, and D is on the VDD grid. LDO is analyzed along with those
two resistors and two ideal voltage sources at B and D whose values are VB and VD
respectively. Then, after the simulation ﬁnishes, new node voltages for A (V ′A) and
C (V ′C) are obtained to update their values. It should be noted that the decoupling
capacitors between local grids and GND grid can be converted to a resistor and a
current source using the Norton companion model.
49
c. Convergence
The convergence is examined by checking the average and maximum voltage changes
at partition boundaries. Although LDO is a nonlinear device, due to its property
that automatically maintains the output voltage, the voltage change at the boundary
between LDO and local grids is small for consecutive time steps. Other partition
boundaries have small voltage change from time step to step as well. Therefore, the
convergence can be quickly reached, which is consistent with our experimental results.
Further convergence improvement can be introduced by employing multi-level Newton
method.
d. Simulation Flow
The simulation ﬂow for GSim at the time step k is summarized in Figure 20. Circuit
blocks are solved separately and the solutions of the interfacing nodes are updated
using the boundary relaxation scheme presented in Figure 19. However, considering
the strong interactions between the on-chip global and local PDNs (caused by the
decoupling capacitors), a naive implementation of the block circuit iteration scheme
may lead to slow convergence. In this work, we propose to ﬁrst solve the local grids
and the global GND grid through a number of inner iterations, and then solve the
rest of the circuit blocks in an outer iteration loop. As shown in our experimental
results, only a few (two to four) outer iterations are needed at each time step.
3. Experimental Results
The PDN simulator GSim has been implemented in CUDA [44] and C++, respec-
tively. The GPU program is executed on a single GPU of the NIVIDIA Geforce 9800
GX2 card (including two GPUs), with a total on board memory of 512Mb. All the
50
 
 
 
 
Transient Analysis At Time k
Update Current Loadings of On-
Chip Power Grids Solve Local Grid 1…N
Solve GND Grid
Update LDO-Local Bound. & GND-
Offchip Bound. Voltages
Solve VDD Grid
Solve LDO Circuits
Converged?
Converged?
Finish Time Step k
Update Local-GND Port Voltages
Update Local-GND Port VoltagesUpdate LDO-Local Bound. & VDD-
LDO Bound. Voltages
Update VDD-LDO Bound. & VDD-
Offchip Bound. Voltages
Solve Off-Chip Circuits
Update VDD-Offchip Bound. & 
GND-Offchip Bound. Voltages
Yes No
Yes
No
LDOs and Off-Chip Circuits 
Solved on CPU
On-Chip Grids Solved on GPU
Fig. 20. GSim simulation ﬂow.
51
Table VI. Transient simulation runtime and numbers of iterations of GSim for PNDs
with on-chip LDOs. 1200 time steps are simulated. CPU%: the percentage
of runtime spent on CPU. The runtime is in seconds.
Num. Nodes Num. LDOs
Runtime Num. Iteration
Total /Step CPU% Total /Step
2.25M 36 1810 1.6 22 2274 1.9
2.25M 144 1768 1.5 23 2000 1.7
9M 64 7398 6.2 24 2864 2.4
9M 256 4500 3.7 27 1900 1.4
C++ programs are executed on a workstation with Intel Xeon CPU at 2.33GHz and
4G RAM running 64-bit Linux OS.
The transient simulation runtime is examined by analyzing PDNs with 2.25M or
9M on-chip nodes. Each PDN has a diﬀerent number of LDOs. The results are shown
in Table VI. For PDNs with less LDOs, since the voltage changes are smoother than
the PDNs with less LDOs, they require more iterations to converge. But all four
cases can converge in less than an average of three outer iterations per time step.
Notice that two inner iterations are forced for each outer iteration. As can be seen,
the cost of analyzing on-chip power grids is dominant, and the overhead introduced
by simulating LDOs is not signiﬁcant because of the proposed partitioning scheme.
Therefore, putting on-chip power grid simulation to the fast GPU engine is extremely
eﬀective and speeds up the entire simulation visibly. In summary, GSim is an eﬃcient
solver to tackle multi-million-node PDNs with on-chip LDO regulators.
52
4. Summary
A fast CPU-GPU combined simulation engine, GSim, has been developed to provide
SPICE-level accuracy for simulating complex PDNs employing on-chip voltage regu-
lation techniques. GSim identiﬁes the simulation diﬃculties for diﬀerent circuit blocks
and achieves its eﬃciency by using a block-based Gauss-Seidel relaxation scheme to
integrate several fast simulation strategies together. These simulation strategies are
speciﬁcally designed for three types of circuit blocks in the PDN. Most importantly,
GSim provides a foundation to comprehensively analyze electric characteristics and
various design tradeoﬀs for PDNs with on-chip voltage regulation.
C. Transient Veriﬁcation of Power-Gated Power Delivery Networks
The on/oﬀ states and transitions of gated power grids can generate a large number of
power gating conﬁgurations which makes the veriﬁcation of gated PDNs very diﬃcult.
By using an eﬀective circuit modeling method and a fast superposition approximation
ﬂow, the veriﬁcation methodology presented in this section makes the veriﬁcation for
gated PDNs feasible.
1. Background
In our work, we focus on the transient power gating veriﬁcation problem, which can
be deﬁned as: for a given loading current distribution for each core, verify if the
required EM and voltage drop speciﬁcations are satisﬁed under all possible on/oﬀ
conﬁgurations and on/oﬀ transitions of the power-gated PDN network.
53
DC
Global VDD Grid
Local Grid 1 Local Grid 2
Global GND Grid
PCB Package
Off-Chip Model On-Chip Model
C4 Bump
Decap
Sleep 
Transistor
Fig. 21. A power-gated power delivery network model.
a. Modeling for Power-Gated PDNs
The model for a power-gated PDN is shown in Figure 21. On the chip, the glob-
al VDD grid delivers power to gated local grids through multiple sleep transistors.
The decoupling capacitors and switch circuits reside between the local grids and the
common global GND grid. When a sleep transistor is completely turned on, it is
modeled as a resistor. Otherwise, it is treated as an open circuit. However, during
the power-on process, due to the charging eﬀect of the decoupling capacitors, sleep
transistors work in the saturation region and the linear region in diﬀerent phases.
Hence, sleep transistors are modeled as time-varying resistors during power-on.
b. Veriﬁcation Metrics
We deﬁne important electromigration and Dynamic Voltage Drop (DVD) metrics
for interconnects and switch circuits, respectively, in our transient veriﬁcation tasks.
Since the EM eﬀect is proportional to the average current ﬂowing through the wire
54
0
Im
EMm
wm
Im
t1 t2
VDD
VDDn -VSSn
DVDPn
VDDn
VSSn
Tn
t1 t2
+
-
Vn
(a) (b)
Fig. 22. EM and DVD metrics.
[5], we use the average current to deﬁne the EM metric. As shown in Figure 22(a),
for a wire wm with transient current Im(t), if the time period for veriﬁcation starts
at t1 and ends at t2, the EM metric is given as
|EMm| =
∣∣∣∣∣
∫ t2
t1
Im(t)dt
t2 − t1
∣∣∣∣∣ . (2.18)
On the other hand, the peak DVD is important to the operation and timing
performance of circuit designs [4]. Correspondingly, it is adopted as a veriﬁcation
metric. As shown in Figure 22(b), for a circuit block Tn connected to VDDn and VSSn,
during the veriﬁcation time period t1 to t2, the peak DVD is deﬁned as,
DVDPn = max
t1≤t≤t2
{VDD − VDDn(t) + VSSn(t)}. (2.19)
55
2. Overview of the Transient Veriﬁcation
a. Veriﬁcation Tasks
In a power-gated power delivery network, each local grid has four possible states:
sleep, active, transition, and idle. In the sleep state, the sleep transistors are com-
pletely turned oﬀ. In the active state, the sleep transistors are completely turned
on and the switch circuits work in steady state. In the transition state, the sleep
transistors are gradually turned on while the decoupling capacitors are under charg-
ing, but the switch circuits are idle. Finally, in the idle state, the sleep transistors
are completely turned on while the switch circuits are still idle. In this work, the
idle state is not explicitly considered, since it can be considered active by modeling
leakage currents. For a local grid Gk, its state is represented by bk,
bk =
⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩
0 sleep state,
1 active state,
2 transition state,
3 idle state.
(2.20)
A power gating conﬁguration is a combination of the above possible states of
gated local grids. Two kinds of power gating conﬁgurations are examined in this
work:
• Stable conﬁguration: as shown in Figure 23, each gated local grid is either in the
active state or sleep state. In this conﬁguration, the voltage drops and current
distributions across the entire PDN are only caused by switch circuits in the
steady state (modeled as current sources). Assume there are N independent
56
active sleep transition
Stable
Configuration
Transition 
Configuration
Fig. 23. Stable and transition conﬁgurations.
gated local grids G1, . . . , GN . A stable conﬁguration Ci can be represented as
Ci = {b1i, b2i, . . . , bNi}, bki = 0, 1; k = 1, . . . , N, (2.21)
where bki is the state of grid Gk. The corresponding state of bki is deﬁned in
(2.20).
• Transition conﬁguration: some gated local grids are in the transition state while
others are either in the active or sleep state. In this conﬁguration, the voltage
drops and current distributions are inﬂuenced by sleep transistors’ power-on
processes (charging eﬀect). Similar to the stable conﬁguration, if there are N
independent gated local grids G1, . . . , GN , a transition conﬁguration Dj can be
represented by
Dj = {b1j , b2j , . . . , bNj}, bkj = 0, 1, 2; k = 1, . . . , N, (2.22)
57
Transient Power Gating? ? ?
Verification
Stable?Mode?Verification Power?on?Verification
EM DVDP DVDP
Global?Grids Local??Grids
Fig. 24. Proposed transient veriﬁcation tasks.
where bkj is the state of grid Gk.
As summarized in Figure 24, the proposed veriﬁcations tasks are categorized into
stable mode veriﬁcation and power-on veriﬁcation.
• In the stable mode veriﬁcation, the mission is to ﬁnd the worst or near-worst
dynamic performance (the largest EM and peak DVD values) of the PDN under
all possible stable conﬁgurations. This helps designers ensure that when their
chip work in steady state under any possible stable gating conﬁguration, the
gated PDN satisﬁes the power integrity requirements. We capture both time-
average and temporal eﬀects by checking the values of EM and peak DVD
metrics. Moreover, interconnect EM values in both the global and the local
grids are checked. However, it is only necessary to perform peak DVD checks
for local grids where switch circuits reside.
• In the power-on veriﬁcation, the worst or near-worst dynamic performance un-
der all possible transition conﬁgurations is identiﬁed. The goal is to determine
58
when coupled with stable workloads from other cores, whether or not the power-
on current transients would jeopardize power delivery integrity.
b. Overview of the Proposed Transient Veriﬁcation Methodology
With current loadings given for each local grid, according to (2.21) and (2.22), a
PDN with N gated local grids, or cores, has 2N stable conﬁgurations and 3N − 2N
transition conﬁgurations. A brute-force exhaustive veriﬁcation would require at least
2N lengthy die-package PDN transient simulations, each covering at least hundreds of
clock cycles to capture the dynamics of the network. Hence, the brute-force approach
is simply infeasible.
To achieve the feasibility and eﬃciency, in this work, we propose a novel technique
to drastically reduce the number of full simulations from 2N or even more to O(N).
Our method develops an equivalent circuit modeling scheme, which makes superpo-
sition able to be used by eﬃcient approximations for all power gating conﬁgurations.
These approximations are obtained by fully simulating O(N) number of conﬁgura-
tions, whose contributions are superimposed to ﬁnd out several conﬁgurations (also
in the order of O(N)) that can potentially cause the worst-case circuit performance.
These conﬁgurations are called worst-case candidates in the paper. Then, this set of
worst-case conﬁguration candidates are fully simulated and the worst or near-worst
performance can be found. The high-level ﬂow of the proposed methodology is sum-
marized in Figure 25. As can be seen, the proposed process can signiﬁcantly reduce
the computation complexity from O(2N) to O(N). Overall, the proposed approach
makes the transient veriﬁcation for a power-gated PDN with several-million nodes
feasible.
59
Approximate?dynamic?metrics?for?all?
possible?configuration
Rank?the?approximated?metric?values?
and?determine?O(N)?worst?case?
candidates
Full?simulation?validation?for?the?
candidate configurations? ?
Return?the?worst?metric?value
Fig. 25. The proposed transient veriﬁcation methodology.
3. Stable-Mode EM Veriﬁcation for Global Grids
The stable-mode EM veriﬁcation is to ﬁnd the largest average current, for a wire in
the on-chip PDN, under all possible stable conﬁgurations.
a. Challenges for Transient EM Veriﬁcation
Firstly, consider the DC EM analysis for a simple gated power grid used in [26], as
shown in Figure 26. For illustration purpose, assume there are only two current loads
for each gated local grid. Each gated local grid is connected to the global grid through
a single sleep transistor. Since only the DC condition is examined, the current ﬂowing
through the sleep transistor is equal to the total current loads of its connected local
grid. For example, for local grid 1 in Figure 26, we have,
IT = Is1 + Is2 (2.23)
60
Global?VDD?Grid
Local Grid 1 Local Grid 2 Local Grid 3 Local Grid 4
IT
? ? ? ? ? ? ? ?
Is2Is1
Fig. 26. A simple power-gated PDN in [26]. For simplicity, only four local grids are
shown. Each local grid has a single sleep transistor and two DC current loads.
where IT is the DC current through the sleep transistor and Is1 and Is2 are the DC
current loads for local grid 1.
Therefore, turning on/oﬀ a gated local grid is equivalent to connecting or dis-
connecting a current source to the global grid. The value for this current source can
be obtained by summing up all the current loads under its corresponding local grid.
Note that each local grid can be treated as an independent current input to the global
grid. Thus superposition theory can be easily applied in this case to ﬁnd the exact
branch current for any conﬁguration.
In this work, a more general and complex PDN model is used (as shown in Figure
27). Meanwhile, we consider the more meaningful transient dynamics for gated PDNs.
However, the above leads to three key diﬃculties for employing superposition theory.
• The existence of decoupling capacitance as well as the package makes the entire
PDN a strongly coupled system. Therefore, ﬁnding a clear cut for the inputs
to the system is diﬃcult.
• Since multiple sleep transistors are used for each local grid, the change of gating
conﬁguration would cause redistribution of the currents through these transis-
tors. Thus, the value of the current source model for each local grid is not ﬁxed.
61
Global?VDD?Grid
Local?Grid?1 Local?Grid?2 Local?Grid?3 Local?Grid?4
IT1(t) IT2(t)
Package
+?
Ic(t)Is(t)
Global?GND?Grid
Fig. 27. The PDN model diagram. For simplicity, only four local grids are shown.
Each local grid has two sleep transistors, one current load and one decoupling
capacitor.
For example, for local grid 1 in Figure 27, even though the current load Is is
given, the currents through two sleep transistors IT1 and IT2 may change when
other local grids’ states change (e.g. local grid 2 changes from the active mode
to sleep mode).
• The current ﬂowing through the decoupling capacitance is subject to change
when the power gating conﬁguration changes. For example, for local grid
1 in Figure 27, the current through the decoupling capacitance Ic(t) would
change when the conﬁguration changes from {1, 1, 1, 1} (all grids are active) to
{1, 1, 1, 0} (grid 4 is in sleep, others are active), because the change of total
capacitance in the system leads to the variation of frequency response.
Although the transient EM analysis for PDNs shown in Figure 27 is challenged by
the above diﬃculties, we develop the following concepts of equivalent circuit modeling
and superposition approximation to speedup the veriﬁcation in the large space of
possible gating conﬁgurations.
62
Global?VDD?Grid Global?VDD?Grid
Current?Source?
d l
Local?Grid?1 Local?Grid?2 I2(t)
Mo e ing
I1(t)Local?Grid?3 I3(t)
Global?GND?Grid Global?GND?Grid
Fig. 28. Switchable current source model for local grids. For simplicity, the package is
not shown and there are only three local grids.
b. Superposition Approximation
Identifying the circuit to which superposition is applied and the corresponding inputs
is crucial in applying superposition technique. If we treat the current loads in each
local grid as the inputs and the remaining PDN as the circuit, then, turning on or oﬀ
the local grids not only changes the inputs, but also changes the circuit. Therefore,
in order to maintain the circuit while turning on/oﬀ gated local grids, the local grids
should be treated as the inputs, and the global grids, C4 bumps and the oﬀ-chip
circuit are included in the circuit. When a local grid is active, it draws currents from
or provides currents (through decaps) to the global grids. Hence, each local grid
can be modeled as a set of switchable time-varying current sources attached to the
global grids. As a simple example, in Figure 28 local grid 1, 2 and 3 are modeled as
switchable time-varying current sources I1(t), I2(t) and I3(t) respectively.
Before determining the value for each switchable current source, we deﬁne a
global basic stable conﬁguration C gi and a full-decap basic stable conﬁguration C
f
i as
below (assume there are N independent gated local grids),
C gi = {b1i, b2i, . . . , bNi}, bii = 1, others = 0, (2.24)
C fi = {b1i, b2i, . . . , bNi}, bii = 1, others = 3, (2.25)
where in C gi , only grid Gi is active and others are in sleep; in C
f
i , only grid Gi is
63
active and others are idle (acting as decoupling capacitance). Moreover, we assume
the decoupling capacitances for local grids G1, . . . , GN are C1, . . . , CN respectively.
For the global basic stable conﬁgurations, the decoupling capacitance of all other grids
are neglected. In C gi , there is only Ci decap in the PDN. Whereas the full amount
of decoupling capacitance is considered in the full-decap basic conﬁgurations. In C fi ,
the total amount of decoupling capacitance is
∑N
i=1Ci.
As we know, the value of each switchable current source Ii(t) changes with con-
ﬁguration alternation, since the amount of decoupling capacitance which is dependent
on the conﬁguration directly impacts the dynamic currents and voltages. Let us look
at a simple RLC circuit in Figure 29 which models the power delivery network in the
simplest way and assume that only Gi is active and all other local grids are in sleep.
Lg, Rp, Rg, Ri, Ci and Isi represent the oﬀ-chip inductance, the resistance of the
global VDD grid, the resistance of the global GND grid, the resistance of the local
grid Gi, the decap of local grid Gi and the current loading of Gi respectively. By
solving this circuit in AC, the current ﬂowing through Gi (from X to Y ) is
Ii(s) = Isi(s)
(
1− 2sLg +Rp +Rg
2sLg +Rp +Rg +Ri +
1
sCi
)
. (2.26)
We can obtain
Ii(s) ≈ Isi(s), at low frequency, (2.27)
Ii(s) ≈ 0, at high frequency. (2.28)
Having some local grids in idle is equivalent to adding some resistor-capacitor branches
between X and Y . Therefore, for a general power gating conﬁguration Ck,
Ii(s) ≈ Isi(s), at low frequency, (2.29)
Ii(s) ≈ Isi(s)
(
1− Ci/
N∑
n=1
bnkCn
)
, at high frequency, (2.30)
64
Lg Rp
Ri
Ci
IsiVin
X
Lg Rg
Y
Fig. 29. A simple RLC model for the power delivery network.
This means that Ii(t) can be approximated as a linear function of the loading
current of Gi, Isi(t). The coeﬃcient for the high-frequency component is determined
by the total capacitance in the system. Therefore, we approximate Ii(t) using the
currents ﬂowing in or out of the global grids under its global basic conﬁguration C gi
(Igi (t)) and full-decap basic conﬁguration C
f
i (I
f
i (t)). For the power delivery network
in Figure 30, current sources Ig1(t) and I
f
1(t) are
Ig1(t) = {IgT1(t), IgT2(t), Is(t), Igc (t)}, (2.31)
If1(t) = {IfT1(t), IfT2(t), Is(t), Ifc (t)}, (2.32)
where IgT1(t), I
g
T2(t) are the currents through the sleep transistors and I
g
c (t) is the
current through the decoupling capacitor, all under the global basic conﬁguration
C g1 = {1, 0, 0}; IfT1(t), IfT2(t) are the currents through the sleep transistors and Ifc (t) is
the current through the decoupling capacitor, all under full-decap basic conﬁguration
C f1 = {1, 3, 3}; and Is(t) is the current load. IgT1(t), IgT2(t) and Igc (t) can be obtained
by simulating the basic conﬁguration C g1 as shown in Figure 30(a), and I
f
T1(t), I
f
T2(t)
and Ifc (t) can be obtained by simulating the basic conﬁguration C
f
1 , as shown in
65
Global?VDD?Grid Global?VDD?Grid
IgT1(t) IgT2(t) IfT1(t) IfT2(t)
Local?Grid?1 Local?Grid?1 Local?Grid?2 Local?Grid?3
Igc(t)Is(t) Ifc(t)Is(t)
Global?GND?Grid Global?GND?Grid
(a) (b)
Fig. 30. Switchable current source values for the global basic conﬁguration (a) and the
full-decap basic conﬁguration (b). For simplicity, the package is not shown
and there are only three local grids.
Global VDD Grid Global VDD Grid? ? ? ?
I3 (t) I3 (t)
L l G id 1 L l G id 2 L l G id 1
T1 T2
L l G id 2oca ? r ? oca ? r ? oca ? r ?
I3 (t)I (t)
oca ? r ?
l b l d l b l d
cs
G o a ?GND?Gri G o a ?GND?Gri
(a) (b)
Fig. 31. A simple example for switchable current source approximation. The package
is not shown and there are only two local grids.
Figure 30(b).
For a simple case of conﬁguration C3 = {1, 1, 0} shown in Figure 31(a), the value
for the switchable current source I31(t) (shown in Figure 31(b), the upper index 3 is
the power gating conﬁguration index) is approximated as a linear function of Ig1(t)
and If1(t) in the form of
I31(t) ≈ I3,apx1 (t) = α
(
If1(t)− Ig1(t)
)
+ Ig1(t) (2.33)
α = (C1 + C2 + C3)C2/ (C2 + C3) · (C1 + C2) . (2.34)
(2.34) is obtained based on (2.29-2.30).
The immediate beneﬁt of this approximation is that with each local grid modeled
66
Global?VDD?Grid Global?VDD?GridGlobal?VDD?Grid
Superposition
Local?Grid?1 Local?Grid?2 I32(t)I31(t) +
Global?GND?Grid Global?GND?GridGlobal?GND?Grid
= +Im(t) Im1 Im2
Fig. 32. Superposition for global grid veriﬁcation.
as a set of switchable currents whose values can be approximated in the way as (2.33),
the circuit responses on the global grids in any stable conﬁguration Ci can be eﬃciently
estimated using the principle of superposition [45]. As shown in Figure 32, for a wire
wm on the global grid under the stable conﬁguration C ={1, 1, 0}, its current I3m(t)
can be approximated as
I3m(t) = I
3
m1(t) + I
3
m2(t) ≈ I3,apxm1 (t) + I3,apxm2 (t), (2.35)
where I3,apxm1 (t) and I
3,apx
m2 (t) are the contributions from two current sources I
3,apx
1 (t)
and I3,apx2 (t). Assume with current source I
g
1(t) and I
f
1(t), the currents on Wm are
Igm1(t) and I
f
m1(t) respectively, then
I3,apxm1 (t) = α
(
Ifm1(t)− Igm1(t)
)
+ Igm1(t), (2.36)
where α is the same as (2.34). Igm1(t) and I
f
m1(t) are obtained by simulations of the
global basic conﬁguration Cg1 = {1, 0, 0} and the full-decap basic conﬁguration Cf1 =
{1, 3, 3} respectively. The contribution I3,apxm2 (t) from current source I3,apx2 (t) can be
obtained in the similar way but with a diﬀerent value for α (α = (C1 + C2 + C3)C1/
(C1 + C3) · (C1 + C2) in this case).
In a more general PDN, assume there are N gated local grids (with indices from
1 to N). Under the conﬁguration Ck = {b1k, . . . , bNk}, for wire wm in the global
67
grid, the current contribution from local grid Gi can be represented as a time varying
variable Ik,apxmi (t). Applying superposition, the average current (from time t1 to t2)
ﬂowing through the wire wm can be approximated as
|EMmk| ≈ |EMmk,apx| =
∣∣∣∣∣
N∑
i=1
bik
∫ t2
t1
Ik,apxmi (t)dt
t2 − t1
∣∣∣∣∣ , (2.37)
Ik,apxmi (t) = αki
(
Ifmi(t)− Igmi(t)
)
+ Igmi(t), (2.38)
αki =
∑N
n=1Cn ·
∑n=N :n =i
n=1 bnkCn∑n=N :n =i
n=1 Cn ·
∑N
n=1 bnkCn
, (2.39)
where Igmi(t) and I
f
mi(t) can be obtained by simulating the global basic conﬁguration
C gi and the full-decap basic conﬁguration C
f
i respectively.
c. Worst Case Validation
As stated above, errors are introduced by approximating the local grids using in-
dependent current sources. The conﬁguration that has the |EMm,apx|max may not
be the worst-case conﬁguration. Therefore, a validation scheme, as shown in Figure
33, is needed. Instead of only selecting the approximate worst-case conﬁguration,
a set of conﬁgurations, Ct1, . . . , CtP , that correspond to the P largest approximate
average currents, |EMmt1,apx|, . . . , |EMmtP,apx|, are selected as the top-P worst-case
EM conﬁguration candidates. The top-P cases can be found by going through all 2N
possible conﬁgurations (rank them according to their approximate average currents)
and picking up the P conﬁgurations that have the P largest approximate average
currents. According to the experimental results, the runtime for identifying top-P
worst cases is insigniﬁcant compared to lengthy die-package transient simulations,
thus does not aﬀect the overall complexity of our approach. Next, full simulations
are applied to all these P candidate conﬁgurations. Finally, the real |EMm|max is
obtained by choosing the largest validated |EMmti| (i = 1 . . . P ) (|EMmv1| in Figure
68
Superposition?approximation?for?all?
possibleC ks
Rank?approximate?|EMmk,apx|s
… …1, || mt apxEM 2 , || mt apxEM , || mtP apxEM
C t1 C t2 … C tP …
Full?simulation?validation?for?PC tis with?
top?P largest?|EMmti,apx|s
Rank P validated |EMm|s? ?
…1 || EM 2 || EM || PEM
C v1 C v2 … C vP
mv m v m v
Return?the?largest?value?as?|EMm|max
Fig. 33. Flow for worst case validation.
69
33), and its corresponding stable conﬁguration can also be found. The number P can
be increased to cover more possible conﬁgurations to assure that the actual |EMm|max
can be identiﬁed in the ﬁnal validation simulations. Usually P is in the order of O(N)
and much smaller than 2N . In contrast to exhaustive enumeration, the presented ap-
proach reduces the complexity from O(2N) to O(N), leading to signiﬁcant eﬃciency
improvement. The entire algorithm for stable-mode EM veriﬁcation is summarized
in Algorithm 2.
Algorithm 2 Stable-mode EM veriﬁcation for wire Wm in global grids
Input: Global basic stable conﬁguration C g1 · · · C gN , full-decap basic stable conﬁguration
C f1 · · · C fN and the number of worst-case candidates P .
Output: |EMm|max and its corresponding stable conﬁguration Cmax.
1: for i← 1 to N do
2: Full simulation for the global basic stable conﬁguration C gi .
3: Full simulation for the full-decap basic stable conﬁguration C fi .
4: end for
5: for k ← 1 to 2N do
6: Obtain |EMmk,apx| by (2.37).
7: end for
8: Rank |EMmk,apx|s and obtain P worst-case candidates Ct1, . . . ,CtP .
9: for i← 1 to P do
10: Full simulation for worst-case candidate Cti.
11: |EMmti| =
∣∣∣∣
∫ t2
t1
Itim(t)dt
t2−t1
∣∣∣∣.
12: end for
13: |EMm|max ← maxPi=1 |EMmti|.
14: return |EMm|max and its conﬁguration as Cmax.
We have discussed how to identify the worst-case EM condition for a single
wire. When it is required to identify the worst-case EM condition among all the
wires in the global grid, for each conﬁguration, the maximum approximate average
current among all the wires in the global grid can be obtained from fast superposition
approximation. Then, all the conﬁgurations are ranked according to their maximum
approximate average current values, and the top-P worst case candidates can be
70
found for validation. The extra handling required is just to ﬁnd the maximum EM
metric among all wires for each conﬁguration, which does not require any additional
full circuit simulations. This applies to other types of veriﬁcation presented in the
following subsections.
4. Other Veriﬁcations
a. Stable-Mode EM Veriﬁcation for Local Grids
A modiﬁed approach is taken to perform the EM veriﬁcation of a wire on a local
grid. Under the context of power gating, this also implies that this local grid is
always powered on. Similar to the stable-mode EM veriﬁcation for global grids, in
order to apply superposition approach, the circuit and the inputs should be identiﬁed.
Diﬀerent from the setup in Figure 32, since the target of veriﬁcation is a particular
local grid, this local grid is always included in the ‘global’ PDN circuit as shown in
Figure 34, where the wire of veriﬁcation is assumed to be in local grid 1. Therefore,
the circuit includes the global grids, the oﬀ-chip circuit, the targeted local grid and
its decoupling capacitors and resistor models for the turned-on sleep transistors. As
stated in the previous veriﬁcation, all other local grids are modeled as switchable
current sources to the circuit. These current source models as well as the intrinsic
current loads at the targeted local grid are treated as inputs. In Figure 34, when
analyze the impact of I2(t) on local grid 1, grid G1 should be considered in the idle
state. The contribution from its own current loads Is1(t) is computed separately by
only keeping grid G1 active.
Similar to the global basic stable conﬁguration, for a PDN with N independent
gated local grids, assume the targeted wire is in local grid Gj , a local basic stable
71
Global?VDD?GridGlobal?VDD?GridGlobal?VDD?Grid
I2(t)
Superposition +I2(t)Local?Grid?1 Local?Grid?1 Local?Grid?1
Is1(t)
Global?GND?GridGlobal?GND?GridGlobal?GND?Grid
I (t) I 2(t) I 1(t)= +m m m
Fig. 34. Superposition for local grid veriﬁcation.
conﬁguration C lji can be deﬁned as,
C lji = {b1i, b2i, . . . , bNi}, bji = 3, bii = 1, others = 0, (2.40)
where grid Gj is idle, grid Gi is active and others are in sleep.
As shown in Figure 34, for a wire wm on the local grid 1 under stable conﬁguration
{1, 1, 0}, its current I3m(t) can be approximated by using superposition approach,
I3m(t) = I
3
m1(t) + I
3
m2(t) ≈ I3,apxm1 (t) + I3,apxm2 (t), (2.41)
where I3,apxm1 (t) is obtained from the global basic conﬁguration C
g
1 = {1, 0, 0} and
the full-decap basic conﬁguration C f1 = {1, 3, 3}; I3,apxm2 (t) is obtained by by linear
combination of the contributions from the local basic conﬁguration C l12 = {3, 1, 0}
and the full-decap basic conﬁguration C f2 = {3, 1, 3}.
More generally, assuming the targeted wire wm is on local grid Gj, the aver-
age current (from t1 to t2) under the conﬁguration Ck = {b1k, . . . , bNk; bjk = 1} is
72
approximated as follows,
|EMmk,apx| =
∣∣∣∣∣
∑N ;i =j
i=1 bik
∫ t2
t1
Ik,apxmi (t)dt+
∫ t2
t1
Ik,apxmj (t)dt
t2 − t1
∣∣∣∣∣ , (2.42)
Ik,apxmi (t) = αki
(
Ifmi(t)− I ljmi(t)
)
+ I ljmi(t), (2.43)
αki =
(Ci + Cj) ·
∑N
n=1Cn ·
∑n=N :n =i
n=1 bnkCn
Ci ·
∑n=N :n =i,j
n=1 Cn ·
∑N
n=1 bnkCn
, (2.44)
Ik,apxmj (t) = αkj
(
Ifmj(t)− Igmj(t)
)
+ Igmj(t), (2.45)
αkj =
∑N
n=1Cn ·
∑n=N :n =j
n=1 bnkCn∑n=N :n =j
n=1 Cn ·
∑N
n=1 bnkCn
, (2.46)
where i = 1 . . .N and i = j; I ljmi(t) is obtained by simulating the local basic conﬁgu-
ration C lji and I
f
mi(t) is obtained by simulating the full-decap basic conﬁguration C
f
i ;
whereas Igmj(t) is obtained by simulating the global basic conﬁguration C
g
j and I
f
mj(t)
is obtained by simulating the full-decap basic conﬁguration C fj .
The procedures of identifying top-P worst-case candidates and the following
full simulation validation illustrated in Algorithm 2 can be applied here to ﬁnd the
maximum average current |EMm|max for the wire wm in a local grid.
b. Stable-Mode Peak Dynamic Voltage Drop Veriﬁcation
Here, the task is to ﬁnd the largest peak dynamic voltage drop for a circuit block in
the on-chip PDN under all possible stable power gating conﬁgurations. It should be
noted that the current loadings are assumed to be given. Therefore, the uncertainty
of the current proﬁle is not considered under this scope.
For a simple PDN circuit shown in Figure 35, a circuit block Tn is connected to
local grid 1 at node a and to the global GND at node b. Assume the voltages at a
and b are Va(t) and Vb(t) respectively. The dynamic voltage drops of node a and b
73
Global?VDD?Grid
Local?Grid?1 Local?Grid?2
node a
Tn
Global?GND?Gridnode b
Fig. 35. Dynamic voltage drop for circuit block Tn. For simplicity, the package is not
shown.
can be expressed as,
DVDa(t) = VDD − Va(t), for VDD grid nodes, (2.47)
DVDb(t) = −Vb(t), for GND grid nodes. (2.48)
Therefore, the dynamic voltage drop for circuit Tn is,
DVDn(t) = VDD − (Va(t)− Vb(t)) = DVDa(t)−DVDb(t), (2.49)
According to the PDN model in Figure 21, all the circuit blocks reside between
local grids and the global GND grid. Therefore, the voltage of the nodes in local grids
must be considered in the DVD veriﬁcation, which implies that the targeted local
grids should be always on. Since in this case the circuit and inputs categorization
is the same as in stable-mode EM veriﬁcation for the local grids, the current source
modeling, approximation method, worst-case identiﬁcation scheme and validation
procedure presented in Section II.C.4.a can be applied here. It should be noted that
in order to use the superposition theorem, voltage drop (deﬁned in (2.47) and (2.48))
is used here instead of the actual node voltage.
In a general PDN with N independent gated local grids, assuming the peak DVD
(from t1 to t2) of a circuit block Tn which is connected to the node a on local grid
74
Gj and the node b in the global GND grid, is examined. Under the conﬁguration
Ck = {b1k, . . . , bNk; bjk = 1}, the voltage drop at node a and b can be approximated
by,
DVDka,apx(t) =
∑N ;i =j
i=1 bikDVD
k
ai,apx(t) +DVD
k
aj,apx(t), (2.50)
DVDkb,apx(t) =
∑N ;i =j
i=1 bikDVD
k
bi,apx(t) +DVD
k
bj,apx(t), (2.51)
where DVDaj(t) and DVDbj(t) are deﬁned in (2.47) and (2.48). Similar to the
transient current on a wire in the local grids, we have
DVDkai,apx(t) = αki
(
DVDfai(t)−DVDljai(t)
)
+DVDljai(t), (2.52)
αki =
(Ci + Cj) ·
∑N
n=1Cn ·
∑n=N :n =i
n=1 bnkCn
Ci
∑n=N :n =i,j
n=1 Cn ·
∑N
n=1 bnkCn
, (2.53)
DVDkaj,apx(t) = αkj
(
DVDfaj(t)−DVDgaj(t)
)
+DVDgaj(t), (2.54)
αkj =
∑N
n=1Cn ·
∑n=N :n =j
n=1 bnkCn∑n=N :n =j
n=1 Cn ·
∑N
n=1 bnkCn
, (2.55)
where i = 1 . . .N , i = j; DVDljai(t) is obtained by simulating the local basic conﬁg-
uration C lji and DVD
f
ai(t) is obtained by simulating the full-decap basic conﬁgura-
tion C fi ; whereas DVD
g
aj(t) is obtained by simulating the global basic conﬁguration
C gj and DVD
f
aj(t) is obtained by simulating the full-decap basic conﬁguration C
f
j .
DVDk,apxbi (t) (i = 1 . . .N ; i = j) and DVDk,apxbj (t) can be obtained in the same way.
Therefore, the peak dynamic voltage drop for Tn under conﬁguration Ck is ap-
proximated as
DVDkPn,apx = max
t1≤t≤t2
{
DVDka,apx(t)−DVDkb,apx(t)
}
. (2.56)
Similar to ﬁnding the top-P worst-case conﬁguration candidates for the stable-
mode EM veriﬁcation, the top-P worst-case conﬁguration candidates for the peak
75
DVD veriﬁcation can also be found. The full simulation validation will be carried out
for these candidates to ﬁnd out the maximum DVDPn.
Unlike the average current (a time-average eﬀect), dynamic voltage drop (a tran-
sient phenomenon) is sensitive to the total decoupling capacitance in the conﬁgura-
tion, so more errors (compared to the exact peak DVD) are expected from the linear
approximation in (2.50-2.51). However, it should be noted that the main purpose of
the superposition approximation is not to obtain the exact peak DVD values for all
the conﬁgurations, but to explore the relative rankings among diﬀerent conﬁgurations.
The approximate peak DVD (DVDkPn,apxs) are only used to rank their corresponding
power gating conﬁgurations (Cks) and to select the top-P worst-case candidates. As
long as the true worst-case (or near worst-case for small P ) conﬁguration is included
in the top P of this ranking or the ranking trend is preserved, the proposed approach
is eﬀective, which is validated by our experimental results. Since for each node, un-
der each basic conﬁguration, the entire dynamic wave form has to be stored, we only
choose a small amount of nodes to do the veriﬁcation for the entire grid. These nodes
have large dynamic voltage drops in the basic conﬁgurations and they are expected
to have large DVD in other conﬁgurations as well.
c. Power-On Peak Dynamic Voltage Drop Veriﬁcation
When a local grid is in transition, although no switching activity has been experienced
from the devices powered by the grid yet, large rush currents may be drawn from the
global grids to charge the local grid’s decoupling capacitors which are discharged in
the sleep state due to leakage. Such current disturbances may propagate through
the global grids and cause drops on the power and ground lines of other local grids.
The turn-on time, a critical design variable, can be properly chosen to control the
amount of generated power-on noise. While a longer turn-on time is beneﬁcial from
76
Conductance of A Sleep Transistor
50ns Power−On 100ns Power−On
0 50.0 100 150 200
time (ns)
6.0
5.0
4.0
3.0
2.0
1.0
0
−1.0
Co
nd
uc
ta
nc
e 
(E
-3
)
Fig. 36. Drain-source conductance of a PMOS sleep transistor during power-on time.
the noise point of view, it nevertheless increases the timing overhead and prevents
more eﬀective use of power gating. When a sleep transistor is turned on, at ﬁrst it
works in the saturation region and then goes to linear region which leads to the drain-
source conductance variation along time (as shown in Figure 36). Therefore, during
the turn-on procedure, the sleep transistor can be simply modeled as time-varying
resistor.
For a given turn-on time, the task of power-on peak dynamic voltage drop veri-
ﬁcation is to identify the worst-case peak voltage drop caused by the turn-on noise in
conjunction with the noise contributions from all possible stable workloads. For the
purpose of illustration, the case in which only one local grid is powered on at a time
is examined, as shown in Figure 37. The cases with multiple power-on local grids can
be handled in a similar way. The grid in transition can be modeled as rush current
77
Gl b l VDD G id Gl b l VDD G ido a ? ? r o a ? ? r
Approximation
rush current
Local?Grid?1 Local?Grid?2 I2(t)
charging
Local?Grid?1
Global?GND?Grid Global?GND?Grid
Fig. 37. Veriﬁcation of power-on peak dynamic voltage drop.
sources which are a part of the inputs to the circuit, thus the superposition technique
can be applied for peak DVD approximation.
For a general PDN with N independent gated local grids, assume the targeted
grid is Gj and Gs is in transition, we introduce a basic transition conﬁguration D tjs
which is deﬁned as
D tjs = {b1s, b2s, . . . , bNs}, bjs = 3, bss = 2, others = 0, (2.57)
where grid Gs is in transition, Gj is in idle, and others are in sleep.
Assuming the peak DVD (from t1 to t2) of a circuit block Tn connected to the
local grid Gj as shown in Figure 35 is examined. Local grid Gs is in transition. Under
the transition conﬁguration Dk = {b1k, . . . , bNk; bjk = 1, bsk = 2}, the peak voltage
drop at node a and b can be approximated by,
DVDkPa,apx(t) =
∑N ;i =j,s
i=1 bikDVD
k
ai,apx(t) +DVD
k
aj,apx(t) +DVD
k
as,apx(t),(2.58)
DVDkPb,apx(t) =
∑N ;i =j,s
i=1 bikDVD
k
bi,apx(t) +DVD
k
bj,apx(t) +DVD
k
as,apx(t),(2.59)
78
where we have
DVDkai,apx(t) = αki
(
DVDfai(t)−DVDljai(t)
)
+DVDljai(t), (2.60)
αki =
(Ci + Cj) ·
∑N
n=1Cn ·
(∑n=N :n =i,s
n=1 bnkCn + Cs
)
Ci ·
∑n=N :n =i,j
n=1 Cn ·
(∑n=N :n =s
n=1 bnkCn + Cs
) , (2.61)
DVDkaj,apx(t) = αkj
(
DVDfaj(t)−DVDgaj(t)
)
+DVDgaj(t), (2.62)
αkj =
∑N
n=1Cn ·
(∑n=N :n =j,s
n=1 bnkCn + Cs
)
∑n=N :n =j
n=1 Cn ·
(∑n=N :n =s
n=1 bnkCn + Cs
) , (2.63)
DVDkas,apx(t) = DVD
tj
as(t), (2.64)
where i = 1 . . . N , i = j, s; DVDljai(t) is obtained by simulating the local basic con-
ﬁguration C lji and DVD
f
ai(t) is obtained by simulating the full-decap basic conﬁgura-
tion C fi ; DVD
g
aj(t) is obtained by simulating the global basic conﬁguration C
g
j and
DVDfaj(t) is obtained by simulating the full-decap basic conﬁguration C
f
j ; DVD
tj
as(t)
is obtained by simulating the basic transition conﬁguration D tjs . DVD
k
bi,apx(t) (i =
1 . . .N ; i = j, s), DVDkbj,apx(t) and DVDkbs,apx(t) can be obtained in the same way.
Therefore, the peak dynamic voltage drop for Tn under transition conﬁguration
Ck is approximated as
DVDkPn,apx = max
t1≤t≤t2
{
DVDkPa,apx(t)−DVDkPb,apx(t)
}
. (2.65)
Following the scheme shown in Algorithm 2, the DVDPn,max can be found.
5. Experimental Results
The PDN simulator GSim and the transient power gating veriﬁcation ﬂow have been
implemented in CUDA [44] and C++, respectively. The GPU program is executed
on a single GPU of the NVIDIA Geforce 9800 GX2 card (including two GPUs),
79
with a total on board memory of 512Mb. All the C++ programs are executed on
a workstation with Intel Xeon CPU at 2.33GHz and 4G RAM running 64-bit Linux
OS. The on-chip power grids of PDNs are generated according to the typical current
loadings and wire conductance of the IBM power grid benchmarks [39], while the
package level model parameters, such as inductance and capacitance values, as well
as total on-chip capacitance are adopted from [6].
Three power gated million-node PDNs with N gated local grids, N = 4, 8, 16
are employed for the transient veriﬁcation presented in the paper. Each PDN has
millions of on-chip nodes and a few hundred chip-to-package pins. Each local grid has
several blocks with diﬀerent current loadings to represent diﬀerent function modules.
200 clock cycles (2000 time steps) are simulated to capture the time averaging and
temporal eﬀects (waveforms are in steady state).
a. Stable-Mode Veriﬁcation
The stable-mode EM veriﬁcations have been carried out for three multi-million-node
PDNs. The run time and the largest |EM | obtained by choosing P as 4, 8, 12 are
shown in Table VII. The largest EM among all the wires in the global VDD grid
is examined. As expected, even for small P , the EM worst-case conﬁguration can
be captured very eﬀectively. Moreover, in terms of runtime, for the PDNs with a
large number of gated grids, the number of possible stable conﬁgurations is very
large, which leads to excessive runtime for brute-force enumeration (107,000 hours
for the largest case!). However, by using the proposed veriﬁcation scheme, only a
small number of full simulations need to be carried out, therefore, the runtime has
been greatly reduced (only a couple of hours for the largest grid).
The stable-mode peak DVD veriﬁcation results shown in Table VIII demonstrate
a similar behavior to the EM veriﬁcations. Still, the largest peak DVD can be well
80
Table VII. EM stable mode veriﬁcation for gated PDN. The global VDD grid is examined. # Con.: number of conﬁgu-
rations; T: total runtime; |EM |max: maximum absolute average current; P: number of worst case candidates.
Runtime is in hrs. EM is in mA. Runtime of the enumeration methods for larger circuits are the estimated
time (∼ time value). The transient veriﬁcations are run for 200 clock cycles (2000 time steps).
Gated
Grids
# Con. # Nodes Enumeration P=4 P=8 P=12
T |EM |max T |EM |max T |EM |max T |EM |max
4 15 2.25M 6.69 4.75 4.72 4.75 6.41 4.75 9.20 4.75
8 255 4.25M ∼ 214.8 NA 8.31 2.40 11.77 2.40 14.0 2.40
16 65535 8.25M ∼ 1.07e5 NA 16.17 2.33 17.75 2.33 21.31 2.33
81
Table VIII. Peak DVD stable mode veriﬁcation for gated PDN. For grids with 8 and 16 grids, two diﬀerent local grids
are examined. Gcor: corner grid; Gcen: center grid; # Con.: number of conﬁgurations; T: total runtime;
DVDp,max: maximum peak DVD; P: number of worst case candidates. Runtime is in hrs. DVD is in mV.
Runtime of the enumeration methods for larger circuits are the estimated time (∼ time value). The transient
veriﬁcations are run for at least 200 clock cycles (2000 time steps).
Gated
Grids
# Con. # Nodes Node Enumeration P=4 P=8 P=12
T DVDp,max T DVDp,max T DVDp,max T DVDp,max
4 8 2.25M 4.82 131.7 4.92 131.7 7.00 131.7 NA NA
8 128 4.25M Gcor ∼ 141.9 NA 9.6 124.3 13.25 124.3 16.54 124.3
Gcen ∼ 141.9 NA 9.47 128.4 13.12 128.4 16.57 128.4
16 32768 8.25M Gcor ∼ 7.06e4 NA 19.21 117.1 27.54 117.1 35.01 117.1
Gcen ∼ 7.06e4 NA 19.23 121.9 26.45 121.9 34.22 121.9
82
Table IX. Peak DVD transition veriﬁcation for gated PDN. The transition grid is at the center. For grids with 8 and 16
grids, two nodes in two diﬀerent grids are examined. Gclo: the grid close to the transition grid; Gfar: the grid
far away from the transition grid; # Con.: number of conﬁgurations; T: total runtime; DVDp,max: maximum
peak DVD; P: number of worst case candidates. Runtime is in hrs. DVD is in mV. Runtime of the enumeration
methods for larger circuits are the estimated time (∼ time value). The transient veriﬁcations are run for at
least 200 clock cycles (2000 time steps).
Gated
Grids
# Con. # Nodes Node Enumeration P=4 P=8 P=12
T DVDp,max T DVDp,max T DVDp,max T DVDp,max
4 4 2.25M 2.62 125.4 5.01 125.4 NA NA NA NA
8 64 4.25M Gclo ∼ 79.28 NA 9.54 122.7 13.03 122.7 17.0 122.7
Gfar ∼ 79.28 NA 9.69 122.3 13.10 122.3 17.32 122.3
16 16384 8.25M Gclo ∼ 3.94e4 NA 19.40 128.6 27.0 128.7 35.9 128.7
Gfar ∼ 3.94e4 NA 19.04 117.8 26.48 117.9 34.57 117.9
83
captured by the proposed approach. 20 nodes (the one that have large DVD in the
basic conﬁgurations) are chosen to represent all the nodes in a local grid. The largest
DVD among them is examined. The runtime saving over brute-force enumeration for
PDNs with a large number of gated grids is huge, estimated as over 2000X for the
largest case when P = 12.
b. Power-On Veriﬁcation
In terms of power-on veriﬁcation, without loss of generality, a gated grid at the center
is chosen to be in transition state. A grid near the transition and a grid far away
from the transition are chosen to be examined. Similar to stable-mode peak DVD
veriﬁcation, for each of the grid, we use 20 nodes (the ones that have large DVD in
the basic conﬁgurations) to present all the nodes. As can be seen from the results
shown in Table IX, similar to the stable mode veriﬁcation, the proposed method can
eﬀectively capture the largest or near largest peak DVD by simulating only a small
number of conﬁgurations. Note that for the PDN with 4 gated grids, the total number
of conﬁguration is only 4, therefore, the results for P = 8 and P = 12 are marked as
NA.
6. Summary
In this section, we propose a simulation-based transient veriﬁcation approach. Specif-
ic circuit modeling techniques have been developed to individually verify each of the
on-chip global and local power grids against given electromigration and voltage drop
constraints. The proposed approach allows the use of fast superposition approxima-
tion methods to identify the top worst-case conditions that are validated by a small
number of full simulations to achieve feasibility.
84
CHAPTER III
POWER DELIVERY NETWORK DESIGN ∗
As stated in Chapter I, the PDN design faces the challenges of saving metal wires
for signal routing and choosing the optimal parameters for various components of
the network. To address these challenges, in this chapter, a novel partitioning-based
two-step power grid wire sizing approach is proposed which has the capability of u-
tilizing parallel computing resources to improve eﬃciency. Then, systematic analysis
to investigate the important electric interactions between active regulators/converters
and passive networks under the entire power delivery context is conducted. Based
on the insights obtained from the analysis, a system-level co-design scheme that can
automatically ﬁnd the optimal parameters for important network components is il-
lustrated.
A. Locality-Driven Parallel Power Grid Wire Sizing
The power grid wire sizing can hardly be applied to large grids due to its ineﬃciency.
In this section, by novelly reformulating the wire sizing problem into a two-step op-
timization problem, an eﬃcient parallel optimization methodology using the locality
of the ﬂip-chip type power grids is presented.
∗Part of the chapter is reprinted with permission from “Locality-driven parallel power
grid optimization” by Z. Zeng and P. Li, 2009. IEEE Trans. on Computer-Aided
Design of Integrated Circuits and Systems, Vol. 28, pp. 1190-1200, Copyright [2009]
by IEEE.
85
1. Background
a. Problem Formulation
In a power grid Ω consisting of nodes N = {n1, ..., nk} and branches R = {r1, ..., rl},
the node voltage for ni is Vi. Each branch ri, whose width and length are wi and
li respectively, connects two nodes ni1 and ni2. The branch current for ri is Ii,
and voltage of the two end nodes are Vi1 and Vi2. There are m metal layers. Rk
(k = 1, · · · , m) includes all the branches on metal layer k and ρk is the sheet resistance
for metal layer k. The branch conductance can be expressed as: gki = wki/(lkiρk) =
Iki/(Vki1−Vki2). Therefore, the area of the power grid, which is the objective function
that should be minimized, can be expressed in terms of the branch conductance [46].
f(G) =
∑
i∈B
wili =
m∑
k=1
∑
i∈Rk
ρkgkil
2
ki. (3.1)
Since using branch conductance G to compute the node voltage V directly consumes
too much time and computing resources, we introduce node voltages as variables to
the formulation to help us deﬁne the property of the circuit. Similar to [27], the
constraints are as follows.
1. IR drop constraint:
Vi ≥ Vmin. (3.2)
2. Minimum width constraint:
wki = ρkgkilki ≥ wk,min. (3.3)
3. Current density constraint (electromigration): For a layer k, the electromigra-
tion constraint is Iki ≤ σkwki, where σk is the Electromigration constant.
−ρkσklki ≤ Vki1 − Vki2 ≤ ρkσklki. (3.4)
86
4. Kirchhoﬀ’s current law: Assume that the branches connecting node nj form the
set Rj, and the current loading at nj is Isj.
∑
bi∈Rj
(Vi1 − Vi2)gi + Isj = 0. (3.5)
The area optimization is to minimize (3.1) subject to constraints deﬁned in (3.2)-
(3.5). This formulation leads to a constrained nonlinear optimization problem.
b. Constrained Nonlinear Optimization
There exists a large body of general nonlinear optimization algorithms that can be
applied to solve our power grid sizing problem. In recent years, interior point type
methods have been particularly popular for large-scale nonlinear programming and
found their application in circuit optimization [47] [48]. In [48], a novel state-of-art
interior point (or barrier) algorithm has shown better robustness and eﬃciency than
the augmented Lagrangian active-set method.
By introducing slack variables, a general nonlinear optimization problem can be
formulated as
min
x∈Rn
f(x) (3.6a)
s.t. c(x) = 0 (3.6b)
x ≥ 0, (3.6c)
where the objective function f : Rn → R, and the equality constraint functions
c : Rn → Rm with m < n, are all assumed twice continuously diﬀerentiable.
The optimum is found by solving a sequence of barrier problems (3.7a-b) with a
87
set of decreasing barrier parameters μl whose limit is zero (liml→∞ μl = 0) [47].
min
x∈Rn
ϕμl(x) = f(x)− μl
n∑
i=1
ln(xi) (3.7a)
s.t. c(x) = 0. (3.7b)
In addition to constrained nonlinear optimization algorithms, the power grid op-
timization may also be achieved by solving a sequence of approximated linear prob-
lems, as demonstrated in [27] or a set of unconstrained Lagrangian penalty problems
[49]. It is important to note that in practice any of these ﬂat optimization methods
is diﬃcult to apply to solve real-life power grid problems.
2. Overview of the Proposed Parallel Optimization
Today’s on-chip power distribution networks can reach a complexity of millions of
nodes or even greater. The design of such large networks via ﬂat optimization is
simply infeasible even if state-of-the-art optimization methods are employed. Hence,
a partitioning based strategy, in which a large network is divided into manageable
pieces that are optimized individually, is desirable to provide a scalable solution as
well as to utilize the increasing parallel computing resources to improve eﬃciency.
a. Key Issues in Parallelizable Power Grid Optimization
Nevertheless, it is nontrivial to devise a systematic divide-and-conquer methodology.
To see this, consider a naive approach as shown in Figure 38. A large power grid is
directly cut into smaller partitions. Each partition is then wire sized to minimize the
wiring area. Finally, the optimized partitions are merged to form a complete solution.
Although cutting the wires between partitions leads to multiple independent smaller
optimization problems, the negligence of coupling between partitions may lead to two
88
Di id & O ti i Mv e  p m ze erge
Fig. 38. A naive divide-and-conquer optimization approach.
key diﬃculties for the merged solution:
• Even if each partition is sized optimally, the merging of several locally optimized
partitions does not necessarily correspond to an overall optimal solution. This
is because that the sizing optimization is limited within each cut partition and
lacks the global view.
• The merged solution is not guaranteed to be even feasible. Although the IR drop
and EM constraints can be strictly enforced while independently optimizing
each partition, the interconnections between the partitions can alter the voltage
and current distributions when the partitions are merged to form a complete
solution.
This creates a practically messy problem, which may create constraint violations
possibly not only around the partitioning boundaries but also throughout the entire
grid.
Consider another divide-and-conquer approach illustrated in [29]. In this ap-
proach, a partition of a large power grid is optimized and the rest of the grid is
89
IB1
IB2
IB3
VB1
VB2
VB3
VB1
VB2
VB3
IB1VB1
IB2VB2
IB3VB3
+
IB1
IB2
IB3
IB1
IB2 VB2
IB3 VB3
VB1
Fig. 39. Power grid partitioning by setting boundary voltages and currents.
represented by a compact macromodel [16]. This approach shows another potential
diﬃculty: It does not immediately permits simultaneous sizing of multiple partitions
while guaranteeing the feasibility and optimality of the complete solution.
It is clear that under the context of this work, a desirable partition-based opti-
mization shall have three essential characteristics: it is parallelizable, and it maintains
the feasibility and optimality of the ﬁnal solution.
b. Two-Level Hierarchical Optimization
In the following, we present a two-level hierarchical optimization formulation which
possesses key characteristics relevant to parallel optimization of large power grids.
However, as will be discussed shortly after, this formulation has limitations that pre-
vents its practical application. Nevertheless, its new hierarchical perspective under-
pins the proposed parallel locality-driven optimization approach that will be presented
in the next subsection.
Instead of creating independent partitions by cutting wires, a diﬀerent approach
is taken, which is based upon setting voltages and currents along the partitioning
boundaries as shown in Figure 39. Each partition is wire sized by considering these
additional voltage and current constraints at the boundary. The usefulness of this
90
partitioning scheme is two fold:
• First, attaching ideal voltage sources along the partitioning boundary electri-
cally isolates each partition, making simultaneous independent optimization of
partitions possible.
• Second, constraining the boundary voltages and currents for each partition
helps the merging operation to keep the circuit responses in both partitions
untouched, thus lead to a feasible solution for the entire grid.
The above procedure optimizes the grid under a given boundary condition (i.
e., VB,i’s and IB,i’s). As a result, the reached solution may not be globally optimal.
To address this problem, a two-level hierarchical optimization is introduced where
the boundary conditions correspond to the ﬁrst-level optimization variables. It is
assumed that the power grid is divided into N partitions. The problem formulation
for the ﬁrst level is
Level 1:
min
VB,IB∈Rn
A(VB, IB) (3.8a)
s.t. A =
N∑
i=1
Ai,min (3.8b)
VB ≥ Vmin (3.8c)
IkB ≤ σkwkB, k = 1, · · · , m, (3.8d)
where the boundary conditions between the N partitions are deﬁned by n boundary
voltages and currents: VB, IB ∈ Rn; A is the total wire area of the power grid, which
is an implicit function in VB and IB; and Ai,min is the minimal wire area for partition
i achievable under its boundary condition VB,i and IB,i, which are subsets of VB and
IB, respectively; the wire widths on the boundaries are denoted as wB; (3.8c) and
91
(3.8d) enforce the IR drop and EM constraints at the partitioning boundaries. Each
Ai,min is obtained by solving one of N second level optimization problems
Level 2:
for each i, i = 1, · · · , N
min
wi∈Rm
Ai(wi) (3.9a)
s.t. (3.2), (3.3), (3.4), (3.5) (3.9b)
VB,i = VB,i, IB,i = IB,i (3.9c)
where (3.9b) sets the standard constraints for the internal of partition i; VB,i and
IB,i are partition i’s boundary conditions, which are set by their counterparts passed
from the ﬁrst level problem.
As can be seen, this hierarchical optimization and the ﬂat optimization share
the same set of constraints and also the two objective functions are eﬀectively identi-
cal. Therefore, the feasible solution for the hierarchical optimization is also a feasible
solution to the ﬂat optimization, and vice versa. If both formulations are solved to
reach the global optimal solutions, it can be seen that they must reach the same op-
timal objective function value. Essentially, the ﬁrst-level problems seeks the optimal
V ∗B and I
∗
B that lead to the overall area minimization while forcing the IR drop and
EM constraints at the partitioning boundaries. When it comes to evaluate the total
area achieved at given VB and IB, a set of N second level optimization problems are
solved and the resulting areas of all the partitions are summed up. Mathematical-
ly, this two-level hierarchical problem formulation has appealing characteristics for
scalable parallel optimization, since for each level, the problem dimension has been
signiﬁcantly reduced. Furthermore, the use of boundary voltages and currents based
partitioning makes the N second level optimization problems completely independent
92
of each other, allowing straightforward parallelization.
Unfortunately, there exist two limitations that prevent a practical application of
this formulation:
• The hierarchical nature of the formulation implies that a large number of the
second level optimization problems need to be solved. Therefore, a brute-force
application of the two-level optimization can lead to excessive runtime.
• In each second-level optimization problem, setting the boundary voltages and
currents as in (3.9c) may make the problem overly constrained, which is partic-
ularly troublesome as the unknown optimal boundary conditions are searched
by the ﬁrst-level problem.
c. Proposed Parallel Two-Step Optimization Formulation
We address the two issues with the two-level hierarchical formulation, runtime eﬃcien-
cy and convergence, by adopting a much more practical parallel two-step formulation.
For clarity, the problem formulation is stated as
Step 1:
Solve for the optimal V ∗B and I
∗
B in parallel. (3.10)
Step 2:
for each i, i = 1, · · · , N
min
wi∈Rm
Ai(wi) (3.11a)
s.t. (3.2), (3.3), (3.4), (3.5) (3.11b)
CR(VB,i, IB,i, V
∗
B,i, I
∗
B,i) (3.11c)
Atot,min =
N∑
i=1
Ai,min, (3.11d)
93
where CR(VB,i, IB,i, V
∗
B,i, I
∗
B,i) is the constraint for boundary condition VB,i and IB,i
with respect to V ∗B,i and I
∗
B,i.
Compared to the two-level hierarchical formulation of (3.8) and (3.9), the two
steps in the above are executed only once in sequence. In step 1, the optimal volt-
ages and currents along the partitioning boundaries, e.g. V ∗B and I
∗
B corresponding to
the (unknown) optimal power grid design, are solved. Conceptually, this seemingly
presents a chicken-and-egg dilemma: V ∗B and I
∗
B may seem only to be obtained after
ﬁnding the optimal power grid solution. However, by exploiting the strong locality
behavior in C4-type power grids, it can be shown that V ∗B and I
∗
B (or near optimal
boundary values) can be rather eﬃciently obtained without solving the entire power
grid optimization problem. The same locality property allows V ∗B and I
∗
B be deter-
mined by solving a set of independent local optimization problems, leading to an
immediate parallelization of step 1, as detailed in Section III.A.3.
In step 2, a set of N optimization problems are solved to optimize the N power
grid partitions based on V ∗B and I
∗
B computed in Step 1. In comparison with (3.9c),
where the boundary condition for partition i is constrained by setting both the bound-
ary voltages and currents, here in (3.11c) a relaxed boundary constraint is adopted.
As detailed in Section III.A.4, the use of the relaxed boundary constraint makes the
step-2 optimization problems signiﬁcantly easier to solve, enhancing the convergence.
Feeding step 2 with V ∗B and I
∗
B (or near optimal boundary values) also signiﬁcantly
improves the convergence of these optimization problems. This is in contrast with
the two-level hierarchical optimization, where arbitrary VB and IB values can make
the level-2 optimization problems very diﬃcult to solve or even infeasible.
In summary, the convergence, consequently also the runtime eﬃciency, of the
proposed approach, are achieved by inputting V ∗B and I
∗
B to the second step and
adopting relaxed boundary constraints. The runtime eﬃciency is further enhanced
94
by converting from the two-level hierarchical formulation to the sequential two-step
formulation, where both steps are parallelizable. It shall be noted that in principle
any robust nonlinear constrained optimization methods can be applied to solve the
optimization problems in the two steps. Hence, the proposed approach is generic and
formulated purely based upon the nature of the application.
3. Parallel Solution of Optimal Boundary Conditions
The locality property of the ﬂip-chip type power delivery networks and its application
in static analysis are presented in Section II.A. For notation simplicity, we refer to
the locality exhibited under the context of the circuit analysis as analysis locality. In
this work, the spatial locality is exploited for ﬁnding the optimal boundary conditions
in power grid optimization. Similar to the idea shown in Figure 8, an optimization
window is used to enclose a partition boundary at the window center. The size
of each window is made large enough to include a ring of C4 bumps around the
partition boundary such that the circuit responses along the boundary are inﬂuenced
in a negligible way by the part of the power grid outside of the window.
In optimization, we further exploit optimization locality as described as follows.
Each truncated window is treated as an independent grid and optimized using the
standard optimization formulation (3.1)-(3.5). The optimized window is analyzed
to compute the voltage and branch current responses on the partitioning boundary
(located at the center of the window). Optimization locality exists if the voltage and
current responses obtained in the above procedure are very close to those under the
ﬂat optimization for the entire power grid. Intuitively, based on analysis locality, it
is well expected that optimization locality can also be achieved by choosing a proper
window size. Again, in a large enough window, the nodes at the center will not be
inﬂuenced much by the part of the grid outside the window. This spatial locality can
95
propagate into the optimization stage. That is, towards to the center, the optimal
wire widths obtained via the window-based optimization are increasingly closer to
the true optimal values. Further considering analysis locality, the circuit responses of
this optimized window shall match closely with those of the true optimal power grid
solution at the center of the window.
We adopt window size and C4 ring size deﬁnition in Section II.A. A 160K-
node power grid with C4 bumps is used as an example to illustrate the observed
optimization locality. The C4 bumps are 25 nodes away from each other and are
evenly distributed in the grid. The power grid is divided into four equally sized
partitions. A boundary is examined as an example. The obtained node voltage and
branch current for a single node on the boundary via the window-based optimization
as functions of window size are shown in Table X. The average node voltage and
the average branch current for the boundary are also examined. All the voltages and
currents obtained from window optimizations are compared with the corresponding
optimal values. Note the quick convergence of the results. The window size of 65,
corresponding to the C4 ring size of 3, is already good enough for practical use.
In practice, the best choice of the window size is problem dependent and not
known a priori. In this work, starting from a relatively small initial value, the window
size is gradually increased till the convergence of the partition boundary responses
is observed. The use of this procedure avoids spending unnecessary runtime due to
an overly conservative choice of the window size. It is found that for most cases, the
window size converges at a quite manageable value. Since each optimal boundary
condition is determined by solving an independent window based optimization, this
entire procedure can easily be parallelized, as summarized in Algorithm 3, where the
original power grid Ω is assumed to be divided into K partitions Ω1, · · ·,ΩK with L
partition boundaries B1,opt, · · ·, BL,opt. L boundary windows W1, · · ·,WL are created
96
Table X. Optimized boundary voltage and current as functions of window size in the
window-based optimization. WS: window size. C4S: C4 ring size. NV:
voltage for a boundary node in V. NC: current for a boundary branch in
mA. AV: average voltage for the boundary in V. AC: average current for the
boundary in mA. OPT: optimal value.
WS C4S NV NC AV AC
15 1 1.91007 0.281127 1.89857 0.109043
40 2 1.91208 0.292751 1.89802 0.368246
65 3 1.91218 0.293240 1.89798 0.341391
90 4 1.91218 0.293262 1.89798 0.338898
OPT NA 1.91219 0.293271 1.89798 0.339042
with an initial window size Sini. Each of these window-based optimization can be
solved using one thread on a multi-core (or shared memory) machine in parallel.
4. Parallel Optimization of Partitioned Sub Power Grids
As presented in Section III.A.2.c, the optimal (or near optimal) boundary conditions
V ∗B and I
∗
B are computed in step 1 of the proposed approach. In step 2, each parti-
tioned sub power grid is optimized in parallel using V ∗B and I
∗
B. In practice, setting
the boundary constraints exactly to V ∗B and I
∗
B can still make the step-2 optimization
problems very diﬃcult to solve numerically. As such, relaxed boundary constraints
as in (3.11c) are adopted. On the other hand, the relaxation of the boundary condi-
tions must be handled with care since inconsistent boundary conditions may alter the
circuit responses in each partition, leading to an infeasible solution after all the opti-
mized partitions are merged to form the complete grid. We show a relaxed boundary
constraint scheme, which is provable to guarantee the IR drop constraints for the
merged grid. In practice, the scheme may produce a small amount of EM violations,
97
Algorithm 3 Parallel optimization for the optimal boundary conditions
Input: Original power grid Ω, partition boundaries B1, · · ·, BL, initial window size Sini,
the convergence tolerance Ctol.
Output: Optimized boundary conditions B1,opt, · · ·, BL,opt, maximum terminating window
size Smax,t.
1: for i← 1 to L par do
2: Bi,opt ← Bi
3: Si ← Sini
4: while NOT CONVERGED do
5: Form the boundary optimization window Wi with Bi,opt and Si.
6: Create a sub optimization problems Oi using Wi subject to (3.2)-(3.5).
7: Solve Oi. The optimized boundary Bi,temp is extracted.
8: if
Bi,temp−Bi,opt
Bi,temp
< Ctol then
9: CONVERGED
10: else
11: NOT CONVERGED
12: end if
13: Bi,opt ← Bi,temp
14: if NOT CONVERGED then
15: Increase window size Si.
16: end if
17: end while
18: end for
19: Smax,t ← maxi=1,···,L{Si}.
20: return B1,opt, · · ·, BL,opt, Smax,t
98
which can be reduced via ﬁxing techniques. It should be noted that there is a tradeoﬀ
between the number of sub-grids to be optimized and the size of each sub-grid. Hav-
ing smaller sub-grid size makes the optimization of each one faster; however, there
might be a large number of them to be processed.
a. Optimization of Partitioned Grids Using Relaxed Boundary Constraints
The basic idea is to maintain the exact boundary voltage conditions while constraining
(relaxing) boundary currents in a way such that each partition is optimized under
a potentially worse current loading condition. As will be shown later, this ensures
that the merged power grid sees a possible reduction in current loading if it ever
changes, which increases or at least maintains the same voltage level for every node,
thereby keeping the IR drop constrains satisﬁed after the merging. Although there is
no theoretical guarantee for zero EM constraint violation, if optimal or near optimal
boundary conditions are used, the non-degraded overall current loading in the merged
grid tends not to alter the current distributions to jeopardize the EM constraints in
a signiﬁcant way, which is consistent with empirical observations.
Consider the illustrative example shown in Figure 40, where a power grid is
partitioned into two pieces P and P ′ at node interface B = {B1, B2, B3}. The
optimization procedure is illustrated in three steps:
• Step I: Before partitioning, the optimal (or near optimal) boundary voltages
V ∗B = {V ∗B1, V ∗B2, V ∗B3} as well as the optimal (or near optimal) boundary currents
I∗B = {I∗B1, I∗B2, I∗B3} are computed in the ﬁrst step of the proposed two-step
optimization approach. It is assumed the directions of I∗B are as shown in the
ﬁgure.
• Step II: The grid is partitioned into two parts: P and P ′. The left part still
99
B1
I*V* B1B1
B2
I*V*STEP I B2B2 
B3
I*V* B3B3
P’P
B1 B1’
VB1=V*B1 V’B1=V*B1
I ?I*B2 I’ ?I*B1 B1 B2’B1 B1
VB2=V*B2 V’B2=V*B2STEP II
B3 I ?I* I’ ? I*+ B3’B2 B2 B2  B2
V =V* V’ =V*B3 B3 B3 B3
IB3 ?I*B3 I’B3 ? I*B3
B1
IB1V
STEP III
,oB1,o
 B2
IB2 oV ,B2,o
B3
IB3 oV ,B3,o
Fig. 40. Optimization of the partitioned grid under relaxed boundary conditions.
100
possesses the original nodes B, and new nodes B′ = {B1′, B2′, B3′} are created
for the right partition. Ideal voltage sources VB = {VB1, VB2, VB3} and V ′B =
{V ′B1, V ′B2, V ′B3} with values V ∗B are attached to the split nodes B and B′ in
each partition. Equality and inequality constraints are set for the voltages and
branch currents of the voltage sources, respectively⎧⎪⎪⎨
⎪⎪⎩
VBi = V
∗
Bi, V
′
Bi = V
∗
Bi
IBi ≥ I∗Bi, I ′Bi ≤ I∗Bi.
i = 1, 2, 3. (3.12)
Note that the reference directions of the two branch currents are consistent to
that of the boundary branch current in step I. The current constraints are such
that the currents drawn out from P by VB are at least I
∗
B while the currents pro-
vided by V ′B into P
′ are not greater than I∗B. This leads to the potentially worse
loading conditions on which the two partitions are independently optimized.
• Step III: After the optimization, P and P ′ are merged to form the complete grid,
in which the ideal voltages used to set the boundary conditions are removed.
b. Maintenance of IR Drop Constraints
We prove the maintenance of IR drop constraints in the proposed optimization with
relaxed boundary conditions. First, several theoretical results relevant to power grid
analysis are presented.
Deﬁnition 1. A nonnegative matrix is a matrix with all the elements being nonneg-
ative.
To perform a DC analysis for a power grid, the standard modiﬁed nodal analysis
(MNA) can be used to generate a system matrix. If the voltages of the VDD pads are
substituted by the known supply level in the system of equations, the system matrix
101
becomes a so-called M-matrix [50], as shown in [10]. The following results exist for
M-matrices.
Lemma 1. The inverse of a M-matrix is a nonnegative matrix [50].
The monotonicity of power grids is suggested in [23].
Lemma 2. (monotonicity) None of the node voltages of a VDD distribution network
decreases if the current loading to the network decreases.
Proof. The variations of node voltages due to any change in the current loading are
determined by the following linear matrix problem
AΔV = ΔIin, (3.13)
where ΔV is the vector of node voltage changes; ΔIin correspond to that of the current
loading; A is the system matrix, which is aM-matrix. If the current loading decreases,
all the entries in ΔIin are positive. Further considering that A
−1 is nonnegative
(A−1 ≥ 0) leads to
ΔV = A−1ΔIin ≥ 0. (3.14)
Now, we show the following theorem.
Theorem 1. Merging individually optimized grid partitions under the relaxed bound-
ary conditions (e.g. (3.12)) maintains the IR drop constraints.
Proof. Since the IR drop constraints are enforced when each partition is individually
optimized, the Theorem 1 amounts to show that after the merging none of the node
voltages in the entire grid decreases. Without loss of generality, the example in Figure
40 is used to show the result, and for simplicity, only node B1 is analyzed. In Figure
102
… …V V
B1 B’1
P P’
DD DDANALYSIS
OPT OPTIB1,OPT
I’STEP I - B1,OPT
B B’… …VOPT V’OPTVDD VDDANALYSIS 
1 1
POPT P’I ?0STEP II OPTB1,new
+
Supposition Theory
?V ?V’B1 B’1… …OPT OPTANALYSIS 
POPT P’OPT
IB1,newSTEP III
B B’… …VDD VDD
1 1
POPT M P’OPT M, ,
Fig. 41. Merging process analysis
41, a sequence of analysis steps are followed to analyze the voltage response change
after the merging.
• Analysis step I: Assume that after the optimization, the actual voltage source
branch currents are IB1,OPT and I
′
B1,OPT for partitions P and P
′, respectively.
Now the two partitions are merged at the interface node B1 with a single ideal
voltage source of VB1 attached. Since constraints (3.12) are enforced in the
optimization, it is easy to see that IB1,OPT − I ′B1,OPT ≥ 0, which is equal to the
103
branch current through the single ideal voltage source. Note that the reference
direction of the branch current is going into the voltage source as shown in the
ﬁgure. Furthermore, all the voltage and current responses in the grid remain
unaltered in this step.
• Analysis step II: The ideal voltage source at the interface is replaced by an ideal
current source with a value IB1,NEW = IB1,OPT −I ′B1,OPT ≥ 0. According to the
substitution theory, no change is made to any circuit response.
• Analysis step III: The voltage responses of the merged grid are now analyzed.
Note that the merged grid is the result of removing the ideal current source in
Step II from the network, which is equivalent to keeping a zero-valued current
source. By the principle of superposition, it implies that the circuit responses
of the merged network can be computed by summing up the responses of two
networks. The ﬁrst network is identical to what is in the previous analysis step.
The second network is constructed by modifying the network in Analysis Step
II: reversing the current direction of the ideal current source and zeroing all
other independent sources, that is, grounding all the VDD connections. Since
the ideal voltage source provides current into the grid, by Lemma 2, the voltage
responses of the second network are all positive.
Compared to analysis step II, the net changes of voltage responses in the merged
grid are equal to the responses of the second network. Hence, the theorem is approved.
c. Fixing EM Violation
Due to the adopted relaxed boundary current constraints, the current boundary con-
ditions between partitions may not be completely identical. Theoretically, the merg-
104
ing step can alter the current distributions of each partition, contributing to a small
amount of EM violations. It has been observed in extensive experiments that if the
boundary voltages and boundary currents computed in the ﬁrst step of the two-step
optimization method are close to the optimal solutions, no signiﬁcant EM violation
(less than 0.1%) will be generated after merging. Therefore, if the merged grid has
large EM violations, the boundary conditions can be recomputed by tightening the
convergence tolerance in Algorithm 3. An alternative ﬁxing strategy is to guard
band the small amount of merging induced EM violations by tightening up the EM
constraints slightly when optimizing each partition.
d. Algorithm Flow for the Proposed Locality Driven Parallel Optimization
Finally, we summarize the overall ﬂow of the proposed locality driven parallel opti-
mization method in Algorithm 4. Assume we have obtained the optimal boundary
conditions B1,opt, · · ·, BL,opt from the boundary window optimization using Algorithm
3. If the maximum EM violation V iomax found in the obtained optimal global grid
exceeds the EM violation tolerance V iotol (e.g. 1%), then the convergence tolerance
Ctol decreases by half. In practice, once the Ctol is well chosen, the maximum EM
violation is always negligible. All other notations are the same as in Algorithm 3.
5. Experimental Results
The proposed parallel locality-driven optimization method has been implemented in
C and integrated with IPOPT (Interior Point OPTimizer) [47]. Parallelization is
implemented by creating multi-threads using Pthreads. Experimental results for the
ﬂat optimization (optimizing the power grid as a whole without partitioning), the
serial locality-driven optimization (running the boundary window optimization and
individual partition optimization in serial), and the parallel locality-driven optimiza-
105
Algorithm 4 Locality-driven parallel optimization algorithm
Input: Original power grid Ω, K partitions Ω1, · · ·,ΩK , convergence tolerance Ctol, EM
violation tolerance V iotol.
Output: Optimized power grid Ωopt, maximum EM violation V iomax.
1: NOT OPTIMUM
2: while NOT OPTIMUM do
3: Get L optimal boundaries B1,opt, · · ·, BL,opt from parallel boundary window optimiza-
tion in Algorithm 3.
4: Create K partition optimization problems Op1, ···, OpK using B1,opt, ···, BL,opt subject
to (3.2)-(3.5) and (3.12).
5: Solve Op1, · · ·, OpK simultaneously. The optimized partitions Ω1,opt, · · ·,ΩK,opt are
merged to form the optimal global grid Ωopt.
6: Simulate the grid Ωopt, and ﬁnd the maximum EM violation V iomax.
7: if V iomax > V iotol then
8: Ctol ← Ctol/2
9: NOT OPTIMUM
10: else
11: return Ωopt and V iomax
12: OPTIMUM
13: end if
14: end while
tion are obtained on a workstation with two quad-core Intel Xeon CPUs at 2.33GHz
and 8G RAM running 64-bit Linux OS.
a. Partition Optimization
As stated in previous sections, the feasibility and convergence of each partition opti-
mization greatly depends on the choice of boundary conditions V ∗B and I
∗
B, thus the
size of the window chosen for the boundary window optimization. Use the 160K-node
power grid mentioned in Section III.A.3 as an example. The power supply is 2V. The
IR drop constraints and EM constraints are set as that for each node voltage Vi: Vi ≥
1.8V; and for each branch current Ib and width wb: Ib/wb ≤ 1mA/um.
Figure 42 shows how the quality of the ﬁnal optimized power grid, in terms of
IR drop violations and EM violations, depends on the optimality of the boundary
106
conditions we employ for the partition optimizations. The original power grid exhibits
IR drop violations, in Figure 42(a), and EM violations, in Figure 42(b), (only the
EM distribution for horizontal grid wires is shown in the ﬁgure. The distribution for
vertical wires has a similar pattern) in some regions. As shown in Table X, if the
boundary window size is chosen to be 40 (C4 ring size of 2), the resulting boundary
voltages and currents are not completely accurate. This creates convergence problems
in the subsequent individual partition optimizations, leading to long runtime and
low optimization quality. According to Theorem 1, in this case, there is no IR drop
violation in the ﬁnal optimized grid, as shown in Figure 42(c). However, EM violations
do exist, as shown in Figure 42 (d) (the maximum Ib/wb is over 1.2mA/um). After
including more C4 bumps into the boundary window optimization and using 65 as the
window size (C4 ring size of 3), the boundary conditions obtained for all partitioned
sub grids are much closer to the optimal values. With those near optimal boundary
conditions, all the partition optimizations are able to converge. Finally, as before,
there exists no IR drop violations (the largest IR drop is less than 140mv), and
also all the current densities are within the speciﬁed EM constraints as shown in
Figure 42(e) and Figure 42(f). The results presented in the ﬁgure also conﬁrms the
previous claim that if the boundary conditions are near to the optimums suﬃciently,
no signiﬁcant EM violations would be generated after merging partitions together
(less than 0.03% in this case).
b. Overall Optimization
The proposed locality-driven parallel power grid optimization algorithm has been
tested under six large-scale power grids with varying sizes: 40K-node, 90K-node,
160K-node, 360K-node, 640K-node, and 1M-node. All the power grids use C4 bump
power supply pads. Among the six initial designs, the starting wire widths are chosen
107
Fig. 42. Impact of boundary conditions on the quality of ﬁnal optimized power grid. (a): node voltage distribution before
optimization. (b): EM distribution of horizontal wires before optimization. (c): node voltage distribution after
optimization using window size 40. (d): EM distribution of horizontal wires after optimization using window size
40. (e): node voltage distribution after optimization using window size 65. (f): EM distribution of horizontal
wires after optimization using window size 65.
108
Table XI. Optimization runtime for ﬂat optimization, serial locality-driven optimization, and parallel locality-driven opti-
mization. N: number of nodes. P: number of partitions. SIM: ﬂat simulation runtime in sec. OPT: optimization
runtime in sec. BWO: boundary window optimization runtime in sec. PO: partition optimization runtime in
sec. TOT: total runtime in sec. AR: area reduction in %. WSb: beginning window size. WSt: maximum ter-
minating window size. IT: number of iterations for window size determination. NVio: number of EM violation.
Viomax: maximum EM violation in mA/um.
N P SIM
Flat Optimization Serial Optimization Parallel Optimization
OPT TOT AR BWO PO TOT AR BWO PO TOT AR WSb WSt IT NVio Viomax
40K 4 1 110 111 52.87 381 63 446 52.86 134 24 160 52.86 40 65 2 6156 4e−4
90K 9 2 890 892 53.96 1055 195 1254 53.95 257 39 300 53.95 40 65 2 8306 2e−4
160K 4 3 2122 2126 61.08 539 670 1218 61.06 366 223 598 61.06 40 90 3 9946 3e−4
360K 9 10 8057 8068 75.50 1707 1267 3001 75.49 333 336 698 75.49 40 65 2 10478 3e−4
640K 16 18 20243 20264 75.58 3500 2588 6143 75.57 686 652 1398 75.57 40 65 2 17503 3e−4
1M 25 43 NA NA NA 6827 3283 10234 59.97 1264 802 2185 59.97 40 65 2 1094 2e−4
109
such that some of them do not have any constraint violation (e.g. over designed),
while others do. A fast multigrid-like power grid simulator is employed for the grid
simulation. Since the simulation only provides the initial values for the optimization,
accurate power grid simulation through expensive analysis (e.g. direct solve) is not
needed. The ﬂat simulation runtime, optimization runtime and the area reduction for
all the cases with three optimization methods: ﬂat, serial and parallel, are shown in
Table XI. The simulation runtime is negligible compared to the optimization runtime.
Since the serial and the parallel methods follow the same algorithm, only the beginning
window size, the maximum terminating window size, and the number of iterations
for window size determination scheme of the parallel method are shown in Table XI.
Moreover, only the number of EM violations and the maximum EM violation for the
parallel method are presented. It should be noted that the percentage of the reduced
wiring area relative to the original wiring area is used to reﬂect the optimal area we
have obtained by using one of the three optimization methods.
As can be seen from the table, the signiﬁcant advantage of the proposed approach
is its scalability. Due to excessive memory and runtime requirement, the standard
ﬂat optimization does not scale well with the circuit complexity. In our case, it
takes the ﬂat optimization 20,264 seconds to size the 640K node grid. The 1-million
node grid cannot be optimized in ﬂat due memory overﬂow. Given the large size of
practical power grids, this presents a severe limitation. In contrast, the divide-and-
conquer nature of the proposed method makes it highly scalable. The serial version
of our locality driven approach can successfully size all the benchmarks. Even for
the grids that can be optimized in ﬂat, our serial method can achieve good runtime
speedups under some cases. In practice, the amount of speedup may depend on the
grid size, structure, current loading and initial design. Furthermore, our method is
naturally parallelizable. This allows the use of parallel processing to gain further
110
runtime eﬃciency. For example, the parallel locality-driven method can size the
largest one-million node grid by using only 2,185 seconds on the machine with two
quad-cores. We expect much larger grids can be also successively optimized by our
approach. Moreover, our methods generate almost the same optimal results as the
ﬂat optimization. Although EM violations exist in the ﬁnal merged power grid, their
values are insigniﬁcant (in the order of e−4 compared to the constraint bound of the
order of 1), thus can be neglected.
For our method, the parallel version can bring in a 2X speedup for grids with four
partitions, and 4X speedup for ones that have more than eight partitions, respectively.
The achieved parallel speedups are shown in Figure 43. As can be seen, the achieved
parallel processing factors are less than one, which is primarily due to load imbalance
caused by the partitioning we use, as shown in Figure 44. For example, for some
boundary optimizations, due to the strong couplings to the rest of the grid, they may
need large window size to get converged boundary conditions (may need to increase
the window size three times). Whereas, for the boundary windows with weak coupling
to other regions of the grid, the near optimal boundary conditions could easily be
reached. Moreover, the optimizations for some partitions could easily converge while
others may take longer time to reach the optimum. Therefore, the work of some
threads (in terms of number of optimization iterations) may be signiﬁcantly larger
than the work of others. Those heavily loaded threads have a great impact on the
parallel runtime. In our future work, loading balancing techniques will be developed
to further improve the eﬃciency of the proposed parallel optimization method.
6. Summary
In this section, we proposed a novel partitioning-based locality-driven two-step op-
timization scheme for the power grid wire size optimization problem. This scheme
111
6
5
4p
du
p
3
ee
d
2S
peS
1
40K 160K 90K 360K 640K 1M
0
4 8
Utili d Cze ? ores
Fig. 43. Speedup of parallel over serial locality-driven optimization.
007
600
500e
400m
e
tim
300un
t
R
u
200R
100
0
B d Wi d O ti i ti P titi O ti i tioun ary n ow p m za on ar on p m za on
Fig. 44. Threads runtime (in sec) of boundary window optimizations and partition
optimizations for 640K-node power grid.
112
exploits the locality of the power grids for scalable optimization of large grids. In the
ﬁrst step of the proposed approach, optimal (or near optimal) partitioning boundary
voltages and currents are obtained via localized window-based optimization. In the
following step, partitions are individually optimized under the obtained boundary
conditions. The divide-and-conquer nature of the proposed method not only leads
to its favorable scalability but also makes it possible to employ increasing parallel
computing resources to facilitate the optimization of large power grids.
B. System-Level Co-Design of Power Delivery Networks with On-Chip Voltage Reg-
ulation
Integrating a large number of on-chip voltage regulators has the appeal of facilitating
ﬁne-grain multiple voltage domains on chip. However, how to choose the optimal
parameters for the voltage regulators/converters designs and the passive network
design so that the overall optimal performance can be achieved becomes a critical
problem. In this section, using the fast GSim engine, detailed systematic analysis for
the entire network with on-chip regulators is carried out. With the obtained design
insights, a system-level co-design ﬂow is proposed to automatically optimize the entire
network.
1. Background
a. System Modeling
The system-level components as well as their detailed models for a power delivery
network with on-chip voltage regulation is presented in Figure 45. Low-dropout
regulators are integrated on the chip to provide local voltage supplies and regulations
to diﬀerent power domains (modeled as local grids). The input voltage to the LDO is
113
DC
Global VDD Grid
Local Grid 1 Local Grid 2
Global GND Grid
LDO LDO LDO LDO
PCB Package
Off-Chip Model On-Chip Model
C4 Bump
Decap
Buck
Converter
Fig. 45. System structure and model of a power delivery network with on-chip voltage
regulation.
provided by an eﬃcient on-board Buck Converter (BC) which converts the external
power supply to the level that is close to the preset LDO output voltage so as to
improve the overall power eﬃciency.
b. Beneﬁts of On-Chip Voltage Regulation
High-frequency local voltage drops due to the fast switching circuits, lower-frequency
global resonance caused by oﬀ-chip inductive parasitics and IR drop due to the resis-
tance between the voltage supply and the on-chip nodes are three major contributors
to the voltage ﬂuctuation [6]. Suppressing or remedying these eﬀects would signiﬁ-
cantly improve the performance of PDN.
With the accurate and powerful analysis engine GSim illustrated in Section II.B,
we give a quantitative analysis to the eﬀects of having on-chip voltage regulation. A
random node’s voltage drops before and after integrating on-chip LDOs are examined,
and the voltage drop waveforms are shown in Figure 46. For the PDN that does not
have LDOs, local grids are connected to the VDD grid through vias. Otherwise the
114
Fig. 46. Voltage drops for a power domain with LDOs and without LDOs.
local grids are connected to LDOs as shown in Figure 45.
Figure 46 shows that without the regulation of on-chip LDOs, on top of the
high-frequency voltage drop, there is a large mid-frequency swing due to chip-package
resonance. Even though the voltage drop caused by switching currents is about 40mv,
the resonance introduces another 20mV drop, and makes the total voltage drop much
larger. However, in contrast, by having on-chip voltage regulators, the beneﬁts are
threefold:
1. Suppressing high-frequency local drop: On-chip LDOs provide strong local regu-
lation. Along with on-chip decoupling capacitors, the LDOs can respond quick-
ly to local current ﬂuctuations and automatically maintain the output voltage
level. Hence large local voltage swings are suppressed signiﬁcantly.
2. Remedying mid-frequency global resonance: On-chip LDOs do not suﬀer the
global resonance since the resonance is blocked at the input of LDOs. LDOs
have weak transfer functions so that as long as working in the regulation region
they are not sensitive to the input changes. Therefore, large voltage ﬂuctuations
at the oﬀ-chip circuits and VDD grids can not propagate to local grids.
115
3. Reducing IR drops : On-chip LDOs shorten the distance between current load-
ings and voltage regulators, and thus reduce the IR drops due to the wire
resistance. It should be noted that the DC shift of the waveform with voltage
regulation is caused by line and load regulations of LDOs.
c. Low-Dropout Regulator Design
In this work we adopt a novel multi-loop topology for enhanced on-chip regulation
performance. The analysis of this LDO is somewhat complex. Without loss of gener-
ality, a simpliﬁed topology (as shown in Figure 47) is employed to demonstrate major
LDO design considerations, such as dropout voltage, maximum load current, power
eﬃciency, LDO output impedance, power supply rejection (PSR) and stability. The
same analysis approach can be applied to other LDOs.
In this topology, M1 is the pass transistor to deliver load current. M2 and M3
work as a current sensor to detect load current variation and generate signal (Vctrl2)
to drive M1. The Error Ampliﬁer (EA) works as the output voltage (Vout) sensor
that compares Vout with the reference voltage (Vref) and generates Vctrl1 to adjust
Vout through M2.
The dropout voltage is the input-to-output diﬀerential voltage at which the circuit
ceases to regulate against further reductions in input voltage. If input-to-output
voltage diﬀerence is less than the dropout voltage, the regulator is in the dropout region
and the output voltage decreases in proportion to the decreasing input voltage. In
contrast, if input-to-output voltage diﬀerence is larger than the dropout voltage, the
regulator is in the regulation region and the output voltage maintains a stable level.
The maximum load current as well as dropout voltage is determined by the dimension
ratio (w/l) and the maximum allowable gate-to-source voltage (Vgs) of M1 in Figure
47.
116
in
ILOAD
Cd
Rs
M1
M2
M3
out
Vctrl1
Bias
Ip
Iq
Iout
Bias
Vin
C1
C2
+-
Vref
EAVctrl
Vctrl2
Fig. 47. LDO topology.
The power eﬃciency of an LDO is limited by the quiescent current and output-
to-input voltage ratio, and is deﬁned as,
εLDO =
IoutVout
IinVin
=
IoutVout
(Iout + Iq)Vin
, (3.15)
where Iq is the quiescent current ﬂowing to the ground. Obviously increasing input-
to-output voltage diﬀerence would reduce power eﬃciency.
The LDO output noise caused by the load current variations is dependent on the
small-signal output impedance (Zout) which has the expression of
Zout ≈ (sCdRs + 1) (sCgs1 + gds2)
(1 +Hol) ·D , (3.16)
Hol ≈ gma1gm2 (gm1 + sCgs1) (gma2 − sC1) (1 + sCdRs)
[goa1goa2 + (C1gma2 + C2goa1s+ C1C2s2)] ·D , (3.17)
D =CdCgs1s
2 +
[
CdRs (gm1gm2 + gds1gds2 + gds2gm1)
+ Cdgds2 + Cgs1gm2
]
s+ gm1gm2,
(3.18)
where Hol is the loop gain; gmx, gdsx, and Cgsx represent transconductance, drain-
source conductance and gate-source capacitance of the device Mx (x = 1, 2, 3); gma1
and gma2 are the equivalent transconductance of the ﬁrst and second stage of the EA
respectively; goa1 and goa2 are the equivalent DC output impedance of the two stages
117
of EA respectively; C1 is the frequency compensation capacitor and C2 is capacitive
load of EA; Cd and Rs are the load capacitance and parasitic resistance of power grids
respectively.
The power supply rejection of the LDO measures how well the output is isolated
from the supply noise. It can be expressed as
PSR ≈ [sCgs1 (gds1 + gds2) + (gm1gds1) (gds2 + gds3)] (1 + sCdRs)
(1 +Hol) ·D . (3.19)
For good noise performance, both Zout and PSR are desired to be small.
As can be observed from (3.17) and (3.18), the system has four poles. Two of
them are contributed by the EA, namely p1 ≈ goa1/(A2C1) and p2 ≈ gma2/C2. The
other two are associated with the output stage, namely p3 and p4. The system also
has three zeros, two of which are contributed by the output stage and the other by
the Miller capacitor C1. The relative positions of the poles and zeros determine the
stability of the regulator.
d. Decoupling Capacitor Sizing and Placement
According to the analysis presented in [38], high-frequency noise introduced by fast
transient loads can be eﬀectively reduced by increasing decoupling capacitance or by
placing capacitance closer to the load. However, increasing decoupling capacitance
comes with the cost of occupying more precious chip area and aggravate the decap
power leakage. Moreover, more eﬀorts on circuit ﬂoorplanning, placement and routing
may be required.
e. Buck Converter Design
A typical Pulse-Width Modulation (PWM) buck converter [51] is shown in Figure
48. The operation of this converter behaves in the following manner. The switches in
118
VX(t)
ILf(t)
Vbc(t)
Vin
∆I
∆V
ILf_avg
Vbc_avg
(1-D)TsDTs
PWM
Vbc
Vin
RL Rc
Lf
Cf
Switch
Driver
Power
Stage
ILf Iload
ICf
Voltage
Sensing
VX
Fig. 48. Buck converter topology.
119
the power stage are controlled by the PWM block to be turned on and oﬀ, generating
a square waveform of Vx. Then the DC component of Vx is passed to the output
through a second-order low-pass LC ﬁlter (Lf and Cf).
In principle, the design of a transistor-level buck converter can be very complex.
Without loss of generality, in this work, a behavioral model of the buck converter
is used to capture the key design aspects that inﬂuence the performance power de-
livery systems [51]. In this model, the average output voltage (Vbc avg), the voltage
ripple (ΔV ), the power loss (Pb) and power eﬃciency (εb) for the buck converter, are
expressed as following.
Vbc avg = DVin − RLIload, (3.20)
ΔV =
DVin(1−D)
8LfCff 2s
, (3.21)
Pb =
1
2
CMOSV
2
infs +
[
DI2oRPMOS + (1−D)I2oRNMOS
]
+
1
2
RLI
2
o +
1
3
(
ΔI
2
)2
RC ,
(3.22)
εb =
Pload
Pload + Pb
, (3.23)
where CMOS is the total capacitance to be charged/discharged for turning on and oﬀ
those switches in the power stage; RPMOS and RNMOS are the equivalent resistance
for the PMOS and NMOS switches in the power stage; RL and RC are the equivalent
serial resistance for Lf and Cf , respectively; the peak-to-peak variation of inductor
current
ΔI =
VinD(1−D)
Lffs
, (3.24)
Io =
√
I2load +
1
3
(
ΔI
2
)2
, (3.25)
Pload = IloadVbc avg. (3.26)
120
Iin IoutV tLDOVin
I I
ou
P
P PPv
Iq
c t
Pq
c t
Fig. 49. Power consumption of the PDN with on-chip LDOs.
The zero-load power loss Pb0 can be obtained from (3.22) by having Iload = 0.
2. System-Level Co-Design
Since the entire power delivery system is very complex and there are many aspects
to be considered for the system-level co-design, for the illustration purpose, we ﬁrst
discuss the LDO-decap system co-design in which the buck converter is assumed to
be an ideal power supply (100% power eﬃciency, zero power loss and a ﬁxed output
voltage). Then, we present how to introduce the buck converter design into the system
co-design framework.
a. LDO-Decap System Co-Design
To simultaneously design low-dropout regulators and decoupling capacitors for a large
power delivery network, speciﬁc design considerations as well as the electric interac-
tions between LDOs and decaps in each major design aspect must be well understood.
Note that the decaps are the capacitive loads to the LDOs in the network.
Power: The power consumed by the entire chip (P ) consists of the power con-
sumed by the transistors (Pt) (noted as useful power), the leakage power of the decou-
pling capacitors (Pc), the LDO voltage conversion power loss (Pv), and LDO quiescent
121
power (Pq). As shown in Figure 49, these power terms can be expressed as
P = Vin · Iin = Vin · (Iq + Iout) (3.27)
Pt = Vout · It, (3.28)
Pv = (Vin − Vout) · (Iout) (3.29)
Pq = Vin · Iq, (3.30)
Pc = Vout · Ic, (3.31)
where Iout = Ic + It.
Two system-level metrics, the system power eﬃciency εs and the ground power
Pg, are introduced to provide power performance measurements of the power delivery
network at the heavy-load situation (the activity of the load circuits is high) and the
zero/low-load situation (the load circuits are idle), respectively. The system power
eﬃciency is deﬁned as the ratio between the useful power and the total input power,
as expressed in (3.32); the ground power is the sum of the LDO quiescent power and
the decoupling capacitor leakage power, as expressed in (3.33). Note that the LDO
power eﬃciency is partially reﬂected in system power eﬃciency.
εs =
Pt
P
=
Pt
Pt + Pv + Pq + Pc
(3.32)
Pg = Pq + Pc. (3.33)
When the decap leakage current and the LDO quiescent current are small com-
pared with the load current, the system power eﬃciency is bounded by the ratio
between the output voltage and the input voltage Vout
Vin
. Therefore, lowering the input
voltage becomes the most eﬀective way to enhance system power eﬃciency (i.e. re-
ducing the dropout voltage of LDO). However, with a signiﬁcantly low input voltage,
LDOs are in danger of working in the dropout region and losing the regulation on
122
output voltage, which will degrade the system noise performance. Decreasing decaps
or the number and sizes of LDOs can cut down the ground power, but may increase
noise as well.
Noise: The noise of the entire system consists of the noise caused by load current
variation and the noise induced by supply voltage ﬂuctuation. The former can be
improved by reducing the LDO output impedance, while the latter can be suppressed
by having good LDO PSR. The output node of the regulator is analyzed as an example
of a node in PDN.
To achieve low voltage noise on the node, the impedance at that node should
be low at the frequency range where the major of power spectrum of load current
variations lies. In todays high-performance IC, the rise/fall time of load current can
be as fast as less than 1ns. Hence node impedance from DC to very high frequency
should be considered. The DC impedance of LDO output node can be derived from
(3.16) as
Zout DC ≈
1
/
gm1
gm2
/
(gds2 + gds3) · (1 + A1A2)
, (3.34)
where A1 and A2 represent the two stage DC gains of EA. The impedance can be
approximated as 1
/
(gm1ACSHol DC) in which Hol DC is the DC loop gain and ACS
is the DC gain of the current sensor at the output stage. Therefore, increasing gm1
or Hol DC can improve DC output impedance and hence better suppression over slow
variations of load current. gm1 can be increased by increasing either w1 or Ids1 which
will increases the size or quiescent power of LDO. ACS and the DC loop gain Hol DC
can be enhanced by enlarging w2, l2, A1 or A2 which can either increase the area or
lay more stress on stability.
At very high frequencies (VHF), (3.16) can be approximated as
Zout V HF ≈ Rs + 1
/
sCd, (3.35)
123
where Rs is in the order of tens of milliohms. Hence increasing Cd (decap) can
signiﬁcantly reduce Zout V HF thus VHF noise, but at the cost of large area and leakage
current.
In the mid-frequency range, (3.16) shows that Zout has two zeros corresponding
to the two poles of EA. Due to the stability consideration, one of poles must be the
dominant pole of the loop gain and it can be approximated as p1 ≈ goa1
/
(A2C1). The
loop gain and Zout start to degrade noticeably after p1. To push this turning point
up to a higher frequency (i.e. to increase gma1 or decrease C1) while maintaining
the LDO stability, the non-dominant poles have to be moved to higher frequencies
at the same time. For the output stage shown in Fig. 47, this can only be done by
increasing gm1 and gm2 without increasing Cgs1 (suppose Cd cannot be decreased for
suppressing VHF noise). By this means, quiescent current will grow.
Area: The total area overhead (A) includes decoupling capacitor area (Ac) and
LDO area (Al). In general, the area overhead is in direct proportion to the amount
of on-chip decoupling capacitance (Cd) and the number of LDOs (N , if the area of
each LDO remains a constant).
Placement & Routing: LDOs are placed at selected locations, termed as LDO
blocks. Each block can contain multiple LDOs. To have good noise performance, it
is desired to have the LDO circuit blocks spread out on the chip. However, placing
each LDO circuit block on chip not only has its own placement and routing overheads
but may also reduce placement freedom of other circuit blocks and cause extra wire
routing eﬀorts. In a rough estimation, the LDO placement & routing overhead is
proportional to the number of LDO circuit blocks on chip (M). It should be noted
that due to the limited scope of the paper, the placement & routing overhead of
decoupling capacitance is not considered.
Stability: A strict approach to check load regulation stability is to check the poles
124
of the whole network’s impedance under a wide range of load current conditions.
However, the sheer complexity of the network is too huge for this approach to be
done with a bearable cost. To get insights for developing a feasible stability checking
approach that is empirical and safe, ﬁrst imagine an extreme case where interconnect
parasitic resistances are set to zero. Then all the nodes in the power grids are actually
the same node. As a result, the output pins of LDOs are tied to one node, so are
those decaps. The circuit evolves into multiple LDOs in parallel connection as a
whole driving a huge capacitor whose value is the total amount of decaps (illustrated
in Figure 50(b)). Then, since all the LDOs are identical, it is fair to say that each
LDO drives a load capacitor equal to Ct/N , where N is the number of LDOs and
Ct is the total decoupling capacitance. The stability check of this system is done by
checking the stability of each LDO-capacitor pair (illustrated in Figure 50(c)). It is
well-known in typical LDO design that if each LDO is well designed, adding back a
resistor of up to several hundred mili-Ohm (Rs) representing interconnect parasitic
resistance between the output of LDO and the capacitor (as shown in Figure 50(d))
will improve the stability of the LDO-capacitor pair. This improvement is proved by
the circuit simulation results shown in Figure 51. As can be seen, when the resistor
value increases from zero to a typical upper bound of interconnect resistance, the
phase margin of the open-loop transfer function of the pair gets better. Therefore, the
setup in Figure 50(c) is used as a conservative stability check for the entire network.
If the circuit in Figure 50(c) has a right-hand-pole, then the entire system is treated
as unstable for the safety. Our experiments have empirically demonstrated the good
robustness of the approach.
A single LDO-decap pair as shown in Figure 50(c) is used to illustrate how the
capacitive load (Ct/N) aﬀects the stability of the system. The pole p1 from EA is
designed as the dominant pole. p3 and p4 are non-dominant poles and are at high
125
...
Fig. 50. Stability check reasoning procedure.
40
30
35
rg
in
 (D
eg
re
e)
?
?
?
0 0.05 0.1 0.15 0.2
15
20
25
Ph
as
e 
M
ar
Rs ( ? )
Fig. 51. Phase margin vs. Rs.
126
frequencies. Detailed reasoning is well-known in typical LDO design and left out
here for conciseness. Then, there are two optional positions for p2: one is between
p1 and p3, p4; the other is beyond p3, p4. If p2 is at the ﬁrst position, p3, p4 should
be suﬃciently higher than p2 to prevent severe phase drop near p2 that endangers
stability. When Ct/N is increased, according to the root locus of p3, p4 with respect
to Ct/N , p3, p4 will slide down to lower frequencies. Then, either increasing Iq to have
higher p3 and p4 or enlarging C2 or even C1 to lower p1 and p2 will help to increase
the phase margin. If p2 is at the second optional position, high DC loop gain should
be avoided to make the unity-gain frequency lower than p3, p4. Otherwise, if Ct/N
is small enough, p3, p4 is possible to clustered with p2, which degrades stability. In
this case, enlarging C1 or reducing DC loop gain Hol DC will help. However, this will
degrade DC or low- and mid- frequency Zout and PSR.
b. LDO-Decap-BC System Co-Design
After introducing the buck converter into the entire power delivery system co-design
framework, the following aspects would be impacted:
• Power: In this case, the buck converter is no longer assumed to be ideal and
its power eﬃciency is εb. Therefore, the entire system power eﬃciency εs is
expressed as
εs = εbεld, (3.36)
where εld is the power eﬃciency for the LDO-decap system which can be ob-
tained by (3.32). Moreover, the ground power Pg has to include the power loss
of the buck converter at zero-load state Pb0.
Pg = Pq + Pc + Pb0. (3.37)
127
• Noise: The output voltage of the buck converter (the input voltage to the
LDO-decap system) has intrinsic voltage ripple due to the charge and discharge
operations. This ripple would inﬂuence the output voltage of the LDO. The
power supply rejection of LDO determines how well the network driven by the
LDO is isolated from the buck converter output voltage ripple and is important
to suppressing the mid-frequency resonance induced by the package parasitic
inductance and on-chip decap. As can be seen from (3.19), PSR can be improved
by increasing loop gain Hol in which the decap Cd plays an important part.
• On-board component cost: The inductor (Lf ) and capacitor (Cf) at the output
stage of the buck converter mainly determine the accuracy and ripple of the
BC output voltage. However, their values can not be too large since there are
costs associated with them. Although the cost function is complex, in general,
we assume the costs are in proportion to the Lf and Cf values.
3. Co-Optimization Formulation and Methodology
In order to design a PDN with the optimal performance, it is desired to use the optimal
parameters for diﬀerent network components, which naturally leads to a system-level
co-optimization problem. We ﬁrst introduce the co-optimization formulation and
methodology for the LDO-decap system and then bring the buck converter into the
picture.
a. Co-Optimization for LDO-Decap System
As the analysis presented in Section III.B.2.a, making one design aspect better can not
guarantee to make other aspects better. For a power delivery network design, in order
to achieve the most desired system performance (i.e., under a presumed weighing on
128
noise performance, power, area, etc., the best overall system performance), optimal
tradeoﬀs between LDO and decap designs must be reached. Such a task leads to the
co-optimization of active on-chip regulator circuits and passive decoupling capacitors.
In this part, the optimization formulation is introduced ﬁrst, which is followed
by the illustration of the entire optimization ﬂow. Finally, a multi-level optimization
strategy is proposed to eﬃciently solve PDNs with large dimensions.
Objective Function: Assume there are K power domains {G1, . . . , GK} on chip.
Domain Gi has Mi LDO blocks and the total number of Ni LDOs. The total decou-
pling capacitance in domain Gi is Cti. The area overhead (A), power measurements
(εs and Pg), placement & routing overhead (R) and noise (n) of the entire system can
be expressed as:
A =
K∑
i=1
(αCti +NiAli) , (3.38)
εs =
∑K
i=1 Pti∑K
i=1 Pi
, (3.39)
Pg =
K∑
i=1
(Pqi + Pci) , (3.40)
R =
K∑
i=1
γMi, (3.41)
n =
K∑
i=1
Li∑
j=1
nji, (3.42)
where α is the capacitance area ratio; Ali is the area for the LDO; γ is the placement
& routing overhead coeﬃcient; Li is the number of nodes in Di; nji is the noise
violation integral for node j deﬁned in [38] [52].
nji =
∫ T
0
max {vl − vji(t), vji(t)− vu, 0} dt, (3.43)
where vu and vl are the supply voltage upper and lower bounds respectively. A
129
boolean variable S is introduced to represent the stability of the system (1 for stable
and 0 for unstable).
Since all these terms have diﬀerent unit and nominal values, normalization is
applied. Then, the optimization is to minimize the objective function f which can be
expressed as
f = a1
A
A′
+ a2
1− εs
1− ε′s
+ a3
Pg
P ′g
+ a4
R
R′
+ a5
n
n′
+ a6
S
S ′
, (3.44)
where {a1, . . . , a6} and {A′, ε′s, P ′g, R′, n′, S ′} are the weights and nominal values for
area, power eﬃciency, ground power, placement & routing overhead, noise, and stabil-
ity respectively. Obviously, by assigning a large weight to a term, that term would be
optimized with preference. Since the stability must be assured and the noise violation
must be zero, their weights a5 and a6 are assigned with large values.
Optimization Variables: In principle, any design parameters associated with L-
DOs and decaps can be considered as optimization variables, which would make the
optimization space huge and the runtime cost unbearably high. Therefore, with-
out the loss of generality, a few assumptions are made to reduce the optimization
complexity:
• The LDO blocks are evenly distributed in the power delivery network, and each
block contains the same number of LDOs. Mi = Xi×Yi and Ni = Xi×Yi×Zi.
• The LDOs in the same domain are the same.
• The locations of decaps are determined and the amount of the decap at each
location can be increased or decreased in proportion in a certain range.
In summary, for each power domain Di, the optimization variables are the num-
ber of LDO circuit blocks (Mi, namely Xi and Yi), the number of LDOs in each LDO
130
Optimizer
A,?R
Stability?Check
Stable?
No
S
DC?Simulation
Yes
Transient?Simulation
?s,?Pg
n
Fig. 52. LDO-decap system co-optimization ﬂow.
block (Zi), total decoupling capacitance (Cti). On the LDO side, according to the
design analysis presented in Section III.B.1, we choose the following subset of elec-
trically important design parameters: the width of LDO pass transistor (w1i) which
determines the maximum load current and dropout voltage, the transistor width and
length at the output stage of LDOs (w2i, w3i, l2i) which decide the quiescent current
and output impedance at light-load situation, and the compensation capacitors of
LDOs (C1i, C2i) which ensure stability. Apparently, all the optimization variables are
bounded. Therefore, the co-optimization problem is a bound-constrained optimiza-
tion problem.
Optimization Flow : Asynchronous Parallel Pattern Search (APPS) [53] which
solves unconstrained or bound-constrained optimization problems is employed to solve
the above co-optimization problem. APPS is a search-based optimization approach
requiring no derivative information. For our problem, each search (iteration) consists
of three steps to evaluation the objective function as shown in Figure 52:
131
1. Stability check: Based on the observations in Section III.B.2.a, we propose an
eﬃcient stability check approach. First, the LDO-capacitor pair is generated
from the new set of variable values chosen by the optimizer. Then, complete
pole-zero analysis for the LDO-capacitor pair under various load conditions
(from zero to maximum) are performed. If a right-half-plane pole is detected,
the DC and transient simulations are skipped and optimizer moves to next
search.
2. DC simulation: The DC simulation is carried out by an eﬃciency CPU-GPU
combined simulator GSim. The system power eﬃciency εs and ground power
Pg are obtained.
3. Transient simulation: The noise performance n is computed from the transient
simulation which is also performed by GSim.
The area overhead A and the placement & routing overhead R can be determined
right after the optimizer chooses a new set of values for the optimization variables.
Then, based on the value of the objective function, a new set of optimization variable
values is chosen for evaluation.
Multigrid-Based Optimization: For the proposed optimization scheme, hundreds
of iterations may be needed to get the optimal result. In each optimization iteration,
a costly transient simulation may be carried out. In addition, most of the simulation
time is spent on solving large on-chip grids. Therefore, when optimizing for large
power delivery network with multi-million on-chip nodes, the excessive runtime be-
comes a bottleneck. In order to address this problem, a multigrid-based optimization
strategy is proposed to signiﬁcantly reduce the optimization runtime for large PDNs.
Multigrid method is a very powerful and eﬀective methods addressing power grid
problems which have very large sheer complexity, for example power grid simulation
132
Co?Optimization
Initial?Optimization?
Results
Co?Optimization
PDN?Reduction
Original?PDN Coarse?PDN
Fig. 53. Multigrid-based optimization.
[9] [10] [11] [12] [13] [14] and wire sizing optimization [28] problems. The basic idea of
this method is to reduce the original large-scale problem into a much smaller problem
using coarsening process. Then the small-scale problem is solved. Finally, the results
of original problem is computed based on the solutions that are back-mapped from
the small-scale problem results. The beneﬁts brought by this method is the signiﬁcant
runtime and memory savings. This idea is adopted for the proposed active-passive
co-optimization to reduce optimization runtime for large power delivery networks.
As illustrated in Figure 53, the proposed multi-grid method consists of three
steps:
1. Generate a coarse PDN using the methods presented in [13].
2. Carry out the optimization for the coarse PDN.
3. Use the optimum variable values for the coarse PDN as the initial values for
the optimization of the original PDN.
Since the coarse PDN has the same basic electric properties of the original PDN, such
as load current, total decoupling capacitance and branch resistance, the optimization
133
results for the coarse PDN are close to the results for the original PDN. Hence, using
the results of the course PDN as the starting point, only a small number of opti-
mization iterations may be required to reach the ﬁnal optimum of the original PDN.
It will be shown in the experimental results that this multigrid-based optimization
strategy can signiﬁcantly reduce the optimization runtime cost.
b. Co-Optimization for LDO-Decap-BC System
After introducing the buck converter into the co-optimization framework, the follow-
ing modiﬁcations to the optimization must be carried out (the notations are the same
as in the LDO-decap co-optimization):
• Objective function: According to (3.36) and (3.37), the system power eﬃciency
and ground power should include the eﬀect from the buck converter.
εs = εb
∑K
i=1 Pti∑K
i=1 Pi
, (3.45)
Pg = Pb0 +
K∑
i=1
(Pqi + Pci) . (3.46)
Moreover, the term a7
Lf
L′f
+ a8
Cf
C′f
which reﬂects the on-board component cost
should be added to the objective function f . The a7 and a8 are the weight
coeﬃcients while L′f and C
′
f are the nominal values for the BC inductor and
capacitor.
• Optimization variables: Besides the existing optimization variables shown in
Section III.B.3.a, four new key parameters of the buck converter, switching fre-
quency fs, duty cycle D, output ﬁlter inductor Lf and capacitor Cf are intro-
duced as optimization variables. Furthermore, four transistor size parameters
of the LDO that impact the PSR are also included as optimization variables.
134
• Optimization ﬂow: The basic optimization ﬂow is almost the same as presented
in Section III.B.3.a except that at the beginning of each optimization search
(iteration), the input DC voltage Vbc avg as well as the input voltage waveform
(with ripple ΔV ) to the LDO-decap system, buck converter power eﬃciency εb
and zero-load power loss Pb0 are computed based on the behavior model of the
buck converter as shown in (3.20-3.23).
It should be noted that the multi-level optimization strategy can still be applied here.
4. Experiment Results
The LDO and decoupling capacitors are implemented using a commercial 90nm tech-
nology. The PDN simulator GSim has been implemented in CUDA [44] and C++.
The optimization is carried out using APPSPACK 5.0.1 [53]. The GPU program is
executed on the NIVIDIA Tesla C1060, with a total on board memory of 4GB. All
the C++ programs are executed on a workstation with Intel Xeon CPU at 2.33GHz
and 4G RAM running 64-bit Linux OS.
A test PDN with 1M on-chip nodes is used to illustrate the co-optimization ben-
eﬁts and various kinds of design tradeoﬀs. The test PDN is generated for a chip with
26mm2 area, 2W power consumption, 1.2V supply voltage from on-board buck con-
verter, 1V on-chip voltage, 2A average current and 4A peak current. The transition
time for each current loading is 500ps. The noise bound is set as ±50mV. In the
initial design, there are 40nF decoupling capacitance and 42 blocks of LDOs (1 LDO
in each block). Gate-oxide capacitors are used for decaps and LDO compensation
capacitors. Without the loss of generality, the entire chip is treated as a single power
domain. It should be noted that for all the optimizations, stability and zero noise
violation are strictly enforced.
135
a. LDO-Decap System Co-Optimization
The following experimental results are obtained only for the LDO-decap system co-
optimization (in Section III.B.3.a) which assumes an ideal on-board buck converter.
Co-Optimization vs. Uni-Optimization: Starting from the same initial design,
three optimizations are carried out: decap uni-optimization, LDO uni-optimization,
and decap & LDO co-optimization. The results are presented in Figure 54.
• Decap uni-optimization: Blindly optimizing decaps without any stability check-
ing can easily lead to instability. With the proposed stability check in place,
since the LDO design can not be adjusted according to the capacitive load
variation, the total decap can only be reduced to 30nF before losing stability.
• LDO uni-optimization: Although the number of LDO blocks is reduced to 36
and each LDO is sized to have the total quiescent power as low as 9.3mW, due
to the initial large decap, the area is almost the same as the initial design and
the leakage power remains at 76.6mW.
• LDO & decap co-optimization: On one hand, since the LDO design parameters
can be adjusted to avoid stability problem, the decap is reduced to 18nF and
the decap leakage power is cut down to 46mW. On the other hand, the number
of LDO blocks is reduced to 36. Although compared with the results of LDO
uni-optimization, the quiescent power is as large as 27.8mW, with the decap
leakage power reduction, the total ground power is as low as 73.8mW.
As can be observed, compared with the uni-optimization, co-optimization has the
advantage of mutually and simultaneously adjusting decap and LDO designs, and its
optimum result is signiﬁcantly better (or no worse) in every design aspects. Therefore,
in order to achieve the system-level optimum, the co-optimization is indispensable.
136
0
1
2
3
4  Decap
 LDO
77%
78%
79%
80%
81%
0
50
100
150  Decap
 LDO
32
34
36
38
40
42
44
Fig. 54. Optimization results of decap uni-optimization, LDO uni-optimization, and co-optimization. (a) LDO and decap
area in mm2. (b) System power eﬃciency. (c) Ground power in mW. (d) Number of LDO blocks.
137
Design Tradeoﬀs: Diﬀerent sets of weights are used to drive the optimization
favoring diﬀerent design aspects: area, power, placement & routing, and noise. These
optimization results are compared with the result that is obtained using balanced
weights. For the noise optimization, since the initial noise bound is very tight, the
noise bound is extended to ±100mV to observe the trend. The results are presented
in Figure 55.
• Area optimization: The total decap is reduced to 8nF which signiﬁcantly cut
down the area and leakage power (65.1% reduction). However, LDO block
numbers are increased (16.7% increase) and the LDOs are sized up to suppress
the noise which leads to more quiescent power (107% increase).
• Power optimization: The decap amount (12nF) and the LDO sizes are reduced,
but the number of LDO blocks increases (16.7% increase). The ground power
is as small as 58.3mV (21% reduction).
• Placement & routing optimization: The number of LDO blocks is cut down to
20 (40% reduction, 2 LDOs in each block). However, more decaps and larger
LDOs are required to keep down the noise. Area overhead is increased by 65%
and ground power is increased by 62%.
• Noise optimization: With relaxed noise constraint, noticeable area, power and
placement & routing reductions (41.4%, 32.9%, 16.7% respectively) are ob-
served. As can be inferred, when the noise bound becomes tighter, there will
be increases in area, power and placement & routing.
It can be seen, reducing the overhead in one aspect may increase the overheads in
other aspects. Hence, based on the design speciﬁcations, the weights for area, power,
and placement & routing must be well set.
138
0
0.5
1
1.5
2
2.5
3
 Decap
 LDO
77%
78%
79%
80%
81%
82%
0
50
100
150
 Decap
 LDO
0
10
20
30
40
50
Fig. 55. Optimization results of favoring diﬀerent design aspects. (a) LDO and decap area in mm2. (b) System power
eﬃciency. (c) Ground power in mW. (d) Number of LDO blocks.
139
Deep Trench Decoupling Capacitor : Deep trench decoupling capacitor is a novel
decap that is fabricated in a similar manner to bulk eDRAM deep trench capacitor
[54]. The advantages of this technology over the typical planar gate-oxide decoupling
capacitor is its high capacitance per unit area as well as low leakage current. For
example, in the commercial 90nm technology we use, for every one um2, the planar
gate-oxide capacitor only provides 16fF capacitance and draws 30nA leakage cur-
rent. However, on the other hand the deep trench decoupling capacitor has 112fF
capacitance and negligible leakage current (less than 0.05pA).
Comparison between the typical gate-oxide technology and the deep trench tech-
nology is done by using the same set of weights for the optimizations. The results
are presented in Figure 56. Without considering the manufacturing costs, the electric
beneﬁts of using deep trench decoupling capacitor is obvious. Since the deep trench
decap is not as area-costly as the gate-oxide decap, more decap can be placed on chip
with even much less area consumption (36nF decap for only 0.32mm2). Moreover,
with the strong decap to suppress the noise, the number of LDOs and LDO sizes can
be reduced, which results in less quiescent power and placement & routing overhead.
There are only 30 LDO blocks (1 LDO per block), and the quiescent power is as low
as 23.2mW. Notice that there is almost no leakage power consumption from the deep
trench decap.
In summary, using deep trench decoupling capacitors would signiﬁcantly enhance
the power delivery system performance, but the hardware cost may be high.
Multi-Level Optimization Strategy : The multi-level optimization strategy is ap-
plied to a PDN with 5.3M on-chip nodes. The coarse PDN is generated from the
original PDN and the total number of on-chip nodes is reduced to 270K.
The optimization iterations and runtime for the coarse PDN and the original
PDN are shown in Table XII. It can be seen that by using the optimization results
140
0
0.5
1
1.5
2  Decap
 LDO
75%
77%
79%
81%
83%
0
25
50
75
100
 Decap
 LDO
0
10
20
30
40
Fig. 56. Optimization results for using planar gate-oxide decoupling capacitance vs. deep trench decoupling capacitance
(a) LDO and decap area in mm2. (b) System power eﬃciency. (c) Ground power in mW. (d) Number of LDO
blocks.
141
Table XII. Multi-level optimization vs. straight optimization.
Straight Optimization Multigrid-Based Optimization
# Iter. Time
Coarse PDN Original PDN
SP
# Iter. Time # Iter. Time
290 64.7h 274 8.7h 34 6.4h 4.3X
of the coarse PDN as the initial values, the number of optimization iterations for
the original PDN is signiﬁcantly reduced. Compared with the optimization using the
straight method (without using multigrid), over 4.3X speedup is reached. This shows
the superior performance and eﬀectiveness of the multi-level method for solving very
large PDN co-optimization problems.
b. LDO-Decap-BC System Co-Optimization
From the previous optimization results, it is observed that the system power eﬃciency
does not vary obviously. This is because the on-board buck converter, which provides
voltage supply to the LDO-decap system, is not considered in the co-optimization
framework, and only optimizing the LDO-decap system does not bring much beneﬁts
to improve the power eﬃciency. Therefore, we bring the buck converter into the
system co-optimization as depicted in Section III.B.3.b.
Assume the initial buck converter design has Vin = 3.6V , fs = 5MHz, D =
0.389, Lf = 2uH , Cf = 2uF , and its power eﬃciency εb = 87%, average output
voltage Vbc avg = 1.2V , and ripple ΔV = 1mV . Optimization that only considers
LDO-decap system and optimization that considers LDO-decap system as well as
the buck converter are run with the same set of weights. As shown in Figure 57,
after introducing the on-board buck converter into the co-optimization, the Vbc avg is
reduced to as low as 1.096V, then, the system power eﬃciency (79.7%) is signiﬁcantly
142
0
1
2
3
 Decap
 LDO
65%
70%
75%
80%
0
50
100
150
200
 Decap
 LDO
 Buck
32
34
36
38
40
42
44
Fig. 57. Optimization results of LDO-decap-BC system co-optimization vs. LDO-decap system co-optimization. (a) LDO
and decap area in mm2. (b) System power eﬃciency. (c) Ground power in mW. (d) Number of LDO blocks.
143
better than the LDO-decap system co-optimization (69.8%). Moreover, LDO PSR
are strengthened to tolerate a larger ripple ΔV = 6mV so that the Lf and Cf are
reduced to 1uH and 0.94uF respectively. However, the gain of power eﬃciency does
not come at no cost. By lowering the input voltage, the LDO is moving towards
the dropout region so that it has weaker regulation capability on the load variance.
Hence, with low input voltage, slightly more decap and LDO resources are needed
to maintain well noise performance. In summary, including buck converter into the
system co-optimization can signiﬁcantly improve the overall power eﬃciency.
5. Summary
In this section, by giving a comprehensive analysis on the the electric interaction
between voltage regulators and decoupling capacitance, their mutual inﬂuences and
design tradeoﬀs on various system design aspects, such as power, noise, area, place-
ment and routing, as well as stability are studied in detail. A thorough system-level
co-optimization scheme which can simultaneously optimize active regulators and de-
caps is proposed. The results demonstrate the signiﬁcant performance improvements
brought by the holistic system co-optimization.
144
CHAPTER IV
CONCLUSION
This dissertation presents new methodologies and approaches to address a series of
challenging issues in power delivery network analysis, veriﬁcation and design.
On the analysis aspect, a parallel partitioning based power grid analysis approach
using the spatial locality is presented. The main factors that eﬀect the solution pro-
cess have been identiﬁed, which are the number of partitions and the window size.
The interdependence of these parameters and their inﬂuence on the execution time
have been analyzed. A strategy that helps users in determining the optimal (or n-
ear optimal) values of these parameters to achieve lowest parallel runtime have also
been suggested. The proposed approach is shown to have excellent parallel eﬃciency,
fast convergence, ﬂexible partitioning, and favorable scalability. By using distributed
computing networks, it is believed to be able to handle extremely large power grids
(with many-million nodes) in an eﬃcient way. Moreover, a fast CPU-GPU combined
simulation engine, GSim, has been developed to provide SPICE-level accuracy for
simulating complex PDNs employing on-chip voltage regulation techniques. GSim
identiﬁes the simulation diﬃculties for diﬀerent circuit blocks and achieves its eﬃ-
ciency by using a block-based Gauss-Seidel relaxation scheme to integrate three fast
simulation strategies together. These simulation strategies are speciﬁcally designed
for three types of circuit blocks in the PDN. Most importantly, GSim provides a
foundation to comprehensively analyze various design tradeoﬀs for PDNs with on-
chip voltage regulation.
In terms of veriﬁcation, a simulation-based transient veriﬁcation approach has
been proposed. Speciﬁc circuit modeling techniques have been developed to individ-
ually verify each of the on-chip global and local power grids against given electromi-
145
gration and voltage drop constraints. To achieve veriﬁcation feasibility, the proposed
approach allows the use of fast superposition approximation methods to identify the
top worst-case candidates. These candidates are later validated by a small number
of full simulations.
At last, for PDN design, a novel partition-based locality-driven two-step opti-
mization scheme has been developed for power grid wire sizing optimization. This
scheme exploits the locality of the power grids for scalable optimization of large grid-
s. In the ﬁrst step of the proposed approach, optimal (or near optimal) partitioning
boundary voltages and currents are obtained via localized window-based optimization.
In the following step, partitions are individually optimized under the obtained bound-
ary conditions. The divide-and-conquer nature of the proposed method not only leads
to its favorable scalability but also makes it possible to employ the increasing parallel
computing resources to facilitate the optimization of large power grids. In addition, by
giving a comprehensive analysis on the the electric interaction between voltage regula-
tors/converters and on-chip decoupling capacitors. The mutual inﬂuences and design
tradeoﬀs of the active regulators/converters and the passive decoupling capacitors on
various system design aspects, such as power, noise, area, placement and routing,
as well as stability are studied in detail. A thorough system-level simulation-based
co-optimization scheme which can simultaneously optimize regulators and decaps has
been proposed. The results demonstrate the signiﬁcant performance improvements
brought by the holistic system co-optimization.
146
REFERENCES
[1] N. Weste and D. Harris, CMOS VLSI Design: A Circuits and Systems Perspec-
tive. Boston, MA: Addison Wesley, 2010.
[2] Q. K. Zhu, Power Distribution Network Design for VLSI, Hoboken, NJ: John
Wiley & Sons, 2004.
[3] J. Shah, “Floorplanning a power delivery network with spice”, 2008
[Online]. Available: http://electronicdesign.com/article/eda/ﬂoorplanning-a-
power-delivery-network-with-spice1.aspx.
[4] K. Arabi, R. Saleh, and X. Meng, “Power supply noise in SoCs: metrics, man-
agement, and measurement”, IEEE Design & Test of Computers, vol. 24, no. 3,
pp. 236-244, 2007.
[5] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits: A
Design Perspective. Upper Saddle River, NJ: Prentice Hall, 2004.
[6] M. S. Gupta, J. L. Oatley, R. Joseph, G. Wei, and D. M. Brooks, “Understand-
ing voltage variations in chip multiprocessors using a distributed power-delivery
network”, In Proc. IEEE/ACM DATE, 2007, pp. 783-790.
[7] H. H. Chen and D. D. Ling, “Power supply noise analysis methodology for deep-
submicron VLSI chip design”, In Proc. IEEE/ACM DAC, 1997, pp. 638-643.
[8] T.-H. Chen and C. C.-P. Chen, “Eﬃcient large-scale power grid analysis based
on preconditioned Krylov-subspace iterative methods”, In Proc. IEEE/ACM
DAC, 2001, pp. 559-562.
147
[9] S. R. Nassif and J. N. Kozhaya, “Fast power grid simulation”, In Proc.
IEEE/ACM DAC, 2000, pp. 156-161.
[10] J. Kozhaya, S. Nassif, and F. Najm, “A multigrid-like technique for power grid
analysis”, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 21, no.
10, pp. 1148-1160, 2002.
[11] H. Su, E. Acar, and S. Nassif, “Power grid reduction based on algebraic multigrid
principles”, In Proc. IEEE/ACM DAC, 2003, pp. 109-112.
[12] C. Zhuo, J. Hu, M. Zhao, and K. Chen, “Power grid analysis and optimization
using algebraic multigrid”, IEEE Trans. Comput.-Aided Design Integr. Circuits
Syst., vol. 27, no. 4, pp. 738-751, 2008.
[13] Z. Feng and P. Li, “Multigrid on GPU: tackling power grid analysis on parallel
SIMT platforms”, In Proc. IEEE/ACM ICCAD, 2008, pp. 647-654.
[14] Z. Feng, Z. Zeng, and P. Li, “Parallel on-chip power distribution network analy-
sis on multi-core-multi-GPU platforms”, IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 19, no. 10, pp. 1823-1836, 2011.
[15] M. Zhao, R. V. Panda, S. S. Sapatnekar, T. Edwards, R. Chaudhry and D.
Blaauw, “Hierarchical analysis of power distribution networks”, In Proc.
IEEE/ACM DAC, 2000, pp. 150-155.
[16] M. Zhao, R. Panda, S. Sapatnekar, and D. Blaauw, “Hierarchical analysis of
power distribution networks”, IEEE Trans. Comput.-Aided Design Integr. Cir-
cuits Syst., vol. 21, no. 2, pp. 159-168, 2002.
[17] K. Sun, Q. Zhou, K. Mohanram, and D. Sorensen, “Parallel domain decompo-
sition for simulation of large-scale power grids”, In Proc. IEEE/ACM ICCAD,
148
2007, pp. 54-59.
[18] H. Qian, S. R. Nassif, and S. S. Sapatnekar, “Power grid analysis using random
walks”, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 24, no.
8, pp. 1204-1224, 2005.
[19] E. Chiprout, “Fast ﬂip-chip power grid analysis via locality and grid shells”, In
Proc. IEEE/ACM ICCAD, 2004, pp. 485-488.
[20] Y. Zhong and M. D. F. Wong, “Fast algorithms for IR drop analysis in large
power grid”, In Proc. IEEE/ACM ICCAD, 2005, pp. 351-357.
[21] J. Shi, Y. Cai, W. Hou, L. Ma, S.-D. Tan, P.-H. Ho, and X. Wang, “GPU
friendly fast Poisson solver for structured power grid network analysis”, In Proc.
IEEE/ACM DAC, 2009, pp. 178-183.
[22] F. Najm, M. Nizam, and A. Devgan, “Power grid voltage integrity veriﬁcation”,
In Proc. ACM/IEEE ISLPED, 2005, pp. 239-244.
[23] I. A. Ferzli, F. N. Najm and L. Kruse, A geometric approach for early power
grid veriﬁcation using current constraints, In Proc. IEEE/ACM ICCAD, 2007,
pp. 40-47.
[24] I. Ferzli, F. Najm, and L. Kruse, “Early power grid veriﬁcation under circuit
current uncertainties”, In Proc. ACM/IEEE ISLPED, 2007, pp. 116-121.
[25] I. Ferzli, E. Chiprout, and F. Najm, “Veriﬁcation and co-design of the package
and die power delivery system using wavelets”, In Proc. IEEE EPEP, 2008, pp.
7-10.
149
[26] A. Todri, M. Marek-Sadowska, and S. Chang, “Analysis and optimization
of power-gated ICs with multiple power gating conﬁgurations”, In Proc.
IEEE/ACM ICCAD, 2007, pp. 783-790.
[27] X. D. S. Tan, C. J. R. Shi, and J. C. Lee, “Reliability-constrained area opti-
mization of VLSI power/ground networks via sequence of linear programmings”,
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 22, no. 12, pp.
1678-1684, 2003.
[28] K. Wang and M. Marek-Sadowska, “On-chip power supply network optimization
ssing multigrid-based technique”, IEEE Trans. Comput.-Aided Design Integr.
Circuits Syst., vol. 24, no. 3, pp. 407-417, 2005.
[29] J. Singh and S. S. Sapatnekar, “Partition-based algorithm for power grid design
using locality”, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol.
25, no. 4, pp. 664-677, 2007.
[30] P. Hazucha, G. Schrom, J. Hahn, B. Bloechel, P. Hack, G. Dermer, S. Narendra,
D. Gardner, T. Karnik, V. De, and S. Borkar, “A 233-MHz 80%-87% eﬃcient
four-phase DC-DC converter utilizing air-core inductors on package”, IEEE J.
Solid-State Circuits, vol. 40, no. 4, pp. 838-845, 2005.
[31] J. Wibben and R. Harjani, “A high eﬃciency DC-DC converter using 2nH on-
chip inductors”, In Proc. IEEE VLSIC, 2007, pp. 22-23.
[32] K. N. Leung and P. Mok, “A capacitor-free CMOS low-dropout regulator with
damping-factor-control frequency compensation”, IEEE J. Solid-State Circuits,
vol. 38, no. 10, pp. 1691-1702, 2003.
150
[33] P. Hazucha, T. Karnik, B. Bloechel, C. Parsons, D. Finan, and S. Borkar, Area-
Eﬃcient Linear Regulator with Ultra-Fast Load Regulation, IEEE J. Solid-State
Circuits, vol. 40, no. 4, pp. 933-940, 2005.
[34] R. Milliken, J. Silva-Martinez, and E. Sanchez-Sinencio, “Full on-chip CMOS
low-dropout voltage regulator”, IEEE Trans. Circuits Syst. I, vol. 54, no. 9, pp.
1879-1890, 2007.
[35] J. Guo and K. N. Leung, “A 6uW chip-area-eﬃcient output-capacitorless LDO
in 90nm CMOS technology”, IEEE J. Solid-State Circuits, vol. 45, no. 9, pp.
1896-1905, 2010.
[36] C. Shi, B. Walker, E. Zeisel, B. Hu, and G. McAllister, “A Highly Integrated
Power Management IC for Advanced Mobile Applications”, IEEE J. Solid-State
Circuits, vol. 42, no. 8, pp. 1723-1731, 2007.
[37] M. Hammes, C. Kranz, D. Seippel, J. Kissing, and A. Leyk, “Evolution on SoC
integration: GSM baseband-radio in 0.13um CMOS extended by fully integrated
power management unit”, IEEE J. Solid-State Circuits, vol. 43, no. 1, pp. 236-
245, 2008.
[38] H. Su, S. Sapatnekar, and S. Nassif, “Optimal decoupling capacitor sizing and
placement for standard-cell layout designs”, IEEE Trans. Comput.-Aided Design
Integr. Circuits Syst., vol. 22, no. 4, pp. 428-436, 2003.
[39] S. R. Nassif, “Power grid analysis benchmarks”, In Proc. IEEE/ACM ASPDAC,
2008, pp. 376-381.
[40] J. Shi, Y. Cai, S.-D. Tan, J. Fan, and X. Hong, “Pattern-based iterative method
for extreme large power/ground analysis”, IEEE Trans. Comput.-Aided Design
151
Integr. Circuits Syst., vol. 26, no. 4, pp. 680-692, 2007.
[41] A. George, “Nested dissection of a regular ﬁnite element mesh”, SIAM J. on
Numer. Anal., vol. 10, no. 2, pp. 345-363, 1973.
[42] Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam, “Algorithm 887:
CHOLMOD, supernodal sparse cholesky factorization and update/downdate”,
ACM Trans. Math. Softw., vol. 35, no. 3, pp. 1-14, 2008.
[43] X. Cai, H. Yie, P. Osterberg, J. Gilbert, S. Senturia, and J. White,
“A relaxation/multipole-accelerated scheme for self-consistent electromechan-
ical analysis of complex 3-D microelectromechanical structures”, In Proc.
IEEE/ACM ICCAD, 1993, pp. 283-286.
[44] NVIDIA CUDA programming guide, 2009 [Online]. Available:
http://www.nvidia.com/object/cuda.html.
[45] C. Alexander and M. Sadiku, Fundamentals of Electric Circuits. NY: McGraw
Hill, 2000.
[46] R. Dutta and M. Marek-Sadowska, “Automatic sizing of power/ground (P/G)
networks in VLSI”, In Proc. IEEE/ACM DAC, 1989, pp.783-786.
[47] A. Wa¨chter and L. T. Biegler, “On the implementation of an interior point ﬁlter
line search algorithm for large-scale nonlinear programming”, Math. Program.,
vol. 106, no. 1, pp. 25-57, 2006.
[48] A. Wa¨chter, C. Visweswariah and A. R. Conn, “Large-scale nonlinear opti-
mization in circuit tuning”, Future Gener. Comput. Syst., vol. 21, no. 8, pp.
1251-1262, 2005.
152
[49] X. Wu, X. Hong, Y. Cai, Z. Luo, C. K. Cheng, J. Gu and W. W. M. Dai, “Area
minimization of power distribution network using eﬃcient nonlinear program-
ming techniques”, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.,
vol. 23, no. 7, pp. 1086-1094, 2004.
[50] A. Berman and R. J. Plemmons, Nonnegative Matrices in the Mathematical
Sciences. Philadelphia, PA: SIAM, 1994.
[51] V. Kursun, S. Narendra, V. De and E. Friedman, “Analysis of buck converters
for on-chip integration with a dual supply voltage microprocessor”, IEEE Trans.
Very Large Scale Integr. (VLSI) Syst., vol. 11, no. 3, pp. 514-522, 2003.
[52] A. Conn, R. Haring and C. Visweswariah, “Noise considerations in circuit opti-
mization”, In Proc. IEEE/ACM ICCAD, 1998, pp. 220-227.
[53] G. A. Gray and T. G. Kolda, “Algorithm 856: APPSPACK 4.0: asynchronous
parallel pattern search for derivative-free optimization”, ACM Trans. Math.
Softw., vol. 32, no. 3, pp. 485-507, 2006.
[54] C. Pei, R. Booth, H. Ho, N. Kusaba, X. Li, M.-J. Brodsky, P. Parries, H. Shang,
R. Divakaruni and S. Iyer, “A novel, low-cost deep trench decoupling capaci-
tor for high-performance, low-power bulk CMOS applications”, In Proc. IEEE
ICSICT, 2008, pp. 1146-1149.
153
VITA
Zhiyu Zeng received the B.S. degree in electronic and information engineering
from Zhejiang University, Hangzhou, China, in 2006. In the fall 2007, he was enrolled
in the Department of Electrical and Computer Engineering at Texas A&M Univer-
sity as a Ph.D. student. He graduated in December 2011. His research interests
include general circuit simulation, parallel optimization of power delivery networks
and system-level design of on-chip voltage regulation.
His permanent address is: Department of Electrical and Computer Engineering,
Texas A&M University, 214 Zachry Engineering Center, TAMU 3128 College Station,
Texas 77843-3128. His email address is albertzeng0308@gmail.com.
The typist for this thesis was Zhiyu Zeng.
