CAD methodologies for low power and reliable 3D ICs by Lee, Young-Joon
CAD METHODOLOGIES FOR LOW POWER







of the Requirements for the Degree
Doctor of Philosophy in the
School of Electrical and Computer Engineering
Georgia Institute of Technology
May 2013
Copyright c⃝ Young-Joon Lee 2013
CAD METHODOLOGIES FOR LOW POWER
AND RELIABLE 3D ICS
Approved by:
Dr. Sung Kyu Lim, Advisor
School of Electrical and Computer
Engineering
Georgia Institute of Technology
Dr. Muhannad S. Bakir
School of Electrical and Computer
Engineering
Georgia Institute of Technology
Dr. Hsien-Hsin S. Lee
School of Electrical and Computer
Engineering
Georgia Institute of Technology
Dr. Hyesoon Kim
College of Computing
Georgia Institute of Technology
Dr. Saibal Mukhopadhyay
School of Electrical and Computer
Engineering
Georgia Institute of Technology
Date Approved: March 18, 2013
Dedicated to my wife, my son, my daughter,
my parents, and my parents-in-law,
for their love and support.
ACKNOWLEDGEMENTS
Till I come to the completion of my doctoral study, many people helped me get through
this six-year-long endeavor. I have felt grateful deep in my heart to have all these people
around me, so I am obliged to mention all of them here.
First of all, I would like to thank my advisor, Professor Sung Kyu Lim, for his insightful
and sincere guidance on my research as well as life. He gave me the chance to study 3D
ICs in Georgia Tech, one of the greatest schools and a research leader in 3D IC field. I was
very delighted to join his group. It was one of major turing points in my life. I hope he
continues his success with great people.
I would like to thank Professor Hsien-Hsin S. Lee and Professor Saibal Mukhopadhyay
for their insightful suggestions and delightful comments on my research. Also, I would
like to thank Professor Muhannad S. Bakir and Professor Hyesoon Kim for serving as my
dissertation committee members. I am also grateful to Dr. Gabriel H. Loh for inspiring me
during the early period of my doctoral study.
I would like to express thanks to all my colleagues. All previous and current GTCAD
group members: Dr. Faik Baskaya, Dr. Michael Healy, Mohit Pathak, Dr. Dae Hyun Kim,
Ye Tao, Dr. Xin Zhao, Dr. Krit Athikulwongse, Moongon Jung, Chang Liu, Taigon Song,
Shreepad Panth, Hemant Sane, Dr. Daniel Limbrick, Woongrae Kim, Yarui Peng, Sandeep
Samal, Yang Wan, and Steven Zhang. I am thankful to have great friends in my everyday
life: Dae Hyun, Moongon, and Taigon. Especially, Dae Hyun has been kind and sincere
to me. And MARS and STING group members: Dr. Dong Hyuk Woo, Dr. Dean Lewis,
Tzu-Wei Wells Lin, Mohammad Hossain, Ilya Khorosh, and Guanhao Shen. We had the
glory and agony together during our 3D-MAPS projects. I am also thankful for a GREEN
group member, Kwanyeob Chae, for sharing industry experiences and research ideas.
I met the following people in IFC meeting whom I am thankful for: Professor Paul Kohl,
Professor Yogendra Joshi, Professor Azad Naeemi, Professor Andrei Fedorov, Dr. Kevin
iv
Martin, and Professor Yoon Jo Kim. I am thankful for Dr. Inki Hong, who gave me a
chance to work at Cadence as an intern. During the projects with Intel, I met the following
people online and offline whom I am thankful for: Dr. Paul Fischer, Dr. Patrick Morrow,
Dr. Hong Wang, Dr. Greg Taylor, Clair Webb, Dr. Vijay Pitchumani, Dr. Debabrata
Mohapatra, and Dr. Devangkumar Jariwala.
During the study at Georgia Tech, I also met the following people whom I am thankful
for: Dr. Myunghwan Lee, Dr. Kwanghun Jung, Dr. Youngchang Yoon, Dr. Suhwan Kim,
Dr. Youngdo Jung, Dr. Hyungwook Kim, Dr. Hyunwoong Kim, Dr. Hamhee Jeon, Dr.
Nak Hee Seong, Ilseo Kim, Sungkap Yeo, and Seungbae Lee.
My lovely wife, Ji In Song, you are the one that I feel most grateful to for my doctoral
study and life. You came into my life when it was dark, lifted up my soul when I was down,
and shared my sunny and rainy days. Your presence energized me so I could get through
this long study. We together brought two beautiful human beings into this world, Matthew
and Allyson. I hope to see greater joys of life with you in the future. Ji In, Matthew, and
Allyson, I love you all.
I would like to thank my sister, Young Mi Lee, who positively affected my doctoral
study and life. I hope she finds what she is seeking for. I look forward to seeing you again.
Lastly, I am thankful for my parents, Sang Hyun Lee and Ock Yeon Kim, for their love
and endless support. And I am also thankful for my parents-in-law, Dr. Chae Hyun Song
and Young Hee Kwon, for their support and understanding. There is nothing like love from
parents. We will try to return your love. And I wish you live healthy and happily for long.
To all who were not mentioned above but had interactions with me during my doctoral
study: I am sorry to have missed you here. There is an old Korean saying: ”Even a person
who brushes your sleeve by becomes your karma.” I wish we encounter each other in the
future.
After all, life is a long journey. As I have received so much from my people, I promise
to return your favor to the people around me wholeheartedly.
v
TABLE OF CONTENTS
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
LIST OF SYMBOLS OR ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . xvi
SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
CHAPTER I INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
CHAPTER II ORIGIN AND HISTORY OF THE PROBLEM . . . . . . . . . . 6
2.1 Power Distribution Network and Thermal Interconnect Designs . . . . . . 6
2.2 Circuit Partitioning and Floorplanning for 3D ICs . . . . . . . . . . . . . 7
2.3 Timing Optimization with Buffer Insertion . . . . . . . . . . . . . . . . . 8
2.4 Monolithic 3D IC Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
CHAPTER III CO-OPTIMIZATION AND ANALYSIS OF SIGNAL, POWER, AND
THERMAL INTERCONNECTS IN 3D ICS . . . . . . . . . . . . . . . . . . . 11
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Design and Analysis Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Signal Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Power Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2.3 Thermal Interconnects . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.4 Overview of Physical Design for 3D ICs . . . . . . . . . . . . . . . 20
3.3 Design of Experiments and Response Surface Methodology . . . . . . . . 22
3.3.1 Classical DOE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.2 Advanced DOE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.3 Finding Best Response Models . . . . . . . . . . . . . . . . . . . . 24
3.3.4 Optimization with Response Surface Models . . . . . . . . . . . . 26
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4.1 Comparison of 2D and 3D IC Designs . . . . . . . . . . . . . . . . 28
vi
3.4.2 Comparison of T-TSV and MFC Based Cooling . . . . . . . . . . 29
3.4.3 Varing One Input Factor at a Time . . . . . . . . . . . . . . . . . 30
3.4.4 Advanced DOE - T-TSV Case . . . . . . . . . . . . . . . . . . . . 32
3.4.5 Advanced DOE - MFC Case . . . . . . . . . . . . . . . . . . . . . 35
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
CHAPTER IV TIMING ANALYSIS AND OPTIMIZATION FOR 3D STACKED
MULTI-CORE MICROPROCESSORS . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Target System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 3D Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Design Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 3D Timing Analysis and Optimization . . . . . . . . . . . . . . . . . . . . 46
4.4.1 3D Static Timing Analysis . . . . . . . . . . . . . . . . . . . . . . 46
4.4.2 3D Timing Optimization . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5.1 Initial Design Results . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5.2 Timing Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.5.3 Impact of TSV parasitics . . . . . . . . . . . . . . . . . . . . . . . 51
4.5.4 Sub-Optimality in 3D IC Design . . . . . . . . . . . . . . . . . . . 53
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
CHAPTER V SLEW-AWARE BUFFER INSERTION FOR THROUGH-SILICON-
VIA-BASED 3D ICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Backgrounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.1 Structural Assumptions . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2.2 Motivational Example . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.3 Delay and Slew Models . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Buffer Insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.2 Ginneken-3D Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 61
vii
5.3.3 Bottom-Up Slew Propagation DP . . . . . . . . . . . . . . . . . . 63
5.4 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.5.1 Full-Chip Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5.2 Critical Path Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5.3 Endpoint Slack Histograms . . . . . . . . . . . . . . . . . . . . . . 74
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
CHAPTER VI ULTRA-HIGH-DENSITY LOGIC DESIGNS USING MONOLITHIC
3D INTEGRATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.1 Backgrounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.1.1 Fabrication Process . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.1.2 Design Styles of Monolithic 3D ICs . . . . . . . . . . . . . . . . . 81
6.2 Design Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2.1 Overall Design and Analysis Flow . . . . . . . . . . . . . . . . . . 82
6.2.2 Monolithic 3D Cell Design . . . . . . . . . . . . . . . . . . . . . . 84
6.2.3 Full-Chip Physical Layout . . . . . . . . . . . . . . . . . . . . . . 89
6.3 Exploration of Metal Layer Options . . . . . . . . . . . . . . . . . . . . . 90
6.3.1 Routing Congestions in T-MI Designs . . . . . . . . . . . . . . . . 91
6.3.2 Impact of Additional Metal Layers . . . . . . . . . . . . . . . . . . 93
6.3.3 Impact of Reduced Metal Dimensions . . . . . . . . . . . . . . . . 98
6.4 Power Benefit Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4.1 Benchmark Circuits and Synthesis Results . . . . . . . . . . . . . 101
6.4.2 Layout Simulation Results . . . . . . . . . . . . . . . . . . . . . . 101
6.4.3 Circuit Characteristics Study . . . . . . . . . . . . . . . . . . . . . 103
6.4.4 Impact of Target Clock Period . . . . . . . . . . . . . . . . . . . . 104
6.5 Comparison with G-MI and TSV-based 3D . . . . . . . . . . . . . . . . . 105
6.5.1 Design Flow and Its Limitation . . . . . . . . . . . . . . . . . . . . 106
6.5.2 Layout Simulation Results . . . . . . . . . . . . . . . . . . . . . . 107
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
CHAPTER VII CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
viii
PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
ix
LIST OF TABLES
Table 1 Abbreviations used in this chapter. . . . . . . . . . . . . . . . . . . . . . 12
Table 2 Input factors used in this chapter. . . . . . . . . . . . . . . . . . . . . . 22
Table 3 Responses used in this chapter. . . . . . . . . . . . . . . . . . . . . . . . 23
Table 4 The technology and default setting parameters. The baseline only uses
top-mounted heat-sink, not T-TSVs or MFCs. . . . . . . . . . . . . . . . 27
Table 5 Comparison of 2D and 3D IC designs. Congestion means number of
routing edges with 100% utilization. . . . . . . . . . . . . . . . . . . . . 29
Table 6 Comparison of baseline, T-TSV case, and MFC case. . . . . . . . . . . . 29
Table 7 Candidate models for maximum silicon temperature in T-TSV case. Only
the best five models are shown. The numbers in the parenthesis after Poly
means the polynomial order of (T-TSV ratio, P/G TSV diameter, P/G
thin wire ratio / interaction), and the name in the parenthesis after RBF
means the RBF kernel type. ’+stepwise’ means stepwise regression was
performed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Table 8 Summary of models for T-TSV case with advanced DOE. . . . . . . . . 33
Table 9 Parameters for total wirelength model of T-TSV case with advanced
DOE. TTSV rat, PGdia, and PGthin means maximum T-TSV ratio,
P/G TSV diameter, and P/G thin wire ratio. . . . . . . . . . . . . . . . 33
Table 10 Optimization results for Scenario 1 and 2 in T-TSV case. . . . . . . . . 36
Table 11 Summary of models for MFC case with advanced DOE. In model type
column, the numbers in the parenthesis after Poly means the polynomial
order of (MFC width, MFC pressure drop, P/G TSV diameter, P/G thin
wire ratio / interaction), and the name in the parenthesis after RBF
means the RBF kernel type. . . . . . . . . . . . . . . . . . . . . . . . . . 36
Table 12 Optimization results for Scenario 1, 2, and 3 in MFC case. . . . . . . . . 39
Table 13 Architecture configuration of the LEON3 design. . . . . . . . . . . . . . 43
Table 14 Summary of the synthesis of the quad-core LEON3 design. . . . . . . . . 43
Table 15 Summary of the memory macro blocks. . . . . . . . . . . . . . . . . . . . 44
Table 16 Experimental settings of this chapter. . . . . . . . . . . . . . . . . . . . 48
Table 17 Initial layout results for the design options. Utilization means area utiliza-
tion including standard cells and memory blocks, and wirelength means
total wirelength. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Table 18 Timing optimization results of LEON3. . . . . . . . . . . . . . . . . . . 52
x
Table 19 Delay and runtime of SPDP with varied maxS for critical nets in a 3D
IC design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Table 20 Delay and runtime with varied bin sizes for critical nets in a 3D IC design. 67
Table 21 Percentage of merged solutions, delay, and runtime with varied dS for
critical multi-pin nets in a 3D IC design. . . . . . . . . . . . . . . . . . . 68
Table 22 Parameters used in this chapter. The Cm and Rm mean unit length
capacitance and resistance of metal5. The CTSV and RTSV mean TSV
parasitic capacitance and resistance, respectively. The maxS and minS
are the maximum/minimum allowed slew in the bottom-up traversal. . . 72
Table 23 Summary of target design information. The ’#nets(critical)’ means the
number of nets in the whole design and the critical nets selected for buffer
insertion. Die size is in µm, and the ’clock’ means target clock period in
ns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
Table 24 Comparison of buffer insertion results. The ’#bufs’ means the number of
buffers in the design, and the fmax stands for maximum achievable clock
frequency. Runtime values of Ginneken-3D and SPDP include bottom-up
and top-down traversals in DP. The WNS, TNS, fmax, and runtime are
in ps, ns, MHz, and s respectively. . . . . . . . . . . . . . . . . . . . . . 73
Table 25 Information of the nets on the critical path with Encounter-3D for de-
sign ckt3 and the comparison of buffer insertion results. The ’#TSVs’
and ’#cand. buf loc’ stand for the number of TSVs and the number of
candidate buffer locations in the net. The ’#bufs’ means the number of
buffers/inverters inserted on the net. The ’delay’ is measured from the
source input to the critical sink input of the net, and ’slew’ is the Si of
the critical sink. Delay and slew are in ps. . . . . . . . . . . . . . . . . . 76
Table 26 Cell internal parasitic RC values. The 3D-c means 3D with top tier silicon
modeled as a conductor. . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Table 27 Delay and internal power consumption of cells with various input slew and
load capacitance conditions. The library uses different input slew settings
for DFF. The values in the parentheses mean the percentage ratio of 3D
to 2D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Table 28 Benchmark circuits used for metal layer option exploration. . . . . . . . 91
Table 29 Pin density of the benchmark circuits. Cell area and pin density (= #cell
pins / cell area) are shown in µm2 and pins/µm2, respectively. . . . . . 92
Table 30 Summary of metal layers in the 2D design option. Eight out of ten metal
layers in the Nangate 45nm library are used. Unit is nm. . . . . . . . . 93
Table 31 Comparison of timing and power of a cell with and without via stack RC.
The values are from the timing/power tables of the characterized libraries. 96
xi
Table 32 Comparison between 2D and monolithic 3D designs. #routing MIVs
means the number of MIVs used in net routing, excluding the MIVs used
inside the monolithic cells. The WL, LPD, and TNS mean wirelength,
longest path delay, and total negative slack, respectively. Total power
includes cell internal, switching, and leakage power. Clock power includes
the power of clock buffers and wires. The values in parentheses show the
percentage ratio to the 2D designs. . . . . . . . . . . . . . . . . . . . . . 109
Table 33 Minimum width/spacing of metal layers with varied metal dimension re-
duction ratio. First metal means the lowest metal layer of the top/bottom
tier. Unit is nm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Table 34 Unit length resistance and capacitance of local metals with varied metal
dimension reduction ratio. The Chigh and Clow are the max/min total
wire capacitance per unit length, depending on the surrounding wires. . 110
Table 35 Total wirelength, longest path delay, and total power of AES, VGA, DES,
and FFT with reduced metal dimensions. . . . . . . . . . . . . . . . . . 111
Table 36 Benchmark circuits and synthesis results. . . . . . . . . . . . . . . . . . 112
Table 37 Summary of layout results. The values represent the percentage difference
of T-MI over 2D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Table 38 Layout results of 2D and 3D designs. The 3D means the T-MI with 3TM
metal layer option. The #cells mean total number of cells, and #buffers
mean the number of inverting/non-inverting buffers. The #cells include
#buffers. The utilization means final cell placement density, after all
optimizations. The WL and WNS mean wirelength and worst negative
slack, respectively. Positive WNS value means timing is met with a posi-
tive slack. The values in parentheses show the percentage ratio to the 2D
designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Table 39 Wire vs. pin capacitance breakdown of LDPC and DES in 45nm node.
The values are for the entire circuit. . . . . . . . . . . . . . . . . . . . . 114
Table 40 Layout results of G-MI and TSV-3D designs. The values in parentheses
show the percentage ratio to the 2D designs in Table 38. . . . . . . . . . 115
xii
LIST OF FIGURES
Figure 1 Illustration of a die in a 3D IC with signal TSVs, P/G TSVs, and MFCs.
These interconnects all compete for layout space. Transistors and signal
wires are not shown for simplicity. . . . . . . . . . . . . . . . . . . . . . 12
Figure 2 Side view of a die in 3D ICs (a) with T-TSVs, and (b) with MFCs. In
(b), bonding layer also seals MFCs and the thickness is larger. Dies are
flipped over, and devices are facing down. Shapes are drawn to scale
based on the default settings, except for gates. Unit is µm. . . . . . . . 14
Figure 3 Top view of global placement and routing tiles with MFCs. Only a part
of the chip is shown. Objects are drawn to scale based on the default
settings. P/G thin wires are not shown for simplicity. . . . . . . . . . . . 15
Figure 4 Top view of the P/G network. . . . . . . . . . . . . . . . . . . . . . . . . 16
Figure 5 Side view of the thermal grid structure used for a 3D IC with MFCs. . . 20
Figure 6 Overall design flow with DOE and RSM. . . . . . . . . . . . . . . . . . . 21
Figure 7 Initial design results with baseline settings. Power density unit is W/cm2
in power map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Figure 8 Temperature profiles for baseline, T-TSV case and MFC case. Dotted
lines in MFC case show MFCs. . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 9 Results of preliminary experiments for (a) MFC case and (b) T-TSV case. 32
Figure 10 Response surfaces for T-TSV case with advanced DOE. For each metric,
the two significant input factors are shown. . . . . . . . . . . . . . . . . 34
Figure 11 Response surfaces for MFC case with advanced DOE. . . . . . . . . . . 37
Figure 12 Target 3D structure of this chapter. (a) Dies are flipped over and facing
down. TSV pin pad (PP) and landing pad (LP) are shown. (b) The TSV
occupies two standard cell rows. Unit is µm. . . . . . . . . . . . . . . . 42
Figure 13 Four design options. Blocks highlighted in orange denote Core 0. inst
$ and data $ denote instruction and data cache, while RF and TLB
represent register file and address translation buffer. . . . . . . . . . . . 44
Figure 14 Design flow with timing scaling and timing budgeting. . . . . . . . . . . 48
Figure 15 Top-die layouts of the four partition styles. The relative sizes of layouts
are preserved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Figure 16 Screen shots of the GDSII images in Cadence Virtuoso. Left: TSVs and
gates. Right: routing to TSVs. . . . . . . . . . . . . . . . . . . . . . . . 49
Figure 17 Wirelength distribution of design options before timing optimization. The
x-axis is wirelength in µm and the y-axis is net count. . . . . . . . . . . 51
xiii
Figure 18 WNS values for 3D-core, 3D-block, and 3D-gate cases with timing bud-
geting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Figure 19 The impact of TSV parasitics on various metrics. CTSV = 0fF means ig-
noring the parasitics of TSVs. Timing budgeting was used for optimization. 52
Figure 20 Layout snapshots of dies for 3D-gate, with timing critical path highlighted
in white. Numbers in bright yellow represent the path sequence. Small
blue squares are TSV PPs on M1, and orange squares are TSV LPs on M6. 53
Figure 21 (a) Side view of the 3D IC, (b) top view of a TSV, and (c) TSV RC model.
TSV PP (M1) and TSV LP (M8) represent TSV pin pad on metal1 and
TSV landing pad on metal8, respectively. Dashed lines in (b) denote
standard cell row boundaries. Dimensions are in µm. . . . . . . . . . . . 56
Figure 22 A motivational example. Numbers shown in blue represent the distance
from source gate in µm. (a) target 3D net, and buffer insertion solutions
with (b) VGDP, (c) SPDP, and (d) timing-constraint-based 2D optimiza-
tion by Cadence Encounter. . . . . . . . . . . . . . . . . . . . . . . . . . 57
Figure 23 Gate and net slew calculations in (a) top-down and (b) bottom-up traversal. 63
Figure 24 Solution merge rule for VGDP and SPDP. . . . . . . . . . . . . . . . . . 67
Figure 25 Slew matching technique. The q’ and S’ are determined as in Figure 24. 68
Figure 26 Different buffer insertion scheme for (a) VGDP and (b) SPDP. . . . . . 69
Figure 27 Overall full-chip design flow for the buffer insertion methods. The ECO
means engineering change order. . . . . . . . . . . . . . . . . . . . . . . 71
Figure 28 Endpoint slack histograms for ckt2 with (a) Encounter-3D, (b) Ginneken-
3D, and (c) the proposed SPDP. . . . . . . . . . . . . . . . . . . . . . . 77
Figure 29 Side view of a two-tier monolithic 3D IC. The MIV and ILD stand for
monolithic inter-tier via and inter-layer dielectric. On the top tier, only
the first two metal layers (M1, M2) are shown. Objects are drawn to
scale. Unit is nm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Figure 30 Monolithic 3D fabrication process flow of CEA/LETI. . . . . . . . . . . 80
Figure 31 Design styles of monolithic 3D ICs: (a) T-MI, (b) G-MI. . . . . . . . . . 82
Figure 32 Overall design and analysis flow for T-MI. Shaded boxes highlight differ-
ences in T-MI. The WLM means wire load model. . . . . . . . . . . . . 83
Figure 33 The layout of an inverter from (a) Nangate 45nm library, and (b) the T-MI
library. P, M, and CT represent poly, metal, and contact. The suffix ’B’
means the bottom tier. MIV means monolithic inter-tier via. Top/bottom
tier silicon substrate and p/nwells are not shown for simplicity. The
numbers in parentheses mean thickness in nm. . . . . . . . . . . . . . . 84
Figure 34 Layout snapshots of the T-MI cells. The S/D means source/drain. The
p/nwell and implants are not shown for simplicity. . . . . . . . . . . . . 85
xiv
Figure 35 Illustration of net routing cases in T-MI. This net connects pin Z of Cell1
to pin A of Cell2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Figure 36 Layout snapshots of the benchmark circuit AES. On the right, zoom-in
shots of the top and the bottom tier are shown. Black and purple squares
indicate the MIVs used for net routing and cell internal connections, re-
spectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Figure 37 Routing congestion map of VGA with (a) 2D and (b) T-MI. Black X
marks show design rule violations due to routing congestions. . . . . . . 92
Figure 38 Metal layer stack options. (a) 2D, (b) baseline T-MI. (c) 3 local metal
layers added to the top tier, (d) 3 local metal layers added to the bottom
tier. ILD stands for inter-layer dielectric between the top and the bottom
tier. The bottom tier substrate and ILD for metal layers are not shown
for simplicity. Objects are drawn to scale. . . . . . . . . . . . . . . . . . 94
Figure 39 Raphael simulation structure for a via stack and its surrounding objects.
The dimensions are shown in µm. . . . . . . . . . . . . . . . . . . . . . . 95
Figure 40 SPICE netlist of a standard cell: (a) original netlist, (b) with via stack
RC. The dotted line in (a) is the tier boundary, and the values denote
internal parasitic resistances in Ω. . . . . . . . . . . . . . . . . . . . . . . 96
Figure 41 Various results of JPEG with reduced metal dimensions. . . . . . . . . . 100
Figure 42 The placement and routing snapshots of AES designs. The figures reflect
the relative sizes of 2D vs. T-MI designs. . . . . . . . . . . . . . . . . . . 102
Figure 43 Snapshots of routing results for LDPC and DES. . . . . . . . . . . . . . 103
Figure 44 Power reduction rate (T-MI over 2D) under various target clock periods. 104
Figure 45 Layer structures of (a) G-MI and (b) TSV-3D ICs. For simplicity, in (b),
only the top metal layer of the bottom tier is shown. . . . . . . . . . . . 105
Figure 46 Design and analysis flow for G-MI and TSV-3D ICs. . . . . . . . . . . . 106
Figure 47 Examples of limitations in die-by-die optimizations: (a) buffer pair to
inverter pair, (b) AND to NAND and an inverter, and (c) gate cloning. . 106
xv
LIST OF SYMBOLS OR ABBREVIATIONS
CAD Computer-aided design.
CTS Clock tree synthesis.
DOE Design of experiments.
DP Dynamic programming.
EDA Electronic design automation.
LPD Longest path delay.
MFC Miro-fluidic channel.
MIV Monolithic inter-tier via.
RSM Response surface methodology.
STA Static timing analysis.
T-TSV Thermal-through-silicon via.
TNS Total negative slack.
TSV Through-silicon via.
WNS Worst negative slack.
xvi
SUMMARY
The main objective of this dissertation is to explore and develop computer-aided-
design (CAD) methodologies and optimization techniques for reliability, timing perfor-
mance, and power consumption of through-silicon-via(TSV)-based and monolithic 3D IC
designs. The 3D IC technology is a promising answer to the device scaling and interconnect
problems that industry faces today. Yet, since multiple dies are stacked vertically in 3D
ICs, new problems arise such as thermal, power delivery, and so on. New physical design
methodologies and optimization techniques should be developed to address the problems
and exploit the design freedom in 3D ICs. Towards the objective, this dissertation includes
four research projects.
The first project is on the co-optimization of traditional design metrics and reliability
metrics for 3D ICs. It is well known that heat removal and power delivery are two major
reliability concerns in 3D ICs. To alleviate thermal problem, two possible solutions have
been proposed: thermal-through-silicon-vias (T-TSVs) and micro-fluidic-channel (MFC)
based cooling. For power delivery, a complex power distribution network is required to
deliver currents reliably to all parts of the 3D IC while suppressing the power supply noise
to an acceptable level. However, these thermal and power networks pose major challenges
in signal routability and congestion. In this project, a co-optimization methodology for
signal, power, and thermal interconnects in 3D ICs is presented. The goal of the proposed
approach is to improve signal, thermal, and power noise metrics and to provide fast and
accurate design space explorations for early design stages.
The second project is a study on 3D IC partition. For a 3D IC, the target circuit needs
to be partitioned into multiple parts then mapped onto the dies. The partition style impacts
design quality such as footprint, wirelength, timing, and so on. In this project, the design
methodologies of 3D ICs with different partition styles are demonstrated. For the LEON3
multi-core microprocessor, three partitioning styles are compared: core-level, block-level,
xvii
and gate-level. The design methodologies for such partitioning styles and their implications
on the physical layout are discussed. Then, to perform timing optimizations for 3D ICs,
two timing constraint generation methods are demonstrated that lead to different design
quality.
The third project is on the buffer insertion for timing optimization of 3D ICs. For
high performance 3D ICs, it is crucial to perform thorough timing optimizations. Among
timing optimization techniques, buffer insertion is known to be the most effective way. The
TSVs have a large parasitic capacitance that increases the signal slew and the delay on
the downstream. In this project, a slew-aware buffer insertion algorithm is developed that
handles full 3D nets and considers TSV parasitics and slew effects on delay. Compared with
the well-known van Ginneken algorithm and a commercial tool, the proposed algorithm finds
buffering solutions with lower delay values and acceptable runtime overhead.
The last project is on the ultra-high-density logic designs for monolithic 3D ICs. The
nano-scale 3D interconnects available in monolithic 3D IC technology enable ultra-high-
density device integration at the individual transistor-level. The benefits and challenges of
monolithic 3D integration technology for logic designs are investigated. First, a 3D stan-
dard cell library for transistor-level monolithic 3D ICs is built and their timing and power
behavior are characterized. Then, various interconnect options for monolithic 3D ICs that
improve design quality are explored. Next, timing-closed, full-chip GDSII layouts are built
and iso-performance power comparisons with 2D IC designs are performed. Important de-
sign metrics such as area, wirelength, timing, and power consumption are compared among





For more than half a century, semiconductor devices and circuits have been developed by
numerous brilliant minds and served people in various fields. The device scaling trend
that arduously followed Moore’s law brought prosperity into semiconductor businesses as
well as end users. Unfortunately, the scaling trend brought hardships as well, which were
overcome successfully until recently. As the semiconductor devices and metal wires become
nano-scale, the physical limitations in manufacturing and material behaviors pose unprece-
dentedly great hurdles to semiconductor industry and academia. It is expected that soon
the progress in device node scaling will slow down noticeably, mainly because next genera-
tion lithography methods (e.g., extreme ultraviolet lithography (EUV) and electron beam
lithograph) are being pushed back. Currently, the next generation device node (14nm) is
still based on 193nm-wavelength emulsion lithography with multiple patterning techniques.
The complexity of multiple patterning incurs intricate design rules and design efficiency
problems, leading to higher manufacturing costs. Furthermore, the multiple patterning is
not expected to provide the scaling below 10nm node.
As devices become smaller, interconnect (or net) dimensions need to be shrunk accord-
ingly to connect the devices. With nano-scale interconnect dimensions, the resistivity of
wires shoots up, as discussed in the International Technology Roadmap for Semiconductors
(ITRS) projection [1]. For short, local nets, the increased resistivity is not a huge problem
because the device output resistance dominates the wire resistance. The real problem is
on the medium-long nets that connect medium-large design blocks, including intellectual
property (IP) blocks. As devices become smaller and more functionalities are brought onto
the chip, the design tends to contain more medium-long nets. The delay of these nets
dominate the delay of devices, majorly determining the overall performance of the design.
Therefore, reducing interconnect length is crucial to improve overall circuit quality such as
1
timing performance and power consumption.
The 3D IC technology is a promising answer to the aforementioned device scaling and
interconnect problems. With 3D IC technologies, more devices can be integrated within a
given footprint. At the end of device roadmap where the physical limit or manufacturing
costs prohibit further scaling, the 3D IC technology is the most viable way to extend Moore’s
law and keep the semiconductor business prosperous. In addition, by stacking devices in
3D, the average distance among devices could be reduced, leading to a shorter average
interconnect length. For successful adoptions, it is crucial that 3D IC technologies bring
promised benefits. During the recent decade or so, industry and academia have been working
hard to enable and adopt 3D IC technologies.
The main objective of this thesis is to explore and develop computer-aided-design (CAD)
methodologies and optimization techniques for reliability, timing performance, and power
consumption of through-silicon-via(TSV)-based and monolithic 3D IC designs. Because
multiple dies are stacked vertically in 3D ICs, new problems arise such as thermal, power
delivery, clock distribution, testability, and so on. New physical design methodologies and
optimization techniques should be developed to address the problems and exploit the design
freedom in 3D ICs. The physical design methodologies and optimization techniques for 3D
ICs should reflect the technological details of today and future as much as possible. In this
dissertation, four projects are presented that partially address the aforementioned problems.
1.1 Contributions
The contributions of this dissertation are summarized as follows.
• A co-optimization method for signal, power, and thermal interconnects: It
is well known that heat removal and power delivery are two major reliability concerns
in 3D ICs. To alleviate the thermal problem, two possible solutions have been pro-
posed: thermal-TSVs (T-TSVs) and micro-fluidic-channel (MFC) based cooling. For
power delivery, a complex power distribution network is required to deliver currents
reliably to all parts of the 3D IC while suppressing the power supply noise to an
acceptable level. However, these thermal and power networks pose major challenges
2
in signal routability and congestion. This is because signal, power, and thermal in-
terconnects are all competing for routing space, and the related TSVs interfere with
gates and wires in each die. In this dissertation, a co-optimization method for signal,
power, and thermal interconnects in 3D ICs based on design of experiments (DOE)
and response surface methodology (RSM) is presented. In early design stages, the pro-
posed method can provide quick and reasonably accurate design space explorations.
First, the design characteristics of a digital signal processing core in 2D and 3D ICs
are compared, and the need for more powerful thermal management techniques for
3D ICs is justified. The two thermal solutions, T-TSV and MFC-based cooling, are
modeled into the framework, as well as power distribution network and signal nets.
With signal, power, and thermal analysis results, the strengths and weaknesses of
these two thermal management techniques are discussed. Then, the co-optimization
of signal, power, and thermal interconnects using DOE and RSM is demonstrated
for 3D ICs with MFCs and T-TSVs. The strengths and limitations of the proposed
co-optimization method are discussed.
• A study on the impact of partition styles on design quality of a multi-core
processor: To implement a target design on multiple dies, the circuit needs to be
partitioned into multiple parts then mapped onto the dies. Different partition styles
lead to different design quality in terms of footprint, wirelength, and timing. The
presented work is the first to compare 2D and 3D IC designs of a commercial-grade
multi-core processor at GDSII level. The design methodologies for circuit partition
options and the implications on the physical layout are discussed. Based on GDSII-
level details, the 3D IC implementations in different partition styles as well as the 2D
IC implementation are compared. For the 3D IC implementation, three partitioning
styles are compared: core-level, block-level, and gate-level. These partitioning styles
represent three most relevant 3D IC implementation choices. The design methodolo-
gies for such partitioning styles, their implications on the physical layout, and the
impact of TSVs on the 3D design quality in terms of chip area, wirelength, and per-
formance are discussed. In addition, two timing constraint generation methods for
3
timing optimizations of 3D ICs are presented: timing scaling and timing budgeting.
Finally, the challenges and opportunities in 3D IC optimizations are discussed.
• A slew-aware buffer insertion algorithm that minimizes delay by consid-
ering slew effect on delay: For high performance 3D ICs, it is crucial to perform
a thorough timing optimization. Among timing optimization techniques, buffer in-
sertion is known to be the most effective way. The TSVs have a large parasitic
capacitance that increases the signal slew and the delay on the downstream. From
a layout experiment, the impact of slew caused by TSVs on gate and net delays is
demonstrated. With a buffered 3D net, the severity of TSV-induced slew degradation
and the idea for buffer solution improvement are discussed. The presented work is the
first to incorporate reasonably accurate slew model into the van Ginneken dynamic
programming (DP) framework for delay minimization. With the proposed slew bin-
ning idea, the slew-aware delay is considered explicitly and efficiently during solution
search. In addition, using the slew information, several efficient pruning rules are pro-
posed, which limit search space and reduce runtime. Compared with the well-known
van Ginneken algorithm and a commercial electronic design automation (EDA) tool,
the proposed algorithm finds buffering solutions with lower delay values and accept-
able runtime overhead.
• Interconnect options and power benefit study for ultra-high-density mono-
lithic 3D ICs: To better exploit the benefits from 3D die stacking, monolithic 3D
technology is currently being investigated as a next generation technology. In a mono-
lithic 3D IC, the device layers are fabricated sequentially, rather than bonding two
fabricated dies together using bumps and/or TSVs. The nano-scale 3D interconnects
available in monolithic 3D IC technology enable ultra-high-density device integration
at the individual transistor-level. In this work, the benefits and challenges of mono-
lithic 3D technology for ultra-high-density logic designs are investigated. First, a 3D
standard cell library for transistor-level monolithic 3D ICs is built and their timing
4
and power characteristics are modeled. Then, various interconnect options for mono-
lithic 3D ICs that improve design quality are explored. Next, timing-closed, full-chip
GDSII layouts are built and sign-off iso-performance power comparisons with 2D IC
designs are performed. Important design metrics such as area, wirelength, timing, and
power consumption are compared for transistor-level monolithic 3D designs, gate-level
monolithic 3D, TSV-based 3D, and traditional 2D designs.
1.2 Organization
The rest of this dissertation is organized as follows:
• In Chapter 2, the origin of the problems and the related works are discussed. This
chapter also provides some background knowledge.
• In Chapter 3, the co-optimization methodology for signal, power, and thermal inter-
connects in 3D ICs based on design of experiments and response surface methodology
is presented.
• In Chapter 4, the partitioning study of a multi-core microprocessor is presented.
• In Chapter 5, the slew-aware buffer insertion for timing optimization of TSV-based
3D ICs is presented.
• In Chapter 6, the ultra-high-density monolithic 3D IC study is presented.
• In Chapter 7, the conclusions of this dissertation are mentioned, as well as the remarks
on the covered topics and possible future works.
5
CHAPTER II
ORIGIN AND HISTORY OF THE PROBLEM
Four broad categories of works are related to this dissertation. Related considerations
include power distribution network and thermal interconnect designs, circuit partitioning
and floorplanning for 3D ICs, timing optimization with buffer insertion, and monolithic 3D
IC designs.
2.1 Power Distribution Network and Thermal Interconnect Designs
Many efforts have been made to solve heat removal and power delivery problems in 3D IC
technology. Thermal management using thermal-TSVs (T-TSVs) has been proposed as a
solution to the heat problem. Several previous works considered T-TSV insertion during
floorplanning [2], placement [3], and routing [4]. The thermal effectiveness of T-TSV heavily
depends on the number and location of T-TSVs. These T-TSVs pierce through the device
and wire areas vertically and interfere with other objects. Thus, it is important to plan
T-TSVs carefully to balance thermal and other metrics.
D. B. Tuckerman and R. F. W. Pease presented the pioneering work on micro-fluidic
channels (MFCs) [5] which includes compact thermal models for MFCs and measurements
of the actual performance of MFCs. For 3D ICs, the MFC-based cooling has been proposed
as a possible solution to dramatically lower the operating temperatures of 3D ICs with high
power densities [6]. M. Bakir, B. Dang, and J. Meindl measured the thermal resistance of
a micro-channel heat sink for a single chip [7]. When de-ionized water was used as coolant,
the junction-to-ambient thermal resistance of the heat sink was 0.24◦C/W at a flow-rate
of about 65mL/min. Recently, the performance of MFC based cooling for 3D ICs was
thoroughly investigated in [8].
Regarding power distribution, G. Huang et al. presented a compact modeling of power
delivery network for 3D ICs and ideas to suppress power supply noise to an acceptable
level [9]. The hierarchical nature of the power distribution network in 3D ICs compilcates
6
the routing with other (signal and thermal) interconnects. It is customary to define a
maximum power noise level and temperature then optimize designs towards higher timing
performance or lower power consumption. In this scenario, it is essential to consider the
signal, power, and thermal problem in the same design framework.
2.2 Circuit Partitioning and Floorplanning for 3D ICs
Traditionally, partitioning [10,11] was used to divide a target circuit into smaller partitions
while minimizing the number of nets connecting multiple partitions. Partitioning effectively
reduces the problem sizes for the rest of the physical design steps, increasing efficiency. For
3D ICs, the partitioning may be used to split the target circuit into different dies to minimize
inter-die connections (and number of TSVs). Traditional floorplanning [12–14] determines
the locations (and shapes) of partitioned blocks to minimize total area and wirelength. For
3D ICs, floorplanning is used to determine not only the locations of the blocks in x-y plane
but also in z axis (or die number). Thus, partitioning and floorplanning are closely related
in the 3D IC design flow.
Bryan Black et al. demonstrated two different approaches for implementing high-
performance 3D processors [15]. The first approach is stacking memory on logic (Mem-
ory+Logic), and the second is implementing a microarchitecture across two or more dies
(Logic+Logic). For an Intel Pentium 4-based microprocessor, with Logic+Logic stacking,
about 25% of pipeline stages were eliminated, leading to about 15% performance improve-
ment. In addition, fewer repeaters, a smaller clock grid, and significantly less global wire
yields a 15% power reduction.
Eun Chu Oh and Paul D. Franzon explored the 3D IC design options for Ternary Content
Addressable Memory (TCAM) [16]. By replacing matchlines with inter-tier 3D vias and
using the inter-cell partitioning method, for a three die implementation, 40% matchline
capacitance reduction and 21% power reduction were achieved, compared with a TCAM in
a conventional single-tier process.
Yuh-Fang Tsai et al. explored 3D design options for partitioning a cache [17]. This
paper examines possible partitioning approaches for caches designed using 3D structures
7
and presents a delay and energy model to explore different options of partitioning a cache
across different device layers. Because of the size of 3D vias (or TSVs), SRAM cell level
partitioning is not feasible. Thus, their focus is on sub-array-level partitioning, namely 3D
divided wordline (3-DWL) approach and 3D divided bit line approach (3-DBL). For four
active device layers, the energy savings of 31.38% is claimed for a 4MB cache with 25nm
technology.
Yu Cheng Hu et al. proposed a multilevel multilayer partitioning algorithm for 3D ICs
application [18]. The objective is to minimize the total number of TSVs while observing the
area constraint for each layer. First, a multilevel coarsening technique is applied to reduce
the number of modules. With an initial K-layer partition, a K-layer FM-like partitioning
refinement process is applied to minimize the number of TSVs under the area constraint.
Then, an uncoarsening process is performed to restore the modules to the previous levels. By
the proposed method, small number of TSVs are used and die areas are evenly distributed
across the 3D stack.
2.3 Timing Optimization with Buffer Insertion
Decades ago, the buffer insertion problem for 2D ICs was studied with closed analytical
formulations [19]. However, these analytical formulations were based on many assumptions
that do not hold in practical designs. After the pioneering work of van Ginneken [20] which
adopted dynamic programming (VGDP), efforts for generalization [21], speed-up [22], and
higher accuracy [23] were made. The essense of the van Ginneken buffer insertion lies in
practicality and efficiency; the algorithm greatly influenced commercial EDA tools.
For 3D ICs, S. Dong et al. proposed a buffer planning algorithm in the floorplanning
stage [24]. However, their results may not correlate to the final timing optimization re-
sults because the algorithm is performed in an early design stage. Meanwhile, in [25], the
post-route timing optimization was performed using existing 2D EDA tools with timing
constraints on die boundaries. Since the optimization engines in 2D EDA tools handle each
die separately, it cannot consider the whole 3D path, which compromises the quality of
timing optimization.
8
Because TSVs are large and exhibit considerable parasitic capacitances, the signal slew
through TSVs degrades, which in turn degrades timing performance. Y. Peng and X. Liu
presented a buffer insertion algorithm with slew consideration [26]. However, their delay
models could not adopt effective capacitance [27], and hence considering the fact that TSVs
affect effective capacitance much, this algorithm is not suitable for 3D ICs. Also, their
framework relies on nonlinear optimization which would incur runtime issues for large net
instances. In contrast, the VGDP framework is known for efficiency and flexibility, yet there
has been no work that considers realistic signal slew in the VGDP framework. J. Lillis, C.-
K. Cheng, and T.-T. Y. Lin considered slew in the VGDP framework [21], however their
slew model is not realistic and the implementation is complicated because of piecewise linear
functions.
2.4 Monolithic 3D IC Designs
The shortcomings of TSV-based 3D ICs are the overhead area consumed by TSVs and
the minimum required pitch between TSVs because of manufacturing issues (such as die
alignment accuracy and mechanical stress). Monolithic 3D technology, in contrast, provides
much higher-density vertical connections because of very high alignment precision [28].
Since monolithic 3D technology enables high-density vertical connections, the first major
application was high-density SRAM designs. Soon-Moon Jung et al. demonstrated the
single-crystal thin-film-based process for their SRAM design [29], which reduced the SRAM
cell area by 46.4%. Recently, Negin Golshani et al. demonstrated the monolithic 3D
integration of SRAM and image sensor [30]. Also, T. Naito et al. demonstrated the first
3D FPGA design implementation based on a monolithic 3D technology [31], by stacking
an amorphous silicon thin-film transistor (TFT) layer on top of a bulk silicon CMOS logic
layer.
P. Batude et al. enabled high-quality top silicon layer using a molecular bonding tech-
nique and a low thermal budget process [28]. Based on their monolithic 3D IC process, two
kinds of monolithic 3D standard cell libraries were demonstrated in [32]. The first method,
Intra-Cell stacking, places NMOS transistors on one tier and PMOS transistors on another.
9
The second method, Cell-on-Cell stacking, places complete CMOS cells on each tier, which
is similar to TSV-based designs. A special physical design flow for Cell-on-Cell stacking was
proposed to reuse an existing commercial 2D placer for placing cells in 3D. Compared with
a traditional 2D design flow, the proposed design flow with Cell-on-Cell stacking provided
wirelength, critical path delay, and area reduction of 15%, 6.1%, and 37.5%, respectively.
Recently, logic design methodologies for monolithic 3D technology were demonstrated
in [32,33]. Yet, the presented design techniques and interconnect options did not resolve the
routing congestion problem in transistor-level monolithic 3D designs, which may degrade the
design quality much. The routing congestion problem was addressed in a recent work [34].
However, timing was not closed in these works [32–34], which makes the timing and power
comparisons non-practical and unfair. Since better timing can be traded with lower power
consumption, it is essential that all the design options under consideration are timing-
closed to allow iso-performance power comparison. In addition, these works assume that
the timing and power characteristics of 3D monolithic gates are the same as 2D gates
and did not demonstrate why that is a reasonable assumption. The authors also did not
provide in-depth analyses and discussions on why monolithic 3D technology reduces power
consumption and what factors affect the power reduction margin. This knowledge is crucial
to maximize the benefit and justify on-going and future researches on fabrication and design
technologies for monolithic 3D ICs.
10
CHAPTER III
CO-OPTIMIZATION AND ANALYSIS OF SIGNAL, POWER, AND
THERMAL INTERCONNECTS IN 3D ICS
3.1 Introduction
The substantially smaller footprint area of 3D ICs inevitably leads to increased power
density and chip temperatures. In addition, the thermal conductivity of the material used
between dies of 3D ICs is low. Elevated temperatures may lead to inefficiency in performance
and power. Furthermore, in 3D ICs the power is fed through TSVs which have significant
parasitics. As more dies are stacked together, the power noise is more prominent in 3D ICs.
Many efforts have been made to solve heat removal and power delivery concerns in the 3D
IC technology. Thermal management using thermal-TSVs (T-TSVs) has been proposed as a
solution to the heat problem [4]. Also, liquid cooling based on micro-fluidic channels (MFCs)
has been proposed as a possible solution to dramatically lower the operating temperatures of
3D ICs with high power densities [6]. Regarding power supply noise, designers use a highly
complex hierarchical power distribution network to deliver currents to all parts of the 3D
IC while suppressing the power supply noise to an acceptable level [9]. These so called
silicon ancillary technologies, however, pose major challenges to routing completion and
congestion, because the routing space is shared by these interconnects. As shown in Figure
1, the power and the thermal interconnects are relatively large. Since these interconnects
interact in a complex manner, optimizing one interconnect after another may lead to a local
optimum. Thus, co-optimization of these interconnects with a holistic approach is highly
called for. Most of the existing studies on signal, power, and thermal interconnects for 3D
ICs are done in isolation, thereby lacking system-level perspective.
In this chapter, the co-optimization of signal, power, and thermal interconnects for 3D
ICs is presented, which is based on design of experiments (DOE) and response surface







Figure 1: Illustration of a die in a 3D IC with signal TSVs, P/G TSVs, and MFCs. These
interconnects all compete for layout space. Transistors and signal wires are not shown for
simplicity.
design stages, this method can provide a quick and accurate design space exploration, and
the obtained response models provide insights on the system and are flexible so as to be
reused for different optimization goals.
Since its invention [35], DOE has been used for various scientific and engineering ap-
plications. DOE has also been used in VLSI and CAD areas. In [36] the DOE framework
for CAD was discussed. A robust interconnect model based on DOE was presented in [37].
In [38] DOE was used to identify performance-critical buses in microarchitectures.
The abbreviations used in this subsection are shown in Table 1.
Table 1: Abbreviations used in this chapter.
DOE Design of experiments RSM Response surface method
ROI Region of interest RMSE Root mean square error
TSV Through-silicon-via T-TSV Thermal-TSV
P/G Power/ground MFC Micro-fluidic channel
3.2 Design and Analysis Flow
3.2.1 Signal Interconnects
In this chapter, the metal interconnect dimensions are set similar to the ones in the North
Carolina State University 45nm technology library [39]. Since no industry data and feedback
is available, the free technology data with assumptions and modifications are used. Total
eight out of ten metal layers in [39] are utilized. The assumption on TSV integration scheme
12
is via-first. Via-first TSVs interfere only with device layer and not with metal layers, so
they are less intrusive than via-last TSVs. Also it is assumed that the TSV aspect ratio (=
TSV height : TSV diameter) is 10:1 for the baseline case (no T-TSV or MFC) and T-TSV
case, and 30:1 for MFC case. The reason for higher TSV aspect ratio with MFC is because
dies with MFCs cannot be as much thinned as ones without MFCs and are thicker. If the
same TSV aspect ratio is assumed for MFC case, the TSV diameter is larger, leading to
larger silicon area for signal TSVs. Thus it is important for MFC based cooling to have
a high TSV aspect ratio. A high TSV aspect ratio of more than 30:1 was demonstrated
in [40].
The side views of dies with T-TSVs and MFCs are shown Figure 2. In both cases,
the diameter of signal TSVs is set to a minimum to accommodate as many connections
as possible. In contrast, the diameter of P/G TSVs is around 10µm, because within the
same area a big TSV gives lower resistance than a bundle of minimum-sized TSVs. Note,
however, that because of manufacturing issues, it could be mandatory to use a bundle of
small TSVs, which would increase aggregate resistance.
Each global routing tile has x-, y-, and z-direction routing capacity values. x- and
y-direction capacity represents available routing space on metal layers, while z-direction
capacity is for signal TSVs. It is assumed that Metal 1/3/5/7 are for x-direction and Metal
2/4/6/8 are for y-direction. The cell occupancy ratio (COR) of a placement tile at (x, y, z)
is defined as:




where Scell is the area of each cell in the placement tile p tile(x, y, z), and Sp tile is the area
of a placement tile.
For x- and y-direction, the default capacity per each metal layer is calculated by dividing
the routing tile size by the minimum wire pitch of the metal layer. Since Metal 1 is heavily
used in standard cells, when a placement tile has COR = α%, the Metal 1 routing capacity
of the corresponding routing tile is reduced by α%. Metal 2-6 are dedicated to signal routing.
On Metal 7 and 8, because of the P/G nets, only part of the space is available for signal


























power TSV ground TSV signal TSV
MFC30
backside metal
Figure 2: Side view of a die in 3D ICs (a) with T-TSVs, and (b) with MFCs. In (b),
bonding layer also seals MFCs and the thickness is larger. Dies are flipped over, and
devices are facing down. Shapes are drawn to scale based on the default settings, except
for gates. Unit is µm.
are decreased correspondingly. For each routing tile, the routing capacity of each metal
layer is calculated then accumulated to obtain the total routing capacity of the routing tile.
For z-direction capacity, both the available silicon surface area and the MFC area of
the routing tile are checked. From the routing tile area COR and the P/G TSV area are
subtracted to obtain the remaining area. Then the remaining area is further adjusted by
the area covered by MFCs. For instance, if the MFC covers 50% of the routing tile, the
remaining area is multiplied by 0.5 to obtain the final remaining area. Then, the area is
divided by the area of a minimum-pitch signal TSV to obtain z-direction capacity.
The global routing tile objects are shown in Figure 3. The width of the routing tile
is fixed at 20µm. Note that some tiles are fully covered by MFCs and thus have zero z-
direction capacity. Since the size of a P/G TSV is comparable to that of a routing tile,
the tiles that contain P/G TSVs have significantly lower x/y/z-direction routing capacities.
Also note that TSV diameter is much larger than the global wire width, making them










Figure 3: Top view of global placement and routing tiles with MFCs. Only a part of the
chip is shown. Objects are drawn to scale based on the default settings. P/G thin wires
are not shown for simplicity.
3.2.2 Power Interconnects
In 3D ICs, power is delivered to all devices through a power interconnect hierarchy. The
global power distribution network on each die uses grids made of orthogonal interconnects on
the top wiring levels. Power is fed from the package through power I/O bumps distributed
over the bottom-most die, and travels to the upper dies via P/G TSVs.
The top view of the P/G network is shown in Figure 4. It is assumed that P/G TSVs
are placed regularly in a dual mesh structure, and each P/G TSV has a co-located P/G I/O
bump on the bottom side of the chip to reduce the parasitic effects of connecting a P/G
TSV to a P/G I/O bump. The pitch between two power TSVs is predefined as 200µm which
is used for all dies. The diameter of P/G TSVs is around 10µm. P/G wires are globally
distributed on Metal 7 and 8. Thick wires of 10µm width connect P/G TSVs. Between two
thick wires, 10 thin wires are placed and the remaining space is used for signal wires. The
area ratio between P/G thin wires and signal wires can be varied; if the ratio is 0.4, P/G
thin wires occupy 40% (= 20% each) of the routing tile area on Metal 7 and 8, and the rest
(= 60%) is for signal routing. Since P/G TSVs provide currents to dies, more TSVs with
lower resistance usually decrease power noise. In the 3D IC structure for this project, each
P/G TSV pierces through the entire stack for efficient vertical power delivery (see Figure
15
2). Thus, no gates can be placed and no wires can be routed at the P/G TSV locations.
For the global placement tiles with pre-placed P/G TSVs, the placement tile capacity is
decreased by a large amount, and the corresponding global routing tile has decreased signal












Figure 4: Top view of the P/G network.
From the design, the resistive mesh structure is built with a current source at each grid
node that represents power consumption. Then the IR-drop analysis is performed with
modified nodal analysis technique [41] for faster analysis. The domain decomposition tech-
nique [42] is applied that decomposes the circuit into several parts and uses a mathematical
technique to reduce the time needed for matrix inversion. After the simulation, the IR-drop
values of all grid nodes are obtained, of which the maximum is used as the response.
3.2.3 Thermal Interconnects
3D ICs bring several challenges in thermal management. By stacking layers, the power
consumption per unit horizontal footprint area is significantly increased. In addition, the
interior layers of 3D ICs are thermally detached from the heat sink. Heat transfer is further
restricted by interlayer dielectric and oxide-based bonding layers with low thermal conduc-
tivity. In this chapter, two possible solutions to the thermal problem are discussed: T-TSV
and MFC.
One way of dissipating heat is to insert T-TSVs in the white spaces of 3D ICs. T-TSVs
16
do not provide any electrical functionality. They help decrease the on-chip temperature
by lowering the inter-layer thermal resistance, hence providing more thermally conductive
paths to the heat sink. Moreover, T-TSVs help distribute the heat more evenly throughout
the entire chip, thus reducing the negative impact of high temperature areas (hot spots).
T-TSVs go through the entire die (so called via-last TSVs), whereas signal TSVs do not,
as shown in Figure 2(a). To avoid electrical short, signal wires, signal TSVs or P/G TSVs
should not make contact with T-TSVs.
Unlike the conventional air-cooled heat sinks or T-TSVs, liquid cooling using MFCs
offer a much larger heat transfer coefficient and chip-scale cooling solution. MFCs can be
fabricated on the back side of silicon dies, enabling rejection of heat from every layer effi-
ciently. The thermal resistance of the micro-channel heat sink for single chip was previously
measured [7]. When de-ionized water was used as coolant, the junction-to-ambient thermal
resistance of the heat sink was 0.24◦C/W at a flow-rate of about 65mL/min without TSVs
(impact of copper TSVs on thermal conductivity of the silicon micro-channel wall is negligi-
ble), which is significantly better than current state-of-the-art air cooled heat sinks [7]. The
smallest resistance possible for air-cooled heat sink is around 0.5◦C/W . MFCs capped with
the thin polymer (Avatrel 2000 P) coating (∼ 30µm) were tested up to 2.5atm pressure
with no leakage observed during continuous operation [43].
The on-chip thermal network is composed of fluidic TSVs, manifolds and MFCs. It is
assumed that all the fluidic TSVs and manifolds are located outside the core region in which
all gates and metal wires reside. Thus, only MFCs are considered for the analysis. Also
coolant pump and heat exchanger are assumed to be at off-chip.
The geometries of MFCs — depth, width, and pitch — have impacts on thermal and
routability objectives. By increasing MFC depth, the mass flow rate of fluid and thus
cooling capability can be improved. However, it also increases die thickness, and for a fixed
aspect ratio of TSVs, the diameter of signal TSVs increases proportionally. Since larger
TSVs consume more silicon space and have higher parasitics, it is not desirable to have deep
MFCs. In contrast, MFC width can be increased without hurting silicon space. However,
wide MFCs decrease z-direction routing capacity considerably. Since MFCs should not
17
touch P/G TSVs, MFC pitch should be decided along with P/G TSV pitch. Thus, in this
chapter, the MFC depth and the pitch are fixed and only the MFC width is varied.
For T-TSV case, the thermal analyzer of this project is based on finite element analysis,
where the entire 3D IC is mapped onto a 3D thermal mesh structure. To calculate the
thermal conductivity of each thermal tile, the material composition of the tile is checked.
For thermal tiles that corresponds to silicon layer, the z-direction thermal conductivity of
the thermal tile is calculated as follows:
ktile,z = ARTSV × kCu + (1−ARTSV )× kSi
where ARTSV is the area ratio of total TSVs (signal, P/G, and thermal) in the tile, and
kCu and kSi are thermal conductivities of copper and silicon. Note that kCu and kSi
are about 400 and 150 W/(m · K), which suggests that replacing silicon by copper may
reduce thermal resistance by about 62%. The boundary conditions are as follows: The four
lateral sides of the chip contacts to ambient air, while the top side has a heatsink which
has thermal resistance of 0.25K/W . The bottom side is assumed to be adiabatic. Then,
the following matrix equation is solved: G · T = P , where G is the thermal conductance
matrix calculated from ktile, T is the temperature vector, and P is the power vector. The
temperature distribution can be directly found from T .
The T-TSV insertion is performed with a predefined maximum T-TSV ratio. The
maximum T-TSV ratio is the maximum allowed area ratio of T-TSV per placement tile.
For example, if maximum T-TSV ratio is 0.1, up to 10% of the silicon area of the placement
tile may be used for T-TSV. After global routing is completed, a thermal analysis is run to
obtain the temperature distribution without T-TSV. Thermal tiles with higher temperatures
are assigned higher temperature severity. Then according to temperature severity, a target
T-TSV ratio is assigned to the routing tile, which is less than or equal to the maximum T-
TSV ratio. Target T-TSV area (Starget ttsv) is the T-TSV ratio multiplied by the placement
tile area. To see if the target T-TSV area can be accomodated, the white space of the
placement tile is calculated as follows:
Swhite = Sp tile − Spgtsv − Sgate (1)
18
where Swhite is the white space of the placement tile, Sp tile is the area of the placement tile,
Spgtsv is the P/G TSV area in the placement tile, and Sgate is the total gate area placed in
the placement tile. The final T-TSV area assigned to the placement tile is the minimum
of Swhite and Starget ttsv. Note that signal wire or signal TSV area is not considered in
Equation 1. Instead, the routing capacity in x/y/z-direction is decreased after T-TSV
insertion. After T-TSV insertion, signal nets that are not routable with updated routing
capacity are ripped up and rerouted. Then a thermal analysis is performed again to observe
the impact of T-TSVs. Note that other thermal insertion algorithms may be used with the
proposed DOE and RSM based optimization, as long as it does not change design space
characteristics dramatically. As will be discussed in Section 3.3, if the design space changes
much during the design flow, it may not be possible to find response models.
To analyze the thermal performance of MFCs in 3D ICs, numerical simulations are
performed. A three-dimensional thermal model [44] is modified to consider the lateral
temperature and fluid flow rate distribution caused by non-uniform power/heat flux distri-
bution. The side view of the 3D IC with MFCs is shown in Figure 5. It is assumed that
the temperatures of the fluid and the solid domains are different but uniform at each cross
section within each control volume. Thermal and fluid flow in MFCs are described by the




























) + q̇g + q̇c = 0 (4)
Tw and Tf represent the temperatures of solid and fluid, respectively, ṁ, i and hc are mass
flow rate, enthalpy, and convective heat transfer coefficient, respectively. For each MFC,
heat is directly supplied only to the channel base, and the channel wall is analyzed as a fin
attached to the base (η0 is the overall surface efficiency for heat transfer, including an array
of fins and the base surface). MFC geometry is described by the channel perimeter P̃ and
the width w. Equation (2) represents the fluid enthalpy change because of the convective
heat transfer owing to the temperature difference between the solid and fluid, as well as
19
fluid convective motion. The pressure drop along the MFC is obtained by solving the fluid
momentum balance equation, (3), wherein P , G and ρ are pressure, mass flux and density
of the fluid, respectively, f is the fluid friction factor and dh is the hydraulic diameter of
a MFC. Equation (4) is the three-dimensional thermal transport equation for the solid. It
has two source/sink terms owing to heat generated (q̇g) from the active and oxide-metal
layers and convective heat transfer (q̇c) to the fluid (k denotes the thermal conductivity of
solid).
j j+1 j - 1 
k = 1 
k = 2 
k = 3 







Figure 5: Side view of the thermal grid structure used for a 3D IC with MFCs.
Deionized water is considered as the working fluid, and fluid temperature at the inlet
was set to 20◦C. The governing equations, (2), (3), and (4), are integrated over a control
volume then discretized using the upwind scheme [45]. The resulting system of linear
algebraic equations is simultaneously iteratively solved using successive under-relaxation
method.
3.2.4 Overview of Physical Design for 3D ICs
The design package of this project works with standard cell based circuits and consists of
several major steps. The overall design flow with DOE and RSM is summarized in Figure 6.
The input factors (or design knobs) and the responses (or assessing metrics) are defined. A
single experimental run is equivalent to performing gate-level global placement and routing.
After reading in the input circuit, partition is performed to divide the input circuit into
dies. The partition step not only relates closely to the routing quality but also determines
20
the signal TSV distribution. A min-cut based partitioner [46] is used to minimize signal
TSV counts. Note that minimizing signal TSV count does not always lead to optimal design






















Run all experimental runs
<Single Experiment>
Rip-up and rerouting
Define optimization goal, 
formulate Cost function
Figure 6: Overall design flow with DOE and RSM.
In the placement step, a global placement of cells is performed onto the Np ×Np ×Ndie
placement grid. The Np is the number of placement tiles in x/y-axis, and Ndie is the number
of dies in the 3D IC. A force-directed placement algorithm for 3D ICs [47] is used. Note that
the same placement is used for all the experimental runs to limit the solution space change.
Even though smaller P/G TSV diameter may allow more gates in the same placement tile,
negligible differences in total wirelength are observed with placements with different P/G
TSV diameter settings.
Next, a global routing is performed on the Nr×Nr×Ndie routing grid. Nr is the number
of routing tiles in x/y-axis. In this chapter, it is assumed that the placement and routing
tiles are of the same size, 20µm. The reason for global routing, instead of detailed routing,
is to obtain quick but relatively accurate pictures of routing congestion. In the routing flow,
signal nets are first routed without any MFCs or P/G TSVs. Then, MFCs (for MFC case),
P/G TSVs, and wires are routed. After that, T-TSVs may be inserted (for T-TSV case).
Since these power and thermal interconnects incur routing congestion, rip-up and reroute
21
is performed for signal nets with routing capacity violations. The reason for performing
signal routing before other interconnects is because a congestion-aware 3D maze router is
used. If other objects exist, the routing results differ by a lot, which changes solution space
much. This design flow limitation may be lifted if a 3D Steiner router [48] is used. In all
experiments, at most 1% nets needed rip-up and reroute. To ensure routability, it is checked
whether global routing fails because of insufficient routing capacity.
After all the routings are finished, a power map is generated based on the placement
and routing results and a power noise and a thermal analysis are performed. The metrics
are evaluated and the experimental run completes. Once all the experiments are performed,
response surfaces are constructed and used to obtain optimal design solutions. The response
surface models are only applicable to the chip that the model was built for. The input factors
and their trade-offs are summarized in Table 2, and the responses are summarized in Table
3. Note that some of the input factors and the responses are only for either MFC case or
T-TSV case, while some are for both.
Table 2: Input factors used in this chapter.
T-TSV ratio (T-
TSV case only)
The maximum T-TSV area ratio per placement tile. This provides
trade-off between thermal and signal.
MFC width (MFC
case only)
Width of a MFC. All MFCs have the same width. Wider MFC
means higher mass flow rate and better cooling capability. This





The pressure drop between inlet and outlet of a MFC. All MFCs
have the same pressure drop. This also affects the mass flow rate




The diameter of a P/G TSV, which affects the parasitics of P/G
TSVs. This provides trade-off between power noise and signal.
P/G thin wire ra-
tio (both cases)
The ratio between P/G thin wires and signal wires on Metal 7 and
8. This also provides trade-off between power noise and signal.
3.3 Design of Experiments and Response Surface Methodology
The main goal of design of experiments is to statistically control the experiments so that the
output responses can be used for drawing meaningful conclusions on the system. It involves
designing the experiments, performing the experiments, and analyzing the responses. After
22
Table 3: Responses used in this chapter.
Total wirelength
(both cases)
Sum of all the wirelengths of signal nets. This value represents the
quality of signal interconnect.
Max. IR-drop
(both cases)
The maximum IR-drop of the entire power grid. This represents
the quality of power interconnect.
Max. Si temp.
(both cases)
Maximum silicon temperature of the die stack. This represents the
quality of thermal interconnect.
Pump power
(MFC case only)
The coolant pump power to provide fluid through MFCs. This
value may be considered during system power budget planning.
running the experiments and gathering the responses, the fitted model per response is found
to understand and optimize the system. This is called response surface methodology [49].
DOE and RSM based optimization is suitable for this problem because of the following
reasons: (1) The knowledge on the target system, such as the relationship between input
factors and responses, is found during the process. (2) Compared with Monte-Carlo or
random search method, this method can characterize the system with far less number of
experimental runs, providing faster yet reasonably accurate solution. (3) The response
models can be reused if the optimization goal is changed without affecting the whole design
settings, which suggests this method is flexible and suitable in early design stages.
3.3.1 Classical DOE
First, a classical design is performed: Box-Behnken [50]. In this design, each input factor
is assigned three levels (minimum, center, maximum). This design can sufficiently fit a
quadratic model with less number of experimental runs than a full factorial design.
Since MFCs should not contact with P/G TSVs, the following constraint is applied onto
the region of interest (ROI) for MFC case:
wmfc + dpgtsv + 2 ·mspmfc−pgtsv ≤ ppgtsv/2 (5)
Here, wmfc is MFC width, dpgtsv is P/G TSV diameter, mspmfc−pgtsv is the minimum
spacing between MFC and P/G TSV, and ppgtsv is P/G TSV pitch. The P/G TSV pitch
is divided by 2 to obtain the distance between a power and a ground TSV (see Figure 3).
mspmfc−pgtsv is set to 5µm in this chapter. For the designs that satisfy (5), MFCs can be
placed so that they do not touch P/G TSVs.
23
In addition to the designed data set, a validation data set with 4 design points was
generated per T-TSV case and MFC case to see how the models predict unseen design
points. The classical designs did not provide accurate models to be used for optimization.
3.3.2 Advanced DOE
To increase model asccuracy, a more complex DOE is performed. Since the design package
of this project does not involve randomized algorithms, there is no random error effects in
the experimental results. That is, if an experiment is repeated with the same settings, it
will produce the same response. Thus, randomization and blocking [51] are not performed
in the DOE in this chapter.
Since it is hard to theoretically derive optimal design points because of the complex
structure of the system, a space-filling design style is adopted. Latin Hypercube sampling
distributes N design points at N different levels per each input factor. In this sampling,
the number of design points needs to be carefully determined based on the number of input
factors and their ranges as well as the response model accuracy. Meanwhile, the same
validation data sets as in classical DOE are used to check the model prediction capability.
3.3.3 Finding Best Response Models
The accuracy of the response models is important because the models are used in the
optimization process. Determining the parameters of response models is based on regression
analysis. Polynomial response models with n input factors can be expressed in multivariate
polynomial equations:











i + · · ·
Here, xi, x
2
i , · · · are called ’main factors’, while xixj , x2ixj , · · · are called ’interaction
factors’. By RSM, the parameters (a0, ai, · · · ) are estimated such that the response equation
fits the data optimally. The goodness-of-fit of the model can be checked with root mean
square error (RMSE) and the coefficient of multiple determination (R2). When R2 is closer
to 1, the model can explain the observed data better.
24
For the models to predict unseen design points well, it is important to avoid overfitting
problem — the response curve follows not only the underlying truth but also unwanted
noise with it. With N design points in the data set, each data point is removed from the
data set, and the remaining N −1 runs are used to fit the prediction model equation, which
is the sum of squares of the prediction residuals (PRESS). The overfitting of the model can
be checked by comparing PRESS RMSE to RMSE. When PRESS RMSE is much higher
than RMSE, the overfitting phenomena is observed.
It is also observed that increasing polynomial order of main factors may improve the
model accuracy. Increasing model order generally increased R2 and decreased RMSE. How-
ever, PRESS RMSE did not decrease monotonically with increased model order. Each
response had a different optimal polynomial order. Additionally, stepwise regression [51]
may be performed to determine which polynomial term should be included in the model to
minimize PRESS RMSE.
To increase model accuracy further, hybrid radial basis functions (RBFs) are tried. A
hybrid RBF model has a polynomial model described above and an RBF network model.
The polynomial model determines global shape, while RBF network handles local variations.





Here, the j-th RBF is centered at µj with the weight βj . The profile functions (Φ) of RBF
kernels tried in this project are as follows:
Φmultiquadric(r) =
√
r2 + β2, β > 0
Φrecmultiquadric(r) = 1/
√





To find the best model, candidates of models (different polynomial order per input factor
and RBF kernels) are exhaustively generated and scores for each candidate are calculated.
25
The Score function is defined as:
Score =
R2




By this function, the RMSE and validation RMSE are minimized and R2 is maximized,
while the difference between RMSE and PRESS RMSE is suppressed. Then, the models
with five top scores are further compared by response surface shape. The model with less
needless curvatures on the surface was chosen as the best.
3.3.4 Optimization with Response Surface Models
With multiple responses and design constraints, there can be several optimization scenarios.
To consider multiple responses together, each of the responses under consideration is nor-
malized to [0, 1] and forms a partial cost. Then, they are combined into a single desirability
function [52] which is called a Cost function. The Cost function can be considered as a
new response surface. Then, using optimization algorithms such as nonlinear programming
or genetic algorithm, the optimal design point with the minimum Cost is found. Note
that the optimization is fast because it is performed on the Cost response surface, not the
actual experimental space. That is, no additional experimental run is needed during the
optimization process. Since some errors are inevitable in the response models, the actual
design point with minimum Cost could be different from the optimal design point found by
this optimization.
Although accurate response models that can be used for optimization were found in this
project, it may not be always possible. Especially, when the design space characteristics
fluctuated much with varied input factors, it was not possible to fit a single response model
to the data with a sufficiently low error.
3.4 Experimental Results
The design package is implemented in C++/STL and MATLAB. The simulations were
executed on a 64-bit Linux server with two quadcore Intel Xeon 2.5GHz CPUs and 16GB
main memory. The target circuit (fft) is synthesized using Synopsys Design Compiler and
the North Carolina State University 45nm technology library [39]. The synthesized circuit
26
has about 370 thousand gates and nets. The technology and default setting parameters
are shown in Table 4. Note that the technology is solely based on the assumptions of this
project. Die size is 700× 700µm, number of dies is four, die bonding style is face-to-back,
clock frequency is 1GHz, and power supply voltage is 1.1V , unless stated otherwise. Each
signal TSV has a keep-out-zone of 1µm size around it.
Table 4: The technology and default setting parameters. The baseline only uses top-
mounted heat-sink, not T-TSVs or MFCs.
baseline & T-TSVMFC case
Si layer thickness (µm) 20 60
Metal layer thickness (µm) 6 6
Bonding layer thickness (µm) 2 10
Signal TSV aspect ratio 10:1 30:1
Signal TSV diameter (µm) 2 2
MFC depth (µm) 0 30
MFC pitch (µm) 0 200
P/G TSV pitch (µm) 200 200
Resistance of package pins (mΩ) 3 3
To construct a power map, switching activities of gates were obtained from a commercial
tool with proper input stimuli. After routing is finished, dynamic power consumption of each
gate is calculated with the parasitic capacitances of the net driven by the gate. Combined
with leakage power, the power consumption of each gate is determined, which contributes to
the power value of each power map tile. The power map is used in the thermal analyzer and
the power noise analyzer. For T-TSV case, the thermal analyzer was written in C++/STL
and the runtime was about three minutes. For MFC case, the thermal analyzer was written
in MATLAB and the runtime was about two minutes. The runtime of a power noise (IR-
drop) simulation was less than 10 seconds.
The initial design results are shown in Figure 7. The placement result in Figure 7(a)
reveals that in some regions more cells are clustered together. Average utilization is about
62% including only cells, and about 70% including signal and P/G TSV area as well.
In overall, x- and z-direction congestions are moderate, yet at some regions z-direction
congestion is severe. The power map indicates that several power hotspots exist.
27
(a) placement tile utilization (b) x-direction routing utilization
(c) z-direction routing utilization (d) power map
Figure 7: Initial design results with baseline settings. Power density unit is W/cm2 in
power map.
3.4.1 Comparison of 2D and 3D IC Designs
First, the designs in 2D and 3D ICs are compared. A 2D and two 3D IC designs (two and
four dies) are compared in Table 5. The die size is set so that each design case has about the
same total silicon area. With increased number of dies, total wirelength becomes shorter
and total power consumption reduces. Congestion increases mainly because of z-direction
congestion. Longest path delay decreases, yet it involves more complex reasons such as
how the gates on the path are partitioned into dies and placed. Signal TSVs occupy about
5.38% and 9.34% silicon area in two and four die cases, respectively. Maximum IR-drop
increase with increased number of dies, mainly because with smaller footprint fewer P/G
TSVs are placed, and the resistance of P/G TSVs contributes to IR-drop.
A big problem with 3D designs is elevated maximum silicon temperatures. The main
reasons are that the power density of four die design is very high (about 277W/cm2) and
28
Table 5: Comparison of 2D and 3D IC designs. Congestion means number of routing edges
with 100% utilization.
2D - 1 die 3D - 2 dies 3D - 4 dies
Footprint (µm2) 1,960,000 1,000,000 490,000
Total wirelength (µm) 16,543,560 15,410,160 14,609,760
# signal TSVs 0 3,360 8,569
Congestion 39 329 673
Longest path delay (ns) 2.031 1.910 1.796
Power consumption (W ) 1.427 1.398 1.304
Max. IR-drop (mV ) 4.092 4.330 6.831
Max. Si temp. (◦C) 49.681 80.389 131.485
the bottom die has a long heat dissipation path to heat sink located on top of the chip.
Hence a thermal solution is crucial for the four die design to be practical.
3.4.2 Comparison of T-TSV and MFC Based Cooling
The experimental results for baseline, T-TSV case, and MFC case are shown in Table 6.
There are no T-TSVs or MFCs in the baseline case. All three cases have almost the same
total wirelength. Congestion in T-TSV case is a little higher than the one in baseline,
because T-TSVs consume routing space. Total power consumption is higher in MFC case,
because of increased total wirelength and larger signal TSV capacitance. Maximum IR-drop
is also higher in MFC case, mainly because P/G TSVs in MFC case are longer and have
higher resistance.
Table 6: Comparison of baseline, T-TSV case, and MFC case.
baseline T-TSV case MFC case
Total wirelength (µm) 14,609,760 14,675,180 14,770,840
Congestion 673 793 698
Total power (W ) 1.304 1.306 1.366
Max. IR-drop (mV ) 6.831 6.838 9.965
Max. Si temp. (◦C) 131.485 119.169 34.550
Compared with the baseline, T-TSVs decreased the maximum silicon temperature by
only 9.4%, which is small compared with an existing work [3], where silicon-on-insulator
technology was assumed. Since the insulator has very low thermal conductivity, insert-
ing T-TSVs would dramatically increase thermal conductivity of insulator layers by much.
29
However, in this chapter a bulk silicon technnology is assumed, and the thermal conductiv-
ity of silicon is good (about one third of that of copper). Thus, inserting T-TSVs did not
decrease the maximum silicon temperature so much. In contrast, MFCs greatly reduced
the maximum silicon temperature by 74%. This shows that T-TSV-based cooling is not as
efficient as MFC-based cooling.
Temperature profiles for the three cases are shown in Figure 8. The heat sink is attached
on top of the chip for baseline and T-TSV case (the top of the chip in MFC case contacts air
directly). Hence, the temperatures of lower dies are higher than that of upper dies as they
are further away from the heat sink. Comparing T-TSV case profile to baseline profile, the
temperature difference between dies is smaller in T-TSV case. Thus, T-TSVs were helpful
for the heat transfer in z-direction. In MFC case, the regions where MFCs are placed have
relatively lower temperature than their neighbors, and the temperature along the MFC flow
direction increases because the fluid absorbs the heat as it travels.
(a) baseline (b) T-TSV case (c) MFC case
Figure 8: Temperature profiles for baseline, T-TSV case and MFC case. Dotted lines in
MFC case show MFCs.
3.4.3 Varing One Input Factor at a Time
As a preliminary experiment, one input factor is varied at a time to investigate its impact
on the responses. By this preliminary experiments, the ranges of input factors that may
satisfy overall target performances are found and the possible trade-offs are checked. Each
input factor is set to three levels: (-, 0, +). ’0’ is the default value, ’-’ is the minimum
value, and ’+’ is the maximum value. While one input factor is varied, the other factors
remain at ’0’ level. For each factor, it is assumed the system has monotonic relationship,
30
and max/min value will not happen during the middle range value of the specified input
value.
For MFC case, the input factor settings are determined as follows (-, 0, +): 1) MFC
width (µm) = (30, 55, 80), 2) MFC pressure drop (kPa) = (30, 50, 70), 3) P/G TSV
diameter (µm) = (5, 10, 15), 4) P/G thin wire ratio = (0.2, 0.5, 0.8).
The results of preliminary experiments for MFC case are shown in Figure 9(a). Total
wirelength and longest path delay are dependent on MFC width and P/G TSV diameter,
whereas IR-drop is mostly dependent on P/G TSV diameter and P/G thin wire ratio. Com-
paring total wirelength and IR-drop graphs, it is clear that increasing P/G TSV diameter
can decrease IR-drop by a lot at the cost of a little increased total wirelength. Meanwhile,
maximum silicon temperature and pump power are mostly dependent on MFC width and
MFC pressure drop. For maximum silicon temperature, MFC width affects more than MFC
pressure drop, while for pump power MFC pressure drop does more than MFC width.
For T-TSV case, the input factor settings are determined as follows (-, 0, +): 1) T-TSV
ratio = (0, 0.1, 0.2), 2) P/G TSV diameter (µm) = (5, 10, 15), 3) P/G thin wire ratio = (0.2,
0.5, 0.8). Note that the maximum T-TSV ratio is restricted at 20% or less. Inserting too
many T-TSVs may consume too much silicon and routing space and may incur reliability
issues because of thermal expansion coefficient mismatch.
The results of preliminary experiments for T-TSV case are shown in Figure 9(b). Total
wirelength is dependent on all three input factors, of which the most influential one is T-
TSV ratio. Longest path delay is affected by T-TSV ratio and P/G TSV diameter. IR-drop
is dependent solely on P/G TSV diameter and P/G thin wire ratio. Note that compared
with MFC case results, the IR-drop range is smaller. This is mainly because in T-TSV case
die thickness is smaller and P/G TSVs have lower resistance. Maximum silicon temperature
drops sharply when the T-TSV ratio is changed from ’-’ to ’0’, however the drop is greatly































































































) T-TSV ratioP/G TSV diameter










































Figure 9: Results of preliminary experiments for (a) MFC case and (b) T-TSV case.
3.4.4 Advanced DOE - T-TSV Case
To perform an advanced DOE for T-TSV case, 32 Latin Hypercube design points are gen-
erated, then eight corner design points are added manually. The reason for adding corner
design points is to reduce the error on the boundaries of the ROI. The input factor ranges
are: 1) T-TSV ratio = [0, 0.2], 2) P/G TSV diameter = [5, 15](µm), and 3) P/G thin wire
ratio = [0.2, 0.8].
The candidate models for maximum silicon temperature in T-TSV case are summarized
in Table 7. The scoring method presented in Section 3.3 is applied. Although the model in
32
the last row had higher Score, the model in the fourth row was chosen as the best, because
it had less unwanted curvatures on the response surface.
Table 7: Candidate models for maximum silicon temperature in T-TSV case. Only the best
five models are shown. The numbers in the parenthesis after Poly means the polynomial
order of (T-TSV ratio, P/G TSV diameter, P/G thin wire ratio / interaction), and the
name in the parenthesis after RBF means the RBF kernel type. ’+stepwise’ means stepwise
regression was performed.
Model type # parameters RMSE PRESS RMSE Validation RMSE R2 Score
Poly(8,8,8/2)+stepwise 13 0.253 0.338 0.411 0.998 0.3953
Poly(8,8,8/3)+stepwise 23 0.117 0.155 0.463 1.000 0.4565
Poly(10,7,2/3)+stepwise 18 0.133 0.268 0.210 0.999 0.3887
Poly(11,3,3/2)+stepwise 16 0.143 0.278 0.169 0.999 0.4574
Poly(10,5,0/3)+RBF(thin-plate) 27 0.033 0.081 0.154 1.000 0.4660
The best models for T-TSV case are summarized in Table 8. The difference between
RMSE and PRESS RMSE is less than twice, therefore overfitting is unlikely. Validation
RMSE is not too far away from PRESS RMSE, suggesting that the models can predict
unseen design points well.
Table 8: Summary of models for T-TSV case with advanced DOE.
Response Model type # para- Average RMSE PRESS Validation R2
meters RMSE RMSE
Total wirelength Poly(3,5,2/2)+stepwise 9 14,699,694 4305.241 4989.319 9643.771 0.997
Max. IR-drop Poly(2,7,7/4)+stepwise 21 7.893 0.003 0.004 0.004 1.000
Max. Si temp. Poly(11,3,3/2)+stepwise 16 121.212 0.143 0.278 0.169 0.999
The parameters for total wirelength model of T-TSV case are shown in Table 9. T-TSV
ratio and P/G TSV diameter are strong main factors, and also significant interaction is
observed between them.
Table 9: Parameters for total wirelength model of T-TSV case with advanced DOE.
TTSV rat, PGdia, and PGthin means maximum T-TSV ratio, P/G TSV diameter, and
P/G thin wire ratio.
Constant 1.467E+07 PGdia 4.217E+04
TTSV rat 8.926E+04 PGthin 1.082E+04
PGdia2 3.158E+04 PGdia× TTSV rat 1.197E+04
TTSV rat2 2.837E+04 TTSV rat× PGthin 4.499E+03
PGdia4 -1.942E+04
The response surfaces of all metrics for T-TSV case are shown in Figure 10. For total
wirelength and maximum silicon temperature, the most significant input factor was T-
TSV ratio. Total wirelength increases with higher maximum T-TSV ratio, because of the
33
congestion incurred by T-TSVs. Also P/G TSV diameter affected the response in the same
manner, yet the impact is smaller. In case of maximum IR-drop, P/G TSV diameter and
P/G thin wire ratio were the major factors. Maximum IR-drop decreased with higher P/G
thin wire ratio and larger P/G TSV diameter. Maximum silicon temperature drops sharply
when maximum T-TSV ratio increases from 0 to 0.05; after 0.05 the slope becomes gentle.
Note that P/G TSV diameter interacts with maximum T-TSV ratio in the maximum silicon
temperature model. This is because P/G TSVs occupy silicon space that affect the actual
amount of T-TSVs which is different from the maximum T-TSV ratio.
Figure 10: Response surfaces for T-TSV case with advanced DOE. For each metric, the
two significant input factors are shown.
With the response models, the optimization is performed. Two optimization scenarios
are considered:










st denote normalized total wirelength, maximum IR-
drop, and maximum silicon temperature costs, respectively.





Constraint: (maximum IR-drop) ≤ 10mV
In Table 10, the input factor settings and their the responses for three cases are shown:
baseline - This is the baseline settings and the responses from an actual experiment.
DOE-predicted - This shows the optimal input factor settings found from optimization,
and the response values predicted by the response models. DOE-actual - With the same
optimal settings as DOE-predicted, the experiment is run to obtain the actual response
values. Comparisons between DOE-predicted and DOE-actual reveal the accuracy of the
response model. Comparing DOE-actual to baseline, it is observed that DOE found better
solutions with about 49% less Cost1 for Scenario 1 and about 1.8% less Cost2 for Scenario
2. Comparing DOE-predicted and DOE-actual, it is clear that all the models are quite
accurate (error<1%). In Scenario 1, maximum T-TSV ratio decreased a little because it
helped reduce total wirelength without increasing maximum silicon temperature too much.
P/G TSV diameter and P/G thin wire ratio increased to maximum because it decreased
power noise and did not exacerbate other metrics too much. In Scenario 2, maximum T-
TSV ratio reached its maximum to minimize maximum silicon temperature. P/G TSV
diameter and P/G thin wire ratio decreased from the baseline, because maximum IR-drop
is not minimized but constrained under the target value. The maximum IR-drop value of
DOE-actual meets the constraint.
3.4.5 Advanced DOE - MFC Case
For MFC case, 48 Latin Hypercube design points are generated, then 16 corner design
points are added manually. The input factor ranges are: 1) MFC width = [30, 85](µm), 2)
35
Table 10: Optimization results for Scenario 1 and 2 in T-TSV case.
Scenario 1
baseline DOE-pred. DOE-actual
Max. T-TSV ratio 0.1 0.094 0.094
P/G TSV diameter (µm) 10 15 15
P/G thin wire ratio 0.5 0.8 0.8
Total wirelength (µm) 14,675,180 14,736,797 14,741,100
Max. IR-drop (mV ) 6.838 5.494 5.497
Max. Si temp. (◦C) 119.169 118.940 119.585
Cost1 0.174 0.071 0.089
Scenario 2
baseline DOE-pred. DOE-actual
Max. T-TSV ratio 0.1 0.2 0.2
P/G TSV diameter (µm) 10 6.766 6.766
P/G thin wire ratio 0.5 0.388 0.388
Total wirelength (µm) 14,675,180 14,764,854 14,767,680
Max. IR-drop (mV ) 6.838 9.065 9.061
Max. Si temp. (◦C) 119.169 118.330 118.817
Cost2 0.169 0.076 0.166
MFC pressure drop = [30, 70](kPa), 3) P/G TSV diameter = [5, 15](µm), and 4) P/G thin
wire ratio = [0.2, 0.8]. Again, the input factor constraint defined by (5) is applied.
The best models for MFC case, chosen by the same scoring method as in T-TSV case, are
summarized in Table 11. PRESS RMSE is close to RMSE, meaning that the models are not
overfitted. Validation RMSE is not too far away from PRESS RMSE, except for maximum
silicon temperature model. Still, the validation RMSE of maximum silicon temperature
model is small compared with the average value.
Table 11: Summary of models for MFC case with advanced DOE. In model type column,
the numbers in the parenthesis after Poly means the polynomial order of (MFC width, MFC
pressure drop, P/G TSV diameter, P/G thin wire ratio / interaction), and the name in the
parenthesis after RBF means the RBF kernel type.
Response Model type # para- Average RMSE PRESS Validation R2
meters RMSE RMSE
Total wirelength Poly(6,0,6,6/2)+stepwise 12 14,798,019 8344.520 9061.613 8086.943 0.995
Max. IR-drop Poly(6,0,6,6/3)+stepwise 21 12.656 0.013 0.016 0.021 1.000
Max. Si temp. Poly(7,7,0,0/4)+RBF(multiquadric) 30 35.314 0.036 0.045 0.103 1.000
Pump power Poly(6,6,2,0/4)+stepwise 15 5.188 0.078 0.085 0.092 1.000
The response surfaces of all metrics are shown inFigure 11. Total wirelength was mostly
dependent on MFC width and P/G TSV diameter. Wider MFCs incurred z-direction con-
gestion around them, thus some signal nets had to take detours which led to longer wire-
length. Maximum IR-drop was majorly dependent on P/G TSV diameter and P/G thin
36
wire ratio, while maximum silicon temperature and pump power were mostly dependent
on MFC width and MFC pressure drop. Increasing MFC width lowers maximum silicon
temperature throughout its whole input factor range, whereas MFC pressure drop affects
maximum silicon temperature more when MFC width is small. That is, smaller MFCs are
more sensitive to MFC pressure drop. Wider MFCs with higher pressure drop requires
higher pump power.
Figure 11: Response surfaces for MFC case with advanced DOE.
Three optimization scenarios are considered for MFC case:
• Scenario 1: Minimize total wirelength, maximum IR-drop, maximum silicon temper-
ature, and pump power.
Cost1 = 4
√







pp denote normalized total wirelength, maxi-
mum IR-drop, maximum silicon temperature, and pump power costs, respectively.





Cost∗wl · Cost∗ir · Cost∗pp
Constraint: (maximum silicon temperature) ≤ 40◦C
• Scenario 3: Minimize total wirelength, IR drop, and maximum silicon temperature
under a pump power constraint.
Cost3 = 3
√
Cost∗wl · Cost∗ir · Cost∗st
Constraint: (pump power) ≤ 10mW
The input factor settings as well as the responses are shown in Table 12. Comparing
DOE-actual to baseline, it is evident that DOE consistently found better solutions with
about 85% less Cost1 for Scenario 1, about 85% less Cost2 for Scenario 2, and about
59% less Cost3 for Scenario 3. The error between DOE-predicted and DOE-actual for
all responses was less than 1%, except for pump power which had around 5.2% and 3.7%
error in Scenario 1 and 3. In Scenario 1, MFC width and MFC pressure drop were almost
minimized while P/G TSV diameter and P/G thin wire ratio were almost maximized.
This is because smaller MFC width and lower MFC pressure drop lowered Cost∗wl and
Cost∗pp at the expense of increased Cost
∗
st, which decreased Cost1. Also increasing P/G
TSV diameter and P/G thin wire ratio decreased maximum IR-drop without increasing
total wirelength too much. In Scenario 2, the optimal design point is similar to the one in
Scenario 1, except for MFC width. The MFC width was determined to meet the maximum
silicon temperature constraint. The maximum silicon temperature in DOE-actual is close
to the constraint value (40◦C). In Scenario 3, compared with baseline, MFC pressure
drop increased to maximum to minimize maximum silicon temperature, while MFC width
decreased to almost the minimum. The pump power of this solution is not close to the
constraint, because the optimum point did not occur along the pump power constraint
boundary.
3.5 Summary
In this chapter, a co-optimization study of signal, power, and thermal interconnects in 3D
ICs is presented. The effectiveness of design space exploration based on DOE and RSM was
38
Table 12: Optimization results for Scenario 1, 2, and 3 in MFC case.
Scenario 1
baseline DOE-pred. DOE-actual
MFC width (µm) 55 30.009 30.009
MFC pressure drop (kPa) 50 30.063 30.063
P/G TSV diameter (µm) 10 14.993 14.993
P/G thin wire ratio 0.5 0.78 0.78
Total wirelength (µm) 14,770,840 14,712,205 14,714,760
Max. IR-drop (mV ) 9.965 6.908 6.950
Max. Si temp. (◦C) 34.550 43.618 43.690
Pump power (mW ) 4.849 0.729 0.691
Cost1 0.270 0.048 0.040
Scenario 2
baseline DOE-pred. DOE-actual
MFC width (µm) 55 39.248 39.248
MFC pressure drop (kPa) 50 30.294 30.294
P/G TSV diameter (µm) 10 14.981 14.981
P/G thin wire ratio 0.5 0.781 0.781
Total wirelength (µm) 14,770,840 14,736,293 14,735,440
Max. IR-drop (mV ) 9.965 6.930 6.954
Max. Si temp. (◦C) 34.550 39.979 39.954
Pump power (mW ) 4.849 1.071 1.066
Cost2 0.254 0.037 0.039
Scenario 3
baseline DOE-pred. DOE-actual
MFC width (µm) 55 30.614 30.614
MFC pressure drop (kPa) 50 70 70
P/G TSV diameter (µm) 10 15 15
P/G thin wire ratio 0.5 0.778 0.778
Total wirelength (µm) 14,770,840 14,714,508 14,721,560
Max. IR-drop (mV ) 9.965 6.905 6.952
Max. Si temp. (◦C) 34.550 39.047 39.171
Pump power (mW ) 4.849 3.587 3.453
Cost3 0.269 0.096 0.111
demonstrated for the 3D IC designs with T-TSVs and MFCs. Response surfaces covering the
entire ROI help find the optimal solution in global scope. However, due to the inaccuracy
of models, global optimum is not guaranteed. It is important to find accurate response
models to obtain reliable optimization results. When some of the models had rather high
errors, the optimization led to suboptimal solutions that had higher costs than some other
design points did. Also, models with too high polynomial order and too many parameters
usually encountered overfit problem.
For high performance circuits with high power density, the proposed optimization method
reveals that inserting T-TSVs might not solve the thermal problem effectively. On the other
39
hand, MFCs could bring down die temperature to an acceptable level, however a high TSV
aspect ratio is required to avoid chip size increase.
40
CHAPTER IV
TIMING ANALYSIS AND OPTIMIZATION FOR 3D STACKED
MULTI-CORE MICROPROCESSORS
4.1 Introduction
As the complexity and cost for continuing Moore’s law in 2D ICs increases rapidly, 3D
ICs attract more and more attention from both academia and industry. To make 3D ICs
practical and profitable, much research has been done in various fields—material, chem-
ical, fabrication, integration [53], etc — not to mention EDA. In the EDA field, various
algorithms for design steps such as circuit partitioning, placement [54], routing [48], and
timing optimization have been proposed, yet many of them neglected the impact of the
through-silicon-vias (TSVs) on the physical layout. Depending on fabrication technology,
TSVs can be so large that the aforementioned algorithms may not work as intended.
Today’s 3D IC market is mostly encompassed by memory chips [55] and image sensor
chips [56], which are designed in a full-custom fashion. In the near future, many-core
processors or core-memory stacked 3D ICs are expected in the market. However, currently
there is no fully-integrated 3D IC design EDA software that can perform the full design flow
from the register transfer level code to the GDSII layout. Thus, if a complex digital system
is to be designed in a 3D IC, it is required to combine existing 2D EDA tools with custom
tweaks to handle 3D-specifics. In this chapter, the design and optimization methodologies
are presented for a multi-core microprocessor in 3D ICs with varied design options using
existing 2D EDA tools and several in-house tools. The impact of TSVs on the 3D designs
is shown in terms of chip area, wirelength, and performance. For this purpose, 3D analyses
are performed using existing 2D analysis tools with some modifications. In addition, two
timing constraint generation methods for the optimization of 3D stacked ICs are presented:
timing scaling and timing budgeting. Timing scaling method is to scale the input/output
delay timing constraints at each boundary point, whereas timing budgeting method is to
41
distribute the timing slack of a path to each net on the path.
4.2 Target System
4.2.1 3D Structure
The target 3D structure of this project is illustrated in Figure 12. The overall stack structure
is shown on the left, where all four dies are bonded in a face-to-back fashion. It is assumed
that TSVs are via-first, which occupy the device layer and Metal 1 and 6 (M1 and M6). As
shown on the right side in the stack diagram, when a net spans more than two dies, it is
routed through TSVs as well as local vias. Note that there is no TSV on Die 3. The top
view of a TSV is shown on the right. A TSV pin pad (PP) on M1 or a landing pad (LP) on
M6 occupies two standard cell rows (denoted by the dotted lines), which is not negligible


















Figure 12: Target 3D structure of this chapter. (a) Dies are flipped over and facing down.
TSV pin pad (PP) and landing pad (LP) are shown. (b) The TSV occupies two standard
cell rows. Unit is µm.
Depending on the TSV dimensions, the capacitance (CTSV ) and the resistance (RTSV )
of TSVs varies. Since the timing values through these TSVs depend on the parasitic values,
the values are varied to see the impact on timing. TSV resistance is dependent on ohmic
resistance and contact resistance, the contact resistance being more dominant. In this
chapter, CTSV =25fF and RTSV =1Ω are used for experiments.
42
4.2.2 Architecture
As the target design, the LEON3 processor [57] is used, which is a 32-bit processor compliant
with the SPARC V8 architecture. It contains an advanced 7-stage pipeline with a hardware
multiplier and divider. The LEON3 design also has configurable caches and local instruction
and data scratch memories. It is configurable as a multi-core processor on AMBA bus. For
this project, a quad-core processor is configured with a single configuration for all the cores,
which is described in Table 13.
Table 13: Architecture configuration of the LEON3 design.
Instruction cache 16 KB, 2 way
Data cache 16 KB, 2 way
Register file 32 32-bit registers, 8 windows
Multiplier 32 x 32bits
Divider iterative
The synthesis results are summarized in Table 14. Synopsys Design Compiler is used
with the physical libraries for the target technology. Memory macro blocks were generated
using a memory compiler for the target technology. The original HDL source code was mod-
ified to include the memory blocks. The synthesized circuit met the timing goal, excluding
interconnect delay.
Table 14: Summary of the synthesis of the quad-core LEON3 design.
Technology 130nm
# memory blocks 44
# standard cells 82,461
# nets 87,451
Average fanout 2.46
Total cell area (um2) 6,101,542
Target clock period (ns) 3.333
Slack (ns) 0
The memory macro blocks in the core are summarized in Table 15. Total 11 memory
macro blocks are used per each core, which are as follows in decreasing size: two banks per
instruction and data caches, two dual port memory blocks for a three-port register file, two
banks per instruction/data cache tags, and an address translation table. Since these macro
blocks are large, the location and orientation of them affect the overall design quality much.
Thus they should be placed carefully, considering connections to other parts.
43
Table 15: Summary of the memory macro blocks.
Name Capacity (bits) Dimension (um) I/O pins
Instr./Data cache 2048x32 427.205x544.295 78
Register file 256x32 401.29x193.265 84
Instr./Data cache tag 256x40 269.035x178.805 91
Address translation 32x32 131.915x108.805 72
4.3 Design Options
There are three 3D partitioning options in addition to the traditional 2D. The four design





Core 0 Core 1
Core 2 Core 3
Core 1
Core 2 Core 3
busbus
Core 1









(a) 2D (b) 3D core
(c) 3D block (d) 3D gate
Figure 13: Four design options. Blocks highlighted in orange denote Core 0. inst $ and
data $ denote instruction and data cache, while RF and TLB represent register file and
address translation buffer.
• 2D (2D): This is the traditional design on a single die. The bus controller in the
center connects the cores. No TSV exists in this design option. For 2D IC design
style, a conventional 2D design flow is applied. Starting with the synthesized netlist,
floorplanning, placement, routing, and timing optimization are performed. Clock tree
is out of the scope; an ideal clock is assumed. In the floorplan step, the whole chip
area is divided into four core regions and a bus region, and the memory macro block
locations are decided inside core regions. An identical macro block placement is used
for all cores, with proper rotations to face cores towards the bus. In the center region
44
the AMBA bus controller logic is placed.
• 3D core-level partition (3D-core): Each core is placed on each die, and the bus
controller is placed on Die 1. TSVs are used to connect cores on Die 0/2/3 to the bus
controller. Minimal number of TSVs is used. The main idea of this design option is to
reuse existing 2D core design and expand in 3D with minimum effort. In this option,
one core is placed on each die. The same macro block placement per core as in 2D
case is used. The bus controller logic is placed on Die 1 which connects the core on
Die 1 as well as the cores on Die 0, 2, and 3 using TSVs. All the TSVs are manually
placed outside the core region, in a clustered fashion.
• 3D block-level partition (3D-block): The processor core is partitioned in core+memory
style and stacked. On Die 1, all the core logics and the bus controller are placed, while
on Die 0/2/3 all the memory blocks are placed. A moderate number of TSVs are used
to connect the blocks to the core logics. A single core is divided into logic cells and
memory blocks. Then all the logic cells are placed on Die 1, while the memory blocks
are placed on Die 0, 2, and 3. The reason for placing logic cells on the Die 1 is to
minimize the total distance to memory blocks. Note that the ordering of logic cell die
and memory die is important, because the relationship is asymmetric. For instance,
if the memory blocks are on Die 0 and the logic cells are on Die 1, the TSVs that
connects memory blocks to logic cells are on Die 0, which means the active device
space on Die 0 is consumed by TSVs while no active device space is needed on Die
1. On the other hand, if the logic cells are on Die 0 and the memory blocks are on
Die 1, the TSVs will consume the active device space on the logic cell die. Thus,
with the configuration of this project, TSV connections to memory blocks on Die 2
and 3 occupy the active device space on logic cell die (Die 1). Due to the shape
of the biggest memory blocks (instruction and data cache banks), the core region is
rectangular. All TSVs are manually placed around the pins of memory blocks. Since
the pin pitch of memory blocks is smaller than the minimum pitch of the TSV, the
TSVs are placed in multiple rows to reduce the distances between memory pins and
45
TSVs. Since connections between the core logics and the memory blocks on Die 3 has
to go through Die 2, pass-through TSVs are placed on Die 2, avoiding contact to the
memory blocks on Die 2. Four of this four die stack are put together on x-y plane to
form the quad-core processor.
• 3D gate-level partition (3D-gate): The whole circuit is partitioned into four parts,
and mapped onto four dies. The memory blocks are also placed on four dies. This
design uses the largest number of TSVs. The last design option is based on gate-
level partitioning. First, the input netlist is partitioned into four parts. The memory
blocks in the netlist are very large compared with the standard cells, therefore they
are processed first. Per a core, each die has a bank of either the instruction cache
or the data cache, and its cache tag. In addition, Die 0 has the address translation
table, whereas Die 1 and 2 have a bank of the register file each. The location of
these memory blocks are manually determined considering pin locations as well as net
connectivity. Then the standard cells are placed in 3D by the recursive partitioning
technique [25].
4.4 3D Timing Analysis and Optimization
4.4.1 3D Static Timing Analysis
The 3D static timing analysis (STA) is performed using Synopsys PrimeTime. First, the
Verilog netlist files of all four dies and the SPEF files containing extracted parasitic values
for all the nets of the dies are prepared. Then, a top-level Verilog netlist is created that
instantiates each die design and connects the 3D nets using TSV connections. In addition,
a top-level SPEF file is created that contains parasitic models of the TSVs. After that,
PrimeTime is run to obtain the 3D timing analysis results. The worst negative slack (WNS)
and the total negative slack (TNS) are reported to demonstrate the timing quality of the
design. Meanwhile, timing constraints are generated from the timing analysis results to
perform 3D timing optimizations.
46
4.4.2 3D Timing Optimization
Considering that each die design is a subdesign of the entire design, 3D IC designs are
essentially hierarchical. Thus, the 3D timing optimization is performed in a hierarchical
manner. Compared with a non-hierarchical design flow, in a hierarchical design flow the
timing constraints on the boundary is important because it is the key information that
the timing optimization engine uses. The timing optimization of each die is performed
with timing constraints on the die boundaries (TSV PP and LP ports). Two methods are
demonstrated to generate the timing constraints: timing scaling and timing budgeting.
Timing scaling method is to scale the input/output delay timing constraints at each
boundary point. Consider a 3D path from a source F/F in a die through a die boundary
to a sink F/F in a neighbor die. After the 3D timing analysis is done, the longest path
delay from the source to the sink (= TLPD) as well as the delay up to the die boundary
(= Tboundary) are obtained. To achieve the target clock period TCLK , ideally TLPD should
be the same as TCLK . Thus, the scaling factor is set as SF = TCLK/TLPD. Then, the
scaled boundary constraints are calculated as follows:
Tboundary,scaled = Tboundary × SF
The updated timing constraint file is used in the timing optimization. By this method,
all the 3D paths are constrained so as to meet the target clock period. This method is
implemented in PrimeTime Tcl and Perl.
Timing budgeting [58] is to distribute the timing slack of a path to each net on the path.
This method analyzes the timing graph of the entire circuit to find out where the critical
paths are. Nets on non-critical paths can be given a positive timing budget which can be
used for other circuit optimizations such as area and power minimization. On the other
hand, nets on critical paths are given negative timing budgets, which means the delays of
the nets should be reduced by timing optimization. Synopsys Design Compiler is used to
perform timing budgeting.
The overall design flow is shown in Figure 14. With the generated timing constraints,
timing optimization is performed by Encounter. The optimization loop is iterated several
47
times.
Perform initial placement and circuit extraction
Make top level netlist and TSV model
Initial 3D STA
Calculate scaling factor Run timing budgeting
Generate timing constraints Generate timing constraints
Timing optimization per die Timing optimization per die
Circuit extraction Circuit extraction
3D STA 3D STA
iteration iteration
(a) With timing scaling (b) With timing budgeting
Figure 14: Design flow with timing scaling and timing budgeting.
4.5 Experimental Results
Experimental settings of this chapter are shown in Table 16. All 3D cases use four dies. The
target clock period was set to 3.333ns. For 2D case, the chip area was chosen so that the
initial utilization is around 80%, whereas for 3D cases chip area was expanded considering
the TSV impact.
Table 16: Experimental settings of this chapter.
2D 3D-core 3D-block 3D-gate
Die size (µm2) 2900x2600 1500x1300 1709x1151 1500x1400
Total area (µm2) 7.54E6 7.80E6 7.87E6 8.40E6
Footprint (µm2) 7.54E6 1.95E6 1.97E6 2.10E6
4.5.1 Initial Design Results
The Cadence Encounter screen shots of top-dies (Die 0) for the four partition styles are
shown in Figure 15. The zoom-in shots of the GDSII images in Cadence Virtuoso are shown
in Figure 16. The initial design results of the design options before timing optimization are
shown in Table 17. Due to the pre-place optimization, the total number of placed instances
differs for each design. In 3D-gate case the utilization is set to a lower value than other
cases, because when the utilization is set at the same level as in other cases, after timing
optimization the design had severe congestion and too high utilization. The number of
48
TSVs is the smallest in 3D-core case, while 3D-block uses about 9.6 times more TSVs than
3D-core case. 3D-gate case uses about 82% more TSVs than 3D-block case. As the designs
use more TSVs, the total wirelength decreases. In particular, the total wirelength in 3D-
gate case is 22.6% shorter than 2D case. However, shorter total wirelengths do not always





Figure 15: Top-die layouts of the four partition styles. The relative sizes of layouts are
preserved.
Figure 16: Screen shots of the GDSII images in Cadence Virtuoso. Left: TSVs and gates.
Right: routing to TSVs.
The wirelength distributions of the design options are shown in Figure 17. Compared
with 2D, 3D-core contains less number of nets around 1, 000µm because the distances be-
tween cores and the bus controller have been reduced by TSVs. Compared with 3D-core,
49
Table 17: Initial layout results for the design options. Utilization means area utilization
including standard cells and memory blocks, and wirelength means total wirelength.
# instances Utilization (%) # TSVs Wirelength (m)
2D Die 0 84,562 80.97 0 6.405
3D-core Die 0 21,093 78.15 112 1.476
Die 1 24,368 79.85 222 1.587
Die 2 21,672 78.39 111 1.462
Die 3 21,048 78.15 0 1.483
Total 88,181 78.64 445 6.008
3D-block Die 0 8 94.61 624 0.027
Die 1 86,618 48.46 3,040 5.087
Die 2 28 73.58 624 0.074
Die 3 8 94.61 0 0.027
Total 86,662 77.81 4,288 5.215
3D-gate Die 0 26,179 70.29 2,345 1.352
Die 1 18,630 78.32 2,388 0.963
Die 2 20,711 78.11 3,089 0.981
Die 3 28,271 67.62 0 1.663
Total 93,791 73.59 7,822 4.959
in 3D-block there are more nets with very short wirelengths (< 4µm), yet there are still
several nets with long wirelengths. Compared with other cases, the overall distribution of
3D-gate has been shifted towards left, and there are no net with a very long wirelength.
The nets in 3D-gate case generally have shorter wirelengths than other cases.
4.5.2 Timing Optimization
The changes of WNS values for 3D-core, 3D-block, and 3D-gate during the timing opti-
mization iterations are shown in Figure 18. The biggest reduction of WNS was observed in
the first optimization. In 3D-core case, WNS converged fast after the first iteration, while
in 3D-block and 3D-gate cases WNS kept decreasing during the four iterations.
The results of the 3D timing optimization are shown in Table 18. Comparing the results
to the ones in Table 17, it is observed that the total wirelengths increased by 2.4%, 12.9%,
and 47.7% in 3D-core, 3D-block, and 3D-gate cases with timing budgeting. Compared with
the increase of 1.6% in 2D case, the wirelength increase is severe in 3D-gate case. Also
the utilization values increased after the optimization, because of the gate sizing and buffer
insertion by the optimization engine to meet the timing goal. From the WNS values, it
is evident that 3D-core design can operate 13% faster than 2D design. Other designs are
slower than 2D, especially 3D-gate case. In terms of TNS, 3D-core is better than 3D-block.
50


















(a) 2D (b) 3D-core
(d) 3D-gate(c) 3D-block






Figure 17: Wirelength distribution of design options before timing optimization. The
x-axis is wirelength in µm and the y-axis is net count.
In 3D-gate case, although average utilization was around 80%, the designs had very densely
packed placement regions around center, which prevented further timing optimizations. In
sum, 3D-core with timing budgeting resulted in the best quality circuit in terms of timing.
4.5.3 Impact of TSV parasitics
To investigate the impact of TSV parasitics on timing, the 3D designs are optimized with















Figure 18: WNS values for 3D-core, 3D-block, and 3D-gate cases with timing budgeting.
51
Table 18: Timing optimization results of LEON3.
2D 3D-core 3D-block 3D-gate
scaling budgeting scaling budgeting scaling budgeting
# inserted buffers 9,177 8,758 8,516 11,310 11,223 13,557 13,528
Utilization (%) 85.59 83.22 83.14 82.88 82.98 80.07 79.56
Wirelength (m) 6.51 6.176 6.15 5.728 5.89 7.346 7.323
WNS (ns) -0.357 -0.478 -0.310 -0.659 -0.543 -1.694 -1.884






















































Figure 19: The impact of TSV parasitics on various metrics. CTSV = 0fF means ignoring
the parasitics of TSVs. Timing budgeting was used for optimization.
the total wirelength slightly increased with higher CTSV . This is because the optimization
engine tends to insert more buffers and upsize gates with higher CTSV . The WNS of 3D-
block when CTSV is 25fF was lower than when CTSV is 0fF . Checking on the timing
constraints, it is found that when CTSV is 0fF , the timing constraints were not tight
enough, thus the optimization engine did not perform enough optimization. The WNS and
the TNS of 3D-gate degraded quickly with increased CTSV . That is because the TSV count
is rather high in 3D-gate case, thus more timing paths are affected by increased CTSV . Also
the TNS of 3D-block degrades with higher CTSV . In contrast, 3D-core case was not so
much affected by CTSV variation. When the TSV count is high, the overall timing quality
is more likely to be affected by TSV parasitics.
52
4.5.4 Sub-Optimality in 3D IC Design
(a) Die 0 (b) Die 1






























Figure 20: Layout snapshots of dies for 3D-gate, with timing critical path highlighted in
white. Numbers in bright yellow represent the path sequence. Small blue squares are TSV
PPs on M1, and orange squares are TSV LPs on M6.
The timing critical path after the timing optimization for 3D-gate design is shown in
Figure 20. The path starts on Die 2, goes down to Die 1, comes back to Die 2, goes down
deeper to Die 1 and 0 and comes back to Die 1 and 2, then goes up to Die 3 and comes
back to Die 2, goes down to Die 1 and comes back to Die 2, then goes up to Die 3 where the
path ends. This path snakes through the entire stack, involving many TSVs. Looking at
the path from (1) to (8), it is observed that the path goes through the dies back and forth.
And the path from (8) to (10) as well as the path from (22) to (24) could be shorter if the
gates at (9) and (23) are placed closer to (8) and (22), although it may affect other nets
53
that are connected to this path. Since this path is the critical path, these gates and TSVs
could be moved to make the entire critical path shorter. Also the delay of the entire path
may be reduced by making the path encompass less number of dies and use less number
of TSVs. This sub-optimal design demonstrates the need for a real 3D-aware placer and
optimizer.
4.6 Summary
In this chapter, the timing analysis and optimization of a quad-core microprocessor in 3D
ICs were presented. Three different partition styles for 3D ICs were explored in layout level
and timing results were analyzed. Current commercial 2D EDA tools cannot fully utilize
benefits of 3D. The timing optimization did not lead to an optimal design, because the
partitioner and placer were not 3D-timing-aware, and optimization could not be aggressive
enough. In addition, the placement of gates and TSVs should be 3D-timing-aware. These
shortcomings could get worse when more dies are stacked together, thus true 3D EDA tool
development is required to enable higher level of integration.
TSV parasitics affected the overall quality of the design in terms of utilization, wire-
length, and timing. With high TSV parasitics, it is better not to use too many TSVs,
because of buffering cost and timing degradation by TSVs. Furthermore, the target circuit
size also correlates to the benefit of 3D IC, because the relative size of the capacitance of
a TSV and a metal wire matters. In large circuits, 3D ICs may lead to better chances
of reducing the delay along the timing paths with long wires. Conversely, with small and
simple circuits, the chance of improving design quality with 3D ICs would be low.
54
CHAPTER V
SLEW-AWARE BUFFER INSERTION FOR
THROUGH-SILICON-VIA-BASED 3D ICS
5.1 Introduction
For high performance 3D ICs, it is crucial to perform thorough timing optimization, espe-
cially when the 3D nets are on timing critical paths. Among timing optimization techniques,
buffer insertion is known to be the most effective way. However, currently there is no com-
mercial design software that performs buffer insertion on multiple die designs simultane-
ously. The through-silicon-vias (TSVs) have a large parasitic capacitance that increases the
signal slew and the delay on the downstream. Even for 2D ICs, today’s advanced technology
nodes experience high slew degradation along nets, which in turn increases gate delay.
A pioneering work of van Ginneken [20] adopted dynamic programming (VGDP).
VGDP has been used in slew buffering [59], which fixes slew design rule violations but
does not optimize timing. In this chapter, the bottom-up slew propagation DP (SPDP)
is proposed, which is a modified version of VGDP, to perform delay optimization with the
consideration of slew for TSV-based 3D ICs. By considering slew in DP framework, the
algorithm achieves lower buffered delay compared with the original VGDP.
There is a common belief in 3D IC area that timing optimization can be handled with
existing 2D EDA tools, with a little modifications for TSV handling. However, since a
2D EDA tool handles each die separately, it cannot consider the whole 3D path and timing
optimization quality is worse compared with true 3D buffer insertion methods. From layout
experiments, the impact of slew caused by TSVs in 3D nets on gate and net delays is
demonstrated. With a buffered 3D net, the severity of TSV-induced slew degradation and
the improvement ideas on the buffer solutions are discussed. Then, a reasonably accurate
slew model is incorporated into the van Ginneken DP framework for delay minimization.
55
In addition, a slew binning idea is proposed to explicitly and efficiently consider slew-
aware delay during solution search. In addition, using the slew information several efficient
pruning rules are proposed, limiting search space and reducing runtime. For various 2D
and 3D nets as well as full-chip designs in detailed layout experiments, the buffer insertion
solutions of the proposed SPDP are compared with VGDP and the timing-constraint-based
2D buffer insertion using commercial EDA softwares. With full-chip 3D IC designs, it is
demonstrated how much timing could be improved if 3D buffer insertion is applied instead































Figure 21: (a) Side view of the 3D IC, (b) top view of a TSV, and (c) TSV RC model.
TSV PP (M1) and TSV LP (M8) represent TSV pin pad on metal1 and TSV landing pad on
metal8, respectively. Dashed lines in (b) denote standard cell row boundaries. Dimensions
are in µm.
In this study, as shown in Figure 21(a), it is assumed that four dies are stacked, and TSVs
go through Die 0, 1, and 2. The TSV macro of this project occupies six standard cell rows
as shown in Figure 21(b). Because of the reliability issues and performance variation, gates
and buffers should be placed outside TSV keep-out zone. The TSVs have large parasitics
that affect timing. Each TSV has a parasitic capacitance (CTSV ) and a resistance (RTSV ),
56
and is represented by a π-model with two capacitors and a resistor as shown in Figure 21(c).
Based on the physical assumptions such as TSV liner thickness and doping concentration,
using the formula in [60] CTSV and RTSV are calculated, which are shown in Table 22. The
inductance of TSV is ignored because it is not dominant under a few GHz signal frequency.
It is assumed that the unit length capacitance and resistance of net wires are Cm and Rm.
Because of the TSVs, 3D nets no longer have uniform RC characteristics, which needs to

















































Figure 22: A motivational example. Numbers shown in blue represent the distance from
source gate in µm. (a) target 3D net, and buffer insertion solutions with (b) VGDP, (c)
SPDP, and (d) timing-constraint-based 2D optimization by Cadence Encounter.
In my 3D IC design experience, it is observed that the slew degradation due to the
TSVs is quite severe, even after buffer insertion. As a motivational example, a 3D net
with two TSVs is shown in Figure 22(a). The source gate is NAND2 X1 and the sink1/2
57
gates are AOI21 X1. The buffer insertion is performed using (3D-aware version) VGDP,
SPDP, and timing-constraint-based 2D optimization by Cadence Encounter. For simplicity
of demonstration, it is assumed that the buffer library consists of a single buffer (BUF X4)
and an inverter (INV X4), the input slew (Si) at the source gate is 40ps, and the load
capacitance at sink1/2 is 20fF . The delay, slew, and arrival time (AT) values in Figure 22
are obtained by layout and PrimeTime 3D static timing analysis (STA). The VGDP places
a buffer right after source (for boosting) and another at 750µm. Due to the large TSV
capacitance and the long wirelength driven by buf1, the slew degradation from buf1 to buf2
is large, thus the Si at the buf2 is quite high which increases the delay of buf2. Also, the
Si at the critical sink gate is large. Note that AT at sink1 and sink2 are almost the same;
the delay difference caused by the TSV before sink1/2 is very small because RTSV is very
small.
Since the proposed SPDP considers slew during DP, inv2 is much closer to TSV1 than
buf1 in VGDP solution is, and inv3 is also closer to TSV1 than buf2 in VGDP solution
is. As a result, the Si at inv3 is only 35ps, which reduces the delay of inv3. Also, the
Si at the critical sink is lower with SPDP which reduces sink gate delay. This lower slew
is especially helpful because the sink gate delay is sensitive to Si. Comparing AT values
at sink1 output, it is observed that SPDP achieves 4.3% delay reduction compared with
VGDP. The timing-constraint-based 2D buffer insertion with a commercial design software
does not produce a good result. It inserted buffers that usually have higher intrinsic delay
than inverters. Moreover, during Die 1 optimization the tool does not know exactly where
buf2 is, thus the location of buf3 is needlessly too close to TSV1. Even though Si at the
critical sink is the lowest, the AT is the worst among the three buffer insertion solutions.
This clearly demonstrates why timing-constraint-based 2D buffer insertion is not thorough
enough for timing-critical nets.
5.2.3 Delay and Slew Models
Linear gate delay model has been extensively used in timing optimization works. Given the
lumped load capacitance (CL) at the output pin of the gate g, the linear gate delay (Dg,lin)
58
and the output slew (So,lin) are expressed as follows:
Dg,lin = Kg +Rg · CL, So,lin = SKg + SRg · CL (6)
where Kg and Rg are intrinsic delay and output resistance of gate g, and SKg and SRg are
intrinsic slew and slew resistance.
As discussed in [23], the linear gate delay model is inaccurate because 1) because of
the resistive shielding [61], the lumped load capacitance is an overestimate of the effective
capacitance [27] seen at the gate output, and 2) gate delay is not a linear function of
load capacitance. The first problem can be solved by adopting the effective capacitance
model, while the second one is dealt by k-factor equation based model [27]. In the effective
capacitance calculation, the RC network is reduced to a π-model (Cn, Rπ, Cf ) in which
Rπ models the resistive shielding effect. Then, the effective capacitance (Cef ) at the gate
output is computed as in [27]. Using effective capacitance model is essential for 3D IC buffer
insertion because TSVs have high capacitance that causes the lumped capacitance (CL) to
overestimate gate delay much, which would discourage buffer usage.
The original k-factor equations for gate delay and output slew are:
Dg,orgk = (kd1 + kd2Cef )Si + kd3C
3
ef + kd4Cef + kd5 (7)
So,orgk = (ks1 + ks2CL)Si + ks3C
2
L + ks4CL + ks5 (8)
where Si is the input slew at the gate g and kd1–kd5 and ks1–ks5 are curve-fitting param-
eters. Note that the parameter values differ for signal sense (rise/fall). In addition, Cg is
the input capacitance of the gate. Also note that CL is used for So calculation, because Cef
tends to underestimate So [27]. The library defines the maximum allowed CL and Si per
each gate.
The problem with the original k-factor equations is that the models are linearly de-
pendent on Si. However, most of the gates require higher order polynomial equations for
59
accuracy. Thus, new k-factor equations are adopted for gate delay and output slew:



























The Dg,newk is a third order polynomial in both Cef and Si, and the So,newk is a third order
polynomial in CL and second order in Si.
From the library characterization experiments, it is found that the new k-factor equa-
tions fit the library data better than the original one. Thus, the new k-factor equation
based delay and slew models are used in this project.
The net delay calculator of this project uses Elmore delay model. It is easy to compute
and the delay is additive [23], which helps pruning during DP traversal. The shortcoming
of Elmore delay is that it may deviate from the actual delay by orders of magnitude [23].
For higher accuracy, a moment-matching based delay metric such as WED [62] can be used.
The model assumes step input signals, yet in real circuits input signals have finite slews,
thus the model tends to underestimate the actual delay. The PERI method [63] converts
the delay from the delay metrics for step inputs to the delay with ramp inputs. From layout
simulations for various 3D nets, it is observed that the WED model combined with with
PERI is quite accurate compared with PrimeTime results.
However, with the moment-based net delay models the optimality of DP framework
solution is not guaranteed, because dominance relation cannot be defined as discussed in
[23]. For two solutions, a1 :(q1, Cef1, m1) and a2 :(q2, Cef2, m2), even if q1 ≥ q2 and
Cef1 ≤ Cef2, depending on the upstream solutions, the seemingly inferior a2 may give a
better solution on the upstream side. No correct pruning scheme for VGDP with moment-
based delay models currently exists. It is observed that buffer insertion with moment-based
net delay model quite often produced worse solutions than buffer insertion with Elmore net
delay model did. Thus, the moment-based net delay model is not used in SPDP.
60
As shown in Figure 23, for slew degradation (Sd) on nets, the Bakoglu’s metric [64] is




The buffer insertion problem of this project is defined as follows: After placement and
routing stage, on a given routed net topology with placed TSV pin pads and landing pads,
buffers (from a given buffer library) are inserted at candidate locations to maximize the
required arrival time (RAT) at the source gate. This is equivalent to minimizing the delay
from the source gate to the critical sink gate. It is assumed that the input slew at the source
gate, the loading capacitance at the sink output, and the RAT at the sink output are given.
Since Si of the sink gate affects the sink gate delay, the sink gate delay is included in RAT
calculation during DP. Thus, the delay from the input of the source gate to the output of
the critical sink gate is minimized.
This problem is different from the delay-constrained buffer insertion problem, where a
buffer insertion solution that minimizes resource usage (e.g., area, power) under a delay
target is sought. For these net instances, a fast buffer insertion algorithm with reasonable
quality in delay is needed. In contrast, for nets on critical paths, a buffer insertion solution
should be found that provides the lowest delay to the critical sink. In this project, the target
nets for buffer insertion are these ’hard’ instances, for which finding a better solution is more
important than finding a reasonably good solution faster. Also note that the problem of this
project differs from slew minimization, which does not produce the lowest delay to critical
sinks.
5.3.2 Ginneken-3D Algorithm
First, the original Ginneken algorithm designed for 2D ICs [20] is extended into 3D, namely
Ginneken-3D. The Ginneken-3D algorithm of this project is similar to VGG in [23], with
extensions for 3D IC handling. From the layout of all dies, a binary tree T = (V,E) per
each target 2D/3D net is built, where V is a set of nodes and E is a set of edges. A TSV
61
is represented by an edge connecting nodes in different dies. The net wires are segmented
by 20µm to generate internal nodes for candidate buffer locations [65]. The TSV related
information, such as keep-out-zone, should be considered in generating these candidate
locations. In addition, a set of buffers, B, is given. The VGDP comprises two steps: a
bottom-up then a top-down traversal. During the bottom-up traversal, candidate solutions
at the leaf vertices are generated and propagated bottom-up. A candidate solution (or a
solution) a is a data tuple (q, C, b, al, ar) associated with a node v ∈ V , where q is RAT,
C is load capacitance, b is an inserted buffer if any, and al and ar are the left and the right
child solutions from which a is generated. With effective capacitance model [27], the C of
the solution is replaced by a tuple (Cn, Rπ, Cf ), which represents the π-model. Thus, a
solution becomes (q, (Cn, Rπ, Cf ), b, al, ar). Each node has its own solution list, and the
solutions are propagated bottom-up. The VGDP assumes a default input slew for delay
computation of gates, including buffers and inverters [23].
The efficiency of the VGDP comes from the pruning of solutions at each node during the
bottom-up traversal. The pruning scheme presented in the original work [20] is simple yet
accurate because dominance relationship can be defined clearly for linear gate delay model
with lumped capacitance and Elmore net delay. For example, for two solutions, a1 :(q1,
C1) and a2 :(q2, C2), if q1 ≥ q2 and C1 ≤ C2, then a1 always produces a better solution
than a2 on the upstream side. However, with slew consideration the pruning should be
performed more intelligently, because the dominance relation no longer holds, as will be
discussed later in this section.
After bottom-up traversal is finished, only one solution survives at the root node, because
the C at the root node is the same (=input capacitance of source gate) for all solutions and
after pruning only the solution with largest q survives. From the best solution, the top-
down traversal is performed. The best solutions at child nodes are obtained by following the
child solution pointers (al and ar) stored during the bottom-up traversal. In this top-down
traversal, the best buffer insertion solution is obtained by checking if the best solution at a
node has an inserted buffer.
62
5.3.3 Bottom-Up Slew Propagation DP
It is well known that slew affects delay. Physically, slew is determined top-down, as shown
in Figure 23(a). The function F is a third order polynomial slew model obtained from
timing library. Note that propagated slew is not additive. During the bottom-up traversal
of VGDP, the slew at the current node is unknown until the buffer (or the gate) on the
upstream is determined. This is why slew consideration in DP framework is hard. To
overcome this hurdle, the slew at each node is guessed during the bottom-up traversal.
Each solution has an additional entry for propagated slew, S = (Sb, Sd), which consists of
slew base (Sb) and slew degradation (Sd). The top-down slew equations in Figure 23(a) can
be solved to get the bottom-up slew equations in Figure 23(b). Since slew is not additive,
the slew at a node needs to be calculated from slew base and slew degradation. The function
G can be found by solving F for Si. Note that because of the direction of slew calculation,
Sb in slew calculation is different for the top-down and the bottom-up traversal; in the
top-down traversal Sb is defined as the output slew of the gate on the upstream, whereas



































Figure 23: Gate and net slew calculations in (a) top-down and (b) bottom-up traversal.
The bottom-up traversal and the top-down traversal algorithms of the proposed SPDP
are outlined in Algorithm 1 and 2. Compared with VGDP, the new or modified ideas are
highlighted in blue. The algorithm is now explained in detail.
Per each sink node, a set of solutions is created, each with the S set to a different trial
Si (tSi) as shown in Line 3, Algorithm 1. Since it is observed that in good buffer insertion
63
Algorithm 1: Bottom-up traversal of SPDP.
Input: a graph G=(V ,E) with topologically sorted node list V list, a buffer library B
Output: list of solutions for each node v
1 foreach node v of V list in reverse order do
2 if v has no child then
3 make sink solutions with varied tSi values for different slew bins and add
them to v;
4 end
5 else if v has one child vc then
6 propagate the solutions of vc to v;
7 end
8 else if v has two children vcl, vcr then
9 merge solutions of vcl and vcr with slew consideration and add it to v;
10 end
11 if v is a feasible buffer location then
12 for each solution, make a buffered solution if possible and add it to v;
13 end
14 for all solutions, calculate net delay of the parent wire or TSV of v and update q;
15 for all solutions, calculate slew degradation along the parent wire or TSV and
update S;
16 prune solutions of v with slew consideration;
17 end
solutions the Si at the sink is in [10, 50]ps range, the solutions are generated for this range.
One may think that we may start from a single solution with a single tSi, perform the buffer
insertion, then vary tSi until the best result is found. However, finding the best solution by
scanning tSi is not efficient because the buffer insertion results change unpredictably with
different tSi mainly because of the discreteness of buffer candidate locations and buffer
strengths. Furthermore, this approach cannot handle multi-pin net efficiently because of
the numerous possible slew combinations for the sinks. Thus, the slew binning is proposed
to find the best solution more efficiently.
The proposed slew binning is different from [59]. The allowed slew range is divided into
multiple slew bins with a predefined bin size. A slew value is associated with a corresponding
slew bin, and each bin has its own ID (bin). Now a solution is represented as (q, (Cn, Rπ,
Cf ), (Sb, Sd), bin, b, al, ar). If bin size is small, the difference of slew among solutions
in the same bin is small. This property provides a good pruning opportunity (Line 12,
Algorithm 1); for the solutions in the same bin, q and C can be compared as in VGDP to
64
Algorithm 2: Top-down traversal of SPDP.
Input: a graph G=(V ,E) with topologically sorted node list V list with solutions
from bottom-up traversal
Output: list Blist of buffer locations and types
1 foreach solution a of the root node do
2 compute the gate delay and q at the source gate input with propagated slew;
3 end
4 find the top Nbest solutions with the highest q’s;
5 foreach top Nbest solution at the root node do
6 mark the current solution at the root node;
7 foreach node v of V list do
8 if marked solution at v has an inserted buffer b then
9 calculate gate delay and output slew;
10 end
11 if v has a left child vcl then
12 mark the solution al of a at vcl;
13 propagate top-down delay and slew to vcl;
14 end
15 if v has a right child vcr then
16 mark the solution ar of a at vcr;




21 mark the best solution at the root node with lowest Dtop−down to critical sink;
22 foreach node v of V list do
23 if v has the best solution a with an inserted buffer b then
24 add the location and type of b in Blist;
25 end
26 if v has a left child vcl then
27 mark the best solution al of a at vcl;
28 end
29 if v has a right child vcr then
30 mark the best solution ar of a at vcr;
31 end
32 end
check dominance relation. Note that this is an approximation; even if two solutions at a
node have the same slew value, depending on their Sb and Sd, the slew on the upstream
may differ because slew is not additive. However, it is observed that this pruning works well
in practice; pruning only when two solutions have similar Sb and Sd produces a buffering
solution of almost the same quality with more than 20% runtime overhead.
65
At sink nodes, when bin size is 2ps, total 20 solutions are generated in [10, 50]ps range.
Compared with a single solution generation at sinks in VGDP, this multiple solution gener-
ation at sinks increases the run time of SPDP. Thus, it is crucial that the pruning scheme is
efficient. During the bottom-up traversal the maximum of the propagated slew is limited,
maxS, so that the search space is limited. The maxS may be set larger than the maxi-
mum Si at sink gates, because buffers can recover degraded slews very well and sometimes
wires towards non-critical sinks may have larger slew between buffers. The maxS effec-
tively limits the maximum number of slew bins at each node, and the runtime complexity
depends on the number of allowed slew bins. As shown in Table 19, maxS effectively limits
search space and runtime. Considering the delay and runtime trade-off, it is determined
that maxS = 70ps. The minimum slew during DP, minS, is 1ps.
Table 19: Delay and runtime of SPDP with varied maxS for critical nets in a 3D IC
design.
maxS (ps) 60 70 80 90
maximum Dtop−down(ps) 440.54 440.54 440.54 440.54
average Dtop−down(ps) 155.94 155.88 155.85 155.85
total runtime (s) 7.81 10.25 13.93 17.75
In the proposed slew binning scheme, a single slew value is propagated in each solution.
Propagating bins which have ranges of slew (i.e. [min, max]) as in [59] cannot be applied
to SPDP because: (1) the delay and the slew calculation using [min, max] slew complicates
the pruning; (2) slew range expands quickly as So-to-Si conversion is performed for buffers.
Usually, buffers have very low slopes in Si-So graphs; a narrow range on the So side corre-
sponds to a wide range on the Si side. After going through three buffers, the propagated
slew range usually covers all good slew range (1–70ps), rendering the propagation pointless.
The buffered delay and the runtime with varied bin sizes are shown in Table 20. With
a larger bin size, runtime decreases because fewer solutions are generated at the sink gates
and pruning applies to more solutions. Yet too large bin size degrades solution quality. The
bin size is set as 2ps for the delay and runtime trade-off.
For multi-pin nets, during bottom-up traversal children solutions are merged at the
merging node, as shown in Figure 24. The q3 and C3 of the merged solution a3 are
66
Table 20: Delay and runtime with varied bin sizes for critical nets in a 3D IC design.
bin size (ps) 1.0 2.0 3.0
maximum Dtop−down (ps) 433.96 440.54 456.43
average Dtop−down (ps) 155.44 155.88 156.28
total runtime (s) 32.27 10.25 5.44
calculated as in VGDP. Yet, the slew values need to be merged carefully. Physically, S1
and S2 should be the same as S3. In the previous slew buffering work [59] the authors
used the max operation, S3 = max(S1, S2), because they propagated a maximum slew
constraint. If this max operation is adopted in SPDP, it may propagate wrong slew values,
which in turn makes delay calculations on the downstream inaccurate and pruning on the
upstream incorrect. Thus, the child solutions are merged only when S1 and S2 are very
close to each other. However, Sb1 and Sd1 may be different from Sb2 and Sd2. Since slew is
not additive, depending on the upstream slew degradation, the propagated slew calculated
from S1 and S2 may differ, which incurs inevitable slew calculation error on the upstream.
Since the delay to the critical sink has to be more accurate than those to other sinks, the











if |bin1 - bin2| dS
q3=max(q1,q2)
C3=C1+C2
if (q1 q2) S3=S1
else S3=S2
Figure 24: Solution merge rule for VGDP and SPDP.
As shown in Figure 25, the data structure for solutions at a node is a list, in which the
solutions are sorted in ascending order of both q and C, but not in S. In merging process
of VGDP, a left/right pointer is pointing to the left/right child solution to be merged (refer
to [21]). After creating a solution by merging, only the timing-critical side pointer moves
towards right (=larger q and C). However, since SPDP merges solutions when the slew of
left and right solutions match, if solutions on left side are all timing critical and all these
solutions do not have slew that matches the slew of the current right solution, no further
solution is merged (i.e. right pointer is stuck). To avoid this, on the non-timing-critical
67
side, the solution that matches slew of the solution on timing-critical side is actively search
for, using a separate pointer. Since each list usually contains lots of solutions with different
slews, finding a solution with matching slew does not take many steps from the current
pointer. This technique improved solution quality without much runtime overhead.
(q1,C1,S1) (q2,C2,S2) (q3,C3,S3)
sol. from left child sol. from right child
search





Figure 25: Slew matching technique. The q’ and S’ are determined as in Figure 24.
In the proposed implementation, the solutions are merged when the bin number of S1
and S2 differ within a threshold, dS (Line 7, Algorithm 1). Allowing this small difference is
investigated to check if slew calculation error during bottom-up might cause wrong pruning
at merge. In Table 21, the percentage of merged solutions, buffered delay, and runtime with
various dS are shown. With a larger dS, more child solutions are merged and the runtime
increases. The percentage of merged solution is high, because of the above slew matching
technique. From the results, it is obvious that dS = 0 produces the best delay with the
lowest runtime. Thus, it is decided that the dS = 0, meaning that the solutions are merged
when they have the same bin ID.
Table 21: Percentage of merged solutions, delay, and runtime with varied dS for critical
multi-pin nets in a 3D IC design.
dS (bins) 0 1 2 3
merged sols (%) 82.3 91.9 93.5 94.4
maximum Dtop−down(ps) 413.00 440.54 486.45 543.38
average Dtop−down(ps) 209.55 216.33 221.24 226.88
total runtime (s) 9.55 10.25 11.71 12.09
The propagated slew provides a very efficient pruning mechanism in the buffer insertion
step (Line 9, Algorithm 1). For any solution at a node during the bottom-up traversal,
the C is known, and for a buffer b with the given C, So(g) can be converted to Si(g).
If the converted Si is out of the [minS, maxS] range, the solution is pruned. Instead of
68
pruning out the solution, if the solution is kept with a default Si value, say 40ps, the delay
calculations on the downstream of the solution become incorrect because the slew at the
current node has been changed. This delay calculation error may cause a better solution
pruned by the default-slew solution, leading to worse buffer insertion results.
Another issue in the buffer insertion step is the dominance relation. As shown in Figure
26(a), in VGDP when the buffered solutions a1’–a3’ are generated from the propagated
solutions a1–a3, the Cs of a1’–a3’ are the same, Cb. Thus the buffered solution with
highest q dominates all the other buffered solutions for the buffer b. Thus, only one new
buffered solution is added per each buffer. However, in SPDP, since S1′ and S2′ also affect
the upstream solutions, it is not correct to just compare q for pruning. For example, even
if q1′ ≥ q2′ and S1′ ≤ S2′, it is possible that solution a1’ gets pruned out on the upstream
side because of slew-based pruning, while solution a2’ can survive up to the root node
and possibly become the best solution. This means that per each buffer type buffered
solutions from all candidate solutions should be created (Line 9, Algorithm 1), increasing
the algorithm complexity from O(|B|2|V |2) to O(|B||V |+1|V |2), where |B| is the size of buffer
library and |V | is the number of nodes. Thanks to the slew-based pruning rules discussed










a1: (q1, C1, S1)
a2: (q2, C2, S2)
a3: (q3, C3, S3)
b1: (q1, Cb, S1)
b2: (q2, Cb, S2)
b3: (q3, Cb, S3)
b
x
(a) VGDP (b) SPDP
Figure 26: Different buffer insertion scheme for (a) VGDP and (b) SPDP.
After bottom-up, multiple solutions exist at the root node. After So-Si conversion,
the Si of a solution at the root node may not match the input slew assumption. It is
intentionally allowed because more accurate delay and slew of solutions will be evaluated in
the top-down traversal. In the top-down traversal, the slew is propagated top-down and the
gate and net delays are calculated with the propagated slew. The top-down delay calculation
69
(Dtop−down, calculated by the internal delay/slew models) may have small difference from
the bottom-up delay calculation (Dbottom−up, also calculated internally) because the top-
down slew values may differ from the bottom-up ones thus effective capacitance and gate/net
delays change. Because of the slew calculations at solution merge and the aforementioned
input slew condition at the root node, Dbottom−up has an inherent error. Thus, the solutions
are sorted based on q and the top Nbest solutions are picked. Then, the top-down solution
tracking is performed from each best solution (Line 4-14, Algorithm 2), and Dtop−down is
checked. The one with the lowest Dtop−down is chosen as the final buffer insertion solution.
It is observed that the buffer insertion quality generally improves with higher Nbest and after
30 it saturates. Since top-down delay calculation is straightforward, the runtime overhead
because of multiple solution tracking is negligible.
5.4 Design Flow
In this project, design methodologies are developed for the following buffer insertion meth-
ods: (1) Encounter-3D: The timing-constraint-based 2D buffer insertion for 3D ICs with
Cadence Encounter, (2) Ginneken-3D: The original VGDP with extensions for handling 3D
ICs, with a fixed input slew of 40ps for all gates, (3) SPDP: The proposed SPDP algorithm,
with parameters in Table 22. The overall full-chip design flow for the three buffer insertion
methods is shown in Figure 27. Starting from partitioned and placed design, in Cadence
Encounter a preliminary optimization is run for 2D nets without timing constraints on die
boundaries (TSV ports) to fix DRVs within dies. Then, with netlists and RC parasitic files
extracted by Cadence QRC for all dies plus the top level netlist and the RC parasitic file
that models TSVs, Synopsys PrimeTime is run to perform true 3D STA and generate timing
constraints on die boundaries as is normally done for hierarchical designs. With the tim-
ing constraints, in Encounter die-by-die 2D optimization in pre-route mode is performed
first. For fair comparisons, only buffer/inverter insertion is allowed in the optimization.
Then, die-by-die routing is performed, followed by RC extraction. A 3D STA is performed
to obtain updated timing constraints on die boundaries. With the timing constraints, in
70
Encounter post-route 2D optimization is performed, which is the final design for Encounter-
3D. Finally, a 3D STA is performed to obtain timing results such as worst negative slack





3D STA & generate timing 
constraints on die boundary
die-by-die 2D buffer 
insertion in pre -route mode
routing
die-by-die 2D buffer 
insertion in post -route mode
3D STA & generate updated 
timing constraints
3D STA
gather critical nets &
get RAT at net sinks
rip up buffers on critical 
nets, run ECO routing
extract layout information
of critical nets
run VGDP / SPDP
with layout info & RAT
apply ECO buffer insertion
refine place & ECO routing
3D STA
ENCOUNTER-3D GINNEKEN-3D/SPDP 
Figure 27: Overall full-chip design flow for the buffer insertion methods. The ECO means
engineering change order.
Starting from the final design of Encounter-3D, in PrimeTime the nets on top 5% critical
paths and the RAT of sink gates in these nets are gathered. In Encounter, buffers on the
critical nets are ripped up and ECO routing is performed to repair broken routing because of
buffer deletions. Then, the layout information of critical nets is extracted from Encounter.
The Ginneken-3D or SPDP are performed to find buffer insertion solutions, which are fed
back to Encounter using ECO buffer insertion commands. Then, placement legalization and
ECO routing are performed, which is the final design for Ginneken-3D and SPDP. Finally,
a 3D STA is performed.
The parameters used in the experiments are summarized in Table 22. The wire parasitics
from a moderately congested layout (extracted by Cadence QRC) matches the Cm and Rm
with less than 5% and 1% error. The SPDP parameters are also summarized in Table 22.
71
Table 22: Parameters used in this chapter. The Cm and Rm mean unit length capacitance
and resistance of metal5. The CTSV and RTSV mean TSV parasitic capacitance and resis-
tance, respectively. The maxS and minS are the maximum/minimum allowed slew in the
bottom-up traversal.
Cm 0.102fF/µm Rm 1.5Ω/µm
CTSV 59fF RTSV 0.1Ω
bin size 2.0ps dS 0 bin
maxS 70ps minS 1.0ps
5.5 Experimental Results
To demonstrate the effectiveness of the proposed buffer insertion algorithm, buffer insertions
on various nets and full-chip designs are performed. The experiments are performed on a
Linux server with Intel Xeon processors running at 2.5GHz and 48GB main memory. In
this study, the Nangate 45nm standard cell library [66] is used. The buffer set in Nangate
45nm standard cell library consists of six non-inverting buffers (BUF X1/2/4/8/16/32) and
six inverting buffers (INV X1/2/4/8/16/32). Each has its own parameters such as Cb, kd1–
kd10, etc. Maximum CL allowed at the buffer output is defined by library. It is assumed
that four dies are stacked in the 3D IC. The diameter and height of TSV are 5um and 30um,
and the TSV macro occupies six standard cell rows. The TSV RC parasitics are shown in
Table 22. The inductance of TSV is ignored because it is not dominant under a few GHz
signal frequency.
The five target designs are summarized in Table 23, and the buffer insertion results
are shown in Table 24. Note that the runtime of Encounter-3D is not reported, because
Encounter performs many internal steps during buffer insertion thus runtime for buffer
insertion alone cannot be measured. Compared with Encounter-3D, Ginneken-3D improves
WNS and TNS by 31.4% and 41.0% on average, which means applying 3D buffer insertion is
advantageous over timing-constraint-based 2D optimization. Compared with Ginneken-3D,
SPDP further improves WNS and TNS by 8.7% and 10.9%, and the maximum achievable
clock frequency is 3.2% higher, while using 4% less number of buffers. The reason why
Ginneken-3D used more buffers than SPDP is that Ginneken-3D inserted offloading buffers
whenever possible to reduce little bits of delay to the critical sink, while SPDP avoided it
because of the slew-aware merging.
72
Table 23: Summary of target design information. The ’#nets(critical)’ means the number
of nets in the whole design and the critical nets selected for buffer insertion. Die size is in
µm, and the ’clock’ means target clock period in ns.
name #gates #nets (critical) die size #TSVs clock
ckt1 12924 13256 (455) 350x350 1203 1.00
ckt2 46677 48426 (3408) 500x500 3102 1.00
ckt3 50375 55454 (1607) 700x700 8596 1.00
ckt4 253554 331177 (7405) 1300x1300 22303 1.50
ckt5 546460 714782 (14102) 1900x1900 42325 2.00
5.5.1 Full-Chip Results
Table 24: Comparison of buffer insertion results. The ’#bufs’ means the number of buffers
in the design, and the fmax stands for maximum achievable clock frequency. Runtime values
of Ginneken-3D and SPDP include bottom-up and top-down traversals in DP. The WNS,
TNS, fmax, and runtime are in ps, ns, MHz, and s respectively.
name ckt1 ckt2 ckt3 ckt4 ckt5 ratio
Encounter #bufs 5134 12587 30812 79510 188014 0.894
-3D WNS -528.54 -1466.49 -1367.94 -1213.83 -604.58 1.458
TNS -69.42 -973.07 -4576.33 -1941.03 -631.20 1.694
fmax 654.22 405.43 422.31 368.48 383.94 0.863
Ginneken #bufs 5338 15596 31459 88070 213212 1.000
-3D WNS -392.62 -832.77 -1156.31 -665.62 -507.43 1.000
TNS -56.41 -695.42 -3885.07 -151.64 -47.62 1.000
fmax 718.07 545.62 463.76 461.76 398.82 1.000
runtime 1.307 11.934 23.332 88.033 294.522 1.000
SPDP #bufs 5183 13297 31177 84312 205420 0.960
WNS -353.711 -740.974 -1106.94 -620.617 -423.778 0.913
TNS -53.60 -607.34 -3483.79 -130.73 -35.54 0.891
fmax 738.71 574.39 474.62 471.56 412.58 1.032
runtime 2.441 24.186 97.339 233.674 929.970 3.072
The cumulative runtime of SPDP for all five designs is about 3.1 times that of Ginneken-
3D, which is acceptable because 1) number of critical nets to be buffered are small compared
with the total net count, and 2) buffering is one of many optimization steps and it usually
consumes around 10% of the total optimization time. Thus, for these ”hard” net instances,
it is allowed to spend more time on buffering to improve timing. Compared with Encounter-
3D, SPDP produces 37.4% and 47.4% better WNS and TNS, and 19.6% higher max clock
frequency, yet uses 7.4% more buffers because the algorithm does not minimize number of
buffers. Note that it is possible to run buffer usage reduction (area reclamation) on non-
timing-critical side paths of the target nets as a post-step, which is outside the scope. This
full-chip results clearly demonstrate that SPDP algorithm is superior to Ginneken-3D and
73
Encounter-3D in WNS, TNS and maximum achievable clock frequency.
5.5.2 Critical Path Analysis
The buffer insertion results for the nets on the critical path from the design ckt3 are ana-
lyzed. The target net information and the buffer insertion results are summarized in Table
25. In the column 2 and 3, the instance, pin, and cell names of the source gate and the
critical sink gate of the net are shown. In the second last row, the setup time at the tim-
ing endpoint flipflop is shown. Note that some nets are 2D (#TSVs=0). Compared with
Ginneken-3D, SPDP reduces the path delay by 6.8%, using 33% less number of buffers.
The biggest difference between SPDP and Ginneken-3D is observed for net n4. The
SPDP inserted fewer buffers than Ginneken-3D yet produced lower delay and slew at the
sink gate input, which is helpful because the delay of sink gate OAI22 X1 is sensitive to the
Si. Although Encounter-3D inserted similar number of buffers as Ginneken-3D, it inserted
too many buffers on the critical path, which increases path delay due to the buffer intrinsic
delay. Ginneken-3D does not consider slew, thus the slew of Ginneken-3D varies in a wide
range. On the other hand, SPDP produces low slew values in most cases to reduce sink
gate delay, except for n9 where the delay of sink gate (NAND2 X4) is not very sensitive
to the Si. With Ginneken-3D, the Si to the timing endpoint is quite high, which increases
the setup time. The Encounter-3D produced the minimum Si among the three methods,
however the overall delay was not the minimum.
5.5.3 Endpoint Slack Histograms
To visualize the timing quality of buffer insertion results, in Figure 28 the timing endpoint
slack histograms for design ckt2 with Encounter-3D, Ginneken-3D, and SPDP are compared.
In Encounter-3D result, the long tail towards left (slack < −1.2ns) is because Encounter-3D
could not optimize several critical 3D nets effectively. Compared with those of Ginneken-3D
or SPDP, the overall histogram bars of Encounter-3D lie on the left side, meaning that the
overall buffer insertion quality is the worst among the three methods. Compared with the
Ginneken-3D graph, in the SPDP graph the leftmost bar (=WNS) as well as the overall
distribution are on the right side, meaning a better timing result.
74
5.6 Summary
In this chapter, the slew-aware buffer insertion algorithm in van Ginneken dynamic pro-
gramming framework was presented for timing optimization of 3D ICs. Compared with the
original (fixed-slew) van Ginneken algorithm, the proposed algorithm reduced delay with a
reasonable runtime increase. In addition, it outperformed the timing-constraint-based 2D





























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Figure 28: Endpoint slack histograms for ckt2 with (a) Encounter-3D, (b) Ginneken-3D,
and (c) the proposed SPDP.
77
CHAPTER VI
ULTRA-HIGH-DENSITY LOGIC DESIGNS USING MONOLITHIC 3D
INTEGRATION
It is believed that in today’s logic designs, interconnects dominate the timing and power
of circuits, therefore reducing the interconnect length may improve the timing and power
of circuits. By stacking device layers in 3D using through-silicon-vias (TSVs), not only the
footprint is reduced but also the average distance among devices is reduced, leading to a
shorter total wirelength and better performance. However, the shortcoming of TSV-based
3D ICs is the area overhead [47] and the minimum keep-out-zone of TSVs [67] because
of manufacturing issues such as die alignment precision [68] and mechanical stress [69].
In addition, the parasitic capacitance of TSVs is large (tens-hundreds of fF ), which may
degrade the timing and power of circuits.
To better exploit the benefits from 3D die stacking, monolithic 3D technology is currently
being investigated as a next generation technology. In a monolithic 3D IC, the device layers
are fabricated sequentially, rather than bonding two fabricated dies together using bumps
and/or TSVs. When the top layer is attached to the bottom layer, the top layer is a
blank silicon. Alignment precision is determined by lithography stepper accuracy, which is
around 10nm today. Also, the top layer can be made very thin, around 30nm [28]. Thus,
monolithic inter-tier vias (MIVs) for vertical connections are very small—about two orders
of magnitude smaller than through-silicon-via (TSV)—with a negligibly small parasitic
capacitance (< 0.1fF ). A side view of a typical monolithic 3D IC is shown in Figure 29.
With these small MIVs, designers can truly exploit the benefit of vertical dimension.
As discussed in [32, 33], monolithic 3D technology enables a very fine-grained 3D cir-
cuit partitioning. Standard cells can be divided into PMOS and NMOS parts, placed in
different layers, and connected using MIVs, which is called transistor-level monolithic 3D

























Figure 29: Side view of a two-tier monolithic 3D IC. The MIV and ILD stand for monolithic
inter-tier via and inter-layer dielectric. On the top tier, only the first two metal layers (M1,
M2) are shown. Objects are drawn to scale. Unit is nm.
in different layers and connected using MIVs, which is named gate-level monolithic 3D
integration (G-MI). In this project, the major focus is on T-MI that allows the highest
integration density possible. The comparisons among T-MI, G-MI, TSV-based 3D, and con-
ventional 2D designs are provided. In addition, the power benefit of T-MI is studied based
on timing-closed, detailed routing completed GDSII-level layouts and sign-off analysis on
timing and power. The research in this chapter encompasses device and interconnect-level
study, gate-level modeling and optimization, and full-chip layout constructions, optimiza-
tion, and timing/power analysis. With the layout-based simulations and in-depth analyses,
how to maximize the power benefit of T-MI technology is demonstrated. For fair compar-
isons between T-MI and 2D designs, timing is closed on all designs (iso-performance), and
power consumption is compared.
6.1 Backgrounds
6.1.1 Fabrication Process
In this paper, the monolithic 3D IC fabrication process from CEA/LETI [28,70] is assumed.
Key features of their monolithic 3D process flow are wafer-level molecular bonding with a
79
thin interlayer dielectric and a special salicidation process, under a specific thermal budget.
Based on their monolithic 3D process, they fabricated a test chip and measured the perfor-
mances of top/bottom tier transistors as well as simple circuit structures such as inverter









(a) (b) (c) (d)
Figure 30: Monolithic 3D fabrication process flow of CEA/LETI.
The monolithic 3D fabrication process flow of CEA/LETI is summarized in Figure 30:
(a) They process the bottom layer, where transistors are fabricated with a classical thermal
budget. Their bottom tier transistors are based on fully-depleted silicon-on-insulator (FD-
SOI) with HfO2 and TiN/Poly-Si N+doped gate stack. Rapid thermal annealing (RTA) at
1050oC was used for dopant activation. A specific Ni salicidation process with platinum
incorporation and fluoride and tungsten implantation was applied, which enables silicide
stabilization under the thermal budget during top tier fabrication. (b) On top of the bottom
tier, a thin Inter Layer Dielectric (ILD) was deposited and planarized. The thickness of
ILD is 110nm, which allows dense 3D interconnects. Then, the top silicon wafer is attached
on the top of ILD using a low temperature (200oC) molecular bonding. At this moment,
the top layer is blank; no patterned object exists. Thus, there is no alignment issue during
the wafer bonding.
(c) They process the top layer, where transistors and other structures are fabricated
under 600oC thermal budget. Solid Phase Epitaxy (SPE) at 600oC is used for the dopant
activation of top tier transistors. For gate dielectric, they deposit HfO2 using Atomic Layer
Deposition (ALD) at 350oC, followed by a thermal annealing at 515oC for 5 minutes. For
80
spacers and passivation layers, they use low temperature deposited oxides. In addition, they
develop the low temperature (650oC) epitaxial growth method to apply raised source-drain
on top tier transistors, which is required for advanced FDSOI technology nodes (22nm and
below). (d) The contacts to the top/bottom layer transistors are fabricated. They use a
single lithography step for top/bottom contacts, for which a highly selective etch is needed
to open contacts down to bottom layer transistor.
For successful fabrication, several major issues need to be overcome such as realization
of high-quality top silicon layer, high-stability bottom transistors, and low-thermal-budget
top transistors. They claimed that their low temperature molecular bonding of top silicon
wafer allowed high quality top layer. By implanting fluorine into NiSi, they implemented
a morphologically robust salicide with a low sheet resistance of wafers. As a result, the
characteristics of bottom tier transistors were maintained after top tier transistor fabrica-
tion. They also claimed that their SPE was efficient for the top silicon layer and led to high
dopant activation levels.
One of huge benefits in monolithic 3D ICs compared with TSV-based 3D ICs is the
alignment precision between layers. In monolithic 3D ICs, this alignment between layers
only depends on lithographic alignment capability [70]. In [71], the authors demonstrated
high alignment precision in monolithic 3D ICs (σ≈10nm) compared with TSV-based 3D
integration (σ≈0.5µm) [68]. The nano-scale alignment precision and the ultra-thin silicon
and ILD layers enable nano-scale 3D interconnects.
6.1.2 Design Styles of Monolithic 3D ICs
As shown in Figure 31, the design styles of monolithic 3D ICs are categorized into two:
gate-level (G-MI) and transistor-level (T-MI). As in TSV-based 3D ICs, in G-MI designs,
standard cells are planar (2D) and each layer contains multiple metal layers. However, in
G-MI, device layers are fabricated sequentially, and MIVs are much smaller than TSVs.
The T-MI designs are different from G-MI: (1) Most of the 3D interconnects are em-
















. multiple metal layers
M1
MB1
Figure 31: Design styles of monolithic 3D ICs: (a) T-MI, (b) G-MI.
manufacturing processes can be optimized separately per die. (3) Physical layout (place-
ment, routing, optimization, etc.) can be performed using existing 2D electronic design
automation (EDA) tools with a little modifications. In contrast, G-MI or TSV-based 3D
ICs require 3D-aware physical layout engines. Currently, no commercial EDA tool can han-
dle multiple dies together, especially for optimizations. Thus, previous works [25,72] rely on
die-by-die optimizations with timing constraints on the die boundary. However, the design
quality with this approach is sub-optimal, because the optimization engine cannot see the
whole 3D paths.1
6.2 Design Methodologies
In this section, the proposed design methods for T-MI technology are explained in detail.
Various practical considerations for high density and high performance T-MI designs are
discussed.
6.2.1 Overall Design and Analysis Flow
One of the major benefits of T-MI is that existing 2D EDA tools can be used, with simple
modifications if needed. Commercial EDA tools are extensively used in this study. The
design and analysis flow of this project, summarized in Figure 32, consists of four parts:
1The optimization limitations are presented in Section 6.5.1.
82











create physical cell library
& interconnect RC library
2D timing & 
power library
Figure 32: Overall design and analysis flow for T-MI. Shaded boxes highlight differences
in T-MI. The WLM means wire load model.
(1) library preparations, (2) synthesis, (3) layout, and (4) analysis. In the library prepa-
ration part, T-MI-specific library files are prepared. The RTL codes of benchmark circuits
are synthesized using Synopsys Design Compiler. In the layout part, placement, routing,
and optimizations are performed using Cadence Encounter (v10.12). Finally, static timing
analysis and static power analysis are performed.
The major efforts for T-MI design flow are spent on T-MI cell library construction and
characterization, T-MI interconnect structure modeling, and T-MI wire load modeling. The
technology files and design rules are modified to account for additional layers on the bottom
tier as well as additional metal layers on the top tier (see Section 6.3.2). Using Cadence
Virtuoso, the T-MI cells are created by modifying existing 2D cells. The cells are then
abstracted to create the T-MI physical cell library. In addition, interconnect RC libraries
are built using Cadence capTable generator and QRC Techgen. For synthesis, the T-MI wire
load models are created that reflect reduced wirelengths with T-MI. The T-MI wire load
models guide synthesis optimizations; with shorter (estimated) wirelengths, the synthesized
netlist of T-MI contains weaker cells and less number of buffers than that of 2D, under the
same clock period.
During layout construction, first Encounter placer is run. The tool recognizes T-MI
cells as the cells with pins on multiple layers. For routing, Encounter is set up to utilize
83
the additional metal layers on bottom and top tiers. Since the T-MI cells contain routing
blockages on the MIV layer, the router avoids 3D routing through the top tier part of the
cells using MIVs. Using the T-MI interconnect library that reflects the T-MI metal layer
structures and materials, RC extraction is performed on all the nets in the layout. The
full-chip timing/power optimizations and analyses for T-MI and 2D are the same, because
the entire T-MI design (top/bottom tiers) is captured in a single Encounter session. Static
power analysis is performed with the switching activity of the primary inputs and sequential
cell outputs at 0.2 and 0.1, respectively.
















(b) our T-MI cell
fold
Figure 33: The layout of an inverter from (a) Nangate 45nm library, and (b) the T-MI
library. P, M, and CT represent poly, metal, and contact. The suffix ’B’ means the bottom
tier. MIV means monolithic inter-tier via. Top/bottom tier silicon substrate and p/nwells
are not shown for simplicity. The numbers in parentheses mean thickness in nm.
The T-MI 3D cells are designed using the (2D) standard cells in Nangate 45nm li-
brary [66] as the baseline. As shown in Figure 33, the 2D standard cells are folded into
3D and create T-MI 3D cells. The thicknesses of top/bottom tier silicon substrates and
inter-layer dielectric (ILD) are 30nm and 110nm, respectively. The diameter of MIV is
70nm. Note that by folding, cell pins (A, Z) are on both tiers. In this project, the PMOS
transistors are placed on the bottom tier and the NMOS on the top tier. In Nangate 45nm
84













Figure 34: Layout snapshots of the T-MI cells. The S/D means source/drain. The p/nwell
and implants are not shown for simplicity.
library, P/NMOS transistors show hole/electron mobility skew. To compensate the differ-
ence, in Nangate 45nm library, a PMOS is larger than the corresponding NMOS. Since
extra silicon space on the top tier is required for MIVs (not on the bottom tier – see Figure
33(b)), placing PMOS transistors on the bottom tier balances top/bottom silicon area us-
age. However, manufacturing aspects should also be considered in deciding the P/NMOS
layer assignment.2
After folding the cell, VDD and VSS strips are overlapping, as shown in Figure 33. The
power to VDD on the bottom tier can be delivered down through arrays of MIVs, placed
apart from the VSS strip. Extra space may be needed for these VDD MIVs. Yet, power
delivery network design and IR-drop analysis are outside the scope. Also, since VDD and
VSS strips are overlapping, it may act as a small decoupling capacitor. However, in the
extracted cell internal RC data for the T-MI inverter cell, the coupling capacitance (or cap)
2In sub-32nm nodes, thanks to advanced channel engineering techniques, the hole/electron mobility is
about the same.
85
between VDD and VSS strips is around 0.01fF , which is small compared with other cell
internal parasitic capacitances.
The transistor model in Nangate 45nm library is PTM 45nm with bulk silicon technology
[73]. In monolithic 3D technology, because of the structure, top tier transistors are similar
to silicon-on-insulator (SOI) devices [28]. However, in this study the same transistor model
is assumed for T-MI and 2D cells, because (1) the original Nangate 45nm library is based
on bulk silicon technology, and (2) if both devices and interconnect structures in T-MI are
assumed to be different from 2D, it becomes harder to understand which factor contributes
to power reduction, by how much.
The proposed standard cell design method differs from Intra-Cell Stacking in [32] for
three major reasons:
• The PMOS transistors are placed on the bottom tier and NMOS transistors on the
top. If PMOS is on the top tier as in [32], extra space may be needed for MIVs, which
increases the cell footprint.
• The proposed cell folding technique is applied on the original 2D standard cell layouts.
Compared with the Intra-Cell Stacking technique in [32] that requires a complete
redesign of internal connections, the proposed method is straightforward and provides
opportunities for reducing internal RC parasitics.
• The VDD/VSS strips of standard cells are placed on the bottom side in different tiers.
Compared with the Intra-Cell Stacking in [32] which places power/ground rails on the
top/bottom side of the standard cells, the proposed method further reduces the cell
footprint because metal1 routing space is even for top and bottom tiers.
The T-MI cells preserve the same transistor sizes as in the original 2D cells. GDSII
layouts of some of the T-MI cells are shown in Figure 34. The T-MI cell height is 0.84µm,
which is 40% smaller than the original 2D cell height (1.4µm). Thus, cell footprint reduces
by 40%3, which is more than the reported values in [32] (about 30%).
3The reasons why it is not 50% are (1) P/NMOS size mismatch incurs extra space on NMOS side, and
(2) MIVs require extra space on the top tier.
86
When designing T-MI cells, care should be taken to reduce cell internal RC parasitics. As
shown in Figure 33(b), the path from the PMOS on the bottom tier to the NMOS on the top
tier consists of CTB, MB1, MIV, CT, M1, then CT to diffusion. This 3D path may become
larger than the original 2D path and may increase cell internal parasitic RC. Similarly, the
path from the PB on the bottom tier to the P on the top tier consists of multiple layers.
To reduce cell internal RC parasitics, it is important to minimize the lengths of 3D paths.
To achieve shorter 3D paths, MIVs should be placed close to the connecting transistors.
In addition, direct source/drain (S/D) contacts need to be utilized (see Figure 34(c)). The
direct S/D contacts reduce the detour in the 3D paths and unnecessary RC parasitics.
The cell internal RC parasitics of 3D and 2D cells and the impact on timing/power are
examined. In previous works [32–34], the authors assumed that the delay and power of
3D cells are the same as 2D cells and used 2D timing/power library. In [28], the authors
fabricated a transistor-level monolithic 3D IC and measured the top/bottom transistor
performances. They reported that the differences between 3D transistors and baseline
2D transistors were negligible. Yet, the delay and power of cells are also affected by cell
internal RC parasitics. From Figure 33(b), it can be conjectured that there are coupling
capacitances among PB, CTB, MB1, MIV, CT, and M1. Using Mentor Graphics Calibre
XRC with EM-simulation-based extraction rules, these capacitance values are extracted as
well as resistances and transistors from the T-MI cell layout. Then, a SPICE netlist of the
cell is generated that consists of transistors and parasitic RC components.
Since Calibre XRC is designed for 2D ICs, it can only model one diffusion layer. Due to
this tool limitation, top tier diffusion layer can be modeled as either dielectric or conductor.
Even though the top tier silicon is doped (low resistivity) and the bodies of top tier trasistors
are tied to the ground, it is expected that some amount of electric field may penetrate the
top tier silicon and coupling among top and bottom tier objects (M1, MB1, P, PB, etc.)
may exist. When it is assumed that the top tier silicon is dielectric, the coupling between
top and bottom tier objects would be overestimated; when it is conductor, the coupling
would be underestimated. The real case would be between these two extreme cases.
The total cell internal RC values, extracted from the original 2D cells and the 3D
87
Table 26: Cell internal parasitic RC values. The 3D-c means 3D with top tier silicon
modeled as a conductor.
R (kΩ) C (fF )
cell 2D 3D 3D-c 2D 3D 3D-c
INV 0.186 0.107 0.107 0.363 0.368 0.349
NAND2 0.372 0.237 0.237 0.561 0.586 0.547
MUX2 1.133 0.975 0.975 1.823 1.938 1.796
DFF 2.876 3.045 3.045 4.108 5.101 4.740
(T-MI) cells, are shown in Table 26. For 3D case, the results with top tier silicon as
both dielectric (3D) and conductor (3D-c) are shown. From the results, the followings are
observed: (1) For INV, NAND2, and MUX2, the R values of 3D are noticeably smaller than
2D counterparts, because the length of poly and metal lines inside the cells are reduced,
using 3D interconnects. (2) The C values of 3D are comparable with those of 2D – the
2D value is between 3D and 3D-c. (3) For DFF, both R and C of 3D are larger than 2D
counterparts. Due to the complex internal connections, a 3D cell layout could not be created
that match RC parasitics of 2D. In summary, depending on the cell layout complexity, the
internal RC ratio between 3D and 2D may vary.
Table 27: Delay and internal power consumption of cells with various input slew and load
capacitance conditions. The library uses different input slew settings for DFF. The values
in the parentheses mean the percentage ratio of 3D to 2D.
delay (ps) power (fJ)
cell 2D 3D 2D 3D
fast case: input slew=7.5ps (5ps for DFF), load cap.=0.8fF
INV 17.2 16.9 (98.3%) 0.383 0.351 (91.6%)
NAND2 21.2 20.9 (98.6%) 0.616 0.583 (94.6%)
MUX2 59.8 58.2 (97.3%) 2.113 2.060 (97.5%)
DFF 108.8 113.4 (104.2%) 6.341 6.735 (106.2%)
medium case: input slew=37.5ps (28.1ps for DFF), load cap.=3.2fF
INV 51.1 50.8 (99.4%) 0.362 0.343 (94.8%)
NAND2 56.2 55.9 (99.5%) 0.604 0.581 (96.2%)
MUX2 97.0 95.3 (98.2%) 2.239 2.168 (96.8%)
DFF 142.6 147.0 (103.1%) 6.358 6.756 (106.3%)
slow case: input slew=150ps (112.5ps for DFF), load cap.=12.8fF
INV 188.3 188.0 (99.8%) 0.449 0.431 (96.0%)
NAND2 195.9 195.5 (99.8%) 0.698 0.675 (96.7%)
MUX2 215.1 212.5 (98.8%) 2.555 2.487 (97.3%)
DFF 237.4 243.3 (102.5%) 7.303 7.659 (104.9%)
88
Yet, the delay and power of the cells are more important metrics. Cell timing/power
characterizations are performed using commercial softwares. The SPICE netlists obtained
from the previous RC extractions are fed into Cadence Encounter Library Characterizer,
which runs SPICE simulations to characterize delay and power of cells under various input
slew and load capacitance conditions. The delay/power of 3D and 2D cells are shown in
Table 27. The values are obtained from the data tables in the characterized Liberty library.
The delay is the cell internal delay including load effect, and the power is the dynamic power
consumed within cell boundary (including short circuit power and power for gate/parasitic
capacitances). It is observed that for INV, NAND2, and MUX2, the delay and power of
3D are slightly better than 2D, whereas for DFF, they are a little worse. In addition, as
the input slew and load capacitance condition changes from fast to slow case, the difference
between T-MI and 2D becomes smaller. Note that depending on cell design quality and
manufacturing technology, the results may change. With proper cell designs, the delay and
power of 3D cells could be similar to 2D counterparts.
6.2.3 Full-Chip Physical Layout
With the libraries built for T-MI, full-chip layout experiments are performed. Using Synop-
sys Design Compiler, the benchmark circuits are synthesized based on the T-MI standard
cells and benchmark design constraints. These benchmark circuits are summarized in Ta-
ble 28. Next, physical layouts of the circuits are built using Cadence Encounter. Starting
from floorplaning, power delivery network planning, timing-driven placement of cells, clock
synthesis, and timing-driven routing are performed. Since a T-MI cell contains both the
top and the bottom tier parts and MIVs as a single unit, the placer places the cells in a 2D
fashion without any overlap between cells. The T-MI cells have pins on the first metal of
both the bottom and the top tiers (MB1 and M1 in Figure 38(b)).
Unlike the metal layer assumption in [32], the router is allowed to use the metal layer
on the bottom tier (MB1 in Figure 38(b)) for routing as well. In this setup, the timing-
driven router in Encounter chooses which pin on which layer to connect to, based on routing
























Figure 35: Illustration of net routing cases in T-MI. This net connects pin Z of Cell1 to
pin A of Cell2.
M1 only, or all 3: MB1, MIV, and M1. Note that the router should not place MIVs inside
standard cells because these MIVs may touch the internal objects of the cell.
After routing is finished, RC extraction of nets is performed, which is required for timing
and power analysis. Once the RC information and the netlist are available, static timing
analysis (STA) engine handles the entire top and bottom tiers at once, providing true 3D
STA results. Using Synopsys PrimeTime PX, static power analysis is performed. Certain
switching activity values are assumeed at the primary input pins and the flip-flop outputs
(0.2 and 0.1, respectively). Then, the tool propagates switching activity information to
the rest of the circuit. Based on the switching activity and library information, power
calculation is performed.
Layout snapshots of AES (see Table 28) are shown in Figure 36. In the zoom-in shots,
cells, signal nets, and power rails are shown. For the top tier, only the first two metals (M1
and M2) are shown. It is observed that Encounter places and routes T-MI cells without
any problem. Note that MIVs used in net routing are placed in the white spaces between
cells, avoiding any contact. Since the state-of-the-art EDA software is used for layout, the
quality of placement and route is very good.
6.3 Exploration of Metal Layer Options
As shown in Figure 38, the metal layer structure of T-MI is dramatically different from











Figure 36: Layout snapshots of the benchmark circuit AES. On the right, zoom-in shots
of the top and the bottom tier are shown. Black and purple squares indicate the MIVs used
for net routing and cell internal connections, respectively.
Table 28: Benchmark circuits used for metal layer option exploration.
AES VGA DES JPEG FFT
#cells 19,719 68,318 76,088 297,028 582,621
#nets 20,146 74,696 78,608 381,548 751,399
average fanout 2.131 2.307 2.034 1.850 2.130
clock period (ns) 0.5 0.5 0.5 3.0 0.6
explored that enable ultra-high-density integration. For this exploration, the benchmark
circuits in Table 28 are used. Note that in this section, layout optimizations are not per-
formed yet, to highlight the timing/power differences between interconnect options. Also,
the same synthesized netlist is used for all design options.
6.3.1 Routing Congestions in T-MI Designs
A preliminary study reveals that routing congestion is a major problem in T-MI designs.
Since the T-MI cells occupy 40% smaller footprints than the original 2D cells, the overall
91
Table 29: Pin density of the benchmark circuits. Cell area and pin density (= #cell pins
/ cell area) are shown in µm2 and pins/µm2, respectively.
AES VGA DES JPEG FFT
#cell pins 63,068 247,015 238,488 1,087,390 2,351,692
cell 2D 20,964 129,977 102,840 639,677 1,357,493
area T-MI 12,578 33,728 61,704 383,806 814,496
pin 2D 3.01 1.90 2.32 1.70 1.73
density T-MI 5.01 3.17 3.87 2.83 2.89
chip footprint is reduced by about 40%. Yet, the number of cell pins to connect stays the
same. As shown in Table 29, the pin density of T-MI becomes much higher than that of
2D. For instance, the pin density of the T-MI design for AES is 66% higher than that of
the 2D design. The nets need to be routed within 40% smaller footprint, which means
increased routing demand per unit area (or routing tile). The additional metal layer on the
bottom tier of T-MI (MB1) can be used only for local interconnects because the MB1 strips
inside cells (internal wires and pins) block cell-to-cell routing. Thus, the routing capacity
(#routing tracks per routing tile) of T-MI per routing tile (= a tile in N×N grid for global
routing) is almost the same as that of 2D and cannot satisfy the much increased routing
demand. To satisfy the high routing demand, the routing capacity needs to be increased.
routing track shortage:
(a) 2D (b) T-MI (1BM)
0 1 2 3 4 5 6 7+
Figure 37: Routing congestion map of VGA with (a) 2D and (b) T-MI. Black X marks
show design rule violations due to routing congestions.
Routing congestion maps of the 2D and the T-MI design for a benchmark circuit are
92
shown in Figure 37. It is evident that T-MI (= the 1BM case defined in Section 6.3.2) shows
more severe routing congestions than 2D.4 Because of metal layer changes and detours to
deal with routing congestions, the timing and power quality of T-MI is also degraded. In
addition, it is observed that the routing congestion becomes severer with circuit optimization
because the optimizer inserts buffers and breaks a complex cell into a group of simpler cells
to improve timing, which in turn increases pin density considerably.
This routing congestion problem is unique in T-MI technology; it does not happen when
the technology node is scaled down, because local metal dimensions and cells shrink at about
the same rate. It does not happen for G-MI or TSV-based 3D ICs either, because enough
metal layers are available on each tier and the routing demand is satisfied.
To enable high density and high performance designs in T-MI technology, the routing
congestion problem needs to be mitigated. Increasing the footprint of T-MI designs to
reduce routing congestion is not a good idea because this reduces device density. In this
study, two kinds of metal interconnect modifications are considered: (1) adding more metal
layers and (2) reducing metal dimensions.
6.3.2 Impact of Additional Metal Layers
Table 30: Summary of metal layers in the 2D design option. Eight out of ten metal layers
in the Nangate 45nm library are used. Unit is nm.
level metal layers width spacing thickness
global 2D: M7-8 400 400 800
intermediate 2D: M4-6 140 140 280
local 2D: M2-3 70 70 140
first 2D: M1 70 65 130
Adding more local metal layers is an effective way to increase routing capacity and reduce
congestion. The most area-efficient way is to add local metal layers, because of the small
pitch. More investment will be made to allow additional metal layers on the top and/or the
bottom tier of monolithic 3D ICs if there is a clear evidence that they improve the design
quality of T-MI significantly. The baseline metal layer dimensions are summarized in Table
4The overall over-congestion rate (reported by Encounter, calculated from metal layers with maximum
shortage) is 0.30% for 2D case and 4.36% for T-MI.
93
MIV


























Figure 38: Metal layer stack options. (a) 2D, (b) baseline T-MI. (c) 3 local metal layers
added to the top tier, (d) 3 local metal layers added to the bottom tier. ILD stands for
inter-layer dielectric between the top and the bottom tier. The bottom tier substrate and
ILD for metal layers are not shown for simplicity. Objects are drawn to scale.
30. As shown in Figure 38, three metal layer stack options are considered for T-MI:
• 1BM: This is the baseline T-MI layer stack with 1 bottom tier metal layer.
• 3TM: Three additional (local) metal layers are added to the top tier. As a result,
total six local metal layers exist on the top tier.
• 4BM: Three metal layers are added to the bottom tier. As a result, total four local
metal layers exist on the bottom tier.
Due to manufacturing issues (low thermal budget), in [32] the authors suggest tungsten
is suitable for bottom tier metal. However, in this project copper is assumed, because a
copper-based manufacturing process may be developed. Besides, MB1 is mostly used for
short interconnects such as within cells or short nets. In the benchmark circuit M256, the
wirelength of MB1 (for net routing) is only 0.3% of the total wirelength. Thus, the impact
94
of MB1 material on the timing and power of a whole circuit is minimal. When tungsten is
used, IR-drop on the VDD strips could be an issue, which is outside the scope.
In 4BM case, as shown in Figure 38(d), the connections from a PMOS on the bottom
tier to an NMOS on the top tier are made through metal and via layers on the bottom tier
(MB1-4, VB1-3) and MIVs, which is called via stack in this project. The physical size of
a via stack is considerably larger than that of a single MIV. In addition, there could be
metal interconnects surrounding a via stack, which may increase its coupling capacitance.





















Figure 39: Raphael simulation structure for a via stack and its surrounding objects. The
dimensions are shown in µm.
Using Synopsys Raphael, the capacitance of a via stack is extracted. The structure for
the Raphael simulation is shown in Figure 39, where the target via stack is surrounded by
neighboring via stacks and metal wires. The capacitance of a via stack (Cvs) reported by
Raphael is 0.123fF . The resistance of a via stack (Rvs) is dominated by the resistances
of local vias (VB1-3) and the MIV. From the values in the technology definition file, the
calculated Rvs is 20Ω, which includes contact resistances.












Figure 40: SPICE netlist of a standard cell: (a) original netlist, (b) with via stack RC. The
dotted line in (a) is the tier boundary, and the values denote internal parasitic resistances
in Ω.
cell to characterize its timing/power behavior. In Figure 40(a), the original SPICE netlist
of a buffer cell with internal parasitic RC is shown. The Cvs and Rvs of via stacks are
inserted at the cut locations as shown in Figure 40(b). Then, Cadence Encounter Library
Characterizer is run to characterize the timing and power of the modified standard cell for
the 4BM case.
Table 31: Comparison of timing and power of a cell with and without via stack RC. The
values are from the timing/power tables of the characterized libraries.
delay power
load without with diff. without with diff.
cap (fF ) RC (ps) RC (ps) (%) RC (fW ) RC (fW ) (%)
0.4 28.4 31.2 9.86 1.15 1.33 15.65
0.8 33.1 35.8 8.16 1.40 1.52 8.57
1.6 42.8 45.4 6.07 1.86 1.98 6.45
3.2 62.4 64.9 4.01 2.81 2.99 6.41
6.4 100.3 103.0 2.69 4.78 4.93 3.14
12.8 175.8 179.9 2.33 8.54 8.74 2.34
25.6 330.0 330.6 0.18 16.17 16.33 0.99
In Table 31, the timing and power of a buffer cell with or without via stack RC are
compared. The delay includes both the cell intrinsic delay and load-dependent delay, and
the power is the cell internal power, excluding wire switching and leakage power. In general,
when the load capacitance of a cell is small, the impact of via stack RC on timing and power
is large; the impact becomes smaller with larger load capacitance. This trend is observed
96
in most of the cells. If a driving net is very short and has a small load capacitance, the
timing and power of the driver may degrade by about 10%. Since the timing and power
of the circuit depend on the net delay and net switching power, the overall degradation of
timing and power of the entire circuit level is lower—about 2-3%—which is still significant.
Thus, via stack RC is incorporated in all of the 4BM-based designs.
For a cell driving a net and the sink cells on the net, the delay (D) is:
Dtotal = Dcell +Dnet (11)
Dcell = Dintrinsic +Dload−dependent (12)
Dload−dependent = fd(Cload, input slew) (13)
Cload = Cwire + Cpin (14)
The Dintrinsic is the intrinsic delay of the cell. The Dload−dependent is a function of Cload and
the signal slew at the cell input pin. Compared with 2D designs, wires are shorter in T-MI
designs, which in turn reduces Cwire, Cload, and Dload−dependent. The Dnet also reduces
as wires become shorter. However, the overall delay improvement may not keep up with
wirelength reduction. If Cpin is larger than Cwire, the Cload may not decrease significantly
because Cpin is not reduced. Moreover, Dintrinsic also contributes to Dcell. Thus, depending
on the circuit characteristics and layouts, the delay improvement of T-MI may vary.
Meanwhile, the power consumption (P ) of a cell is:
Ptotal = Pinternal + Pswitching + Pleakage (15)
Pinternal = fp(Cload, input slew) (16)
Pswitching ∝ switching activity × Cload (17)
The Pinternal is the power consumed for the objects within the cell boundary, which weakly
depends on Cload and the cell input slew. When the input slew is larger, Pinternal increases.
With the standard cell library (based on Nangate 45nm library), Pleakage is usually much
smaller than Pinternal and Pswitching. The Pswitching is proportional to both the switching
activity and Cload. Assuming that the switching activity is the same for 2D and T-MI
designs, the reduction of Cload in T-MI designs is the main reason for the total power
97
reduction. Note that if (a) Cpin is more dominant than Cwire, or (b) Pinternal is more
dominant than Pswitching, the total power reduction of T-MI designs caused by wirelength
reduction may not be significant.
The design and analysis results for 2D and T-MI design options are summarized in
Table 32.5 Placement utilization of all designs is 70%. Compared with 2D designs, the
footprints of T-MI designs are 40% smaller, while the total silicon areas are 20% larger.
Compared with 2D, the total wirelength and clock wirelength of all three T-MI design
types are reduced by about 20%. The total number of MIVs used in routing is about the
same for 1BM and 3TM, while 4BM utilizes considerably more MIVs because the bottom
tier metals are highly utilized for routing.
The timing improvement of 3TM is the best among the T-MI design types. For the
largest circuit (FFT), the longest path delay improvement of 3TM over 2D is 39.7%. Note
that this timing improvement can be used towards power reduction during the timing/power
optimization; for the same target clock speed, 3TM may use more power-efficient (slower)
cells to reduce power. However, the total power reduction of T-MI designs is less significant
than timing improvement. The power reduction of T-MI designs over 2D design is mostly
from reduced wire power. However, wire power is only a small fraction of the total power.
For instance, the wire power of JPEG for 3TM is 39.2mW , which is only 13.2% of the total
power. Depending on the quality of Encounter clock tree synthesis (CTS) results, the clock
tree power may decrease. It is observed that CTS usually produces the best results for 3TM
among T-MI designs, because the CTS quality is related to the routing quality. The timing
and power of 4BM designs are generally worse than 1BM and 3TM designs mainly because
of the RC effect of via stacks inside cells.
6.3.3 Impact of Reduced Metal Dimensions
Another interconnect modification option to mitigate the routing congestion problem is to
reduce the width, spacing, and thickness of metal layers. The local metal width/spacing
5For fair comparisons between 2D and T-MI, supplemental simulations have been performed with 3 more
metal layers for 2D (we call it 2D+M). It was found that although additional metal layers improved design
metrics of 2D a little bit, still the improvement of T-MI over 2D+M was significant.
98
is close to the minimum feature size of the technology node. However, if scaling down
the metal dimensions brings large benefits in design quality, process engineers are willing
to invest efforts towards it. Thus, the purpose of this metal dimension reduction study
is to explore the interconnect design space for maximizing the benefit of MI-T; extreme
scalings (> 20%) may not be manufacturable with the technology node due to lithography
limitations, chemical mechanical polishing issues, etc. For all MI-T cases (1BM, 3TM, and
4BM), the minimum metal width, spacing, and thickness of all metal layers are reduced
up to 40% by 10% step. The diameters of vias and MIVs are also reduced to match the
corresponding metal layers. The reduced metal width/spacing are summarized in Table 33.
Note that to keep the aspect ratio, the thickness of metal layers is also reduced, which is
not shown in Table 33. Per each reduced metal dimension setting, the interconnect-related
libraries such as capacitance table are rebuilt. Note that cell internal wires are not modified.
The unit length resistance and capacitance of local metal layers with reduced metal
dimensions are summarized in Table 34. As the width and thickness of a metal layer
reduces, the unit length resistance of the metal layer increases. In constrast, the unit
length capacitance of the metal layer does not change much. Note that depending on the
surrounding wires, the unit length capacitance changes significantly (Chigh vs. Clow), mainly
due to the difference in coupling capacitance. With reduced metal dimensions, more routing
tracks are available. Thus, the router has a better chance for improving timing by carefully
routing metal wires to reduce coupling capacitance. However, if the reduction ratio is too
high, the metal resistance may increase the net delay and signal slew considerably.
Various design metrics of the JPEG circuit with varied metal dimension reduction ratio
are shown in Figure 41. The wirelength generally reduces as metal dimensions reduce, be-
cause of less routing congestion and detour. The number of clock buffers generally increases
slowly when the reduction ratio increases. The reason is that as the metal dimensions de-
crease, the metal unit length RC increases, and the clock signal slew degrades. To meet the
clock skew/slew specifications, the CTS engine inserts more buffers. For the longest path
delay (LPD), the sweet spot of 1BM and 4BM cases is at the 30% reduction, while that of
3TM is 10%. Moreover, the LPD improvement of 4BM at the sweet spot over the default
99
Figure 41: Various results of JPEG with reduced metal dimensions.
setting (=0% reduction) is larger than 1BM and 3TM cases. The wire power generally
decreases with the reduced metal dimensions. However, it is observed that the cell internal
power increases, which is also related to the signal slew degradation with reduced metal
dimensions. As a result, the total power of 3TM and 4BM is minimum when the reduction
ratio is 30%.
The total wirelength, longest path delay, and total power of the other benchmark cir-
cuits are shown in Table 35. For total wirelength, the same trend as with JPEG is observed.
The maximum wirelength reduction is 27.7% for AES with 3TM and 40% reduced metal
dimensions. However, depending on the circuit characteristics, reducing metal dimensions
may not translate to longest path delay reduction (see VGA and FFT results). In gen-
eral, 3TM provides the most power improvement over 2D designs. It is observed that the
100
maximum power reduction is 9.7% with 3TM and 40% reduced metal dimensions for FFT
circuit. Note that depending on the benchmark circuit, the sweet spot changes.
From the simulation results in this section, the conclusion is that 3TM (=T-MI with
3 additional metal layers on the top tier) is the best option for T-MI. The reduced metal
dimensions may further improve the design quality, however considering the increased cost
and difficulties for manufacturing, it may not be a good option. Thus, in the following
sections, 3TM without metal dimension reduction is considered.
6.4 Power Benefit Study
In this section, the power benefit of T-MI is studied. Iso-performance comparisons are
performed: under the same target clock period, the timing is closed for all design options
and the power consumption is compared.
6.4.1 Benchmark Circuits and Synthesis Results
The benchmark circuits and synthesis results are summarized in Table 36. The FPU is a
double precision floating point unit. The AES and the DES are encryption engines. The
LDPC is a low-density parity-check engine for the IEEE 802.3an standard. And the M256 is
a simple partial-sum-add-based 256bit integer multiplier. The circuits are in different sizes.
Synopsys Design Compiler (ver. F-2011.09) is used for synthesis. The synthesis results are
from 2D results. All synthesized designs (2D and T-MI) met target clock periods.
6.4.2 Layout Simulation Results
The layout simulation results are summarized in Table 37. The GDSII layouts of the timing-
closed, routing completed AES design are shown in Figure 42. With T-MI, the footprint
reduces by 40.9-43.4%, which is larger than the cell footprint reduction rate, 40%. With
T-MI, timing is better because of shorter wirelengths, and the optimizer may downsize
cells and use less number of buffers while still meeting the target clock period. Thus,
the footprint of the whole T-MI design could be further reduced than the individual cell
footprint reduction rate. With T-MI, total wirelength reduces by 21.5-33.6%. Depending






Figure 42: The placement and routing snapshots of AES designs. The figures reflect the
relative sizes of 2D vs. T-MI designs.
circuit with a larger wirelength reduction rate tends to show a larger power reduction rate.
All designs met the timing. The power reduction was the largest in LDPC, 32.1%, whereas
in DES, only 4.1%. In LDPC, the net power is much larger than the cell power, thus a
large net power reduction with T-MI leads to a large total power reduction. In addition, it
is observed that with T-MI, not only net power but also cell power reduces; with a better
timing, cells are downsized and less number of buffers are used, to reduce cell power.
The detailed layout simulation results are shown in Table 38, which supplements Table
37. The final utilization (after all optimizations) is set to around 80%, which is a common








Figure 43: Snapshots of routing results for LDPC and DES.
Figure 43(a)), the target utilization was lowered to about 33%; the 2D design was barely
routable with this setting. Also, significant wire congestions were observed in M256, thus
the target utilization was lowered to 68%. All designs met the timing (WNS≥0).
6.4.3 Circuit Characteristics Study
As shown in Table 37, LDPC and DES showed much different power reduction rate with
T-MI. By contrasting these two designs, for what kind of circuits T-MI provides large power
benefit is explained. With T-MI, the buffer count reduces by 48.6% (in LDPC) vs. 3.2% (in
DES), total wirelength reduces by 33.6% vs. 21.5%, total power reduces by 32.1% vs. 4.1%,
cell power reduces by 12.8% vs. 1.6%, and net power reduces by 39.2% vs. 7.7%. Compared
with LDPC, the buffer count reduction for DES is very small, which leads to very small cell
power reduction. Although the wirelength reduction in DES is not so small, the net power
reduction rate is significantly smaller than LDPC. The net capacitance/power consists of
wire and (cell input) pin parts. For most nets in DES, wires are very short6. This difference
is also observed in Figure 43. In DES layout, there are many small regions where cells are
tightly connected inside but not so much to outside. For these short nets, pin capacitances
dominate wire capacitances, thus reducing wirelength does not reduce net power as much.
Although these two circuits are similar in size (#cells, nets) and average fanout, because of
6The average wirelengths of DES-2D and LDPC-2D are 10.5µm and 72.0µm, respectively.
103
the inherent difference in circuit characteristics, the power benefit of T-MI differs by much.
Net power is broken into wire and pin power components (net = wire + pin). Wire
means metal wires and vias used for routing outside cells, and pin means input pins of
cells. As shown in Table 39, in LDPC, wire cap is much larger than pin cap, and so is wire
power. Most of the net power reduction is from reduced wirelengths, as seen by the wire
power reduction. In contrast, in DES, pin cap is much larger than wire cap. Thus, reduced
wirelengths and wire power only reduces a small portion of the net power.
6.4.4 Impact of Target Clock Period
(a) AES (b) M256







































Figure 44: Power reduction rate (T-MI over 2D) under various target clock periods.
The power benefit of T-MI also depends on the target clock period. For AES and
M256, the target clock period is varied and full designs are performed, from synthesis to
layout optimizations. The power reduction rate is shown in Figure 44. The trend is clear;
when the target clock is faster, the power benefit of T-MI becomes larger. This is because
at faster clock speeds, the timing of the 2D design becomes harder to meet than T-MI,
because of longer wires. The optimization engine uses more buffers and larger cells, leading
to steep increase in cell power. Thus, the cell power reduction rate increases noticeably as
clock becomes faster. With faster clock speeds, core footprint and wirelengths also become





































Figure 45: Layer structures of (a) G-MI and (b) TSV-3D ICs. For simplicity, in (b), only
the top metal layer of the bottom tier is shown.
6.5 Comparison with G-MI and TSV-based 3D
In this section, the design quality of T-MI designs is compared with G-MI and TSV-based
3D designs (TSV-3D). The layer structure of the G-MI and TSV-3D are shown in Figure
45. Note that two layers are assumed for G-MI and TSV-3D designs. For G-MI designs,
six metal layers are used on the bottom tier and eight on the top. The reason why only six
metal layers are used on the bottom tier is that the MIV pitch is determined by the top
metal pitch on the bottom tier. If all eight metal layers are used, because the minimum
pitch of metal 8 wires is large, the density of MIV becomes small. For TSV-3D designs,
eight metal layers are used on both top and bottom tiers, because TSVs are large. The
diameter and height of the TSV are 3µm and 30µm. Based on the physical assumptions
such as TSV oxide liner thickness and doping concentration, using the parasitic RC models
for TSVs [60], the resistance and capacitance of the TSVs are determined to be 1Ω and
31.1fF .
105












initial 3D STA & timing 
constraint generation
pre-route optimization
3D STA & timing const.






Figure 46: Design and analysis flow for G-MI and TSV-3D ICs.
The design flows of this project for G-MI and TSV-3D ICs are summarized in Figure 46.
Since today’s commercial EDA tools cannot handle multiple dies together, the in-house 3D
partitioner/placer [25] and timing-constraint-based iterative optimization method [72] are
used. After the synthesis, circuit partitioning is performed.7 The gates are placed on Die
0/1 and MIVs/TSVs on Die 0 (= top tier), followed by a 3D STA to generate the timing
constraints on the die boundary ports (MIVs or TSVs). Then, per each die, pre-route
optimizations are performed, followed by a 3D STA and timing constraint generation. As
suggested in [72], several iterations of optimizations are performed to improve timing. After
routing, post-route optimizations are performed in multiple iterations. Lastly, the final 3D





Figure 47: Examples of limitations in die-by-die optimizations: (a) buffer pair to inverter
pair, (b) AND to NAND and an inverter, and (c) gate cloning.
7As suggested in [25], XY/Z-cut sequences are varied to find the best layout results in terms of final
timing and power.
106
The most serious problem with die-by-die optimizations is the optimization quality. As
shown in Figure 47, die-by-die optimizations cannot perform many effective optimizations.
The main reasons are (1) the optimization engine cannot see the whole path, (2) it is not
allowed to violate the logic equivalency at die boundary ports (MIVs or TSVs), (3) it is
not allowed to move gates across the die boundary, (4) it is not allowed to add/remove die
boundary ports. In Figure 47(a), Encounter cannot convert the buffer on Die 0 to an inverter
because it will violate the logic equivalence check at the die boundary port. Although
an inverter pair produces lower delay than a buffer pair, Encounter cannot perform this
conversion. Also, in Figure 47(b), when the net driven by the AND gate is long, breaking
the AND gate into a NAND and an inverter and placing them apart may reduce the delay.
However, due to the logic equivalency check, it is not possible. In Figure 47(c), when
the net driven by the AND gate is a high-fanout net, gate cloning helps reduce the delay.
However, since it is not allowed to move gate across the die boundary, it is not possible.
Although not shown in Figure 47, there are other optimizations not possible with die-by-die
optimizations. In addition, the timing-constraint-based die-by-die optimization tends to use
more buffers/inverters than necessary [74]. These limitations in optimizations degrade the
timing and power of G-MI and TSV-3D designs.
6.5.2 Layout Simulation Results
The detailed layout simulation results for G-MI and TSV-3D designs are shown in Table
40. The footprints are determined so that design is routable. Note that for TSV-3D cases,
the footprints need to be increased significantly to accomodate TSVs. Comparing G-MI
and TSV-3D results, it is clear that in all aspects (wirelength, #buffers, timing, and power)
G-MI is better than TSV-3D. This is mainly because MIVs are much smaller than TSVs in
terms of physical dimensions and RC parasitics.
Comparing the G-MI and TSV-3D results with the T-MI results in Table 38, it is ob-
served that the design quality of G-MI and TSV-3D is worse than that of T-MI. Possible
reasons for this trend are: (1) Placement quality of the 3D placer is not as good as com-
mercial 2D EDA tool. Note that the wirelength of G-MI is much longer than that of T-MI.
107
(2) As mentioned in Section 6.5.1, layout optimization quality in the G-MI and TSV-3D
design flow is not as good as in T-MI or 2D design flow. Note that for many cases, the
timing could not be closed. Especially, when there are lots of long 3D nets, the timing of
G-MI or TSV-3D became worse than that of T-MI or 2D. These two reasons support the
claim that T-MI produces better designs than G-MI or TSV-3D. In addition, for G-MI or
TSV-based 3D designs, true 3D placement and optimization engines are needed that can
handle multiple dies together.
6.6 Summary
In this chapter, the benefits and challenges of monolithic 3D IC technology were investi-
gated. It was demonstrated that monolithic 3D technology provides various benefits over
traditional 2D technology. Routing congestion issues were identified that may hinder the
benefit of monolithic 3D technology and several interconnect options to overcome the prob-
lem were investigated.
In transistor-level monolithic 3D ICs, reduced footprints lead to shorter wirelengths,
better performances, and lower power consumptions. With carefully designed transistor-
level monolithic 3D cells, layout simulations were performed for the benchmark circuits
and up to 32.1% total power reductions were demonstrated. In contrast, because of the
limitations in 3D net optimizations, gate-level monolithic 3D and TSV-based 3D designs




























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Table 33: Minimum width/spacing of metal layers with varied metal dimension reduction
ratio. First metal means the lowest metal layer of the top/bottom tier. Unit is nm.
reduction ratio (%) 0 10 20 30 40
global 400/400 360/360 320/320 280/280 240/240
intermediate 140/140 126/126 112/112 98/98 84/84
local 70/70 63/63 56/56 49/49 42/42
first 70/65 63/59 56/52 49/46 42/39
Table 34: Unit length resistance and capacitance of local metals with varied metal dimen-
sion reduction ratio. The Chigh and Clow are the max/min total wire capacitance per unit
length, depending on the surrounding wires.
reduction ratio (%) 0 10 20 30 40
R (Ω/µm) 3.57 4.41 5.59 7.29 9.93
Chigh (fF/µm) 0.163 0.175 0.153 0.166 0.173









































































































































































































































































































































































































































































































































































































































































































































































































































































































































Table 36: Benchmark circuits and synthesis results.
FPU AES LDPC DES M256
target clock period (ns) 1.8 0.8 2.4 1.0 2.4
#cells 9,694 13,891 38,289 51,162 202,877
cell area (µm2) 19,123 16,756 60,590 85,526 293,636
#nets 11,345 14,218 44,153 54,724 222,569
average fanout 2.35 2.40 2.38 2.33 2.23
Table 37: Summary of layout results. The values represent the percentage difference of
T-MI over 2D.
circuit footprint total power
name wirelen. total cell net leakage
FPU -41.7% -26.3% -14.5% -9.4% -19.5% -11.1%
AES -42.4% -23.6% -10.9% -7.6% -13.9% -9.5%
LDPC -43.2% -33.6% -32.1% -12.8% -39.2% -21.7%
DES -40.9% -21.5% -4.1% -1.6% -7.7% -1.4%














































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Table 39: Wire vs. pin capacitance breakdown of LDPC and DES in 45nm node. The
values are for the entire circuit.
design total cap. (pF ) power (mW )
wire pin wire pin
LDPC-2D 558.0 134.4 30.73 9.04
LDPC-3D 310.3 123.6 15.88 8.32
DES-2D 64.4 127.4 8.88 17.80


































































































































































































































































































































































































































































































































































































































































































































































































As demonstrated in this dissertation and other works, 3D ICs provide significant benefits
over traditional 2D ICs in important metrics such as footprint, wirelength, timing, power,
and so on. Currently, industry is taking slow steps towards 3D IC because of various
issues such as manufacturing cost, yield, logistics, lack of standards, etc. However, with the
physical limits in devices and interconnects approaching fast, industry will eventually move
towards 3D IC technologies. To successfully adopt 3D IC technologies, it is essential (1)
to study the benefits of 3D IC designs based on today’s and future technology settings, as
well as (2) to develop the design methodologies for 3D ICs that resolve reliability problems
(thermal, power delivery, etc.) and optimize design quality (timing, power consumption,
etc.). Towards these objectives, the following four projects have been presented in this
dissertation:
• A co-optimization method for signal, power, and thermal interconnects.
• A study on the impact of partition styles on the design quality of a multi-core pro-
cessor.
• A slew-aware buffer insertion algorithm that minimizes delay by considering slew
effects on delay.
• Interconnect options and power benefit study on ultra-high-density monolithic 3D
ICs.
The proposed co-optimization method for signal, power, and thermal interconnects pro-
vides a quick and reasonably accurate design space exploration in early design stages so
that designers can make intelligent decisions on power delivery and thermal interconnects
(T-TSVs and MFCs). The response models can be reused for multiple optimization scenar-
ios to facilitate early design stage decisions. The congestion of interconnects (signal TSVs,
116
P/G TSVs, T-TSVs and MFCs) affects the amount of trade-offs the input factors provide.
For instance, when the target design was less congested, MFC width or T-TSV ratio did not
affect total wirelength much, and the optimization always favored maximizing MFC width
or T-TSV ratio. One major limitation of this method is that the responses should change in
predictable ways with respect to the changes in input factors. If the response (e.g., longest
path delay) change abruptly with small changes in input factors, the method could not
find a reasonably accurate model and the optimization fails to find a good solution. As a
follow-up work, it would be worthwhile to compare the proposed method against iterative,
multi-step (e.g., power-thermal-signal) optimization approaches.
In the partition style study, it was found that the 3D partition styles greatly affect
design quality. Note that depending on the target circuit characteristics (circuit size, num-
ber/size of macro blocks, connectivity, etc.) and the technology setup (technology node,
TSV dimensions, die bonding style, etc.), the optimal partition style may differ. It is also
worthwhile to note that the impact of the TSV parasitics on design quality is significant.
For today’s TSV technologies, it may not be a good idea to use too many TSVs simply
because TSVs are physically and electrically very large. One limitation of this study is
that the design flow for each partition style could not exploit the full benefit of 3D ICs.
Although it is possible to perform timing optimizations using existing 2D EDA tools with
timing constraints on the die boundary ports, the whole 3D design is not captured by the 2D
EDA tools, hence various powerful optimization techniques cannot be performed. Another
limitation is that the 3D floorplanning was performed manually, and once the TSVs were
placed, they were not allowed to be moved/added/deleted. Future researches addressing
these limitations would be practical and valuable.
The proposed slew-aware buffer insertion algorithm could improve critical path delays
compared with an existing non-slew-aware buffer insertion algorithm as well as timing-
constraint-based optimizations by a 2D EDA tool. With various tuning parameters, the
algorithm is flexible to trade off quality vs. runtime. One of the limitations of the proposed
algorithm is that occasionally the slew-aware algorithm may find a worse solution than the
non-slew-aware algorithm. The slew-aware pruning should be more intelligent to address
117
the sub-optimality. Another limitation is that the algorithm does not consider signal-
integrity(SI)-induced delay. Considering that the SI-induced delay may degrade buffering
solutions, the SI-delay-aware buffer insertion would be a good follow-up work. In addition,
it would be interesting to apply the algorithm multiple times and see how much further
improvement is possible; it may be possible that after the first application of the proposed
algorithm, new critical paths that were not critical in the first place may emerge.
To enable ultra-high-density transistor-level monolithic 3D ICs, it is necessary to modify
interconnect structures to satisfy increased routing demand. Based on the CEA/LETI
monolithic 3D fabrication technology, it is demonstrated that transistor-level monolithic 3D
ICs provide significant benefits over traditional 2D ICs. One of the limitation of this project
is that the device model used for 3D cell characterization is not based on the monolithic
3D technology. The device model used in this project is planar, bulk-silicon [73]. However,
based on the monolithic 3D IC structure, the device characterisics would be close to silicon-
on-insulator devices. Another limitation is the inaccuracy of parasitic RC extraction for
monolithic 3D cells. Today’s 2D EDA tools cannot handle multiple device layers together
during RC extraction. As follow-up works, the challenges in monolithic 3D ICs may be
studied, such as power delivery (IR-drop) problem, thermal impact on design quality, and
so on. In addition, to facilitate the adoption of transistor-level monolithic 3D technology,
cost-effectiveness of the technology needs to be evaluated. By splitting PMOS and NMOS
into two layers, more masks are required. Furthermore, to resolve routing congestions,
more metal layers are required, which further increases the mask cost. An appropriate cost
modeling needs to be supported by foundry data, therefore it is advised that industry and
academia cooperate to justify the cost towards the benefit.
118
REFERENCES
[1] International Technology Roadmap for Semiconductors, “ITRS 2011 Edi-
tion.”
[2] Wong, E. and Lim, S., “3D Floorplanning with Thermal Vias,” in Proc. Design,
Automation and Test in Europe, vol. 1, pp. 1–6, Mar. 2006.
[3] Goplen, B. and Sapatnekar, S., “Thermal Via Placement in 3D ICs,” in Proc. Int.
Symp. on Physical Design, pp. 167–174, Apr. 2005.
[4] Cong, J. and Zhang, Y., “Thermal-driven multilevel routing for 3-D ICs,” in Proc.
Asia and South Pacific Design Automation Conf., vol. 1, pp. 121–126, Jan. 2005.
[5] Tuckerman, D. B. and Pease, R. F. W., “High-performance heat sinking for VLSI,”
IEEE Electron Device Letters, vol. 2, pp. 126–129, 1981.
[6] Sekar, D., King, C., Dang, B., Spencer, T., Thacker, H., Joseph, P., Bakir,
M., and Meindl, J., “A 3D-IC Technology with Integrated Microchannel Cooling,”
in Proc. IEEE Int. Interconnect Technology Conference, 2008.
[7] Bakir, M., Dang, B., and Meindl, J., “Revolutionary nanosilicon ancillary tech-
nologies for ultimate-performance gigascale systems,” in Proc. IEEE Custom Integrated
Circuits Conf., pp. 421–428, 2007.
[8] Kim, Y. J., Joshi, Y. K., Fedorov, A. G., Lee, Y.-J., and Lim, S.-K., “Thermal
Characterization of Interlayer Microfluidic Cooling of Three-Dimensional Integrated
Circuits With Nonuniform Heat Flux,” Journal of Heat Transfer, vol. 132, pp. 214–
219, Apr. 2010.
[9] Huang, G., Sekar, D. C., Naeemi, A., Shakeri, K., and Meindl, J. D., “Com-
pact Physical Models for Power Supply Noise and Chip/Package Co-Design of Gigascale
Integration,” in IEEE Electronic Components and Technology Conf., pp. 1659–1666,
2007.
[10] Kernighan, B. W. and Lin, S., “An Efficient Heuristic Procedure for Partitioning
Graphs,” Bell System Technical Journal, vol. 49, pp. 291–307, 1970.
[11] Fiduccia, C. M. and Mattheyses, R. M., “A Linear-Time Heuristic for Improving
Network Partitions,” in Proc. ACM Design Automation Conf., pp. 175–181, 1982.
[12] Otten, R. H., “Automatic Floorplan Design,” in Proc. ACM Design Automation
Conf., pp. 261–267, 1982.
[13] Stockmeyer, L., “Optimal Orientation of Cells in Slicing Floorplan Designs,” Infor-
mation and Control, vol. 57, pp. 91–101, 1983.
[14] Wong, D. F. and Liu, C. L., “A New Algorithm for Floorplan Design,” in Proc.
ACM Design Automation Conf., pp. 101–107, 1986.
119
[15] Black, B., Annavaram, M., Brekelbaum, N., DeVale, J., Jiang, L., Loh,
G. H., McCauley, D., Morrow, P., Nelson, D. W., Pantuso, D., Reed, P.,
Rupley, J., Shankar, S., Shen, J., and Webb, C., “Die Stacking (3D) Microarchi-
tecture,” in Proc. Annual Int. Symp. Microarchitecture, pp. 469–479, 2006.
[16] Oh, E. C. and Franzon, P. D., “Design Considerations and Benefits of Three-
Dimensional Ternary Content Addressable Memory,” in Proc. IEEE Int. Interconnect
Technology Conference, pp. 591–594, 2007.
[17] Tsai, Y.-F., Wang, F., Xie, Y., Vijaykrishnan, N., and Irwin, M. J., “Design
Space Exploration for 3-D Cache,” IEEE Trans. on VLSI Systems, vol. 16, no. 4,
pp. 444–455, 2008.
[18] Hu, Y. C., Chung, Y. L., and Chi, M. C., “A Multilevel Multilayer Partitioning
Algorithm for Three Dimensional Integrated Circuits,” in Proc. Int. Symp. on Quality
Electronic Design, pp. 483–487, 2010.
[19] Bakoglu, H. B. and Meindl, J. D., “Optimal Interconnection Circuits for VLSI,”
IEEE Trans. on Electron Devices, vol. 32, pp. 903–909, May 1985.
[20] van Ginneken, L. P., “Buffer Placement in Distributed RC-tree Networks for Mini-
mal Elmore Delay,” in Proc. IEEE Int. Symp. on Circuits and Systems, pp. 865–868,
1990.
[21] Lillis, J., Cheng, C.-K., and Lin, T.-T. Y., “Optimal Wire Sizing and Buffer
Insertion for Low Power and a Generalized Delay Model,” IEEE Journal of Solid-State
Circuits, vol. 31, no. 3, pp. 437–447, 1996.
[22] Shi, W., Li, Z., and Alpert, C. J., “Complexity Analysis and Speedup Techniques
for Optimal Buffer Insertion with Minimum Cost,” in Proc. Asia and South Pacific
Design Automation Conf., pp. 609–614, 2004.
[23] Alpert, C. J., Devgan, A., and Quay, S. T., “Buffer Insertion With Accurate
Gate and Interconnect Delay Computation,” in Proc. ACM Design Automation Conf.,
pp. 479–484, 1999.
[24] Dong, S., Bai, H., Hong, X., and Goto, S., “Buffer Planning for 3D ICs,” in Proc.
IEEE Int. Symp. on Circuits and Systems, pp. 1735–1738, 2009.
[25] Pathak, M., Lee, Y.-J., Moon, T., and Lim, S. K., “Through Silicon Via Manage-
ment during 3D Physical Design: When to Add and How Many?,” in Proc. IEEE Int.
Conf. on Computer-Aided Design, pp. 387–394, 2010.
[26] Peng, Y. and Liu, X., “Low-Power Repeater Insertion With Both Delay and Slew
Rate Constraints,” in Proc. ACM Design Automation Conf., pp. 302–307, 2006.
[27] Qian, J., Pullela, S., and Pillage, L., “Modeling the Effective Capacitance for
the RC Interconnect of CMOS Gates,” IEEE Trans. on Computer-Aided Design of
Integrated Circuits and Systems, vol. 13, no. 12, pp. 1526–1535, 1994.
[28] Batude, P., Vinet, M., Pouydebasque, A., Royer, C. L., Previtali, B.,
Tabone, C., Hartmann, J.-M., Sanchez, L., Baud, L., Carron, V., Toffoli,
120
A., Allain, F., Mazzocchi, V., Lafond, D., Thomas, O., Cueto, O., Bouzaida,
N., D.Fleury, Amara, A., Deleonibus, S., and Faynot, O., “Advances in 3D
CMOS Sequential Integration,” in Proc. IEEE Int. Electron Devices Meeting, pp. 1–4,
2009.
[29] Jung, S.-M., Jang, J., Cho, W., Moon, J., Kwak, K., Choi, B., Hwang,
B., Lim, H., Jeong, J., Kim, J., and Kim, K., “The Revolutionary and Truly 3-
Dimensional 25F 2 SRAM Technology with the smallest S3 (Stacked Single-crystal Si)
Cell, 0.16um2, and SSTFT (Stacked Single-crystal Thin Film Transistor) for Ultra
High Density SRAM,” in Proc. Symposium on VLSI Technology, pp. 228–229, 2004.
[30] Golshani, N., Derakhshandeh, J., Ishihara, R., Beenakker, C., Robertson,
M., and Morrison, T., “Monolithic 3D Integration of SRAM and Image Sensor
Using Two Layers of Single Grain Silicon,” in Proc. IEEE Int. Conf. on 3D System
Integration, pp. 1–4, 2010.
[31] Naito, T., Ishida, T., Onoduka1, T., Nishigoori, M., Nakayama, T., Ueno,
Y., Ishimoto, Y., Suzuki, A., Chung, W., Madurawe, R., Wu, S., Ikeda, S.,
and Oyamatsu, H., “World’s first monolithic 3D-FPGA with TFT SRAM over 90nm
9 layer Cu CMOS,” in Proc. Symposium on VLSI Technology, pp. 219–220, 2010.
[32] Bobba, S., Chakraborty, A., Thomas, O., Batude, P., Ernst, T., Faynot,
O., Pan, D. Z., and Micheli, G. D., “CELONCEL: Effective Design Technique for
3-D Monolithic Integration targeting High Performance Integrated Circuits,” in Proc.
Asia and South Pacific Design Automation Conf., pp. 336–343, 2011.
[33] Liu, C. and Lim, S. K., “A Design Tradeoff Study with Monolithic 3D Integration,”
in Proc. Int. Symp. on Quality Electronic Design, pp. 531–538, 2012.
[34] Lee, Y.-J., Morrow, P., and Lim, S. K., “Ultra High Density Logic Designs Using
Transistor-Level Monolithic 3D Integration,” in Proc. IEEE Int. Conf. on Computer-
Aided Design, pp. 539–546, 2012.
[35] Fisher, R. A., The Design of Experiments. London: Oliver and Boyd, 1935.
[36] Brglez, F. and Drechsler, R., “Design of Experiments in CAD: Context and New
Data Sets for ISCAS’99,” in Proc. IEEE Int. Symp. on Circuits and Systems, vol. 6,
pp. 424–427, 1999.
[37] Zhang, Q., Liou, J. J., McMacken, J., Thomson, J., and Layman, P., “Devel-
opment of Robust Interconnect Model Based on Design of Experiments and Multiob-
jective Optimization,” IEEE Trans. on Electron Devices, vol. 48, pp. 1885–1891, Sep.
2001.
[38] Nookala, V., Chen, Y., Lilja, D. J., and Sapatnekar, S. S., “Microarchitecture-
Aware Floorplanning Using a Statistical Design of Experiments Approach,” in Proc.
ACM Design Automation Conf., pp. 579–584, 2005.
[39] North Carolina State University, “NCSU FreePDK.”
[40] Joseph, A. J., Gillis, J. D., Doherty, M., Lindgren, P. J., Previti-Kelly,
R. A., Malladi, R. M., Wang, P.-C., Erturk, M., Ding, H., Gebreselasie,
121
E. G., McPartlin, M. J., and Dunn, J., “Through-silicon vias enable next-
generation SiGe power amplifiers for wireless communications,” IBM J. Res. & Dev.,
vol. 52, pp. 635–648, Nov. 2008.
[41] Ho, C.-W., Ruehli, A. E., and Brennan, P. A., “The Modified Nodal Approach to
Network Analysis,” IEEE Transactions on Circuits and Systems, vol. 22, pp. 504–509,
June 1975.
[42] Zhou, Q., Sun, K., Mohanram, K., and Sorensen, D. C., “Large power grid anal-
ysis using domain decomposition,” in Proc. Design, Automation and Test in Europe,
vol. 1, pp. 1–6, 2006.
[43] Dang, B., Bakir, M. S., and Meindl, J. D., “Integrated thermal-fluidic I/O in-
terconnect for an on-chip microchannel heat sink,” IEEE Electron Device Letters,
vol. 27(2), pp. 117–119, 2006.
[44] Koo, J.-M., Im, S., Jiang, L., and Goodson, K. E., “Integrated microchannel cool-
ing for three-dimensilonal electronic architecture,” Journal of Heat Transfer, vol. 127,
pp. 49–58, 2005.
[45] Patankar, S. V., Numerical Heat Transfer and Fluid Flow. Washington, DC, Hemi-
sphere Publishing Corp., 1980.
[46] Cong, J. and Lim, S. K., “Edge Separability based Circuit Clustering With Appli-
cation to Circuit Partitioning,” in Proc. Asia and South Pacific Design Automation
Conf., pp. 429–434, 2000.
[47] Kim, D. H., Athikulwongse, K., and Lim, S. K., “A Study of Through-Silicon-Via
Impact on the 3D Stacked IC Layout,” in Proc. IEEE Int. Conf. on Computer-Aided
Design, pp. 674–680, 2009.
[48] Pathak, M. and Lim, S. K., “Thermal-aware Steiner Routing for 3D Stacked ICs,”
in Proc. IEEE Int. Conf. on Computer-Aided Design, pp. 205–211, 2007.
[49] Myers, R. H. and Montgomery, D. C., Response Surface Methodology: Process
and Product Optimization Using Designed Experiments. John Wiley and Sons Inc.,
1995.
[50] Box, G. and Behnken, D., “Some new three level designs for the study of quantitative
variables,” Technometrics, vol. 2, pp. 455–475, 1960.
[51] Mason, R. L., Gunst, R. F., and Hess, J. L., Statistical Design and Analysis
of Experiments - With Applications to Engineering and Science (2nd Edition). John
Wiley & Sons, 2003.
[52] Derringer, G. and Suich, R., “Simultaneous Optimization of Several Response
Variables,” Journal of Quality Technology, vol. 12, no. 4, pp. 214–219, 1980.
[53] Koester, S. J., Young, A. M., Yu, R. R., Purushothaman, S., Chen, K.-
N., D. C. La Tulipe, J., Rana, N., Shi, L., Wordeman, M. R., and Sprogis,
E. J., “Wafer-level 3D integration technology,” IBM J. Res. & Dev., vol. 52, no. 6,
pp. 583–597, 2008.
122
[54] Minz, J., Wong, E., Pathak, M., and Lim, S. K., “Placement and Routing for
3D System-On-Package Designs,” IEEE Transactions on Components and Packaging
Technologies, vol. 29, no. 3, pp. 644–657, 2006.
[55] Jung, S.-M., Jang, J., Cho, W., Cho, H., Jeong, J., Chang, Y., Kim, J., Rah,
Y., Son, Y., Park, J., Song, M.-S., Kim, K.-H., Lim, J.-S., and Kim, K., “Three
Dimensionally Stacked NAND Flash Memory Technology Using Stacking Single Crystal
Si Layers on ILD and TANOS Structure for Beyond 30nm Node,” in Proc. IEEE Int.
Electron Devices Meeting, pp. 37–40, 2006.
[56] Suntharalingam, V., Berger, R., Burns, J. A., Chen, C. K., Keast, C. L.,
Knecht, J. M., Lambert, R. D., Newcomb, K. L., O’Mara, D. M., Rathman,
D. D., Shaver, D. C., Soares, A. M., Stevenson, C. N., Tyrrell, B. M.,
Warner, K., Wheeler, B. D., Yost, D.-R. W., and Young, D. J., “Megapixel
CMOS Image Sensor Fabricated in Three-Dimensional Integrated Circuit Technology,”
in IEEE International Solid-States Circuits Conf., 2005.
[57] AB., A. G., “Leon3 Processor.”
[58] Nair, R., Berman, C. L., Hauge, P. S., and Yoffa, E. J., “Generation of Perfor-
mance Constraints for Layout,” IEEE Trans. on Computer-Aided Design of Integrated
Circuits and Systems, vol. 8, pp. 860–874, Aug. 1989.
[59] Hu, S., Alpert, C. J., Hu, J., Karandikar, S. K., Li, Z., Shi, W., and Sze,
C. Z., “Fast Algorithm for Slew-Constrained Minimum Cost Buffering,” IEEE Trans.
on Computer-Aided Design of Integrated Circuits and Systems, vol. 26, pp. 2009–2022,
Nov. 2007.
[60] Katti, G., Stucchi, M., Meyer, K. D., and Dehaene, W., “Electrical Modeling
and Characterization of Through Silicon via for Three-Dimensional ICs,” IEEE Trans.
on Electron Devices, vol. 57, pp. 256–262, Jan. 2010.
[61] O’Brien, P. R. and Savarino, T. L., “Modeling the Driving-Point Characteristic of
Resistive Interconnect for Accurate Delay Estimation,” in Proc. IEEE Int. Conf. on
Computer-Aided Design, pp. 512–515, 1989.
[62] Liu, F., Kashyap, C., and Alpert, C. J., “A Delay Metric for RC Circuits Based
on the Weibull Distribution,” IEEE Trans. on Computer-Aided Design of Integrated
Circuits and Systems, vol. 23, pp. 443–447, Mar. 2004.
[63] Kashyap, C. V., Alpert, C. J., Liu, F., and Devgan, A., “Closed-Form Ex-
pressions for Extending Step Delay and Slew Metrics to Ramp Inputs for RC Trees,”
IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 23,
pp. 509–516, Apr. 2004.
[64] Bakoglu, H. B., Circuits, Interconnects, and Packaging for VLSI. Addison-Wesley,
1990.
[65] Alpert, C. and Devgan, A., “Wire Segmenting for Improved Buffer Insertion,” in
Proc. ACM Design Automation Conf., pp. 588–593, 1997.
[66] Nangate, “Nangate 45nm Open Cell Library.”
123
[67] Mercha, A., der Plas, G. V., Moroz, V., Wolf, I. D., Asimakopoulos, P.,
Minas, N., Domae, S., Perry, D., Choi, M., Redolfi, A., Okoro, C., Yang, Y.,
Olmen, J. V., Thangaraju, S., Tezcan, D. S., Soussan, P., Cho, J., Yakovlev,
A., Marchal, P., Travaly, Y., Beyne, E., Biesemans, S., and Swinnen, B.,
“Comprehensive Analysis of the Impact of Single and Arrays of Through Silicon Vias
Induced Stress on High-k / Metal Gate CMOS Performance,” in Proc. IEEE Int.
Electron Devices Meeting, pp. 2.2.1–2.2.4, 2010.
[68] Topol, A. W., Tulipe, D. C. L., Shi, L., Alam, S. M., Frank, D. J., Steen,
S. E., Vichiconti, J., Posillico, D., Cobb, M., Medd, S., Patel, J., Goma, S.,
DiMilia, D., Robson, M. T., Duch, E., Farinelli, M., Wang, C., Conti, R. A.,
Canaperi, D. M., Deligianni, L., Kumar, A., Kwietniak, K. T., D’Emic, C.,
Ott, J., Young, A. M., Guarini, K. W., , and Ieong, M., “Enabling SOI-Based
Assembly Technology for Three-Dimensional (3D) Integrated Circuits (ICs),” in Proc.
IEEE Int. Electron Devices Meeting, pp. 352–355, 2005.
[69] Yu, C., Chang, C., Wang, H., Chang, J., Huang, L., Kuo, C., Tai, S., Hou,
S., Lin, W., Liao, E., Yang, K., Wu, T., Chiou, W., Tung, C., Jeng, S., and
Yu, C., “TSV Process Optimization for Reduced Device Impact on 28nm CMOS,” in
Proc. Symposium on VLSI Technology, pp. 138–139, 2011.
[70] Batude, P., Vinet, M., Pouydebasque, A., Royer, C. L., Previtali, B.,
Tabone, C., Hartmann, J.-M., Sanchez, L., Baud, L., Carron, V., Toffoli,
A., Allain, F., Mazzocchi, V., Lafond, D., Deleonibus, S., and Faynot, O.,
“3D Monolithic Integration,” in Proc. IEEE Int. Symp. on Circuits and Systems, 2011.
[71] Batude, P., Vinet, M., Pouydebasque, A., and Clavelier, L., “Enabling 3D
Monolithic Integration,” in ECS Transactions, 2008.
[72] Lee, Y.-J. and Lim, S. K., “Timing Analysis and Optimization for 3D Stacked Multi-
Core Microprocessors,” in Proc. IEEE Int. Conf. on 3D System Integration, pp. 1–7,
2010.
[73] at ASU, N. G., “Predictive Technology Model.”
[74] Lee, Y.-J., Hong, I., and Lim, S. K., “Slew-Aware Buffer Insertion for Through-




This dissertation is based on and/or related to the works and results presented in the
following publications in print:
[1] Young-Joon Lee and Sung Kyu Lim, “Co-Optimization of Signal, Power, and Ther-
mal Distribution Networks for 3D ICs”, in IEEE Symposium on Electrical Design of
Advanced Packaging and Systems Symposium, 2008, pp. 163-166.
[2] Yoon Jo Kim, Yogendra K. Joshi, Andrei G. Fedorov, Young-Joon Lee, and Sung
Kyu Lim, “Thermal Characterization of Interlayer Microfluidic Cooling of Three-
Dimensional IC with Non-Uniform Heat Flux”, in ASME International Conference
on Nanochannels, Microchannels and Minichannels, 2009, pp. 1249-1258.
[3] Young-Joon Lee, Yoon Jo Kim, Gang Huang, Muhannad Bakir, Yogendra Joshi,
Andrei Fedorov, and Sung Kyu Lim, “Co-Design of Signal, Power, and Thermal Dis-
tribution Networks for 3D ICs”, in Design, Automation & Test in Europe Conference
& Exhibition, 2009, pp. 610-615.
[4] Young-Joon Lee and Sung Kyu Lim, “Routing Optimization of Multi-modal Inter-
connects In 3D ICs”, in Electronic Components and Technology Conference, 2009, pp.
32-39.
[5] Young-Joon Lee, Michael Healy, and Sung Kyu Lim, “Co-design of Reliable Signal
and Power Interconnects in 3D Stacked ICs”, in IEEE International Interconnect
Technology Conference, 2009, pp. 56-58.
[6] Young-Joon Lee, Michael Healy, Dae Hyun Kim, and Sung Kyu Lim, “Efficient
On-Chip Power, Clock, Thermal, and Signal Delivery for 3D ICs”, Three Dimensional
System Integration: IC Stacking Process and Design, edited by Antonis Papanikolaou,
Dimitrios Soudris and Riko Radojcic, Springer, 2009.
125
[7] Young-Joon Lee, Rohan Goel, and Sung Kyu Lim, “Multi-functional Interconnect
Co-optimization for Fast and Reliable 3D Stacked ICs”, in IEEE/ACM International
Conference on Computer-Aided Design, 2009, pp. 645-651.
[8] Yoon Jo Kim, Yogendra K. Joshi, Andrei G. Fedorov, Young-Joon Lee, and Sung
Kyu Lim, “Thermal Characterization of Interlayer Microfluidic Cooling of Three-
Dimensional Integrated Circuits With Nonuniform Heat Flux”, ASME Journal of
Heat Transfer, vol. 132, no. 4, Apr. 2010.
[9] Young-Joon Lee, Mohit Pathak, Chang Liu, Moongon Jung, and Sung Kyu Lim,
“Design and Timing Optimization of a 3D Stacked Microprocessor”, in ACM Inter-
national Workshop on Timing Issues in the Specification and Synthesis of Digital
Systems, 2010.
[10] Young-Joon Lee and Sung Kyu Lim, “Timing Analysis and Optimization for
Many-Tier 3D ICs”, in SRC Techcon Conference, 2010.
[11] Mohit Pathak, Young-Joon Lee, Thomas Moon, and Sung Kyu Lim, “Through
Silicon Via Management during 3D Physical Design: When to Add and How Many?”,
in IEEE/ACM International Conference on Computer-Aided Design, 2010, pp. 387-
394.
[12] Young-Joon Lee and Sung Kyu Lim, “Timing Analysis and Optimization for 3D
Stacked Multi-Core Microprocessors”, in IEEE International 3D System Integration
Conference, 2010, pp. 1-7.
[13] Young-Joon Lee and Sung Kyu Lim, “Co-Optimization and Analysis of Signal,
Power, and Thermal Interconnects in 3D ICs”, IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, Vol. 30, No. 11, pp. 1635-1648,
2011.
[14] Young-Joon Lee, Shreepad Panth, and Sung Kyu Lim, “Enabling High Density
Logic Designs for Monolithic 3D ICs”, in SRC Techcon Conference, 2012.
126
[15] Young-Joon Lee, Inki Hong, and Sung Kyu Lim, “Slew-Aware Buffer Insertion for
Through-Silicon-Via-Based 3D ICs”, in IEEE Custom Integrated Circuits Conference,
2012, pp. 1-8. (Invited Paper)
[16] Young-Joon Lee, Patrick Morrow, and Sung Kyu Lim, “Ultra High Density Logic
Designs Using Transistor-Level Monolithic 3D Integration”, in IEEE/ACM Interna-
tional Conference on Computer-Aided Design, 2012, pp. 539-546.
[17] Young-Joon Lee, Daniel Limbrick, and Sung Kyu Lim, “Power Benefit Study for
Ultra-High Density Transistor-Level Monolithic 3D ICs”, in ACM Design Automation
Conference, 2013, to appear.
In addition, the author has completed works unrelated to this dissertation presented in
the following publications in print:
[1] Michael B. Healy, Krit Athikulwongse, Rohan Goel, Mohammad M. Hossain, Dae
Hyun Kim, Young-Joon Lee, Dean L. Lewis, Tzu-Wei Lin, Chang Liu, Moongon
Jung, Brian Ouellette, Mohit Pathak, Hemant Sane, Guanhao Shen, Dong Hyuk
Woo, Xin Zhao, Gabriel H. Loh, Hsien-Hsin S. Lee, and Sung Kyu Lim, “Design and
Analysis of 3D-MAPS: A Many-Core 3D Processor with Stacked Memory”, in IEEE
Custom Integrated Circuits Conference, 2010, pp. 1-4.
[2] Jae-Seok Yang, Krit Athikulwongse, Young-Joon Lee, Sung Kyu Lim, and David
Z. Pan, “TSV Stress Aware Timing Analysis with Applications to 3D-IC Layout
Optimization”, in ACM Design Automation Conference, 2010, pp. 803-806.
[3] Young-Joon Lee and Sung Kyu Lim, “Fast Delay Estimation with Buffer Insertion
for Through-Silicon-Via-Based 3D Interconnects”, in IEEE International Symposium
on Quality Electronic Design, 2012, pp. 228-235.
[4] Dae Hyun Kim, Krit Athikulwongse, Michael B. Healy, Mohammad M. Hossain,
Moongon Jung, Ilya Khorosh, Gokul Kumar, Young-Joon Lee, Dean L. Lewis,
Tzu-Wei Lin, Chang Liu, Shreepad Panth, Mohit Pathak, Minzhen Ren, Guanhao
127
Shen, Taigon Song, Dong Hyuk Woo, Xin Zhao, Joungho Kim, Ho Choi, Gabriel H.
Loh, Hsien-Hsin S. Lee, and Sung Kyu Lim, “3D-MAPS: 3D Massively Parallel Pro-




Young-Joon Lee was born in Busan, Republic of Korea, in 1979. He received the BS and
the MS degree from Seoul National University in 2002 and 2007. He is currently a PhD
candidate in the School of Electrical and Computer Engineering at Georgia Institute of
Technology. From 2007 to 2013, he did researches in Georgia Tech Computer Aided Design
(GTCAD) laboratory led by Professor Sung Kyu Lim. He made major contributions to
the 3D-MAPS projects, the world’s first 3D many-core processor from academia. During
summer 2011, he worked at Cadence Design Systems as an intern. His research interests
include monolithic 3D IC design automation, low-power design techniques for TSV-based
3D ICs, timing optimizations for TSV-based 3D ICs, and co-optimization of traditional
metrics and reliability metrics on 3D ICs.
129
