Physical design methodologies for monolithic 3D ICs by Panth, Shreepad Amar








of the Requirements for the Degree
Doctor of Philosophy in the
School of Electrical and Computer Engineering
Georgia Institute of Technology
May 2015
Copyright c© 2015 by Shreepad Amar Panth
PHYSICAL DESIGN METHODOLOGIES FOR
MONOLITHIC 3D ICS
Approved by:
Dr. Sung Kyu Lim, Advisor
School of Electrical and Computer
Engineering
Georgia Institute of Technology
Dr. Azad Naeemi
School of Electrical and Computer
Engineering
Georgia Institute of Technology
Dr. Saibal Mukhopadhyay
School of Electrical and Computer
Engineering
Georgia Institute of Technology
Dr. Hyesoon Kim
College of Computing
Georgia Institute of Technology
Dr. Arijit Raychowdhury
School of Electrical and Computer
Engineering
Georgia Institute of Technology
Date Approved: March 13, 2015
ACKNOWLEDGEMENTS
Finishing my Ph.D. has been a long journey, and it wouldn’t have been possible without the
assistance of many people. I would like to thank all those whohelped me along the way.
Firstly, I would like to thank my advisor, Dr Sung Kyu Lim, forguiding me and shaping
my research. He gave me the chance to pursue the highest academi degree possible, in
one of the best research institutions in the world.
I would like to thank Dr. Saibal Mukhopadhyay and Dr. Arijit Raychowdhury for
suggestions and guidance on my research. In addition, I thank Dr. Azad Naeemi and Dr.
Hyesoon Kim for serving on my dissertation defence committee.
The bulk of my research was spent working under a project withQualcomm, and I
would like to thank Dr. Kambiz Samadi, Dr. Yang Du, and Pratyush Kamal for providing
valuable feedback and an industrial viewpoint to guide my research.
I would like to thank the past and current members of the GTCAD lab: Dr. Michael
Healy, Mohit Pathak, Dr. Dae Hyun Kim, Dr. Xin Zhao, Dr. Krit Athikulwongse, Dr.
Young-Joon Lee, Dr. Moonong Jung, Taigon Song, Yarui Peng, Sandeep Samal, Neela
Lohith, Kyung Wook Chang, Bon Woong Ku and Kartik Acharya for providing expertise
in areas I was unfamiliar with, tools and scripts, as well their time for me to bounce ideas
off of. I also thank David Webb and Keith May, from the school of ECE’s IT department,
for responding to hundreds of my requests over the years.
Lastly, I would like to thank my family – my parents, grandparents, and sister for their
support not just through my Ph.D., but throughout my life.
iii
TABLE OF CONTENTS
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xiv
I INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview of Monolithic 3D ICs. . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Fabrication Techniques. . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Design Styles. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Organization and Contributions. . . . . . . . . . . . . . . . . . . . . . . 5
II DESIGN-FOR-TEST FOR TSV-BASED 3D ICS . . . . . . . . . . . . . . . 7
2.1 Scan-Chain Design for 3D ICs. . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 3D Scan Chain Construction. . . . . . . . . . . . . . . . . . . . 8
2.1.2 Reuse of Signal TSVs. . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 Broken Scan Chains. . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Transistion-delay-fault Testing for 3D ICs with IR-drop Study . . . . . . . 13
2.2.1 Transition-delay-fault Architecture. . . . . . . . . . . . . . . . . 14
2.2.2 Probe-pad Placement and PDN Design. . . . . . . . . . . . . . . 17
2.2.3 Design and Analysis Flow. . . . . . . . . . . . . . . . . . . . . 20
2.2.4 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Test-time Estimation for 3D ICs. . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1 Die-level partitioning. . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.2 Block-level partitioning. . . . . . . . . . . . . . . . . . . . . . . 35
2.3.3 Case Studies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
iv
III PHYSICAL DESIGN FOR BLOCK-LEVEL MONOLITHIC 3D ICS . . . 48
3.1 3D Floorplanning with Monolithic Inter-tier Vias. . . . . . . . . . . . . . 49
3.1.1 Problem Formulation and Overview. . . . . . . . . . . . . . . . 49
3.1.2 Floorplanning Engine. . . . . . . . . . . . . . . . . . . . . . . . 50
3.1.3 Post-Floorplan Refinement (PFPR). . . . . . . . . . . . . . . . . 51
3.1.4 MIV Planning Algorithm. . . . . . . . . . . . . . . . . . . . . . 52
3.2 Floorplan Quality Evaluation. . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.1 Experimental Setup. . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.2 Floorplanner Validation. . . . . . . . . . . . . . . . . . . . . . . 57
3.2.3 Monolithic 3D vs. TSV-based 3D. . . . . . . . . . . . . . . . . . 58
3.3 Inter-Tier Performance Differences. . . . . . . . . . . . . . . . . . . . . 58
3.3.1 Source of Inter-Tier Performance Differences. . . . . . . . . . . 58
3.3.2 Degraded Interconnects on the Bottom Tier. . . . . . . . . . . . 60
3.3.3 Degraded Transistors on the Top Tier. . . . . . . . . . . . . . . . 61
3.4 Performance-Difference-Aware Design and Analysis Flow . . . . . . . . . 63
3.4.1 Performance-Difference-Aware Floorplanner. . . . . . . . . . . 64
3.4.2 Performance-Difference-Aware Analysis. . . . . . . . . . . . . . 66
3.5 Power-Performance Study. . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5.1 Identical Performance on Both Tiers. . . . . . . . . . . . . . . . 67
3.5.2 Impact of Inter-Tier Performance Differences. . . . . . . . . . . 69
3.5.3 Overall Comparisons. . . . . . . . . . . . . . . . . . . . . . . . 73
3.5.4 Block Folding. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
IV PHYSICAL DESIGN FOR GATE-LEVEL MONOLITHIC 3D ICS . . . . 76
4.1 Congestion-Aware Placement for Gate-level Monolithic 3D ICs . . . . . . 77
4.1.1 Overall Design Flow . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.2 Monolithic 3D IC Placement. . . . . . . . . . . . . . . . . . . . 78
4.1.3 Routability-Driven Partitioning. . . . . . . . . . . . . . . . . . . 82
4.1.4 Router-based 3D-Via Insertion. . . . . . . . . . . . . . . . . . . 92
v
4.1.5 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . 93
4.1.6 Comparison with Existing 3D Placers. . . . . . . . . . . . . . . 104
4.2 Monolithic 3D IC Design With Commercial 2D IC Tools. . . . . . . . . 106
4.2.1 CAD Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.2.2 Power Benefit Study. . . . . . . . . . . . . . . . . . . . . . . . 112
4.3 IR-drop Aware Partitioning for Monolithic 3D ICs. . . . . . . . . . . . . 117
4.3.1 Motivation and Objectives. . . . . . . . . . . . . . . . . . . . . 118
4.3.2 Design and Analysis Flow. . . . . . . . . . . . . . . . . . . . . 119
4.3.3 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . 124
4.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
V CONCLUSIONS AND FUTURE DIRECTIONS . . . . . . . . . . . . . . .133
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .135
PUBLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .141
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .145
vi
LIST OF TABLES
1 Statistics for different scan chain configurations.. . . . . . . . . . . . . . . 12
2 Design Statistics for two designs, split by die.. . . . . . . . . . . . . . . . 23
3 Post-bond test time results. All test times are in cycles. . . . . . . . . . . 24
4 The optimal test times (in cycles) achieved for a two-die circuit, along with
the TSV usage at which this optimum time is reached.. . . . . . . . . . . . 32
5 The test times for die-level partitioning of a three-die 3DIC, considering
both uniform and tapered TSV constraints.. . . . . . . . . . . . . . . . . 32
6 The test times for die-level partitioning of a four-die 3D IC, considering
both uniform and tapered TSV constraints.. . . . . . . . . . . . . . . . . 33
7 Details of benchmark circuits used, showing the average and standard de-
viation of the test data volume among all modules.. . . . . . . . . . . . . 43
8 Design Statistics for All Benchmarks. . . . . . . . . . . . . . . . . . . . . 55
9 Comparison between the proposed floorplanner and Cadence Encounter. . . 57
10 A comparison of wirelength, timing and top net power of 2D versus 3D . . 59
11 Various interconnect parameters. . . . . . . . . . . . . . . . . . . . . . . 61
12 The change in resistivity values of different metal layers in the Nangate
45nm library due to Tungsten interconnects.. . . . . . . . . . . . . . . . 62
13 Minimum size (X1) std. cell average delay (inps), assuming worst loading,
at different corners.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
14 Benchmarks used for evaluation evaluation.. . . . . . . . . . . . . . . . 67
15 Basic floorplan comparisons assuming both tiers have same performance. . 68
16 Basic floorplan comparisons for different degraded 3D options. The num-
bers are normalized to the respective 2D numbers in Table 15.. . . . . . . 70
17 Impact of performance difference aware floorplanning (PDAFP). ‘-’ indi-
cates that point is not achievable within±10% VDD. . . . . . . . . . . . . . 72
18 Iso-power performance and iso-performance power results for all imple-
mentation flavors.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
19 Placement results for the128× 4 multiplier block. . . . . . . . . . . . . . 74
20 The various benchmarks considered in this section.. . . . . . . . . . . . . 94
21 The impact of partition bin size on solution quality.. . . . . . . . . . . . . 96
vii
22 The impact of router-based MIV insertion. Entries markedwith a * are
unroutable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
23 The impact of routability-driven partitioning on monolithic 3D IC designs. 99
24 The impact of routability-driven partitioning for face-to face designs. . . . 103
25 Overall Comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
26 Comparison between 3D-Craft and Our Placer. . . . . . . . . . . . . . . . 105
27 Comparison of single vs. multiple MIV/F2F insertion. Power values are
reported in mW, and wirelength in meter. . . . . . . . . . . . . . . . . . 114
28 Comparison of two different types of 3D CTS. Power values arereported
in mW, and wirelength in meter.. . . . . . . . . . . . . . . . . . . . . . . 114
29 Overall comparisons between 2D and different 3D implementation styles.
Power numbers are in mW.. . . . . . . . . . . . . . . . . . . . . . . . . . 116
30 Dual-Vt comparisons between 2D and different 3D implementation styles.
Power is in mW.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
31 Material properties used in a mobile package.. . . . . . . . . . . . . . . . 123
32 Benchmarks used.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
33 Design statistics of baseline 2D and 3D designs.. . . . . . . . . . . . . . . 126
34 The impact of IR-drop-aware partitioning. The PDN utilizat on is kept the
same as the baseline designs.. . . . . . . . . . . . . . . . . . . . . . . . 128
35 The impact of PDN optimization such that the IR-drop falls within the
45mV target. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
viii
LIST OF FIGURES
1 The fabrication process of monolithic 3D ICs [2]. (a) The bottom tier is
created the same way as 2D ICs. (b,c,d) Attachment of thin layer of silicon
to the top of the bottom tier. (e) FEOL of top tier and creationof MIVs and
top-tier contacts, and (f) BEOL processing of top-tier.. . . . . . . . . . . . 3
2 Various design styles available for monolithic 3D ICs.. . . . . . . . . . . . 4
3 Scan chain grown from (a) one direction, and (b) two directions. . . . . . . 9
4 Re-use of existing signal TSVs for scan chain. . . . . . . . . . . . . . . . 10
5 (a) A 3D scan chain, and (b) multiple fragments connected together . . . . 11
6 The impact of scan configuration on wirelength. . . . . . . . . . . . . . . 12
7 The Structure of a 3D Integrated Circuit. . . . . . . . . . . . . . . . . . . 13
8 An IEEE 1500 Wrapper Boundary Register capable of launching a tran-
sition on CFO. The abbreviations used are S-shiftWR, C-captureWR,T-
transferWR, U-updateWR. . . . . . . . . . . . . . . . . . . . . . . . . . 14
9 The DfT Architecture for Transition Delay Fault Testing of3D ICs, show-
ing only the data path and serial operation. . . . . . . . . . . . . . . . . . 15
10 (a) A 0 to 1 Transition launched from WBR on Top Die (no TSV testing),
(b) An equivalent 0 to 1 Transition launched from WBR on Bottom Die
(with TSV testing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
11 (a) Post-bond test of bottom die, (b) Post-bond test of topdie with TSV
test. Solid red lines indicate flow of scanned data, and dashed blue lines
indicate flow of data to and from WBRs in the launch-capture window . . . 17
12 Damage caused to the probe pad after a single probe touchdown [27]. . . . 18
13 Layout images of (a) probe pads and TSVs, (b) P/G TSVs and P/G wire
detours, (c) signal TSVs and P/G wires. P/G wires can be routed ov r
signal TSVs.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
14 (a) Candidate locations for probe pads, (b) Sample horizontal a d vertical
power/ground pads, as well as signal pads, (c) 4 power probe pads laced
in a 2×2 horizontal configuration, and (d) in a vertical configuration . . . . 19
15 The overall design Flow. Yellow indicates inputs to the flow, green boxes
are custom scripts, blue indicates use of Synopsys tools, and red the use of
Cadence tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
16 A sample waveform obtained during testing, designed withfour scan chains 22
ix
17 GDSII images. (a) A close up of a TSV and its WBR, (b) IEEE 1500
Instruction Register Chain, (c) zoom out shot of the top metal layer of the
top die, showing TSV landing pads and probe pads. . . . . . . . . . . . . 22
18 Various overheads involved in adding wrappers for (a) FFTand (b) Jpeg. . 23
19 Total Power comparison among (1) pre-bond, (2) post-bondwithout TSV
test, and (3) post-bond with TSV test under five different test v ctors. . . . 25
20 Pre-bond IR-drop under different probe pad configurationsand test vectors
for FFT (a, b, c) and Jpeg (d, e). . . . . . . . . . . . . . . . . . . . . . . . 26
21 IR-drop maps before (= a, c) and after (= b, d) probe pad optimization. . . . 27
22 Comparison between pre-bond and post-bond IR-drop. (a) FFT, bottom
die, (b) FFT, top die, (c) Jpeg bottom die, (d) Jpeg, top die. . . . . . . . . 28
23 (a) GDSII screen shot of a single die of a block-level 3D IC (b) Zoom in
shot of the boxed TSV block in (a). . . . . . . . . . . . . . . . . . . . . . 29
24 Three different circuits considered for die-level partitioning of a two-die
stack. (a) A homogeneous stack, (b & c) Two different partitions of a
heterogeneous stack. A larger number implies the die is morecomplex. . . 31
25 Circuits considered for die-level partitioning of multi-die stacks. (a - c)
three die stack, (d - f) four die stack. A larger number implies the die is
more complex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
26 The variation in test time observed for a two-die stack starting with ckt2 p2
and performing 1000 different random moves. 50 test-pins and 2 different
test-TSV constraints are assumed.. . . . . . . . . . . . . . . . . . . . . . 35
27 Comparison between the measured test time and approximatelower bound
of test time (= Equation 13) for a 2 die stack. The number of test pins is 50. 39
28 Variation in test time observed while performing 1000 random moves, start-
ing with ckt3 p1. The test time is computed assuming 50 test-pins, and 2
different uniform TSV constraints (20 vs 50 per-die).. . . . . . . . . . . . 40
29 Comparison between the measured test time and approximatelower bound
for a four-die stack. The test pin constraint is assumed to be100. . . . . . . 43
30 Comparison of the variation in test time observed between moves involv-
ing the bottom die (= D1 moves), and all other moves. The numbers are
reported for four-die implementations of (a,b) b19, (c,d) desperf. . . . . . 44
31 Comparison of theoretical and experimental threshold complexity factors
under various TSV and pin constraints. (a,b) Two-die stack,(c,d) Four-die
stack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
x
32 The variation inTSVt,po observed while performing 1000 different ran-
dom moves, assuming 50 test-pins. (a) b19 two-dies, (b) b19 four-dies, (c)
desperf two-dies and (d) desperf four-dies.. . . . . . . . . . . . . . . . . 46
33 The design flow to obtain a 3D floorplan, assuming hard blocks. . . . . . . 49
34 Histogram of the longest path delay through inter-block nets of a benchmark.51
35 Iterative MIV planning algorithm for soft blocks. . . . . . . . . . . . . . 54
36 Illustration of MIV planning for soft blocks. (a) Initialestimated MIV
locations (b) After one iteration of MIV planning.. . . . . . . . . . . . . . 54
37 Our design flow used to get post-layout simulation results. . . . . . . . . . 56
38 Sample layouts for cfrca 16 testcase, along with select block designs, and
zoomed in shots of TSVs and MIVs. . . . . . . . . . . . . . . . . . . . . 57
39 Copper vs. Tungsten resistivity at different wire widths.. . . . . . . . . . . 61
40 IV curves of nominal and degraded transistors.. . . . . . . . . . . . . . . 62
41 Synthesis results of “des3” benchmark for different degradations.. . . . . . 64
42 The proposed inter-tier performance difference aware floorplanner.. . . . . 65
43 Floorplan screenshots of “des3” when the top tier is at theTTm20p cor-
ner. (a) Without performance difference aware floorplanning, and (b) With
performance difference aware floorplanning.. . . . . . . . . . . . . . . . . 66
44 Power-performance trade-off curves assuming that both the tiers have iden-
tical transistors and interconnects.. . . . . . . . . . . . . . . . . . . . . . 69
45 Power-performance trade-off curves assuming degraded transistors and in-
terconnects. Dashed lines represent non performance differenc aware
floorplanning and solid lines represent performance difference aware floor-
planning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
46 3D placement layout snapshots of one128 × 4 multiplier block within the
“mul128” benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
47 Power-performance trade-off curves for the128× 4 multiplier block.. . . . 75
48 The design flow used for gate-level M3D placement.. . . . . . . . . . . 77
49 Placement-aware partitioning. A modified 2D engine is used to place all
the gates into half the area, and then partitioned with area balance in each
bin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
xi
50 Handling pre-placed memory macros (a) Initial pre-placed locations, (b)
Projection of both tiers onto the same plane, and (c) Modifying the target
density to represent memory locations.t′d is the target density in the mod-
ified 2D placement andtd is the required target density in the final M3D
design.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
51 Construction of a 3D RST. (a) The points to be routed. (b) Project to 2D
and construct a 2D RSMT. (c) Expand the 2D RSMT to a 3D RST. (d) If a
cell changes tier, the 2D RSMT can be re-used.. . . . . . . . . . . . . . 84
52 A legal route from A to B in a4 × 3 × 2 grid. The top-view is limited to
two bends, while the unfurled view can have unlimited bends.. . . . . . 85
53 A view of the top metal layer that contains MIV landing pads. (a) A 2D
wire on the top metal layer blocks potential MIV landing pad slots. (b) If
MIVs connect to cells outside the current bin (external), they block other
MIVs. If MIVs connect to cells within the current bin (internal), they do
not block other potential MIV slots.. . . . . . . . . . . . . . . . . . . . . 87
54 An overview of the router-based MIV insertion methodology. (a) The tech-
nology and macro LEF are modified to represent a two-tier monolithic
3D IC. (b) The structure that is fed into the commercial router, which is
then routed. The MIV locations are extracted and separate verilog/DEF
files are created for each tier.. . . . . . . . . . . . . . . . . . . . . . . . . 92
55 Screenshots of router-based MIV insertion (a) All the gates are placed
in the same placement layer, but no overlap exists in the routing layers.
(b) The result after routing. The MIV locations are highlighted in red. . . . 93
56 Manual partitioning of the memories in the OST2 benchmark. The mem-
ories belonging to each sub-module are partitioned, and placed in a config-
uration similar to that in 2D. . . . . . . . . . . . . . . . . . . . . . . . . . 95
57 Supply, demand, and overflow maps of the mul64 benchmark for min-cut
based partitioning solution. If interdependent supply/demand is consid-
ered, a significant reduction in supply in densely wired areas is observed,
leading to more overflow. . . . . . . . . . . . . . . . . . . . . . . . . . . 99
58 The impact of reducing the metal layer count. “Tm1” (“Bm1”)stands for
one metal layer removed from the top (bottom) tier.. . . . . . . . . . . . 101
59 (a) Monolithic 3D integration, and (b) Face-to-face 3D integration. MIVs
are limited to whitespace, while F2F vias are not.. . . . . . . . . . . . . . 102
60 Comparison of 2D, partition-then-place, and placement-aware partitioning
methods.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
61 The overall CAD methodology flow used in this paper.. . . . . . . . . . . 107
xii
62 Isolating the memory pins by shrinking the memory footprint. (a) Initial
memory footprint, and (b) Memory footprint reduced to size of filler cell. . 109
63 Pre-placed memory is flattened to get a shrunk 2D footprint, on which 2D
P&R is performed. This is then partitioned to get a monolithic 3D solution. 110
64 Two different types of 3D CTS possible (a) One clock tree pertier for each
gating group (source-level), and (b) The entire backbone isfixed onto tier
0 (leaf-level). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
65 The proposed CTS methodology (a) The clock backbone in tier0, and (b)
Zoom-in shot of leaf-level flip-flops in both tiers connectedo a leaf clock
buffer in tier 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
66 Two types of MIV insertion for a 3D net (a) Single, (b) Multip e . . . . . . 113
67 Resistive equivalent circuits for IR-drop and thermal in a conventional and
mobile package. Moving high power cells to the tier close to package helps
alleviate IR-drop. In a mobile package, the temperature increase is much
smaller than in a conventional package. Resistance is inmΩ, and thermal
resistance in◦C/W . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
68 The design flow used for IR-drop-aware partitioning.. . . . . . . . . . . . 120
69 (a) A PDN structure in monolithic 3D. Red wires represent VDD and blue
wires represent VSS, (b) The power mesh showing the top and intermedi-
ate metal layers, (c) Zoom-in shot of PDN MIV arrays showing oly the
intermediate mesh layer and local cell rails.. . . . . . . . . . . . . . . . . 122
70 A structure of a mobile package in 3D VLSI [1].. . . . . . . . . . . . . . . 123
71 Sensitivity of tier IR-drop to change in tier power for (a) crossbar, and (b)
jpeg. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
72 IR-drop maps for crossbar benchmark. (a) baseline, (b) ourIR-d op-aware
partition, where tier 0 has 60% of the chip power.. . . . . . . . . . . . . . 129
73 The impact of PDN optimization on the crossbar benchmark.IR-drop
aware partitioning is able to achieve the same IR-drop targets the baseline
partition while using significantly fewer PDN resources.. . . . . . . . . . 131
74 The impact of changing the target power of the bottom tier on the tempera-
ture of the crossbar benchmark. Even if the bottom tier has70% of the chip
power, the temperature increase is< 1◦C. . . . . . . . . . . . . . . . . . . 132
xiii
SUMMARY
The objective of this research is to develop physical designmethodologies for mono-
lithic 3D ICs and use them to evaluate the improvements in the power-performance enve-
lope offered over 2D ICs. In addition, design-for-test (DfT)techniques essential for the
adoption of shorter term through-silicon-via (TSV) based 3D ICs are explored.
Testing of TSV-based 3D ICs is one of the last challenges facing their commercializa-
tion. First, a pre-bond testable 3D scan chain constructiontechnique is developed. Next, a
transition-delay-fault test architecture is presented, along with a study on how to mitigate
IR-drop. Finally, to facilitate partitioning, a quick and accurate framework for test-TSV
estimation is developed.
Block-level monolithic 3D ICs will be the first to emerge, as significant IP can be
reused. However, no physical design flows exist, and hence a monolithic 3D floorplan-
ning framework is developed. Next, inter-tier performancedifferences that arise due to
the not yet mature fabrication process are investigated andmodeled. Finally, an inter-tier
performance-difference aware floorplanner is presented, an it is demonstrated that high
quality 3D floorplans are achievable even under these inter-tier differences.
Monolithic 3D offers sufficient integration density to place individual gates in three
dimensions and connect them together. However, no tools or techniques exist that can
take advantage of the high integration density offered. Therefore, a gate-level framework
that leverages existing 2D ICs tools is presented. This framework also provides conges-
tion modeling and produces results that minimize routing congestion. Next, this frame-
work is extended to commercial 2D IC tools, so that steps suchas timing optimization and
clock tree synthesis can be applied. Finally, a voltage-drop-aware partitioning technique
is presented that can alleviate IR-drop issues, without any impact on the performance or




Technology scaling has been the fundamental driver of the semiconductor industry over the
last few decades. Each new technology generation delivers chip that are not only smaller
and faster, but also cheaper. However, scaling brings with it an exponential increase in
fabrication complexity. Devices today are no longer planar, and finFET structures have
become mainstream. Today’s extremely small geometries ideally require advancements
in lithography such as extreme-ultraviolet lithography. However, delays in its deployment
have led to the necessity of stop-gap solutions such as double and triple patterning. This
not only increases mask and fabrication cost, but also increases design cycle time. All this
additional complexity has led to speculation that cost no loger scales below28nm.
These issues have led the industry to rethink the direction of technology scaling. Typi-
cally, as chips shrink, the devices get smaller and faster, but the interconnects become more
resistive and slower. In older nodes, the interconnect delay was such a small portion of the
total delay that this could be neglected. Today, however, thinterconnect delay is dominant.
This has led to three dimensional integrated circuits (3D ICs) being proposed as a solution
to the interconnect bottleneck. In 3D ICs, devices are placedon multiple layers, instead
of just one, and connected together. This reduces the lengthof the on-chip interconnect,
squeezing additional performance out of the same device genration.
One of the first techniques developed to enable 3D ICs was throug -silicon-vias (TSVs).
Two or more layers of devices are fabricated, TSVs created onthe dies, and then each die is
aligned and bonded. This technology is relatively close to market, and design-for-test (DfT)
is one of the last challenges facing its adoption. However, th quality of TSV-based 3D ICs
1
strongly depends on the TSV dimensions and parasitics, and they do not solve all inter-
connect issues. Their relatively large pitch and parasitics limit then to memory-on-logic or
large logic-on-logic designs with relatively small numberof global interconnects [66, 15].
An emerging alternative is monolithic 3D integration (M3D), where the tiers are fabri-
cated sequentially, one on top of another, and connected togther using monolithic inter-tier
vias (MIVs). Since no die alignment is required, these MIVs are roughly the same size as
local vias. Overall, monolithic 3D ICs offer several advantages over TSV-based 3D ICs:
(1) the small size of MIVs enables ultra-high integration density, considerably reducing
silicon area and cost, (2) the significantly reduced MIV parasitics help improve the power-
performance envelope, and (3) the manufacturing process isentirely foundry-driven, and
does not involve a packaging house for the processing of backside redistribution layers and
micro-bumps. This enables tighter process control, potentially leading to a faster ramp-up
once the technology is mature.
This section now presents an overview of the fabrication techniques and design styles
available for monolithic 3D ICs, and then outlines the contributions and the structure of the
rest of this dissertation.
1.1 Overview of Monolithic 3D ICs
1.1.1 Fabrication Techniques
The first technique developed to fabricate monolithic 3D ICs was to fabricate the bottom
tier as usual, and then to deposit a thin-film of amorphous silicon on top of it. Existing
know-how was then used to fabricate thin-film-transistors (TFT) on the top tier [25, 48].
However, the problem with this technique is that amorphous silicon leads to severely de-
graded transistors. Next, attempts were made to crystallize the amorphous silicon on the
top tier using lasers [26, 18]. This, however, leads to islands of crystalline silicon with
unpredictable device behaviour at these island boundaries. Batudeet al. were the first to
propose a process that produces extremely high quality crystalline silicon on the top tier,
2
which allows the fabrication of general logic [3, 2]. The rest of the dissertation assumes














Figure 1: The fabrication process of monolithic 3D ICs [2]. (a) The bottom tier is created
the same way as 2D ICs. (b,c,d) Attachment of thin layer of silicon to the top of the
bottom tier. (e) FEOL of top tier and creation of MIVs and top-tier contacts, and (f) BEOL
processing of top-tier.
This is a silicon-on-insulator (SOI) process, and the bottom ier is fabricated similar to
a 2D IC (Figure1(a)). Next, a thermal oxide is grown on an empty wafer, andH+ ions are
implanted just below the silicon surface at a constant depth(Figure1(b)). The thickness of
the oxide determines the buried oxide (BOX) thickness, and the depth of ion-implantation
determines the active silicon thickness. This new wafer is then flipped and bonded to the top
of the bottom tier using a low temperature oxide bonding process (Figure1(c)). The excess
silicon is then sheared of at the implant line, and polished using chemical-mechanical-
polishing (CMP) to give an extremely high quality single-crystal silicon layer. The gates
are formed on the top tier, and the MIVs are created with the contact mask of the top tier
(Figure1(e)). Finally, the metallization of the top tier is created (Figure1(f)).
3
1.1.2 Design Styles
Monolithic 3D ICs were first applied to SRAM and FPGA design, where the masks are ex-
tremely regular, and full-custom design techniques are easily pplied. Junget al. demon-
strated a 3D SRAM fabricated using a TFT layer on the top tier [25]. Naito et al. pre-
sented a monolithic 3D FPGA design using a TFT configuration SRAM over bulk CMOS
logic [48]. Junget al. also demonstrated a high-performance cost-effective DDR3 SAM
using epitaxial growth [26]. This technology also allows heterogeneous integration,such
as that demonstrated by Golshaniet al. , where a photodiode array was stacked onto SRAM
for image sensing applications [18]. With respect to design, Liu and Lim evaluated several
design options for 3D SRAM including separating the PMOS and NMOS into different
tiers, and changing transistor and metal layer counts [40]. However, none of these works
considered general logic, where physical design techniques becomes essential. In general,









Figure 2: Various design styles available for monolithic 3D ICs.
Transistor-level integration is the most fine-grained technique [4, 33, 39, 34], where the
PMOS and NMOS within standard cells are placed on different ti rs. It has the advantage
that the PMOS and NMOS fabrication process can be optimized separately. However,
this style requires redesign and re-characterization of the s andard cells themselves, which
4
takes significant effort. In addition, the standard cell footprint does not reduce by50%
in 3D due to the mismatch in the PMOS and NMOS sizes, as well as because MIVs are
required within the cell itself. Lee, Morrow, and Lim have demonstrated that one of the
main advantages of this design style is that the redesigned stan ard cells can be directly fed
into existing 2D tools [34].
Since re-designing existing logic, memory and IP blocks for3D incurs significant de-
sign overhead and cost, near-term 3D ICs will focus on reusingexisting 2D blocks. In
block-level monolithic 3D ICs, functional blocks are floorplanned onto different tiers. This
style has the benefit of IP reuse, but does not fully take advantage of the fine-grained nature
of MIVs. There has been no prior work in designing block-levemonolithic 3D ICs.
The last design style is gate-level monolithic 3D ICs, where existing standard cells
and memory can be placed on multiple tiers, and connected together using MIVs. The
advantage of this style is that it offers the reuse of existing cells, zero total silicon area
overhead (unlike transistor-level), and a sufficiently high integration density to obtain sig-
nificant power benefits (unlike block-level). The only priorwork in this design style is [4],
where the authors provide a rudimentary design flow that is not capable of handling any
hard macros such as memory, and therefore cannot be applied to r al designs.
Therefore, for general logic designs, physical design for transistor-level monolithic
3D ICs have been explored, while there is a complete lack of CAD tools and methodologies
to design real world block-level and gate-level monolithic3D ICs.
1.2 Organization and Contributions
This research first explores design-for-test (DfT) techniques crucial to the commercializa-
tion of short term TSV-based 3D ICs. Next, it presents a complete sign-off physical design
framework to take an RTL description of a circuit, and implement it in either a block-level
or gate-level monolithic 3D IC. Each of these is organized into a self-contained chapter,
and the contributions of this dissertation are as follows:
5
Design-for-Test for TSV-based 3D ICsis presented in Chapter2. This chapter first
presents a technique to construct 3D scan chains, that unlike previous works, is pre-bond
testable. Next, as 3D ICs need to be tested at the rated frequency, this work presents the
first transition-delay-fault capable test architecture for 3D ICs. In addition, since IR-drop
is an issue during transition testing, techniques to mitigate IR-drop are presented. Finally,
this chapter presents techniques to quickly and accuratelyestimate the test time of a given
3D IC partition. This estimate can be used during the partitioning process to assess the total
number of test TSVs required by the partition under consideration.
Physical Design for Block-level Monolithic 3D ICsis discussed in Chapter3. First, a
floorplanning framework is presented, and it is demonstrated that this engine produces
results comparable to commercial 2D engines. Inter-tier performance differences that arise
due to an immature fabrication process is discussed, and twoopti ns to mitigate these
differences are discussed and modeled. A performance-diffrence aware floorplanner that
uses these models to produce high quality monolithic 3D floorplans is also presented.
Physical Design for Gate-level Monolithic 3D ICsis covered in Chapter4. This chapter
first presents a technique to modify existing academic 2D engin s, and couple them with
a placement-aware partitioning step to obtain high-quality monolithic 3D IC placement
solutions. It also discusses a technique to use commercial routers for MIV insertion. In
addition, it presents a technique to utilize commercial 2D engines instead of academic ones.
Finally, an IR-drop-aware partitioner that reduces the power and IR-drop of a monolithic
3D IC without increasing the maximum operating temperatureof the chip is developed.
Conclusions and Future Directionsare discussed in Chapter5. This chapter summarises
all the work presented in this dissertation and goes over futu e research directions that will
help in designing better quality industrial-sized monolithic 3D systems-on-a-chip.
6
CHAPTER II
DESIGN-FOR-TEST FOR TSV-BASED 3D ICS
TSV-based 3D ICs are manufactured by fabricating each die separately, thinning the dies
containing TSVs, and stacking them all together. Due to the additional manufacturing steps
of thinning and stacking, additional defects could be introduced into the circuit. There-
fore, these 3D ICs need to be tested both before stacking (pre-bond), and after stacking
(post-bond). Testing of TSV-based 3D ICs is one of the last EDAchallenges facing their
widespread adoption [62], and some of the challenges facing 3D test were enumerated
in [32].
Wu et al. [63] compare several scan-chain schemes, and provide genetic ad ILP based
algorithms for post-bond test. Zhaoet al. [67] provide a scheme for clock tree synthesis
to facilitate pre-bond test. At the architectural level, Lewis and Lee [35] proposed a scan
island based methodology to test incomplete circuits during pre-bond test. This architecture
is similar to IEEE 1500, and a pre-bond testable architecturbased on extensions to IEEE
1500 was formalized in [44, 46, 47]. The authors of [22, 23] provide test architecture
design for 3D SoCs, and Leeet al. provide an architecture that supports different test-
access-mechanism (TAM) widths for pre-bond and post-bond test [36].
In this chapter, three different aspects of DfT for TSV-based 3D ICs are presented. First,
a pre and post-bond testable scan chain design scheme is discu sed. Next, a transition-
delay-fault capable test architecture that can test 3D ICs atthe rated functional frequency is
presented. Since voltage-drop (IR-drop) becomes an issue atth functional frequency, this
chapter also discusses power-delivery issues and IR-drop mitigation during test. Finally,
a theoretical framework to quickly estimate the test time ofa given 3D IC partition is
included. Typical use cases and benefits of such a framework are also demonstrated.
7
2.1 Scan-Chain Design for 3D ICs
Constructing a 3D scan chain (i.e, goes across tiers) has several advantages over construct-
ing one 2D scan chain per tier and stitching them together. However, since a 3D scan chain
relies on the use of TSVs, and since TSVs occupy significant silicon area, the number of
scan TSVs that can be used is limited. Wuet al. [63] have demonstrated that 3D scan
chains give up to a40% reduction in the scan wirelength. This can significantly improve
the speed of the scan chain, and reduce the test time of the circuit. However, the approach
presented in their work does not support pre-bond test, and assumes that the dies will be
tested only after bonding. This project demonstrates a scan-ch in construction approach
that makes use of 3D scan chains, and is also pre-bond testabl.
2.1.1 3D Scan Chain Construction
This section presents a greedy heuristic to construct a 3D scan chain while minimizing
its wirelength. The input constraints are the maximum number of scan TSVs that can be
used, the location of all the flip-flops, and a fixed scan-in andscan-out pin. The heuristic is
presented in Algorithm1.
Algorithm 1: Greedy algorithm to construct a 3D scan chain
1 C ← {c1, c2, . . . , ck−1} ;
2 X ← {x0, x1, x2, . . . , xm, xm+1} ;
3 ∀i, j Initialize (Cost (i,j)) ;
4 M = {x0, xm+1} ;
5 u← x0 v ← xm+1 ;
6 while M ∩X 6= X do
7 u′ = Min (Cost (u,j)) , j /∈M ;
8 M = M ∪ j ;
9 u = u′ ;
10 ∀i, j Update (Cost (i,j)) ;
11 v′ = Min (Cost (v,k)) , j /∈M ;
12 M = M ∪ k ;
13 v = v′ ;
14 ∀i, j Update (Cost (i,j)) ;
15 end
8
Here,C represents the TSV constraint for each die, and there arek dies. Assuming face-
to-back (F2B) bonding, TSVs are absent on the last die, and there ar k − 1 constraints.
X represents the set of all scan cells, which has sizem. x0 represents the scan-in pin, and
xm+1 represents the scan out pin. Next, the cost function betweentwo cells is initialized.
This cost function is given by Equation (1), wherez represents all dies betweenxi and
xj, andRz represents the remaining number of TSVs that can be used in that die without
















SetM represents the set of marked cells, and the scan-in and scan-out pin are initially
marked. Next, the scan chain is grown from two sides, both from the scan-in and the
scan-out pins. Each iteration picks the cell with minimum cost, and this process continues
until all cells are marked. The cost function is dynamicallyupdated, and TSVs become
more expensive as the TSV constraint is approached. Eventually, the cost of using a TSV
becomes infinity once the TSV constraint is reached. It is important to note that when this
happens, it may not be possible to stitch all the scan cells without using more TSVs due to
the presence of isolated chains. In this case, extra TSVs maybe used, which is guaranteed
to not exceed two TSVs per die, and the constraints can be adjuste appropriately. Although
it is possible to grow the scan chain from one direction only,growing it from two directions
usually results in smaller scan wirelength, as as shown in Figure3.
s t s t
(a) (b)
Figure 3: Scan chain grown from (a) one direction, and (b) two directions.
9
2.1.2 Reuse of Signal TSVs
So far, the assumption has been that a dedicated scan TSV is required when a scan chain
goes from one die to another. In a scan chain, the output of a flip-flop is connected to
the scan input of the next flip flop, as well as to some combinatio l logic that is of no
consequence during the test mode. It might be possible that aflip flop drives some combi-
national logic on another die through an existing signal TSV. In such a case, an additional
dedicated scan TSV is not required, and the existing signal TSV can be reused. A careful
choice of scan ordering can make use of several existing signal TSVs, thereby reducing the
overall scan chain wirelength, without suffering the penalty of inserting a large TSV into
the layout. An example of signal-TSV reuse is shown in Figure4.
Figure 4: Re-use of existing signal TSVs for scan chain
2.1.3 Broken Scan Chains
Once a 3D scan chain is inserted into the design, it is used during post-bond test, and its
scan-in and scan-out pins are accessed through solder bumps. However, if pre-bond test is
to be performed, the scan chains on each die are broken into a number of fragments, and
cannot be used as-is. It is not feasible to probe all these fragments as probe needles are
usually large and their number is quite small. Thus, it becomes necessary to stitch together
different fragments as shown in Figure5 so that the pre-bond test-pin count is reduced.
10
This can be achieved using tri-state buffers to stitch together he broken fragments, and
enabling them using a pre-bond test signal.
Figure 5: (a) A 3D scan chain, and (b) multiple fragments connected together
2.1.4 Experimental Results
Initially scan cells are inserted into the 2D netlist, either during or after synthesis. Next,
the original netlist is partitioned into as many dies as required, and individual netlists are
obtained for each die. Each die is then placed individually using Cadence Encounter to get
initial rough locations of scan flip-flops. Scan chains are then stitched together using the
greedy algorithm discussed earlier. This process introduces additional scan TSVs into the
design, and placement is again carried out to accommodate them.
The greedy heuristic for scan chain insertion was implemented in C++, and a FFT
circuit from [51] is chosen for analysis. Synthesis was carried out in Synopsys Design
Compiler using NCSU45nm technology. The design was placed in two dies, and the
number of signal TSVs is chosen to occupy around 20% of the entire die area. Statistics
about the design used are as follows: the number of gates is400, 213, signal TSVs is2953,
and flip-flops is75, 723. The TSV is assumed to have a diameter of6µm, with a height of
50µm. The inserted wrapper scan elements occupy1.96% of the total die area, and have a
total stitched wirelength of75054µm.
In order to study the impact of the number of scan TSVs on the wir length of the design,
three different scan chains are constructed, as shown in Table 1. Since a 3D scan chain
cannot be constructed without any TSVs, the “scan0” case hastwo test TSVs inserted,
11
which is the minimum number required. Column 3 shows that evenwithout using any
specific algorithm to re-use existing signal TSVs, it is possible to re-use around2% of
the TSVs required for the scan chain. The number of scan-chain fr gments formed per
die is exactly half of the number of TSVs, and Column 4 gives theamount of additional
wirelength that is required to stitch all of them together into a single chain. With an increase
in the number of fragments, the wirelength required to stitch them together also increases.
Table 1: Statistics for different scan chain configurations.
Name No. TSVs #TSV reusedStitch WL (µm)
scan0 2 0 4.75
scan100 100 2 26595
scan200 200 4 34296
The impact of the scan chain TSV count on the scan wirelength and the total wirelength
of the 3D design is plotted in Figure6(a). First, it is observed that an increase in the number
of scan TSVs always helps reduce the scan wirelength. However, adding more scan TSVs
does not always reduce the signal and total wirelength. Beyond a certain point they start
to worsen. The initial improvement is achieved because the low r scan wirelength reduces
the routing congestion. With a further increase in the number of TSVs, either the die area
or standard cell density increases. If the die area increases, th average distance between
gates increases, increasing the overall wirelength. An increase in the cell density increases
routing congestion, and hence wirelength.
Figure 6: The impact of scan configuration on wirelength
12
2.2 Transistion-delay-fault Testing for 3D ICs with IR-drop Study
One of the reasons 3D ICs are being explored is because they arexpected to be faster than
2D ICs. Therefore, it becomes essential to test them at the rated functional frequency, and
make sure that they work. While there exists literature that supports transition delay fault
testing of 2D SoCs [41, 7], no prior work has looked at transition delay fault testingfor
3D ICs. This section first presents a DfT architecture that supports transition delay fault
testing of 3D ICs. It supports both pre-bond and post-bond transition testing. In addition,
it supports transition testing of TSVs after bonding.
In a 3D IC, only one die has C4 bumps, and all other dies have no direct test access.
During pre-bond test of these dies, no wire bond pads exist, and it becomes necessary to add
large probe pads into the layout to facilitate probe needle touchdown, as shown in Figure7.
This section discusses how these probe pads can be added intothe layout, and how they fit















Figure 7: The Structure of a 3D Integrated Circuit
Finally, since transition test is carried out at the rated frequency of the chip, excessive
voltage drop (IR-drop) may occur. This is because test pattern g neration tools aim to test
as many faults as possible with each pattern, leading to large portions of the chip switching
at the same time. This section also discusses creating a power delivery network (PDN)
that can support transition test, including the addition ofp wer/ground probe pads, and
techniques to mitigate IR-drop.
13
2.2.1 Transition-delay-fault Architecture
The application of a transition fault vector to a circuit requires two cycles. The first cycle
triggers a transition (launch) at the location to be tested,an the second cycle (capture) cap-
tures the response to this transition. The IEEE 1500 Wrapper Boundary Registers (WBR)
specified in [44], cannot directly be used as it only supports the application of a single bit to
a primary input, while two bits are required to launch a transition. Instead, a three flip flop
IEEE 1500 WBR specified in [41] is used. Such a register is shown in Figure8. This figure
also explains abbreviations that will be used in the remainder of this section. Each flip flop
is sensitive to a different combination of IEEE 1500 controlsignals, which are indicated
above the clock. To apply a transition test, one bit is scanned i to each of the SC and ST




















Figure 8: An IEEE 1500 Wrapper Boundary Register capable of launching a transition on
CFO. The abbreviations used are S-shiftWR, C-captureWR, T-transferWR, U-updateWR
The overall transition fault DfT architecture is shown in Figure9. This figure is sim-
plified for illustration, and only the data path and a serial scan chain is shown. Parallel
testing is essentially the same idea, but with a larger number of scan chains. Each TSV is
equipped with a WBR, so that values can be scanned into it during test. Once the values are
scanned in, the launch and capture clocks are applied, and the responses are scanned out.
Each die is tested independently of the other, during both the pre-bond and post-bond tests.
Each unwrapped die is equipped with an internal bypass, so that the internal scan chains
can be bypassed, if desired. In order to transport data to andfrom the top die, the bottom
14
die is equipped with a multiplexer (elevator enable) to select the data from the top die. The
various control signals are generated by the IEEE 1149.1 TAPcontroller.
























































































B  : Bypass register










Figure 9: The DfT Architecture for Transition Delay Fault Testing of 3D ICs, showing
only the data path and serial operation
This architecture is similar to that presented in [44], but with a few notable differences.
The first one is that a transition fault capable WBR is used. The second is that this system
has to support the transfer operation, in order to transfer data between the SC and ST
registers. Therefore, an extra transfer signal is to be routed between the dies. However,
the IEEE 1149.1 TAP controller does not natively support theapplication of delay tests,
and two approaches to modify it exist in the literature. The first one uses the exit1-DR,
exit2-DR and pause-DR states of the IEEE 1149.1 FSM to generate update, transfer and
capture signals, while in delay test mode [41]. The second approach utilizes an additional
TMS bit to change the state from update-DR to capture-DR within a single clock cycle [7].
The first approach is used, because additional package pins are undesirable.
15
TSVs also need to be tested at-speed in a 3D IC. For stuck-at fault esting, TSV testing
is trivial. Each TSV has a WBR on either side, and TSVs can be testd by placing both
dies in their respective extest modes. However, for transition testing, the time between the
launch and capture pulses has to be of the order of the TSV delay. This is a few tens of
picoseconds, and it is unreasonable to assume that the clockan be applied with such a
high speed.
This section presents an alternate approach to test the TSVsafter the dies have been
bonded. Consider Figure10(a). This represents the post-bond testing of the top die, with a
transition launched from the WBR on the top die. Figure10(b) shows the identical transi-
tion on the top die, but launched from the WBR on the bottom die. This transition would
also occur on the TSV, and would hence test the TSV also. This implies that a test vector
generated for the top die, but launched from the bottom die will also test TSVs. If, after
bonding, the testing of the top die is performed exclusivelythrough the WBRs of the bot-
tom die, no additional patterns will be required, and all TSVs between the top and bottom
die will be tested.
Figure 10: (a) A 0 to 1 Transition launched from WBR on Top Die (no TSV testing), (b)
An equivalent 0 to 1 Transition launched from WBR on Bottom Die (with TSV testing)
In order to support TSV test, an additional mode of operationthat configures the WBRs
as shown in Figure10 is required, which is called TSVtest. The default modes presented
by [44] are serial/parallel , pre-bond/post-bond, intest/extest/bypass and turn/elevator. If
16
a die is placed into TSVtest, all WBRs facing the bottom die are made transparent. TSV
testing can then be performed by placing the bottom die into ex est, and the top die in the
intestTSVtest mode.
Two example modes of operation are shown in Figure11. Figure11(a) Shows the
post-bond test of the bottom die. The instruction used is post-b nd-intest-serial-turn. Fig-
ure11(b) shows the post-bond testing of the top die with TSV test. Here the bottom die is
programmed with post-bond-extest-serial-elevator, and the top die with post-bond-intest-
serial-turn-TSVtest. The solid red lines show the flow of data scanned in, and the dashed
blue lines show the data flow to and from the WBRs in the launch–capture window.
Figure 11: (a) Post-bond test of bottom die, (b) Post-bond test of top die with TSV test.
Solid red lines indicate flow of scanned data, and dashed bluelines indicate flow of data to
and from WBRs in the launch-capture window
2.2.2 Probe-pad Placement and PDN Design
Fine grained probe needles are unlikely to be available at least for another decade [53].
Today’s probe pads are limited by available technology [43] to a minimum pitch of35 −
40µm for cantilever probing, and100µm for vertical probing with a minimum pad size
of around25µm. As seen from Figure7, not only do these probes occupy significant area
17
on the die in which it is placed, any TSVs in the previous die cannot be placed in the
same location as the probe pad in order to avoid overlap with its landing pad. In addition,
when the probe needle makes contact with the probe pad, it crea es a scrub mark, which
significantly affects its planarity, as shown in Figure12 [27]. Therefore, several layout
implications exist while adding probe pads, and their locati ns need to be chosen carefully.
Figure 12: Damage caused to the probe pad after a single probe touchdown[27].
Probe pads can be divided into two categories – signal and power/ground (PG). Signal
probe pads are needed as the top die requires test access during pre-bond test. Each IEEE
1500 data and control signal needs to be provided with its ownprobe pad. In addition to
these, the die needs to be powered during test. Ideally, eachPG TSV chosen for touchdown
would have a PG probe pad directly on top of its landing pad. This would minimize the
area overhead, as well as provide a low resistance connectiofor power delivery. However,
the scrub mark will affect the TSV bonding process, and for the sake of reliability, a certain
distance has to be maintained between the probe pad and the TSV landing pad. Figure13
shows such an arrangement. This figure also shows how PG TSVs are cre ted in the layout,
and since they are quite large, how the thin PG rails detour around them.
This study focuses on circuits that have a regular power and ground TSV placement as
shown in Figure14(a). Since the power and ground TSVs form a regular array, thespace
in between them are candidate locations for probe pads. PG and sig al probe pads can be
placed in a subset of these candidate locations. An example is shown in Figure14(b). Two
choices exist when connecting a power probe pad to a power TSV– either a horizontal or a
vertical configuration. This figure also shows two signal probe pads. To simplify the design
18
(a) probe pads and TSVs (b) P/G TSV with P/G wires (c) signal TSV and P/G wires
PAD
TSV
Figure 13: Layout images of (a) probe pads and TSVs, (b) P/G TSVs and P/G wire
detours, (c) signal TSVs and P/G wires. P/G wires can be routed ov r signal TSVs.
process and reduce the search space, PG probe pads are placedin either the horizontal or
the vertical configuration, but not both. Figure14(c) shows how 4 power probe pads are
placed in a2 × 2 array in a horizontal configuration, and Figure14(d) shows the same for
a vertical configuration.
Figure 14: (a) Candidate locations for probe pads, (b) Sample horizontal a d vertical
power/ground pads, as well as signal pads, (c) 4 power probe pads laced in a 2×2 hori-
zontal configuration, and (d) in a vertical configuration
19
2.2.3 Design and Analysis Flow
The design flow used in this section is shown in Figure15. It can be broadly divided
into two categories. The left column represents physical design, and the right column
represents test related steps. Finally, IR-drop analysis isperformed. Each step is explained
individually below.
Figure 15: The overall design Flow. Yellow indicates inputs to the flow,green boxes are
custom scripts, blue indicates use of Synopsys tools, and red the use of Cadence tools
With respect to physical design, the starting point is an initial 3D gate-level verilog
netlist, generated by partitioning a 2D netlist. Synopsys De ign Compiler is then used to
insert as many scan chains per die as required. Custom scriptsthen take this netlist with
scan chains, and generates the RTL for the IEEE 1500 wrapper.This is then re-synthesized
using Synopsys Design Compiler. Probe pads are then insertedinto the layout and treated
as locations where other TSVs cannot be placed. The design isthen placed and routed
using Cadence Encounter.
The test related steps starts with pin constraints, which are any pins that need to be
constrained to a certain logic value during the test mode (such as reset). Logic simula-
tion is then performed on the bottom die to get the pin constraints on the top die. Using
20
this information, automatic test pattern generation (ATPG) is performed on both dies us-
ing Synopsys Tetramax. The output is STIL files, containing pattern information. These
are parsed, and using the information about the wrapper chain ordering from the physical
design stage, the bits in the test patterns are reordered . A testbench is generated, and sim-
ulated using Synopsys VCS. Using the routed result, and the VCDfile generated from the
testbench, IR-drop analysis is performed as described next.
For 2D IR-drop analysis, as is the case with all pre-bond testing, existing tools can
simply be used. However, 3D IR-drop analysis is required to measure the post-bond voltage
drop. Power simulations are first performed on a per-die basis using the switching activity
from the VCD file, after annotating each die with TSV parasitic. The DEF files from
both the dies are then combined into a single DEF file, treating the TSV as a via. This
tricks the tool into believing that it is dealing with a 2D design, but with a higher number of
metal layers. The power numbers generated earlier can then be used to perform 3D IR-drop
analysis using Cadence Encounter.
2.2.4 Experimental Results
All required scripts were implemented in C++. The designs used ar synthesized using the
nangate 45nm technology library. The TSV diameter is assumed to be4µm , and its height
to be40µm. The TSV landing pads size is assumed to be7µm, and the total TSV cell size
including keep out zone is8.4µm. Power and ground TSVs are placed in a regular fashion,
with a pitch of130µm. The TSV resistance, including contact resistance is considered
to be50mΩ. The probe pads are assumed to have a size of40µm × 40µm, and that the
minimum pitch is100µm.
Figure16 shows a sample testing waveform of a design with four scan chains. During
capture, the responses from the circuit are stored into the SC register, and the value of ST
is don’t care. Only the first vector scanned out exhibits thisdon’t care, and all subsequent










ST register is Dont Care
Launch CaptureScan In Scan Out
Figure 16: A sample waveform obtained during testing, designed with four scan chains
Figure 17: GDSII images. (a) A close up of a TSV and its WBR, (b) IEEE 1500 Instruc-
tion Register Chain, (c) zoom out shot of the top metal layer of the top die, showing TSV
landing pads and probe pads
Two designs are picked from the OpenCores benchmark suite andimplemented in two
dies. Design statistics are shown in Table2. This table splits up the statistics on a die by die
basis. The top die does not have any TSVs, and hence that particul entry is blank. This
table also reports the results of ATPG for both stuck-at faults as well as transition faults.
In all the following experiments, each die is assumed to havefive scan chains. Since
the power consumption of stuck at tests can be controlled by reducing the frequency, all
power numbers and IR-drop results focus on transition tests.Five transition test vectors are
picked from each die, and used as representative vectors. The test vectors of the bottom die
are prefixed with “BD” , and those of the top die with “TD”. SinceATPG runs in a greedy
fashion, the first few vectors test a larger number of faults per vector than later vectors.
22
Table 2: Design Statistics for two designs, split by die.
Jpeg FFT
Bottom Die Top Die Bottom Die Top Die
Gate Count 214,641 197,187 328,512 296,929
# Scan F.F 15,828 22,219 87,681 78,503
# Signal TSV 2,164 - 2,879 -
S-A Coverage (%) 99.77 99.61 99.99 99.99
S-A Patterns 2012 2217 12180 11610
Tr Coverage (%) 98.93 97.74 99.92 99.90
Tr Patterns 3892 5200 61,798 55,656
Therefore, choosing five vectors at random out of the first fewg nerated gives patterns
with high switching activity. Since only a single die at a time is tested in the following
experiments, the clocks to all the scan flip flops of the die notbeing tested are gated off,
which helps reduce power consumption.
2.2.4.1 Overhead Study
This section discusses the overhead involved in adding the IEEE 1500 wrappers to different
designs. The overhead is computed with respect to wirelength, ga e count, area, and power,
and plotted in Figure18.
Figure 18: Various overheads involved in adding wrappers for (a) FFT and (b) Jpeg
From this graph, it is observed that there is around a10% increase in gate area for jpeg,
but this reduces to5% in the case of FFT. This is because FFT has a smaller TSV to gate
ratio. For both designs, the wirelength and gate count increase by less than5%. In addition,
only a small increase in the power consumption is observed inboth circuits. This because
23
the test related elements do not switch during the normal operation, and any power increase
comes only from the small increase in the wirelength.
2.2.4.2 Test Time Study
This section discusses the change in test time for differenttest ypes and test configurations.
A summary of results is shown in Table3.
Table 3: Post-bond test time results. All test times are in cycles
Design Die
Stuck-at test Transition test
[44] This Work % Inc. w/o TSV with TSV % Inc.
(×106) (×106) (×106) (×106)
FFT
Bot. 220.6 227.6 3.17 1155.0 - -
Top 189.0 195.7 3.53 938.2 1002.0 6.83
Jpeg
Bot. 7.2 8.1 12.02 15.7 - -
Top 10.8 11.8 8.87 27.6 32.1 16.29
The test time is reported for post-bond test only, as the number of vectors is identical in
the pre-bond case. The third and fourth column refers to the test time obtained by running
stuck at tests only. The test times are compared against [44], which implements a stuck-
at architecture only. Since the proposed architecture has has one additional flip-flop per
WBR, the test time is expected to increase. It is observed that this increase reduces with an
increase in the circuit size. Columns 6 and 7 compare test times of the top die, when tested
through its own WBR, as opposed to through that of the bottom die.This corresponds to
testing of the top die without, and with TSV test. Since the latter case has a longer chain
length, the test time increases. Again, this increase is observed to be proportional to the
circuit size. If this increase is found to be unacceptable, th WBR chain in the top die can
be bypassed, incurring some additional area and wirelengthcosts due to extra multiplexers.
2.2.4.3 Power Study
This section evaluates the change in power consumption frompre-bond to post-bond test,
as well as across different test patterns. In the case of the top die, post-bond without and
with TSV test is also compared. These results are plotted in Figure19.
24
Figure 19: Total Power comparison among (1) pre-bond, (2) post-bond without TSV test,
and (3) post-bond with TSV test under five different test vectors.
The total power consumed in each case is split into the contribution by each die. From
these graphs, it is observed that the power consumed by a particul die changes very little
when moving from pre-bond to post-bond test. However, the otr die consumes some
additional power due to leakage and switching in the test circuitry, leading to an increase
in the overall power. Furthermore, when the top die is testedin conjunction with TSVs, the
power consumed by both dies increase, compared to the case when TSVs are not tested.
This is because the logic driving TSVs in each die now consumeore power.
2.2.4.4 Pre-bond IR-drop
Here, the impact of different configurations of power probe pads on the voltage drop during
the pre-bond test is studied. Since the bottom die receives power from solder bumps, it is of
no interest in this study, and hence results focus on the top die only. As mentioned earlier,
the probe pads are placed in a regular grid like fashion, at different pitches, and different
configurations. The results are shown in Figure20.
25










(e) Jpeg , 3 x 3 Pads, Pitch = 260 um(d) Jpeg , 2 x 2 Pads, Pitch = 390 um










 Horizontal  Vertical



















 Horizontal  Vertical



















 Horizontal   Vertical
















 Horizontal  Vertical
















 Horizontal  Vertical
Figure 20: Pre-bond IR-drop under different probe pad configurations and test vectors for
FFT (a, b, c) and Jpeg (d, e)
As expected, the IR-drop goes down if the pitch of probe pads godown. It is interesting
to note that the vertical configuration almost always outperforms the horizontal configura-
tion. This is because the standard cells receive power from hrizontal metal stripes, and
placing pads in a horizontal configuration would simply meanthat the same stripes get
power at two locations. However, in the vertical configuration, more of these stripes will
get a direct connection to power, and hence the IR-drop reduces.
As observed for the2 × 2 configuration of probe pads of the circuit jpeg, the IR-drop
can be quite high. One obvious solution would be to go back to ATPG, and constrain the
power budget. This would increase the total number of vectors, and hence the test time.
Instead, this project investigates whether any improvement in the IR-drop can be achieved
by cleverly placing probe pads. A manually optimized configuration, along with IR-drop
maps are shown in Figure21. Therefore, a careful choice of probe pad locations can reduce
the IR-drop.
26
Figure 21: IR-drop maps before (= a, c) and after (= b, d) probe pad optimization.
2.2.4.5 Pre-bond vs. Post-bond IR-drop
This section studies how the voltage drop of a particular diechanges depending on the
stage in the bonding process. These results are plotted in Figure22. In the case of the top
die, the lowest pre-bond voltage drop achieved among all possible combinations is plotted.
Not surprisingly, the post-bond IR-drop of the top die is muchlower than the pre-bond
case. This is because in the post-bond case, the top die receives power through TSVs at a
much finer pitch than the probe pads in the pre-bond case. The small increase in the power
consumption, when tested with TSVs is not sufficient to causeany change in the IR-drop.
It is interesting to note however, that the IR-drop of the bottom die also reduces slightly
during post-bond test, even though it still receives power from the same locations, and has a
slightly higher power consumption. This is because during the post-bond test of the bottom
27
die, the top die consumes very little power, yet attaches itsen ire power grid in parallel to
that of the bottom die. This reduces the equivalent resistance of the power grid, and hence
the IR-drop is lower.



































 Postbond- no TSV Test
 Postbond - TSV Test
(a) FFT - Bottom Die Test (b) FFT - Top Die Test 
(c) Jpeg - Bottom Die Test (d) Jpeg - Top Die Test 
































Figure 22: Comparison between pre-bond and post-bond IR-drop. (a) FFT, bottom die,
(b) FFT, top die, (c) Jpeg bottom die, (d) Jpeg, top die
2.2.4.6 Normal vs. Test Mode
Since transition fault testing aims to switch as many nets aspos ible with one vector, the
IR-drop during the test mode is expected to be much higher thanthe IR-drop during the
normal mode. The normal mode IR-drop of Jpeg was found to be10mV , and that of FFT
was found to be6mV . When compared with the post-bond numbers from Figure22, it is
clear that test mode has much higher IR-drop.
28
2.3 Test-time Estimation for 3D ICs
During early design space exploration, a large number of possible partitioning solutions
are evaluated w.r.t. power, performance, area, TSV count, etc. The TSV count includes the
number of signal TSVs, as well as estimates of TSVs for power delivery, clock, thermal,
and test. The number of test-TSVs depend on the test architecture, and includes TSVs
required for control, as well as those required to pump data.If est-TSVs are not accounted
for during partition evaluation, downstream design steps may have insufficient area to add
these TSVs. One such example is shown in Figure23, where floorplanning was carried out
considering only signal TSV count. Insufficient area remains to add other TSVs such as
clock, power and test. The only solution is to expand die area, which increases cost, and






Figure 23: (a) GDSII screen shot of a single die of a block-level 3D IC (b)Zoom in shot
of the boxed TSV block in (a)
The chosen test architecture determines the number of control test-TSVs, while the
number of TSVs required to pump data is variable, and left up to the design engineer.
Only the latter is of interest, as the former remains constant irrespective of partition. In the
remainder of this section, test-TSVs refer only to those TSVs used to carry test vectors and
responses, and control test-TSVs can be treated as a separate, fixed constant.
If a fixed number of test-TSVs (TSVt,f ) are allocated during partitioning, there is the
possibility of overestimating the real total TSV count of a partition. It has been shown [50]
29
that pareto-optimality exists in the test-TSV count. IfTSVt,po is the pareto-optimal number
of test-TSVs, any TSVs allocated beyond this will not yield areduction in test time. The
actual number of test-TSVs used during scheduling is given by
TSVt = min(TSVt,f , TSVt,po) (2)
In area critical designs, whenTSVt,f is small, it is usually the smaller of the two, so it
serves as a reasonable estimate. However, ifTSVt,f is large, and it was used as an estimate
for TSVt, several candidate partitioning solutions would be discarded for having too many
TSVs. Therefore, an accurate estimate ofTSVt,po is required, and it needs to be quickly
computed to be incorporated into automatic partitioning.
Existing test-scheduling algorithms such as [49, 50] focus on determining the test time
given a fixed test-pin and test-TSV constraint [49]. Using such algorithms to determine
TSVt,po would require repeatedly applying them for different test-TSV constraints, and
finding the point where there is no reduction in test time. While this process will work
if the partition is fixed, it is too slow to be used during earlydesign space exploration.
In this section, a fast and accurate estimate of the pareto-optimal number of test-TSVs
required for a given 3D partition is derived. Since the test time estimate is meant to be used
during design space exploration, block-level designs are assumed, where the blocks are all
soft, and top-level interconnect tests are ignored. To validate results, the ILP-based test
scheduling algorithm presented in [49] is used to compute test time for a given partition.
2.3.1 Die-level partitioning
Die-level partitioning is studied first, where partitioning implies die ordering. While the so-
lution space is small, and exhaustive search methods can easily be applied, insights gained
in this section are used to explain block-level partitioning later.
30
2.3.1.1 Two-die stack
A two tier die-level stack is the simplest form of a 3D IC, and there are only two partitions
possible. Furthermore, only two test scheduling options exi t, serial or parallel test. In
serial test, each die is tested one at a time, the bottom die with all the test-pins, and the top
die with all the test-TSVs. In parallel test, the test-pins are divided between the bottom and
the top die. Three circuits are considered, and shown in Figure 24. The first circuit is a
homogeneous stack, and the next two are different die-levelpartitions of a heterogeneous












Figure 24: Three different circuits considered for die-level partitioning of a two-die stack.
(a) A homogeneous stack, (b & c) Two different partitions of aheterogeneous stack. A
larger number implies the die is more complex.
Since the solution space is small, all possible test scheduling options are tried, and the
pareto-optimal TSV count for both serial and parallel test is abulated in Table4. Fifty
test-pins are assumed, and the test-TSV count is swept to obtain the minimum test time
andTSVt,po. The parallel schedule offers lower test time, and would be chosen by any test
scheduling algorithm. For the homogeneous stack, an equal division of test-pins is optimal,
which implies thatTSVt,po is half of the number of test-pins, or 25. For the heterogeneous
stack however, it is observed that both partitioning options give the same minimum test
time, butTSVt,po is different. As expected, the partition with the more complex die on top
requires more test-TSVs to obtain minimum test time.
2.3.1.2 Multi-die stack
This section tabulates the test time for a given set of partitions under fixed test-pin and TSV
constraints, and then uses this information to identify thec aracteristics of the partition that
affects the test time. The different multi-die circuits considered are shown in Figure25.
31
Table 4: The optimal test times (in cycles) achieved for a two-die circuit, along with the
TSV usage at which this optimum time is reached.
Circuit
Serial Test Parallel Test
Tmin TSVt,po Tmin TSVt,po
ckt1 2,447,767 47 2,363,730 25
ckt2 p1 1,931,750 47 1,899,170 19

























Figure 25: Circuits considered for die-level partitioning of multi-die stacks. (a - c) three
die stack, (d - f) four die stack. A larger number implies the die is more complex.
TSV constraints can be assigned in two ways. The first method is uniform TSV con-
straints, which allocates an equal TSV budget to all the dies. The second method istapering
TSV constraints, which allocates more TSVs for the lower dies (closer to the package), and
less TSVs for the upper dies. The test time is computed using ILP-based scheduling. The
test time difference for both types of constraints is studied, and tabulated for three and four
dies in Tables5 and6, respectively.
Table 5: The test times for die-level partitioning of a three-die 3D IC, considering both
uniform and tapered TSV constraints.
Pmax
TSVmax Test time (cycles)










Table 6: The test times for die-level partitioning of a four-die 3D IC,considering both
uniform and tapered TSV constraints.
Pmax
TSVmax Test time (cycles)
D2-D1 D3-D2 D4-D3 ckt4 p1 ckt4 p2 ckt4 p3
50
50 50 50 2,225,7652,225,7652,225,765
30 30 30 2,300,8512,597,7762,597,776
30 20 10 2,418,4382,971,7867,021,398
70
70 70 70 1,561,7511,561,7511,561,751
30 30 30 1,802,0682,597,7762,597,776
30 20 10 1,919,6552,971,7867,021,398
It is clear from these tables that, as expected, the test timeof a partition with the most
complex dies closest to the package is least. However, underuniform TSV constraints, the
test time changes only when the bottom die changes. Any permutation of the upper dies
without changing the bottom die does not affect the test time. Furthermore, if the pin and
TSV constraints are equal, partitioning has no impact on thetest time. If two partitions
have the same test time when tested with the same number of TSVs, it follows that they
both also have the sameTSVt,po. This implies thatTSVt,po only needs to be updated if the
complexity of the bottom die changes. These results are not restricted to these particular
simulation settings, and a formal proof is given below.
Lemma 1. Assume thatTSVmax is a uniform TSV constraint to test the set of diesD.
Let Dp ⊆ D be a subset of the dies tested in parallel within a single test ssion. Let
pd = (p1, · · · , p|Dp|) be a division of pins within this test session. If two diesDi andDj,
i 6= j 6= 1 are swapped, thenp′d obtained frompd by swappingpi andpj does not violate
Pmax andTSVmax constraints.








pm ≤ TSVmax ∀k > 1 (3)
Since the set of diesD is known to be tested withpd, Equation (3) is satisfied. It needs to
be proved that this equation is also satisfied ifD′ is tested withp′d. Clearly, the greatest
33
term in Equation (3) occurs whenk = 2, or at the die immediately above the bottom die.
Therefore
∑|D|
m=2 pm satisfies theTSVmax constraint. IfD
′ is tested withp′d, this sum does
not change, and thereforep′d also satisfies theTSVmax constraint.
This lemma proves that if two dies are tested in parallel, andthen interchanged in the
stack, they can still be tested in parallel with the same division of pins. It does not claim
that the same old division of pins will be optimal for the new partitioning, just that it is
possible without violating TSV and pin constraints.
Lemma 2. If the set of diesD is tested with a certain test schedule (with uniformTSVmax
constraints), then any different partitionD′ with the same bottom dieD1, can be tested
with the same test schedule.
Proof. A test schedule is merely a series of test sessions with dies tested in parallel within
the same test session. Since TSVs are multiplexed between two different sessions, it is
enough to show that a single test session can be repeated forD′. F om the previous lemma,
the test session can be repeated for a different partition with t o dies interchanged. It is
clear thatD′ can be obtained fromD with a series of two die exchanges. ThereforeD′ can
also be tested with the same test schedule.
Again, this lemma does not claim that the same test schedule is opt mal for the new
partition, but simply that it is possible. Finally, it is proved that the test time is independent
of the partition of upper dies.
Theorem 1. All partitions of a set of diesD with same bottom dieD1 have the same test
time under a uniformTSVmax constraint.
Proof. Let Dall be the set of all partitions ofD with the same bottom dieD1. Using
identicalTSVmax constraints, find the partition with the minimum test time, say Dmin.
Then, from the previous lemma, any other partitionD′ ∈ Dall can be tested with the same
test schedule asDmin, and hence also has minimum test time.
34
Tables5 and6 also show that if the number of test pins is equal to the numberof test
TSVs, then all partitioning results have the same test time.The proof of this follows from
the fact that ifPmax = TSVmax, lemma1 holds for interchanging any two dies, including
the bottom die.
2.3.2 Block-level partitioning
Block-level partitioning is the more general case of die-leve partitioning. This section
studies the change in test time for different partitions under fixed test-TSV constraints,
derives lower bounds on the test time, and uses these lower bounds to derive equations for
TSVt,po. This section assumes uniform TSV constraints.
2.3.2.1 Two-die stack
Ckt2 p2 is taken as a starting solution, and modules are moved across the tiers. Each move
results in a new partition. Two types of module moves are performed. The first is moving a
module from one die to another, and the other is swapping two modules from different dies.
A total of 1000 moves are performed, and test scheduling is carried out for each partition
assuming 50 test-pins and different TSV constraints. The results are plotted in Figure26.


























Figure 26: The variation in test time observed for a two-die stack starting with ckt2 p2 and
performing 1000 different random moves. 50 test-pins and 2 different test-TSV constraints
are assumed.
35
As observed in the previous section, if the test-TSV constraint is high enough, all par-
titions have similar test time. With lower test-TSV constraints (= 20), it is observed that a
significant number of partitions have much higher test time,indicating that theirTSVt,po
is higher. There are also partitions, however (Moves 650-80), that have close to the min-
imum test time, indicating that theirTSVt,po is close to 20. These results are explained on
the basis of the lower bounds derived below.
Lower bound on test time For a modulem, let im, om, andbm be the number of input,
output, and bi-directional ports, respectively. Further,l t pm be the number of patterns
required to test that module. Letfm be the number of flip flops in that module. In the
case of hard modules,fm is simply the sum of the lengths of the internal scan chains. The
number of stimulus (tsm) bits is the sum ofim, bm, andfm, and the number of response
bits (trm) is the sum ofom, bm, andfm. The complexity of a modulem is then defined as:
cm = max(tsm, trm) · pm+min(tsm, trm) (4)
Note that this is simply the test data volume of that particular module, neglecting the
one cycle required to run the test. Given a set of modulesM , the complexity of that set
CM is defined as the sum of the complexities of all its constituent modules i.e.,
∑
m∈M cm.
Although similar to the ITC’02 [45] definition of complexity, this formulation is linear.
This implies that irrespective of any partition of the modulesM into M1 andM2, the sum
of CM1 andCM2 will always result inCM .
Given a set of modulesM andP pins with which to test them, a lower bound on the
test time of a 2D design based on the amount of data that needs to be pumped into it was

























Let M3D be the set of all modules in the 3D stack.M1 is the set of modules in the
bottom die, andM2 the set of modules in the top die. LetLBMi denote the lower bound
36
of the test time of the set of modulesMi. First, the lower bounds induced by both TSV
and pin constraints are considered. It is assumed thatTSVmax <= Pmax, as any additional
TSVs will simply be wasted. The maximum test-pins availableto the bottom and top dies
arePmax andTSVmax, respectively. Therefore, a partition-dependant lower bound is given
by:
LBdep = max{LB2D(M1, Pmax), LB2D(M2, TSVmax)} (6)
This lower bound can be improved by considering that every module in the 3D stack can
be tested with no more thanPmax pins. Such a lower bound is partition independent, and is
given by:
LBindep = LB2D(M3D, Pmax) (7)
This lower bound holds irrespective of the partition or the TSV count. The overall lower
bound is then given by the maximum of the partition independent and dependent lower
bounds, and it can be reduced to:
LB3D = max{LB2D(M3D, Pmax), LB2D(M2, TSVmax)} (8)
Once the lower bound is defined, its behaviour w.r.t. change ithe partitioning solution
needs to be captured. A partition-dependent metric, the complexity factor (CF ) for a two-







VaryingCF from 0 to 1 captures all types of partitions. ACF of 0 means that all modules
are in the top die, and aCF of 1 means that all modules are in the bottom die. There exists
aCF beyond which the lower bound becomes constant, as proved below.
Theorem 2. LB2D(M2, TSVmax) decreases with increasingCF , and intersectsLBindep
for all values ofTSVmax < Pmax.
Proof. The first statement is trivial. IfCF increases, it implies thatCM2 reduces, and this
will reduce the lower bound onM2. Next, whenCF = 0, all the modules are in Die 2,
37
Mtop becomesM3D. SinceTSVmax < Pmax, LB2D(M3D, TSVmax) > LB2D(M3D, Pmax).
WhenCF = 1, the top die is empty with lower bound zero, and therefore,LB2D(Mtop
, TSVmax) < LBindep. This shows that somewhere in between a CF of 0 and 1, they
intersect.
To calculate the value of this threshold, a linear approximation of Equation (8) is de-
veloped. It is assumed that the scan unload and scan load of successive modules are not
overlapped. In addition, the third term in Equation (5) is neglected, as it is small when
compared with the first. Equation (8) can then be approximated as:
LB′2D(M,P ) ≈ 2 · CM/P (10)
The lower bound then becomes









The threshold complexity factor is the complexity factor when both terms are equal,





Note that this threshold value only depends on the TSV and pinconstraints and not on the
actual design or partition.
With these simplifications, the approximate lower bound on the 3D test time can be
written as








(1− CF )/TSVmax 0 ≤ CF ≤ CFth
1/Pmax CFth ≤ CF ≤ 1
(13)
This gives a linear model for the lower bound, with both design dependant and independent
terms. The shape of the lower bound curve is independent of design, and it is simply shifted
up or down depending on the particular design. This linear model gives a way to predict the
38
lower bounds on the test time without having any real partition information. The converse
of Equation (12) can be used to find out the pareto-optimal number of TSVs for agiven
partition. Given a partitionP with complexity factorCFP , TSVt,po can be written as:
TSVt,po = Pmax × (1− CFP ) (14)
This equation essentially finds the TSV count for which this partition is at the threshold
complexity factor. Increasing the TSV count beyond this value implies that the first term in
Equation (11) is greater than the second term, and since it is a constant, the est time does
not reduce. This is the definition ofTSVt,po.
Test time vs. lower bound This section plots the test time versus theCF , to observe how
different partitions affect the test time. In addition, theapproximate lower bound is plotted
on the same scale to investigate how the test time curve compares to the lower bound curve,
and is shown in Figure27.

























 TSV=20  TSV=25
 TSV=30  TSV=50
Figure 27: Comparison between the measured test time and approximate lower b und of
test time (= Equation13) for a 2 die stack. The number of test pins is 50.
As expected, the test time curve follows the general shape ofthe lower bound, but is
shifted upwards by some amount. Most importantly, the threshold complexity factorCFth
for both the test time and the lower bounds is similar. Therefore, the lower bound gives
39
the designer a very good estimate of what the shape of the testtime curve is. Therefore,
TSVt,po is well estimated by Equation (14).
2.3.2.2 Multi-die stack
Similar to the experiment done with two dies, ckt3p1 is used as the initial design. Then,
1000 random moves are made and the variation in test time is observed. Although the
moves random, specific kinds of moves are made. The first1/3 moves are performed only
between Die 1 and Die 2. The next1/3 are only between Die 1 and Die 3. The third and
final 1/3 is made between Die 2 and Die 3. The test time is computed usingILP with a
test-pin constraint of 50 and 2 different uniform TSV constraints and the results obtained
are plotted in Figure28.
































Figure 28: Variation in test time observed while performing 1000 random moves, starting
with ckt3 p1. The test time is computed assuming 50 test-pins, and 2 different uniform
TSV constraints (20 vs 50 per-die).
From these results, it is again observed that if sufficient TSVs are available, the test
time does not vary much, indicating that all partitions havet leastTSVt,po TSVs. If,
however, sufficient TSVs are not available, there is significant variation in the test time.
Most interestingly however, similar to the die-level partitioning, moves between the upper
dies do not change the test time. These results are explainedon the basis of lower bounds
on test time, as described next.
40
Lower bound on test time This section generalizes the results obtained for the two-tier
case. The lower bound on the top die can be written as:
LBM|D| = LB2D(M|D|, TSVmax,|D|) (15)
For the die|D| − 1, the lower bound can be written as
LBM|D|−1 = LB2D(M|D|−1, TSVmax,|D|−1) (16)
Now, all the modules in the upper two dies can be tested with atmostTSVmax,|D|−1
TSVs. Therefore
LBM|D|,|D|−1 = LB2D(M|D| ∪M|D|−1, TSVmax,|D|−1) (17)
The true lower bound on the test time of the upper two dies is simply the maximum
of Equations (15), (16), and (17). Similar lower bounds on all dies can be obtained by
inductively working backwards from the top die. The lower bound of test time to test all






This is the time to test the upper die withTSVmax,|D| TSVs, the upper two dies with
TSVmax,|D|−1 and so on. The test time of the entire 3D stack can than be givenby.
LB3D = max(LB3D−D1 , LB2D(M3D, Pmax)) (19)
This is a general equation, for arbitrary TSV constraints. However, for the special case
when all the TSV constraints are equal, sayTSVmax, this can be reduced to
LB3D,eq = max(LB2D(∪|D|i=2Mi, TSVmax), LB2D(M3D, Pmax)) (20)


























This shows that the lower bound is independent of the partition of the upper dies. For









Note that thisCF has a slightly different meaning from that of the two-die case. Here,
if CF = 1, then all modules are in the bottom die as usual, but aCF of 0 simply means
that no modules exist in the bottom die. With this definition of CF , the definitions of the
threshold complexity factorCFth, andTSVt,po are identical to the two-die case.
Test time vs. lower bound The test time vs CF for a four-die circuit is plotted by starting
with ckt4 p1, and performing 1000 different moves. The test-pin constraint is assumed to
be 100, and a uniform TSV constraint is assumed. The TSV numbers ar chosen such
that the TSV-to-pin ratio is the same as that of the two-die case. This would imply that
the shape of the approximate lower bounds is exactly the same, but the curve will have
a different magnitude. The purpose of this is to demonstratehat different circuits tested
under the same TSV-to-pin ratio indeed have similar test time curves. This is plotted in
Figure29.
As observed from this figure, the slope of the test time curve as well as the threshold
complexity values are dependent only on the TSV and pin constrai ts, and not on the circuit
being tested. This implies that Equation (14) gives us a good estimate ofTSVt,po, even for
more than two tiers.
2.3.3 Case Studies
In this section, benchmark circuits are chosen from the IWLS’05 benchmark suite, and
the developed theory is applied to it. Two circuits are chosen, and details are tabulated in
42




















 TSV=40  TSV=50
 TSV=60  TSV=100
Approximate lower bound
Test time
Figure 29: Comparison between the measured test time and approximate lower b und for
a four-die stack. The test pin constraint is assumed to be 100.
Table7. ATPG for each module is performed using Synopsys Tetramax,and this table lists
the average and standard deviation of test data volume (TDV)among all modules. Uniform
TSV constraints are assumed in all experiments involving more than two dies.
Table 7: Details of benchmark circuits used, showing the average andstandard deviation
of the test data volume among all modules.
Circuit #Modules Average TDV Std.Dev TDV
b19 57 141,489 168,833
desperf 51 18,820 18,857
2.3.3.1 Test time variation
The objective of this experiment is to confirm that differentpartitions with the same bottom
die have similar test time. This will justify the definition of complexity factor, which in turn
translates to a more accurateTSVt,po. Four die implementations of the two benchmarks are
taken as the baseline, and 500 moves that change the complexity of the bottom die are
performed. Next, an additional 500 moves that change the complexity of the upper dies,
but maintain the bottom die constant are performed. The variation observed for each type
of move is plotted in Figure30. The variation is computed as(tmax − tmin)/tmin, where
tmax andtmin are the maximum and minimum test times respectively.


































































































































































Figure 30: Comparison of the variation in test time observed between moves involving
the bottom die (= D1 moves), and all other moves. The numbers ar reported for four-die
implementations of (a,b) b19, (c,d) desperf.
when compared with moves that do not, validating the assumptions made. It is also ob-
served that if the test-TSV constraint is increased, the variation in the moves involving
the bottom die decreases. This is because with increased test-TSV constraints, a greater
fraction of all possible partitions already meetTSVt,po.
2.3.3.2 Threshold complexity factor prediction
Correct prediction ofCFth is important, as it directly translates in to the correct prediction
of TSVt,po. Theoretically, it is computed by Equation (12). This section validates the fact
thatCFth is independent of design and only depends on the ratio between TSV and pin
constraints.
The experimentalCFth is computed as follows. One thousand partitions of a design are
44
taken, and theCF and test time of each one is computed. Bins are created with respect to
CF , with a bin size of0.005. For each bin, the average test time of all the partitions (using
ILP) that lie within that bin is computed. The thresholdCF is computed as the first bin for
which the test time is within10% of the minimum test time observed for that particular pin
and TSV constraint.
The theoretical and experimental results observed are plotted in Figure31. Different
TSV and pin constraints that lead to the sameCFth are considered. In addition, both
two and four die implementations of both designs are plotted. This figure shows that the
theoretical formula does indeed give results close to the experimentally observed ones,

























































































































































































Figure 31: Comparison of theoretical and experimental threshold complexity factors un-
der various TSV and pin constraints. (a,b) Two-die stack, (c,d) Four-die stack.
45
2.3.3.3 Over-design reduction
In this experiment,TSVt,po is computed during a simulated partitioning process, and its
variation is observed. The partitioning process is simulated by taking an initial circuit,
and performing 1000 different random moves on it. The results are plotted assuming 50
test-pins in Figure32.







































































































Figure 32: The variation inTSVt,po observed while performing 1000 different random
moves, assuming 50 test-pins. (a) b19 two-dies, (b) b19 four-dies, (c) desperf two-dies
and (d) desperf four-dies.
From this figure, it is observed that if a fixed TSV constraint is used, then there is the
possibility of over-design depending on what that constrain is. If it is quite low (e.g., 10),
then theTSVt,po is always greater than this, and no resources are wasted. If however the
fixed TSV constraint is high (e.g., 40), then the actual number of TSVs required can be
much lesser than this, and correct prediction ofTSVt,po helps eliminate resource wastage.
It is also observed that increasing the number of tiers increasesTSVt,po. This is expected,
as more tiers require more TSVs to test them with minimum testtime.
46
2.4 Summary
This chapter presented various techniques for design-for-test for TSV-based 3D ICs, which
is one of the last challenges facing their adoption. First, atechnique to create 3D scan
chains was developed. Unlike previous approaches, this technique is pre-bond testable.
The impact of the number of scan TSVs on the scan wirelength was also presented.
Next, an architecture for transition delay fault testing of3D ICs was presented. This
architecture supports pre-bond as well as post-bond test ofthe logic, as well as post-bond
test of all the TSVs. In addition, since IR-drop is an issue during transition testing, tech-
niques to mitigate IR-drop were presented. In addition, adding probe pads into the layout
for pre-bond test access was also discussed.
Finally, this chapter presented techniques to quickly and accurately estimate the test
time of a given 3D IC partition. This estimate can be used during the partitioning process
to assess the total number of test-TSVs required by a given partition.
47
CHAPTER III
PHYSICAL DESIGN FOR BLOCK-LEVEL MONOLITHIC 3D ICS
Since re-designing existing logic, memory and IP blocks for3D incurs significant design
overhead and cost, 3D ICs will first focus on reusing existing 2D blocks [29, 61, 31]. These
2D blocks are placed in a 3D space and connected together using MIVs. However, since
block-level designs have only a few inter-block wires, thisdesign style is also a prime
target for TSV-based 3D ICs. A few works have considered adding TSVs into existing
whitespace blocks at the floorplanning stage. Simultaneousbuffering and TSV planning
was carried out in [20], but the authors reported inaccurate 3D half perimeter wirelength
(HPWL) and timing metrics. An improved algorithm was presented in [61], but the same
inaccurate HPWL metric was used. Results based on an improved BB-2D-HPWL metric
was presented in [31], and the most accurate HPWL metric based on subnets was presented
in [29]. However, none of these papers compared the quality of their engine with that of a
commercially available tool, or took the obtained floorplans ll the way through place and
route to obtain GDSII layouts.
This chapter first presents a 3D floorplanning framework thatis capable of handling
monolithic 3D ICs as well as TSV-based 3D ICs. The quality of thefloorplanning results
are validated against a commercial tool. It is shown that, even in coarse-grained block-level
designs, monolithic 3D can lead to better designs than in TSV-based 3D. Next, due to the
fabrication process, some tiers suffer from degraded performance. This chapter models this
performance degradation, and provides a floorplanning technique to make designs more
resilient to it.
48
3.1 3D Floorplanning with Monolithic Inter-tier Vias
3.1.1 Problem Formulation and Overview
A general form of the 3D floorplanning problem can be stated asfollows: Given the number
of desired tiers, and a set of blocks along with their corresponding widths and heights,
determine the(x, y, z) locations of each of the blocks and all MIVs/TSVs.
The overall design flow, assuming hard blocks (i.e., have a fixed aspect ratio) is shown
in Figure33. Floorplanning is performed to determine the location of all the blocks assum-
ing the pins are placed at the center. Once the locations of all the blocks are determined,
blocks are updated with the locations of the pins, and a refineme t step (i.e., PFPR) is per-
formed out to further minimize wirelength. Note that this refin ment is unnecessary for
soft blocks, as the pin locations are determined based on thefloorplanning result. Different
via planning engines are used depending on whether TSVs or MIVs need to be inserted.
Finally, separate verilog and DEF files are created for each die/tier with the corresponding
connectivity information and location of blocks and TSVs/MIVs, respectively. Each of the
above steps are further explained in following subsections.
Center-to-Center based 
Annealing
Update with pin locations
Annealing based refinement
Monolithic ?
Create Verilog and DEF files 
with pins
Route with Encounter
Extract MIV location and 
connectivity
Create Verilog/DEF file for 
each die
TSV planning
Existing work Custom program Cadence Encounter
Yes
No
Figure 33: The design flow to obtain a 3D floorplan, assuming hard blocks.
49
3.1.2 Floorplanning Engine
This step takes the description of all the blocks as well as the connectivity information and
generates an output floorplan that minimizes a certain cost function. This cost function
is different for TSV-based and monolithic 3D ICs. A simulatedannealing engine similar
to [29] is used, which maintains a separate sequence pair for each die. The following
different moves are performed during the annealing process: (1) change aspect ratio of a
block (or rotate in case of hard blocks), (2) swap two blocks in either the positive sequence,
negative sequence, or both, and (3) move or swap two blocks between a pair of dies/tiers.
In TSV-based 3D, the number of TSVs need to be limited as they each occupy signifi-
cant silicon area. Hence, the TSV-based 3D cost function is given as follows
CTSV = αWL+ βA+ γNTSV (24)
In the above equation,WL represents the inter-block wirelength,A represents the chip
area, andNTSV represents the number of TSVs. Since the MIV size is negligible in mono-
lithic 3D, the floorplanner doesn’t need to artificially control their count. The monolithic
3D cost function is given as follows
CMIV = α
′WL+ β′A (25)
Now, in a given block-level netlist, not all the nets are timing critical. More effort should
be spent minimizing the nets that are the most critical, at the expense of non-critical nets.
A histogram of the longest path delays (LPD) through each inter-block net for a benchmark
are shown in Figure34.
From this figure, it is observed that this distribution follows something resembling a
Gaussian curve for the nets with LPD greater than0.35ns. There are very few nets that
are the most critical, and the most effort should be spent trying to minimize their length.
Weighting each net by the LPD through it makes the floorplanner timing aware.
In case of soft blocks, the pin locations are determined after floorplanning, and mea-
suring the wirelength from the center of the block is adequate. However, for hard blocks,
50
Figure 34: Histogram of the longest path delay through inter-block nets of a benchmark.
considering the pin locations of the blocks during floorplanning will require an extra step to
compute the physical location of all block-pins. Since the number of block-pins are quite
large, this will lead to large runtime overhead. Instead, a post-floorplanning refinement
(PFPR) step is proposed to consider block-pin locations oncethe block locations have been
determined.
3.1.3 Post-Floorplan Refinement (PFPR)
After determining the relative locations of all the blocks,each block is assumed to have
a random orientation, and updated with its block-pin locations. Each block has 8 possi-
ble orientations,0◦, 90◦, 180◦, 270◦, and their flipped counterparts. Without changing the
relative locations of the blocks in the floorplan, each blockcan only have four possible
orientations. For example, if the pins are in the center of a block, 0◦, 180◦ or 90◦, 270◦
and their flipped counterparts are all the same. However, if the pins are placed along the
periphery each of the above four orientations gives a different wirelength result. The ob-
jective of this step is to determine the orientation of each block such that the wirelength is
minimized. Simulated annealing is used for this purpose, where the only operation allowed
is to change block orientation. The block orientation can only be changed among the al-
lowed four scenarios. No sequence pair is necessary, as the relative locations of blocks do
not change. Furthermore, wirelength computation can be donincrementally, as only one
block is changed at a time.
51
3.1.4 MIV Planning Algorithm
Once the 3D floorplanning result is obtained, TSVs or MIVs need to be inserted into the
whitespace between blocks. Since TSVs are big (around5µm to 10µm), and there may
not be enough whitespace in the dies, a whitespace manipulation step is required. TSV
planners exist, and this project uses the planner from [29] that constructs a 3D rectilinear
Steiner tree (RST) from a 2D rectilinear Steiner minimum tree(RSMT). It then moves
TSVs to nearby whitespace based on a network-flow formulation. If there is insufficient
whitespace, it also inserts whitespace between blocks, at the cost of increased area.
However, in the case of monolithic 3D, MIVs are very small (around70nm), and it can
be assumed that there is always whitespace available for MIVinsertion. Since MIVs are
also the same size as local vias, existing obstacle avoidingrouters can be used to perform
MIV insertion. Commercial tools, such as the 2D IC router in Cadence SOC Encounter
can therefore be used. However, it is limited handling to 15 metal layers only. In order to
maximize the number of dies that can be supported, three metal layers are used to represent
the inter-block nets of a tier. For example, if a block is in tier 2, metal layer 4 is used to
place block-pins, and metal layers 5 and 6 are used to represent int r-block routing on that
tier. Vias between metal 6 and 7 represent MIVs between tier 2and 3. The choice of the
number of metal layers used is justified because only the inter-block nets are considered
during MIV planning, and they are usually routed in the top 2 or 3 metal layers of each tier.
Now, an MIV planning algorithm is presented assuming that the blocks are hard (block pin
locations are known). Next, this is extended for soft-blocks.
3.1.4.1 MIV Planning for Hard Blocks
The MIV planning heuristic starts with creating a netlist that contains the connectivity
information of the pins of all the 3D nets as shown in Lines 1–3of Algorithm 2, where
Nnet denotes the total number of 3D nets. A DEF file that contains the p ysical location
of every pin of each block is then created.xpbi andy
p
bi
denote thex andy coordinates of
52
pin p of block bi, respectively, andlbi denotes the metal layer that blockbi is assigned to.
In addition, routing blockages for each block are added to account for: (1) the fact that
MIVs cannot be placed within the blocks and (2) the internal wiring of each block (Lines
4–9). Next, verilog and DEF files are fed to SOC Encounter, which routes all the 3D nets
simultaneously (Lines 10 and 11). Simultaneous routing of all 3D nets avoids any possible
congestion issues due to the small size of MIVs. The routed DEF is parsed, and the routing
topology of each net is traced to determine (1) the net that each MIV belongs to, and (2)
the block-pin that each MIV connects to (Lines 12 and 13). Finally, verilog and DEF files
for each tier (Lines 14 and 15) that contain the block/MIV locations are generated.
Algorithm 2: MIV planning algorithm for hard blocks.
Input : Location of all blocks inB, block orientation, block-pin locations, and
connectivity information
Output : Number, location, and connectivity information of MIVs
1 for n← 1 to Nnet do
2 addconnectivity information into a Verilog file;
3 end
4 for i← 1 to |B| do
5 for p← 1 to N bipin do
6 addpin physical location(xpbi , y
p
bi
, lbi) in the DEF;
7 end




10 read the above Verilog and DEF files into SOC Encounter;
11 routethe design and save the routed DEF file;
12 read the routed DEF file and reconstruct the routing graphs;
13 extractcorresponding subnets in each die / tier from the routing graphs;
14 createVerilog file for each die/tier with subnet connectivity;
15 createDEF file for each die/tier with MIV locations;
3.1.4.2 MIV Planning for Soft Blocks
In the case of soft blocks, the block pin locations are determined only after floorplanning is
finished. These block-pin locations are determined based onthe inter-block connectivity,
as well as the locations of any MIVs present. From the discussion on hard blocks, the MIV
53
locations depend on the block-pin locations as well. This isa chicken and an egg problem,
and an iterative method to determine both the MIV and the block pin locations is presented
in Figure35.
Figure 35: Iterative MIV planning algorithm for soft blocks.
Given the block outlines from the floorplanner, the blocks pins are first assumed to be in
the center of the block. Next, for each 3D net, the optimal MIVlocation can then roughly
be given as the center of its 3D bounding box. However, this approach will lead to overlap
between blocks and MIVs, as well as between MIVs themselves,as hown in Figure36(a).
Figure 36: Illustration of MIV planning for soft blocks. (a) Initial estimated MIV loca-
tions (b) After one iteration of MIV planning.
With these initial MIV locations, verilog and DEF files are created for each tier. Ca-
dence Encounter is then used to open each tier separately, and to determine the block pin
locations based on the estimated MIV locations. These blockpin locations can then be
fed into the MIV planner for hard blocks to determine updatedMIV locations, as shown
in Figure36(b). This entire process can be repeated until the MIV locatins stabilize. In
practice, only one or two iterations are required as the locati ns converge quickly. Once
54
the MIV locations are finalized, each block and tier can be placed and routed separately in
Cadence Encounter.
3.2 Floorplan Quality Evaluation
This section evaluates the quality of the floorplan engine, as well as the quality of mono-
lithic 3D vs TSV-based floorplans. All required code and scripts are implemented in C/C++
and python, and all experiments are carried out on a 2.5 GHz 64-bit linux system. The
45nm Nangate open source standard cell library is used in experiments. The TSV diame-
ter, landing pad size, pitch, and thickness are assumed to be6µm, 7µm, 10µm, and 50µm
respectively. The MIV diameter, pitch and thickness are 0.07µm, 0.28µm and 0.31µm re-
spectively. The TSV resistance and capacitance are 50mΩ, and 122fF respectively. These
parasitics are measured values, taken from [64]. The MIV resistance and capacitance are
similar to that of local vias and are 4Ω, and 1fF respectively, and six metal layers per tier
are assumed.
3.2.1 Experimental Setup
Four benchmarks are considered, and their statistics are shown in Table8. The first three
are taken from the Opencores benchmark suite [51], and the fourth is a custom built 256-
bit integer multiplier. This multiplier is built out of 256×4-bit multiplier and 512-bit adder
blocks, arranged into an adder tree. Each multiplier block has 3 pipeline stages and each
adder block has 4 pipeline stages.
Table 8: Design Statistics for All Benchmarks
Design # Gates #Blk
#Inter-blk Intra-blk Target
nets WL (µm) period (ns)
desperf 33,024 38 2,378 210,488 0.9
cf rca 16 146,542 95 3,135 1,210,618 1.3
cf fft 256 8 288,145 49 1,402 4,490,813 1.5
mult 256 256 1,639,050 127 49,471 12,354,340 0.845
55
In this particular section, evaluation is carried out with hard blocks, and the design flow
used to obtain all results is shown in Figure37. It consists of roughly two steps: block
design, and top-level design and analysis.
Figure 37: Our design flow used to get post-layout simulation results.
Each block is first designed separately in Cadence SoC Encounter. The netlist for each
block is obtained by grouping modules bottom up along the hierarchy, until they reach a
certain area threshold. Timing constraints for each block depend on the overall system
frequency, and are determined by context characterization. Each block is then placed,
routed and timing optimized in SOC Encounter. This step finalizes the pin locations within
each block.
These blocks are then fed into the floorplanner to obtain block and MIV locations. After
each die is routed separately in SOC Encounter, parasitic extraction is performed to obtain
the SPEF files for each die. In addition, a top-level verilog file and SPEF file are created
which contain inter-die connectivity and TSV/MIV parasitics, respectively. All netlist and
parasitic information is then fed into Synopsys Primetime to obtain true 3D timing and
power numbers. Sample layouts of block design as well as 2D floorplanning and 2-Die
implementations of cfrca 16 are shown in Figure38.
56
Figure 38: Sample layouts for cfrca 16 testcase, along with select block designs, and
zoomed in shots of TSVs and MIVs
3.2.2 Floorplanner Validation
The proposed floorplanner is run in 2D mode, and compared withthe results obtained from
wirelength-driven floorplanning in Cadence Encounter. The Encounter footprint area is
obtained by gradually increasing the area and running floorplanning until no block overlap
is observed. The results are summarized in Table9.
Table 9: Comparison between the proposed floorplanner and Cadence Encounter.
Design
Footprint (mm2) Inter-block WL (m)
Encounter This Project Encounter This Project
desperf 0.0655 (1.00)0.0604 (0.92)0.352 (1.00)0.356 (1.01)
cf rca 16 0.445 (1.00) 0.413 (0.93) 0.361 (1.00)0.368 (1.02)
cf fft 256 8 1.690 (1.00) 1.141 (0.68) 0.414 (1.00)0.437 (1.06)
mul 256 256 5.198 (1.00) 4.896 (0.94) 17.01 (1.00)17.87 (1.05)
Average 1.00 0.87 1.00 1.035
As seen from this table, the proposed floorplanner produces comparable results with
SOC Encounter. The large area reduction in the cffft 256 8 design is due to the fact that
Cadence Encounter repeatedly produces module overlaps whenprovided with smaller area.
This is presumably due to some bug in the legalization stage of SOC Encounter. It can still
provide comparable wirelength to our floorplanner, however, as this particular testcase is
only locally connected, and each block communicates with only e or two neighbours.
57
3.2.3 Monolithic 3D vs. TSV-based 3D
This section compares the intra-block as well as inter-block wirelength for each design
implemented in 2D as well as monolithic or TSV-based 3D. These results are summa-
rized in Table10. From this table, it is observed that with respect to the inter-block wire-
length, monolithic 3D gives significant advantage over 2D. The total wirelength reduction
depends upon the ratio of inter-block wirelength to intra-block wirelength, and varies de-
pending on the circuit. TSV-based 3D design however, does not give any improvement
in wirelength for the small design desperf, and small improvements begin be seen in the
cf rca 16 and cffft 256 8 testcases. However, with the largest design, no improvement is
visible mainly because a large distance needs to be traversed from the block boundary to
the nearest whitespace block to insert TSVs.
Therefore, monolithic 3D can provide significant benefits over 2D even in the case of
small designs, while TSV-based 3D is suitable for designs with a large number of long
interconnections or memory-on-logic stacking applications.
3.3 Inter-Tier Performance Differences
Although it has been demonstrated that monolithic 3D ICs offer significant advantages, it
has so far been assumed that both tiers have identical performance. In reality, one or more
of the tiers suffers from degraded performance, due to limitations in the current fabrication
process. This section discusses the source of these differences and how to model them.
3.3.1 Source of Inter-Tier Performance Differences
The fabrication process was shown in Figure1. During the fabrication process of the top
tier, a low temperature transistor process is key to preventdamage to the devices of the
bottom tier. It has been demonstrated [65] that transistors can be fabricated at temperatures
down to625◦C without any loss of performance. While this is sufficient to prevent damage
to the devices, this temperature is still too high to preventdamage to the copper BEOL.
58
Table 10: A comparison of wirelength, timing and top net power of 2D versus 3D
Type
Footprint Norm. #MIV/ Inter-block Total routed
(µm× µm) Si. Area #TSV routed WL (µm) WL (µm)
desperf
2D
Encounter 256x256 1 - 352,805 (1.00) 563,293 (1.00)
Ours 251x241 0.92 - 356,489 (1.01) 566,977 (1.01)
2 Tiers 146x211 0.94 1,800 267,678 (0.76) 478,166 (0.85)
MIV 3 Tiers 127x179 1.04 2,738 222,240 (0.63) 432,728 (0.77)
4 Tiers 111x149 1.01 3,823 204,868 (0.58) 415,356 (0.74)
2 Dies 215x323 2.12 120 473,092 (1.34) 683,580 (1.21)
TSV 3 Dies 320x235 3.44 456 515,267 (1.46) 725,755 (1.29)
4 Dies 359x402 8.81 984 734,739 (2.08) 945,227 (1.68)
cf rca 16
2D
Encounter 667x667 1 - 361,673 (1.00) 1,572,291 (1.00)
Ours 555x744 0.93 - 367,542 (1.02) 1,578,160 (1.00)
2 Tiers 416x477 0.89 1,747 289,156 (0.80) 1,499,774 (0.95)
MIV 3 Tiers 367x370 0.92 2,925 255,910 (0.71) 1,466,258 (0.93)
4 Tiers 273x384 0.94 3,936 240,583 (0.67) 1,451,201 (0.92)
2 Dies 484x418 0.91 156 354,347 (1.07) 1,564,965 (1.00)
TSV 3 Dies 377x370 0.94 334 401,425 (1.11) 1,612,043 (1.03)
4 Dies 350x349 1.10 477 345,090 (0.95) 1,555,708 (0.99)
cf fft 256 8
2D
Encounter1,300x1,300 1.00 - 413,674 (1.00) 4,904,487 (1.00)
Ours 1,142x999 0.68 - 436,933 (1.06) 4,927,746 (1.00)
2 Tiers 819x718 0.70 1,050 263,787 (0.64) 4,754,600 (0.97)
MIV 3 Tiers 581x799 0.82 1,921 254,256 (0.61) 4,745,069 (0.97)
4 Tiers 595x594 0.84 2,475 269,049 (0.65) 4,759,862 (0.97)
2 Dies 679x932 0.75 75 369,166 (0.89) 4,859,979 (0.99)
TSV 3 Dies 653x674 0.78 147 357,592 (0.86) 4,848,405 (0.99)
4 Dies 584x527 0.73 377 422,216 (1.02) 4,913,029 (1.00)
mult 256 256
2D
Encounter2,280x2,280 1.00 - 17,089,968 (1.00)29,444,308 (1.00)
Ours 2,144x2,284 0.94 - 17,870,346 (1.05)30,224,686 (1.03)
2 Tiers 1,506x1,718 1.00 48,513 13,815,376 (0.81)26,169,716 (0.89)
MIV 3 Tiers 1,286x1,295 0.96 79,682 11,392,196 (0.67)23,746,536 (0.81)
4 Tiers 1,177x1,131 1.02 102,994 10,116,222 (0.59)22,470,562 (0.76)
2 Dies 1,608x1,616 1.00 1,683 18,825,744 (1.10)31,180,084 (1.06)
TSV 3 Dies 1,508x1,236 1.08 3,599 21,184,404 (1.24)33,538,744 (1.14)
4 Dies 1,240x1,190 1.14 4,232 20,890,062 (1.22)33,244,402 (1.13)
59
This problem can be avoided by using tungsten as the interconnect material on the
bottom tier [2], which degrades the interconnects. If, however, copper must be used in
the bottom tier, the top tier needs an alternate manufacturing process such as laser scan
anneal [55], which degrades the top-tier transistors. Therefore, thechoice is between de-
graded interconnects on the bottom tier or degraded transistor on the top tier. This section
discusses the modelling of these performance degradations.
3.3.2 Degraded Interconnects on the Bottom Tier
Tungsten has several attractive properties that make it a suit ble choice for nano-scale in-
terconnects. It has a much higher melting point than copper (3422◦C vs 1085◦C), so no
low temperature process is needed for the top tier. It also does not diffuse into silicon,
eliminating the need for a diffusion barrier and preventinga y copper contamination issues
during FEOL processing of the top tier. It also has much higher electromigration resistance
and can be etched similar to aluminium, eliminating the needfor a damascene process.
However, tungsten has a bulk resistivity3.1× that of copper, which has so far prevented its
widespread use.
When interconnects shrink, the bulk resistivity no longer applies, and resistivity goes
up due to effects such as line edge roughness, sidewall scattering, and grain boundary

































Most of these quantities are empirically fitted, and an explanation of the various parameters
and a choice of their values for both copper and tungsten are listed in Table11.
Using this equation, the resistivity for both tungsten and copper interconnects are plot-
ted in Figure39. This curve is in close agreement with measured data from IBM [6]. It is
observed that the degradation of resistivity due to tungsten is significantly lower at lower
60
Table 11: Various interconnect parameters
Parameter Description Copper Tungsten
w0 Width
ρ0 Bulk Resistivity (µΩ-cm) 1.68 5.28
u Line Edge Roughness 0.4w0 0.4w0
h0 Height (Thickness) 1.8w0 1.8w0
d Dist. Between Grain Boundaries w0 w0
λ Electron Mean Free Path (nm) 39 [54] 19.1 [9]
p Sidewall Specularity 0.2 [54] 0.3 [60]
R Grain Boundary Reflectivity 0.3 [54] 0.25 [60]
α λR/(dR(1−R))































Figure 39: Copper vs. Tungsten resistivity at different wire widths.
widths. It should be noted that a3nm diffusion barrier was assumed for both tungsten and
copper. In reality, tungsten does not diffuse into the ILD, and a diffusion barrier is not
strictly necessary. This makes the tungsten numbers pessimistic, and its resistivity will be
lower in practice.
Using these resistivity values, the change in the interconnect resistivity for the Nangate
45nm library is tabulated in Table12. From this table, it is observed that the local metal
lines degrade less than the global metal lines. These modified resistivity values are used
to generate interconnect technology file (.ict), and fed into Cadence QRC Techgen to re-
characterize the interconnect extraction libraries.
3.3.3 Degraded Transistors on the Top Tier
If copper is to be used on the bottom tier, laser-scan anneal has been proposed for the
dopant activation on the top tier [55]. This results in localized heating in the source/drain
61
Table 12: The change in resistivity values of different metal layers in the Nangate 45nm
library due to Tungsten interconnects.
Layer Width(nm) Thickness(nm)ρ(Cu) /ρ(W)
Metal1 – Metal3 70 140 2.38
Metal4 – Metal6 140 280 2.67
Metal7 – Metal8 400 800 2.94
Metal9 – Metal10 800 2000 3.04
regions thereby preventing damage to the underlying devices and interconnects. However,
the process is not mature yet, and identical transistor performance as a high-temperature
anneal has not yet been obtained. The PMOS and NMOS performance degrade by27.8%
and16.2% respectively [55]. This is referred to as theTTm20p corner, as on average, the
performance is worse by roughly20%. However, this work was from several years ago,
and improvements in the process are bound to be made. Therefore, to represent fabrication
progress, another cornerTTm10p is defined, which has a PMOS and NMOS degradation
of 13.9% and8.1% respectively, which is exactly half that of theTTm20p corner. The tran-
sistor parameters in the Nangate 45nm library are modified torepresent this degradation,
and the IV curves of the nominal and degraded transistors areplott d in Figure40.
































Figure 40: IV curves of nominal and degraded transistors.
These modified transistor models are used to re-characterize all the std. cell libraries
using Encounter Library Characterizer. The resulting performance of select std. cells at
62
maximum loading is tabulated in Table13. In addition to re-characterization at different
transistor corners, tungsten interconnects also increasethe internal parasitics of std. cells.
The std. cells are also re-characterized under this conditi, and this corner is named TTW.
Table 13: Minimum size (X1) std. cell average delay (inps), assuming worst loading, at
different corners.
Std. Cell TT TTm10p TTm20p TT W
NAND2 221.8 (1.00) 243.9 (1.10) 265.2 (1.19) 222.35 (1.00)
AOI211 154.5 (1.00) 173.8 (1.12) 192.9 (1.25) 154.97 (1.00)
XOR2 163.42 (1.00)187.6 (1.14)210.85 (1.28)163.86 (1.00)
DFF Clk-Q 213.1 (1.00) 243.8 (1.14) 277.7 (1.30) 214.05 (1.00)
DFF Setup 40.29 (1.00) 50.95 (1.26) 58.11 (1.44) 43.86 (1.08)
From this table, it is observed that the cell delays for simple gates such as NAND
roughly follow the average of NMOS and PMOS degradation, while complex gates are
more or less dominated by PMOS degradation. In addition, it is observed that the setup
time for the flip-flops degrade at a much higher rate than either NMOS/PMOS. Tungsten
interconnects only have a minimum impact on the gate performance, as the wires within
the std. cells are very small, and the resistance is dominated by theRON of the transistor.
In summary, two choices exist: (1) Use tungsten on the bottomier and deal with de-
graded interconnects and marginally worse std. cells, or (2) Use copper on the bottom tier
and deal with significantly degraded std. cells on the top tier. This chapter studies both
options and compares and contrasts them.
3.4 Performance-Difference-Aware Design and Analysis Flow
This section first describes how the floorplanner is modified such that designs become less
sensitive to inter-tier performance differences. It then dscribes how timing and power




In most designs, not every block is timing critical. Although non-timing critical blocks
can operate faster, they are synthesized at the frequency ofthe critical block to save area
and power. Therefore, even with degraded transistors, these blocks can be synthesized to
operate at the frequency of the critical block, albeit with alarger area. As long as the
critical blocks do not operate with slower transistors or interconnects, the chip can still
meet timing.
Given the block RTL and timing constraints, four different versions of each block are
synthesized: One for the nominal corner, and one for each of te degraded libraries. In the
case of tungsten interconnects, in addition to the modified standard cell libraries, the resis-
tivity of the wire load models is modified to accurately drivesynthesis. For each version of
the block, the area and longest path delay (LPD) through it are noted. An illustration of this
synthesis is shown in Figure41, where all the blocks in a particular design are synthesized
at all four corners.








TT_W only slightly 
larger than TT




















Small Blocks are 
Most Critical
Figure 41: Synthesis results of “des3” benchmark for different degradations.
This figure plots the block area vs. the longest path delay through it. Each point on this
plot is a single block. As seen from this graph, the largest blocks in this benchmark are not
64
timing critical. For all of the degraded transistor and interconnect options, they have the
same frequency and area. However, the smallest blocks seem to be the most timing critical.
They require much larger area (buffers) to try and meet timing, a d it is still not possible.
Given that the design has inter-tier performance differences, each block will have a
different area and LPD depending on the tier in which it lies.The premise of the modi-
fied floorplanner is to move the timing critical blocks to the tier that is not degraded. The
non-timing critical blocks, although on a slower tier, can be optimized to meet the sys-
tem frequency. An overview of the inter-tier performance-difference aware floorplanner is
shown in Figure42.
Figure 42: The proposed inter-tier performance difference aware floorplanner.
If LPD(bi) is the tier-dependant longest path delay of a blockbi, the modified cost
function of the floorplanner is defined as:





In the above equation, WL refers to the wirelength. The area ofa block is also dependent
on its tier. Therefore, whenever a 3D move is made, the area ofall the blocks that have
changed their tier is also updated. The third term in the above equation will try to place the
timing critical blocks in the faster tier, and the non-timing critical blocks in the slower tier.
65
An illustration of the modified floorplanner is shown in Figure43. This figure assumes
that the top tier is at theTTm20p corner, and the floorplanning is carried out for the same
benchmark shown in Figure41. In this figure, it is observed that the performance difference
aware floorplanner moves the smaller, more timing critical blocks to the bottom tier, so they















Figure 43: Floorplan screenshots of “des3” when the top tier is at the TTm20p corner. (a)
Without performance difference aware floorplanning, and (b) With performance difference
aware floorplanning.
3.4.2 Performance-Difference-Aware Analysis
The floorplanner gives the corner in which each block operates. Once the placed and routed
netlists of all the blocks and tiers are available, they are loaded into Synopsys PrimeTime.
The appropriate std. cell library is chosen for each cell depending on the tier in which
it lies. The extraction tech file for each block and tier is also modified depending on the
interconnect material, and the appropriate parasitics arelo ded into Synopsys PrimeTime.
66
A top-level netlist and parasitic file is created to represent the MIV connectivity and para-
sitics. According to [2], if the inter-tier oxide thickness is greater than or equalto 100nm,
there is negligible inter-tier coupling. Therefore, any such coupling is ignored. Once all
the netlists and parasitics are loaded, 3D timing and statistical power analysis is performed.
3.5 Power-Performance Study
One benchmark is chosen from the OpenCores benchmark suite (des3), one from the IWLS
benchmark suite (b19), and one custom 128-bit integer multiplier is designed. These bench-
marks are designed using the Nangate 45nm library, and theirstat stics are tabulated in
Table14. The cell counts shown are the synthesis results without anywire load models. In
all the 3D implementations considered, the diameter of an MIV is assumed to be100nm,
with a resistance of2Ω and a capacitance of0.1fF [34].
Table 14: Benchmarks used for evaluation evaluation.
Benchmark#Blocks #Gates #Inter-Block Nets
des3 55 63,194 6,138
b19 55 78,852 14,223
mul128 63 253,867 12,447
3.5.1 Identical Performance on Both Tiers
This section discusses the case where both tiers in 3D have identical transistors and inter-
connects. This represents an ideal manufacturing process,and represents the best possible
case for monolithic 3D. Initial floorplanning is first performed to derive wire load models
for each benchmark. Floorplanning is carried out again, andbasic floorplan comparisons
for 2D and 3D are tabulated in Table15. In addition to these two flavors, an “ideal” block-
level implementation is defined. This implementation is obtained by assuming that all the
inter-block nets have zero length and parasitics. During the block implementation, the out-
put load of the blocks is set to be zero and the inputs are assumed to be driven by ideal
drivers. This is the lower bound onany block-level implementation of this design, given
the same set of blocks, and the constraint that each block is implemented in 2D.
67
Table 15: Basic floorplan comparisons assuming both tiers have same performance.
Ckt. Flavor
#Gates Footprint Total # MIV
(×103) (mm2) WL (m) (×103)
des3
2D 68.9 (1.00) 0.328 (1.00)1.514 (1.00) -
3D 66.2 (0.96) 0.156 (0.48)1.287 (0.85) 3.75
Ideal 64.4 (0.94) - 0.938 (0.62) -
b19
2D 82.3 (1.00) 0.398 (1.00)3.341 (1.00) -
3D 80.62 (0.98)0.204 (0.51)2.847 (0.85) 13.46
Ideal 79.35 (0.96) - 1.838 (0.55) -
mul128
2D 251 (1.00) 1.096 (1.00)4.693 (1.00) -
3D 245 (0.97) 0.550 (0.50)4.447 (0.95) 7.261
Ideal 235 (0.93) - 3.271 (0.70) -
From this table, it is observed that monolithic 3D leads to significantly shorter wire-
length. Although the inter-block wirelength is always significantly reduced, the total wire-
length reduction depends on the intra-block wirelength as well. Benchmarks such as
“mul128” have most of the wirelength within the block, so there is not much total wire-
length reduction. In addition, shorter wires leads to fewergates (buffers) being required.
Next, the power-performance trade-off for each of the threediff rent implementations
is studied. In order to get the numbers for the ideal implementation, the parasitics of all
inter-block nets are forced to zero in Synopsys PrimeTime. In addition to the nominalVDD
of 1.1V , the std. cell libraries are characterized at four additional VDD values covering
±10% of nominalVDD (1.00V, 1.05V, 1.10V, 1.15V, 1.20V). The power and frequency are
measured at each of theseVDD values and the resulting curves are plotted in Figure44.
From this figure, it is observed that 3D usually offers a performance advantage (at the
same power) over 2D, and it closes the gap to ideal by up to50%. This additional per-
formance can be traded for power savings to meet the 2D frequency, and up to a16.1%
reduction in power is observed. In these curves, the ideal imple entation of “b19” requires
extrapolation to make iso-performance power comparisons at the nominal 2D frequency.
Such a comparison is avoided due to inaccuracies that are bound t be introduced by ex-
trapolation.
The reason the absolute values of the gains in the “mul128” benchmark are so small
68
Figure 44: Power-performance trade-off curves assuming that both thetiers have identical
transistors and interconnects.
is because the critical path is always within a single block.Since the inter-block nets are
not timing critical, shortening them does not make the design faster, and there is no addi-
tional performance to trade for power. Making this design faster will require architectural
modifications such as block folding.
3.5.2 Impact of Inter-Tier Performance Differences
The performance difference aware floorplanner (PDAFP) is run on all benchmarks for each
degraded option, and the basic floorplan comparisons are tabula ed in Table16. The num-
bers are normalized to the respective 2D numbers in Table15.
69
Table 16: Basic floorplan comparisons for different degraded 3D options. The numbers
are normalized to the respective 2D numbers in Table15.
Ckt. Flavor
#Gates Footprint Total # MIV
(×103) (mm2) WL (m) (×103)
des3
Top=TTm10p 68.1 (0.99)0.159 (0.49)1.29 (0.85) 3.92
Top=TTm20p 67.2 (0.98)0.177 (0.54)1.44 (0.95) 5.67
Bot=TT W 66.8 (0.97)0.153 (0.47)1.31 (0.87) 3.11
b19
Top=TTm10p 80.8 (0.98)0.212 (0.53)2.84 (0.85) 11.6
Top=TTm20p 82.0 (1.00)0.222 (0.56)2.90 (0.87) 11.3
Bot=TT W 80.8 (0.98)0.208 (0.52)2.91 (0.87) 12.9
mul128
Top=TTm10p 247 (0.98) 0.574 (0.52)4.35 (0.93) 4.48
Top=TTm20p 249 (0.99) 0.575 (0.52)4.38 (0.94) 4.48
Bot=TT W 246 (0.98) 0.568 (0.52)4.29 (0.91) 4.48
As observed from this table, all of the degraded options use more gates than the case
when both tiers have identical performance. However, the gat counts are still less than 2D.
Similarly, both the footprint area and the wirelength are increased from the non-degraded
case, but are still less than 2D. The only exception is the “mul128” benchmark, when the
bottom tier is at the TTW corner. This has a slightly lower wirelength than the non-
degraded option, but this is due to the trade off with footprint area.
Next, the power-performance trade-off curves for the degraded transistors and inter-
connects are plotted in Figure45. For the sake of comparison, degraded transistors and
interconnects on top of a non PDAFP floorplanning solution are also plotted.
As observed from this figure, the performance difference aware floorplanner (PDAFP)
always outperforms the non-PDAFP one. The top tier having TTm20p transistors is always
worse than 2D, except in the case of “mul128”. After PDAFP, the top tier with TTm10p
transistors always becomes better than 2D. Finally, tungste interconnects on the bottom
tier are by far the best option, and although there is negligible t ming degradation compared
to the identical tiers case, some power overhead exists.
To summarize the impact of PDAFP, the iso-power frequency and iso-performance
power for differetn benchmarks are tabulated in Table17. The comparison point for each of
70
Figure 45: Power-performance trade-off curves assuming degraded transistors and in-
terconnects. Dashed lines represent non performance differenc aware floorplanning and
solid lines represent performance difference aware floorplanning.
the three benchmarks is the respective 2D power and frequency at nominalVDD. If a par-
ticular point is not achievable within±10% of nominalVDD, and extrapolation is required,
it is marked with a ‘-’.
From this table, PDAFP improves the iso-power performance by up to12.6% and the
iso-performance power by up to10.6%. The non-PDAFP floorplan results are often not
able to meet the 2D frequency even with a10% VDD boost. If theVDD was increased
further so that they could meet timing, PDAFP would show evenmore benefit.
71
Table 17: Impact of performance difference aware floorplanning (PDAFP). ‘-’ indicates that point is not achievable within±10% VDD.
Ckt. Parameter
Top=TTm10p Top=TTm20p Bot=TT W
Non-PDAFP PDAFP Non-PDAFP PDAFP Non-PDAFP PDAFP
des3
iso-power freq. (Ghz) 1.233 1.259 (+2.1%) 1.14 1.19 (+4.4%) 1.222 1.28 (+4.7%)
iso-freq. power (mW) 507.746 479.1 (-5.6%) - 547.65 (-) 519.48 464.55 (-11.6%)
b19
iso-power freq. (Ghz) 0.417 0.424 (+1.7%) 0.396 0.396 (+0%) 0.432 0.439 (+1.6%)
iso-freq. power (mW) 151.723 144.58 (-4.7%) 173.14 172.828 (-0.2%) 135.06 135.06 (+0%)
mul128
iso-power freq. (Ghz) 0.737 0.793 (+7.6%) 0.692 0.779 (+12.6%) - 0.793 (-)
iso-freq. power (mW) - 892.95 (-) - 922.53 (-) - 887.37 (-)
Table 18: Iso-power performance and iso-performance power results for all implementation flavors.
Ckt. Parameter 2D Ideal
3D
Both=TT Top=TTm10p Top=TTm20p Bot=TT W
des3
iso-power freq. (Ghz) 1.222 1.411 (+15.5%) 1.293 (+5.8%) 1.259 (+3.0%) 1.19 (-2.6%) 1.28 (+4.7%)
iso-freq. power (mW)519.48 372.06 (-28.4%)458.45 (-11.7%) 479.1 (-7.8%) 547.65 (+5.4%) 464.55 (-10.6%)
b19
iso-power freq. (Ghz) 0.408 0.5 (+22.5%) 0.439 (+7.6%) 0.424 (+3.9%) 0.396 (-2.9%) 0.439 (+7.6%)
iso-freq. power (mW)157.05 - (-) 131.81 (-16.1%)144.58 (-7.9%)172.828 (+10.0%)135.06 (-14.0%)
mul128
iso-power freq. (Ghz) 0.779 0.807 (+3.6%) 0.793 (+1.8%) 0.793 (+1.8%) 0.779 0.793 (+1.8%)
iso-freq. power (mW)922.53 810.56 (-12.1%) 859.15 (-6.9%) 892.95 (-3.2%) 922.53 887.37 (-3.8%)
72
3.5.3 Overall Comparisons
The iso-power performance and iso-performance power for 2D, ideal, the non-degraded
monolithic 3D, as well the PDAFP results for degraded monolithic 3D are tabulated in
Table18.
From this table, it is clearly seen that tungsten interconnects on the bottom tier outper-
form degraded transistors on the top tier. This option is preferable from the manufacturing
perspective as well, as the process is already available. Evn with tungsten interconnects
on the bottom tier, the gap to the ideal block-level implementation can be closed by up to
50% w.r.t. performance and36% w.r.t power.
3.5.4 Block Folding
As mentioned in Subsection3.5.1, the “mul128” benchmark has very limited benefit in
block-level 3D due to the fact that the critical path is within a single block. This block is a
128 × 4 multiplier. In this benchmark, there are 32 such blocks. Each of these blocks has
only 4,906 gates when synthesized without any wire load models, and is too small to be
folded using other 3D technologies such as TSV-based 3D. This section demonstrates how
monolithic 3D can help to increase the chip performance and decrease the chip power by
folding this one block.
In order to perform 3D block folding, the gate-level 3D placer presented in [28] is used.
Once the locations of all gates are determined, MIV insertion is performed by tricking the
2D router, similar to the method presented for block-level designs.
This block is first synthesized without any wire load models,implemented it in 2D and
3D, and then re-synthesized using the derived wire load models. This is then placed, and the
resulting footprint and wirelength comparisons are shown in Table19. The corresponding
screenshots are shown in Figure46.
From this table, block folding offers26% wirelength reduction, even for extremely
small blocks. The MIV density is approximately50, 000 permm2, which is significantly
73
Table 19: Placement results for the128× 4 multiplier block.
Flavor #Gates Footprint (um2) WL (um) # MIV
2D 5,398 (1.00) 13,225 (1.00) 61,045 (1.00) -




Figure 46: 3D placement layout snapshots of one128 × 4 multiplier block within the
“mul128” benchmark.
higher than that offered by any other 3D integration technology. In addition, this comes at
zero area overhead.
Finally, similar to the block-level designs, the power-performance curves for 2D and
3D designs are plotted. In addition, since it has already been d monstrated that tungsten
interconnects are preferable to degraded transistors, thepow r-performance curves are also
plotted assuming that the bottom tier uses tungsten interconnects. These curves are shown
in Figure47.
As seen from this figure, even with degraded interconnects, a5.7% performance boost
and 12.6% power saving is obtained. The impact due to tungsten is minimal, as such
small blocks are almost always transistor dominated. The above results suggest an alter-
nate design methodology for monolithic 3D ICs. Every block isfolded using tungsten
interconnects on the bottom tier. This comes at a negligibleperformance hit, as the blocks
74























Figure 47: Power-performance trade-off curves for the128× 4 multiplier block.
are gate dominated. Next, since each block has a reduced footprint, assembling these 3D
blocks together will reduce the chip footprint, leading to sh rter wires between blocks. The
timing critical buses between the blocks can then be routed using the global metal layers of
the top tier, using copper interconnects, at no performanceloss.
3.6 Summary
This chapter presented physical design techniques for block-level monolithic 3D ICs un-
der real world considerations. First, a floorplanning framework was presented, and it was
demonstrated that this engine produces results comparableto commercial engines. Next,
it was demonstrated that even in coarse-grained integration such as block level, monolithic
3D significantly outperforms other 3D styles such as TSV-based 3D.
Inter-tier performance differences that arise due to an immature fabrication process
was discussed, and two options for monolithic 3D ICs were discus ed and modeled. A
performance difference aware floorplanner was presented, and it was demonstrated that
using this floorplanner, monolithic 3D still shows significant benefits compared to 2D ICs.
Finally, it was demonstrated that tungsten interconnects on he bottom tier are preferable to
degraded transistors on the top tier.
75
CHAPTER IV
PHYSICAL DESIGN FOR GATE-LEVEL MONOLITHIC 3D ICS
So far, block-level monolithic 3D ICs have been discussed. However, the potential benefit
offered is limited, as this style does not fully take advantage of the high integration density
offered. In contrast, the gate-level design style naturally lends itself to monolithic 3D ICs.
Existing standard cells and memory can simply be reused, placed onto multiple tiers, and
MIVs used to connect them together. In addition, there is no silicon area overhead of doing
this. Out of the three design styles available for monolithic 3D ICs, gate-level offers the
greatest balance between integration density and reuse of existing libraries. The authors
of [4] provided a rudimentary design flow that is not capable of handling any hard macros
such as memory, and therefore cannot be applied to real designs.
The gate-level design style can also be applied to other stacking technologies such as
TSV-based 3D ICs and face-to-face 3D ICs. In TSV-based 3D ICs, the via size is so large
compared to the gate size that the power benefit is limited. However, face-to-face 3D ICs
offer only slightly larger via sizes than monolithic 3D, andcan also be considered fine-
grained. Therefore, this chapter provides results on both face-to-face and monolithic 3D
integration.
This chapter first provides a routing congestion aware physical design framework that
modifies existing 2D placement engines for M3D placement, and also inserts MIVs into
the layout. Next, it discusses how commercial 2D engines canbe used for M3D placement,
taking full advantage of state-of-the-art power and timingoptimization techniques. Finally,
it discusses how to partition the gates in the design such that voltage-drop is minimized,
with a minimal impact on the temperature of the 3D chip.
76
4.1 Congestion-Aware Placement for Gate-level Monolithic 3D ICs
This section first formulates the problem, and then discusses how existing 2D placers can
be minimally modified for M3D placement. It then presents a congestion model, and uses it
to derive a congestion-driven placement algorithm. Finally, it presents results that demon-
strate the effectiveness and benefits of the proposed techniques.
4.1.1 Overall Design Flow
4.1.1.1 Problem Formulation
The“Projected 2D HPWL” is defined as the half perimeter wirelength (HPWL) of a mono-
lithic 3D IC if all the gates are projected onto a single placement layer. The total routing
overflow is defined as the sum of routing demand minus routing supply on all global rout-
ing edges that are congested. The problem to be solved can then be stated as:Given an
initial monolithic 3D placement, repartition the gates withminimal change to the projected
2D HPWL, such that the total routing overflow is minimized.
However, this formulation still requires an initial monolithic 3D placement. Therefore,
the following problem is also solved:Generate a 2D design, using minimally modified 2D
tools, such that it represents a monolithic 3D IC with all the gates projected to a single tier.
If such a design is generated, then tier partitioning can directly be applied on top of it.
4.1.1.2 Design Flow
An overview of the proposed flow is shown in Figure48. In this figure, the red boxes








3D Timing & Power Analysis
Figure 48: The design flow used for gate-level M3D placement.
77
From the synthesized netlist, an initial monolithic 3D IC placement result is obtained.
Next, routability-driven partitioning is performed, whictakes the initial placement solu-
tion and re-partitions the gates to improve the routed wirelength of the design. A top-off
placement step is then performed to make sure that each tier in he monolithic 3D IC meets
target density requirements. The last step in the placementprocess is legalization, which
snaps the cells to the placement grid. Once the locations of cells are determined, MIVs
need to be inserted into the whitespace between cells. MIVs can then simply be treated
as I/Os in each tier, and a tier-by-tier route can be carried out using commercial tools (Ca-
dence Encounter). Finally, parasitics are extracted tier-by-tier, and a separate parasitic file
to represent MIV parasitics is created. All this information s fed into Synopsys PrimeTime
to obtain 3D timing and power numbers.
4.1.2 Monolithic 3D IC Placement
This section first presents prior work in TSV-based 3D IC placement, and discusses why
those approaches are not applicable to monolithic 3D ICs. Next, a methodology is proposed
based on modifications to 2D IC tools. Finally, handling pre-laced memory macros in a
3D design while still using 2D IC tools is discussed.
The monolithic 3D gate-level placement problem is similar to the TSV-based problem,
except that the via count need not be minimized. The first approach to TSV-based 3D place-
ment is folding-based [13]. This takes an existing legal 2D placement, and transformsit to
3D by several folding operations. This approach generates inf rior quality solutions [12],
and is also not capable of handling pre-placed memory. The next m thod is partitioning-
based [28], where the netlist is first partitioned and all tiers are placed simultaneously.
Lastly, true 3D placement approaches exist [12, 21], where the half-perimeter wirelength
(HPWL) is minimized in thex, y andz dimensions. However, in monolithic 3D ICs, the
z dimension is so small (1 − 2µm) that attempting to minimize thez HPWL is not really
necessary. In addition, all of these engines are geared towards TSV-based 3D, and try to
78
minimize the via count. This section demonstrates the fact that since monolithic vias are so
small, only a minimally modified 2D placement engine suffices, and separate 3D placement
engines are not required.
4.1.2.1 Placement-Aware Partitioning
An illustration of the proposed method for a two-tier monolithic 3D IC is shown in Fig-
ure49. If the width and height of a 2D IC areW2D andH2D respectively, the M3D outline
is defined such that the width and height of a 2D chip are divided by
√
2. This modification
leads to exactly half the footprint of a 2D IC. All 2D placementgines have the concept of
chip capacity(or target density), which is the maximum number of standardcells that can
be placed in a given area. Since all the gates need to fit into half the area, simply doubling
the capacity of the chip will work.Anyexisting 2D placer can be modified for this purpose,
and this section implements a custom implementation of KraftWerk2 [59]. Clearly, the
HPWL obtained after such a placement represents the HPWL of a monolithic 3D IC where











Figure 49: Placement-aware partitioning. A modified 2D engine is used to place all the
gates into half the area, and then partitioned with area balance in each bin.
The next step is to partition the gates such that each tier hasan equal number of gates,
and the deviation from the initial(x, y) location is minimized. An obvious approach to
partitioning the gates is a min-cut approach, and modifyingthe Fiduccia-Mattheyses [16]
(FM) min-cut partitioner is straightforward, an overview of which is given below.
First, partition bins are defined in a regular fashion. Next,the design is partitioned
79
such that the cells in a given bin in the modified 2D result remain in the same bin after
splitting. As will be discussed in Section4.1.5.1, the choice of bin size affects solution
quality greatly. This is because after partitioning, although each bin in each tier will contain
the correct number of cells, these cells may not be distributed niformly throughout the
bin. If the partitioning bin size is much larger than the global placement bin size, there
could potentially be large areas of extra-dense cell placement and large areas of whitespace.
Therefore, top-off placement becomes necessary to obtain an acceptable placement solution
that meets target density within each global bin.
Initially, a random, area-balanced (within each partition-bi ) solution is created. The
gain of a cell is defined as the reduction in the cutsize if the cell’s tier is changed. A cell
is “legal” if moving it does not violate the area-balance constraints within its partition bin.
While moving a single cell from one tier to another will not affect the area balance too
much, this condition ensures that too many cells are not moved from one tier to another.
Initially, all the cell gains are computed and stored in a bucket structure. All the cells
also marked as “unlocked”. Among all legal cells, the one with the highest gain is picked,
moved to the other tier, and locked. Once a cell is moved, onlythe gains of its neighbors
(connected by a net) needs to be updated. This process is contnued until all the cells are
locked. This is termed apass. Several passes are performed until no more cutsize gains are
achieved. Due to the nature of the incremental gain update, this algorithm runs inO(C)
time, whereC is the number of cells. While the min-cut is straightforward,MIVs are ex-
tremely small and there is no real need to perform a min-cut onthe etlist. Additional MIVs
can be tolerated, if there is good reason to use them. A routability-driven partitioner is pre-
sented in Section4.1.3, where additional MIVs are utilized to reduce routing congestion,
and hence, routed WL.
Note that while this approach may appear somewhat similar tothe local stacking trans-
formation (LST) presented in [13], it is superior in one major aspect – the handling of pre-
placed memory macros. The LST method obtains the initial(x, y) locations of all the cells
80
by scaling them from alegal 2D placement, and hence has no way to handle pre-placed
memory macros in a 3D space. Handling them in the proposed method is straightforward,
and will be discussed in the following subsection.
4.1.2.2 Handling Memory Macros
In a M3D design, hard macros such as memory are bound to be pre-plac d. This section
discusses how to handle these memory macros while still leveraging 2D IC tools. Let d
be the target density required in the final, post-partitioned M3D design, andt′d be the target













Figure 50: Handling pre-placed memory macros (a) Initial pre-placed locations, (b) Pro-
jection of both tiers onto the same plane, and (c) Modifying the target density to represent
memory locations.t′d is the target density in the modified 2D placement andtd is the re-
quired target density in the final M3D design.
First, both these tiers are projected onto the same plane as shown in Figure50(b). Those
regions that have two memories overlapping cannot contain cells in any tier, and hence will
havet′d = 0. Those regions that have only one memory can contain cells inthe tier where
81
the memory is not placed. To reflect this fact, the target density in those regions will not be
doubled, ort′d = td, as shown in Figure50(c). Finally, the regions not containing memory
will have cells of both tiers placed, and hencet′d = 2td.
Handling these region-specific target density constraintsis straightforward in the Kraftwerk
placement system. In order to remove overlap between cells,it maintains a supply/demand
system of placement space. The chip is divided into fine mesh tiles, and each mesh tile has
a supplytd. Each cell has demand1 on each mesh tile that it occupies. Solving the poisson
equation of supply minus demand gives the direction and amount t move each cell in order
to equalize supply and demand. In this system, the supply of each fine mesh tile is set tod
or 2td depending on requirements.
The partitioning process can also be modified easily. The regions with memory overlap
in both tiers do not have cells, and need not be partitioned. Those cells placed in the
regions with a single memory macro are moved to the tier not containing memory. Finally,
the regions with cell overlap are partitioned as usual.
4.1.3 Routability-Driven Partitioning
The first step in building a routability-driven partitioneris to estimate the routing congestion
in the monolithic 3D IC. The routing congestion is measured asthe total routing overflow,
which is the routing demand minus routing supply on all the globa routing edges in the
chip. The routing supply is determined from the number and pitch of metal layers, and this
section discusses how to determine the 3D routing demand. This section then describes
how to re-partition the monolithic 3D IC to reduce routing congestion.
4.1.3.1 Prior Work
While this is the first work to discuss a monolithic 3D routing demand model, this topic has
been explored extensively for 2D ICs. The first approach is a grid-less approach [58] where
the demand of a net is assumed to be distributed evenly along al possible Steiner tree com-
binations. This was extended to consider the differences between horizontal and vertical
82
segments in [24]. These approaches are more suitable for routability-driven placement, not
partitioning, as both these papers try to minimize the overlap of the net bounding boxes.
The other approach is to first decompose multi-pin nets into two pin nets, and add each two
pin net into the demand estimate. The demand of each two pin net ca be estimated either
by maze routing [30], rough global (LZ) routing [37], or probabilistically [5]. This project
chooses a probabilistic demand model because (1) It is extremely fast unlike maze routing,
and (2) The predicted demand numbers are independent of net ordering unlike LZ routing.
The first property is necessary as several solutions will be evaluated during partitioning,
and the second property is essential for a partitioner as each re- ompute of the demand of
the same two-pin net must yield the same result.
4.1.3.2 Decomposing Multi-Pin Nets into Two-Pin Nets
This section presents a method of decomposing multi-pin nets i to two-pin nets by con-
structing 3D rectilinear Steiner trees (RSTs). Currently, notool exists to efficiently com-
pute a 3D RST, so the net is projected to 2D, a 2D rectilinear Steiner minimum tree (RSMT)
constructed, and then expanded back to 3D.
Sample points to be routed are shown in Figure51(a). The points are first projected
to a 2D plane, and a 2D RSMT is constructed using FLUTE [10] (Figure 51(b)). Now,
while expanding this 2D RSMT to a 3D RST, the tiers of all the fixedpoints are already
known. The tier of each Steiner point is determined by a majority vote of the tier of all
of its neighbors. Any ties are broken in any arbitrary, deterministic manner. A neighbor
is defined as any point (steiner or fixed) that the current Steiner point is connected to. If a
neighbor does not have a tier determined yet, it is ignored during the current iteration of the
majority vote operation. For example, when the 2D RSMT of Figure51(b) is expanded, the
tiers of the three steiner points that are connected to the fixed points are determined first.
They each have two neighbors in one tier, and one undetermined neighbor. Therefore, they
all lie in the same tier as the fixed points that they are connected to. Next, the tier of the
83
middle steiner point can be determined as the top tier as it has wo neighbors in the top tier










Figure 51: Construction of a 3D RST. (a) The points to be routed. (b) Project to 2D and
construct a 2D RSMT. (c) Expand the 2D RSMT to a 3D RST. (d) If a cellchanges tier, the
2D RSMT can be re-used.
Since the target is move-based partitioning, the change in topology needs to be quickly
evaluated if the tier of a given cell is changed. Since such a change does not change thex &
y co-ordinate of the cell, the same 2D RSMT can be reused. The tier of one cell is changed
and the resulting 3D RST is shown in Figure51(d). The expansion from Figure51(b) is
redone, and only the quick majority vote operation needs to be performed on the Steiner
points. Note that the steiner point connected to the cell that has changed tier now has an
equal number of neighbors in each tier. This tie can be brokenin a y deterministic manner,
and this project always goes with the lower tier. Since the middle steiner point now has two
neighbors in the bottom tier and one in the top tier, it is alsosigned to the bottom tier.
As seen from this figure, a lot of the routing demand on the top tier is offloaded to the
bottom tier, with an unchanged 3D bounding-box. Therefore,to evaluate the change in
demand if the tier of a given cell is changed, the following step need to be performed:
(1) Redo the majority vote operation for all nets connected tothat cell, (2) Delete the old
84
topology (rip-up) of the changed two-pin nets from the demand estimate, and (3) Add the
new topology (re-route) of the changed two-pin nets into thedemand estimate. Handling
each two-pin net is described next.
4.1.3.3 3D Demand Model for Two-Pin Nets
A 3D routing graph is maintained for the entire chip. This section considers only that sub-
graph that a given two-pin net spans. Although the focus is only two tier monolithic
3D ICs, the model presented in this section is general, and is applic ble to any number of








Top-down view Unfurled view
A
B
Figure 52: A legal route from A to B in a4× 3× 2 grid. The top-view is limited to two
bends, while the unfurled view can have unlimited bends.
Assume that the net (A-B) spans al × m × n routing sub-graph. The probabilistic
routing demand contributed by this two-pin net on each edge within this sub-graph needs
to be computed. One possible route from A to B is highlighted in red. Many such legal
routes exist, and a probabilistic demand model assumes thateachlegal route is equally
probable. Therefore, the key to such a demand model is to correctly identify which routes
are legal.
Two key observations that help derive the demand model are: (1) Looking at the 3D
demand graph from the top-view, each bend represents the usage of a local via. Since
current global routers try to minimize the usage of local vias, this is limited to at most two
85
bends (or local vias) in the top view [5, 37]. (2) A new view called the unfurled view is
defined, which unfurls the routing graph along a legal route (refer Figure52). In such a
view, movement along eitherx or y directions look the same. In this view, irrespective of
the number of bends, the number of MIVs is always the same and equal to exactlyn − 1.
For example, in Figure52, two MIVs always connect A and B, irrespective of the number
of bends in the route. Therefore, there are no limits to the number of bends in the unfurled
view.
Assuming the above constraints, the total number of routes from A to B is(l+m)×(l+m+n)
Cn. First, given the top-view constraint, the sum of all the probabilities along all the edges























(l − x), if y = 0
(x+ 1), if y = m
1, otherwise
(28)
A similar expression can also be written for all they dges. Next, in the unfurled view,
all edges with the same(x + y) look the same. Therefore, leti represent(x + y). Since
there is no limit to the number of bends, the routing probability on any horizontal edge is


























(l − x), if y = 0




A similar expression can also be computed for all they edges. Once the probabilities
of thex & y edges have been computed, the probability on eachz edge can be computed
by visiting them in turn, and setting the probability to be thsum of the probability on all
incoming edges (towards A) minus the sum of the probability on all the outgoing edges
(towards B).
4.1.3.4 Interdependent Supply/Demand Model
In 2D ICs, there are two types of tracks –x andy. Using anx track does not affect the
supply ofy tracks, and vice-versa. In monolithic 3D ICs, the number ofz tracks available
also needs to be taken into account. Thesez tracks, however, are not independent of thex
andy track usage. Assuming that the top metal layer is vertical, this fact is illustrated in
Figure53. This figure shows the top view of the top metal layer of one globa routing bin.
The green squares represent potential MIV landing pad siteswho e pitch is determined by





Internally used MIV Slot
Externally used MIV Slot
Global routing bin
2D route on top metal 
(a) (b) 
Figure 53: A view of the top metal layer that contains MIV landing pads. (a) A 2D wire
on the top metal layer blocks potential MIV landing pad slots. (b) If MIVs connect to cells
outside the current bin (external), they block other MIVs. If M Vs connect to cells within
the current bin (internal), they do not block other potential MIV slots.
There are three effects that need to be modelled. First, assume that a 2D wire on the top
metal layer crosses this bin. As shown in Figure53(a), this 2D route blocks potential MIV
landing pad sites, and hence reduces the 3D supply. Next, as shown in Figure53(b), if a
87
MIV lands on the top metal layer (from the other die), and continues onto a different global
routing bin, this is termed an externally used MIV slot. Suchconnections use one MIV
slot, but also block others. Finally, if an MIV lands on the top metal layer but connects
to a gate within the same bin itself, it is termed an internally used MIV. As seen from this
figure, it uses one MIV slot but does not block other MIV slots.However, this requires an
entire via stack from the top metal to the lowest metal to connect to the cell. This via stack
causes via blockages [8], which reduces the 2D supply in the lower metal layers.
Let WB andHB be the width and height of the global routing bin.NMH andNMV are
the number of horizontal and vertical metal layers, respectiv ly. Let PHi andPV i be the
pitch of theith horizontal and vertical metal layer respectively. Note that M1 is ignored as
it is usually used for within-cell routing. Therefore, the “first” metal layer is actually M2.
Also, this section assumes that the top metal layer has a preferred vertical direction. The
derivation can also easily be carried out if it is horizontal.
If the top metal pitch is assumed to be the only factor determining the number of MIV
slots, then the number of vertical and horizontal MIV slots are: NH = WB/PNMV and
NV = HB/PNMH . However, not all these slots are accessible. This is becauseach metal
layer only contributes a finite number of tracks that can connect to MIVs in this bin. The
number of MIV slots can then be given as
NMIV = 2NHNMV + 2NVNMH − 4NMVNMH (31)
This can then be divided into a matrix withN ′H andN
′
V effective horizontal and vertical
slots. It should be noted that this routing-based constraint on the number of MIVs is far
more restrictive than computing the number of MIVs slots avail ble by simply looking at
the whitespace available for MIV insertion. It can be shown that even if all the above MIV
slots are utilized, it will occupy only2− 3% of the area of a given placement bin.
Next, to determine the number of blocked MIV slots, the number of 2D and 3D routes
that use the top metal layer needs to be determined. This requires metal layer assignment,
which is a complicated problem. Instead, the routes are assumed to be assigned to metal
88
layers based on the inverse ratio of pitch, i.e., a larger pitch metal will have fewer wires.
Let NN,2D,i be the number of 2D routes that cross the north edge on metal layer i. Similar
definitions can be made for 3D routes and the east, west, and south edges. LetNN,2D be the
total number of 2D routes crossing the north edge, andPi be the pitch of theith metal layer.
For each vertical metal layeri, NN,2D,i = NN,2D/(Pi.
∑
j(1/Pj)). It is pessimistically
assumed that any 2D or 3D wire crossing an edge goes all the wayto the center of the bin.
The number of blocked MIV slots (assuming the top metal is vertical) can then be given as
NMIV,Blk =0.5N
′
V (NN,2D,NMV +NS,2D,NMV )
+(0.5N ′V − 1)(NN,3D,NMV +NS,3D,NMV ) (32)
The first term in the above equation represents the number of MIV slots blocked by 2D
wires and the second term represents the number of MIV slots bcked by external MIV
connections. The actual number of MIV slots can be obtained by su tracting Equation (32)
from Equation (31).
The next step is to calculate the 2D supply reduction due to the via blockages introduced
by MIV connections. LetNint,3D be the number of internal MIV connections in this bin.
Each bin is divided into four quadrants, numbered one through four, in the usual naming
convention. The number of vias in the first quadrant, on metallayeri, can then be given as




If Wvia,i is the width of the via on metal layeri, then the fraction of metal layeri in the first











(1− 0.5(Bvia,1,i + Bvia,2,i))/Pi (35)
Similar expressions can then be derived for all the other edges as well.
89
4.1.3.5 Min-Overflow Partitioning
Routability-driven (min-overflow) partitioning can now make use of the 3D demand model.
First, min-cut partitioning as described in Section4.1.2 is performed. A min-overflow
partitioning is then performed on top of this solution. Total overflow is used as the metric
to be minimized, which is defined as the summation of the overflow on all the 2D and 3D
edges in the chip that are congested. The overflow-gain of a cell is then the reduction in
the total overflow when its tier is changed, and it is computedby the procedure outlined in
Subsection4.1.3.2.
Let C be the set of all cells andN be the set of all the nets in the design. In the min-
cut partitioner, once a cell is moved, only the gains of its neighbors needs to be updated.
However, the overflow depends onall nets that use a particular routing edge, not just those
connected to this cell. If a cell is moved, it affects severalrouting edges. Any other net
that uses the affected routing edges will now need to have itsoverflow updated. Since the
gain is defined for moving a cell, all cells connected to such nets will also need to have
their gain updated. For cells connected to nets with large bounding boxes, up toC cells
will need to be updated every time it is moved. This means thatmaintaining a priority
queue with all cells, such as in the default FM algorithm, would lead to a time complexity
of O(C2). This neglects the time necessary to rebuild the queue, which adds a further
O(log(C)) complexity. Overall, this would lead to excessively large runtime, making it
infeasible. A heuristic that reduces the time complexity significantly is now presented, and
shown in Algorithm3.
The top-level function in this algorithm isMinOverflow(). Initially, the demand esti-
mate is cleared i.e, all nets are removed, and the utilization on each routing edge is set to0.
Next, there are two stages, build and refine, both of which aresimilar, and handled by the
Stage() function. In the build phase, all the nets are initially set to invalid. In both stages,
the nets are then sorted by bounding-box. This is because netwith a larger bounding box
have a greater impact on the routing graph, and will be processed first. During the build
90







7 if (type == build) then
8 ∀n ∈ N : n→valid = false ;
9 end
10 SortN in descending order of bounding-box ;
11 foreach n ∈ N do
12 if (type == build) then
13 demandEstimate→AddRST(n→rst) ;
14 n→valid = true ;
15 end
16 FM( n→ cn) ;
17 end
18 end
phase, the 3D-RST of the net currently being processed is added into the demand estimate,
and the net is set to valid. Next, irrespective of stage, theFM() function (to be described
later) is performed on the cells of the current net. Note thatin the build phase, the de-
mand estimate does not have all the nets included, only the ones that have been processed
so far. This is to avoid any noise introduced by a bad initial random partitioning of the
unprocessed nets.
TheFM() function is similar to the basic algorithm described in Section 4.1.2, with a
few differences: (1) A heap is used instead of a bucket, as thegains are not integer values.
(2) Only a subset of cells that belong to a given net are considered, (3) When a cell is moved
to another tier, the gains of all cells within the current subset are updated, and (4) The gain
function is the global max-overflow gain, considering all “valid” nets in the design, not just
the current net being processed.
The above heuristic adds one net at a time into the demand estimate, maintaining a
local optima of the global total overflow after each net is added. Once all the nets are
91
added, each net is processed again to further reduce the overflow. This approach leads to
a time complexity ofO(N.(rmsNd)
2), wherermsNd is the root-mean-square of the net
degrees. This value does not scale much with circuit size, and therefore, the heuristic is
more or less linear in runtime.
4.1.4 Router-based 3D-Via Insertion
To continue with the P&R flow, routing and then parasitic extrac ion needs to be performed.
However, current routers can only handle 2D ICs, and the usualapproach is to split the 3D
design into separate designs for each tier, each of which canbe routed independently. This
requires the locations of the MIVs to be known, so that they can be represented as I/O pins
within each tier.
Once the partition of all cells are finalized, current TSV-based placers perform a TSV
and cell co-placement step [28, 12] to determine the via locations. However, MIVs are so
small that they can actually be handled by the router, and theonly hurdle is the lack of an
existing 3D commercial router. However, 2D commercial routers are capable of routing to
pins on different metal layers, and a method to trick existing 2D commercial routers into















2D TECH LEF 2D MACRO LEF 
3D TECH LEF 3D MACRO LEF 





















DEF for each tier
(b) 
Figure 54: An overview of the router-based MIV insertion methodology.(a) The tech-
nology and macro LEF are modified to represent a two-tier monolithic 3D IC. (b) The
structure that is fed into the commercial router, which is then routed. The MIV locations
are extracted and separate verilog/DEF files are created foreach tier.
92
First, all the metal layers in the technology LEF are duplicated to yield a new 3D LEF
with twice the number of metal layers. Next, for each standard cell in the LEF file, two
flavors are defined – one for each tier. The only difference betwe n the two flavors is that
their pins are mapped onto different metal layers dependingon its tier. Next, each cell in the
3D space is mapped to its appropriate flavour, and forced ontothe same placement layer.
Note that this will lead to cell overlap in the placement layer, but there will be no overlap
in the routing layers (Figure55). Routing blockages are placed in the via layer between the
two tiers, to prevent MIVs over cells. This structure is thenf d into an existing commercial
router (Cadence Encounter). Once routed, the routing topology is traced to extract the MIV
locations, and separate verilog/DEF files are generated foreach tier.
Tier 0 Gate




(a) Input to Commercial Router
(b) Output from Commercial Router
Figure 55: Screenshots of router-based MIV insertion (a) All the gatesare placed in
the same placement layer, but no overlap exists in the routing layers. (b) The result after
routing. The MIV locations are highlighted in red.
4.1.5 Experimental Results
Eight benchmarks are chosen, six of which are from the OpenCores benchmark suite. In
addition, two processor designs, the OpenSPARC (OST2) and LEON3 cores are chosen.
93
These designs vary in size from a few tens of thousands of gates to half a million gates.
They are synthesized with a 28nm cell library, and their statistics are tabulated in Table20.
Of these eight designs, three have memory macros, as listed under the memory area column
in Table20.






Period (ns) Std. Cell Memory
mul 64 1.2 21,671 22,399 0.078 0 4
LEON3 0.9 17,419 19,069 0.051 0.034 4
nova 2.3 57,339 60,867 0.179 0.028 6
rca 16 0.4 67,086 75,786 0.263 0 4
aes128 0.5 133,944 138,861 0.349 0 5
jpeg 1.5 193,988 238,496 0.739 0 4
OS T2 1.5 316,573 334,374 1.110 0.468 6
fft 256 1.0 488,508 492,499 1.833 0 5
In addition to the clock period, number of cells, and number of nets, this table also
shows the minimum number of metal layers with which the 2D placement is routable. This
is used as the number of metal layers for both 2D and monolithic 3D versions of each
design. The footprint area of each design is chosen such thatthe standard cells have a
target density of 70%. All monolithic 3D designs are implemented such that they have
exactly0% area overhead compared to their corresponding 2D version, i.e., exactly50%
footprint area,irrespective of MIV count. This condition also ensures that the standard cells
in the M3D design have a target density of 70%. The diameter ofach MIV is assumed to
be100nm, with a resistance of2Ω and a capacitance of0.1fF [33].
In order to obtain pre-placed memory macro locations for 3D,the memory macros are
partitioned architecturally. An example of this for the OST2 benchmark is shown in Fig-
ure56. The 2D design contains several modules such as load-store unit (ls ), instruction-
fetch unit (ifu) e.t.c. Roughly half the memories in each module are allocated to each tier,
and the memories are manually placed to mimic the 2D placement as close as possible.
94
3D - Tier 0























Figure 56: Manual partitioning of the memories in the OST2 benchmark. The memories
belonging to each sub-module are partitioned, and placed ina configuration similar to that
in 2D.
4.1.5.1 The Impact of Partitioning Bin Size
As discussed in Subsection4.1.2.1, the choice of partition-bin size affects the solution
quality greatly. From the perspective of cell displacement, smaller bin sizes are better.
However smaller bin sizes mean more partitioning-bins, which leads to more area-balance
constraints that the partitioner needs to satisfy. More constraints imply a worse objective
function, which means a larger cutsize in the min-cut partitioner. Since routed WL is more
important than 3D HPWL, more 3D vias mean that an appropriate whitespace location
needs to be found for more MIVs, which may not always be feasible. Therefore, a smaller
bin size may not always lead to lower wirelength. To quantifythis effect, the min-cut
partitioner is run on all benchmarks with varying bin sizes,and results are tabulated in
Table21.
For each benchmark, five different bin sizes are evaluated. The MIV count after router-
based MIV insertion and the projected 2D HPWL which is the objectiv function of the top-
off placement are tabulated. As expected, increasing the bin size always reduces the MIV
count due to the partitioner having more freedom, and also always increases the projected
2D HPWL as the final(x, y) location of cells deviates more. However, the impact on routed
95
Table 21: The impact of partition bin size on solution quality.
mul 64
Bin W #MIV Proj. 2D Routed PDP
(µm) (×103) HPWL (m) WL (m) (mW-ns)
5 15.41 (1.00)0.31 (1.00) 0.46 (1.00)35.61 (1.00)
10 8.35 (0.54) 0.31 (1.00) 0.45 (0.97)34.99 (0.98)
20 5.67 (0.36) 0.32 (1.01) 0.44 (0.96)34.63 (0.97)
40 4.73 (0.30) 0.32 (1.02) 0.45 (0.98)35.22 (0.98)
80 3.50 (0.22) 0.34 (1.08) 0.47 (1.02)35.34 (0.99)
LEON3
5 12.50 (1.00)0.36 (1.00) 0.54 (1.00)25.92 (1.00)
10 6.79 (0.54) 0.37 (1.00) 0.53 (0.97)25.60 (0.98)
20 5.77 (0.46) 0.37 (1.01) 0.52 (0.96)25.51 (0.98)
40 5.44 (0.43) 0.37 (1.02) 0.53 (0.97)25.62 (0.98)
80 4.19 (0.33) 0.38 (1.03) 0.53 (0.97)26.04 (1.00)
nova
5 44.81 (1.00)1.27 (1.00) 2.09 (1.00)68.84 (1.00)
10 25.66 (0.57)1.27 (1.00) 2.01 (0.96)68.08 (0.98)
20 22.25 (0.49)1.29 (1.01) 1.98 (0.94)68.07 (0.98)
40 17.07 (0.38)1.30 (1.02) 1.99 (0.95)67.38 (0.97)
80 14.34 (0.32)1.35 (1.06) 1.99 (0.95)68.44 (0.99)
rca 16
5 53.38 (1.00)0.79 (1.00) 1.52 (1.00)23.91 (1.00)
10 31.83 (0.59)0.82 (1.03) 1.50 (0.98)23.76 (0.99)
20 19.34 (0.36)0.86 (1.08) 1.53 (1.00)24.07 (1.00)
40 14.16 (0.26)0.90 (1.13) 1.54 (1.01)24.56 (1.02)
80 11.25 (0.21)0.93 (1.16) 1.56 (1.02)24.75 (1.03)
aes128
Bin W #MIV Proj. 2D Routed PDP
(µm) (×103) HPWL (m) WL (m) (mW-ns)
5 95.43 (1.00) 1.94 (1.00) 3.00 (1.00) 105.16 (1.00)
10 63.75 (0.66) 1.97 (1.01) 2.95 (0.98) 105.05 (0.99)
20 56.63 (0.59) 2.02 (1.04) 2.99 (0.99) 105.37 (1.00)
40 35.96 (0.37) 2.27 (1.17) 3.19 (1.06) 107.04 (1.01)
80 16.76 (0.17) 2.43 (1.25) 3.34 (1.11) 108.48 (1.03)
jpeg
5 161.06 (1.00) 3.79 (1.00) 5.40 (1.00) 359.20 (1.00)
10 88.84 (0.55) 3.78 (0.99) 5.32 (0.98) 352.72 (0.98)
20 56.79 (0.35) 3.83 (1.01) 5.27 (0.97) 350.51 (0.97)
40 47.29 (0.29) 3.90 (1.02) 5.30 (0.98) 351.06 (0.97)
80 35.47 (0.22) 4.14 (1.09) 5.48 (1.01) 355.50 (0.99)
OS T2
5 270.77 (1.00)11.44 (1.00) - -
10 149.36 (0.55)11.62 (1.01)17.41 (1.00)520.20 (1.00)
20 129.30 (0.47)11.64 (1.01)17.36 (0.99)517.50 (0.99)
40 108.17 (0.39)11.72 (1.02)17.40 (0.99)518.10 (0.99)
80 102.42 (0.37)11.79 (1.03)17.44 (1.00)519.90 (0.99)
fft 256
5 368.22 (1.00)14.10 (1.00) - -
10 227.62 (0.61)14.11 (1.00)24.76 (1.00)775.32 (1.00)
20 164.78 (0.44)14.34 (1.01)24.71 (0.99)767.55 (0.99)
40 145.87 (0.39)14.48 (1.02)24.58 (0.99)755.23 (0.97)
80 130.14 (0.35)14.49 (1.02)24.17 (0.97)752.00 (0.97)
96
wirelength is mixed, which is due to the trade-off mentionedearlier. There is a clear sweet
spot in terms of bin size. Increasing the bin size reduces theMIV count, which means
that MIV insertion is easier, which reduces the routed wirelength. However, increasing the
bin size too much means that the increase in projected 2D HPWL outweighs any benefits
obtained from fewer MIVs. This sweet spot is different for different benchmarks, but
Table21suggests that a bin size of10− 20µm works well across a wide range of designs,
for this technology. Note that with a different technology,this bin size will need to change
to keep the number of cells per bin a constant. Since sweepingthe bin size is not feasible
for each new benchmark, a partitioning bin size of20µm is chosen for all benchmarks, and
all subsequent results presented in this section assume this bin size.
4.1.5.2 Impact of Router-based MIV Insertion
The conventional method for 3D via insertion is to perform a post-place cell & 3D via co-
placement [28, 12]. This section compares router-based MIV insertion schemeagainst this
conventional technique. For reasons that will be given in Subsection4.1.5.4, it is assumed
that monolithic 3D has one metal layer removed from the top tier. Both placement-driven
MIV insertion, as well as the proposed router-driven MIV insertion are performed, and
results are tabulated in Table22.
In this table, entries marked with a * indicate that that particular flavor is unroutable,
and the wirelength reported is on designs with many thousands ofDRC violations. Since
reliable parasitic extraction cannot be performed on such designs, only wirelength and MIV
count are compared. As observed from this table, the placement-driven MIV insertion
often produces results that are unroutable. In those cases that are routable, router-based
MIV insertion improves the routed WL by up to15%. This is because the placement-
based method tends to cluster vias together, leading to large clumps of vias, and large areas
without any vias. When routing the placement-based method with the commercial router,
no significant congestion is observed during the trial router global route phase. However,
97




WL #MIV WL #MIV
(m) (×103) (m) (×103)
mul 64 0.530 3.723 0.473 5.677
LEON3 0.628 3.907 0.549 5.772
nova 2.170 13.687 2.031 22.256
rca 16 1.575 11.749 1.535 19.344
aes128 3.213 35.026 2.988 56.632
jpeg 6.233 24.010 5.304 56.791
OS T2 21.740* 73.805* 17.469 129.308
fft 256 31.829* 71.272* 25.133 164.784
Geo-Mean 3.348 17.859 2.943 31.489
Norm. 1.000 1.000 0.879 1.763
the vias are so small that it becomes difficult to route to themcausing huge issues during
detailed routing. The router-based method, although it hasmore MIVs (due to multiple
vias inserted per net), spreads them out over the area of the chip, increasing the routability.
4.1.5.3 Impact of Routability-Driven Partitioning
Starting with the min-cut solution, routability-driven partitioning is performed with and
without the interdependent supply/demand (IdS) proposed in Subsection4.1.3.4. It is also
assumed that one metal layer is reduced from the top tier in M3D. The supply, demand, and
overflow of the min-cut partition of mul64 with and without IdS is plotted in Figure57.
From this figure, it is seen that in the case of IdS, the supply of the MIV layer is reduced
due to the demand in the tier 1 top metal, and vice-versa. Clearly, not considering IdS
during min-overflow partitioning significantly overestimates the MIV supply. The results
are tabulated in Table23.
When compared with the min-cut solution, the min-overflow partitioner without IdS can
reduce the routed WL by up to4.30% (mul 64) and the PDP by up to3.14% (fft 256). On
average, the min-overflow partitioner without IdS gives1.8% and0.9% better wirelength






Tier 0 MIV Tier 1 Tier 0 MIV Tier 1











Figure 57: Supply, demand, and overflow maps of the mul64 benchmark for min-cut
based partitioning solution. If interdependent supply/demand is considered, a significant
reduction in supply in densely wired areas is observed, leading to more overflow.
Table 23: The impact of routability-driven partitioning on monolithic 3D IC designs.
Circuit
Min-cut Min-overflow (w/o IdS) Min-overflow (with IdS)
WL PDP #MIV WL PDP #MIV WL PDP #MIV
(m) (mW-ns) (×103) (m) (mW-ns) (×103) (m) (mW-ns) (×103)
mul 64 0.47 35.52 5.67 0.45 34.91 7.24 0.45 33.11 6.32
LEON3 0.54 25.95 5.77 0.55 26.26 6.69 0.53 25.86 5.84
nova 2.03 69.82 22.25 1.98 68.15 27.51 1.98 67.94 25.05
rca 16 1.53 23.75 19.34 1.50 23.58 25.82 1.49 23.44 23.14
aes128 2.98 105.47 56.63 2.98 104.05 63.73 2.96 103.97 61.95
jpeg 5.30 351.93 56.79 5.27 357.75 72.43 5.18 349.53 63.10
OS T 17.46 522.30 129.30 17.16 517.95 164.86 16.61 509.55 134.98
fft 256 25.13 791.64 164.78 24.18 766.72 222.05 23.26 758.79 180.66
Geo-Mean 2.94 111.25 31.48 2.89 110.22 39.41 2.84 108.47 34.58
Norm. 1.00 1.00 1.00 0.98 0.99 1.25 0.96 0.97 1.09
3.8% and2.65% boost in the WL and PDP is obtained, respectively. In this case, the min-
cut solution can be improved by up to7.44% w.r.t. WL and4.31% w.r.t. PDP. This takes the
average WL and PDP gain over min-cut to3.4% and2.2%, respectively. In addition, the
min-overflow solution without IdS underestimates the congestion in the MIV layer, and,
on average uses25.2% more MIVs than the min-cut solution. If IdS is considered during
partitioning, the MIV count increase over min-cut goes downto 9.8%.
99
4.1.5.4 Reducing Metal Layers in Monolithic-3D
Cost is one of the primary concerns that needs to be addressed before 3D ICs can be widely
adopted. If each tier in a monolithic 3D IC uses the same number of metal layers as 2D,
the additional cost over 2D is the bonding of the empty silicon wafer. One method to offset
the increased cost is to reduce the number of metal layers in 3D, reducing the total cost of
the chip.
Reducing the number metal layers in monolithic 3D is now explored. The default case
is when both tiers have the same number of metal layers as 2D (Table20). Reducing one
metal layer from the top-tier alone is termed “Tm1”, and reducing one metal layer from
each of the top and bottom tiers is termed “Tm1Bm1”. For each of these cases, min-cut
partitioning, as well as min-overflow partitioning, with and without IdS is performed. The
wirelength and PDP for all these cases is plotted in Figure58. The curves for 2D are also
plotted as a comparison.
The first thing observed is that even with a reduced metal count, all designs in mono-
lithic 3D are able to be routed with zero DRC violations. Thesedesigns were not routable
with fewer metal layers in 2D, so the fact that they are now routable indicates that mono-
lithic 3D reduces the routing demand significantly. The nextthing to note is that, as ex-
pected, reducing the metal layer count increases the wirelength and PDP. The magnitude of
this increase depends on how congested the initial design isto begin with. In addition, the
min-overflow partitioner helps both wirelength and PDP significantly. In many cases, the
“Tm1” min-overflow (without IdS) result is better than the min-cut with all metal layers.
Similarly, the addition of IdS into the partitioner gives a huge WL and PDP benefit. In
several cases, designs can have two metal layers removed andstill have lower WL than the
































































































































 OS_T2 aes_128  jpeg
 rca_16
 2D  3D: Min-cut 
 3D: Min-overflow 
  3D: Min-overflow + IdS  































































































































Figure 58: The impact of reducing the metal layer count. “Tm1” (“Bm1”) stands for one
metal layer removed from the top (bottom) tier.
4.1.5.5 Application to Face-to-Face Bonding
So far, the proposed approach has been applied to monolithic3D ICs only. However,
this approach is general and is applicable to any 3D technology where the via size is so
small that the placement need not be aware of them. This section now discusses how this
methodology applies to face-to-face (F2F) technology, which as a different stack-up than
the face-to-back style discussed so far. The placement engin itself need not change. This
is because a design can be placed as if it was face-to-back, and the the mask of the top
tier is mirrored with the center of the die as the axis of symmetry. Modifications to the
min-overflow algorithm and router-based via insertion stepis now discussed.
For the min-overflow partitioner, F2F without IdS is identical to the monolithic 3D
partitioner without IdS. With IdS, only a few changes need tobe made. First, the supply in
101
the F2F layer depends on the top metal layer ofb th tiers, not just the top tier. Therefore,
Equation (32) is computed for both tiers separately, and the number of F2Fvia blockages
is the maximum of the two. Next, to calculate the 2D supply reduction, Equation (35) is
applied to each tier independently.
For router-based F2F insertion, consider the modified technology LEF file as shown in
Figure54(a). To represent face-to-face, the order of the metal layers of the top die simply
need to be reversed. The stack-up will now beM1 1, · · · ,MN 1 , MN 0, · · · ,M1 0. Note
that no additional modifications are made to the macro LEF file. In addition, no routing
blockages are placed over cells, as F2F vias do not occupy silicon space. Finally, while
tracing the routing topology, the F2F landing pads are created on the top metal layers of
each tier. Each tier can then be routed, and the mask of the toptier will be mirrored before




Figure 59: (a) Monolithic 3D integration, and (b) Face-to-face 3D integration. MIVs are
limited to whitespace, while F2F vias are not.
F2F vias are assumed to have a width of0.5µm, a resistance of0.1Ω, and a capacitance
of 0.2fF . The “Tm1” case is assumed, and the routed WL and PDP for the min-cut and
min-overflow (with and without IdS) are tabulated in Table24. Although the min-overflow
partitioner without IdS gives an average WL reduction of2%, the PDP actually goes up
slightly. This is due to overestimation of the available F2Fsupply, and the more accurate
partitioner with IdS corrects this issue.
102
Table 24: The impact of routability-driven partitioning for face-toface designs.
Circuit
Min-cut Min-overflow (w/o IdS) Min-overflow (with IdS)
WL PDP #F2F WL PDP #F2F WL PDP #F2F
(m) (mW-ns) (×103) (m) (mW-ns) (×103) (m) (mW-ns) (×103)
mul 64 0.49 35.87 5.29 0.47 35.67 6.94 0.46 35.45 6.44
LEON3 0.59 27.29 5.32 0.58 27.92 6.36 0.58 27.06 5.90
nova 2.09 73.57 20.21 2.03 73.89 25.58 2.02 71.30 23.82
rca 16 1.53 23.75 19.34 1.50 23.68 24.36 1.47 23.38 21.97
aes128 3.01 111.72 52.81 3.05 112.38 60.33 2.97 108.36 62.20
jpeg 5.37 351.63 52.58 5.30 348.76 68.12 5.19 344.97 60.00
OS T2 17.78 533.25 115.73 17.48 530.55 153.96 17.02 521.85 130.31
fft 256 25.31 762.76 145.04 24.35 757.78 204.96 23.43 735.29 168.82
Geo-Mean 3.01 113.39 29.09 2.95 113.46 37.08 2.90 110.94 33.62
Norm. 1.00 1.00 1.00 0.98 1.00 1.27 0.96 0.97 1.15
4.1.5.6 Overall Comparisons
The WL and PDP numbers of 2D, and the monolithic 3D and face-to-face designs obtained
after partitioning with IdS are now compared. The results are t bulated in Table25. From
this table, M3D offers up to a25.6% WL benefit and16.6% PDP benefit. On average, M3D
offers19.9% and11.8% WL and PDP benefit, respectively. In contrast, F2F offers up to
23.8% WL benefit and14.6% PDP benefit. On average,18.2% and10.1% WL and PDP
benefit is seen, respectively.
Table 25: Overall Comparisons
Circuit
2D 3D – MIV 3D – F2F
WL PDP WL PDP WL PDP
(m) (mW-ns) (m) (mW-ns) (m) (mW-ns)
mul 64 0.584 39.432 0.452 33.119 0.468 35.454
LEON3 0.638 28.088 0.537 25.863 0.582 27.060
nova 2.447 75.420 1.982 67.947 2.028 71.308
rca 16 1.727 26.010 1.491 23.443 1.474 23.384
aes128 3.632 117.148 2.961 103.978 2.979 108.365
jpeg 6.769 399.339 5.183 349.531 5.193 344.972
OS T2 22.352 611.400 16.615 509.550 17.024 521.850
fft 256 28.922 861.750 23.263 758.793 23.436 735.297
Geo-Mean 3.547 123.338 2.842 108.476 2.901 110.941
Norm. 1.000 1.000 0.801 0.880 0.818 0.899
103
In general, F2F has slightly worse numbers than monolithic 3D. This is because of the
larger vias sizes (necessitated by die-alignment) and the fact that connecting two gates in
3D requires a stacked via through both tiers. F2F also has other issues not considered here,
such as the requirement of being in a regular array, through-silicon-vias required for I/O
connections to the chip, and the non-availability of flip-chip style packaging.
4.1.6 Comparison with Existing 3D Placers
The proposed placer is compared against two existing techniques, which were primarily de-
veloped forTSV-based 3D placement. The first technique is 3D-Craft [12], which performs
true 3D placement, and the other is the partition-then-place approach [28]. No comparison
is made against another TSV-specific 3D placer [21], because the binary is not publicly
available. In addition, [21] only presents absolute 3D WL numbers without providingany
2D baseline number. It is therefore unclear how much of the improvement comes from
their 2D engine, and how much from their 3D specific approach.Since the proposed 3D
approach can easily incorporate any 2D engine, any engine specific gains in [21] will also
carry over.
4.1.6.1 Comparison with 3D-Craft [12]
Only the binary version of this tool is available, and it doesnot support a target density
driven mode. The cells are preset to always be placed with a target density of 1, orwithout
any whitespace in between them. Such a placement solution will not have any space for
router-driven MIV insertion, and hence is inherentlynot routable. For this reason, only the
3D half-perimeter wirelength (HPWL) is compared in this section. In addition, the binary
provided is not capable of handling pre-placed hard macros such as memory. Therefore, in
this subsection, only the pure-logic designs are compared.
Both the proposed placer and 3D-Craft are run with the number ofdies set to one to
give a 2D placement. Next, both placers are run with the number of dies set to two, which
gives a 3D placement. Only the improvement in HPWL when going to 3D is compared.
104
The proposed placer is run with a target density of 1 to match the preset setting of 3D-Craft.
3D-Craft also has a via weight parameter in the cost function (as it is TSV-based), which
controls the number of 3D vias. This is set to0 to make the cost function purely 3D HPWL
driven. The results of both placers are tabulated in Table26.
Table 26: Comparison between 3D-Craft and Our Placer
Circuit
Our HPWL (m) 3D-Craft HPWL (m)
2D 3D 3D/2D 2D 3D 3D/2D
mul 64 0.39 0.30 0.77 0.34 0.27 0.79
rca 16 1.15 0.92 0.79 1.22 0.97 0.80
aes128 2.61 1.93 0.74 2.52 1.87 0.74
jpeg 4.96 3.70 0.74 5.09 3.78 0.74
fft 256 18.95 13.63 0.72 19.57 13.31 0.68
Geo-Mean 2.56 1.93 0.75 2.54 1.90 0.75
Norm. 1.00 1.00 1.00 0.99 0.99 1.00
From this table, both placement approaches produce comparable wirelength improve-
ments when going to 3D. Since the proposed placer takes some steps to minimize the MIV
count such as min-cut partitioning, the MIV counts are not compared. The benefit of the
proposed approach comes not just from comparable improvements in HPWL, but in the
fact that any 2D placer can be easily modified and coupled withour partitioner to give high
quality results.
4.1.6.2 Comparison with Partition-then-Place [28]
This technique of 3D placement first performs partitioning,and then simultaneous 2D
placement of all the tiers while minimizing 3D HPWL. During placement, it looks at all
gates in the 3D space, but does not move gates between tiers. Therefore, the initial partition
solution is very important, as it greatly affects solution quality. The same KraftWerk engine
is used for both types of placement, so they have identical 2Dnumbers. The utilization of
each circuit is set to70%, and both placement solutions are taken through router-based MIV
insertion to obtain routed WL. To generate initial partitions for the partition-then-place ap-










































































































































































































 Partition, then place
P lacement-aware 
partitioning















































































Figure 60: Comparison of 2D, partition-then-place, and placement-aware p rtitioning
methods.
First, the placement-aware partitioning approach is run, and the number of nets used
is computed. Partitions are generated starting from this cut ize, in increments of±5% of
the number of nets. The wirelength and PDP for all approachesare plotted in Figure60.
From these graphs, it is clear that choosing an appropriate cutsize is very important to the
solution quality. In addition, the proposed approach givesth best wirelength, without the
need to sweep the cutsize.
4.2 Monolithic 3D IC Design With Commercial 2D IC Tools
The previous section has described how to modify an academic2D placer to obtain M3D
designs. However, this technique has several limitations.Academic placers usually target
wirelength as the objective function, and not timing, whichis more critical. In addition, the
106
techniques in Section4.1do not consider timing optimization, while real M3D designsneed
to be timing closed. Finally, commercial engines include state-of-the art power optimiza-
tion techniques such asvt swaps for gates not on the critical path. For a fair comparison
with commercial-quality 2D results, M3D needs these optimizations as well. Therefore,
this section presents a methodology utilize commercial 2D engines, along with all state of
the art optimization steps, to obtain M3D results. The OpenSPARC T2 [52] core is used as
a case study throughout this section.
4.2.1 CAD Methodology
This section discusses how the techniques presented in Section 4.1can be modified to use
commercial 2D engines instead of academic ones.
4.2.1.1 Overall Methodology
The overall design flow is shown in Figure61. First, in order to utilize the 2D tool to
handle all the standard cells in a reduced footprint, several t chnology files are scaled, and
this process will be described in detail in Subsection4.2.1.2. Next, memory handling re-
quires several steps such as memory scaling, memory placement and memory flattening,
which will be described in detail in Subsection4.2.1.3. Once this is done, the commer-
cial 2D engine (Cadence Encounter) can be run on this “shrunk 2D” design (described
in Subsection4.2.1.4). This result is then split into multiple tiers to obtain a DRC-clean
sign-off design as described in Subsection4.2.1.5, and finally timing and power analysis is
performed as before.
Technology Scaling Memory Scaling
Memory Placement
Memory Flattening








Figure 61: The overall CAD methodology flow used in this paper.
107
4.2.1.2 Scaling Technology Files
The goal of this step is twofold. The commercial 2D tool first needs to be tricked into
placing all the gates in half the footprint area, and it also needs to be able to extract the wire
parasitics such that the shrunk 2D design reflects the final geometries in a 3D design. Note
that this subsection assumes a gate-only design, and handling memory will be introduced
in Subsection4.2.1.3.
Placing all the gates into half the area can be achieved by shrinking the area of each
standard cell by50%. The width, height and the location of all the pins within thecell
are scaled by1/
√
2 (0.707). In addition, the chip width and height are scaled by0.707 to
reduce the 2D footprint area by half. This will also be the footprint of each tier in the final
M3D design. Note that since the x and y axis equations in an analytical placer are linear,
scaling all the dimensions by0.707 will simply make the cell locations0.707 of what they
used to be in the 2D placement solution. This leads to a theoretical HPWL improvement of
29.3%.
Next, in order to make the routing in the shrunk 2D accuratelyr present the routing
in monolithic 3D, both the metal width and pitch of each metallayer is shrunk by0.707.
Since the chip width and height are also shrunk by the same amount, the total routing track
length does not change between 2D and shrunk 2D. The total track length will also be the
same in 3D, and hence this method gives a good estimate of wirelength. Note that the wire
RC per unit lengthis not changed, even though the wire width is smaller. Therefore, the
extracted RC values from the tool does not reflect the geometryof shrunk 2D, but that of a
M3D wire of equivalent length using the original metal geometries.
4.2.1.3 Handling Memory Macros
While standard cells can be handled by shrinking their footprin , this is not the case for
memory. This is because standard cells can be moved by the plac r, while memory is pre-
placed. Since no standard cell can be placed in the location where a memory is pre-placed,
108
simply shrinking the memory is not an option. A pre-placed memory can be thought of
as a combination of its pins, which serve as anchors for standard cell placement, and a
placement blockage over its footprint, which prevents cells from being placed over it. Each
component is described separately.
In order to isolate the memory pin portion, the footprint of the memory is shrunk to
the minimum size possible (that of a filler cell). However, the relative locations of its pins
are not scaled. This is shown in Figure62. This will lead to memory pins that are placed
outside the memory footprint. These pins will be in the same location they would have
been if the memory was its original size. Therefore, from a memory pin perspective, the








Figure 62: Isolating the memory pins by shrinking the memory footprint. (a) Initial
memory footprint, and (b) Memory footprint reduced to size of filler cell.
Handling the placement blockage portion of the memory is similar to what was de-
scribed in Figure50. Those regions that have two memories overlapping cannot contain
cells in any tier, and hence will become full placement blockages in the shrunk 2D foot-
print. Those regions that have only one memory can contain cells in the tier where the
memory is not placed. In the shrunk 2D design, the maximum placement density of these
regions needs to be reduced to reflect this fact. This can be achi ved by using partial place-
ment blockages. For example, if the target density of the final 3D design is70%, then the
maximum placement density of the partial placement blockages is set to35%.
109
4.2.1.4 Shrunk 2D Place and Route
The shrunk technology and standard cell libraries are fed along with the memory related
pins and blockages into Cadence Encounter. This commercial 2D IC tool is then used to
run throughall the design stages such as placement, post-placement optimization, CTS,
routing, and post-route optimization. Unlike conventional 3D flows, this approach avoids
the problem of tier-by-tier timing optimization. The advant ge of this is that the tool can
see the entire 3D path, and will insert the minimum buffers required to meet timing.
4.2.1.5 Obtaining a 3D Design
Once the shrunk 2D place and route is done, the cells and memories are expanded back
to their original areas. This directly corresponds to results from modified 2D academic
placers, and the existing partitioning approaches can be applied to this result. A snapshot











Memory Fla"ening Shrunk 2D P&R 
Tier Par!!oning
Figure 63: Pre-placed memory is flattened to get a shrunk 2D footprint, owhich 2D
P&R is performed. This is then partitioned to get a monolithic 3D solution.
In addition to splitting the logic, the commercial flow enables the building of a clock
tree in the shrunk 2D design. The conventional approach for 3D ICs (using commercial
tools) is to create one separate clock tree per tier, and tie them together using a single











Figure 64: Two different types of 3D CTS possible (a) One clock tree per tier for each
gating group (source-level), and (b) The entire backbone isfixed onto tier 0 (leaf-level).
to use the conventional approach, all the clock gating cellsare fixed onto tier 0 (as shown
in Figure64(a)), and one clock tree per tier is constructed for each gatin group. This is
termed source-level CTS, as MIVs are inserted close to the clock s urce. This approach
does not use the clock tree from shrunk 2D at all, so if this approach is to be used, no
clock tree is constructed in shrunk 2D, and instead a fixed clock uncertainty is set during
optimization.
This section proposes a new CTS methodology that will help reduc the clock power.
Since MIVs are very small, it can be assumed that any number ofthem can be inserted.
In this case, the existing CTS result of shrunk 2D can be reused. This clock tree contains
several levels of logic as shown in Figure64(b). During the logic splitting process, the
entire clock backbone (clock buffers and clock gates) is fixed onto tier 0. Only the leaf-
level flip-flops are free to be partitioned to maintain area balance. Therefore, MIVs will
be inserted following all leaf clock buffers that drive flip-flops in both tiers. This approach
is termed leaf-level CTS, and an example of this approach for the OpenSPARC T2 core is
shown in Figure65.
Next, the same gate-level MIV insertion scheme can be used. However, for certain
nets, the router is bound to insert multiple MIVs. Since existing 3D tool flows use tier-by-
tier optimization, timing constraints need to be derived for each tier. In each tier, MIVs










Figure 65: The proposed CTS methodology (a) The clock backbone in tier 0,and (b)
Zoom-in shot of leaf-level flip-flops in both tiers connectedo a leaf clock buffer in tier 0.
However, if a single net contains multiple MIVs, then it becomes very difficult to capture
multiple input/output delays on a single net, as such conditions do not arise in 2D ICs
(which current tools are designed for). Therefore, multiple MIV insertion is converted to
single MIV insertion by picking the best MIV (in terms of HPWL)from those inserted,
and re-routing the net. This could potentially increase thewir length, but is unavoidable
for conventional 3D flows. In the proposed flow, since the optimization is performed in
the shrunk 2D design and not tier-by-tier, multiple MIV insertion can be used, which will
reduce wirelength and power. Routing topologies for single and multiple MIV insertion
for a given net are shown in Figure66. Once the 3D design is obtained, timing and power
analysis can be performed as usual.
4.2.2 Power Benefit Study
The OpenSPARC T2 core is chosen as a case study, and implemented in a 28nm technology
library. The power benefit that monolithic 3D ICs offer when compared to a commercial
quality sign-off 2D design is investigated. All the numbersp esented in this section are for
timing closed designs, with a frequency of1Ghz. This is the maximum frequency that the
2D version could be design with while using a high-effort timing-driven flow in Cadence
112
Tier0 Mul ple MIVs






Figure 66: Two types of MIV insertion for a 3D net (a) Single, (b) Multiple
Encounter. The footprint area of the monolithic 3D IC designis exactly half that of the 2D
design, and therefore, all 3D designs presented here have zero total silicon area overhead
when compared to 2D.
The MIV diameter is assumed to be100nm, and its resistance and capacitance are
assumed to be2Ω and0.1fF respectively. Comparisons with face-to-face integration are
also provided, and the F2F via diameter, resistance and capacit nce are assumed to be
500nm, 0.5Ω and 0.2fF respectively. All required scripts are implemented in C/C++,
Python and Tcl.
4.2.2.1 Single vs. Multiple MIV Insertion
The power benefit offered by using multiple MIVs (or F2F vias)for each 3D net is first
investigated. A summary of results for both single and multiple MIV insertion is tabulated
in Table27.
From this table, it is observed that using multiple vias offers 8.4% and10.04% wire-
length reduction, for M3D and F2F respectively. In addition, the number of 3D vias double.
This means that each net is, on average, using approximatelytwo MIV/F2F vias. This wire-
length reduction does not reduce leakage power, but it does reduce some cell power. The
biggest reduction is in net power, which reduces by3.81% and4.53% for M3D and F2F,
which translates to2.25% and2.66% total power reduction, respectively.
113
Table 27: Comparison of single vs. multiple MIV/F2F insertion. Power values are re-
ported in mW, and wirelength in meter.
Monolithic 3D Face-to-face
Single Multiple Diff(%) Single Multiple Diff(%)
Total WL 15.61 14.29 -8.43 15.44 13.89 -10.05
#MIV/F2F 106k 235k +120.44 106k 202k +89.72
Total Pwr 534.10 522.10 -2.25 538.30 524.00 -2.66
Cell Pwr 126.90 126.10 -0.63 127.30 126.40 -0.71
Net Pwr 293.90 282.70 -3.81 297.80 284.30 -4.53
Lkg Pwr 113.30 113.30 0.00 113.30 113.30 0.00
Table 28: Comparison of two different types of 3D CTS. Power values are report d in
mW, and wirelength in meter.
Monolithic 3D Face-to-face
Source- Leaf- Diff Source- Leaf- Diff
level level (%) level level (%)
#MIV/F2F 871 11,376 +1.2k 871 11,376 +1.2k
Skew (ps) 197.42 103.00 -47.83 172.90 117.07 -32.29
Clock Pwr 68.40 48.00 -29.82 69.00 48.50 -29.71
Tier0 WL 0.55 0.62 +11.89 0.53 0.62 +16.61
Tier1 WL 0.48 0.19 -60.50 0.48 0.17 -64.85
Total WL 1.03 0.80 -21.67 1.01 0.79 -21.91
#Tier0 Buf 14,610 21,687 +48.44 14,958 21,687 +44.99
#Tier1 Buf 12,444 0 -100 12,691 0 -100
#Total Buf 27,054 21,687 -19.84 27,649 21,687 -21.56
4.2.2.2 CTS: Source-level vs. Leaf-level
This section discusses the power benefit that the proposed CTSmethodology (leaf-level)
offers over existing 3D techniques (source-level). A summary of results is tabulated in
Table28. Clearly, leaf-level CTS offers huge reductions in clock skew, as well as a29.82%
reduction in the clock tree power. There are871 clock-gating related cells in the design,
which is why source-level CTS uses that number of MIV/F2F vias. In addition, leaf-level
uses far more 3D vias, which helps reduce the clock power.
These power reduction numbers can be explained on the basis of per-tier wirelength and
buffer count. Leaf-level CTS uses far more buffers and has a longer WL on tier 0, which
is the tier with the clock-backbone. On the other hand, the number of buffers is zero in
114
tier 1 and the WL is much smaller. In comparison, source-levelhas a more balanced clock
WL and buffer count between the tiers, but this comes at the cost of an increase in the total
clock WL and buffer count.
4.2.2.3 Overall Comparisons: 2D vs. 3D
Using the techniques that give the best power reduction (i.e. multiple MIV insertion and
leaf-level CTS), M3D and F2F is compared with a 2D IC designed using Cadence En-
counter. A summary of results is tabulated in Table29. From this table, shrunk 2D reduces
the wirelength by27.05% compared to 2D. This is very close to the29.3% HPWL bound
predicted in Section4.2.1. The improvement number goes down for both M3D and F2F,
which is to be expected. In addition, M3D has slightly higherWL compared to F2F because
the MIVs are limited to whitespace, while F2F vias are not. Next, the 3D implementations
reduce the buffer count by22.3%, which translates to a8.03% reduction in total gate count.
Since MIV and F2F designs are obtained by simply splitting the s runk 2D design, all three
have the same gate counts. The reduced wirelength and gate count lead to a total power
reduction of15.57% and15.27% for M3D and F2F respectively. Finally, F2F has a higher
power consumption than M3D even though it has lower WL, which is due to increased par-
asitics of F2F vias. Also, both M3D and F2F power numbers are quite close to the shrunk
2D numbers, which shows that the shrunk 2D design is a very good estimate of M3D and
other fine-grained 3D technologies.
The total power is divided into cell, net, and leakage power.The cell power reduces at
a number roughly equal to the total gate count reduction. Thenet power reduces roughly
proportional to wirelength, and finally, the leakage reduction is slightly larger than cell
count reduction due to smaller buffer sizes. The total powercan also be split up by lumping
the internal, net and leakage power of certain classes of gates/memory together. This is also
tabulated in Table29. It is observed that the flip-flop clock pin power and registerpower
are virtually unchanged in 3D. The biggest savings in power come from combinational
115
Table 29: Overall comparisons between 2D and different 3D implementation styles.
Power numbers are in mW.
Enc. 2D Shrunk 2D Monolithic 3D Face-to-face
Total WL(m) 17.96 13.10 ( -27.0% ) 14.29 ( -20.4% ) 13.89 ( -22.6% )
# MIV/F2F - - 235,394 235,394
# Buffers 164,917 128,098 ( -22.3% )128,098 ( -22.3% )128,098 ( -22.3% )
#Tot. Gates 458,824 421,959 ( -8.0% )421,959 ( -8.0% )421,959 ( -8.0% )
Total Pwr 618.40 514.40 ( -16.8% ) 522.10 ( -15.5% ) 524.00 ( -15.2% )
Cell Pwr 135.60 126.80 ( -6.4% ) 126.10 ( -7.0% ) 126.40 ( -6.7% )
Net Pwr 356.30 274.30 ( -23.0% ) 282.70 ( -20.6% ) 284.30 ( -20.2% )
Leak. Pwr 126.50 113.30 ( -10.4% ) 113.30 ( -10.4% ) 113.30 ( -10.4% )
Mem. Pwr 49.00 45.10 ( -7.9% ) 45.10 ( -7.9% ) 45.00 ( -8.1% )
Comb. Pwr 385.10 300.00 ( -22.1% ) 305.30 ( -20.7% ) 306.80 ( -20.3% )
Clk Tr. Pwr 62.50 46.90 ( -24.9% ) 48.00 ( -23.2% ) 48.50 ( -22.4% )
FF Clk Pwr 9.70 9.90 ( +2.0% ) 9.60 ( -1.0% ) 9.70 ( 0.0% )
Reg. Pwr 112.10 112.50 ( +0.3% ) 114.00 ( +1.6% ) 114.00 ( +1.6% )
logic (20.72% savings), and from the clock tree (23.20% savings). These also exists some
memory power savings due to reduction in the output net length that the memory drives.
4.2.2.4 Impact of Dual-Vt Gates
All the results discussed so far have used only the regularVt standard cell library for both
2D and 3D designs. However, it is known that converting cellson non-critical paths to a
high Vt flavor can help reduce leakage power. In this section, dualVt designs (DVT) are
implemented, and their power benefit versus singleVt designs (SVT) is evaluated. For both
2D and 3D (shrunk 2D), Encounter is used to perform leakage optimization during the P&R
flow. In addition, leakage optimizations are performed in PrimeTime using a script similar
to [19], and the results are tabulated in Table30.
It is observed that dualVt M3D designs reduce the total power of 2D designs by16.08%.
This is a slightly better improvement number than the SVT case lone. This is due to
the fact that there are more paths that become non-critical in 3D. The F2F improvement
numbers are also better than the SVT case. Therefore, the 3D power benefit not only
carries over to dual-Vt designs, it actually improves.
116
Table 30: Dual-Vt comparisons between 2D and different 3D implementation styles.
Power is in mW.
Enc. 2D Monolithic 3D Face-to-face
Total WL(m) 17.94 14.29 ( -20.33% ) 13.89 ( -22.59% )
#MIV/F2F - 235,394 202,593
Total Pwr 572.10 480.10 ( -16.08% ) 482.20 ( -15.71% )
Cell Pwr 131.80 123.00 ( -6.68% ) 123.30 ( -6.45% )
Net Pwr 356.60 282.70 ( -20.72% ) 284.30 ( -20.27% )
Leak. Pwr 83.60 74.40 ( -11.00% ) 74.60 ( -10.77% )
Mem. Pwr 48.80 45.10 ( -7.58% ) 45.00 ( -7.79% )
Comb. Pwr 361.60 283.00 ( -21.74% ) 284.30 ( -21.38% )
Clk Tree Pwr 62.50 48.00 ( -23.20% ) 48.50 ( -22.40% )
FF Clk Pin Pwr 9.10 9.20 ( +1.10% ) 9.20 ( +1.10% )
Reg. Pwr 90.00 94.90 ( +5.44% ) 94.80 ( +5.33% )
4.3 IR-drop Aware Partitioning for Monolithic 3D ICs
The previous two sections have presented techniques to design gate-level monolithic 3D ICs
with either academic or commercial 2D engines. Partitioning techniques such as min-cut
and min-overflow were also presented. Although sign-off quality designs can be obtained,
real design issues such as power delivery and IR-drop was not co sidered. In three di-
mensional integration, power delivery to the tier farther away from the package is a prob-
lem [38]. This is especially true in monolithic 3D as the vias are very small and hence more
resistive than TSVs. The power thus has to traverse the tier closer to the package first, and
then pass through a highly resistive stack before it can reach the farther tier. This leads
to significant IR-drop in the farther tier. One solution to this problem is moving power
hungry cells close to the package. However, in a conventional package, this causes thermal
issues, as the majority of the heat is conducted from the heatsink, which is close to the tier
farther away from the package. In fact, several thermal optimization works exist that try
to solve the temperature issue by moving power hungry cells and modules closer to the
heatsink [14]. However, this usually worsens the IR-drop problem, which most works do
not consider. Only a handful of works co-optimize thermal and IR-drop in 3D ICs [38].
The approach usually taken to improve IR-drop is to strengthen power delivery network
117
(PDN). This has other consequences such as increasing the signal wirelength, total power
of chip, and so on. This section presents a partitioning technique that can reduce IR-drop,
while also reducing the PDN resource demand.
4.3.1 Motivation and Objectives
In a conventional package, moving power-hungry cells closer to the package usually alle-
viates the IR-drop problem, but increases the temperature. However, in a mobile package,
heat is conducted away from both sides of the chip in equal proportions [1]. Using the
simple resistive equivalent circuit of Figure67, it is demonstrated that the temperature in-
crease is much less of a problem in a mobile package. Note thatthe resistance values are
for illustrative purposes only. The absolute thermal resistance in the mobile package has
also been increased to represent the fact that each side conducts heat poorer than a full heat
sink. Two partitioning cases are considered – one where the tiers are equally balanced in















































(a) Non IR-Drop-aware partitioning (Tier 0 = 50% power)
(b) IR-Drop-aware partitioning (Tier 0 = 70% power )
IR-Drop Thermal (Regular) Thermal (Mobile)
Better IR-drop Comparable thermal
Figure 67: Resistive equivalent circuits for IR-drop and thermal in a conventional and
mobile package. Moving high power cells to the tier close to package helps alleviate IR-
drop. In a mobile package, the temperature increase is much smaller than in a conventional
package. Resistance is inmΩ, and thermal resistance in◦C/W .
118
It is observed that the IR-drop in the non-optimized partition s quite severe in the
farther tier, and that the optimized partition can help reduce the IR-drop by 25%. Next, for
the conventional package, moving power close to the packageand away from the heat sink
leads to a temperature increase of4◦C. In a mobile package, however, heat is conducted
away from both the top and bottom of the chip in roughly equal proportions (details are
given in Section4.3.2.3). In such a scenario, the temperature increases only by1.7◦C,
while still maintaining the same IR-drop benefit.
In addition, it has been demonstrated [57], that increasing the PDN in the farther tier
(tier 1) has a significant impact on solution quality. This isbecause the PDN on the top-
metal interferes with MIV insertion, which leads to sub-optimal MIV locations, and this
increases the wirelength and degrades solution quality. Therefore, IR-drop aware partition-
ing will also help reduce the PDN burden on tier 1, thereby improving design quality. Thus,
the objective of this section is to obtain a gate-level partition such that the tier closer to the
package has more power than the tier farther away from the package,without degrading
solution quality.
4.3.2 Design and Analysis Flow
An overview of the proposed design flow is shown in Figure68. “Shrunk2D” design is
first performed on the netlist as in the previous section. An initial power analysis is per-
formed on this design to get power numbers for each standard cell. These are kept con-
stant during the partitioning process. Next, this design ispartitioned (described in Subsec-
tion 4.3.2.1) such that a given power target is met (e.g. 70% power in tier 0, 30% power in
tier 1). This solution is legalized, and a PDN is designed foreach tier (described in Sub-
section4.3.2.2). MIV planning is performed, with a similar flow as before. After obtaining









3D IR-drop / Thermal Analysis
PDN OptimizationNetlist
MIV Planning
Figure 68: The design flow used for IR-drop-aware partitioning.
4.3.2.1 IR-drop-aware tier Partitioning
This subsection describes how placement-aware-partitioning is modified such that the end
result meets a certain power target for each tier. In the original partitioning technique, the
first step is to create a random, area-balanced partition. A heuristic that generates an initial
partition that already satisfies the power targets is proposed in Algorithm4.
Algorithm 4: Power-aware initial solution generation.
Input : Power targets of each tiertarget(t0), target(t1)
Output : An area-balanced solution that meets the targets
1 areaBalance() ;
2 tiermax ← max( power(t0),power(t1)) ;
3 tiermin ← min( power(t0),power(t1)) ;
4 unbalance← 0 ;
5 while power(tiermax) < target(tiermax) do
6 cmax = max. power cell fromtiermin ;
7 cmin = min. power cell fromtiermax ;
8 if power(cmin) ≥ power(cmax) then break;
9 if ubnalance == 0 then
10 swapcmax andcmin ;
11 unbalance + = area(cmin) - area(cmax) ;
12 else if unbalance > 0 then
13 movecmax to tiermin ;
14 unbalance − = area(cmax) ;
15 else
16 movecmin to tiermax ;




The first step,areaBalance, creates a random, area-balanced partition as before (line
1). Next, tiers that have the larger and smaller power targets (lines 2–3) are identified. The
next step is to move power from the tier with the smaller powertarget to the tier with the
larger power target without hurting area balance. The cell with maximum power from the
tier with smaller power target (cmax), and the cell with minimum power from the tier with
the larger power target (cmin) are identified (lines 6–7). If all cells had equal area, these two
could simply be swapped, and this process repeated until thepow r target was achieved.
However, since cells have unequal area, the area unbalance is tracked using anunbalance
variable. In essence, one of the two chosen cells is only moved if the area balance target
is met (lines 12–17). Cell swaps are terminated ifcmin has more power thancmax, as no
further power optimization is possible (line 8).
With this initial solution, the objective is to perform a min-cut as before, without harm-
ing the target power distributions. In addition to the area balance condition of the min-cut,
a power unbalance condition is defined. If moving a cell from one tier to another makes the
power distribution deviate from the target distribution bymore than a couple of percent,
then that move is illegal. Essentially, a global min-cut subject to both area balance and
power distribution targets is performed.
4.3.2.2 PDN Design and Analysis
An overview of the PDN structure used is shown in Figure69(a). First, the power is fed
from the C4 bumps to a power-mesh on the tier closer to the package (tier 0). This power
mesh consists of thick stripes on the top metal layer, and thinner stripes on an interme-
diate metal layer. These thinner stripes also have a finer pitch than the top metal layer
(Figure69(b)). This is representative of PDN design for mobile chips [1]. This mesh then
connects to local cell rails that feed power to standard cells.
The PDN structure of the tier farther away from the package (tier 1) is quite similar to




















Horizontal Wires = 
Top Metal (M6)
Vertical Wires = 
Intermediate Metal (M3)
MIV Array
Figure 69: (a) A PDN structure in monolithic 3D. Red wires represent VDD and blue
wires represent VSS, (b) The power mesh showing the top and intermediate metal layers,
(c) Zoom-in shot of PDN MIV arrays showing only the intermediate mesh layer and local
cell rails.
connect the C4 bumps to the PDN mesh on tier 1. While adding theseMIV arrays, care
must be taken to not short VDD arrays with the thin VSS cell rais. This is achieved by
providing a break in the array, as shown in Figures69(a)&(c).
In order to perform 3D IR-drop analysis, an interconnect technology file that contains
all the metal layers and their associated resistivity is created. This is then fed to Cadence
Techgen to generate an extraction techfile that can be used for IR-drop analysis. Once
the design is completed, two flavors of standard cells are defined, with rails on different
metal layers (similar to MIV planning). This is fed along with the power numbers and the
extraction techfile to Cadence VoltageStorm to get 3D IR-drop numbers.
4.3.2.3 Thermal Analysis
The structure of a mobile package is shown in Figure70 [1]. The thickness and thermal
properties of the various materials used are tabulated in Table31. The structure of the chip

















Figure 70: A structure of a mobile package in 3D VLSI [1].




PCB 1200 4.5 60
tier Active 0.1 141 141
Inter-tier ILD 0.1 1.38 1.38
Handle Bulk 75 141 141
TIM 650 5 5
EMI Shield 250 120 120
Graphite Sheet 25 4.5 500
It is observed that the embedded graphite on the top of the chip, as well as the PCB at the
bottom have much higher thermal conductivities in the lateral direction than in the vertical
direction. This is because graphite is composed of layers ofgraphene sheets, each of which
is highly conductive, and there is very little inter-sheet thermal conduction. Similarly, the
majority of the heat conduction in a PCB is through the lateralconduction of the metal
planes present in it. There is limited inter-plane heat conduction. Therefore, both act as
heat spreaders. Although the PCB has a lower conductivity than gr phite, it is thicker and
also closer to the chip. Therefore, heat is conducted away inroughly equal proportions
from both sides of the chip.
In order to perform thermal analysis, each layer of the 3D structure is meshed into
grids of size20µm × 20µm. The thermal resistance of each tile is computed based on the
material within it, and set up as a thermal resistor. In addition, if this tile is in one of the
123
active layers, then the power in that tile is set up as a current sink in that tile. Boundary
conditions are set up as voltage sources at room temperature(27◦C) on the sides of PCB
and the graphite layer, as well as the top of the graphite layer nd bottom of the PCB. This
entire resistive structure along with voltage sources and current sinks is fed into HSPICE to




Two benchmarks are chosen, and their statistics are tabulated in Table32. The first one is
a crossbar taken from the OpenSPARC T2 muiti-processor SoC. Itis a full 8 × 8 crossbar
that can connect one of 8 cores to any of 8 cache blocks, and vice-versa. The second design
is a jpeg encoder taken from the OpenCores benchmark suite.




WxH (µm× µm) # VDD C4
(ns) 2D 3D 2D 3D
crossbar 1 121,142 600x600 400x400 16 9
jpeg 1.5 255,842 650x650 450x450 16 9
This table shows the clock period at which each design is closed. It also shows the num-
ber of gates for 2D and 3D implementations of each design. Thegat counts are different
as 3D requires fewer buffers for optimization and timing closure. Since no optimization is
performed after partitioning, the gate count remains the same for all 3D implementations.
Note that both benchmarks have similar footprints, althougthe gate counts are very dif-
ferent. This is because jpeg contains a lot of small gates, and is more locally connected,
whereas the crossbar contains fewer, but larger gates, and is an interconnect dominated de-
sign. AC4 bump pitch of100µm is assumed, which corresponds to a pitch of200µm for
each of VDD and VSS.
124
While designing the power delivery network, the width of M6 and M3 wires are as-
sumed to be4µm and 1µm, respectively. Only the pitch of these wires is changed to
strengthen or weaken the PDN. In addition, in all experiments, the PDN utilization of M3
tracks is assumed to be roughly half the PDN utilization of M6tracks. This is because it
is an intermediate metal layer, and is also needed for signalrouting. The diameter of each
MIV is assumed to be100nm, with a resistance of2Ω and a capacitance of0.1fF [33].
As depicted in Figure69(c), each C4 bump has two sets of MIV arrays that carry power
to tier 1. Each MIV array has56 MIVs arranged in a8 × 7 array. A foundry28nm SOI
library which has a supply voltage of0.9V is used for design and analysis. The IR-drop
target is set to be5% for each of VDD/VSS so that the IR-drop and ground bounce together
are within10%. This corresponds to a IR-drop target of45mV .
4.3.3.2 Baseline Designs
The PDN utilization for a 2D IC is chosen by determining the mini um percentage of metal
layers that is required to meet the IR-drop target. Next, 3D ICsare designed assuming the
same PDN utilization as 2D to obtain baseline designs. Theirstatistics are tabulated in
Table33. Note that a smaller reduction in wirelength (WL) in the crossbar leads to a larger
total power reduction compared to jpeg. This is because it isinterconnect dominated. It
is also observed that jpeg has a higher power consumption, and therefore requires more
PDN resources. As expected, the 3D design does not meet the IR-drop targets with the
same PDN utilization as 2D. This is because of both fewerC4 bumps and the fact that
tier 1 suffers from higher IR-drop. Finally, because a mobilepackage has heat conduction
on both sides, the temperature increase from 2D to 3D is in therange of only10◦C, even
though the power density doubles in 3D. Reducing the 3D IR-dropto acceptable levels is
now explored.
125
Table 33: Design statistics of baseline 2D and 3D designs.
Circuit
WL (m) Power (mW) M6/M3 Drop (mV) Temp(◦C)
2D 3D 2D 3D PDN% 2D 3D 2D 3D
crossbar3.68 3.12 137.5 125.8 15/8 45 79 64.2 71.6
jpeg 3.26 2.53 222.9 213.6 30/15 39 73 80.25 92.25
4.3.3.3 PDN Sensitivity Analysis
As discussed in Section4.3.2, the objective is to partition the design such that tier 0 has
more power than tier 1. This will lead to reduced PDN demand, improving solution quality.
Now supposex% of power is moved from tier 1 to tier 0, andy% of PDN resources in tier 1
are freed up. The additionalx% of power in tier 0 should require less thany% additional
PDN in tier 0 in order to get a net benefit. In order to validate this assumption, the power
consumed in each tier is scaled, and the resulting change in that tier’s IR-drop is plotted in
Figure71.
















































Figure 71: Sensitivity of tier IR-drop to change in tier power for (a) crossbar, and (b)
jpeg.
From this figure, it is seen that a transfer of30% power from tier 1 to tier 0 reduces the
tier 1 IR-drop by a much greater margin than the tier 0 IR-drop isincreased. For example,
removing30% power from tier 1 in the crossbar benchmark reduces the tier 1IR-drop by
30mV . This power is added to tier 0, but the graph shows that this increases the tier 0
IR-drop by only15mV . This makes it much easier to fix any remaining IR-drop violatins.
In addition, the reduced PDN demand will reduce chip power and improve design quality.
126
4.3.3.4 IR-drop-aware Partitioning Results
This section maintains the same PDN density as the baseline designs, applies the IR-drop-
aware partitioning technique, and demonstrates that underthe same PDN, significant re-
duction in IR-drop can be achieved. Different target power distributions are given to the
partitioner, starting with 30% power on tier 0 (30/70), and changed in increments of 10%
all the way till 70% power on tier 0 (70/30). The resulting statistics of each design is tab-
ulated in Table34. From this table, the 70/30 and 30/70 targets do not give the requi ed
distributions exactly. Therefore, it is concluded that 65%power on one tier is the most
power unbalance achievable in these designs. This is reasonable given that the tiers need to
be area balanced. It is unlikely that half the cells (w.r.t. area) will consume more than 70%
of the power.
Next, it is observed that providing a power target impacts the cutsize of the partitioner.
This is because an additional power constraint is added on top of the existing area balance
constraints. The MIV planner inserts more than one MIV per 3Dnet when appropriate, so
its count is more than the cutsize. The cutsize increase is also reflected in the MIV count.
In general, since MIVs are small, more of them can be tolerated. This is observed in the
fact that, except for a few outliers, the WL increase is quite small. This leads to only a
minor increase in the total power of the design.
However, the impact on the IR-drop is dramatic. Up to a24.66% reduction in the
maximum IR-drop of the chip can be achieved, with a thermal impact of < 1◦C. The
30/70 and 40/60 partitions are also tabulated as they are theconventional “thermal-aware”
partitions, where power is moved towards the heat sink. Although the temperature reduces
in these partitions, the IR-drop increases significantly. The IR-drop benefit is also plotted
in Figure72, which clearly shows the IR-drop reduction by clever partitioning.
127
Table 34: The impact of IR-drop-aware partitioning. The PDN utilization is kept the same as the baseline designs.
Power (T0/T1%)
Cutsize #MIV WL (m)
Total IR Drop (mV) Temp.(◦C)
Target Actual Power (mW) Tier0 / Tier1 Tier0 / Tier1
crossbar
Baseline47.1 / 52.917,868 -30,772 -3.124 - 125.8 - 50 / 79 - 71.65 / 70.84 -
30 / 70 33.5 / 66.523,419 (+31.1%)34,764 (+12.9%)3.60 (+15.3%)128.1 (+1.83%)40 / 105 (+32.9%)71.59 / 71.61 (-0.06%)
40 / 60 40.2 / 59.818,242 (+2.1%)31,552 (+2.5%)3.14 (+0.54%)125.9 (+0.08%)40 / 87 (+10.1%)71.16 / 70.81 (-0.68%)
50 / 50 50.9 / 49.117,968 (+0.6%)30,836 (+0.2%)3.13 (+0.32%)125.9 (+0.08%)58 / 82 (+3.80%)71.62 / 70.76 (-0.04%)
60 / 40 59.3 / 40.715,840 (-11.4%)26,993 (-12.3%)3.16 (+1.03%)126.1 (+0.24%)68 / 67 (-13.9%)72.32 / 70.81 (+0.94%)
70 / 30 65.8 / 34.221,282 (+19.1%)30,313 (-1.5% ) 3.12 (-0.11%)125.9 (+0.08%) 75 / 56 (-5.06%) 72.5 / 70.69 (+1.19%)
jpeg
Baseline44.6 / 55.434,834 -41,122 - 2.53 - 213.6 - 41 / 73 - 92.25 / 91.63 -
30 / 70 35.2 / 64.851,772 (+48.6%)56,982 (+38.6%)2.58 (+1.89%)214.1 (+0.23%)29 / 85 (+16.4%)91.83 / 91.87 (-0.41%)
40 / 60 39.9 / 60.141,528 (+19.2%)47,666 (+15.9%)2.56 (+1.17%)213.8 (+0.09%)37 / 79 (+8.22%)91.93 / 91.71 (+0.07%)
50 / 50 49.9 / 50.134,527 (-0.9%)40,452 (-1.6%) 2.53 (+0.21%)213.7 (+0.05%)46 / 66 (-9.59%)92.07 / 91.58 (+0.15%)
60 / 40 58.1 / 41.935,540 (+2.0%)40,695 (-1.1%) 2.53 (+0.11%)213.6 (+0.00%) 53 / 55 (-24.6%)92.53 / 91.48 (+0.50%)
70 / 30 64.8 / 35.258,859 (+68.9%)62,798 (+52.7%)2.58 (+2.05%)214.4 (+0.37%)57 / 45 (-21.9%)92.69 / 91.62 (+0.17%)
128
Tier 0, Power = 47.16%
IR Drop = 50mV
Tier 0, Power = 59.49%
IR Drop = 68mV
Tier 1, Power = 52.84%
IR Drop = 79mV
Tier 0, Power = 40.51%
IR Drop = 67mV
(a) (b)
0mV 79mV
Figure 72: IR-drop maps for crossbar benchmark. (a) baseline, (b) our IR-drop aware
partition, where tier 0 has 60% of the chip power.
4.3.3.5 PDN Resource Optimization
The previous section demonstrated that under the same PDN utilization, significant IR-
drop reduction can be achieved. However, the IR-drop numbersfor many designs were
significantly over the budget, and needs to be fixed. In this section, explores optimizing the
PDN of each tier such that the IR-drop target (45mV ) is met. To do this, the results of the
previous section are taken, and the PDN resources required to meet the IR-drop target is
estimated. The 3D IC is redesigned with this estimate, and ifit st ll does not meet the target,
the estimate is revised. This is repeated until the target ismet. For the sake of simplicity,
the ratio between the utilization of M6 and M3 is kept the same. In addition, the maximum
utilization of M6 is set to 75%. If a design still does not meetthe IR-drop target with 75%
M6 utilization, IR-drop is not optimized further. The results of these optimizations are
tabulated in Table35.
129
Table 35: The impact of PDN optimization such that the IR-drop falls within the45mV target.
Pow. Dist. PDN M6/M3 %
#MIV WL (m)
Power (mW) IR Drop Temp.(◦C)
(T0/T1) Tier 0 Tier 1 Change Total T0/T1 (mV) Tier0/Tier1
crossbar
Baseline 24 / 12 68 / 36 - 27,265 - 3.25 - 128.1 - 37 / 41 72.34 / 71.68 -
30 / 70 15 / 8 75 / 40 0.00% 31,540 (+15.68%)3.71 (+13.95%) 131 (+2.26%) 36 / 52 72.66 / 72.67 (+0.46%)
40 / 60 15 / 8 75 / 40 0.00% 26,594 (-2.46%)3.31 (+1.80%)129.2 (+0.86%) 40 / 45 72.32 / 72.02 (-0.03%)
50 / 50 15 / 8 60 / 32 -8.33% 27,675 (+1.50%)3.23 (-0.63%)127.6 (-0.39%) 44 / 44 72.28 / 71.45 (-0.08%)
60 / 40 30 / 16 45 / 24 -16.67% 25,526 (-6.38%)3.22 (-1.11%)127.1 (-0.78%) 41 / 40 72.64 / 71.14 (+0.41%)
70 / 30 38 / 20 30 / 16 -25.00% 29,828 (+9.40%)3.19 (-1.85%) 126.9 (-0.94%) 40 / 39 72.91 / 71.02 (+0.79%)
jpeg
Baseline 30 / 15 75 / 38 - 38,264 - 2.68 - 215.5 - 38 / 48 92.71 / 92.24 -
30 / 70 22 / 11 75 / 38 -7.14% 54,772 (+43.14%)2.81 (+4.70%)217.2 (+0.79%) 37 / 49 92.76 / 92.79 (+0.09%)
40 / 60 22 / 11 75 / 38 -7.14% 45,278 (+18.33%)2.72 (+1.34%) 216 (+0.23%) 46 / 52 92.71 / 92.43 (+0.00%)
50 / 50 38 / 19 75 / 38 7.14% 37,829 (-1.14%)2.69 (+0.25%)215.7 (+0.09%) 39 / 43 92.82 / 92.23 (+0.12%)
60 / 40 45 / 18 45 / 18 -14.29% 39,979 (+4.48%)2.58 (-4.03%) 214.4 (-0.51%) 41 / 46 92.6 / 91.63 (-0.12%)
70 / 30 45 / 18 30 / 15 -28.57% 62,809 (+64.15%)2.58 (-3.74%)214.7 (-0.37%) 45 / 45 92.57 / 91.64 (-0.15%)
130
From this table, it is observed that the PDN utilization can be reduced by up to28.57%
from the baseline, and still meet IR-drop targets. In some cass, the baseline is not able to
meet the IR-drop target even with PDN optimization, as the initial IR-drop is too severe.
This reduction in the PDN utilization, especially in tier 1,frees up additional resources for
signal routing and MIV insertion. This gives up to a4% reduction in the total WL of the
design. This helps reduce the chip power, which limits the temp rature increase.
The PDN as well as the IR-drop for both the baseline and the 70/30 implementation of
the crossbar is plotted in Figure73. It is clearly seen that there is a huge reduction in the
PDN utilization, while the same IR-drop is maintained.














IR Drop = 37mV
IR Drop = 41mV
IR Drop = 40mV







Figure 73: The impact of PDN optimization on the crossbar benchmark. IR-drop aware
partitioning is able to achieve the same IR-drop target as thebas line partition while using
significantly fewer PDN resources.
The temperature maps for various partition solutions of thecrossbar in are shown in
Figure74. Even though an additional30% power is moved to the bottom tier, the power















Figure 74: The impact of changing the target power of the bottom tier on the tempera-
ture of the crossbar benchmark. Even if the bottom tier has70% of the chip power, the
temperature increase is< 1◦C.
4.4 Summary
This chapter fist demonstrated that modified 2D placement coupled with a placement-aware
partitioning step is sufficient to produce high quality monolithic 3D IC placement results.
A router-based MIV insertion algorithm that makes previously nroutable designs routable
was presented. A monolithic 3D demand model was used to builda min-overflow parti-
tioning heuristic, and it was demonstrated that this helps to reduce the routed wirelength.
Next, a technique to utilize commercial 2D engines instead of academic ones was pre-
sented. This enables gate-level monolithic 3D IC designs tobe taken all the way through
place, route, CTS, and timing optimization. This chapter finally demonstrates that in mo-
bile applications, power can be moved to the tier closer to the package to reduce IR-drop,
while not hurting temperature. An IR-drop-aware partitioner was developed that can reduce
the power and IR-drop of a monolithic 3D IC, without increasingthe maximum operating
temperature of the chip.
132
CHAPTER V
CONCLUSIONS AND FUTURE DIRECTIONS
As discussed in this dissertation, testability for TSV-based 3D ICs remain one of the last
challenges facing their adoption. While TSV-based 3D ICs solve some interconnect issues,
they do not fully exploit the flexibility of the third dimension. In addition, it was demon-
strated that monolithic 3D ICs offer significant benefits overboth 2D ICs and TSV-based
3D ICs. Although this is a longer term technology, and the fabrication process is not yet
completely mature, physical design techniques are needed to evaluate the benefits of mono-
lithic 3D. In general, before significant resources can be div rted to ramp up monolithic 3D,
studies of their efficacy are necessary. To carry out reasonable and meaningful studies, the
following are crucial: (1) Physical design techniques for different design styles of mono-
lithic 3D ICs, (2) An understanding of how the fabrication process affects the potential
benefits and how to overcome any potential degradation, and (3) An understanding of real
world reliability issues such as thermal, IR-drop, e.t.c. that affect monolithic 3D ICs.
Towards these objectives of overcoming the last hurdles of short term TSV-based 3D ICs
and developing tools and techniques for evaluating longer term monolithic 3D ICs, the fol-
lowing projects have been presented in this dissertation.
• Design for Test for TSV-based 3D ICs including scan chain construction techniques,
a transition delay fault test architecture, IR-drop studies, and test time estimation
during 3D IC partitioning.
• Physical design for block-level monolithic 3D ICs, where a floorplanning framework
was presented, and extended to consider inter-tier performance differences arising
because of an immature fabrication process.
133
• Physical design for gate-level monolithic 3D ICs, where placement techniques were
developed for monolithic 3D ICs. This was extended to utilizecommercial tools
for placement, timing optimization and CTS. In addition, IR-drop aware partitioning
was presented.
The DfT research carried out in this dissertation addressedome of the testability con-
cerns of TSV-based 3D ICs. However, several more hurdles needto be surmounted before
TSV-based 3D IC testing can mature. For example, at-speed ofTSVs need to be performed
before bonding, and the test architecture presented in thisdissertation does not support this.
In addition, it is as yet unclear under what conditions pre-bond test will be necessary and
cost effective.
The floorplanner presented in this dissertation provides a good framework to design
block-level monolithic 3D ICs. It was also demonstrated thattungsten interconnects are
preferable to degraded transistors. However, it is unclearhow this will change at future
nodes, where the interconnect is expected to become more of ab ttleneck. Additional
research needs to be carried out to determine the most effective technology stackup at
future nodes.
Finally, an efficient gate-level framework was presented that provides commercial-
quality monolithic 3D IC designs. However, it still relies on tricking 2D tools into de-
signing 3D ICs. There are bound to be inaccuracies introduceddu to this abstraction, and
future research needs to look into development of true 3D tools.
Finally, although physical design was presented for block and gate-level monolithic
3D ICs, today’s industrial SoCs are bound to require a mix of thetwo. For example, large
blocks can be implemented in 3D using the gate-level framework, and these 3D blocks can
then be assembled together. Additional physical design tools are needed to develop a mixed




[1] “Personal communication with industry partner..”
[2] BATUDE, P., ERNST, T., ARCAMONE, J., ARNDT, G., COUDRAIN, P., and GAIL -
LARDON, P.-E., “3-D Sequential Integration: A Key Enabling Technology for Het-
erogeneous Co-Integration of New Function With CMOS,”IEEE Journal on Emerg-
ing and Selected Topics in Circuits and Systems, vol. 2, pp. 714–722, Dec 2012.
[3] BATUDE, P., VINET, M., POUYDEBASQUE, A., LE ROYER, C., PREVITALI , B.,
TABONE, C., HARTMANN , J.-M., SANCHEZ, L., BAUD , L., CARRON, V., TOF-
FOLI, A., ALLAIN , F., MAZZOCCHI, V., LAFOND, D., THOMAS, O., CUETO, O.,
BOUZAIDA , N., FLEURY, D., AMARA , A., DELEONIBUS, S., and FAYNOT, O.,
“Advances in 3D CMOS sequential integration,” inProc. IEEE Int. Electron Devices
Meeting, pp. 1–4, Dec 2009.
[4] BOBBA, S., CHAKRABORTY, A., THOMAS, O., BATUDE, P., ERNST, T., FAYNOT,
O., PAN , D., and DE M ICHELI , G., “CELONCEL: Effective design technique for
3-D monolithic integration targeting high performance integrated circuits,” inProc.
Asia and South Pacific Design Automation Conf., pp. 336–343, Jan 2011.
[5] BRENNER, U. and ROHE, A., “An effective congestion-driven placement frame-
work,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems,
vol. 22, pp. 387–394, April 2003.
[6] C. CABRAL , J., FLETCHER, B., ROSSNAGEL, S., HU, C.-K., BAKER-ONEAL , B.,
HUANG, Q., DER STRATEN, O. V., NITTA , S., RODBELL, K., and EDELSTEIN,
D., “Metallization Opportunities and Challenges for FutureBack-End-of-the-Line
Technology,” inAdvanced Metallization Conference, pp. 136–137, October 2010.
[7] CHEN, P.-L., LIN , J.-W., and CHANG, T.-Y., “IEEE Standard 1500 Compatible
Delay Test Framework,”IEEE Trans. on VLSI Systems, vol. 17, pp. 1152–1156, Aug
2009.
[8] CHEN, Q., DAVIS , J., ZARKESH-HA , P., and MEINDL , J., “A compact physical via
blockage model,”IEEE Trans. on VLSI Systems, vol. 8, pp. 689–692, Dec 2000.
[9] CHOI, D., KIM , C. S., NAVEH , D., CHUNG, S., WARREN, A. P., NUHFER, N. T.,
TONEY, M. F., COFFEY, K. R., and BARMAK , K., “Electron mean free path of
tungsten and the electrical resistivity of epitaxial (110)tungsten films,”Phys. Rev. B,
vol. 86, p. 045432, Jul 2012.
[10] CHU, C. and WONG, Y.-C., “FLUTE: Fast Lookup Table Based Rectilinear Steiner
Minimal Tree Algorithm for VLSI Design,”IEEE Trans. on Computer-Aided Design
of Integrated Circuits and Systems, vol. 27, pp. 70–83, Jan 2008.
135
[11] CONG, J. and LIM , S. K., “Edge separability-based circuit clustering with appli-
cation to multilevel circuit partitioning,”IEEE Trans. on Computer-Aided Design of
Integrated Circuits and Systems, vol. 23, pp. 346–357, March 2004.
[12] CONG, J. and LUO, G., “A multilevel analytical placement for 3D ICs,” inProc. Asia
and South Pacific Design Automation Conf., pp. 361–366, Jan 2009.
[13] CONG, J., LUO, G., WEI, J., and ZHANG, Y., “Thermal-Aware 3D IC Placement Via
Transformation,” inProc. Asia and South Pacific Design Automation Conf., pp 780–
785, Jan 2007.
[14] CONG, J., WEI, J., and ZHANG, Y., “A thermal-driven floorplanning algorithm for
3D ICs,” in Proc. IEEE Int. Conf. on Computer-Aided Design, pp. 306–313, Nov
2004.
[15] DONG, X., ZHAO, J., and XIE, Y., “Fabrication Cost Analysis and Cost-Aware De-
sign Space Exploration for 3-D ICs,”IEEE Trans. on Computer-Aided Design of In-
tegrated Circuits and Systems, vol. 29, pp. 1959–1972, Dec 2010.
[16] FIDUCCIA , C. and MATTHEYSES, R., “A Linear-Time Heuristic for Improving Net-
work Partitions,” inProc. ACM Design Automation Conf., pp. 175–181, June 1982.
[17] GOEL, S. K. and MARINISSEN, E. J., “SOC Test Architecture Design for Efficient
Utilization of Test Bandwidth,”ACM Trans. on Design Automation of Electronics
Systems, vol. 8, pp. 399–429, Oct. 2003.
[18] GOLSHANI, N., DERAKHSHANDEH, J., ISHIHARA, R., BEENAKKER, C. I. M.,
ROBERTSON, M., and MORRISON, J., “Monolithic 3D integration of SRAM and im-
age sensor using two layers of single grain silicon,” inIEEE International 3D System
Integration Conference, pp. 1–4, Nov 2010.
[19] GUPTA, P., KAHNG, A., SHARMA , P., and SYLVESTER, D., “Gate-length biasing
for runtime-leakage control,”IEEE Trans. on Computer-Aided Design of Integrated
Circuits and Systems, vol. 25, pp. 1475–1485, Aug 2006.
[20] HE, X., DONG, S., MA , Y., and HONG, X., “Simultaneous buffer and interlayer
via planning for 3D floorplanning,” inProc. Int. Symp. on Quality Electronic Design,
pp. 740–745, March 2009.
[21] HSU, M.-K., CHANG, Y.-W., and BALABANOV , V., “TSV-aware analytical place-
ment for 3D IC designs,” inProc. ACM Design Automation Conf., pp. 664–669, June
2011.
[22] JIANG , L., HUANG, L., and XU, Q., “Test architecture design and optimization for
three-dimensional SoCs,” inProc. Design, Automation and Test in Europe, p. 220–
225, April 2009.
136
[23] JIANG , L., XU, Q., CHAKRABARTY, K., and MAK , T., “Layout-driven test-
architecture design and optimization for 3D SoCs under pre-bond test-pin-count con-
straint,” inProc. IEEE Int. Conf. on Computer-Aided Design, pp. 191–196, Nov 2009.
[24] JIANG , Z.-W., SU, B.-Y., and CHANG, Y.-W., “Routability-driven analytical place-
ment by net overlapping removal for large-scale mixed-sizedesigns,” inProc. ACM
Design Automation Conf., pp. 167–172, June 2008.
[25] JUNG, S.-M., JANG, J., CHO, W., MOON, J., KWAK , K., CHOI, B., HWANG,
B., LIM , H., JEONG, J., KIM , J., and KIM , K., “The revolutionary and truly 3-
dimensional 25F2 SRAM technology with the smallest S3 ( stacked single-crystal Si)
cell, 0.16um2, and SSTFT (atacked single-crystal thin film transistor) for ultra high
density SRAM,” in IEEE Int. Symposium on VLSI Technology, pp. 228–229, June
2004.
[26] JUNG, S.-M., LIM , H., KWAK , K., and KIM , K., “A 500-MHz DDR High-
Performance 72-Mb 3-D SRAM Fabricated With Laser-Induced Epitaxial c-Si
Growth Technology for a Stand-Alone and Embedded Memory Application,” IEEE
Trans. on Electron Devices, vol. 57, pp. 474–481, Feb 2010.
[27] KARKLIN , K., BROZ, J., and MANN , B., “Bond Pad Damage Tutorial,” inIEEE
Semiconductor Wafer Test Workshop, June 2008.
[28] K IM , D. H., ATHIKULWONGSE, K., and LIM , S. K., “A study of Through-Silicon-
Via impact on the 3D stacked IC layout,” inProc. IEEE Int. Conf. on Computer-Aided
Design, pp. 674–680, Nov 2009.
[29] K IM , D. H., TOPALOGLU, R., and LIM , S. K., “Block-level 3D IC design with
through-silicon-via planning,” inProc. Asia and South Pacific Design Automation
Conf., pp. 335–340, Jan 2012.
[30] K IM , M.-C., HU, J., LEE, D.-J., and MARKOV, I., “A SimPLR method for
routability-driven placement,” inProc. IEEE Int. Conf. on Computer-Aided Design,
pp. 67–73, Nov 2011.
[31] KNECHTEL, J., MARKOV, I., and LIENIG, J., “Assembling 2-D Blocks Into 3-D
Chips,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems,
vol. 31, pp. 228–241, Feb 2012.
[32] LEE, H.-H. and CHAKRABARTY, K., “Test Challenges for 3D Integrated Circuits,”
IEEE Design and Test of Computers, vol. 26, pp. 26–35, Sept 2009.
[33] LEE, Y.-J., LIMBRICK , D., and LIM , S. K., “Power benefit study for ultra-high
density transistor-level monolithic 3D ICs,” inProc. ACM Design Automation Conf.,
pp. 1–10, May 2013.
[34] LEE, Y.-J., MORROW, P., and LIM , S. K., “Ultra high density logic designs using
transistor-level monolithic 3D integration,” inProc. IEEE Int. Conf. on Computer-
Aided Design, pp. 539–546, Nov 2012.
137
[35] LEWIS, D. and LEE, H., “A scan island based design enabling prebond testability in
die-stacked microprocessors,” inProc. IEEE Int. Test Conference, pp. 1–8, Oct 2007.
[36] LEWIS, D., PANTH , S., ZHAO, X., L IM , S. K., and LEE, H.-H., “Designing 3D test
wrappers for pre-bond and post-bond test of 3D embedded cores,” in Proc. IEEE Int.
Conf. on Computer Design, pp. 90–95, Oct 2011.
[37] L I , C., XIE, M., KOH, C.-K., CONG, J., and MADDEN, P., “Routability-Driven
Placement and White Space Allocation,”IEEE Trans. on Computer-Aided Design of
Integrated Circuits and Systems, vol. 26, pp. 858–871, May 2007.
[38] L I , Z., MA , Y., ZHOU, Q., CAI , Y., WANG, Y., HUANG, T., and XIE, Y., “Thermal-
aware power network design for IR drop reduction in 3D ICs,” inProc. Asia and
South Pacific Design Automation Conf., pp. 47–52, Jan 2012.
[39] L IU , C. and LIM , S. K., “A Design Tradeoff Study with Monolithic 3D Integration,”
in Proc. Int. Symp. on Quality Electronic Design, pp. 529–536, March 2012.
[40] L IU , C. and LIM , S. K., “Ultra-high density 3D SRAM cell designs for monolithic
3D integration,” inProc. IEEE Int. Interconnect Technology Conference, pp. 1–3,
June 2012.
[41] LO, C.-Y., WANG, C.-H., CHENG, K.-L., HUANG, J.-R., WANG, C.-W., WANG,
S.-M., and WU, C.-W., “STEAC: A Platform for Automatic SOC Test Integration,”
IEEE Trans. on VLSI Systems, vol. 15, pp. 541–545, May 2007.
[42] LOPEZ, G.,The impact of interconnect process variations and size effects or gigas-
cale integration. PhD thesis, Georgia Institute of Technology, 2009.
[43] MANN , W., TABER, F., SEITZER, P., and BROZ, J., “The leading edge of production
wafer probe test technology,” inProc. IEEE Int. Test Conference, pp. 1168–1195, Oct
2004.
[44] MARINISSEN, E., CHI , C.-C., VERBREE, J., and KONIJNENBURG, M., “3D DfT
architecture for pre-bond and post-bond testing,” inIEEE International 3D System
Integration Conference, pp. 1–8, Nov 2010.
[45] MARINISSEN, E., IYENGAR, V., and CHAKRABARTY, K., “A set of benchmarks for
modular testing of SOCs,” inProc. IEEE Int. Test Conference, pp. 519–528, 2002.
[46] MARINISSEN, E., VERBREE, J., and KONIJNENBURG, M., “A structured and scal-
able test access architecture for TSV-based 3D stacked ICs,”in IEEE VLSI Test Sym-
posium, pp. 269–274, April 2010.
[47] MARINISSEN, E.J.AND ZORIAN, Y., “Testing 3D chips containing through-silicon
vias,” in Proc. IEEE Int. Test Conference, pp. 1–11, Nov 2009.
138
[48] NAITO , T., ISHIDA, T., ONODUKA , T., NISHIGOORI, M., NAKAYAMA , T., UENO,
Y., ISHIMOTO, Y., SUZUKI , A., CHUNG, W., MADURAWE, R., WU, S., IKEDA ,
S., and OYAMATSU , H., “World’s first monolithic 3D-FPGA with TFT SRAM over
90nm 9 layer Cu CMOS,” inIEEE Int. Symposium on VLSI Technology, pp. 219–220,
June 2010.
[49] NOIA , B., CHAKRABARTY, K., GOEL, S., MARINISSEN, E., and VERBREE, J.,
“Test-Architecture Optimization and Test Scheduling for TSV-Based 3-D Stacked
ICs,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems,
vol. 30, pp. 1705–1718, Nov 2011.
[50] NOIA , B., GOEL, S., CHAKRABARTY, K., MARINISSEN, E., and VERBREE, J.,
“Test-architecture optimization for TSV-based 3D stackedICs,” in Proc. European
Test Symposium, pp. 24–29, May 2010.
[51] [ONLINE ], “OpenCores Benchmark Suite.”http://www.opencores.org/.
[52] [ONLINE ], “Oracle OpenSPARC T2.”
[53] [ONLINE ], “International Technology Roadmap for Semiconductors 2009.”
http://www.itrs.net/, 2009.
[54] PLOMBON, J. J., ANDIDEH, E., DUBIN , V. M., and MAIZ , J., “Influence of phonon,
geometry, impurity, and grain size on Copper line resistivity,” Applied Physics Letters,
vol. 89, no. 11, pp. –, 2006.
[55] RAJENDRAN, B., SHENOY, R., WITTE, D., CHOKSHI, N., DE LEON, R., TOMPA,
G., and FABIAN , R., “Low Thermal Budget Processing for Sequential 3-D IC Fabri-
cation,” IEEE Trans. on Electron Devices, vol. 54, pp. 707–714, April 2007.
[56] SAMAL , S., PANTH , S., SAMADI , K., SAEDI , M., DU, Y., and LIM , S. K., “Fast and
accurate thermal modeling and optimization for monolithic3D ICs,” in Proc. ACM
Design Automation Conf., pp. 1–6, June 2014.
[57] SAMAL , S., SAMADI , K., KAMAL , P., DU, Y., and LIM , S. K., “Full chip impact
study of power delivery network designs in monolithic 3D ICs,” in Proc. IEEE Int.
Conf. on Computer-Aided Design, pp. 565–572, Nov 2014.
[58] SPINDLER, P. and JOHANNES, F., “Fast and Accurate Routing Demand Estimation
for Efficient Routability-driven Placement,” inProc. Design, Automation and Test in
Europe, pp. 1–6, April 2007.
[59] SPINDLER, P., SCHLICHTMANN , U., and JOHANNES, F., “Kraftwerk2: A Fast
Force-Directed Quadratic Placement Approach Using an Accurate Net Model,”IEEE
Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 27,
pp. 1398–1411, Aug 2008.
139
[60] STEINHOGL, W., STEINLESBERGER, G., PERRIN, M., SCHEINBACHER, G.,
SCHINDLER, G., TRAVING , M., and ENGELHARDT, M., “Tungsten interconnects in
the nano-scale regime ,”Microelectronic Engineering, vol. 82, pp. 266 – 272, 2005.
[61] TSAI, M.-C., WANG, T.-C., and HWANG, T., “Through-Silicon Via Planning in 3-D
Floorplanning,”IEEE Trans. on VLSI Systems, vol. 19, pp. 1448–1457, Aug 2011.
[62] VUCUREVICH, T., “The Long Road to 3D Integration: Are we there yet?.” Key note
speech at the 3D Architecture Conference, 2007.
[63] WU, X., FALKENSTERN, P., CHAKRABARTY, K., and XIE, Y., “Scan-chain De-
sign and Optimization for Three-dimensional Integrated Circuits,” ACM Journal on
Emerging Technologies in Computing Systems, vol. 5, pp. 9:1–9:26, July 2009.
[64] WU, X., ZHAO, W., NAKAMOTO , M., NIMMAGADDA , C., LISK, D., GU, S.,
RADOJCIC, R., NOWAK , M., and XIE, Y., “Electrical Characterization for Intertier
Connections and Timing Analysis for 3-D ICs,”IEEE Trans. on VLSI Systems, vol. 20,
pp. 186–191, Jan 2012.
[65] XU, C., BATUDE, P., VINET, M., MOUIS, M., CASSE, M., SKLENARD , B.,
COLOMBEAU, B., RAFHAY, Q., TABONE, C., BERTHOZ, J., PREVITALI , B.,
MAZURIER, J., BRUNET, L., BREVARD, L., KHAJA, F., HARTMANN , J., ALLAIN ,
F., TOFFOLI, A., K IES, R., LE ROYER, C., MORVAN, S., POUYDEBASQUE, A.,
GARROS, X., PAKFAR , A., TAVERNIER, C., FAYNOT, O., and POIROUX, T., “Im-
provements in low temperature (<625C) FDSOI devices down to 30nm gate length,”
in IEEE Int. Symposium on VLSI Technology, Systems, and Applications, pp. 1–2,
April 2012.
[66] YANG, K., K IM , D. H., and LIM , S. K., “Design quality tradeoff studies for 3D ICs
built with nano-scale TSVs and devices,” inProc. Int. Symp. on Quality Electronic
Design, pp. 740–746, March 2012.
[67] ZHAO, X., LEWIS, D., LEE, H.-H., and LIM , S. K., “Pre-bond testable low-power
clock tree design for 3D stacked ICs,” inProc. IEEE Int. Conf. on Computer-Aided
Design, pp. 184–190, Nov 2009.
140
PUBLICATIONS
This dissertation is based on and/or related to the works andresults presented in the fol-
lowing publications in print:
[1] Shreepad Panthand Sung Kyu Lim, “Scan Chain and Power Delivery Network Syn-
thesis for Pre-Bond Test of 3D ICs”, inIEEE VLSI Test Symposium, pp. 26–31, 2011.
[2] Dean Lewis,Shreepad Panth, Xin Zhao, Sung Kyu Lim, and Hsien-Hsin Lee, “De-
signing 3D Test Wrappers for Pre-bond and Post-bond Test of 3DEmbedded Cores”,
in IEEE International Conference on Computer Design, pp. 90–95, 2011.
[3] Shreepad Panthand Sung Kyu Lim, “Transition Delay Fault Testing of 3D ICs with
IR-Drop Study”, inIEEE VLSI Test Symposium, pp. 270–275, 2012.
[4] Young-Joon Lee,Shreepad Panth, and Sung Kyu Lim, “Enabling High Density Logic
Designs for Monolithic 3D ICs”, inSRC TECHCON Conference, 2012.
[5] Brandon Noia,Shreepad Panth, Krishnendu Chakrabarty, and Sung Kyu Lim, “Scan
Test of Die Logic in 3D ICs Using TSV Probing”, inIEEE International Test Confer-
ence, pp. 1–8, 2012.
[6] Sergej Deutsch, Krishnendu Chakrabarty,Shreepad Panth, and Sung Kyu Lim, “TSV
Stress-Aware ATPG for 3D Stacked ICs”, inIEEE Asian Test Symposium, pp. 31–36,
2012.
[7] Sergej Deutsch, Krishnendu Chakrabarty,Shreepad Panth, and Sung Kyu Lim, “TSV
Stress-Aware ATPG for 3D Stacked ICs”, inIEEE International Workshop on Testing
Three-Dimensional Stacked Integrated Circuits, 2012.
141
[8] Shreepad Panth, Kambiz Samadi, Yang Du, and Sung Kyu Lim, “High-Density Inte-
gration of Functional Modules Using Monolithic 3D-IC Technology”, in IEEE/ACM
Asia South Pacific Design Automation Conference, pp. 681–686, 2013.
[9] Shreepad Panth, Kambiz Samadi, and Sung Kyu Lim, “Test-TSV Estimation During
3D-IC Partitioning”, in IEEE International 3D Systems Integration Conference, pp.
1–7, 2013.
[10] Shreepad Panth, Kambiz Samadi, Yang Du, and Sung Kyu Lim, “Placement-Driven
Partitioning for Congestion Mitigation in Monolithic 3D IC Designs”, inACM Inter-
national Symposium on Physical Design, pp 47–54, 2014.
[11] Shreepad Panth, Kambiz Samadi, Yang Du, and Sung Kyu Lim, “Power-Performance
Study of Block-Level Monolithic 3D-ICs Considering Inter-Tier Performance Varia-
tions”, in ACM Design Automation Conference, pp. 1–6, 2014.
[12] Shreepad Panth, Kambiz Samadi, Yang Du, and Sung Kyu Lim, “Design and CAD
Methodologies for Low Power Gate-level Monolithic 3D ICs”, in IEEE International
Symposium on Low Power Electronics and Design, pp. 171–176, 2014.
[13] Shreepad Panth, Sandeep Samal, Yun Seop Yu, and Sung Kyu Lim, “Design Chal-
lenges and Solutions for Ultra-High-Density Monolithic 3DICs”, in IEEE SOI-3D-
Subthreshold Microelectronics Technology Unified Conference, pp. 1–2, 2014.
[14] Shreepad Panth, Sandeep Samal, Yun Seop Yu, and Sung Kyu Lim, “Design Chal-
lenges and Solutions for Ultra-High-Density Monolithic 3DICs”, in Journal of Infor-
mation and Communication Convergence Engineering, Vol. 12, No. 3, pp. 186–192,
2014.
[15] Shreepad Panth, Kambiz Samadi, Yang Du, and Sung Kyu Lim, “Tier-Partitioning
for Power Delivery vs Cooling Tradeoff in 3D VLSI for Mobile Applications”, inACM
Design Automation Conference, 2015, to appear.
142
[16] Shreepad Panth, Kambiz Samadi, Yang Du, and Sung Kyu Lim, “Placement-Driven
Partitioning for Congestion Mitigation in Monolithic 3D IC Designs”, inIEEE Trans-
actions on Computer-Aided Design of Integrated Circuits and Systems, to appear.
[17] Brandon Noia,Shreepad Panth, Krishnendu Chakrabarty, and Sung Kyu Lim, “Scan
Test of Die Logic in 3D ICs Using TSV Probing”, inIEEE Transactions on Very Large
Scale Integration Systems, to appear.
In addition, the author has completed works unrelated to this dissertation presented in
the following publications in print:
[1] Moongon Jung,Shreepad Panth, and Sung Kyu Lim, “A Study of TSV Variation
Impact on Power Supply Noise”, inIEEE International Interconnect Technology Con-
ference, pp. 8–12, 2011.
[2] Dae Hyun Kim, Krit Athikulwongse, Michael B. Healy, Mohammad M. Hossain,
Moongon Jung, Ilya Khorosh, Gokul Kumar, Young-Joon Lee, Dean L. Lewis, Tzu-
Wei Lin, Chang Liu,Shreepad Panth, Mohit Pathak, Minzhen Ren, Guanhao Shen,
Taigon Song, Dong Hyuk Woo, Xin Zhao, Joungho Kim, Ho Choi, Gabriel H. Loh,
Hsien-Hsin S. Lee, and Sung Kyu Lim, “3D-MAPS: 3D Massively Parallel Proces-
sor with Stacked Memory”, inIEEE International Solid-State Circuits Conference, pp.
188–190, 2012.
[3] Junghee Lee, Chryostomos Nicopoulos, Hyung Gyu Lee,Shreepad Panth, Sung Kyu
Lim, and Jongman Kim, “IsoNet: Hardware-Based Job Queue Management for Many-
Core Architectures”, inIEEE Transactions on Very Large Scale Integration Systems,
Vol. 21, No. 6, pp. 1080–1093, 2013.
[4] Sandeep Samal,Shreepad Panth, Kambiz Samadi, Mehdi Saeidi, Yang Du, and Sung
Kyu Lim, “Fast and Accurate Thermal Modeling and Optimization for Monolithic 3D
ICs”, in ACM Design Automation Conference, pp. 1–6, 2014.
143
[5] Ahmet Ceyhan, Moongon Jung,Shreepad Panth, Sung Kyu Lim, and Azad Naeemi,
“Impact of Size Effects in Local Interconnects for Future Technology Nodes: A Study
Based on Full-Chip Layouts”, inIEEE International Interconnect Technology Confer-
ence, pp. 345–348, 2014.
[6] Dae Hyun Kim, Krit Athikulwongse, Michael B. Healy, Mohammad M. Hossain,
Moongon Jung, Ilya Khorosh, Gokul Kumar, Young-Joon Lee, Dean L. Lewis, Tzu-
Wei Lin, Chang Liu,Shreepad Panth, Mohit Pathak, Minzhen Ren, Guanhao Shen,
Taigon Song, Dong Hyuk Woo, Xin Zhao, Joungho Kim, Ho Choi, Gabriel H. Loh,
Hsien-Hsin S. Lee, and Sung Kyu Lim, “Design and Analysis of 3D-MAPS (3D Mas-
sively Parallel Processor with Stacked Memory)”, inIEEE Transactions on Computers,
Vol.64, no.1, pp.112–125, 2015.
[7] Ahmet Ceyhan, Moongon Jung,Shreepad Panth, Sung Kyu Lim, and Azad Naeemi,
“Evaluating Chip-Level Impact of Cu/low-k Performance Degradation on Circuit Per-




Shreepad Panth was born in Pune, India, in 1988. He received hs B.E. from Anna Uni-
versity, India, in 2009, in Electrical and Electronics Engineering. He also received an M.S
from the school of Electrical and Computer Engineering at Georgia Institute of Technology
in 2011, where he is currently a Ph.D. candidate under the suprvision of Dr. Sung Kyu
Lim. His research interests lie in physical design methodolgies for monolithic 3D ICs. He
is the author of more than 20 publications in top conferencesand journals, and has received
the best paper award at ATS’12 and nominations for best paperwa ds at ISPD’14 and
DAC’14.
145
