Clock Network Design for 2.5D Heterogeneous Systems by Murali, Gauthaman







of the Requirements for the Degree
Master of Science in the
School of ECE
Georgia Institute of Technology
May 2020
Copyright c© Gauthaman Murali 2020
CLOCK NETWORK DESIGN FOR 2.5D HETEROGENEOUS SYSTEMS
Approved by:
Dr. Sung Kyu Lim
School of Electrical and Computer
Engineering
Georgia Institute of Technology
Dr. Madhavan Swaminathan
School of Electrical and Computer
Engineering
Georgia Institute of Technology
Dr. Saibal Mukhopadhyay
School of Electrical and Computer
Engineering
Georgia Institute of Technology
Date Approved: Apr 21, 2020
ACKNOWLEDGEMENTS
I express my sincere gratitude to my advisor, Dr. Sung Kyu Lim for imparting his
knowledge and exercise in this work. His continuous inputs and support played a major
role in performing this research as intended.
I thank the distinguished members of the reading committee, Dr. Madhavan Swami-
nathan and Dr. Saibal Mukhopadhyay, for the approval of my work
I would like to acknowledge the academic and technical support provided by my insti-
tution, Georgia Institute of Technology and the financial support provided by DARPA to
accomplish this work.
My appreciation also extends to my colleagues Dr. Heechun Park, Eric Qin, Hakki Mert
Torun and Majid Ahadi Dolatsara for willingly helping me in accomplishing this work.
Finally, I am grateful to my parents and friends for their continued encouragement and
support, which helped me in the completion of this thesis.
iii
TABLE OF CONTENTS
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Chapter 1: Introduction and Background . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2: Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 3: Benchmark Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.1 64 RISC-V Core Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Interposer Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 Chiplet Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Architectural differences between 2D and 2.5D Rocket-64 . . . . . . . . . 9
Chapter 4: 2.5D Clock Network Synthesis . . . . . . . . . . . . . . . . . . . . . . 12
4.1 2.5D Clocking Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Reference Clock Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2.1 Homogeneous Rocket-64 . . . . . . . . . . . . . . . . . . . . . . . 14
4.2.2 Heterogeneous Rocket-64 . . . . . . . . . . . . . . . . . . . . . . 14
4.3 RLGC Models of Interposer Interconnects . . . . . . . . . . . . . . . . . . 16
iv
4.4 Functional Clock Generation . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.5 AIB Clock Forwarding Technique . . . . . . . . . . . . . . . . . . . . . . 18
4.5.1 Homogeneous Rocket-64 . . . . . . . . . . . . . . . . . . . . . . . 20
4.5.2 Heterogeneous Rocket-64 . . . . . . . . . . . . . . . . . . . . . . 20
4.6 Functional Clock Routing in Slave Chiplets . . . . . . . . . . . . . . . . . 21
4.6.1 Homogeneous Rocket-64 . . . . . . . . . . . . . . . . . . . . . . . 21
4.6.2 Heterogeneous Rocket-64 . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 5: Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1 Benefits of the proposed 2.5D Clocking Architecture over 2D Counterpart . 29
5.1.1 Homogeneous Rocket-64 . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.2 Heterogeneous Rocket-64 . . . . . . . . . . . . . . . . . . . . . . 32
Chapter 6: Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.1 Hierarchical vs. Flat Approach . . . . . . . . . . . . . . . . . . . . . . . . 34
Chapter 7: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Appendix A: Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Appendix B: Why Rocket-64? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
v
LIST OF TABLES
3.1 Material properties and design rule of TSMC 65nm silicon interposer. . . . 8
3.2 Architectural features: 2D vs homogeneous 2.5D Rocket-64 design. . . . . 11
3.3 Architectural features: 2D vs heterogeneous 2.5D Rocket-64 design. . . . . 11
4.1 Properties of AIB buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Chiplet clock metrics of Homogeneous Rocket-64. . . . . . . . . . . . . . 22
4.3 Chiplet clock metrics of Heterogeneous Rocket-64. . . . . . . . . . . . . . 22
5.1 Clock power comparison: 2D vs 2.5D homogeneous Rocket-64 . . . . . . . 32
5.2 Clock power comparison: 2D vs 2.5D heterogeneous Rocket-64 . . . . . . 32
6.1 Hierarchical vs. flat interposer clock routing. The latency here denotes the
maximum delay from the clock C4 to clock micro-bump. . . . . . . . . . . 35
vi
LIST OF FIGURES
1.1 Chiplet integration using an interposer-based 2.5D system[2] . . . . . . . . 2
2.1 2.5D Clocking Architecture proposed in [8] . . . . . . . . . . . . . . . . . 4
3.1 Internal architecture of RocketCore and L2 chiplets [9]. . . . . . . . . . . . 6
3.2 Floorplan of 2.5D heterogeneous Rocket-64 that consists of 27 chiplets im-
plemented in four different commercial technology nodes and one inductor.
TSMC CoWoS 65nm interposer technology is used. The chiplet layouts are
shown in Fig. 4.13 and 4.14 . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Cross sectional view of interposer. . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Intel AIB [12] clock forwarding architecture. . . . . . . . . . . . . . . . . 9
3.5 Layout of the digitally-synthesized AIB transceiver. . . . . . . . . . . . . . 10
4.1 The hierarchical clocking architecture for homogeneous Rocket-64. . . . . 13
4.2 The hierarchical clocking architecture for heterogeneous Rocket-64. . . . . 13
4.3 PLL architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.4 PLL layout using TSMC 28nm. . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5 100 MHz crystal/reference interposer clock tree on interposer of homoge-
neous Rocket-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.6 100 MHz crystal/reference clock tree on interposer of heterogeneous Rocket-
64. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.7 Eye diagram of 100 MHz reference clock on interposer. . . . . . . . . . . 18
vii
4.8 Microstrip model of multi-conductor transmission line. . . . . . . . . . . . 19
4.9 1.2 GHz functional clock tree on the interposer of heterogeneous Rocket-64. 23
4.10 HSPICE clock wave forms of 1 GHz clock through the interposer and AIB
transceiver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.11 Homogeneous Rocket-64: Full-chip design and clock tree of (a) L2 cache
(TSMC 28nm), (b) Rocket-8 (TSMC 28nm) chiplets. Not drawn in scale. . 25
4.12 Homogeneous Rocket-64: Full-chip design and clock tree of (a) NoC (TSMC
28nm), (b) Memory controller (TSMC 28nm) chiplets. Not drawn in scale. . 26
4.13 Heterogeneous Rocket-64: Full-chip design and clock tree of (a) L2 cache
(TSMC 28nm), (b) Rocket-8 (TSMC 16nm) chiplets. Not drawn in scale. . 27
4.14 Heterogeneous Rocket-64: Full-chip design and clock tree of (a) NoC (TSMC
16nm), (b) Memory controller (TSMC 40nm) chiplets. Not drawn in scale. . 28
5.1 Final layout of 2D monolithic SoC design of modified Rocket-64 processor.
The 2.5D design is shown in Fig. 3.2. . . . . . . . . . . . . . . . . . . . . 30
5.2 Clock tree of 2D monolithic SoC design of modified Rocket-64 processor.
3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.1 (a) Flat 1 GHz interposer clock tree, (b) eye diagram. . . . . . . . . . . . . 35
viii
SUMMARY
The CMOS process technology scaling may have reached its pinnacle, yet not all ele-
ments of computing can be manufactured at lower technological nodes. This has led to the
development of a new branch of chip designing that allows chiplets on different technolog-
ical nodes to be integrated on to a single package using interposers, the passive intercon-
nection mediums. However, establishing a high-frequency communication over an entirely
passive layer is one of the significant design challenges of 2.5D systems. My research will
focus on building a robust clocking architecture for 2.5D systems, using a 64 core processor
benchmark. The clocking scheme of any 2.5D design consists of two major components,
viz., Interposer Clocking, and On-Chiplet Clocking. The interposer clocking consists of
clocks used to achieve global synchronicity and clocks for inter-chiplet communication es-
tablished using AIB protocol. These clocking components will be built using commercial
EDA tools and analyzed using standard tools, and package/interconnect models. I will also
be comparing these results against a 2D design of the same benchmark and against a differ-
ent 2.5D clocking architecture to study if the 2.5D clock network can be designed to offer
better power performance than the 2D counterpart.
CHAPTER 1
INTRODUCTION AND BACKGROUND
Though the 2D IC process technology has been scaling down continuously, there are cer-
tain circuit modules like memory and analog modules that cannot be scaled down to the
lowest technology node used in a particular manufacturing process. Also, there are digital
elements in a chip, which, when scaled down to a lower technology node, provide minimal
performance improvement, and this may not be worth the cost incurred on scaling down the
technology node. As the monolithic 2D design does not support integrating heterogeneous
technologies, we will have to scale down the entire design to a lower technology node
complicating the whole design process. However, this increases the chance of rendering
multiple dies unusable in a wafer. This is where 2.5D designs help by allowing the integra-
tion of heterogeneous modules (chiplets) together on an interposer. This process not only
helps in improving the yield but also makes the entire manufacturing process time-efficient
by allowing reuse of past chiplet designs [1].
2.5D technique enables the system designers to design any SoC by choosing off-the-
shelf chiplets and heterogeneously integrating them into the target SoC, thereby drastically
reducing the design time and design complexity by allowing reuse of pre-designed chiplets
as plug-and-play modules. Fig. 1.1 shows an example of an interposer based 2.5D design
and its cross-sectional view illustrating the inter-chiplet connections and the package con-
nections. Similar to a ball grid array (BGA) package, micro bumps are created across the
surface of chiplets to establish connections with the interposer. Further, the inter-chiplet
routing is done by connecting the corresponding micro bumps using wires routed across
the interposer over different metal layers. The external signals are routed across the metal
layers through the through-silicon vias (TSVs) before they exit the package via C4 bumps.












(b) Cross-section view of 2.5D IC ‘
Figure 1.1: Chiplet integration using an interposer-based 2.5D system[2]
hitting the commercial markets. As the commercialization of 2.5D designs has begun, it
is essential to analyze if the system performance improves proportionately to the efforts
involved in switching to a new design model. 2.5D designs integrating CPU, GPU, and
high bandwidth memory (HBM) have started hitting the commercial markets. As the com-
mercialization of 2.5D designs has begun, it is necessary to compare and analyze different
aspects of 2.5D designs against existing 2D designs to have a clear idea of what we will
be dealing with in the new technology. So far, researches have compared the performance
of different interposer technologies and methodologies to improve the signal and power
integrity of signals on the interposer. The other major component that affects the perfor-
mance of a design is the clocking behavior, and it is mandatory to ensure that any new
design methodology provides a better clock performance than the current methodologies,




Several works [3, 4, 5, 6, 7] have been performed related to improving different aspects
of clock trees in 3D ICs. However, very limited work is found on 2.5D clock networks.
One such work is performed by Huang and Zheng [8], who propose a global 2.5D clocking
architecture with one chiplet acting as the clock source to all the dies, as shown in Fig. 2.1.
They try to achieve synchronicity by using a one-driver-per-relay architecture, which min-
imizes the clock skew in 2.5D designs by dynamically tuning the delay of the clock driver
in the source chiplet to match the clock delays across the interposer to various chiplets.
However, this clocking architecture has several disadvantages.
1. The source chiplet distributes high frequency clock signals across the interposer.
The interposer being passive, reconstructing the clock signals at the destination chiplets
becomes a herculean task.
2. When the number of chiplets in the system is high, all clock signals originating from
a single source chiplet may lead to crosstalk issues on the interposer data signals. This, in
turn, makes it difficult to reconstruct the data signals at the destination chiplets.
3. In case of multi-clock domain system, all PLLs must be placed within the source
chiplet, leading to heating issues in the source die.
To overcome these disadvantages, a hierarchical clock network to improve the perfor-
mance and reduce the clock power consumption of 2.5D designs is proposed in this thesis.
A hierarchical clocking architecture tries to minimize the routing of high frequency clocks
on the passive interposer. The degradation of low frequency signals on the interposer is less
and can be easily reconstructed using double-inverter based buffers. In addition to these,
hierarchical clock network also works well for multi-clock domain systems.
Thus, this research focuses on proposing a scalable clocking architecture for homo-
3
‘
Figure 2.1: 2.5D Clocking Architecture proposed in [8]
geneous/heterogeneous 2.5D systems. Also, for the first time, a comparison between 2D
and 2.5D clock networks is performed to estimate if 2.5D designs can provide better clock




3.1 64 RISC-V Core Architecture
The Rocket-64 [2] architecture, which is a 64 core processor architecture based on RISC-
V RocketCore [9] implemented in TSMC 28nm, is used for this study. This architecture
contains around 8 million gates. The 2.5D design of Rocket-64 architecture consists of
8 Octa-Core RocketCore processor chiplets containing an L1 cache (0.25MB) in each of
them, 8 L2 cache (1MB each) chiplets, a 4-channel Memory Controller (MC) chiplet [10]
(both logical and physical layers) for the 64 cores to interact with external DRAMs, 1
Integrated Voltage Regulator (IVR) chiplet and 8 Digital Low Drop Out (DLDO) voltage
regulators to power up the entire 2.5D System, and a Network-on-Chip (NoC) chiplet (with
eight routers) to arbitrate among the 8 RocketCore - L2 cache chiplet pairs and the memory
controller chiplet. The internal architecture of RocketCore and L2 Cache chiplet pair is
shown in Fig. 3.1. In this work, the homogeneous Rocket-64 described in [2] and a hetero-
geneous version of the same are used to understand the benefits of the proposed clocking
architecture. In the heterogeneous version of Rocket-64, the chiplets are re-implemented
as follows: RocketCore and NoC chiplets using TSMC 16nm, Memory Controller chiplet
in TSMC 40nm, and retained the L2 Cache, and Digital Low Drop Out (DLDO) voltage
regulator chiplets at TSMC 28nm node, and IVR chiplet at GF 130nm technology node.
Fig. 3.2 shows the 2.5D floorplan of the modified Rocket-64 architecture.
3.2 Interposer Technology
TSMC CoWoS [11] 65nm silicon interposer with 0.4 µm fine pitch RDLs and 45 µm-pitch


























































Figure 3.2: Floorplan of 2.5D heterogeneous Rocket-64 that consists of 27 chiplets imple-
mented in four different commercial technology nodes and one inductor. TSMC CoWoS
















Figure 3.3: Cross sectional view of interposer.
Table 3.1: Material properties and design rule of TSMC 65nm silicon interposer.
Metal layer# 4
Metal thickness 1 µm
Dielectric thickness 1 µm
Min. line width/spacing 0.4 µm/0.4 µm
Via size 0.7 µm
Through Via size/depth 10 µm/100 µm
Die-to-die spacing 100 µm
micro bump pitch 45 µm
C4 bump pitch 180 µm
PDN width/spacing 40 µm/90 µm
interposer design used in this paper are shown in Table 3.1 and the cross section view of a
TSMC CoWoS R© based interposer is shown in Fig. 3.3.
3.3 Chiplet Protocol
In this architecture, Intel Advanced Interface Bus [12] (AIB) protocol, a chiplet standard
that uses special AIB drivers and clock forwarding architecture for inter-chiplet commu-
nication, is used to establish inter-chiplet communication. These AIB drivers help in re-
generating the degenerated interposer signals. Fig. 3.4 shows the AIB clock forwarding
architecture and Fig. 3.5 shows the layout of AIB driver.
8
Figure 3.4: Intel AIB [12] clock forwarding architecture.
3.4 Architectural differences between 2D and 2.5D Rocket-64
The monolithic 2D counterpart of this benchmark contains the same components as that of
the 2.5D design except for the NoC, IVR, DLDO, and AIB protocol logic. In the 2.5D sys-
tem, eight routers are used to arbitrate the transactions between the eight Rocket-8 chiplets
and the memory controller chiplet, whereas, in the 2D design, 12 routers are used. The
difference in the router count between the two designs is due to the use of AIB protocol for
inter-chiplet communication in 2.5D design. The AIB protocol restricts the number of data
I/O signal bumps to 40 to reduce the number of signals routed on the interposer. Therefore,
the entire I/O bus of each chiplet is streamlined into a 40-bit bus using an appropriate FIFO
synchronization mechanism. However, this adds latency in inter-chiplet communication.
In the 2D design, there is no restriction on the number of connections between modules.
Therefore, twelve routers are used for arbitrating signals between the Rocket-8 and the
Memory Controller. Instead of increasing the interface width of the router, the same 40-bit
router is used in the 2D design. Hence, more routers are needed to route the increased inter-
9
Figure 3.5: Layout of the digitally-synthesized AIB transceiver.
connections between NoC and Memory Controller modules in 2D design. Fig. 5.1 shows
the single-chip 2D design of the Rocket-64 [2] architecture. Tables 3.2 and 3.3 tabulate the
architectural features of 2D and 2.5D designs of Rocket-64.
10
Table 3.2: Architectural features: 2D vs homogeneous 2.5D Rocket-64 design.
Module 2D Design 2.5D Design Technology Node
Rocket Core 8 8 28nm
L2 Cache 8 8 28nm
Memory Controller 4 4 28nm
Routers 12 8 28nm
IVR 0 4 28nm
DLDO 0 8 130nm
PLL 8 1 28nm
Table 3.3: Architectural features: 2D vs heterogeneous 2.5D Rocket-64 design.
Module 2D Design 2.5D Design Technology Node
Rocket Core 8 8 16nm
L2 Cache 8 8 28nm
Memory Controller 4 4 40nm
Routers 12 8 16nm
IVR 0 4 28nm
DLDO 0 8 130nm
PLL 20 24 28nm
11
CHAPTER 4
2.5D CLOCK NETWORK SYNTHESIS
Any clock network in a 2.5D design consists of two components; interposer clocking and
on-chiplet clocking. In this work, the Cadence SiP Layout is used for interposer clock
routing, and clock tree synthesizer (CTS) in the Cadence Innovus Implementation System
for on-chiplet clock routing.
4.1 2.5D Clocking Architecture
Passive interposers prohibit the optimization of high-frequency clocks using buffer inser-
tion techniques. This makes it necessary to downgrade the clock frequency if the clock
signal is routed over a greater distance. Taking this limitation into account, the clocking ar-
chitecture shown in Fig. 4.1 is used for the homogeneous 2.5D Rocket-64 and the clocking
architecture shown in Fig. 4.2 for the heterogeneous 2.5D Rocket-64 design. The homoge-
neous Rocket-64 uses single clock domain (1GHz) for its functional operations, whereas
the heterogeneous variant uses multiple clock domains (1.2GHz, 1GHz, and 600MHz).
The crystal/reference clock is scaled to high-frequency clocks using phase locked loop
(PLL) circuits within different chiplets based on the variant of Rocket-64. A typical analog
PLL [13] on TSMC 28nm node that consists of a ring oscillator based voltage-controlled
oscillator (VCO), a digital phase-frequency detector (PFD), a digital frequency divider, a
charge pump, and a loop filter is used to scale the clock frequency. The loop filter of the
PLL consists of PMOS and NMOS capacitors, and poly resistors to generate the appropriate
control signal from charge pump output to control the frequency of oscillation of VCO. The
architecture and layout of the PLL are shown in Fig. 4.3 and 4.4, respectively. The area of
the PLL is 42,433 µm2. The lock time of the PLL is approximately 110 ns.
12
Crystal/Reference Clock
       100 MHz
Functional/AIB Clock: 1 GHz
Figure 4.1: The hierarchical clocking architecture for homogeneous Rocket-64.
Crystal/Reference Clock




 Clock: 1 GHz
 Functional/AIB
Clock: 600 MHz
Figure 4.2: The hierarchical clocking architecture for heterogeneous Rocket-64.
4.2 Reference Clock Routing
The interposer clock network forms the base of clocking in any 2.5D system. A crys-
tal/reference clock from an external 100 MHz crystal oscillator is routed into the 2.5D
system through the C4 bumps, TSVs, and via stack and metal layer up to the clock micro
bumps of the chiplets consisting of PLLs using Allegro signal router in the Cadence SiP
Layout tool. The clock C4 bump is placed as close as possible to the PLLs, so that the ref-
erence clock does not undergo much degradation over the passive interposer layer. Owing
to the passive nature of the interposer, there cannot be equalizers on the interposer layer
to reduce the effect of crosstalk. The crosstalk on clock signals is reduced by making sure
that the clock C4 and micro bumps are surrounded by either power, ground, or semi-static












Figure 4.3: PLL architecture.
noise-free clock, and hence, AIB I/O drivers, which are capable of reconstructing cleaner
clock signals from the degraded ones are used.
In a hierarchical clock routing architecture, the structure of the reference clock plays an
important role in achieving global synchronicity in the system. The PLLs are placed within
different chiplets in homogeneous and heterogeneous Rocket-64 designs to check if the
hierarchical clock routing performs well irrespective of the PLL’s location. The reference
clock tree structure in the two variants of Rocket-64 are explained below.
4.2.1 Homogeneous Rocket-64
In case of homogeneous Rocket-64, the reference clock is manually routed on the interposer
layer in an H-Tree fashion to all the Rocket-8 and NoC chiplets. Within each of these
chiplets, a PLL scales the reference clock to 1 GHz. The reference clock tree is shown in
Fig. 4.5.
4.2.2 Heterogeneous Rocket-64
In case of heterogeneous Rocket-64, the reference clock is manually routed on the inter-
poser layer in an H-Tree fashion to all the L2 cache chiplets. Within each L2 cache chiplets,








Figure 4.4: PLL layout using TSMC 28nm.
erence clock tree is shown in Fig. 4.6.
A completely balanced reference clock H-Tree is necessary to make sure that every
PLLs is synchronous. When the PLLs are synchronous with each other, the high-frequency
clock signals generated within Rocket-8, L2 cache, and NoC chiplets align on every refer-
ence clock edge to ensure global synchronicity.The low frequency of the reference clock
makes it less prone to interposer degradation, and this can be observed in the eye diagram







Figure 4.5: 100 MHz crystal/reference interposer clock tree on interposer of homogeneous
Rocket-64
4.3 RLGC Models of Interposer Interconnects
The clock metric analysis of different types of clocks is performed in this design using
RLGC models of C4 bumps, TSVs [14] [15], vias, and interposer transmission line models
[16]. High-speed interposer interconnect models are designed using a Bayesian framework
coupled with machine learning techniques and used these models to perform the simula-
tions. Also, multi-conductor transmission line models, as shown in Fig. 4.8 are used to




Figure 4.6: 100 MHz crystal/reference clock tree on interposer of heterogeneous Rocket-
64.
4.4 Functional Clock Generation
After the reference clock routing is completed, the propagation delay of the clocks from
the crystal clock C4 bump to various chiplets is calculated based on the RLGC models of
C4 bumps, TSVs, and interposer wires as described in the section III-C. This delay is used
as the source latency to the PLLs within each NoC, Rocket-8, and L2 cache chiplets for
on-chiplet clock tree synthesis. These clock delays are modelled using Synopsys Design
Constraints (SDC) file, which is a Synopsys file format to model clock/reset related con-























Figure 4.7: Eye diagram of 100 MHz reference clock on interposer.
and optimize the clock trees of 1 GHz within the Rocket-8 and NoC chiplets in case of
homogeneous Rocket-64 and the clock trees of 1.2 GHz, 1 GHz, and 600 MHz within
the L2 cache chiplets in case of heterogeneous Rocket-64. In heterogeneous Rocket-64,
the clocks generated within the L2 cahce chiplets are forwarded to the Rocket-8 and NoC
chiplets for their operation. The 1.2 GHz clock forwarded from L2 cache chiplets to the
corresponding Rocket-8, NoC, and memory controller chiplets in heterogeneous Rocket-64
are shown in Fig. 4.9.
4.5 AIB Clock Forwarding Technique
Adding a PLL within every chiplet, especially in cases where a chiplet operates fully in syn-
chronous with another chiplet, is not an efficient clocking technique. In a case where two
chiplets communicate as a master-slave pair, the slave chiplet should derive its clock signal
from the master’s clock. The following master-slave communications are implemented in
18
INTERPOSER SUBSTRATE
Line spacing = 0.4 μm
Line Width = 0.4 μm
Figure 4.8: Microstrip model of multi-conductor transmission line.
this design using the AIB clock forwarding architecture: Communication between Rocket-
8 and L2 Cache, L2 Cache and NoC, and NoC and Memory Controller. The buffers shown
in Fig. 3.4 are special AIB buffers, which can be configured to act as either clock buffer or
data buffer and also aid in the reconstruction of high-speed signals that get degraded while
passing through shorter distances over the interposer. Fig. 4.10 shows a 1 GHz clock sig-
nal through a 3 µm wire on a passive silicon interposer and the corresponding clock signal
reconstructed by the AIB buffer.
The clock forwarded from the master chiplet is routed back to the master by the slave
chiplet when the slave responds to the master. Duty cycle corrector (DCC) circuits are
used to correct duty cycle variations if the clock signal’s duty cycle is affected on transmis-
sion over the interposer. The performance metrics, power, and area of AIB buffers for an
operating frequency of 1 GHz are given in Table 4.1.
Using these features of AIB protocol, the following clocking style is designed for
master-slave chiplet pair communication in both variants of Rocket-64.
19
Table 4.1: Properties of AIB buffer
Metric Value
Op. Frequency 1 GHz
Area 56 µm2
Gate Count 69
Total Power 19 µW
Clock Power 6.1 µW
Clock Latency 4 ps
4.5.1 Homogeneous Rocket-64
The 1 GHz functional clock generated in each Rocket-8 chiplet is forwarded along with
the data in a 40 bit AIB bus to the corresponding L2 Cache chiplet. When the L2 cache
chiplet responds to the Rocket-8 chiplet, the same clock is rerouted internally within the
L2 Cache chiplet and forwarded along with the L2 cache data back to Rocket-8 chiplet,
similar to the structure shown in Fig. 3.4. Also, the 1 GHz clock from NoC chiplet is
routed with the data signals between the L2 Cache and NoC chiplet pairs, and NoC and
Memory Controller chiplet pairs in a similar fashion.
4.5.2 Heterogeneous Rocket-64
The 1.2 GHz functional clock forwarded to Rocket-8 chiplet from the L2 cache chiplet is
forwarded back along with the data, in a 40 bit AIB bus, to the corresponding L2 Cache
chiplet. When the L2 cache chiplet responds to the Rocket-8 chiplet, the same clock is
rerouted internally within the L2 Cache chiplet and forwarded along with the L2 cache
data back to Rocket-8 chiplet, similar to the structure shown in Fig. 3.4. Also, the 600
MHz clock and data signals between the L2 Cache and NoC chiplet pairs, and NoC and
Memory Controller chiplet pairs are routed in a similar fashion.
It is necessary to ensure that when the clock is forwarded along with the data, the signals
are not skewed in a way that it breaks synchronicity in the communication. The Cadence
SiP tool’s Allegro signal router is used to perform the AIB clock routing by constraining the
skew limits. The routing lengths are ensured to be within a safe limit of 3.5 µm. The clock
20
and the data signals in a bus are routed such that the maximum skew between the clock and
any data signal is 3.96 ps. Unlike the reference clock, these AIB clocks pass through the
interposer only for a shorter length. They are also regenerated within the chiplets as they
pass from one chiplet to the other, making them robust to degradation despite their high
frequencies. Also, the clock signals are surrounded by semi-static signals in an AIB bus to
reduce the effect of crosstalk.
4.6 Functional Clock Routing in Slave Chiplets
4.6.1 Homogeneous Rocket-64
Once the AIB interposer clock routing is done, we calculate the propagation delays of 1
GHz clocks on the interposer and clock skew between the clock and data signals in the
AIB bus to generate SDC constraints for functional clock tree synthesis of L2 cache, and
memory controller chiplets.
4.6.2 Heterogeneous Rocket-64
In heterogeneous Rocket-64, the propagation delays of 1.2 GHz, and 600 MHz clocks on
the interposer and clock skew between the clock and data signals in the AIB bus are calcu-
lated and used to generate SDC constraints for functional clock tree synthesis of Rocket-8,
NoC, and memory controller chiplets.
The high-frequency clock signals are more degraded compared to the 100 MHz ref-
erence clock signal, and hence it is necessary to reconstruct a clean clock signal from the
degraded interposer clock. However, as mentioned earlier, the AIB buffer used as a part of
clock forwarding architecture takes care of reconstructing the degraded clock signals.
Fig. 4.11 and 4.12 show the chiplet layouts and their corresponding clock trees of
homogeneous Rocket-64 and Fig. 4.13 and 4.14 show that of heterogeneous Rocket-64.
Tables 4.2 and 4.3 provide the corresponding clock metrics of homogeneous and hetero-
geneous Rocket-64, respectively. It is observed that Rocket-8 chiplet contains the most
21
Table 4.2: Chiplet clock metrics of Homogeneous Rocket-64.
L2 Cache Rocket-8 NoC Mem-Ctr
Target Clock Period (GHz) 1 1 1 1
Technology node (nm) 28 28 28 28
Clock Latency (ps) 264 566 216 273
Clock Skew (ps) 7 12 34 41
Clock Jitter (ps) 21 18 21 45
Clock Power (mW ) 5 170 47 19
Clock Buffer Count 420 409 1,230 328
Clock Wire Length (mm) 35 428 110 39
Table 4.3: Chiplet clock metrics of Heterogeneous Rocket-64.
L2 Cache Rocket-8 NoC Mem-Ctr
Target Clock Period (GHz) 1 1.2 1.2 0.6
Technology node (nm) 28 16 16 40
Clock Latency (ps) 152 239 309 526
Clock Skew (ps) 2 2 43 144
Clock Jitter (ps) 11 7 9 22
Clock Power (mW ) 38 139 24 20
Clock Buffer Count 369 5,866 1,215 610
Clock Wire Length (mm) 41 313 97 55
























Figure 4.11: Homogeneous Rocket-64: Full-chip design and clock tree of (a) L2 cache




Figure 4.12: Homogeneous Rocket-64: Full-chip design and clock tree of (a) NoC (TSMC




Figure 4.13: Heterogeneous Rocket-64: Full-chip design and clock tree of (a) L2 cache




Figure 4.14: Heterogeneous Rocket-64: Full-chip design and clock tree of (a) NoC (TSMC




5.1 Benefits of the proposed 2.5D Clocking Architecture over 2D Counterpart
For the single-chip monolithic 2D design of Rocket-64, a hierarchical design is performed
using TSMC 28nm technology node. Unlike the 2.5D design, 2D design cannot involve
multiple technology nodes. The 2D design does not require IVR and DLDO modules as
the power delivery in a 2D system is less stringent compared to that of a 2.5D system. For
a fair comparison, 100 MHz clock is used as the bus clock and use a PLL for each group
of 8 Rocket cores to scale it to their functional frequencies. Similar to the 2.5D design, two
variants of 2D Rocket-64 are designed with: (1) Rocket-8, NoC, and DDR-PHY modules
operating at 1.2 GHz, L2 cache at 1 GHz and memory controller at 600 MHz, (2) All
modules operating at 1 GHz. Fig. 5.1 and 5.2 show the overall design and the multi-
domain clock network of the 2D design, respectively. Tables 5.1 and 5.2 compare the clock
power consumption of two variants of 2D and 2.5D designs. The following observations
are made:
5.1.1 Homogeneous Rocket-64
• RocketCore: The large capacitance contributed by long high frequency nets and the
large number of buffers added on these nets cause 2D design dissipate more clock
power than 2.5D design.
• L2 Cache: The 2.5D design involves additional logic to support AIB protocol, which
involves a significant amount of sequential circuits, causing an increase in the overall
clock power.
29
Figure 5.1: Final layout of 2D monolithic SoC design of modified Rocket-64 processor.
The 2.5D design is shown in Fig. 3.2.
• Memory Controller: The clock power of the 2.5D 4-channel memory controller is
slightly higher than that of the 2D design due to the presence of AIB logic.
• Router: The 2D design has 12 routers for arbitration, whereas the 2.5D design has
only eight routers, explaining the significant increase in clock power of 2D router
design.
• PLL: Both 2D and 2.5D designs have one PLL per each Rocket-8 and NoC modules.
The additional power seen in 2D design is due to large capacitance contributed by




   L2 
Cache
  NoC Router
Memory Controller
Figure 5.2: Clock tree of 2D monolithic SoC design of modified Rocket-64 processor. 3.2.
these long wires pass through passive interposer layer, thereby reducing the effective
capacitance seen by the PLL.
• Overall: The overall clock power of 2.5D design is 12% lower than that of the
2D design. This reduction in power is due to the long low-frequency interposer
clock nets in the 2.5D design as opposed to long high-frequency clock nets in the 2D
design.
Thus, we have demonstrated that clock delivery network optimization is manageable in
2.5D designs and can be done to even outperform 2D counterpart. But this requires rigorous
31
Table 5.1: Clock power comparison: 2D vs 2.5D homogeneous Rocket-64
Module (2D) or Chiplet (2.5D) 2D Design 2.5D Design
Eight Rocket-8 1,580 mW 1,260 mW
Eight L2 Cache 8.8 mW 40 mW
Memory Controller 10 mW 19 mW
NoC Router 81 mW 47 mW
PLL 110.3 mW 101.25 mW
Overall Power 1.79 W 1.57 W
Table 5.2: Clock power comparison: 2D vs 2.5D heterogeneous Rocket-64
Module (2D) or Chiplet (2.5D) 2D Design 2.5D Design
Eight Rocket-8 1,640 mW 1,110 mW (16nm)
Eight L2 Cache 8.8 mW 32 mW (28nm)
Memory Controller 13 mW 20 mW (40nm)
NoC Router 60 mW 23 mW (16nm)
PLL 241 mW 270 mW (28nm)
Total Clock Power 1.98 W 1.65 W
co-optimization of chiplet and interposer portions.
5.1.2 Heterogeneous Rocket-64
• RocketCore: The power of 2D Rocket-8 design is higher than the 2.5D counterpart,
as the 2D Rocket-8 is at a higher technology node than the 2.5D design.
• L2 Cache: Similar to the homogeneous variant, the 2.5D design involves additional
logic to support AIB protocol, which involves a significant amount of sequential
circuits, causing an increase in the overall clock power.
• Memory Controller: The clock power of the 2.5D 4-channel memory controller is
slightly higher than that of the 2D design due to the presence of AIB logic and higher
technology node.
• Router: In addition to the lesser number of routers in 2.5D design, the routers are
designed at 16nm node. This lowers the power consumption of 2.5D routers further.
• PLL: To reduce the number of high-frequency signals on the interposer, we placed
32
all the PLLs within the L2 cache chiplets in the 2.5D design. There are 24 PLLs (3
in each L2 cache chiplet) in 2.5D design. The 2D design has 20 PLLs (one PLL per
partition), so the PLLs consume less power in 2D design.
• Overall: The overall clock power of 2.5D design is 16.7% lower than that of the 2D
design. This further reduction in power is due to the presence of lower technology
node chiplets in the 2.5D design.
Thus, it is demonstrated that clock delivery network optimization is manageable in 2.5D
designs and can be done to even outperform 2D counterpart, irrespective of homogeneous
or heterogeneous chiplets, single or multi clock domains, and PLL location. However, this




6.1 Hierarchical vs. Flat Approach
The proposed clocking architecture for 2.5D Rocket-64 involves hierarchical routing, AIB
clock forwarding, and on-chip PLLs. However, what would happen if high-frequency clock
signals are delivered directly to all chiplets? In order to answer this question, a 1GHz clock
tree is routed in the interposer for comparison. In this case, all the chiplets are assumed
to be operating at 1 GHz so that PLLs are not necessary. Fig. 6.1(a) shows the routing
topology used for this 1 GHz clock and its eye diagram. First, an excessive degradation
of the flat 1 GHz clock signal can be observed from the eye diagram. This demonstrates
the difficulty in delivering high frequency clock signal through a long distance passive
interposer interconnect without using any clock buffers. Table 6.1 compares the proposed
heterogeneous hierarchical clock tree vs. flat. It is observed that hierarchical clock tree
performs better in almost all metrics except for NoC and memory controller latency. This
is because the clock µ-bumps of NoC and memory controller chiplets are closer to the
crystal clock C4 bump, and the clock frequency of these chiplets is lower in the flat clock










L2 L2 L2 L2
L2 L2 L2 L2
R R R R
(b)
Figure 6.1: (a) Flat 1 GHz interposer clock tree, (b) eye diagram.
Table 6.1: Hierarchical vs. flat interposer clock routing. The latency here denotes the
maximum delay from the clock C4 to clock micro-bump.
Clock Metric Chiplet Hierarchical Flat
Clock latency L2 cache 71 ps 74 ps
Rocket 50 ps 89 ps
NoC 62 ps 48 ps
Mem controller 66 ps 58 ps
Clock skew L2 cache 0.25 ps 8 ps
Rocket 0 ps 2 ps
Clock Jitter - 1.2 ps 1.4 ps
Eye Height - 899 mV 607 mV




In this work, a robust clock architecture is proposed for many-core 2.5D processor design.
This architecture relies on a hierarchical clock distribution network that utilizes a novel
clock forwarding scheme and on-chip PLL for frequency conversion. Using this tool, for
the first time, 2D vs. 2.5D clocking architecture comparison is presented using GDS layouts
of all chiplets and interposer and sign-off quality power, performance, and clock reliabil-
ity metrics. Unlike the common belief that clock delivery is much more challenging in
2.5D designs, it is demonstrated that with rigorous co-optimization of chiplet and inter-







The RTL code for the main module of the benchmark used in this research, RocketCore,
is provided by the sponsors of the project, DARPA. The RTL code for other modules in
the benchmark were taken from opensource websites such as github and opencores. These
codes were integrated together into the Rocket-64 benchmark. The specifications of the
special AIB driver architecture used to drive signals through the interposer are provided
by Intel. Based on the specification, RTL, timing, and power models of the AIB buffers
were created and used in the designing process. Different modules of the benchmark were
synthesized using different TSMC technology nodes as described in the thesis. Several
proprietary physical design tools like Cadence Innovus, Cadence SiP Layout tool, and Syn-





J. Kim et al. compared the power, performance, and area between the 2D and 2.5D designs
of a 64 core processor, Rocket 64, in [2]. This benchmark has 27 chiplets, viz., 8 Rocket8
(octa-core) chiplets, 8 L2 cache chiplets, 1 Network-on-Chip (NoC) chiplet, 1 memory
controller chiplet, 8 DLDO chiplets, and 1 IVR chiplet. The reason to use the same bench-
mark for studying the clocking behavior is that the variety of chiplets in this design enables
creating a multi-domain clock tree structure, thereby matching the clock tree structure in
many real-world System on Chips (SoCs). The other advantage of using this benchmark is




[1] D. Stow, I. Akgun, R. Barnes, P. Gu, and Y. Xie, “Cost analysis and cost-driven IP
reuse methodology for SoC design based on 2.5D/3D integration,” in 2016 IEEE/ACM
International Conference on Computer-Aided Design (ICCAD), 2016, pp. 1–6.
[2] J. Kim et al., “Architecture, Chip, and Package Co-design Flow for 2.5D IC Design
Enabling Heterogeneous IP Reuse.,” in ACM Design Automation Conference, 2019.
[3] V. F. Pavlidis, I. Savidis, and E. G. Friedman, “Clock distribution networks in 3-
d integrated systems,” IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 19, no. 12, pp. 2256–2266, 2011.
[4] F.-W. Chen and T. Hwang, “Clock tree synthesis with methodology of re-use in 3d
ic,” Jun. 2012.
[5] H. Xu, V. Pavlidis, and G. Micheli, “Effect of process variations in 3d global clock
distribution networks,” ACM Journal on Emerging Technologies in Computing Sys-
tems, vol. 8, Aug. 2012.
[6] T. Lu and A. Srivastava, “Gated low-power clock tree synthesis for 3d-ics,” in 2014
IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED),
2014, pp. 319–322.
[7] F. Chen and T. Hwang, “Clock tree synthesis with methodology of re-use in 3d ic,”
in DAC Design Automation Conference 2012, 2012, pp. 1094–1099.
[8] S. Huang and C. Zheng, “Die-to-die clock skew characterization and tuning for 2.5d
ics,” in 2016 IEEE 25th Asian Test Symposium (ATS), 2016, pp. 221–226.
[9] Asanović, Krste, et al., “The Rocket Chip Generator,” EECS Department, University
of California, Berkeley, Tech. Rep., 2016.
[10] Standard, “Double Data Rate (DDR3) DRAM Standard,” JEDEC, 2007.
[11] R. Chaware, K. Nagarajan, and S. Ramalingam, “Assembly and reliability chal-
lenges in 3D integration of 28nm FPGA die on a large high density 65nm passive
interposer,” in 2012 IEEE 62nd Electronic Components and Technology Conference,
2012, pp. 279–283.
[12] D. Kehlet et al., Accelerating Innovation Through a Standard Chiplet Interface: The
Advanced Interface Bus (AIB), https://www.intel.com.
40
[13] Y.-L. Hsueh et al., “A 0.29mm2 frequency synthesizer in 40nm CMOS with 0.19psrms
jitter and <-100dBc reference spur for 802.11ac,” IEEE Int. Solid-State Circuits
Conf., 2014.
[14] S. Choi et al., “Signal Integrity Analysis of Silicon/Glass/Organic Interposers for
2.5D/3D Interconnects,” IEEE Electronic Components and Technology Conf., 2017.
[15] A. E. Engin and S. R. Narasimhan, “Modeling of Crosstalk in Through Silicon Vias,”
IEEE Transactions on Electromagnetic Compatibility, vol. 55, pp. 149–158, 1 2013.
[16] H. M. Torun, M. Larbi, and M. Swaminathan, “A Bayesian Framework for Optimiz-
ing Interconnects in High-Speed Channels,” IEEE Int. Conf. on Numerical Electro-
magnetic and Multiphysics Modeling and Optimization, pp. 1–4, 2018.
41
