Inter-Tier Process Variation-Aware Monolithic 3D NoC Architectures by Musavvir, Shouvik et al.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
1
Abstract— Monolithic 3D (M3D) technology enables high 
density integration, performance, and energy-efficiency by 
sequentially stacking tiers on top of each other. M3D-based 
network-on-chip (NoC) architectures can exploit these 
benefits by adopting tier partitioning for intra-router 
stages. However, conventional fabrication methods are 
infeasible for M3D-enabled designs due to temperature 
related issues. This has necessitated lower temperature and 
temperature-resilient techniques for M3D fabrication, 
leading to inferior performance of transistors in the top tier 
and interconnects in the bottom tier. The resulting inter-tier 
process variation leads to performance degradation of 
M3D-enabled NoCs. In this work, we demonstrate that 
without considering inter-tier process variation, an M3D-
enabled NoC architecture overestimates the energy-delay-
product (EDP) on average by 50.8% for a set of SPLASH-2 
and PARSEC benchmarks. As a countermeasure, we adopt 
a process variation aware design approach. The proposed 
design and optimization method distribute the intra-router 
stages and inter-router links among the tiers to mitigate the 
adverse effects of process variation. Experimental results 
show that the NoC architecture under consideration 
improves the EDP by 27.4% on average across all 
benchmarks compared to the process-oblivious design. 
 
Index Terms—Monolithic 3D, NoC, process variation, EDP.  
I. INTRODUCTION 
hree-dimensional (3D) integrated circuits (ICs) have been 
shown to enable the design of high-performance and 
energy-efficient systems [1]. In particular, the network-on-
chip (NoC) can heavily benefit from 3D integration as the 
communication backbone of manycore systems. By taking 
advantage of a third dimension, 3D NoCs provide a scalable 
communication fabric with lower hop-count, lower energy, and 
higher performance compared to their 2D counterparts [2]. 
Modern 3D integration processes have widely adopted 
through-silicon-via (TSV) technology to connect planar dies 
together. However, there are several significant challenges to 
TSV-based 3D integration. First, TSVs require additional 
fabrication steps like creating landing pads, wafer-thinning, and 
bonding [3]. These fabrication and packaging related 
challenges lead to lower yield rates and higher production costs 
for TSV-based designs [4]. Secondly, TSVs require a minimum 
 
The manuscript was submitted on June 10, 2019. This work was supported 
in part by the US National Science Foundation (NSF) grants CNS-1564014, 
CCF 1514269, and USA Army Research Office grant W911NF-17-1-0485. 
S. Musavvir, A. Chatterjee, D.H. Kim, and P.P. Pande are with the School 
of Electrical Engineering & Computer Science, Washington State University, 
keep-out-zone (KOZ) to reduce stress and coupling noise, 
introducing additional area overheads while undermining 
achievable integration density [5]. Thirdly, even though TSV-
based vertical links can improve communication in 3D NoCs, 
they may fail due to crosstalk and electromigration [6]. 
Recently, monolithic 3D (M3D) integration has been 
proposed as an alternative to TSV-based designs. In M3D 
designs, multiple tiers are processed sequentially on the same 
die [7] and monolithic inter-tier vias (MIVs) are used as vertical 
links instead of TSVs. The physical dimensions of MIVs 
(~50nm × 100nm) are several orders of magnitude smaller than 
TSVs (1-3μm × 10-30μm) and are comparable to standard 
copper vias [8]. Similarly, the contact dimensions of M3D are 
much smaller (~100nm [9]) while TSV-based systems require 
contacts of 2-5μm. This allows us to achieve nanoscale contact 
pitch using M3D and attain the true benefit of vertical system 
integration. By facilitating nanoscale pitch, M3D enables us to 
examine gate- and block-level partitioning in circuits [7]. As a 
result, M3D offers much higher integration density and large 
reductions in total wire length over TSV-based counterparts. In 
addition, the direct wafer bonding technique in M3D achieves 
higher yields and lower costs compared to TSV-based 
integration [7], [10]. 
Naturally, NoC architectures can exploit the benefits of gate-
/block-level partitioning in M3D integration by fabricating 
routers that span multiple tiers. In a recent study, M3D-enabled 
NoCs are shown to achieve 28% better energy efficiency 
compared to its TSV-based counterpart [11]. However, the 
investigations in [11] do not consider any M3D fabrication-
related challenges or the benefit of lower interconnect 
capacitance from the reduced wire lengths in M3D designs [12].  
While M3D architectures offer significant design flexibility 
and better energy efficiency compared to TSV-based designs, 
there are technology- and fabrication-related challenges that 
need to be addressed. M3D’s sequential integration requires: 1) 
a low-temperature top-tier annealing process to prevent 
degradation in bottom-tier transistors [13]; and 2) temperature-
resilient tungsten interconnects in the bottom tier to withstand 
high top-tier fabrication temperatures [7]. Unfortunately, these 
requirements degrade the transistors in the top tier and the 
interconnects in the bottom tier. Without considering these 
process-related effects it is likely that performance and energy 
efficiency will be overestimated at design time. 
In this paper, we demonstrate, for the first time, the 
importance of including these two M3D design requirements 
(tungsten interconnects in the bottom tier and a low-
Pullman, WA 99163 USA (e-mail: shouvik.musavvir@wsu.edu; 
anwesha.chatterjee@wsu.edu; daehyun.kim@wsu.edu; pande@wsu.edu).  
R.G. Kim is with the Department of Electrical and Computer Engineering, 
Colorado State University, Fort Collins, CO 80524 USA (e-mail: 
Ryan.G.Kim@colostate.edu). 
Inter-Tier Process Variation-Aware Monolithic 
3D NoC Architectures 
Shouvik Musavvir, Student Member, IEEE, Anwesha Chatterjee, Student Member, IEEE, Ryan Gary 
Kim, Member, IEEE, Dae Hyun Kim, Member, IEEE, Partha Pratim Pande, Senior Member, IEEE 
T
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
2
temperature top-tier annealing process) and the resulting inter-
tier performance variation on the design, optimization, and 
efficiency of an M3D-enabled NoC architecture. Our main 
contributions of this paper are: 
1) We demonstrate the necessity of including M3D process 
related effects in the design of NoCs. We show that a small-
world NoC designed with M3D process parameters in mind 
lowers the energy-delay-product (EDP) with respect to the 
process-oblivious counterpart on average by 27.4% across 
all benchmarks under consideration. 
2) We perform a detailed analysis of the effects of these M3D 
process related parameters and the benefits of partitioning 
intra-router stages across tiers on the design of process-
aware NoCs. (i.e., intra-router stage placement and inter-
router link distribution).  
3)  We find that the distribution of intra-router pipelined 
stages and inter-router links among the M3D tiers is 
strongly dependent on the values of the process variation 
parameters. We also find that the distribution of the intra-
router stages and inter-router links depends on the 
benchmarks under consideration. All these observations 
show and justify the need for the M3D process-aware 3D 
NoC design and optimization we propose in this paper. 
The rest of the paper is organized as follows. The related 
work is discussed in Section II. Section III presents the 
challenges of M3D NoC design. In Section IV, we describe the 
problem setup and the proposed solution for the process 
variation induced performance variation. The experimental 
results are presented in Section V, and finally, we conclude the 
paper in Section VI. 
II. RELATED WORK 
The merits of M3D-based designs have been explored in 
several works [14], [15]. M3D circuits provide reduced power, 
performance, and area compared to their 2D counterparts. 
Motivated by the promise of monolithic integration, the 
CELONCEL design framework was proposed to explore the 
advantage of transistor/gate-level partitioning and cell-on-cell 
stacking design for M3D integration [15]. It was found that the 
footprint and wirelength of M3D-based designs are reduced by 
37.5% and 16.2% respectively over their planar counterparts at 
the 45 nm technology node. This results in a 6.2% reduction of 
overall delay for M3D designs for the same technology node. 
Moreover, performance improves for more advanced 
technology nodes such as 22 nm [16], 14 nm [8], and 
7 nm [17]. The effects of the number of planar tiers, tier-level 
partitioning, and MIV insertion methodology on the 
performance of M3D-based ICs were analyzed [18]. As the 
number of MIVs in the design increases, the power saving 
improves as well. M3D systems enabled by nanotechnologies 
(N3XT) are proposed in [19]. N3XT employs recent 
nanotechnologies such as carbon nanotubes and M3D 
integration and achieves high-performance and energy 
efficiency. 
Several works address the application of M3D technology 
to different types of circuits, e.g., 3D FPGA [20], 3D 
DRAM [21], and 3D SRAM [22]. It is demonstrated that by 
using M3D technology, we can reduce the total area, path delay, 
and power consumption of the circuit. Similarly, the authors 
in [11] explored the design space of 3D NoCs and demonstrated 
the efficacy of M3D integration. However, all these studies 
considered ideal characteristics for the transistors and 
interconnects (i.e., uniform delay and power consumption) 
across different tiers in the M3D ICs and ignored the effects of 
inter-tier process variation during design and evaluation.  
The authors in [23] examined the effects of a realistic M3D 
fabrication process on the performance of M3D-enabled ICs. 
They found that the energy consumption and delay can increase 
significantly compared to an ideal M3D process. This study 
also evaluated the performance of M3D ICs that incorporate 
low-temperature annealing for top-tier transistors and tungsten 
interconnects in the bottom tier. Although M3D is a promising 
emerging interconnect technology, there is very little work on 
exploiting this to design a 3D NoC. The work in [11] 
incorporates router partitioning in the design of an M3D NoC 
but it does not consider the performance benefit of multitier 
router stages or inter-tier performance variation between tiers. 
Therefore, our aim is to create a design process that integrates 
inter-tier process variation and how the inclusion of inter-tier 
process variation (and lack thereof) impacts the design and 
optimization of an M3D NoC. 
III. CHALLENGES OF M3D-ENABLED DESIGNS 
Although M3D enables higher integration density and better 
performance than TSV-based designs, fabrication challenges 
need to be addressed. For example, fabricating upper tiers 
becomes a major challenge due to the sequential tier synthesis 
in M3D integration and very thin inter-tier dielectric between 
the tiers. If ion implantation and annealing during top-tier 
fabrication uses the standard thermal budget (≈1050°C [7]), the 
high temperature can damage the underlying interconnects and 
transistors in the bottom tier. As a result, two techniques have 
been proposed to realize top-tier transistors without damaging 
lower tiers: solid phase epitaxy regrowth (SPER) [24] and laser 
annealing [25].  
In [7], it has been shown that temperature must be kept below 
650°C to prevent any damage to the lower-tier transistors. This 
is accomplished by both SPER, the ion implantation step is 
done at temperatures as low as 600°C [24]; and laser annealing, 
upper-tier transistors can be fabricated while only heating up 
the bottom tier to 500°C [25]. However, both procedures have 
disadvantages. Transistors created using SPER have three times 
higher source-drain resistance compared to conventional 
transistors [24]. Similarly, transistors fabricated using laser-
based annealing have 16-28% lower on-current [13]. As a 
result, both processes introduce performance degradation for 
the top-tier transistors.  
Unfortunately, although these temperatures are okay for 
lower-tier transistors, both SPER and laser annealing do not 
reduce the temperature enough to utilize copper back-end-of-
line (BEOL) interconnects (the temperature must be kept within 
400°C [7]). Therefore, a metal that can withstand higher 
temperatures such as tungsten is needed for BEOL 
interconnects in the bottom tier. However, tungsten has a higher 
resistivity than copper, which leads to inferior performance of 
bottom-tier interconnects [23]. 
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
3
These effects, degraded transistors in the upper tiers and 
higher resistivity interconnects in the lower tiers, can affect the 
design of NoCs. In particular, these inter-tier process variations 
cause non-uniform performance and energy consumption 
across the tiers. If these effects are not considered during design 
time, we may obtain overly optimistic latency and energy 
estimates and more importantly, sub-optimal NoC 
configurations. This motivates our work into formulating a new 
M3D-variation-aware NoC design problem and optimization. 
IV. PROBLEM FORMULATION AND OPTIMIZATION 
In a manycore architecture, our goal is to place cores, routers, 
and links efficiently and design the intra-router stages and inter-
router links optimally to achieve the best NoC performance. We 
begin by discussing the optimization goals for a standard 2D 
NoC. Then, we extend this discussion to include M3D process 
variations and design considerations. Fig. 1 shows the overall 
flow of process variation-aware design of M3D NoCs. Finally, 
we present a framework that utilizes these optimization goals to 
design realistic M3D NoC architectures. 
A. Latency and Energy Modeling of NoCs  
In this section, we define general models for NoC latency and 
energy that can be applied to both 2D and M3D systems. Later, 
we will provide specific details for 2D (Section IV.B) and M3D 
(Section IV.C) systems. We first model traffic-weighted NoC 
latency for a system with 𝑁 cores and 𝑁 routers as follows: 
 𝐿𝑎𝑡𝑒𝑛𝑐𝑦 = ∑ ∑ ∑ ൫∑ 𝑡௥,௠ௌ௠ୀଵ + 𝑙௨𝑡௨൯௥,௨∈௣(௜,௝) 𝑓௜,௝ே௝ୀଵே௜ୀଵ  (1) 
𝑝(𝑖, 𝑗) gives the path between cores 𝑖 and 𝑗 (routers and links), 
where 𝑟 and 𝑢 are a router and a link along that path, 
respectively. The parameter 𝑡௥,௠ is the intra-router stage delay 
for the 𝑚௧௛ router stage of router 𝑟 with 𝑆 router stages. It is 
important to note that 𝑡௥,௠ depends on the number of ports and 
virtual channels associated with the particular router. The 
parameters 𝑙௨ and 𝑡௨ are the length of the interconnect and unit 
length delay of the inter-router link 𝑢, respectively. Lastly, 𝑓௜,௝ 
is the frequency of interaction between cores 𝑖 and 𝑗. The 
latency captures the weighted sum of communication cost 
between every pair of cores. 
 Similarly, we model traffic-weighted NoC energy as follows: 
 𝐸𝑛𝑒𝑟𝑔𝑦 = ∑ ∑ ∑ ൫∑ 𝑒௥,௠ + 𝑙௨𝑒௨ௌ௠ୀଵ ൯௥,௨∈௣(௜,௝) 𝑓௜,௝ே௝ୀଵே௜ୀଵ  (2) 
Here, for the network path between any two cores 𝑖 and 𝑗, 𝑒௥,௠  
is the intra-router stage energy for the 𝑚௧௛ router stage of 
router 𝑟 with 𝑆 router stages. The parameter 𝑒௨ is the unit length 
energy of the inter-router link 𝑢.  
In this work, our goal is to design energy-efficient and high-
performance NoCs by minimizing NoC latency and energy 
simultaneously. Therefore, we need a unified optimization 
metric for NoC latency and energy. To consider the effect of 
both network latency and energy together, we use EDP as the 
relevant performance metric for NoC optimization. Using Eqs. 
(1) and (2), we can represent the EDP for NoC designs as 
follows: 
 𝐸𝐷𝑃 = 𝐸𝑛𝑒𝑟𝑔𝑦 ⋅ 𝐿𝑎𝑡𝑒𝑛𝑐𝑦 (3) 
B. 2D Router Delay and Energy Models 
To determine the effects of M3D integration on the 
performance of an NoC, we first need to adopt an appropriate 
router model. In this work, we follow the virtual-channel router 
model proposed in [26]. However, it should be noted that any 
other router model can be adopted for this analysis (Eq. 1 and 2 
are general in the number of stages). The virtual channel router 
has three pipelined stages, viz., virtual channel allocator, switch 
allocator, and crossbar traversal [26].  
The delay of each stage depends on the number of router 
ports (𝑝), flit size (𝑤), and the number of virtual channels (𝑣). 
All these parameters in turn depend on the adopted NoC 
architecture. Their relationship is given in Table I [26]. For a 
regular NoC architecture like mesh, the delay of a particular 
pipelined stage will be the same in each router since each router 
has the same number of ports (except the routers at the edge). 
However, small-world networks have already been shown to 
achieve significantly lower latency and energy consumption 
over mesh-based NoCs [11]. For these irregular small-world 
architectures, each router may have different number of ports. 
Hence, delay of each pipelined stage varies depending on the 
router configuration. Using the model given in Table I, we 
determine the delay (𝑡௥,௠) of each pipelined stage (𝑚) in every 
router (𝑟) in terms of fanout-of-four (FO4) delay. 
 
               (a)                           (b) 
Fig. 1. (a) Illustration of the process variation-aware design methodology and an example small-world network-enabled 16-node M3D NoC architecture. The legend 
indicates different components of the NoC. (b) Illustration of the implementation of the STAGE algorithm for M3D NoC optimization. 
TABLE I 
PARAMETERIZED DELAY EQUATIONS FOR INTRA-ROUTER STAGES [26] 
Intra-router stage (m) Delay (FO4) 
Virtual channel allocator (1)               33log4(pv)+125/6 
Switch allocator (2)              28log4p+35/2 
Crossbar traversal (3)  9log8(w⌊p/2⌋)+6⌈log2p⌉+6 
 
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
4
Energy consumption of the routers (𝑒௥,௠) depends on the 
capacitance of the logic cells and the interconnect of each 
stage [27]. For ease of notation, we will denote 𝑡௥,௠ and 𝑒௥,௠ as 
𝑡ଶ஽ି௥,௠ and 𝑒ଶ஽ି௥,௠, respectively for 2D planar designs. Also, 
since 2D systems use copper interconnects, 𝑡௨ = 𝑡஼௨ and 𝑒௨ =
𝑒஼௨ are the unit length delay and energy of transferring data 
through standard copper links, respectively. Using these 
models, we examine the effects of M3D integration on NoC 
design.  
C. Process Variation-Aware M3D NoC Design 
Although M3D integration allows us to build multitier 
routers, process variation from the M3D fabrication process 
causes the NoC to suffer from latency and energy penalties. As 
discussed in Section III, any router logic components in the top 
tier and inter-router links in the bottom tier will experience 
slowdowns due to process-related transistor degradation and the 
higher resistivity of tungsten, respectively. These effects could 
dominate the benefits obtained from tier partitioning, which 
would reduce the overall performance gain compared to its 2D 
counterpart. Therefore, in addition to placing the links between 
routers, it is our objective to choose the tiers for each router 
stage and link of the M3D-based NoC to minimize the EDP. 
Hence, we consider the following design choices separately for 
each router stage and link (Fig. 1(a)). Please note that this 
represents a high-level discussion, methods for determining 
delay, capacitance, etc. will be discussed later in the results 
section (Section V.C). 
1) Bottom-tier-only intra-router stage (BT): The performance 
of the router logic in BT is the same as the 2D counterpart 
because there is no degradation in the bottom-tier 
transistors. Therefore, 𝑡௥,௠ = 𝑡ଶ஽ି௥,௠ and 𝑒௥,௠ = 𝑒ଶ஽ି௥,௠.  
2) Top-tier-only intra-router stage (TT): As the transistors in 
the top tier are degraded, the FO4 delay of the router logic 
in TT will be larger than that in BT. To determine the delay 
of the router stage in TT, we need to determine the FO4 
delay in TT in presence of transistor degradation. Hence 
the parameters 𝑡௥,௠ and 𝑒௥,௠ will vary depending on the 
degradation of the transistor on-current (𝛼). The intra-
router delay for the router stage can be expressed as 
follows: 
 𝑡௥,௠ =
ிைସ೅೅
ிைସమವ
 ⋅ 𝑡ଶ஽ି௥,௠ (4) 
where 𝐹𝑂4்் is the degraded FO4 delay in presence of 𝛼 
and 𝐹𝑂4ଶ஽ is the ideal FO4 delay for 2D design (Table I).  
Transistor degradation will also increase the logic 
capacitance in TT stage [28]. As energy is proportional to 
the capacitance in the router stage, the stage energy is 
expressed as follows: 
 𝑒௥,௠ =  
஼೅೅,೘
஼మವ,೘
⋅ 𝑒ଶ஽ି௥,௠ (5) 
where 𝐶்்,௠ and 𝐶ଶ஽,௠ are the total capacitance of TT and 
2D stage, respectively. 𝐶்்,௠ comprises of 𝐶ଶ஽,௠ and the 
incremental logic capacitance due to 𝛼. It should be noted 
that the interconnect capacitance of the intra-router stage 
remains the same as the 2D counterpart. 
3) Multitier intra-router stage (MT): By using block- and 
gate-level M3D integration, the interconnect length can be 
reduced to a factor of 1/√𝑇 for 𝑇-tier systems [12]. 
Therefore, although the change of logic capacitance is 
insignificant, the interconnect capacitance decreases by 
approximately 1 − 1/√𝑇 [12]. This results in an 
improvement of FO4 delay (denoted as 𝛾) for a multitier 
design compared to the single-tier counterpart. In this 
work, for simplicity, we assume that the MT stages are 
equally distributed among the bottom and top tiers [29]. 
Hence, the delay and energy for the router stages are: 
 𝑡௥,௠ = 𝛾 ⋅ ቀ
ଵ
ଶ
⋅ 𝑡ଶ஽ି௥,௠ +
ଵ
ଶ
ிைସ೅೅
ிைସమವ
 ⋅ 𝑡ଶ஽ି௥,௠ቁ (6) 
 𝑒௥,௠ =  
஼ಾ೅,೘
஼మವ,೘
⋅ 𝑒ଶ஽ି௥,௠ (7) 
Here, 𝐶ெ்,௠ is the total capacitance of the MT stage. The 
capacitance of the top tier logic of MT stage will increase 
as mentioned in TT whereas the interconnect capacitance 
will decrease compared to the 2D counterpart.  
4) Inter-Router Link Placement: The delay and energy 
incurred by the inter-router links will depend on its tier 
placement. The inter-router links in the top tier use copper 
and do not suffer from any performance degradation (i.e., 
𝑡௨ = 𝑡஼௨ and 𝑒௨ = 𝑒஼௨). On the other hand, the bottom-tier 
tungsten interconnects exhibit higher resistance compared 
to the copper-based counterpart. Hence, the inter-router 
links in the bottom tier suffer from higher delay and energy 
consumption: 
 𝑡௨ = 𝑡ௐ 
 𝑒௨ = 𝑒ௐ (8) 
where 𝑡ௐ and 𝑒ௐ are the delay and energy of tungsten links 
per unit length, respectively. Since tungsten has higher 
resistivity than copper, 𝑡ௐ > 𝑡஼௨ and 𝑒ௐ > 𝑒஼௨. We define 
the interconnect slowdown factor for tungsten 
interconnects as 𝛽, where 𝛽 is the degradation in the 
propagation delay of the tungsten interconnects compared 
to its copper counterpart. As the resistivity of nanoscale 
tungsten wires changes based on the dimensions and 
geometry [23], 𝛽 will vary too, which will be captured in 
Eqs. (1) and (2) by 𝑡ௐ and 𝑒ௐ, respectively. Since these 
inter-router links are connected to the input and output 
stages of routers, a link and its respective router stages must 
be on the same tier. Therefore, during optimization (see 
Section IV.D), we constrain that top-tier links must be 
connected to TT or MT router stages and bottom-tier links 
must be connected to BT or MT router stages only.  
D. M3D NoC Design Optimization 
Since each router stage can be on different tiers (TT, BT, or 
MT) with either top- or bottom-tier links, the design space 
complexity increases dramatically from a 2D or TSV-based 
NoC to its M3D counterpart.   
Such large search spaces make it difficult to utilize traditional 
optimization methods which rely on stochastic local exploration 
to reach the minima, e.g., simulated annealing (SA) or genetic 
algorithms (GA). Therefore, intelligent search methods are 
necessary to reduce the run time and enhance scalability. We 
apply a machine learning based search technique, STAGE [30], 
to guide the search process. Prior works have already shown the 
efficacy of STAGE over SA and GA for different NoC 
architecture optimizations [11], [31]. 
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
5
STAGE works by utilizing past search trajectories to find 
better starting points. To accomplish this, STAGE iterates over 
two steps: (1) Hill climbing (or some other local search) to 
optimize 𝑂𝑏𝑗, the primary design objective; and (2) Hill 
climbing to optimize 𝐸, a learned evaluation function that 
predicts how promising a design is as a starting point for Step 1. 
We show these steps in Fig. 1(b).  
1) STAGE Step 1: Similar to simple hill climbing or SA, the 
first step attempts to minimize the target function 𝑂𝑏𝑗 by 
making small steps (using a perturbation function 𝑆), 
accepting new designs if it reduces 𝑂𝑏𝑗, i.e., simple hill 
climbing. STAGE keeps track of all accepted designs in 
this search trajectory (𝑑଴, 𝑑ଵ, … , 𝑑்) and adds each design 
to a dataset 𝒟, as a pair (𝜙(𝑑), 𝑂𝑏𝑗(𝑑்)). Here, 𝜙(⋅) is a 
function that extracts pertinent features from the design.  
2) STAGE Step 2: Using a regression learner, 𝑅, Step 2 learns 
the evaluation function 𝐸൫𝜙(𝑑)൯ = 𝑅(𝒟). This evaluation 
function tries to predict the EDP of the final design of 
Step 1 starting from a particular design 𝑑. Ultimately, 𝐸 
can be used to predict the next best starting point for the 
search. Starting from the final design in Step 1, we use 
simple hill climbing to minimize 𝐸. This design is provided 
as the starting point to Step 1.  
3) STAGE Iteration: STAGE iterates over Step 1 and Step 2 
until the maximum number of iterations allowed (𝐼𝑡𝑒𝑟௠௔௫) 
has been reached. After each iteration, we accumulate more 
data points in 𝒟 and learn a more accurate 𝐸 which results 
in the search finding better designs. Our final output is the 
best design 𝑑∗ with minimum EDP.  
In this work, we consider two different types of M3D NoCs, 
a process variation oblivious M3D NoC that uses MT for all 
router stages and uniformly distributes the links among 
layers [29] and our proposed process variation aware M3D 
NoC. To utilize STAGE for M3D NoC design, we use the 
following definitions. For the perturbation function, S, we 
consider two types of perturbations to move to neighboring 
designs in all M3D architectures: (i) swapping the position of 
two cores; and (ii) moving a link between a pair of routers with 
another of same length between two other routers. For process-
aware M3D designs, we use two additional perturbations 
where: (iii) a router stage is switched among the three different 
stage types (TT, BT, and MT); or (iv) a link is switched between 
the top and bottom tier. For each call of the perturbation 
function S, the function chooses one among the available 
perturbations described above. We show these perturbation 
choices in Fig. 1(b). 
The feature selection (𝜙(⋅)) is an important aspect of STAGE 
as relevant features allow us to learn an accurate evaluation 
function 𝐸. We use random forest regression to learn 𝐸. Since 
the EDP depends directly on the average network hops and 
traffic-weighted hop-count, we take account of these features. 
In addition, the clustering coefficient (𝐶௖) is a measure of a 
router’s connectivity with neighbors that can indicate better 
local communication. Hence, we consider these features, i.e., 
average network hops, traffic-weighted hop-count, and the 
clustering coefficient. For process-aware M3D NoC designs we 
use two additional features: bottom-tier inter-router and top-tier 
intra-router performance penalty to account for the process 
variation effect on design.  
V. RESULTS AND ANALYSIS 
In this section, we describe the experimental set up followed 
by the design considerations and architectural adjustments for 
our process-aware architecture. Then, we present a detailed 
EDP analysis of the M3D NoCs with process variation. 
A. Experimental Setup  
In this work, we consider a 64-core system where each core 
is associated with a dedicated router. We use GEM5 [32], a full 
system simulator, to obtain detailed processor- and network-
level information. Using Gem5’s full-system mode, we 
simulate x86 cores running Linux. We use the 
MOESI_CMP_directory cache coherence protocol. Each core 
consists of private 64KB L1 instruction and data caches and a 
shared 8MB L2 cache. We consider four SPLASH-2 
benchmarks (FFT, RADIX, LU, and WATER) [33] and four 
PARSEC benchmarks (DEDUP, VIPS, CANNEAL (CAN), 
and FLUIDANIMATE (FLUID)) [34]. These benchmarks are 
selected because they vary widely in communication and 
computation patterns.  
B. NoC Design and Baseline  
For each router, we use the three-stage model shown in 
Table I [26]. Each router has four virtual channels (𝑣=4) per 
port. Each packet contains six flits and each flit consists of 
32 bits (𝑤=32). These routers are synthesized from an RTL-
level design using a TSMC 28 nm CMOS process in 
SynopsysTM Design Vision. The MIV physical dimensions are 
50 nm × 50 nm × 44 nm (diameter × depth × pitch) [8]. It 
should be noted that as the dimensions of MIV are comparable 
to that of local via, using standard 2D cells in Synopsys to 
synthesize the NoC routers does not add any additional 
overhead [8]. Energy consumption of the NoC links is 
determined using HSPICE simulations, taking their lengths and 
resistivity (the bottom tier uses tungsten while the top tier uses 
copper) into consideration.  
For the NoC topology, we consider two cases: small-world 
network-enabled 3D NoC (SWNoC) and traditional 3D mesh. 
We create the SWNoC architecture by following a power-law-
based link distribution as elaborated in [31]. It should be noted 
that we have already demonstrated in our prior works that 
SWNoC outperforms any other regular and application-specific 
3D NoC architectures [11], [31]. For 3D mesh, we use standard 
XYZ-dimension-order routing. Since small-world networks are 
irregular topologies, we adopt the topology-agnostic adaptive 
layered shortest path routing (ALASH) for SWNoCs [35]. 
C. M3D Transistor and Interconnect Characteristics 
In this work, we consider a two-tier M3D process. In order 
to deliver a thorough analysis of M3D NoCs, we must first 
provide a practical range for the degradation of transistor on-
current (𝛼), slowdown factor due to tungsten interconnects (𝛽), 
and M3D improvement in FO4 delay (𝛾) parameters (see 
Section IV.C) and determine their effects on the design and 
optimization of M3D NoCs.  
For 𝛼, prior work has reported a maximum of 20% 
degradation in the top-tier transistors [36]. Hence, we can 
consider 𝛼 = [0.05, 0.20] in 0.05 increments. The parameter 𝛼 
affects 𝐹𝑂4்், 𝐶்்,.௠, and 𝐶ெ்,.௠, and hence, the energy and 
delay of TT and MT intra-router stages. We used Cadence 
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
6
Virtuoso to determine 𝐹𝑂4்் in presence of 𝛼. Fig. 2 shows the 
ratio of 𝐹𝑂4்் to 𝐹𝑂4ଶ஽ for different values of 𝛼. Here, 𝛼 = 0 
is the case for nominal 2D transistors. For every 5% increment 
in 𝛼, the FO4 delay degrades by approximately 9%. The 
increment of top-tier logic capacitance for TT (𝐶்்,௠ in Eq. (5)) 
and MT (𝐶ெ்,௠ in Eq. (7)) is estimated from the relationship 
between transistor on-current and capacitance presented in [28]. 
In Fig. 3, the normalized capacitance for transistors is plotted 
for different values of 𝛼. Here, the capacitance is normalized 
with respect to a nominal 2D transistor (𝛼 = 0). As expected, 
the capacitance increases almost linearly with f 𝛼. 
For 𝛽, the impact of tungsten interconnects in the bottom tier 
depends on the metal layer and technology node [23]. We 
consider tungsten interconnects in metal layers 3 through 10 for 
a TSMC 28nm process. For each layer we characterize the delay 
of the tungsten interconnect by changing the resistivity [23] in 
Cadence Virtuoso. Our experimental analysis shows that 
tungsten interconnects introduce 10-30% additional 
propagation delay per unit length. Hence, we use 𝛽= [0.10,0.30] 
in 0.10 increments. 
For 𝛾, M3D designs achieve up to 25% improvement in clock 
frequency compared to their 2D counterpart [37]. Thus, we use 
two values for 𝛾 (0.10 and 0.20) in our work. To determine the 
MT stage energy in Eq. (7), we find 𝐶ெ்,௠, which must consider 
the increased logic capacitance at the top tier due to 𝛼 [28] and 
the lower interconnect capacitance due to tier partitioning. 
D. Router Stage and Link Distribution with Process Variation 
Under ideal conditions (𝛼=0, 𝛽=0), naturally, all intra-router 
stages in an M3D NoC will be multitier and the links will be 
placed evenly in both tiers. Unfortunately, as discussed in 
Section IV.C, top-tier transistor degradation (𝛼 > 0) and 
bottom-tier interconnect degradation (𝛽 > 0) cause significant 
slowdowns in the M3D NoC. Moreover, these slowdowns are 
not uniform across different tiers. In addition to these non-
uniform effects of M3D process variation, the delay and 
capacitance of each intra-router stage varies with the number of 
bits per flit, ports, and VCs in the router (see Table I). Thus, to 
minimize the overall EDP for a given system configuration 
(number of bits per flit, number of ports, number of VCs, packet 
size, 𝛼, 𝛽, and 𝛾), we should optimize the distribution of the 
intra-router stages and inter-router links. We present the 
analysis for SWNoCs followed by mesh based NoCs. 
1) M3D SWNoC Architecture Optimization  
In SWNoCs, the distribution of intra-router stages and links 
depends on the process variation effects and traffic distribution. 
Across a range of 𝛼, 𝛽, and 𝛾, Fig. 4 shows the tier-wise 
distribution of all intra-router stages optimized for the CAN 
benchmark as an example. Since there is a trade-off between 
BT and MT in terms of link versus logic degradation, we see 
different distributions of BT and MT. Although the logic in the 
top tier of an MT stage has longer delay and larger capacitance 
than the nominal values, the overall wirelength is shorter than 
sole BT stage. Therefore, some of the router stages can benefit 
from tier partitioning depending on 𝛼, 𝛽, and 𝛾. For example, 
Fig. 4 shows that approximately 80% of the router stages are 
MT and 20% are BT when (𝛼, 𝛽, 𝛾) is (0.05, 0.1, 0.1). However, 
if 𝛼 increases, more stages are chosen to be BT to avoid the 
intra-router performance penalty originating from the top-tier 
transistor degradation as explained in Eqs. (4) and (5). On the 
other hand, since 𝛾 improves the MT intra-router stage delay 
(Eq. (6)), there are more MT stages at higher values of 𝛾. 
Notably, the result shows that the SWNoCs do not have any TT 
stages. In MT, only half of the logic cells suffer from transistor 
degradation. Whereas all logic cells in TT experience transistor 
slowdown. Moreover, the speedup due to multitier logic (𝛾) 
results in superior performance of MT compared to TT. As a 
result, the optimization always chooses MT over TT for all 
intra-router stages. 
It should be noted that the distribution of different types of 
intra-router stages depends on their circuit composition. Since 
the crossbar stage is dominated by the interconnect 
capacitance [27] and interconnects are heavily reduced in MT 
(interconnect capacitance decreases by 29.3% for the two-tier 
 
Fig. 4. Tier distribution of all intra-router stages for SWNoC (CAN). 
 
Fig. 5. Tier distribution of the crossbar stages for SWNoC (CAN). 
 
0
20
40
60
80
100
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
α=0.05 α=0.1 α=0.15 α=0.2 α=0.05 α=0.1 α=0.15 α=0.2
γ=0.1 γ=0.2
Pe
rc
en
ta
ge
 o
f s
ta
ge
s BT MT
0
20
40
60
80
100
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
α=0.05 α=0.1 α=0.15 α=0.2 α=0.05 α=0.1 α=0.15 α=0.2
γ=0.1 γ=0.2
Pe
rc
en
ta
ge
 o
f s
ta
ge
s BT MT
 
Fig. 2.  𝐹𝑂4்்/𝐹𝑂4ଶ஽ for different values of 𝛼. 
0.8
0.9
1.0
1.1
1.2
1.3
1.4
0.00 0.05 0.10 0.15 0.20
FO
4 T
T/
FO
4 2
D
𝛼 Fig. 3. Normalized transistor capacitance for different values of  𝛼. Capacitance
is normalized with respect to the nominal transistor. 
0.8
0.9
1.0
1.1
1.2
1.3
1.4
0.00 0.05 0.10 0.15 0.20
N
or
m
al
iz
ed
  
C
ap
ac
ita
nc
e
𝛼
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
7
system), the energy saving in the interconnects offsets the 
transistor slowdown. In Fig. 5, it can be seen that besides a few 
crossbar stages at low values of 𝛽, all crossbar stages are MT 
for every combination of 𝛼, 𝛽, and 𝛾.  
As mentioned in Section IV.C, there is no effect of 𝛽 on the 
placement of the crossbar stage (it is not connected directly to 
an inter-router link). However, the switch allocator (SWA) and 
virtual channel allocator (VCA) stages are connected to the 
router ports, which in turn are connected to the inter-router 
links. As discussed in Section IV.C, the 𝛼 and 𝛾 parameters 
affect the intra-router stages and 𝛽 slows down the inter-router 
links in the bottom tier. Thus, all three parameters 𝛼, 𝛽, and 𝛾 
influence the distribution of the SWA and VCA stages and the 
inter-router links associated with them. Figs. 6 and 7 show the 
distribution of the SWA and VCA stages, and inter-router links, 
respectively for different values of 𝛼, 𝛽, and 𝛾 for the CAN 
benchmark. As 𝛽 increases, the number of BT stages decreases 
(Fig. 6) and more links are placed at the top tier (Fig. 7) to avoid 
the interconnect performance penalty. For the links in the top 
tier, the associated stages must become MT or TT (TT is never 
chosen since MT is always better as discussed earlier). So, in 
Fig. 6, the number of MT stages increases with the rise of 𝛽. 
Conversely, as 𝛼 increases, MT stages experience more delay 
and energy degradation. Hence, more stages (Fig. 6) and their 
respective links (Fig. 7) are placed at the bottom tier to avoid 
the transistor degradation. For 𝛾 =0.1, 95.9% of the stages and 
97.8% of the links are placed in the bottom tier when we 
consider the highest value of 𝛼 (𝛼 = 0.2) and the lowest value 
of 𝛽 (𝛽 = 0.1). Alternatively, all the intra-router stages are MT 
and all the links are placed in the top tier for the lowest value of 
𝛼 (𝛼 = 0.05) and the highest value of 𝛽 (𝛽 = 0.3). Overall, the 
router stages and the inter-router links are distributed to reduce 
the effects of 𝛼 and 𝛽 as much as possible. We can also observe 
the effect of 𝛾 in Figs. 6 and 7. On average (considering 
different 𝛼 and 𝛽), the number of MT stages and top-tier links 
increase by 9.6% and 8%, respectively when 𝛾 increases from 
0.1 to 0.2. Hence, considering the effects of various process 
variation parameters, the NoC router stages can be placed on a 
single tier (BT) or can be distributed over multiple tiers (MT) 
to optimize EDP. 
In Fig. 8, the tier-wise distribution for the stages connected 
to a link of particular length is plotted for the CAN benchmark 
as an example. The link length is expressed in terms of 
Manhattan distance. As we can see, the placement of SWA and 
VCA stages is associated with the length of the links. The inter-
router traversal penalty at the bottom tier is proportional to the 
link length (Eq. (8)). Hence, the long-range links are placed 
mostly in the top tier to avoid the performance penalty (𝛽) while 
the shorter links along with its respective SWA and VCA stages 
are placed in the bottom tier. Here, the stages connected to 
longer links favor MT over BT. 
We also found similar trends of the stage and link distribution 
in all the benchmarks. We plot the input/output stage (SWA and 
VCA) and link distribution of SWNoCs in Figs. 9 and 10 for 
representative values 𝛼=0.1 and 𝛾=0.1 with varying  𝛽. As 
previously discussed, there are zero TT stages and the number 
of BT stages and bottom-tier links decrease with increasing 𝛽 
across all benchmarks. The placement of intra-router stages 
connected to links (SWA and VCA) and inter-router links 
varies depending on the traffic characteristics of the specific 
benchmark. To understand this benchmark dependence, we 
analyzed RADIX (high percentage of BT and bottom-tier links) 
and WATER (low percentage of BT and bottom-tier links), two 
benchmarks that exhibit distinct link distribution trends. In 
Fig. 11, we show the percentage of traffic exchanged between 
two routers separated by a particular Manhattan distance. For 
two routers separated by one Manhattan distance, the traffic 
exchanged is significantly more for RADIX compared to 
WATER (18.6% on average). Moreover, RADIX has almost no 
traffic between routers separated by a Manhattan distance 
greater than three. Since traffic in RADIX doesn’t have to travel 
physically far, much of the RADIX traffic takes short-distance 
links. As we found in the analysis of Fig. 8, this influences the 
network to have short bottom-tier interconnects. On the other 
 
Fig. 6. Tier distribution of the SWA and VCA stages for SWNoC (CAN). 
0
20
40
60
80
100
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
α=0.05 α=0.1 α=0.15 α=0.2 α=0.05 α=0.1 α=0.15 α=0.2
γ=0.1 γ=0.2
Pe
rc
en
ta
ge
 o
f s
ta
ge
s
BT MT
 
Fig. 7. Tier distribution of links for SWNoC (CAN). 
 
 
Fig. 8. Tier distribution of the SWA and VCA stages connected to a link of 
particular length for SWNoC (CAN). 
0
20
40
60
80
100
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
α=0.05 α=0.1 α=0.15 α=0.2 α=0.05 α=0.1 α=0.15 α=0.2
γ=0.1 γ=0.2
Pe
rc
en
ta
ge
 o
f l
in
ks
Bottom tier Top tier
0
20
40
60
80
100
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
γ=0.2, α=0.1, β=0.1 γ=0.2, α=0.1, β=0.2 γ=0.2, α=0.1, β=0.3
Pe
rc
en
ta
ge
 o
f s
ta
ge
s
Link length (Manhattan distance)
BT MT
 
Fig. 9. Tier distribution of the SWA and VCA stages for SWNoC considering 
all benchmarks (α=0.1,𝛾=0.1). 
 
0
20
40
60
80
100
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
CAN DEDUP FFT FLUID LU RADIX VIPS WATER
Pe
rc
en
ta
ge
 o
f s
ta
ge
s BT MT
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
8
hand, WATER has more traffic that needs to travel further, 
causing less bottom-tier links, especially for higher values of 𝛽. 
In fact, all the intra-router stages are BT and all the inter-router 
links are placed at the bottom tier for RADIX (α = 0.1, 𝛾 = 0.1 
and 𝛽 = 0.1) as 77.6% of the total traffic is exchanged between 
routers separated by one Manhattan distance (Fig. 11). Hence, 
the link and stage distribution depend on the degree of process 
variation and traffic pattern of the respective benchmark. 
2) Optimization of Mesh-based NoC 
So far, we have considered the SWNoC architecture to 
thoroughly study the effects of process variation on the NoC 
router configuration. To study the effects of the process 
variation on a regular NoC architecture, we undertake the same 
analysis on an equivalent mesh NoC of the same size. Although 
the crossbar stage distribution of mesh NoCs is similar to that 
of SWNoCs, the input/output stages’ and links’ tier placement 
are different. We show the distribution of stages (SWA and 
VCA) and links for a mesh NoC in Figs. 12 and 13, respectively 
considering the CAN benchmark. Compared to the stage 
(Fig. 6) and link (Fig. 7) distributions in SWNoC, the SWA and 
VCA stages, and links in mesh NoC favor the bottom tier. This 
is attributed to the fact that a mesh NoC consists of only short-
range links between adjacent routers causing the router logic to 
have greater influence on delay and energy. Therefore, in order 
to avoid the top-tier transistor penalty (except for low values of 
𝛼), links are placed in the bottom tier. On the other hand, 
SWNoCs contain both short- and long-range links. The long-
range links placed between non-adjacent routers incur more 
delay and energy overhead if they are placed in the bottom tier. 
The general characteristics of process-aware design in 
SWNoCs are also present in their mesh counterparts. As 𝛾 
increases, the number of MT stages also increases as seen in 
Fig. 12. Similarly, the number of BT stages (Fig. 12) and 
bottom-tier links (Fig. 13) increase with lower values of 𝛽 or 
higher values of 𝛼. Hence, our process-aware design accounts 
for the effects of 𝛼 and 𝛽 for both SWNoC and mesh NoC. 
E. Process-Oblivious vs. Process-Aware M3D NoC 
In this section, our aim is to demonstrate the advantages of 
designing the M3D NoC when considering the effects of 
process variation. As explained above, when we consider the 
effects of process variation, the router stages and inter-router 
links need to be distributed suitably among the tiers. We call 
this M3D NoC the process-aware architecture (M3D-PA). On 
the contrary, if we do not consider the effects of process 
variation (by assuming 𝛼=0, 𝛽=0) while designing the M3D 
NoC, every router would be multitier to take advantage of the 
performance benefit due to 𝛾, we call this M3D NoC the 
process-oblivious M3D NoC (M3D-PO). Our aim is to quantify 
the benefits of our process-aware design compared to its 
process-oblivious counterpart.  
Figs. 14 and 15 show the EDP of the SWNoC and mesh NoC, 
respectively for the CAN benchmark. The EDP is normalized 
with respect to the M3D-PO design under ideal conditions (α =
0, β = 0). For SWNoCs at the lowest value of 𝛼 (α = 0.05), the 
M3D-PA design improves the EDP by 10.7% and 9.1% on 
average considering all values of 𝛽 over its M3D-PO 
counterpart for 𝛾 = 0.1 and 𝛾 = 0.2, respectively. Since the 
stage distribution of M3D-PO and M3D-PA design is similar 
for 𝛼=0.05, the difference in EDP between these two design 
approaches is low. In addition, at low values of 𝛼 most of the 
stages are MT (Fig. 4), this allows the optimization to utilize 
the top layer for links, reducing the effects of beta on EDP. 
As the value of α increases, the MT router stages get 
increasingly penalized, so the M3D-PA designs use fewer MT 
and more BT stages as shown in Fig. 4. Therefore, there is more 
opportunity to make better decisions by establishing the 
tradeoff between 𝛼 and 𝛽. However, since the M3D-PO designs 
do not consider the process variation, the EDP difference 
between the M3D-PA and M3D-PO designs becomes more 
prominent. For the most severe process variation 
(α=0.2, β=0.3), the M3D-PA SWNoCs outperform the M3D-
PO counterparts by 43.9% and 37.6% for 𝛾=0.1 (Fig. 14(a)) and 
𝛾=0.2 (Fig. 14(b)), respectively.  
 
Fig. 12. Tier distribution of stages for mesh NoC (CAN). 
 
0
20
40
60
80
100
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
α=0.05 α=0.1 α=0.15 α=0.2 α=0.05 α=0.1 α=0.15 α=0.2
γ=0.1 γ=0.2
Pe
rc
en
ta
ge
 o
f  
St
ag
es
BT MT
 
Fig. 10. Tier distribution of links for SWNoC considering all benchmarks 
(α=0.1, 𝛾=0.1). 
𝛾 ). 
Fig. 11. Percentage of traffic exchanged between any two routers as a function 
of Manhattan distance for RADIX and WATER. 
 
0
20
40
60
80
100
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
CAN DEDUP FFT FLUID LU RADIX VIPS WATER
Pe
rc
en
ta
ge
 o
f l
in
ks
Bottom tier Top tier
0
20
40
60
80
100
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
γ=0.1, α=0.1, β=0.1 γ=0.1, α=0.1, β=0.2 γ=0.1, α=0.1, β=0.3
Pe
rc
en
ta
ge
 o
f T
ra
ff
ic
Link length (Manhattan distance)
RADIX WATER
Fig. 13. Tier distribution of links for mesh NoC (CAN). 
0
20
40
60
80
100
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
β=
0.
1
β=
0.
2
β=
0.
3
α=0.05 α=0.1 α=0.15 α=0.2 α=0.05 α=0.1 α=0.15 α=0.2
γ=0.1 γ=0.2
Pe
rc
en
ta
ge
 o
f l
in
ks
Bottom tier Top tier
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
9
We observe similar trends in the EDP distribution of the 
mesh NoCs in Fig. 15. In fact, we save more EDP for the M3D-
PA mesh NoCs with respect to the M3D-PO counterpart than 
the SWNoC. For the maximum process variation, the process-
aware designs outperform the process-oblivious counterpart by 
69.6% and 64.2%, respectively, for 𝛾=0.1 (Fig. 15(a)) and 
𝛾=0.2 (Fig. 15(b)). In the mesh NoCs, we need more hops to 
traverse between a pair of source and destination routers on 
average. This results in passing through more intra-router stages 
and inter-router links which relates to more opportunities to 
make the appropriate tradeoffs due to the process variation 
parameters ( and ).  
We show the EDP of all benchmarks in Fig. 16 for the 
SWNoC and the mesh NoC. We chose three different process 
variation combinations that cover the range of possible values: 
α=0.1, β=0.1 (LOW), α=0.15, β=0.2 (MED), and α=0.2, β=0.3 
(HIGH) to observe the effects of different levels of process 
variations while keeping 𝛾 at 0.1. As discussed above, the EDP 
of the M3D-PO design deteriorates as the value of 𝛼 and 𝛽 
increases. For the SWNoCs (Fig. 16(a)), on average, the M3D-
PA saves 19.6%, 33.1%, and 48.7% EDP compared to its M3D-
PO counterpart for LOW, MED, and HIGH respectively. For 
the mesh NoCs (Fig. 16(b)), the M3D-PA design saves 27.5%, 
47.9%, and 70.2% EDP on average compared to its M3D-PO 
counterpart for LOW, MED, and HIGH respectively.  
F. EDP Comparison Between TSV- and M3D-Based SWNoCs 
To complete the analysis, we compare the M3D-PA SWNoC 
with respect to the TSV-based counterpart of the same size [31]. 
In Fig. 17, we present the EDP of TSV and M3D-PA SWNoCs. 
The EDP is normalized with respect to the TSV-based design. 
Here we consider the maximum effect of process variation 
(α=0.2, β=0.3) and the lowest value of 𝛾 (𝛾=0.1). It is evident 
from the figure that even in the worst case for process 
variations, M3D-based designs still reduce the EDP by 11.7% 
on average compared to TSV based designs for all benchmarks.  
VI. CONCLUSION 
While M3D-integration offers high performance and energy 
efficiency, fabrication-related challenges pose major concerns 
to achieve desirable performance levels. The process-induced 
performance degradation of the transistors and interconnects 
introduce significant performance and energy overheads for 
M3D-enabled NoCs. Our analysis shows that the SWNoC 
designed without considering the process variation 
underestimates the EDP by at least 18.8% and at most 83.7% 
depending on the process variation parameters. Thus, process 
variation aware design is a must for realistic M3D NoC 
architectures. In this work, we incorporate both top-tier 
transistor slow-down and bottom-tier interconnect degradation 
   
                    (a) 
Fig. 14. EDP for SWNoCs for a) 𝛾=0.1 and b) 𝛾=0.2. (CAN). EDP is normalized with respect to that of process-oblivious design under ideal conditions (α=0,𝛽=0). 
               (b)  
  
                (a)  
Fig. 15. EDP for mesh NoCs for a) 𝛾=0.1 and b) 𝛾=0.2. (CAN). EDP is normalized with respect to that of process-oblivious design under ideal conditions 
(α=0,𝛽=0). 
 
            (b) 
  
 (a)                
Fig. 16. EDP for a) SWNoCs and b) mesh NoCs considering all benchmarks (𝛾= 0.1). EDP is normalized with respect to that of process-oblivious design under 
ideal conditions (α=0,𝛽=0). 
 
 
(b) 
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
10 
in the NoC design process. Our proposed design reduces the 
performance degradation of the M3D NoC by suitably 
distributing the intra-router stages and inter-router links among 
the M3D tiers. The process-aware design improves the EDP of 
SWNoC by at least 7.2% and up to 48.7% compared to the 
process-oblivious design approach for the best- and worst-case 
of process variation, respectively. 
VII. REFERENCES 
[1]  W. R. Davis et al., "Demystifying 3D ICs: the pros and cons of going 
vertical," IEEE Design & Test of Computers,vol. 22, no. 6, pp. 498-510, 
Nov.-Dec. 2005. 
[2]  B. S. Feero and P. P. Pande, "Networks-on-Chip in a Three-Dimensional 
Environment: A Performance Evaluation," IEEE Transactions on 
Computers, vol. 58, no. 1, pp. 32-45, Jan. 2009. 
[3]  X. Dong and Y. Xie, "System-level cost analysis and design exploration 
for three-dimensional integrated circuits (3D ICs)," in Asia and South 
Pacific Design Automation Conference (ASPDAC), Yokohama, 2009, 
pp. 234-241. 
[4]  J. H. Lau, "TSV manufacturing yield and hidden costs for 3D IC 
integration," in Proceedings 60th Electronic Components and 
Technology Conference (ECTC), Las Vegas, NV, 2010, pp. 1031-1042.  
[5]  K. Athikulwongse, A. Chakraborty, J. S. Yang, D. Z. Pan and S. K. Lim, 
"Stress-driven 3D-IC placement with TSV keep-out zone and regularity 
study," in IEEE/ACM International Conference on Computer-Aided 
Design (ICCAD), San Jose, CA, 2010, pp. 669-674.  
[6]  S. Das, J. R. Doppa, P. P. Pande and K. Chakrabarty, "Robust TSV-based 
3D NoC design to counteract electromigration and crosstalk noise," in 
Design, Automation & Test in Europe Conference & Exhibition (DATE), 
Lausanne, 2017, pp. 1366-1371.  
[7]  P. Batude et al., "3-D Sequential Integration: A Key Enabling 
Technology for Heterogeneous Co-Integration of New Function With 
CMOS," IEEE Journal on Emerging and Selected Topics in Circuits and 
Systems, vol. 2, no. 4, pp. 714-722, Dec. 2012.  
[8]  S. K. Samal, D. Nayak, M. Ichihashi, S. Banna and S. K. Lim, 
"Monolithic 3D IC vs. TSV-based 3D IC in 14nm FinFET technology," 
in IEEE SOI-3D-Subthreshold Microelectronics Technology Unified 
Conference (S3S), Burlingame, CA, 2016, pp. 1-2.  
[9]  S. M. Jung et al., "High Speed and Highly Cost effective 72M bit density 
S3 SRAM Technology with Doubly Stacked Si Layers, Peripheral only 
CoSix layers and Tungsten Shunt W/L Scheme for Standalone and 
Embedded Memory," in IEEE Symposium on VLSI Technology, Kyoto, 
2007, pp. 82-83. 
[10] T. Uhrmann, T. Wagenleitner, T. Glinsner, M. Wimplinger and P. 
Lindner, "Monolithic IC integration key alignment aspects for high 
process yield," in SOI-3D-Subthreshold Microelectronics Technology 
Unified Conference (S3S), Millbrae, CA, 2014, pp. 1-2.  
[11] S. Das, J. R. Doppa, P. P. Pande and K. Chakrabarty, "Monolithic 3D-
Enabled High Performance and Energy Efficient Network-on-Chip," in 
IEEE International Conference on Computer Design (ICCD), Boston, 
MA, 2017, pp. 233-240.  
[12] S. D. Lin and D. H. Kim, "Detailed-Placement-Enabled Dynamic Power 
Optimization of Multitier Gate-Level Monolithic 3-D ICs," IEEE 
Transactions on Computer-Aided Design of Integrated Circuits and 
Systems, vol. 37, no. 4, pp. 845-854, 2018. 
[13] B. Rajendran et al., "Low thermal budget processing for sequential 3-D 
IC fabrication," IEEE Transactions on Electron Devices, vol. 54,  no. 4, 
pp. 707–714, Apr. 2007.  
[14] M. Lin, A. El Gamal, Y. C. Lu and S. Wong, "Performance Benefits of 
Monolithically Stacked 3-D FPGA," IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems,vol. 26, no. 2, pp. 216-
229, Feb. 2007.  
[15] S. Bobba, A. Chakraborty, O. Thomas, P. Batude, and G. D.  Micheli, 
"Cell transformations and physical design techniques for 3D monolithic 
integrated circuits," ACM Journal on Emerging Technologies in 
Computing Systems, vol. 9, no. 3, pp. 19:1–19:28, Sep. 2013.   
[16] Y. J. Lee, D. Limbrick and S. K. Lim, "Ultra high density logic designs 
using transistor-level monolithic 3D integration," in IEEE/ACM 
International Conference on Computer-Aided Design (ICCAD), San 
Jose, CA, 2012, pp. 539-546. 
[17] K. Acharya et al., "Monolithic 3D IC design: Power, performance, and 
area impact at 7nm," in 17th International Symposium on Quality 
Electronic Design (ISQED), Santa Clara, 2016, pp. 41-48.  
[18] K. M. Kim, S. Sinha, B. Cline, G. Yeric, and S. K. Lim, "Four-Tier 
Monolithic 3D ICs: Tier Partitioning Methodology and Power Benefit 
Study," in Proceedings of the 2016 International Symposium on Low 
Power Electronics and Design, San Fransisco, CA, Aug. 2016, pp. 70–
75. 
[19] M. M. S. Aly et al., "Energy-Efficient Abundant-Data Computing: The 
N3XT 1,000x," Computer, vol. 48, no. 12, pp. 24-33, Dec. 2015.  
[20] T. Naito et al., "World's first monolithic 3D-FPGA with TFT SRAM over 
90nm 9 layer Cu CMOS," in Symposium on VLSI Technology, Honolulu, 
HI, 2010, pp. 219-220.  
[21] Y. Yu and N. K. Jha, "Energy-Efficient Monolithic Three-Dimensional 
On-Chip Memory Architectures," in IEEE Transactions on 
Nanotechnology, vol. 17, no. 4, pp. 620-633, July 2018.     
[22] N. Golshani et al., "Monolithic 3D integration of SRAM and image 
sensor using two layers of single grain silicon," in IEEE International 3D 
Systems Integration Conference (3DIC), Munich, 2010, pp. 1-4. 
[23] S. Panth, S.K. Samal, K. Samadi, Y. Du, and S.K. Lim, "Tier 
Degradation of Monolithic 3-D ICs: A Power Performance Study at 
Different Technology Nodes," IEEE Transactions on Computer-Aided 
Design of Integrated Circuits and Systems, vol. 36, no. 8, pp. 1265-1273, 
Aug. 2017.  
[24] L. Paisini et al., "nFET FDSOI activated by low temperature solid phase 
epitaxial regrowth: Optimization guidelines," in SOI-3D-Subthreshold 
Microelectronics Technology Unified Conference (S3S), Millbrae, CA, 
2014, pp. 1-2.   
[25] C. Fenouillet-Beranger et al., "New insights on bottom layer thermal 
stability and laser annealing promises for high performance 3D VLSI," 
in IEEE International Electron Devices Meeting, San Francisco, CA, 
2014, pp. 27.5.1-27.5.4.  
[26] Li-Shiuan Peh and W. J. Dally, "A delay model for router 
microarchitectures," IEEE Micro, vol. 21, no. 1, pp. 26-34, Jan.-Feb. 
2001.  
[27] Hang-Sheng Wang, Li-Shiuan Peh and S. Malik, "A power model for 
routers: modeling Alpha 21364 and InfiniBand routers," in Proceedings 
10th Symposium on High Performance Interconnects, Stanford, CA, 
USA, 2002.  
[28] P. Batude et al., "3D monolithic integration," in IEEE International 
Symposium of Circuits and Systems (ISCAS), Rio de Janeiro, 2011, pp. 
2233-2236. 
[29] S. K. Samal, D. Nayak, M. Ichihashi, S. Banna and S. K. Lim, "Tier 
partitioning strategy to mitigate BEOL degradation and cost issues in 
monolithic 3D ICs," in IEEE/ACM International Conference on 
Computer-Aided Design (ICCAD), Austin, TX, 2016, pp. 1-7. 
[30] J. A. Boyan and A. W. Moore, "Learning evaluation functions to improve 
optimization by local search," The Journal of Machine Learning 
Research, vol. 1, pp. 77–112, Nov. 2000.  
[31] S. Das, J. R. Doppa, P. P. Pande and K. Chakrabarty, "Design-Space 
Exploration and Optimization of an Energy-Efficient and Reliable 3-D 
Small-World Network-on-Chip," IEEE Transactions on Computer-
Fig. 17.  EDP of TSV- and M3D-enabled (𝛾=0.1,α=0.2, β=0.3) SWNoCs. 
EDP is normalized with respect to that of the TSV-based design. 
 
0.80
0.85
0.90
0.95
1.00
1.05
CAN DEDUP FFT FLUID LU RADIX VIPS WATER
N
or
m
al
iz
ed
 E
D
P
TSV M3D-PA
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
11 
Aided Design of Integrated Circuits and Systems, vol. 36, no. 5, pp. 719-
732, May 2017.   
[32] N. Binkert et al., "The GEM5 Simulator," SIGARCH Computer 
Architecture News, vol. 39, no. 2, pp. 1-7, 2011.  
[33] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh and A. Gupta, "The SPLASH-
2 programs: characterization and methodological considerations," in 
Proceedings 22nd Annual International Symposium on Computer 
Architecture, Santa Margherita Ligure, Italy, 1995, pp. 24-36. 
[34] C. Bienia, "Benchmarking Modern Multiprocessors", Ph.D. Dissertation, 
Dept. Comp. Sci., Princeton Univ., Princeton, NJ, USA, 2011. 
[35] O. Lysne, T. Skeie, S. A. Reinemo and I. Theiss, "Layered routing in 
irregular networks," IEEE Transactions on Parallel and Distributed 
Systems, vol. 17, no. 1, pp. 51-65, Jan. 2006.  
[36] A. Mallik et al., "The impact of sequential-3D integration on 
semiconductor scaling roadmap," in IEEE International Electron 
Devices Meeting (IEDM), San Francisco, CA, 2017, pp. 32.1.1-31.1.4.  
[37] K. Chang et al., "Cascade2D: A design-aware partitioning approach to 
monolithic 3D IC with 2D commercial tools," in IEEE/ACM 
International Conference on Computer-Aided Design (ICCAD), Austin, 
TX, 2016, pp. 1-8. 
 
 
 
 
 
 
 
 
 
 
