Learning-based Application-Agnostic 3D NoC Design for Heterogeneous
  Manycore Systems by Joardar, Biresh Kumar et al.
UNDER REVIEW AT IEEE TRANSACTIONS ON COMPUTERS 1 
 
Learning-based Application-Agnostic 3D NoC 
Design for Heterogeneous Manycore Systems  
Biresh Kumar Joardar, Student Member, IEEE, Ryan Gary Kim, Member, IEEE, Janardhan Rao Doppa, 
Member, IEEE, Partha Pratim Pande, Senior Member, IEEE, Diana Marculescu, Fellow, IEEE, and 
Radu Marculescu, Fellow, IEEE 
Abstract— The rising use of deep learning and other big-data algorithms has led to an increasing demand for hardware platforms that 
are computationally powerful, yet energy-efficient. Due to the amount of data parallelism in these algorithms, high-performance three-
dimensional (3D) manycore platforms that incorporate both CPUs and GPUs present a promising direction. However, as systems use 
heterogeneity (e.g., a combination of CPUs, GPUs, and accelerators) to improve performance and efficiency, it becomes more pertinent 
to address the distinct and likely conflicting communication requirements (e.g., CPU memory access latency or GPU network 
throughput) that arise from such heterogeneity. Unfortunately, it is difficult to quickly explore the hardware design space and choose 
appropriate tradeoffs between these heterogeneous requirements. To address these challenges, we propose the design of a 3D Network-
on-Chip (NoC) for heterogeneous manycore platforms that considers the appropriate design objectives for a 3D heterogeneous system 
and explores various tradeoffs using an efficient machine learning (ML)-based multi-objective optimization (MOO) technique. The 
proposed design space exploration considers the various requirements of its heterogeneous components and generates a set of 3D NoC 
architectures that efficiently trades off these design objectives. Our findings show that by jointly considering these requirements (latency, 
throughput, temperature, and energy), we can achieve 9.6% better Energy-Delay Product on average at nearly iso-temperature 
conditions when compared to a thermally-optimized design for 3D heterogeneous NoCs. More importantly, our results suggest that our 
3D NoCs optimized for a few applications can be generalized for unknown applications as well. Our results show that these generalized 
3D NoCs only incur a 1.8% (36-tile system) and 1.1% (64-tile system) average performance loss compared to application-specific NoCs.  
Index Terms—Heterogeneous architectures, Manycore systems, Multi-objective optimization, Network-on-Chip  
——————————   ◆   —————————— 
1 INTRODUCTION
eural Networks, graph analytics, and other big-data 
applications have become vastly important for many 
domains. This has led to a search for proper computing sys-
tems that can efficiently utilize the tremendous amount of 
data parallelism that is associated with these applications. 
Recently, platforms using both CPUs and GPUs have signif-
icantly improved the execution time for such applications 
[1]. However, existing discrete GPU systems use off-chip in-
terconnects (e.g., PCIe) to communicate with the CPUs. 
These interconnects give rise to high data transfer latency 
and become performance bottlenecks for applications that 
involve high volumes of data transfers between the CPUs 
and GPUs.  
A heterogeneous manycore system that integrates 
many CPUs and GPUs on a single chip can solve this prob-
lem and avoid such expensive off-chip data transfers [2], 
[3]. In addition, these single-chip systems require a scalable 
interconnection backbone (Networks-on-Chip (NoCs)) to 
facilitate more efficient communication.  
To further reduce data transfer costs, three-dimensional 
(3D) integrated circuits (ICs) have been investigated as a 
possible solution and have made significant strides towards 
improving communication efficiency [4], [5]. By connecting 
planar dies stacked on top of each other with through-sili-
con vias (TSVs), the communication latency, throughput, 
and energy consumption can be further improved [6].  
3D ICs together with NoCs, enable the design of highly 
integrated heterogeneous (e.g., CPUs, GPUs, accelerators) 
manycore platforms for big-data applications. However, 
the design of 3D NoC based manycore systems pose 
unique challenges. Due to the heterogeneity of the cores 
integrated on a single chip, the communication require-
ments for each core can vary significantly. For example, in 
a CPU-GPU based heterogeneous system, CPUs require low 
memory latency while GPUs need high-throughput data 
transfers [7]. In addition to the individual core require-
ments, 3D ICs allow dense circuit integration but have 
much higher power density than their 2D counterparts. 
Therefore, the design process must consider reducing tem-
perature hotspots as an additional objective. Overall, the 
design of a 3D heterogeneous manycore architecture 
needs to consider each of these objectives and satisfy all of 
them simultaneously [8]. Hence, 3D heterogeneous many-
core design can be formulated as a multi-objective optimi-
zation (MOO) problem.  
In this work, we incorporate appropriate analytical mod-
els for each of the relevant objectives (i.e., throughput, la-
tency, temperature, and energy). We also demonstrate that 
it is necessary to consider all objectives to achieve the op-
timal trade-off between temperature and performance. We 
examine the differences between performance-only and 
xxxx-xxxx/0x/$xx.00 © 200x IEEE        Published by the IEEE Computer Society 
N 
———————————————— 
• Biresh Kumar Joardar, Janardhan Rao Doppa, and Partha Pratim 
Pande are with Washington State University, Pullman, WA, 99164. 
Email: {biresh.joardar, jana.doppa, pande}@wsu.edu 
• Ryan Gary Kim is with Colorado State University, Fort Collins, CO, 
80523, Email: Ryan.G.Kim@colostate.edu 
• Diana Marculescu, and Radu Marculescu are with Carnegie Mellon 
University, Pittsburgh, PA, 15213, Email: {dianam, radum}@cmu.edu 
2  
 
performance-thermal-joint optimization as an example. 
However, the complexity of the design space and the high 
number of objectives make this design optimization prob-
lem difficult. Widely-used MOO techniques (e.g., NSGA-II 
[9] or simulated annealing based AMOSA [10]) can require 
significant amounts of time due to their exploratory nature. 
Therefore, more efficient and scalable optimization tech-
niques are required.  
To this end, in this work, we propose a new MOO algo-
rithm, MOO-STAGE, which extends the machine learning 
framework STAGE [11]. As opposed to traditional MOO al-
gorithms that only consider the current solution set when 
making search decisions, MOO-STAGE learns from the 
knowledge of previous search trajectories to guide the 
search towards more promising parts of the design space. 
This significantly reduces the optimization time without 
sacrificing the solution quality. Using MOO-STAGE, we can 
take advantage of the traffic characteristics of different ap-
plications and incorporate appropriate design objectives to 
enable quick design space exploration of 3D heterogene-
ous systems. In addition, through careful analysis, we notice 
that several applications on heterogeneous platforms ex-
hibit similar traffic patterns. Subsequently, we propose that 
an application-agnostic heterogeneous 3D NoC can be de-
signed to achieve similar performance as designs that are 
optimized for a specific application. We evaluate the feasi-
bility and performance of these application-agnostic de-
signs across all considered benchmarks. 
Below we summarize our main contributions in this 
work: 
1. We undertake a comprehensive study of the traffic 
patterns from multiple applications taken from var-
ious domains running on 3D heterogeneous sys-
tems. 
2. Based on the observed traffic patterns, we propose 
a generalized application-agnostic heterogeneous 
3D NoC design that achieves similar levels of per-
formance (latency, throughput, energy, and tem-
perature) as application-specific designs. 
3. We propose a new MOO framework MOO-STAGE 
and apply it to the problem of manycore 3D heter-
ogeneous NoC design. Our findings show that 
MOO-STAGE can find the same quality of solutions 
as AMOSA and a branch-and-bound based algo-
rithm (PCBB [12]) while significantly reducing opti-
mization time and improving scalability. 
2 RELATED WORK 
In this section, we present some of the most relevant prior 
works on 3D heterogeneous NoC design and related MOO 
algorithms.  
2.1 3D Heterogeneous NoCs 
Due to its heterogeneity, CPU-GPU based systems exhibit 
several interesting traffic characteristics, for instance, GPUs 
typically only communicate with a few shared last level 
caches (LLCs) which results in many-to-few traffic patterns 
(i.e., many GPUs communicate with a few LLCs) with negli-
gible inter-GPU communication [7], [13], [14]. This can 
cause the LLCs to become bandwidth bottlenecks under 
heavy network loads and lead to significant performance 
degradation [7]. In addition, since heterogeneous systems 
share the memory resources, the GPUs can monopolize the 
memory and cause high CPU memory access latency [15]. 
Conventional 2D architectures, such as mesh NoCs, cannot 
efficiently handle this many-to-few traffic or fulfill the qual-
ity of service (QoS) requirements for both CPU and GPU 
communication [7]. 
In recent years, designers have taken advantage of 3D 
IC’s higher packing density and lower interconnect latency 
to improve the performance of manycore systems [4], [5]. 
The advantages of 3D integration for CPU and GPU based 
manycore systems have been demonstrated in [16], [17] 
where the authors have principally focused on improving 
the throughput and energy efficiency by using the benefits 
of 3D integration for homogeneous systems (all CPUs or all 
GPUs) only.  
Due to the differences in the thread-level parallelism of 
CPUs and GPUs, the NoC designed for heterogeneous sys-
tems should satisfy both CPU and GPU communication 
constraints [18]. Hence, designing the 3D NoC for hetero-
geneous systems is more complicated than homogeneous 
systems; this aspect has not been explored adequately. On 
top of this, 3D ICs suffer from thermal issues due to higher 
power density [8], [19]. One of the common methodologies 
for reducing the peak temperature in a 3D architecture in-
cludes proper core placement to prevent high power con-
suming cores from being placed on top of each other [19]. 
However, it is not possible to implement such a strategy for 
heterogeneous systems with many GPU cores [8]. Other 
techniques to reduce temperature include suitable floor-
planning [20] and temperature-aware task scheduling [21]. 
In contrast to these prior works, we propose a MOO algo-
rithm to intelligently place the cores and links within a 3D 
heterogeneous system that jointly considers all relevant 
design metrics, e.g., latency, throughput, energy, and tem-
perature. 
For a given workload, application-specific NoCs are 
known to outperform conventional architectures, e.g., 
mesh NoCs [7]. A MOO formulation for 3D NoCs is pre-
sented in [8] for accelerating deep learning workloads. In 
[22], the authors have explored heterogeneous NoC design 
for multimedia applications. However, these works have 
only focused on one class of workloads to design the NoC 
and ignored the correlation in the traffic patterns of other 
applications. 
2.2 Multi-Objective Optimization Algorithms  
Basic MOO algorithms such as genetic algorithms (GA), 
e.g., NSGA-II [9], or simulated annealing-based algorithms, 
e.g., AMOSA [10], have been used in different optimization 
problems. AMOSA has been demonstrated to be superior 
to GAs or simulated annealing [10] and has been applied 
for the problem of heterogeneous NoC design in [7], [8]. 
However, since AMOSA is based on simulated annealing, it 
needs to be annealed slowly to ensure a good solution, 
which does not scale well with the size of the search space.   
In [23], the authors have used a heuristic-based MOO 
 3 
 
for multicore designs. However, they focus mainly on opti-
mizing individual cores in smaller systems with up to 16 
processors. Latency and area have been optimized using 
GAs to design NoC architectures in [24]. The authors in [25] 
have used machine learning techniques like linear regres-
sion and neural networks for MOO on different platforms. 
A learning-based fuzzy algorithm has been proposed to re-
duce the search time in [26]. However, this methodology 
requires a threshold to be decided for each application 
separately. A recent work [12] proposed a branch-and-
bound-based algorithm, priority and compensation factor-
oriented branch and bound (PCBB) for task mapping in a 
NoC-based platform [12]. However, this work only consid-
ers task mapping on a relatively smaller system size, where 
calculating the bound for each node is significantly easier. 
These works have mainly considered homogeneous plat-
forms with smaller system sizes and fewer number of ob-
jectives.   
3D heterogeneous NoC design is far more complex 
since the design must consider the requirements for each 
component. With additional constraints such as tempera-
ture and energy, the required optimization time can be-
come tremendously high. Therefore, as systems become 
more complex, algorithms that are scalable with the size of 
the search space and can reduce optimization time without 
sacrificing solution quality will be needed.  
 In this work, we show that multiple applications exhibit 
similar traffic patterns on heterogeneous platforms.  Lever-
aging this observation, we investigate the design of appli-
cation-agnostic NoC architectures and propose a machine-
learning inspired algorithm MOO-STAGE for 3D heteroge-
neous NoC design. Together, using MOO-STAGE and our 
observations of application traffic characteristics, we signif-
icantly reduce the design time of 3D heterogeneous NoCs 
and create optimized, application-agnostic architectures. 
3 TRAFFIC PATTERN ANALYSIS 
In this section, we present an in-depth study of the charac-
teristics of the traffic patterns generated by a variety of ap-
plications that run on a heterogeneous platform. To this 
end, applications from multiple domains were selected, 
e.g., physics, data mining, and bio-informatics. Two of these 
benchmarks, LeNet [27] and CDBNet [28], are commonly 
used neural networks for image classification while the rest 
of these applications come from the Rodinia benchmark 
suite [29]. This allows us to study the traffic patterns and 
the corresponding communication requirements of com-
monly used applications from different fields. Table 1 lists 
the applications along with their corresponding do-
mains/usages. To obtain accurate traffic characteristics, we 
run each application on a detailed architecture simulator, 
Gem5-GPU [30]. The traffic characteristics are measured in 
the number of flits per cycle. Full experimental details are 
elaborated in Section 6.1.  
Fig. 1 shows the traffic heat map for BP, BFS, NW, and PF 
applications running on a generic 64-tile heterogeneous 
system (8 CPUs, 16 LLCs, and 40 GPUs). Each row represents 
a different source, while each column represents a different 
destination. Since CPU and GPU cores have different re-
quirements for delivering high performance, we show their 
TABLE 1 
LIST OF APPLICATIONS AND THEIR RESPECTIVE DOMAINS 
Applications Domain/Usage 
Back Propagation (BP) Pattern Recognition 
Breadth-First Search (BFS) Graph Algorithm 
CNN for CIFAR -10 (CDN) [28] Image Classification (RGB) 
Gaussian Elimination (GAU) Linear Algebra 
HotSpot (HS) Physics Simulation 
CNN for MNIST (LEN) [27] Image Classification (Grayscale) 
LU Decomposition (LUD) Linear Algebra 
Needleman-Wunsch (NW) Bio-Informatics 
k-Nearest Neighbors (KNN) Data Mining 
PathFinder (PF) Grid Traversal 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 1. Traffic pattern heat map for four applications (BP, BFS, NW, and PF) running on a 64-tile heterogeneous manycore system. The numbers indicate 
percentage of total traffic contributed by that section (e.g., CPU-LLC communication results in 2.6% of total traffic for BP). 
C
PU
LL
C
G
PU
LL
C
GPULLC
LLCCPU
SO
U
R
C
E
SO
U
R
C
E
DESTINATION
DESTINATION
BP BFS PFNW
GPULLC GPULLC GPULLC
G
PU
LL
C
G
PU
LL
C
G
PU
LL
C
Amount of Communication: HighLow
G
PU
-L
LC
 t
ra
ff
ic
C
PU
-L
LC
 t
ra
ff
ic
C
PU
LL
C
LLCCPU
C
PU
LL
C
LLCCPU
C
PU
LL
C
LLCCPU
2.6%
12
.3
%
0.2%
0.
3%
0.1%
0.
7%
0.4%
0.
9%
42.9%
34
.2
%
63.4%
29
.7
%
60%
36
.4
%
60%
28
.9
%
4  
 
respective traffic patterns separately, CPU-LLC communica-
tion in the top section and GPU-LLC communication in the 
bottom section (of note, CPU-GPU communication is neg-
ligible). 
We observe that these heterogeneous systems exhibit 
several interesting traffic patterns: 
• In every application, we observe that one CPU (the 
master core) exhibits higher traffic intensity compared 
to the other CPUs. The master core is easy to spot since 
it contributes a majority of the CPU traffic.  
• Contrary to CPUs, GPU-LLC pairs exhibit nearly uniform 
high traffic due to well distributed and parallelized GPU 
workloads. The large number of GPUs can cause the 
GPU traffic to significantly congest the network. Com-
munication between the other pairs of cores, e.g., GPU-
GPU is much lower. 
• The majority of all traffic is associated with the LLCs. 
Fig. 2 shows percentage of total traffic going to/from 
the LLCs. On average, more than 80% of the total traffic 
is associated with the LLCs. Since heterogeneous sys-
tems typically have a small number of LLCs, this gener-
ates many-to-few communication patterns, especially 
between the GPUs and LLCs [7]. Without proper archi-
tectural support, LLCs can easily become network 
hotspots.  
All applications considered (Table 1) exhibit such traffic 
behaviors and have similar traffic heat maps. Based on 
these observations, we conjecture that these characteristics 
are more dependent on the heterogeneous architecture 
than any specific application. Even though there exists 
some amount of application-specific variations among the 
cores, these differences are relatively insignficant 
compared to the heavy many-to-few communication going 
to/from the LLC blocks. As a result, the traffic patterns of 
any new application can be expected to exhibit similar 
features as those in Fig. 1. Therefore, an NoC optimized for 
any of these applications can potentially be re-used for 
other applications without significant loss of performance.  
To demonstrate that the above-mentioned traffic pat-
terns are not specific to any particular system size, we con-
sider a different system size of 36 tiles (4 CPUs, 8 LLCs, and 
24 GPUs). The traffic patterns generated by this system size 
exhibit the same characteristics as the 64-tile system across 
all applications, i.e., a highly active master core, little CPU-
GPU communication, nearly uniform GPU-LLC communica-
tion, and most of the communication is based around the 
LLCs (Fig. 2). We do not replicate Fig. 1 for the 36-tile sys-
tem for brevity. This reinforces our previous observation 
that the traffic characteristics are more dependent on the 
elements of the heterogeneous architecture and is not lim-
ited to one system size or configuration. Hence, we can de-
sign the 3D NoC architecture by primarily considering the 
constituents of the heterogeneous system rather than any 
specific traffic patterns.  
4 MULTI-OBJECTIVE OPTIMIZATION FORMULATION 
4.1 Drawbacks of Mesh NoCs  
Mesh NoC is the preferred design for on-chip communica-
tion due to its simplicity. Intel’s Xeon Phi and Tilera’s TILE 
processors are examples of architectures with a mesh NoC. 
However, as the number of cores on a single chip increases, 
mesh NoCs inevitably require more hops for each network 
traversal. These added hops lead to increased network la-
tency and energy consumption. Therefore, despite its sim-
plicity, mesh NoCs do not scale well with system size.  
Mesh NoCs are especially ill-suited for heterogeneous 
systems. In [7], the authors have shown that links closer to 
the LLCs are highly over-utilized due to the many-to-few 
communication in mesh NoCs. Even an optimized 3D mesh 
can have links carrying 3X the average link traffic [8]. This 
can lead to network congestion, which results in higher la-
tency and decreased throughput, negatively affecting the 
overall system performance. To combat these issues, we 
look to define a general methodology for designing 3D 
NoC-based heterogeneous systems.  
4.2 MOO formulation for 3D heterogeneous NoCs 
In this section, we discuss the necessary objectives to de-
sign an efficient 3D heterogeneous system. Fig. 3 illustrates 
an example 3D heterogeneous architecture with two layers. 
For these systems, it is important that we 1) optimize both 
CPU and GPU communication; 2) efficiently balance the 
load of the 3D NoC under many-to-few traffic patterns 
seen in Section 3; 3) minimize the network energy; and 4) 
minimize the peak temperature of the system. The design 
methodology should optimize the system for individual 
core requirements along with other design constraints for 
high-performance NoC architectures. There may be addi-
tional design objectives based on specific design cases 
which can be similarly included in the design process. In 
 
 
 
 
 
 
 
 
 
 
 
 (a) (b) 
Fig. 2. Traffic breakdown showing the percentage of traffic between (in either direction) LLC and either CPU or GPU (CORE-LLC) and between CPUs 
and GPUs (CORE-CORE) for a (a) 36-tile and (b) 64-tile manycore system. 
0%
20%
40%
60%
80%
100%
BP BFS CDN GAU HS LEN LUD NW KNN PF
P
e
rc
e
n
ta
ge
 o
f 
tr
af
fi
c
CORE-LLC CORE-CORE
0%
20%
40%
60%
80%
100%
BP BFS CDN GAU HS LEN LUD NW KNN PF
P
e
rc
e
n
ta
ge
 o
f 
tr
af
fi
c
CORE-LLC CORE-CORE
 5 
 
this work, the design methodology focuses on the place-
ment of the CPUs, GPUs, LLCs, and planar links. We elabo-
rate on how the methodology satisfies each objective next. 
4.2.1 CPU Communication Objective 
CPU cores use instruction-level parallelism to achieve high 
performance on a limited number of threads. If any of these 
threads stall, CPUs incur a large penalty. Therefore, memory 
access latency is a primary concern for CPUs. For 𝐶 CPUs 
and 𝑀 LLCs, we model the average CPU-LLC latency using 
the following equation [5]: 
𝐿𝑎𝑡 =
1
𝐶 ∗ 𝑀
∑∑(𝑟 ⋅ ℎ𝑖𝑗 + 𝑑𝑖𝑗) ⋅ 𝑓𝑖𝑗
𝑀
𝑗=1
𝐶
𝑖=1
 (1) 
Here, 𝑟 is the number of router stages, ℎ𝑖𝑗 is the number of 
hops from CPU 𝑖 to LLC 𝑗, 𝑑𝑖𝑗 indicates the total link delay, 
and 𝑓𝑖𝑗 represents the amount of interaction between core 
𝑖 and core 𝑗. The path from core 𝑖 to core 𝑗 is determined 
by the routing algorithm (given in Section 6.1). It should be 
noted here that the above equation is not limited to our 
specific routing technique and can be used with other rout-
ing algorithms as well.  
4.2.2 GPU Communication Objective 
Unlike the CPUs, GPUs rely on high levels of data parallel-
ism. Massive amounts of parallelism coupled with quick 
context switching allow the GPU to hide most of its 
memory access latency. However, to do so, GPUs need lots 
of data and rely on high throughput memory accesses.  
We maximize the throughput of GPU-related traffic by 
load-balancing the network to allow more messages to uti-
lize the network at a time. In other words, given a frequency 
of traffic interaction and the routing paths, we want to bal-
ance the expected link utilization across all links. This does 
not change the total number of packets to be communi-
cated. Instead, it reduces the number of heavily congested 
links by redistributing traffic flows. This reduces the amount 
of contention for heavily utilized links. As a result, links are 
more readily available, there is less network congestion, 
and hence, network throughput is improved.  
For more intuition, load-balancing the network by ad-
justing link and tile placement tries to bring highly com-
municating tiles closer together and place links such that 
path diversity between highly communicating pairs is cre-
ated. In other words, this load-balancing approach at-
tempts to improve throughput by utilizing the given re-
sources more efficiently. To balance the expected link utili-
zation (load-balance the network), we consider minimizing 
both the mean (?̅?) and standard deviation (𝜎) of expected 
link utilization as suitable objectives. 
The expected utilization of link 𝑘 (𝑈𝑘) can be obtained 
by the following equation: 
𝑈𝑘 =∑∑(𝑓𝑖𝑗 ⋅ 𝑝𝑖𝑗𝑘)
𝑅
𝑗=1
𝑅
𝑖=1
 (2) 
Here 𝑅 is the total number of tiles and 𝑝𝑖𝑗𝑘 indicates 
whether a planar/vertical link 𝑘 is used to communicate be-
tween core 𝑖 and core 𝑗 respectively, i.e., 
𝑝𝑖𝑗𝑘 = {
1,  𝑖𝑓 cores 𝑖, 𝑗 𝑐𝑜𝑚𝑚𝑢𝑛𝑖𝑐𝑎𝑡𝑒 𝑎𝑙𝑜𝑛𝑔 𝑝𝑙𝑎𝑛𝑎𝑟/𝑣𝑒𝑟𝑡𝑖𝑐𝑎𝑙 𝑙𝑖𝑛𝑘 𝑘
0,                                             𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
 
𝑝𝑖𝑗𝑘 can be determined by using the network connectivity 
and routing protocols. 
Then, the mean (?̅?) and standard deviation (𝜎) of link 
utilization can be determined from the following equations: 
 ?̅? =
1
𝐿
∑ 𝑢𝑘
𝐿
𝑘=1  (3) 
 
 
 
 
 
 
 
 
 
 
 
 
   
 (a) (b) 
Fig. 4. Throughput with respect to mean (Eq. 3) and standard deviation (Eq. 4) of link utilization for (a) BFS and (b) HS. The plots have been generated 
by NoCs that were visited while optimizing for throughput only (Section 6.2, Case 1).  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 3. Overview of the TSV-based 3D system considered in this work. The 
system is divided into CPU, GPU, and LLC tiles. Tiles are interconnected 
via a planar link (intra-layer) or a TSV (inter-layer). This figure is for il-
lustration purpose only; it is not optimized for any metric. 
CPU TILE LLC TILEGPU TILE
CPU
L1
Router
GPU
L1
Router
Memory
Controller
L2 BANK
Router
DDR SDRAM
DDR SDRAM
6  
 
 𝜎 = √
1
𝐿
∑ (𝑢𝑘 − ?̅?)2
𝐿
𝑘=1  (4) 
Model Validation: Throughput can be accurately measured 
from network simulations. However, repeated simulations 
require significant amounts of time and increase total opti-
mization time [26]. Existing network throughput models 
have only considered regular networks [31] and hence can-
not be applied to the networks we generate (there are no 
regularity constraints). In this work, we have modeled max-
imizing throughput as minimizing Eqs. (3) and (4). We vali-
date our proposed throughput model using detailed cycle-
accurate network simulations. Fig. 4(a) and Fig. 4(b) show 
the throughput trend for different values of mean (Eqn. 3) 
and standard deviation (Eqn. 4) of link utilization for BFS 
and HS. Similar behavior is observed for all other applica-
tions. The plots have been restricted to regions which had 
enough data points for a faithful representation. It is clear 
from these figures that network throughput has an inverse 
relation with the mean and standard deviation of link utili-
zation. Reducing the mean and standard deviation simulta-
neously leads to a monotonic increase in throughput. 
Therefore, increasing throughput can alternatively be ex-
pressed as minimizing mean and standard deviation of the 
expected link utilization, validating our throughput model. 
4.2.3 Thermal requirements 
One of the key challenges in 3D integration is the high-
power density and resulting temperature hotspots. High 
temperature not only affects performance but also the life-
time of the device. Cores that are further away from the 
sink tend to have higher temperatures than those close to 
the sink. Therefore, cores must be properly placed, e.g., 
high power consuming cores should be placed close to the 
sink to reduce temperature.  
To estimate the temperature of a core, we use the fast 
approximation model presented in [32]. It considers both 
horizontal and vertical heat flow to accurately estimate the 
temperature. A manycore system can be divided into 𝑁 sin-
gle-tile stacks, each with 𝐾 layers, where 𝑁 is the number 
of tiles on a single layer and 𝐾 is the total number of layers. 
The temperature of a core within a single-tile stack 𝑛 lo-
cated at layer 𝑘 from the sink (𝑇𝑛,𝑘) due to the vertical heat 
flow is given by: 
𝑇𝑛,𝑘 =∑ (𝑃𝑛,𝑖∑ 𝑅𝑗
𝑖
𝑗=1
)
𝑘
𝑖=1
+ 𝑅𝑏∑ 𝑃𝑛,𝑖
𝑘
𝑖=1
 (5) 
This represents the vertical heat flow in a manycore system 
[32]. Here, 𝑃𝑛,𝑖 is the power consumption of the core 𝑖 layers 
away from the sink in single-tile stack 𝑛, 𝑅𝑗 is the vertical 
thermal resistance, and 𝑅𝑏 is the thermal resistance of the 
base layer on which the dies are placed. The values of 𝑅𝑗 
and 𝑅𝑏 are obtained using 3D-ICE [33]. The horizontal heat 
flow is represented through the maximum temperature dif-
ference in the same layer 𝑘 (Δ𝑇(𝑘)):  
Δ𝑇(𝑘) = max
𝑛
𝑇𝑛,𝑘 −min
𝑛
𝑇𝑛,𝑘 (6) 
The overall thermal prediction model includes both ver-
tical and horizontal heat flow equations. Following [32], we 
use 𝑇 as our comparative temperature metric for any given 
3D architecture:  
𝑇 = (max
n,k
𝑇𝑛,𝑘) (max
𝑘
Δ𝑇(𝑘)) (7) 
4.2.4 Energy requirements 
A few long-range links added to the NoC can improve per-
formance [5]. However, these long-range links are costlier 
in terms of energy. Routers with a higher number of ports 
can improve path diversity and throughput, however, larger 
routers are difficult to design and are power hungry. There-
fore, router size and link length must be optimized during 
design time to deliver high performance without consum-
ing high amounts of energy. For a system with 𝑁 tiles, 𝑅 
routers, 𝐿 planar links, and 𝑉 vertical links, the approximate 
network energy consumed is obtained using the following 
equation. 
𝐸𝑟𝑜𝑢𝑡𝑒𝑟 =∑∑𝑓𝑖𝑗
𝑁
𝑗=1
𝑁
𝑖=1
⋅∑ 𝑟𝑖𝑗𝑘 ⋅ (𝐸𝑟 ⋅ 𝑃𝑘)
𝑅
𝑘=1
 (8) 
𝐸𝑙𝑖𝑛𝑘 =∑∑𝑓𝑖𝑗 ⋅ (∑𝑝𝑖𝑗𝑘 ⋅ 𝑑𝑘 ⋅ E𝑝𝑙𝑎𝑛𝑎𝑟 +∑𝑞𝑖𝑗𝑘 ⋅ 𝐸𝑣𝑒𝑟𝑡𝑖𝑐𝑎𝑙
𝑉
𝑘=1
𝐿
𝑘=1
)
𝑁
𝑗=1
𝑁
𝑖=1
 (9) 
𝐸 = 𝐸𝑟𝑜𝑢𝑡𝑒𝑟 + 𝐸𝑙𝑖𝑛𝑘  (10) 
Here 𝐸𝑟 denotes the average router logic energy per port 
and 𝑃𝑘 denotes the number of ports available at router 𝑘. 
The total link energy can be divided into two parts due to 
the different physical characteristics of planar and vertical 
links. 𝑓𝑖𝑗  represents the frequency of communication be-
tween core 𝑖 and core 𝑗 and can be extracted from Gem5-
GPU simulations while 𝑑𝑘 represents the physical link 
length of link 𝑘. Here,  𝑞𝑖𝑗𝑘 and  𝑟𝑖𝑗𝑘 is defined similarly as 
𝑝𝑖𝑗𝑘 (Eqn. 2) to indicate if a vertical link or router 𝑘 is utilized 
to communicate between core 𝑖 and core 𝑗 respectively. 
𝐸𝑝𝑙𝑎𝑛𝑎𝑟 and 𝐸𝑣𝑒𝑟𝑡𝑖𝑐𝑎𝑙 denote the energy consumed per flit 
by planar metal wires and vertical links respectively. All the 
required power numbers were obtained using Synopsys 
Prime Power for 28nm nodes. The total network energy 𝐸 
is the sum of router logic and link energy.  
4.2.5 Overall MOO formulation 
In the end, our aim is to find a 3D heterogeneous manycore 
design that minimizes the mean link utilization (𝑈), stand-
ard deviation of individual link utilizations (𝜎), average la-
tency between CPU and LLCs (𝐿𝑎𝑡), temperature (𝑇) and en-
ergy (𝐸). It is important to note that the analytical models 
for these objectives only need to be accurate in determin-
ing which designs are better relative to one another, e.g., 
lower values of 𝑇 result in better temperatures. This allows 
us to quickly compare designs without performing detailed 
simulations during the optimization search. Since optimiz-
ing one objective may negatively affect another, it is im-
portant that these objectives are optimized simultaneously. 
For example, a thermal-only aware placement would move 
high-power cores closer to the sink [8] and possibly further 
away from cores they highly communicate with, negatively 
affecting performance and energy. We write our combined 
objective as follows: 
𝐷∗ = 𝑀𝑂𝑂(𝑂𝐵𝐽 = {Ū(𝑑), 𝜎(𝑑), 𝐿𝑎𝑡(𝑑), 𝑇(𝑑), 𝐸(𝑑)}) (11) 
where, 𝐷∗ is the set of Pareto optimal designs among all 
possible 3D heterogeneous manycore system configura-
tions 𝐷, i.e., 𝐷∗ ∈ 𝐷, 𝑀𝑂𝑂 is a multi-objective solver, and 
𝑂𝐵𝐽 is the set of all objectives to evaluate a candidate de-
sign 𝑑 ∈ 𝐷. A candidate design d consists of an adjacency 
 7 
 
matrix for the links (designates which pair of tiles are con-
nected via a link) and a tile placement vector (designates 
which core is placed at which tile). We also ensure that for 
all 𝑑 ∈ 𝐷, all source-destination pairs have at least one path 
between them. Since mesh is the most commonly used 
NoC architecture, any design 𝑑 has an equal number of 
links as that of a 3D mesh NoC.  
In the following section, we describe the machine learn-
ing based MOO-STAGE, that we use as the multi-objective 
problem solver. However, any other MOO algorithm can 
also be used.  
5 DESIGN OPTIMIZATION USING MACHINE LEARNING 
In this section, we present a machine learning based opti-
mization algorithm called MOO-STAGE that is scalable with 
the size of the search space. STAGE [11] is an online learning 
algorithm originally developed to improve the perfor-
mance of local search algorithms (e.g., hill climbing) for sin-
gle objective optimization problems. In [5], authors have 
shown that STAGE can significantly outperform traditional 
optimization techniques, namely, simulated annealing (SA) 
and genetic algorithms (GA) for NoC design optimization 
with homogenous cores.  
Inspired by this success of STAGE for single objective 
NoC design optimization, we extend it to a multi-objective 
optimization setting. In this work, we propose MOO-
STAGE, a multi-objective optimization algorithm, and apply 
it to 3D manycore heterogeneous NoC design. The key idea 
behind MOO-STAGE is to intelligently explore the search 
space such that the MOO problem is efficiently solved. 
More precisely, MOO-STAGE utilizes a supervised learning 
algorithm that leverages past search experience (Local 
search) to learn an evaluation function that can then esti-
mate the outcome of performing a local search from any 
given state in the design space (Meta search). In practice, 
the MOO-STAGE algorithm iteratively executes Local and 
Meta searches in a sequence as shown in Fig. 5.  
Fig. 5 shows a high-level overview of how MOO-STAGE 
works. The first stage (Local search) performs a search from 
a given starting state, guided by a cost function consider-
ing all objectives. Then, the search trajectories collected 
from the Local search is used for the next stage (Meta 
search) to learn an evaluation function. This evaluation 
function attempts to learn the potential (quantified using 
the cost function) of performing a Local search starting 
from a particular state. This allows the algorithm to prune 
away bad starting states to reduce the number of local 
search calls needed to find (near-) optimal designs in the 
given design space. Unlike MOO-STAGE, other MOO algo-
rithms based on random restarts do not leverage any such 
knowledge and spend significant time searching from 
states that would otherwise be rejected by MOO-STAGE. 
Therefore, MOO-STAGE explicitly guides the search to-
wards promising areas of the search space much faster than 
conventional MOO algorithms. Below we describe the de-
tails of the MOO-STAGE algorithm. 
5.1 MOO-STAGE: Local Search  
Given an objective, the goal of a local search algorithm 
(e.g., greedy search or SA) is to traverse through a sequence 
of neighboring states to find a solution that minimizes the 
objective. To accommodate multiple objectives, we employ 
the Pareto hypervolume (PHV) [34] metric to evaluate the 
quality of a set of solutions (higher is better). The PHV is 
the hypervolume of the dominated portion of the objective 
space as a measure for the quality of Pareto set approxima-
tions [34]. A design 𝑃 is dominated by design 𝑄 (𝑃≺𝑄) when   
∀𝑖: 𝑂𝑏𝑗𝑖(𝑃) ≤ 𝑂𝑏𝑗𝑖(𝑄) ∧ ∃𝑖: 𝑂𝑏𝑗𝑖(𝑃) < 𝑂𝑏𝑗𝑖(𝑄)  
Local search guided by the PHV heuristic has two strong 
advantages over other metrics for comparing solutions 
[35]: 1) The PHV captures the improvement in any objec-
tive. If a new set of solutions has a better PHV than the cur-
rent set of solutions, then the new set of solutions covers 
more of the objective space and better captures the trade-
offs between objectives. 2) PHV allows the handling of any 
number of objectives as part of the MOO problem (i.e., 
generality) since PHV maps to a single output (cost). This is 
particularly useful for learning the evaluation function via a 
regression learning algorithm.  
To compute the PHV, we employ a fast and scalable PHV 
algorithm called hypervolume by slicing objectives [36]. It 
employs the divide-and-conquer principle to achieve effi-
ciency: it repeatedly divides the PHV computation into sim-
pler problems with fewer objectives and aggregates the so-
Algorithm 1. Local Search: 𝑙𝑜𝑐𝑎𝑙(𝑂𝑏𝑗, 𝑑𝑠𝑡𝑎𝑟𝑡) 
Input: 𝑂𝑏𝑗 (Set of optimization objectives), 
 𝑑𝑠𝑡𝑎𝑟𝑡 (Starting design) 
Output: 𝑆𝑙𝑜𝑐𝑎𝑙 (Non-dominated set of designs), 
 𝑆𝑡𝑟𝑎𝑗 (Trajectory set), 𝑑𝑙𝑎𝑠𝑡 (Last design) 
1: Initialize: 𝑆𝑙𝑜𝑐𝑎𝑙 ← {𝑑𝑠𝑡𝑎𝑟𝑡}, 𝑆𝑡𝑟𝑎𝑗 ← {𝑑𝑠𝑡𝑎𝑟𝑡}, 
 𝑑𝑐𝑢𝑟𝑟 ← 𝑑𝑠𝑡𝑎𝑟𝑡 
2: While 1: 
3:  𝑑𝑛𝑒𝑥𝑡 ←  arg 𝑚𝑎𝑥
𝑑∈𝑛𝑒𝑖𝑔ℎ(𝑑𝑐𝑢𝑟𝑟)
𝑃𝐻𝑉𝑂𝑏𝑗(𝑆𝑙𝑜𝑐𝑎𝑙 ∪ {𝑑})  
4:  If 𝑃𝐻𝑉𝑂𝑏𝑗(𝑆𝑙𝑜𝑐𝑎𝑙 ∪ {𝑑𝑛𝑒𝑥𝑡}) > 𝑃𝐻𝑉𝑂𝑏𝑗(𝑆𝑙𝑜𝑐𝑎𝑙): 
5:   𝑆𝑙𝑜𝑐𝑎𝑙 ← 𝑆𝑙𝑜𝑐𝑎𝑙 ∪ {𝑑𝑛𝑒𝑥𝑡}  
𝑆𝑙𝑜𝑐𝑎𝑙 ← {𝑑 ∈ 𝑆𝑙𝑜𝑐𝑎𝑙|(∄𝑘 ∈ 𝑆𝑙𝑜𝑐𝑎𝑙)[𝑘 ≺ 𝑑]}  
6:  Else: 
7:   Return (𝑆𝑙𝑜𝑐𝑎𝑙, 𝑆𝑡𝑟𝑎𝑗, 𝑑𝑙𝑎𝑠𝑡 ← 𝑑𝑐𝑢𝑟𝑟) 
8:  𝑑𝑐𝑢𝑟𝑟 ← 𝑑𝑛𝑒𝑥𝑡  
9:  𝑆𝑡𝑟𝑎𝑗 ← 𝑆𝑡𝑟𝑎𝑗 ∪ {𝑑𝑐𝑢𝑟𝑟} 
 
 
 
 
 
 
 
 
 
 
 
 
 
Fig. 5. Overview of the MOO-STAGE algorithm 
 
Good starting states to 
find better solutions
New search trajectory to 
improve the evaluation function
Meta Search 
Learn evaluation function and 
search for promising starting states
Local Search 
Search guided by 𝑂1 , 𝑂2,  , 𝑂𝑛
from starting state
Potential 
starting 
states
Evaluation 
Function
Good
Poor
Poor…
8  
 
lutions of simpler problems to compute the total hypervol-
ume.  
In this work, we use a simple greedy search with the ob-
jective of maximizing PHV with respect to the input objec-
tive set (𝑃𝐻𝑉𝑂𝑏𝑗) as the local search procedure (Algorithm 
1). However, it should be noted that greedy search has been 
employed as an example case only. Any other local search 
method, e.g., SA, can be used to similar effect. Starting from 
the initial state 𝑑𝑠𝑡𝑎𝑟𝑡, we find the best neighboring state 
(𝑛𝑒𝑖𝑔ℎ(𝑑𝑐𝑢𝑟𝑟)) that improves the PHV heuristic at each 
greedy search step (Algorithm 1, line 3). In the context of 
designing 3D heterogeneous systems, a neighboring state 
is where exactly one planar link is repositioned or two tiles 
are swapped (both irrespective of layers). If this best neigh-
boring state improves the PHV value, we add this state to 
the set of local optima (𝑆𝑙𝑜𝑐𝑎𝑙) while ensuring that all designs 
in 𝑆𝑙𝑜𝑐𝑎𝑙 are non-dominated (Algorithm 1, lines 4-5). This is 
repeated until the best neighboring state does not improve 
the PHV value, at which point, we return the local optima 
set, search trajectory (𝑆𝑡𝑟𝑎𝑗 = 𝑑𝑠𝑡𝑎𝑟𝑡,  , 𝑑𝑙𝑎𝑠𝑡), and the final 
search state (𝑑𝑙𝑎𝑠𝑡). Essentially, the local search explores the 
neighborhood of the current solutions to expand the Pa-
reto front to dominate as much of the objective space as 
possible.  
5.2 MOO-STAGE: Meta Search  
The second and key component of MOO-STAGE is the 
learning phase, also known as the meta-search. For stand-
ard local search procedures, one of the key limitations is 
that the quality of the local search critically depends on the 
starting point of the search process (𝑑𝑠𝑡𝑎𝑟𝑡). Although algo-
rithms like SA try to mitigate this effect by incorporating 
some random exploration, they are still limited by the local 
nature of the search. If the search repeatedly begins near 
poor local minima, it is possible that the search will never 
find a high-quality solution. MOO-STAGE attempts to solve 
this problem by learning a function approximator (evalua-
tion function) using previous local search data that can pre-
dict the outcome of a local search procedure from a partic-
ular starting point. Using this evaluation function MOO-
STAGE intelligently selects starting states with a high po-
tential to lead to better quality solutions and subsequently, 
significantly reduces the computation time. We discuss the 
details of this procedure in the following paragraphs.  
After completing the local search, we add the local op-
tima set to the global optima set (𝑆𝑔𝑙𝑜𝑏𝑎𝑙) ensuring that all 
states in the global optima set are non-dominated (Algo-
rithm 2, lines 3-4). If the local optima set didn’t add any new 
entries to the global optima set, MOO-STAGE completes 
and returns the global optima set (Algorithm 2, lines 5-6). 
Otherwise, we add the local search trajectory (𝑆𝑡𝑟𝑎𝑗) and 
PHV of this trajectory (𝑃𝐻𝑉𝑂𝑏j(𝑆𝑡𝑟𝑎𝑗)) as a training example 
to the aggregated training set (𝑆𝑡𝑟𝑎𝑖𝑛) and learn the evalua-
tion function 𝐸𝑣𝑎𝑙 using 𝑆𝑡𝑟𝑎𝑖𝑛 (Algorithm 2, lines 7-8). In 
this work, we employ Regression Forest as the base learner 
for creating 𝐸𝑣𝑎𝑙. Regression Forest is only used as an ex-
ample here and other regression learners that are quick to 
evaluate and sufficiently expressive to fit the training data 
can be used to similar effect.  
Given the function 𝐸𝑣𝑎𝑙, we use a standard greedy 
search to optimize 𝐸𝑣𝑎𝑙 beginning at the last state of the 
local search (𝑑𝑙𝑎𝑠𝑡) to find the starting state for the next local 
search iteration (𝑑𝑟𝑒𝑠𝑡𝑎𝑟𝑡). If 𝑑𝑙𝑎𝑠𝑡=𝑑𝑟𝑒𝑠𝑡𝑎𝑟𝑡, we choose a ran-
dom design from the design space instead (𝑟𝑎𝑛𝑑(𝐷)) (Al-
gorithm 2, lines 9-13). Using these two computational 
search processes (Local search and Meta search), MOO-
STAGE progressively learns the structure of the solution 
space and improves Eval. Essentially, the algorithm at-
tempts to learn a regressor that can predict the PHV of the 
local optima from any starting design and explicitly guides 
the search towards predicted high-quality starting designs.  
6 EXPERIMENTAL RESULTS 
6.1 Experimental Setup 
To obtain network- and processor-level information, we use 
the Gem5-GPU full-system simulator [30]. The CPU cores 
are based on the x86 architecture while the GPUs are based 
on the NVIDIA Maxwell architecture. Here, each GPU core 
is analogous to a Streaming Multiprocessor (SM) in Nvidia 
terminology. Within each GPU core, we have 32 shader pro-
cessors. The architecture of an individual GPU core is similar 
to a GPU Compute Unit (CU) described in [30]. The CPUs 
operate at 2.5 GHz while the GPUs operate at 0.7 GHz. The 
core power profiles have been extracted using GPUWattch 
[37] and McPat [38]. The core temperatures have been ob-
tained using 3D-ICE [33]. Due to the high-power densities 
in 3D ICs, we incorporate microfluid-based cooling tech-
niques to reduce core temperatures. In this work, we also 
adopt Reciprocal Design Symmetry (RDS) floor-planning 
[20] to reduce the direct overlap of core areas as much as 
possible.  
To implement different NoC topologies, we modified the 
Garnet network in Gem5-GPU. In this work, we use a stand-
ard three-stage router, however, the proposed design 
methodology is independent of the number of router 
Algorithm 2. MOO-STAGE 
Input: 𝑂𝑏𝑗 (Set of optimization objectives), 
 𝑖𝑡𝑒𝑟𝑚𝑎𝑥 (Maximum iterations), 𝐷 (Design space) 
Output: 𝑆𝑔𝑙𝑜𝑏𝑎𝑙 (Non-dominated set of designs) 
1: Initialize: 𝑆𝑔𝑙𝑜𝑏𝑎𝑙 ← ∅, 𝑆𝑡𝑟𝑎𝑖𝑛 ← ∅, 𝑑𝑠𝑡𝑎𝑟𝑡 ← 𝑟𝑎𝑛𝑑(𝐷) 
2: For 𝑖 = 0  to 𝑖𝑡𝑒𝑟𝑚𝑎𝑥: 
3:  (𝑆𝑙𝑜𝑐𝑎𝑙 , 𝑆𝑡𝑟𝑎𝑗 , 𝑑𝑙𝑎𝑠𝑡) ← 𝑙𝑜𝑐𝑎𝑙(𝑂𝑏𝑗, 𝑑𝑠𝑡𝑎𝑟𝑡)  
4:  Maintain non-dominated global set: 
𝑆𝑔𝑙𝑜𝑏𝑎𝑙 ← 𝑆𝑔𝑙𝑜𝑏𝑎𝑙 ∪ 𝑆𝑙𝑜𝑐𝑎𝑙 
𝑆𝑔𝑙𝑜𝑏𝑎𝑙 ← {𝑑 ∈ 𝑆𝑔𝑙𝑜𝑏𝑎𝑙|(∄𝑘 ∈ 𝑆𝑔𝑙𝑜𝑏𝑎𝑙)[𝑘 ≺ 𝑑]}  
5:  If 𝑆𝑔𝑙𝑜𝑏𝑎𝑙 ∩ 𝑆𝑙𝑜𝑐𝑎𝑙 = ∅: [If algorithm converged] 
6:    Return 𝑆𝑔𝑙𝑜𝑏𝑎𝑙 
7:  Add training example for each design 𝑑 ∈ 𝑆𝑡𝑟𝑎𝑗: 
 𝑆𝑡𝑟𝑎𝑖𝑛 ← 𝑆𝑡𝑟𝑎𝑖𝑛 ∪ {(𝑑, 𝑃𝐻𝑉𝑂𝑏𝑗(𝑆𝑡𝑟𝑎𝑗))}  
8:  Train evaluation function: 𝐸𝑣𝑎𝑙 ← 𝑡𝑟𝑎𝑖𝑛(𝑆𝑡𝑟𝑎𝑖𝑛) 
9:  Greedy Search: 𝑑𝑟𝑒𝑠𝑡𝑎𝑟𝑡 ← 𝑔𝑟𝑒𝑒𝑑𝑦(𝐸𝑣𝑎𝑙, 𝑑𝑙𝑎𝑠𝑡) 
10:  If 𝑑𝑙𝑎𝑠𝑡 = 𝑑𝑟𝑒𝑠𝑡𝑎𝑟𝑡: 
11:    𝑑𝑠𝑡𝑎𝑟𝑡 ← 𝑟𝑎𝑛𝑑(𝐷) 
12:  Else 
13:    𝑑𝑠𝑡𝑎𝑟𝑡 ← 𝑑𝑟𝑒𝑠𝑡𝑎𝑟𝑡 
14: Return 𝑆𝑔𝑙𝑜𝑏𝑎𝑙 
 
 9 
 
stages. The 3D mesh NoCs use XYZ-dimension order rout-
ing while the proposed architectures use ALASH routing 
[39]. It should be noted here that the proposed architec-
tures do not have a regularity constraint and hence XYZ-
dimension order routing cannot be employed as in the case 
of 3D Mesh NoCs. The memory system uses a MESI Two-
Level cache coherence protocol. Each CPU and GPU have a 
private L1 data and instruction cache of 32 KB each. Each 
LLC consists of 256KB memory.   
To evaluate our proposed MOO-STAGE, we consider two 
reference algorithms, AMOSA [10] and PCBB [12]. AMOSA 
is a widely employed algorithm for multi-objective optimi-
zation due to its ability to achieve near optimal solu-
tions [10]. On the other hand, PCBB is a recently proposed 
branch-and-bound based technique used for task mapping 
in an NoC-based system considering multiple objec-
tives [12]. PCBB outperforms standard branch and bound 
techniques due to two key features: a) a prioritization strat-
egy that prioritizes more prominent tasks to help prune 
branches earlier in the process and reduce computational 
complexity; and b) a compensation factor that allows 
tradeoffs between bound computational overhead and ac-
curacy. We have adapted PCBB for heterogeneous 3D NoC 
design as follows. First, we divide the branching decisions 
into two stages, node placement followed by link place-
ment. Second, we estimate the bound of a branch using a 
roll-out procedure by virtually placing the remaining un-
placed cores and links following several well-known map-
ping strategies (i.e., greedy, random, and small-world). 
Lastly, similar to [12], we combine the objectives into a sin-
gle metric. We prune a branch only if the bounds are worse 
even after being adjusted by a compensation factor, indi-
cating that the branch is unlikely to produce a good solu-
tion even after accounting for bound estimation error [12]. 
We evaluate the algorithms based on runtime and qual-
ity of solutions. Given the set of Pareto-optimal solutions 
𝐷∗ specified by (11) for each MOO solver considered here 
(i.e., AMOSA, PCBB, and MOO-STAGE), we run detailed sim-
ulations on this subset of solutions to get absolute values 
for energy, performance, and temperature. Here, the NoC 
solution is characterized using network energy-delay prod-
uct (EDP) as an example. The network EDP is a combined 
metric for performance and energy. Here, network EDP is 
defined as the product of network latency and energy con-
sumption. All experiments have been run on an Intel® 
Xeon® CPU E5-2620 @ 2GHz machine with 16 GB RAM 
running CentOS 6. The code for MOO-STAGE, AMOSA, and 
PCBB have been made available on GitHub [40]. 
6.2 Optimization Parameters 
System designers often have many different and perhaps 
conflicting objectives. Therefore, we look at several cases 
with different number of objectives for the proposed 3D 
heterogeneous architecture. As an example, we consider 
three different cases:  
Case 1 – {𝑈, 𝜎} We consider mean (Eq. 3) and standard 
deviation of link utilization (Eq. 4).  
Case 2 – {𝑈, 𝜎, 𝐿𝑎𝑡} We add CPU-LLC latency (Eq. 1) to 
Case 1.  
Case 3 – {𝑈, 𝜎, 𝐿𝑎𝑡, 𝐸} We add energy (Eq. 10) to Case 2.  
However, new objectives can be judiciously added to fit the 
designer goals and constraints. 
Also, since both AMOSA and MOO-STAGE rely on the 
structure of the design space, we define what constitutes a 
neighboring design. In the context of designing 3D heter-
ogeneous systems, a neighboring state is where exactly 
one planar link is repositioned or two tiles are swapped 
(can be between tiles in the same or different layers). On 
the other hand, since PCBB is based on branch-and-bound, 
it does a systematic enumeration of the candidate solu-
tions. Finally, our goal here is to optimize the placement of 
CPUs, LLCs, GPUs, and planar links such that they improve 
the design objectives.  
6.3 Finding Better Solutions Using Machine Learning 
In this section, we investigate PCBB, AMOSA, and MOO-
STAGE’s performance for the problem of 3D heterogeneous 
NoC design. More specifically, we investigate their abilities 
to optimize the core and planar link placement for 3D het-
erogeneous architectures. Here, we consider a 64-tile sys-
tem with 8 CPUs, 16 LLCs, and 40 GPUs. The number of pla-
nar links and number of TSVs are kept the same as a similar 
size 3D mesh NoC. When comparing the two algorithms, 
we present the average EDP of multiple runs from the same 
starting NoC configuration.  
For brevity, in Fig. 6 we present the BFS results averaged 
over multiple runs as a representative example. Similar ob-
servations are made for all other applications as well. Fig. 6 
shows the evolution of the best solution’s EDP over time for 
AMOSA and MOO-STAGE for all three optimization cases. 
Since PCBB is not an anytime algorithm, we can only show 
the total run-time needed to complete the branch-and-
bound enumeration. This is discussed later in Table 2.  
 It is evident from Fig. 6 that MOO-STAGE achieves lower 
EDP values significantly faster than AMOSA. To further 
demonstrate this, we define two metrics: 𝑇𝑀𝑂𝑂−𝑆𝑇𝐴𝐺𝐸 which 
is the time required for MOO-STAGE to converge and 
 
 
 
 
 
 
 
                                                                             
 
(a)      (b)             (c) 
Fig. 6. Normalized quality of NoC solutions (EDP) obtained using AMOSA and MOO-STAGE for (a) two objectives ({?̅?, 𝜎}), (b) three objectives 
({?̅?, 𝜎, 𝐿𝑎𝑡}), and (c) four objectives ({?̅?, 𝜎, 𝐿𝑎𝑡, 𝐸}) for the BFS benchmark. 
0.9
1
1.1
1.2
1.3
1.4
1.5
0 1 2 3 4 5 6
N
o
rm
al
iz
e
d
 E
D
P
Time (hrs)
AMOSA MOO-STAGE
TSTAGE =
2 hrs
TAMOSA =
4 hrs
0.9
1
1.1
1.2
1.3
1.4
1.5
0 4 8 12 16 20 24
N
o
rm
al
iz
e
d
 E
D
P
Time (hrs)
AMOSA MOO-STAGE
TSTAGE =
7 hrs
30
TAMOSA =
35 hrs
0.9
1
1.1
1.2
1.3
1.4
1.5
0 4 8 12 16 20 24
N
o
rm
al
iz
e
d
 E
D
P
Time (hrs)
AMOSA MOO-STAGE
TSTAGE =
9 hrs
80
TAMOSA =
85 hrs
10  
 
𝑇𝐴𝑀𝑂𝑆𝐴 which is the time needed for AMOSA to generate 
similar quality of solutions. However, AMOSA never finds 
the best solution that MOO-STAGE obtains even after sig-
nificantly longer durations for the three- and four-objective 
optimization. For these cases, 𝑇𝐴𝑀𝑂𝑆𝐴 is defined as the time 
AMOSA takes to reach within 3% of the best solution qual-
ity of MOO-STAGE in terms of EDP. It is clear from Fig. 6 that 
the amount of speed-up MOO-STAGE achieves increases as 
the number of objectives increase. With four objectives, 
MOO-STAGE converges approximately after 𝑇𝑀𝑂𝑂−𝑆𝑇𝐴𝐺𝐸 =
9 hours while AMOSA takes approximately 𝑇𝐴𝑀𝑂𝑆𝐴 =
85 hours to come within 3% of MOO-STAGE’s solution 
quality. Therefore, MOO-STAGE achieves an approximate 
9.4 times optimization time speed-up compared to 
AMOSA.  
The significant improvement in optimization time can be 
attributed to the fact that MOO-STAGE performs active 
learning. In machine learning literature, it is well-known 
that the active learning paradigm is exponentially more ef-
ficient than passive supervised learning [41]. Similar to 
other active learning algorithms, e.g., DAgger [42], MOO-
STAGE aggregates learning examples over multiple itera-
tions to reduce the number of training data needed to learn 
a target concept. This guarantees that only a small number 
of trajectories are needed to achieve good generalization 
behavior with the learned function [42] and accurately eval-
uate the entire input design space. As a result, after a few 
iterations MOO-STAGE achieves a near-accurate evaluation 
function to speed-up the optimization process.  
Table 2 shows the speed-up with MOO-STAGE com-
pared to AMOSA and PCBB for all applications under Cases 
1, 2, and 3 (Section 6.2). MOO-STAGE achieves significant 
gains in convergence time for all applications and number 
of objectives. Note that due to the large execution time for 
PCBB, we only show the two-objective optimization case 
(Case 1) for PCBB. However, increasing the number of ob-
jectives will reduce the number of branches that are pruned 
and exponentially increase the run-time. This would result 
in even worse three- and four-objective run-times for PCBB.  
As seen from Table 2, even for the simpler two-objective 
optimization, PCBB takes 141x longer on average to find 
the similar quality of solution as MOO-STAGE. This is mainly 
due to the sheer size of the design space of 3D NoCs. For 
more intuition, in a 4x4x4 (64-tile) system with 144 links (96 
planar + 48 vertical), the total number of possible tile place-
ments is 64 factorial. Then, each of these tile placements 
has 𝐶(𝐶(16,2) ∗ 4,96) different ways to place the planar 
links. Although PCBB manages to prune significantly over 
99.99% of this solution space, the tiny fraction that is left 
consists of several millions of possible solutions. This is sig-
nifcantly more than MOO-STAGE or AMOSA leading to 
worse execution times. 
On the other hand, MOO-STAGE reduces the optimiza-
tion time over AMOSA by 1.5X, 5.8X, and 10.7X on average 
for the two-, three-, and four-objective cases respectively. 
Table 2 also shows that, MOO-STAGE can obtain high-qual-
ity solutions in a shorter amount of time irrespective of the 
application.  
To demonstrate MOO-STAGE’s ability to learn a function 
that maps the design space to the objective space, we show 
the prediction error of the evaluation function 𝐸𝑣𝑎𝑙 (Algo-
rithm 2, lines 7-8) as a function of time for all considered 
cases (Section 6.2, Cases 1-3) in Fig. 7 considering BFS as 
an example. The prediction error (in %) represents the dif-
ference between the estimated PHV value obtained by 𝐸𝑣𝑎𝑙 
and the actual PHV value obtained by the subsequent local 
search. From Fig. 7, we note that irrespective of the number 
of objectives, after only a few hours, the prediction error is 
less than 5%. This low error rate indicates that the evalua-
tion function 𝐸𝑣𝑎𝑙 can accurately predict good starting 
points for the local search. Hence, MOO-STAGE continu-
ously improves its search by choosing promising starting 
points. As seen in Fig. 6 and Table 2, this allows MOO-
STAGE to reduce the total number of searches necessary 
and find solutions much more quickly than AMOSA’s rela-
tively random explorations and PCBB’s systematic enumer-
ation of the entire candidate set.  
In Section 3, we studied the traffic patterns generated 
by different applications. We found that the traffic patterns 
of different applications on heterogeneous platforms ex-
hibit a set of similar characteristics. Therefore, we conjec-
tured that we could utilize a heterogeneous platform opti-
mized for one application to run other applications. Taking 
advantage of the similarities in application traffic character-
istics seen in Section 3, we undertake the design of appli-
cation-agnostic NoC architectures using MOO-STAGE in 
the next sections. 
6.4 Application-Agnostic NoC Design 
In this section, we validate our observations and show that 
NoCs optimized for one application can show similar per-
formance for other applications as well. Here, we consider 
the four-objective optimization problem (Section 6.2, Case 
 
 
 
 
 
 
 
 
 
 
Fig. 7. Prediction error for MOO-STAGE (64-tile system, BFS). 
0
5
10
15
20
1 3 5 7 9 11 13 15
Er
ro
r 
p
e
rc
e
n
ta
ge
Time (hrs)
Two Three FourTABLE 2  
MOO-STAGE SPEED-UP OVER PCBB AND AMOSA 
Application 
Two-obj Three-obj 
AMOSA 
Four-obj 
AMOSA PCBB AMOSA 
BP 130 1.5 6.4 12.5 
BFS 135 2.0 5.0 9.4 
CDN 146 1.5 5.8 13.7 
GAU 134 1.3 6.0 7.2 
HS 144 1.5 8.0 10.0 
LEN 145 2.0 5.8 14.2 
LUD 140 1.3 5.0 10.0 
NW 150 1.5 5.0 11.4 
KNN 148 1.2 6.4 7.5 
PF 142 1.2 5.0 11.4 
Average 141.4 1.5 5.8 10.7 
 
 11 
 
3) as an example to reduce network energy and CPU-LLC 
latency, while improving the GPU-LLC throughput. So, 
principally we focus on enhancing the network efficiency 
(performance) here. To show our approach’s applicability to 
different system sizes, we consider the optimization of a 
36-tile (4 CPUs, 8 LLCs, and 24 GPUs arranged in four layers 
of 3x3 cores) and a 64-tile system (8 CPUs, 16 LLCs, and 40 
GPUs arranged in four layers of 4x4 cores).  
To design an application-agnostic NoC, we consider two 
cases: generate an NoC optimized for a) each application 
(denoted by its application) and b) a set of several applica-
tions, using an aggregated traffic profile (AVG). For each of 
the 𝑁 applications, we create a different AVG NoC (a total 
of 𝑁 AVG NoCs) using the set of remaining 𝑁 − 1 applica-
tions (leave-one-out).  
The optimized NoCs are then used to execute all appli-
cations, e.g., an NoC optimized for BFS is used to execute 
all ten applications, and the performance is normalized to 
the application’s respective application-specific NoC. For 
example, the EDP of an NoC optimized for BFS running BP 
is normalized with respect to the EDP of the NoC optimized 
for BP, running BP. Each AVG NoC executes the application 
that was left-out during optimization (otherwise unknown 
to the optimization). Fig. 8(a) shows normalized EDP of 64-
tile NoCs. From Fig. 8(a) we note that on average, only 3.2% 
degradation is observed for all applications when com-
pared to application-specific NoCs with a worst case reach-
ing only up to 9.8%. However, the averaged NoC (AVG) only 
shows a 1.1% average degradation compared to the appli-
cation specific NoC architectures.  
In Fig. 8(b) we also provide a comparative study with a 
36-tile system. Here, we see similar evaluation results for 
the 36-tile system as well. From Fig. 8(b), we note that even 
for a different system size, the performance degradation is 
only 3.8% on average, with worst case difference going up 
to 11% for NoCs optimized with a single application. Simi-
lar to previous case, AVG performs better with an average 
degradation of 1.8%. 
By aggregating the characteristics from multiple appli-
cations, AVG can better generalize to the unknown appli-
cation. Therefore, an NoC optimized for a subset of appli-
cations can be reused for a new application on 3D hetero-
geneous systems without significant performance penalty. 
These implications can be helpful for future applications 
and NoC designs. For example, an NoC optimized using 
BFS and GAU can be used to execute neural network archi-
tectures like LeNet or CDBNet as shown in Fig. 8(a). Simi-
larly, other neural network architectures, e.g., AlexNet [43], 
is likely to exhibit similar performance improvements. 
Hence, irrespective of system size, it is possible to design 
high-performance 3D heterogeneous NoCs without prior 
knowledge of the application we intend to run.   
6.5 Thermal Aware Application-Agnostic NoC Design 
Up until this point, we have only optimized 3D heteroge-
neous systems for network efficiency (network perfor-
mance). However, 3D ICs have higher packaging densities, 
resulting in higher temperature. High on-chip temperature 
is detrimental to the performance of the IC. Hence, it is es-
sential to include the thermal characteristics in the optimi-
zation process. In this section, we extend our evaluation of 
application-agnostic 3D heterogeneous NoC design by in-
cluding temperature into the optimization process as well. 
We introduce two new optimization cases (extending from 
the cases in Section 6.2) for this purpose:  
Case 4 – {𝑇} Thermal only optimization. We consider 
 
 
 
 
 
 
 
 
 
 
 
 
  (a) 
 
 
 
 
 
 
 
 
 
 
 
 
 (b) 
Fig. 8: Normalized EDP of (a) 64-tile and (b) 36-tile NoCs optimized for network efficiency only (Section 6.2, Case 3) to study the performance 
degradation with respect to application specific designs. 
0.6
0.7
0.8
0.9
1
1.1
1.2
BP BFS CDN GAU HS LEN LUD NW KNN PF
N
o
rm
al
iz
e
d
 E
D
P
Applications being executed
BP BFS CDN GAU HS LEN LUD NW KNN PF AVG
0.6
0.7
0.8
0.9
1
1.1
1.2
BP BFS CDN GAU HS LEN LUD NW KNN PF
N
o
rm
al
iz
e
d
 E
D
P
Application being executed
BP BFS CDN GAU HS LEN LUD NW KNN PF AVG
12  
 
peak core temperature (Eq. 7) only.  
Case 5 - {𝑈, 𝜎, 𝐿𝑎𝑡, 𝑇, 𝐸} Joint performance-thermal op-
timization. We add temperature (Eq. 7) to Case 3.  
Like the previous section, we consider two system sizes 
with single application optimized NoC and the averaged 
NoC. However, optimizing only for the thermal profile can 
lead to performance degradation since it doesn’t consider 
any performance objectives during the design process. We 
show the performance-thermal trade-offs in Fig. 9.  
In Fig. 9, we compare the results of NoCs optimized for 
Case 3 (network efficiency/performance), Case 4 (thermal-
only), and Case 5 ( joint network performance-thermal) nor-
malized to the Case 3 NoC. Figs. 9(a) and 9(b) show the 
Full-system execution time and EDP respectively. Fig. 9(c) 
shows the temperature of the 3D NoC configuration in all 
three NoC cases. Here, Full-System EDP (FS-EDP) is defined 
as the product of Full-System execution time and Energy. 
The full-system execution time is obtained via detailed 
Gem5-GPU simulations. It is clear from Fig. 9 that incorpo-
rating only thermal in the optimization process leads to the 
best temperature profile but a significant degradation of 
more than 7% in full-system execution time on average. 
Similarly, the NoC optimized for network efficiency (Case 3) 
achieves the best EDP but at a 20ᵒC average degradation 
of temperature compared to the only-thermal optimized 
NoC (Case 4). On the other hand, the jointly-optimized NoC 
exhibits temperature improvements of 18ᵒC on average 
while sacrificing only 2.3% in overall execution time. There-
fore, it is important that we jointly optimize both perfor-
mance and thermal to reduce on-chip temperature while 
delivering high performance.  
Next, we show that it is also possible to design applica-
tion-agnostic NoC architectures for jointly-optimized ther-
mal-performance case. To this end, we perform similar ex-
periments as in Fig. 8. Fig. 10(a) (64-tile system) and Fig 
10(b) (36-tile system) show the normalized EDP for appli-
cations executed on different application-specific and traf-
fic-averaged NoCs. Exactly like Fig. 8, the application-spe-
cific NoC for each application has been chosen as the base-
line for comparison. On average, only 2.8% degradation is 
 
 
 
 
 
 
 
 
 
    
                                        (a)                                                                             (b)                                                                               (c) 
Fig. 9: Performance-thermal trade-offs for 64-tile NoCs: (a) Full-System Execution time, (b) Full-System EDP, (c) Temperature comparison for three 
optimization cases: network efficiency/performance-only (Perf), joint performance-thermal (Joint) and thermal-only (Therm).  
0.8
0.9
1
1.1
1.2
B
P
B
FS
C
D
N
G
A
U H
S
LE
N
LU
D
N
W
K
N
N P
FN
o
rm
al
iz
e
d
 E
xe
cu
ti
o
n
 T
im
e
Perf Joint Therm
0.8
0.9
1
1.1
1.2
1.3
B
P
B
FS
C
D
N
G
A
U H
S
LE
N
LU
D
N
W
K
N
N P
F
N
o
rm
al
iz
e
d
 F
S-
ED
P
Perf Joint Therm
40
60
80
100
120
B
P
B
FS
C
D
N
G
A
U H
S
LE
N
LU
D
N
W
K
N
N P
F
Te
m
p
e
ra
tu
re
 (ᵒ
C
)
Perf Joint Therm
 
 
 
 
 
 
 
 
 
 
 
 
  (a) 
 
 
 
 
 
 
 
 
 
 
 
 
 (b) 
Fig. 10: Normalized EDP of (a) 64-tile and (b) 36-tile NoCs optimized jointly for performance-thermal to study the performance degradation with 
respect to application-specific designs. 
0.6
0.7
0.8
0.9
1
1.1
1.2
BP BFS CDN GAU HS LEN LUD NW KNN PF
N
o
rm
al
iz
e
d
 E
D
P
Application being executed
BP BFS CDN GAU HS LEN LUD NW KNN PF AVG
0.6
0.7
0.8
0.9
1
1.1
1.2
BP BFS CDN GAU HS LEN LUD NW KNN PF
N
o
rm
al
iz
e
d
 E
D
P
Applications being executed
BP BFS CDN GAU HS LEN LUD NW KNN PF AVG
 13 
 
observed for the application-specific NoC running other 
applications, when compared to the application-specific 
NoCs on its application, with a worst case of 8.5%. Similarly, 
for the 36-tile NoCs, the average EDP degradation be-
comes 4.5% and worst case is 11%. Like the previous case, 
the traffic-averaged NoCs perform better with an average 
degradation of 2% and 2.1% for 64-tile and 36-tile NoCs 
respectively.  
From the above observations, we find that due to the 
similarities in the traffic pattern of applications on a heter-
ogeneous platform, it is possible to optimize the NoCs for 
any known application(s) and have them perform well with 
unknown applications. We have seen that optimizing on a 
small set of applications reduces both the average and 
worst-case degradation even further. Looking deeper, we 
study the physical core and link distributions for each of the 
application-specific NoCs. In Section 3, we noted that the 
traffic patterns are similar for multiple applications. As a re-
sult, the optimized NoCs are expected to be similar as well.    
 To this end, we evaluate the link distribution among the 
four layers and the associated tile placements for the het-
erogeneous NoCs.  Fig. 11 shows the distribution of tiles 
and links in the performance-only optimized Het-perf (Sec-
tion 6.4), joint performance-thermal optimized Het-joint 
(Section 6.5), and Mesh-perf (baseline 3D Mesh NoC with 
tile placement that has been performance-only optimized 
similar to Het-perf). Due to the uniform link distribution 
across all layers, mesh NoCs cannot handle many-to-few 
traffic efficiently (Section 4.1). On the other hand, both het-
erogeneous NoCs designed following the framework pre-
sented in Section 4 produce an irregular topology with 
more links near the LLCs. This allows greater path diversity 
and reduces the amount of traffic congestion. The LLCs also 
tend to remain in the middle layers, allowing the LLCs to 
access the vertical links in both directions and reduce the 
average hop count to the other tiles. We also look at the 
performance-thermal joint optimized NoCs, Het-joint (i.e., 
Case 5 described in Section 6.5). The placement of cores 
and links are greatly affected by doing a temperature-
aware optimization (Fig. 11). To reduce core temperatures, 
high power consuming cores, i.e., GPUs, are placed closer 
to the sink. As a result, the LLCs and CPUs are placed mostly 
in the upper layers. Also, similar to previous case, more links 
are observed in the layers with a higher number of LLCs. In 
both these cases, the physical distribution of cores and 
links are observed to be similar for all considered applica-
tions. Hence, the optimized NoCs share similar characteris-
tics in the physical placement of cores and links as well.  
7 CONCLUSIONS 
3D NoC-enabled CPU-GPU based heterogeneous architec-
tures provide an opportunity to design high-performance, 
energy-efficient computing platforms to meet the growing 
computational need in deep learning and big-data appli-
cations. However, 3D heterogeneous architectures present 
several new design challenges: a) multiple potentially con-
flicting design requirements; b) 3D integration induced 
thermal hotspots; and c) significantly larger design spaces.  
In this work, we have shown that we can generate ther-
mally-efficient high-performance 3D NoCs that are appli-
cation-agnostic by analyzing the on-chip traffic, designing 
suitable objectives, and using efficient MOO techniques. 
Our study shows that applications on heterogeneous sys-
tems with many GPUs and few LLCs, generate similar traffic 
patterns. Experiments demonstrate that our design frame-
work can generate generic 3D NoC configurations which 
experience an average performance loss of 1.1% for 64-tile 
systems and 1.8% for 36-tile systems compared to applica-
tion-specific NoCs by considering an aggregated traffic 
pattern of several applications. Similar observations were 
made for a performance-thermal joint optimized case. 
These observations were made irrespective of system size, 
system configuration, and available training application 
sets, demonstrating that we can create NoCs that general-
ize well to unknown applications using a small subset of 
available applications.  
REFERENCES 
[1] D. Strigl, K. Kofler, and S. Podlipnig, “Performance and Scalability of 
GPUBased Convolutional Neural Networks,” Euromicro Conf. on Parallel, 
Distributed and Network-based Processing, Pisa, pp. 317-324, 2010. 
[2] J. Power et al., “Heterogeneous system coherence for integrated 
CPU-GPU systems.” Proc. of IEEE/ACM Int’l Symp. on Michroar-
chitecture (MICRO), Davis, 457–467, 2013. 
[3] J. Hestness, S.W. Keckler, D.A. Wood, “c”. IEEE Int’l Symp. on Work-
load Characterization (IISWC), Atlanta, 87-97, 2015. 
[4] W. R. Davis et. al., “Demystifying 3D ICs: The Pros and Cons of 
Going Vertical,” IEEE D&T of Computers, vol. 22, no. 6, pp. 498-
510, 2005. 
[5] S. Das, J. R. Doppa, P. P. Pande and K. Chakrabarty, “Design-Space 
Exploration and Optimization of an Energy-Efficient and Reliable 
3-D Small-World Network-on-Chip,” IEEE TCAD, vol. 36, no. 5, pp. 
719-732, 2017. 
[6] B.S. Feero and P.P. Pande, “Networks-on-Chip in a Three-Dimen-
sional Environment: A Performance Evaluation,” IEEE TC, vol. 53, 
no. 1, pp. 32-45, 2008. 
[7] W. Choi et al., "On-Chip Communication Network for Efficient 
Training of Deep Convolutional Networks on Heterogeneous 
Manycore Systems," in IEEE TC, vol. 67, no. 5, pp. 672-686, 2018. 
[8] B. K. Joardar et. al., “3D NoC-Enabled Heterogeneous Manycore 
Architectures for Accelerating CNN Training: Performance and 
Thermal Trade-offs,” Procs. of IEEE/ACM NOCS, Seoul, 2017. 
[9] K. Deb, A. Pratap, and S. Agarwal, “A fast and elitist multiobjective 
genetic algorithm: NSGA-II,” IEEE TEVC, vol. 6, no. 2, pp. 182-197, 
 
 
 
 
 
 
 
 
 
 
  
 
Fig 11: Distribution of tiles and links in different architectures considered 
in this work 
Het-Perf
15
29
31
21
Het-Joint
34
28
30
4
Mesh-perf
24
24
24
24
Placement: Core Core CoreLink Link Link
CPU LLC GPU
L1
L2
L3
L4
SinkSinkSink
14  
 
2002. 
[10] S. Bandyopadhyay, S. Saha, U. Maulik, and K. Deb, “A simulated 
annealing-based multi-objective optimization algorithm: 
AMOSA,” IEEE TEVC, vol. 12, no. 3, 269–283, 2008. 
[11] J.A. Boyan and A.W. Moore, “Learning Evaluation Functions to Im-
prove Optimization by Local Search,” JMLR, pp. 77-112, 2001. 
[12] C. Wu et al., "A Multi-Objective Model Oriented Mapping Ap-
proach for NoC-based Computing Systems," in IEEE TPDS, vol. 28, 
no. 3, pp. 662-676, March 2017. 
[13] A. Bakhoda, J. Kim, and T.M. Aamodt, “Throughput-Effective On-
Chip Networks for Manycore Accelerators,” Proc. of Int’l Symp. 
Microarchitecture (MICRO), 457–467, Atlanta, 2013. 
[14] H. Jang et al., “Bandwidth-efficient on-chip interconnect designs 
for GPGPUs,” in DAC, San Francisco, 1-6, 2015. 
[15] O. Kayiran et al., “Managing GPU concurrency in heterogeneous 
architectures,” Int’l Symp. Microarchitecture, Cambridge, 2014. 
[16] C. C. Liu, I. Ganusov, M. Burtscher, S. Tiwari, “Bridging the Proces-
sor-Memory Performance Gap with 3D IC Technology,” IEEE D&T 
of Computers, vol. 22, no. 6, pp. 556-564, 2005. 
[17] A. Al Maashri, G. Sun, X. Dong, V. Narayanan, Y. Xie, “3D GPU ar-
chitecture using cache stacking: performance, cost, power and 
thermal analysis,” ICCD, Lake Tahoe, pp. 254-259, 2009. 
[18] J. Lee, S. Li, H. Kim, and S. Yalamanchilli, “Design Space Explora-
tion of on chip Ring Interconnection for a CPU-GPU Heterogene-
ous Architecture,” in JPDC., 1525-1538, 2013. 
[19] F. Li et. al., “Design and Management of 3D Chip Multiprocessors 
Using Network-in-Memory,” ISCA, pp. 130-141, 2006. 
[20] S. M. Alam, R. E. Jones, S. Pozder, A. Jain, “Die/wafer stacking with 
reciprocal design symmetry (RDS) for mask reuse in three-dimen-
sional (3D) integration technology,” in ISQED, San Jose, pp. 569-
575, 2009. 
[21] X. Zhou, Y. Xu, Y. Du, Y. Zhang, and J. Yang, “Thermal Management 
for 3D Processors via Task Scheduling,” Int’l Conf. on Parallel Pro-
cessing, Portland, pp. 115-122, 2008. 
[22] A. Jarrah and M. M. Jamali, "Energy analysis and NoC design for 
heterogeneous MPSoC platform for a video application," in IEEE 
MWSCAS, Columbus, pp. 437-440, 2013 
[23] G. Mariani, G. Palermo, V. Zaccaria, C. Silvano, "OSCAR: An Opti-
mization Methodology Exploiting Spatial Correlation in Multicore 
Design Spaces," in IEEE TCAD, vol. 31, pp. 740-753, 2012. 
[24] A. A. Morgan, H. Elmiligi, M. W. El-Kharashi and F. Gebali, "Multi-
objective optimization for Networks-on-Chip architectures using 
Genetic Algorithms," Proc. of IEEE Int’l. Symp. on Circuits and Sys-
tems, Paris, pp. 3725-3728, 2010. 
[25] B. Ozisikyilmaz, G. Memik and A. Choudhary, "Efficient system de-
sign space exploration using machine learning techniques," in 
ACM/IEEE DAC, Anaheim, CA, pp. 966-969, 2008. 
[26] G. Ascia, V. Catania, A. G. Di Nuovo, M. Palesi, D. Patti, “Efficient 
design space exploration for application specific systems-on-a-
chip,” J. of Systems Architecture, Vol. 53, Pg. 733-750, 2007. 
[27] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, “Gradient-based learning 
applied to document recognition,” Proc. of IEEE, vol. 86, no. 11, 
pp. 2278-2324, 1998. 
[28] A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Im-
ages,” MSc Thesis, University of Toronto, 2009. 
[29] S. Che et al., “Rodinia: A benchmark suite for heterogeneous 
computing,” in IISWC’09, Austin, pp. 44-54, 2009. 
[30] J. Power, J. Hestness, M. Orr, M. Hill, D. Wood, “gem5-gpu: A Het-
erogeneous CPU-GPU Simulator,” in Computer Architecture Let-
ters, vol. 13, no. 1, 2014. 
[31] S. Koohi, M. Mirza-Aghatabar, S. Hessabi, M. Pedram, "High-Level 
Modeling Approach for Analyzing the Effects of Traffic Models on 
Power and Throughput in Mesh-Based NoCs," in VLSID, Hydera-
bad, pp. 415-420, 2008. 
[32] J. Cong, J. Wei and Y. Zhang, “A thermal-driven floorplanning al-
gorithm for 3D ICs,” in ICCAD, San Jose, pp.306-313, 2004. 
[33] A. Sridhar, A. Vincenzi, M. Ruggiero, T. Brunschwiler, D. Atienza, 
“3D-ICE: Fast compact transient thermal modeling for 3D ICs with 
inter-tier liquid cooling,” Proc. of ICCAD, San Jose, pp. 463-470, 
2010. 
[34] E. Zitzler, D. Brockhoff, L. Thiele, “The Hypervolume Indicator Re-
visited: On the Design of Pareto-compliant Indicators Via 
Weighted Integration,” in Int’l Conf on Evolutionary Multi-Crite-
rion Optimization, pp. 862-876, 2007. 
[35] A. Auger, J. Bader, D. Brockhoff, E. Zitzler, “Theory of the hypervol-
ume indicator: optimal mu-distributions and the choice of the 
reference point,” in FOGA, pp. 87-102, 2009. 
[36] L. While, P. Hingston, L. Barone, S. Huband, “A faster algorithm for 
calculating hypervolume,” IEEE TEVC, vol. 10, pp. 29-38, 2006. 
[37] J. Leng et. al., “GPUWattch: enabling energy optimizations in 
GPGPUs,” in ISCA, Tel-Aviv, pp. 487-498, 2013. 
[38] S. Li et. al., "McPAT: An integrated power, area, and timing mod-
eling framework for multicore and manycore architectures," Int’l 
Symp on Microarchitecture, New York, pp. 469-480, 2009. 
[39] O. Lysne, T. Skeie, S.A. Reinemo, and I. Theiss, “Layered routing in 
irregular networks,” in IEEE Trans. On Parallel and Distributed Sys-
tem, vol. 17, no. 1, pp. 51-65, 2006. 
[40] GitHub: https://github.com/CSU-rgkim/TC_2018_code 
[41] B. Settles, “Active Learning. Synthesis Lectures on Artificial Intelli-
gence and Machine Learning,” Morgan & Clay-pool Publishers, 
2012. 
[42] S. Ross, G. J. Gordon, D. Bagnell, “A Reduction of Imitation Learn-
ing and Structured Prediction to No-Regret Online Learning,” in 
AISTATS, pp. 627-635, 2011. 
[43] A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet classification 
with deep convolutional neural networks,” In NIPS, pp. 1106–
1114, 2012. 
