Optimization-based power and thermal management for dark silicon aware 3D chip multiprocessors using heterogeneous cache hierarchy by Asad A. et al.
Microprocessors and Microsystems 51 (2017) 76–98 
Contents lists available at ScienceDirect 
Microprocessors and Microsystems 
journal homepage: www.elsevier.com/locate/micpro 
Optimization-based power and thermal management for dark silicon 
aware 3D chip multiprocessors using heterogeneous cache hierarchy 
Arghavan Asad a , ∗, Ozcan Ozturk b , Mahmood Fathy a , Mohammad Reza Jahed-Motlagh a 
a Computer Engineering Department, Iran University of Science and Technology, Tehran, Iran 
b Computer Engineering Department, Bilkent University, Ankara, Turkey 
a r t i c l e i n f o 
Article history: 
Received 5 January 2016 
Revised 29 December 2016 
Accepted 27 March 2017 
Available online 14 April 2017 
Keywords: 
Hybrid cache hierarchy 
Reconfigurable cache 
Non-volatile memory (NVM) 





a b s t r a c t 
Management of a problem recently known as “dark silicon” is a new challenge in multicore designs. 
Prior innovative studies have addressed the dark silicon problem in the fields of power-efficient core 
design. However, addressing dark silicon challenges in uncore component designs such as cache hierarchy, 
on-chip interconnect etc. that consume significant portion of the on-chip power consumption is largely 
unexplored. In this paper, for the first time, we propose an integrated approach which considers the 
impact of power consumption of core and uncore components simultaneously to improve multi/many- 
core performance in the dark silicon era. The proposed approach dynamically (1) predicts the changing 
program behavior on each core; (2) re-determines frequency/voltage, cache capacity and technology in 
each level of the cache hierarchy based on the program’s scalability in order to satisfy the power and 
temperature constraints. In the proposed architecture, for future chip-multiprocessors (CMPs), we exploit 
emerging technologies such as non-volatile memories (NVMs) and 3D techniques to combat dark silicon. 
Also, for the first time, we propose a detailed power model which is useful for future dark silicon CMPs 
power modeling. Experimental results on SPEC 20 0 0/20 06 benchmarks show that the proposed method 
improves throughput by about 54.3% and energy-delay product by about 61% on average, respectively, in 
comparison with the conventional CMP architecture with homogenous cache system. 
(A preliminary short version of this work was presented in the 18th Euromicro Conference on Digital 
System Design (DSD), 2015.) 





























m  1. Introduction 
Even though the development of semiconductor technology
continues to provide increasing on-chip transistor densities and
enabling the integration of many cores on a single die, Dennard
scaling [1] , which offer near-constant chip power with the dou-
bling of transistors, has come to an end. 
Due to the breakdown of Dennard scaling, the fraction of tran-
sistors that can be simultaneously powered on within the peak
power and temperature budgets is dropping exponentially with
each process generation. This phenomenon has been termed as the
dark silicon era [2] . Predictions in literature indicate that if dark
silicon challenges is not addressed properly, more than 90% of frac-
tions of chips will be effectively dark, idle, dim, or under-clocked
dark silicon, within 6 years [3] . Therefore, it is extremely important
to provide next generation architectural techniques, design tools,∗ Corresponding author. 







0141-9331/© 2017 Elsevier B.V. All rights reserved. nd analytical models for future many-core CMPs in the presence
f dark silicon [4] . 
In the nanometer era, leakage power depletes the power bud-
et and has substantial contribution in overall power consump-
ion. In this regard, study has shown that over 50% of the overall
ower dissipation in 65 nm generation is due to the leakage power
5] and this percentage is expected to increase in the next process
enerations [6,7] . Also, research shows that the increasing leakage
ower consumption is a major driver of unusable portion or dark
ilicon in future many-core CMPs [2] . 
In recent years, more and more applications are shifting from
ompute bounding to data bounding, thereby, a hierarchy of cache
evels to efficiently store and manipulate large amounts of data is
equired. In this context, an increasing percentage of on-chip tran-
istors are invested on the cache hierarchy and architects have dra-
atically increased the size of cache levels in cache hierarchy, in
n attempt to bridge the gap between fast cores and slow off-chip
emory accesses in multi/many-core CMPs. Considering the fact
hat cache hierarchy occupies as much as 50% of the chip area,
t is dominant leakage consumer in future multi/many-core sys-
ems. Also, since leakage power has become a significant factor in
















































































































he overall chip power budget in the nanoscale era, cache hierar-
hies have become substantial power consumers in future many-
ore CMPs. 
A majority of prior researches on power management tech-
iques in multicore processors focus on core designs to control the
ower consumption. The only knob that they use to manage the
ower of multicore systems is at the core level [35–48 , 69–73] . In
his work, we show that uncore components such as cache hier-
rchy, on-chip interconnect and etc. are significant contributors in
he overall chip power budget in the nanoscale era and play im-
ortant roles in the dark silicon era. Uncore components, especially
hose in the cache hierarchy, are the dominant leakage consumers
n multi/many-core CMPs. Therefore, besides focusing on energy-
fficient core designs, how to design the uncore components is es-
ential to tackle the challenges of multicore scaling in the dark sil-
con era. 
Since the slight improvement in CMOS device’s power density
eads to the dark silicon phenomenon, the emerging power-saving
aterials manufactured with nano-technology might be useful for
lluminating the dark area of future CMPs. The long switch de-
ay and high switch energy of such emerging low-power materials
re the main drawbacks which prevent manufactures from com-
letely replacing the traditional CMOS in future processor manu-
acturing [8] . Therefore, architecting heterogeneous CMPs and inte-
rating cores and cache hierarchy made up of different materials
n the same die emerges as an attractive design option to alleviate
he power constraint. In this work, we use emerging technologies,
uch as three-dimensional integrated circuits (3D ICs) [9,10] and
on-volatile memories (NVMs) [11,12] to exploit the device hetero-
eneity and design of dark-silicon-aware multi/many-core systems.
With increasing parallelism levels of new applications (from
merging domains such as recognition, mining, synthesis and es-
ecially mobile applications) which can efficiently use 100–1000
ores, shifting to multi/many core designs has been aimed in re-
ent years. Due to the scalability limitations and performance
egradation problems in 2D CMPs, especially in future many-cores,
n this work, we focus on 3D integration to reduce global wire-
engths and improve performance of future CMPs. Among several
enefits offered by 3D integrations compared to 2D technologies,
ixed-technology stacking is especially attractive for stacking NVM
n top of CMOS logics and designers can take full advantage of the
ttractive benefits that NVM provides. 
In these days, providing analytical models for future
ulti/many-core CMPs in the presence of dark silicon is es-
ential [4] . None of the previous studies have presented analytical
odels for the future CMPs. To the best of our knowledge, this
s the first work which proposes an accurate power model that
ormulates the power consumption of future many-core CMPs. 
With increasing core count and parallelization of applications,
his issue has become important that in future many-core ar-
hitectures, workloads are expected to be multithreaded appli-
ations. Most of power budgeting and performance optimization
echniques proposed so far in the multicore systems [35 –48,69 –
1] only focus on multiprogramed workloads where each thread 
s a separate application. These techniques are inappropriate for
ultithreaded applications. Our proposed analytical power model
ormulates the power consumption of CMPs with stacked cache
ayers under execution of both multiprogramed and multithreaded
orkloads, for the first time. Unlike the previous researches on
ark silicon which consider only the portion of power consumption
elated to on-chip cores [2,13–24] , the proposed model considers
ower impact of uncore components as the important contributors
n the total CMP power consumption. Moreover, prior researches
69–71] as the latest works on performance/energy optimizing in
ulticore systems, do not support more than eight cores multi-
ore. They are not scalable to many-core CMPs and they do notupport non-uniform cache architecture (NUCA) as the main cache
rganizations for future many-cores. To the best of our knowl-
dge, this is the first study that presents an accurate power model
or many-core CMPs with stacked cache hierarchy. This analyti-
al power model is useful for dark-silicon modeling in future. The
roposed power model considers microarchitectural features and
orkload behavior. A recent study from industry [76] has revealed
hat under the same power budget, the best power allocation strat-
gy depends strongly on application characteristics. 
In particular, in a dark-silicon-aware CMP with different com-
onents including cores and uncores, an integrated reconfiguration
pproach is needed at runtime to maximize performance under
ower and thermal constraints as well as under dynamic changing
rogram behavior and execution parameters. To reach this target,
ased on the derived accurate power model for the proposed het-
rogeneous 3D CMP, two optimization problems have been formu-
ated. These optimization problems are applied at run-time to re-
onfigure cache hierarchy and assign appropriate frequency/voltage
o each core of the 3D CMP based on online workload characteris-
ics monitoring. Since the proposed reconfiguration scheme should
e aware of runtime application variability, such as program phase
hanges, it needs to be efficient enough to be brought at runtime. 
Fig. 1 shows an overview of the proposed 3D CMP architec-
ure with stacked heterogeneous cache hierarchy on the core layer
nd used run-time flow in this paper. As shown in this figure,
he memory technology can be different between the levels in
he cache hierarchy, while the memory technology is homogeneous
ithin each level. 
In continue, based on the prepared hardware required for the
roposed reconfiguration approach, we propose a mapping tech-
ique for the target 3D CMP which considers the workload behav-
or to improve the thermal distribution at runtime, as an essential
eed in future many-core 3D CMPs. 
The contribution of this paper is as follows: 
• We propose an analytical power model that formulates the
power consumption of future many-core CMPs. Specifically, this
power model is useful for dark silicon modeling and can help
to researchers to propose new power management techniques
in future CMPs. 
• We target CMPs with large number of cores (e.g., more than
eight (many-core)) which require building a non-uniform cache
architecture (NUCA) through a scalable network-on-chip (NoC)
in order to reduce cache access latency, in this modeling for the
first time. 
• We consider the impact of power consumption of core and un-
core components in parallel in this modeling for the first time. 
• We consider both heterogeneous and non-heterogeneous CMPs
under execution of both multiprogramed and multithreaded
workloads in this modeling. 























































Fig. 2. Power breakdown of (a) a 4-core, (b) an 8-core, (c) a 16-core, and (d) a 
32-core system under limited power budget. 





















a  • We consider core and uncore leakage power consumption as an
important contributor in the overall CMP power consumption
in the nanoscale era in this power model. 
• We propose an optimization-based power and thermal aware
reconfiguration technique for the target dark silicon aware 3D
CMP based on the derived power model. 
• We consider microarchitectural features (core microarchitec-
tures, cache organization, interconnection network and chip or-
ganization) and workload behavior (memory access pattern, dy-
namic changing program behavior and execution parameters) in
the proposed reconfiguration technique. 
• We propose a low-overhead mapping technique which consid-
ers the behavior of applications/threads at runtime based on
the prepared hardware required for the proposed reconfigura-
tion approach to balance the thermal distribution. 
The rest of this paper is organized as follows. Section 2 analyzes
the power consumption of cores and uncore components in CMPs.
Section 3 reviews prior related works and background. Section
4 explains about the proposed heterogeneous 3D CMP architecture
and the motivation behind of the proposed technique in this work.
Section 5 presents the power model of the proposed 3D CMP with
the stacked cache hierarchy. Section 6 describes the details of the
optimization-based runtime reconfiguration technique which con-
sists of two phases, online and off-line. Section 7 presents the
proposed runtime application-aware mapping technique. Section
8 presents experimental results, and Section 9 concludes the pa-
per. 
2. Analyzing the contribution of cores and uncore components 
in total multicore processors power consumption 
In this section, we analyze the power consumption of cores and
uncore components in multicore systems. We first illustrate that
uncore components have significant contribution in on-chip power
consumption and we cannot ignore the impact of them in future
chips’ total power budget. We then show that the percentage of
leakage power, a major fraction of total power consumption in un-
core components, increases when compared to dynamic power as
technology scales, and considerably outweighs the dynamic power
in future nanoscale designs. To better understand the power distri-
bution of a multicore processor, we use McPAT [60] and evaluate
the power dissipation of cores and uncore components including
L2/L3 cache levels, the routers and links of NoC, integrated mem-
ory controllers and integrated I/O controllers etc. 
Fig. 2 illustrates the power breakdown of a multicore system
with increasing number of cores under limited power budget. We
use technology 32 nm in this figure. As shown in this figure, the
power consumption of uncore components become more critical
when the number of cores is increased in a multicore system and
the power budget is a design constraint. In this work, we assume
idle cores can be gated-off (dark silicon) while other on-chip re-
sources stay active or idle under limited power budget. Actually,
the uncore components remain active and consume power as long
as there is an active core on the chip. As illustrated in Fig. 2 , more
than half of the power consumption is due to the uncore compo-
nents in the 16-core and 32-core systems. Also, Fig. 2 shows that
cache hierarchy and NoC consume a large portion of uncore power
consumption. Therefore besides energy-efficient core designs, how
to architect the uncore components is essential to tackle the chal-
lenges of multicore scaling in the dark silicon era. 
As shown in Fig. 3 when technology scales from 32 nm to
22 nm, the ratio of leakage power increases and is expected to ex-
ceed the dynamic power in the future generations. We use 1 GHz
frequency and 0.9 V supply voltage for an 8-core system in 32 nm
and 22 nm technologies in Fig. 3 . This figure shows that leakageower dominates the power budget in the nanoscale technologies
nd is a major driver for unusable portion or dark silicon in future
any-core CMPs. Thus, using emerging technologies such as NVMs
ith near-zero leakage power and three-dimensional integrated
ircuits (3D ICs) for stacking different technologies onto CMOS cir-
uits bring new opportunities to the design of multi/many-core
ystems in the dark silicon era. 
. Background and related work 
.1. Background 
Compared with traditional memory technologies such as SRAM
nd DRAM, NVM technologies commonly offer many desirable fea-
ures like near-zero leakage power consumption due to their non-
olatile property, high cell density and high resilient against soft
rrors. Nevertheless, they suffer from some obstacles such as lim-
ted number of write operations and long write operation latency
nd energy. Table 1 lists a brief comparison between SRAM, STT-
AM, eDRAM and PCRAM technologies in 32 nm technology. The
stimation is given by NVSim [58] , a performance, energy, and
rea model based on CACTI [59] . Table 1 shows that the STT-RAM
echnology is around four times denser than SRAM. In addition,
s shown in Table 1 , STT-RAM has a much smaller leakage power
A. Asad et al. / Microprocessors and Microsystems 51 (2017) 76–98 79 
Table 1 
Different memory technologies comparison at 32 nm. 
Technology Area Read latency Write latency Leakage power at 80 °C Read energy Write energy 
1MB SRAM 3.03 mm 2 0.702 ns 0.702 ns 4 4 4.6 mW 0.168 nJ 0.168 nJ 
4MB eDRAM 3.31 mm 2 1.26 ns 1.26 ns 386.8 mW 0.142 nJ 0.142 nJ 
4MB STTRAM 3.39 mm 2 0.880 ns 10.67 ns 190.5 mW 0.278 nJ 0.765 nJ 
















































































































b  han SRAM. Also, STT-RAM has significantly high write latency and
rite power consumption compared with SRAM. 
.2. Related work 
A number of recent researches over the past five years have ad-
ressed the dark silicon phenomenon [2,13 –24,66] . 
To combat dark silicon, Esmailzadeh et al. [2] focused on using
nly general-purpose cores. They evaluated homogeneous dark sil-
con CMPs and showed that fundamental performance limitations
tem from the processor core. They ignored the power impact of
uncore” components such as the cache hierarchy, memory subsys-
em and on-chip interconnection. In their paper, it is described that
ith technology scaling and increasing number of cores on a chip
n CMPs, the number of these “uncore” components will increase
nd hence they will further eat into the power budget, reducing
peedups. Ignoring power impact of uncore components has been
entioned as one of the limitation of this work. 
The research in [14 –17] works on architectural synthesis of het-
rogeneous dark silicon CMPs from performance and reliability as-
ects under power/area constraints. Turakhia et al. [15] proposed a
ramework for architectural synthesis of dark-silicon CMPs. Raghu-
athan et al. [14] proposed a framework to evaluate the benefits
f selecting the more suitable subset of cores for an application in
 dark silicon multi-core system to maximize performance within
he power budget. Similar to Esmailzadeh et al. [2] , works in [14 –
7] proposed design-time solutions. 
There have been recent efforts to mitigate the impact of dark
ilicon using device level heterogeneity for processing elements.
or example, variable symmetric multiprocessing (vSMP) [20] is
n energy-efficient methodology presented by NVIDIA where cores
ith the same architecture, but fabricated by a different silicon
rocess, are integrated. Some of them are using special low power
ilicon process while some are using standard silicon process. In
nother effort in the same category, authors in [21] use a com-
ination of steep-slope devices (e.g., interband tunnel field-effect
ransistors (TFETs)) and CMOS devices in the design of heteroge-
eous multicores. The main idea in these studies is to dynamically
witch between the processors based on the system workload as
hey have different performance-energy consumption behaviors. 
In [22,23] , near-threshold computing is an approach which al-
ows cores to operate at a supply voltage near the threshold volt-
ge and allowing several otherwise dark cores to be turned on. 
Venkatesh et al. in [24] introduce the concept of “conservation
ores”. They are specialized processors that focus on reducing en-
rgy instead of increasing performance, used for computations that
annot take advantage of hardware acceleration. 
The work in [66] targets architectural synthesis of hetero-
eneous dark silicon processors from performance aspect un-
er power constraint. In order to maximize performance in
66] , cores that are homogeneous but synthesized with different
ower/performance targets are exploited. 
All of these prior works on the dark silicon [2,13 –24,66] are
haracterization studies and focus on cores rather than uncore
omponents. In one of the newest papers [73] , authors review
ome recent papers in multicore systems which work on cores or
ncore components separately. This paper advises to researchers toonsider core power consumption in parallel with uncore compo-
ents simultaneously for future many-core designs. 
In this work, we consider the power impact of uncore compo-
ents parallel with that of cores, for the first time. 
To improve the performance of CMPs and reduce their power
onsumption, a number of researchers proposed 3D CMP architec-
ures with 3D stacked memory/cache layers on top of a core layer
9,10,25–27 , 65,72,77] . In these studies, stacking large SRAM cache
r DRAM memory on a core layer increased the performance and
educed the power consumption. Study in [10] demonstrated that
p to sixteen layers of DRAM could be stacked on a quad-core pro-
essor without exceeding the maximum thermal limit. Cheng et al.
n [77] proposed an energy efficient SRAM based last level cache
ithout considering power and thermal limits. Stacked traditional
emories such as SRAM or DRAM on the core layer may cause
 drastic increase in power density and temperature-related prob-
ems such as negative bias temperature instability (NBTI) [28] . For
xample by stacking eDRAM/DRAM on top of cores as cache/main
emory, the heat generated by the core-layer can significantly
ggravate the refresh power of DRAM layers and the designer
eeds to consider the power consumption due to refresh when
esigning the power management policy for stacked DRAM mem-
ry or cache. Recently, emerging nonvolatile memory (NVM) tech-
ologies have emerged as candidates for future universal mem-
ry and cache subsystems due to their advantages such as high
ensity, near-zero low leakage power, scalability and 3D integra-
ion with CMOS circuits [11,12,29–32] . Even though NVMs have
any advantages as described above, shortcomings such as high
rite energy consumption, long latency writes and limited write
ndurance prevent them from being directly used as a replacement
or traditional memories. To tackle these issues, recent studies [33–
5] have proposed hybrid architectures, wherein traditional mem-
ries is integrated with NVMs to use advantages of both technolo-
ies. However, none of the aforementioned studies have explored
hese emerging memory technologies in the dark silicon context. 
A number of researches proposed some proactive techniques
o reduce the power consumption in multicore systems such as
ynamic voltage and frequency scaling (DVFS) technique, thread
cheduling, thread mapping, shutting-down schemes, and migra-
ion policies [36 –43] . These aforementioned methods have not
een designed for the dark silicon era. They cannot guarantee a
ituation that the system does not have enough power budget to
eep running in the current setting. Also, these approaches limit
heir scope only to cores. In a power emergency situation, a well-
esigned power management method should not only decrease the
ower consumption to meet the new power constraint but also re-
uce the impact on performance as much as possible. In addition,
rior work [36 –42] provide power management for platforms im-
lemented by using technology nodes in which dark silicon issue
oes not practically exists (e.g. 45 nm CMOS technology) and leak-
ge power is not so problematic. 
Although some effort s have been expanded in recent years, they
re still relatively small improvement in mitigating the dark sili-
on issue due to dynamicity of the dark/dim area as it grows and
hrinks at run-time. A broad open question that is unaddressed in
he literature is the run-time optimization of the subset of cores to
e kept dark and select the ideal set of on-chip resources to power
80 A. Asad et al. / Microprocessors and Microsystems 51 (2017) 76–98 

































Fig. 5. Comparison of temperature of each homogenous cache hierarchy with re- 
spect to the AMAT shown in Table 2 . 





























t  on based on the available power budget, workload characteristics
and the thermal profile of the chip. This motivates us to provide
a comprehensive dark silicon aware runtime power and thermal
management platform to react to power emergencies for future
CMPs. Specifically, in this paper, we focus on power consumption
of core and uncore components that interact with each other dur-
ing applications execution time. Exploiting NVM technologies and
3D techniques in uncore components of the proposed CMP bring
new opportunities to combat dark silicon challenge in this paper. 
4. Proposed architecture 
The architecture model assumed in this work is based on a 3D
CMP with multi-level hybrid cache hierarchy stacked on the core
layer similar to Fig. 1 . As shown in this figure, each cache level
is assumed to be implemented using a different memory technol-
ogy. For motivating about the proposed architecture, we design
two scenarios. In the first scenario, we consider a 3D CMP with
homogenous cache hierarchy. In this scenario, we assume there
is one layer per each level in the homogenous cache hierarchy
stacked on the core layer such as Fig. 4 (a). Also, we assume there
are four cores in the core layer, each of them running art appli-
cation [49] . Table 2 gives the properties of average memory access
time (AMAT), as the performance parameter for evaluation of cache
systems performance, and system power consumption when the
stacked cache levels in the homogenous hierarchy are made from
SRAM, eDRAM, STT-RAM, or PRAM. 
Note that normalization reported in Table 2 is done based on
the best case. That is power consumption is normalized with re-
spect to the SRAM, whereas AMAT is normalized with respect to
the PRAM. Based on these views, SRAM is fastest and higher power
hungry option and it is better to use in lower level in the cache hi-
erarchy because of faster accesses. 
In this context, we introduce the steady-state temperature of
the only layer of each cache level in the homogenous cache hier-
archy shown in Fig. 4 (a) as another measure to better understand.
Fig. 5 depicts the temperature of the up layer of each cache level
in the homogenous hierarchy shown in Fig. 4 (a) vs. AMAT shownTable 2 
Comparison of AMAT and system power consumption. 
Technology AMAT Power consumption 
SRAM 0.09 1 
eDRAM 0.16 0.62 
STT-RAM 0.3 0.37 










b  n Table 2 . As shown in Fig. 4 (a), since each cache level has one
ayer, the top layer is the single layer. 
According to Fig. 5 , if T max is set to 80 °C, just the cache hierar-
hy based on PRAM satisfies the maximum temperature constraint,
hile it has the maximum AMAT in compared to the others. If T max 
s set to 90 °C, STT-RAM and PRAM become suitable solutions, and
TT-RAM is finally chosen to minimize AMAT. In some high-speed
pplications with T max set to (90 °C ∼ 98 °C), the memory technol-
gy and the number of cache layers are selected between SRAM
nd eDRAM to minimize AMAT. If T max is set to more than 100 °C,
he best option to minimize AMAT is SRAM. 
According to Table 2 and Fig. 5 , amongst the homogeneous
ache hierarchy, there exists a SRAM configuration which min-
mizes AMAT under the maximum temperature constant. Thus,
here is no single memory technology in this study that has the
est performance for all temperature ranges. This motivates us to
tudy temperature-aware reconfigurable heterogeneous cache hier-
rchies, which combine the advantages of all these memory tech-
ologies to minimize power consumption and improve overall per-
ormance. 
Based on the observations in Table 2 and Fig. 5 , we decided to
se SRAM in the L2 cache level, eDRAM in the L3 cache level, STT-
AM in the L4 cache level, and PRAM in the L5 cache level. 
In the second scenario illustrated in Fig. 4 , we consider three
ifferent im plementations of our architecture with three cache lev-
ls in the hierarchy stacked on the core layer and other more de-
ails used in this paper. On the other hand, Fig. 4 (c) illustrates an
xample of the proposed architecture shown in Fig. 1 . We assume
hat art, gzip, mpeg2dec , and mcf [49] runs on Cores 1 to 4, respec-
ively. In order to give more details about these four applications,
ig. 6 demonstrates the reduction of cache miss rate as the amount
f cache assigned to programs increases. According to this figure,
cf and art are memory-intensive benchmarks since they show a
argely reduction in cache miss rate as the cache capacity increases.
lso, mpeg2dec and gzip are computation-intensive benchmarks be-
ause they show very small reduction in cache miss rate with in-
reasing cache capacity. 
In the second scenario, maximum temperature limit and power
udget for the chip, T max and P budget are considered 80 °C and
A. Asad et al. / Microprocessors and Microsystems 51 (2017) 76–98 81 
Fig. 7. An example for a 3D CMP with more than one layer in each level of the 

































































































P  00 W , respectively. Also, cache banks power-gating and per-core
VFS for the 3D CMP are assumed. Because of strong thermal cor-
elations between a core and cache banks directly stacked on the
ore, the core and the cache banks in the same stack called a core-
tack in our architecture. 
In the three parts shown in Fig. 4 , cache banks in each level
f the hierarchy are allocated to cores such that IPS (instruction
er second) is maximized without violating the maximum tem-
erature limit and power budget. We also assume that the core
lock frequency varies from 2 GHz to 3 GHz. In this motivational
xample, we assume one layer per each cache level. As shown in
ig. 4 (a), the high leakage power consumption of SRAM technol-
gy has increased the temperature of layers of the cache hierarchy.
o keep the temperature within the given limit, L3 and L4 levels
re turned off. In Fig. 4 (b), without violating the maximum tem-
erature and power budget, in addition to L2, L3 cache banks are
urned on because of lower leakage power of eDRAM based cache
anks. In Fig. 4 (c), more cache banks are allocated to each core
n the upper level of the hierarchy by analytically determining the
oltage/frequency of cores. In the Fig. 4 (c), we allocate more cache
anks from different technologies to the cores while lowering their
requencies and voltages in order to maximize the IPS and sat-
sfy the temperature limit. Since the multiprogramed applications
ave high bandwidth demand, in Fig. 4 (a) and (b), allocated cache
anks to the cores are small according to the maximum tempera-
ure limit and power budget. Therefore, we can use allocated cache
anks in each level in a shared manner which has its own prob-
ems. For example, there is contention between the working sets
f different applications in the shared space. The proposed tech-
ique in Fig. 4 (c) partitions the shared cache space according to
he specific demands of each individual application in a workload
et. Cache banks allocated to each core in each level are specified
y the core numbers in Fig. 4 (c). Increasing cache capacity yields
ignificantly different performance improvement for various appli-
ations. This is due to the fact that some applications only need
 small amount of cache while others benefit from larger caches
nd they use as much as available cache capacity given to them.
herefore, our proposed approach can be very useful because it re-
onfigures cache hierarchy based on the needs of the mapped ap-
lication on each core. For example since mapped applications on
ore1 and Core4 are memory-bound, more cache banks are allo-
ated to Core1 and Core4 in Fig. 4 (c). 
In Fig. 7 , we provide another CMP example with stacked cache
ierarchy with more than one layer in each level. In this motiva-
ional example, similar to Fig. 4 , we use four cores in the core layer
f the 3D CMP. The applications mapped on cores in this example
re art, gzip, mpeg2dec , and mcf running on Cores 1 to 4, respec-ively, same as previous example. The stacked cache hierarchy on
he core layer of this 3D CMP is composed of L2 with two layers,
3 with three layers, L4 and L5 with four layers. T max and P max are
ame as Fig. 4 . In Fig. 7 , we report the number of allocated cache
ayers in each level to the whole of the core layer for simplicity.
lso, we report the result for the core layer with two temperature,
0 °C and 80 °C. As shown in Fig. 7 (a), L2, L3, and L4 are homoge-
ous and made of SRAM and L5 is eDRAM. In Fig. 7 (b) and (c),
he stacked cache hierarchy is hybrid. These figures show that the
requency and voltage of cores depend on the number of active
ache layers stacked directly on the core layer. As shown in Fig.
 (c), without violating the T max and P max by lowering frequency and
oltage of some cores in the core layer, the number of active cache
ayers stacked directly on the cores can be increased to maximize
he IPS. In Fig. 7 , IPS result for each core layer’s temperature has
een normalized with respect to the Fig. 7 (a). Details of the esti-
ations and experimental setup used in this motivational example
ill be shown in Section 7 . 
By saving the leakage power from the heterogeneous cache hi-
rarchy in the proposed architectural management technique in
his work, the CMP would become more power-efficient and the
aved power can be utilized to power-on darkened cores for per-
ormance improvement. 
. Power modeling for NoC-based many core CMPs 
In this section, we present an analytical power model for fu-
ure many-core chip multi-processors with multi-level cache hier-
rchy. The proposed model emphasizes on various types of on-chip
esources such as cores, memory system and interconnection for
he first time. The model can be very useful for future dark sili-
on aware power modeling in many core systems. Table 3 lists the
arameters used in this model. 
The total power consumption of a CMP mainly comes from
hree on-chip resources: cores, cache hierarchy, and intercon-
ection network. Chip multiprocessors with a large number of
ores (more than eight) require building architectures through a
calable network-on-chip (NoC). As widely used in the literature
13 –24,35 –48] , we also adopt a mesh-based NoC in this modeling. 
.1. Components of the total power consumption of a 3D 
hip-multiprocessor 
The total power of a 3D chip multi-processor can be calculated
s the sum of the power of individual on-chip resources (cores and
ncore components). 
 Total = P cores + P uncores (1) 
 Total = P cores + P cache _ hierarchy + P interconnection (2) 
.1.1. Modeling core power consumption 
We denote the power consumption of core i as P core 
i 
. 
 cores = 
n ∑ 
i =1 
P core i (3) 
The power consumption of core i is comprised of dynamic and
eakage power components. The total power consumption of core i
s written as: 
 
core 
i = P D,i + P L,i , ∀ i (4)





, ∀ i (5) 
82 A. Asad et al. / Microprocessors and Microsystems 51 (2017) 76–98 
Table 3 
Parameters used in the power model. 
Parameter Description 
n Number of cores in the core layer 
f i Operation frequency of core i 
P core 
i 
Power consumption of core i 
P D , i Dynamic power consumption of core i 
P L , i Leakage power consumption of core i 
P cache _ hirarchy 
i 
Sum of power consumption related to the dedicated cache banks in each level of the cache 
hierarchy to core i from the 1st to the k th level 
P stati c k (T ) Static power consumed by each layer of the k th cache level ( L k ) at temperature T C 
N Number of cache levels ( L 1 , L 2 , . . . , L N ) 
C k Capacity of the k th cache level ( L k ) 
b i , k Number of active cache layers in the region-set bank i stacked on core i at the k th cache level 
B i , k Accumulated cache capacity in the region-set bank i stacked on core i at the k th cache level 
regn k Total number of regions at the k th cache level ( L k ) 
a r , a w Number of read and write accesses of an application 
APPH k Average Power consumption Per Hit access 
x j , k Indicator of j regions at the k th cache level 
γ Number of accesses per second 
α Sensitivity coefficient from the cache misses power-law 
E n Data sharing factor of an application with n threads 
T s Maximum execution time of the mapped applications 
E s 
interconnection 
Energy consumption of the interconnection network between nodes in T s 
P interconnection Power consumption of the interconnection between nodes 
P q n , n ′ , n ′′ Static power consumption of an interconnection network based on mesh topology with n 
nodes in dimension 1, n ′ nodes in dimension 2 and n ′′ nodes in dimension 3 
E s NP Average total energy dissipated in the on-chip interconnection network for transferring of NP 
packets in T s 
P qC 
R 
Static power consumption of a router (without any packet) 
P c R Static power consumption of a router with one virtual channel (without any packet) 
E s 1 Average total energy dissipated for transferring of one packet from the source to the 
destination in the on chip interconnection network 
E P R Average energy dissipated in a router and the related link for a packet transfer 
E f 
R 
Average energy dissipated in a router and the related link for a flit transfer 
D mesh The average distance of the mesh topology (The average number of links which a packet 
transits from the source to reach the destination) 
v Number of virtual channels per a link 






































i  Since the operating voltage of a core depends on the operating
frequency, it is assumed that the square of the voltage scales lin-
early with the frequency of operation [44] . In Eq. (5) , P max is max-
imum power budget and f max is maximum frequency of the core. 
The leakage power dissipation depends on temperature. The
leakage power of core i can be written as Eq. (6) . T t is am-
bient temperature at time t and h i is empirical coefficient for
temperature-dependent leakage power dissipation. h i coefficients
in cores with same microarchitectures have the same value. 
P L,i = h i . T t , ∀ i, t (6)
In this work for core power modeling, we can consider peak
leakage power as other works [14,15] . Therefore, in this model we
can use the maximum sustainable temperature for the chip. 
P L,i = h i . T max , ∀ i (7)
5.1.2. Modeling cache hierarchy power consumption 
a) Cache hierarchy power consumption modeling for multipro-
gramed workloads 
As shown in Fig. 1 , the number of cache levels is N and, each
cache level is presented as L k , ( k = 1 , 2 , 3 , . . . , N ) . In k th cache
level, L k , there is M k layers and the l th cache layer in the L k is rep-
resented as A k,l ( l = 1 , 2 , 3 , . . . , M k ) . 
We assume that in multi-programmed applications, each appli-
cation mapped on each core effectively sees only its own slice of
the dedicated cache banks in the cache hierarchy. 
P cache _ hierarchy = 
n ∑ 
i =1 
P cache _ hierarchy 
i 
(8) 
cache _ hierarchy 
i 
= P cache _ hierarchy 
dynami c i 
+ P cache _ hierarchy 
stati c i 
(9)
 
cache _ hierarchy 
i 
= N access . 
N ∑ 
k =1 




P static ( T t ) k (10)
here h and m are hit and miss rates, respectively. Cache hit and
iss rates depend on cache capacity. Increasing the cache capacity
llocated to a core leads to reduce cache miss rate. E dy n k denotes
ynamic energy consumed by k th cache level per access. N access is
umber of accesses per second. P static ( T t ) k is static power consumed
y k th cache level, L k , with capacity C k at temperature T t . 
The first part of Eq. (9) , P 
cache _ hierarchy 
dynami c i 
, depends on dynamic
nergy. Dynamic energy consumed by cache depends on Average
emory Access Time (AMAT). Reducing AMAT leads to lower cache
ynamic energy. 
For formulating first part of Eq. (10) based on accessible vari-
bles in the model, first we compute the average power per access
APPA) by: 
P PA = AP P H 1 + 
N−1 ∑ 
k =1 
AP P H k +1 . R 
miss 
k (11)
here R miss 
k 
is the product of cache miss rates from the 1st to
he k th cache level. Since the access time of reading and writing
n emerging non-volatile memories (i.e., STTRAM-based or PRAM-









































Fig. 8. The style of using cache hierarchy in: (a). A multiprogramed workload and, 
(b). A multithreaded workload. 





























p  ased cache) is different, the APPH k is expressed as: 
P P H k = 
a r . τ r 
k 
. p r 
k 
+ a w . τ w 
k 
. p w 
k 
a r + a w (12) 
here a r and a w are the number of read and write accesses of the




are latencies of read and
rite at the k th cache level, and p r 
k 
and p w 
k 
are power consump-
ion of read and write at the k th cache level, respectively. We can
ewrite Eq. (12) as: 
P P H k = 
a r . E read k + a w . E write k 
a r + a w (13) 










here σ is baseline cache size. μ is baseline cache miss rate. α is
ower law exponent, typically lies between 0.3 and 0.7 [50] . B k is
he sum of allocated cache capacity from the 1st to the k th cache
evel and is obtained by: 
B k = 
k ∑ 
m =1 
c m . b m (15) 
here c m and b m are the capacity of each cache layer and the
umber of active cache layers at the m th cache level, respectively. 
We can rewrite the first part of the Eq. (10) , P 
cache _ hierarchy 
dynami c i 
, based
n the accessible variables as: 
 
cache _ hirarchy 
dynami c i 
= γ . 
( 
AP P H 1 + 
N−1 ∑ 
k =1 
AP P H k +1 . μ . 
(




here γ is the number of accesses per second. In Eq. (17) , d i is the
ime-to-deadline constraint of the program allocated to core i . 
 access = γ = a 
r + a w 
d i 
(17) 
As one of the worst case, we can assume all of accesses of the
apped application are to the N th cache level of the hierarchy
ith biggest latency. Therefore, we can set d i as: 
d i = a r .τ r N + a w .τ w N (18) 
The second part of Eq. (10) , P 
cache _ hierarchy 
stati c i 
, is the total leakage
ower consumption related to the dedicated cache banks to core i
hich is the main contributor to the total power consumption. 
P cache _ hirarchy 
stati c i 
= 
∑ N 
k =1 b i, k . P stati c k ( T max ) (19) 
b i, k = 
B i, k − B i,k −1 
c k 
(20) 
P stati c k ( T max ) is the static power consumed by each layer of the
 th cache level ( L k ) at temperature T max . 
Eqs. (10) –(20) model cache hierarchy power consumption in




is used as the per-level power consumption in the region-




= P dynamic + P static , ∀ i 
 
(
AP P H k + AP P H k +1 .μ. 
(
B i, k 
σ
)−α)
+ b i,k . P stati c k ( T max ) 
(21) 
b) Cache hierarchy power consumption modeling for multi-
hreaded workloads In previous sub-section, we model cache power consumption in
ultiprogramed workloads which each program only uses the ded-
cated cache banks in its own core-set privately as shown in Fig.
 (a). 
Large class of multithreaded applications are based on barrier
ynchronization and consist of two phases of execution (shown in
ig. 9 ): a sequential phase, which consists of a single thread of exe-
ution, and a parallel phase in which multiple threads process data
n parallel. The parallel threads of execution in a parallel phase
ypically synchronize on a barrier. In the parallel phase, all threads
ust finish execution before the application can proceed to the
ext phase. In multithreaded workloads, cache levels are shared
cross the threads. In the parallel phase, threads share regions at
ach layer of the cache levels in the hierarchy as shown in Fig.
 (b). First, we dedicate region1 in each level to the threads. Then
ased on power budget and performance constraints in optimiza-
ion techniques, we can increase the number of regions or keep it
xed in each level. 
Since multithreaded applications use cache hierarchy in shared
tyle, we can rewrite Eq. (9) for them as follows: 
P cache _ hierarchy = P cache _ hierarchy dynamic + P 
cache _ hierarchy 
static 
(22) 
We can rewrite Eq. (11) for a multithreaded program with more
etails as follows: 
P PA = AP P H 1 + 
N−1 ∑ 
k =1 
reg n k ∑ 
j=1 
AP P H j+1 ,k .R 
miss 
j,k (23)
Note that in Eq. (23) , AP P H reg n k +1 ,k = AP P H 1 ,k +1 . It means after
iss accessing to the last region of the k th cache level, search will
e done in the first region of the next level. In this equation, R miss 
j,k 
s the product of cache miss rates from 1st region of the 1st cache
evel to j th region of the k th cache level. The APPH j, k is average
ower per hit access to j th region of the k th cache level and ex-


































































Fig. 10. Proposed temperature aware reconfiguration mechanism. pressed as: 
AP P H j,k = 
a r . τ r 
j,k 
. p r 
j,k 
+ a w . τ w 
j,k 
. p w 
j,k 
a r + a w (24)
Note that for all of the regions in each cache level, read
latencies, write latencies, read energies and write energies are
same, τ r 
j,k 
= τ r 




j+1 ,k , p 
r 
j,k 
= p r 
j+1 ,k , p 
w 
j,k 
= p w 
j+1 ,k , ∀ k, j.
We can write Eq. (24) as: 
AP P H j,k = 
a r . E read j,k + a w . E write j,k 
a r + a w (25)
In Eq. (23) , R miss 
k 
for a multithreaded program can be modelled
as: 





. E n (26)
where n is number of cores. σ is baseline cache size and μ is base-
line cache miss rate. 





. E n (27)
B j,k = 
k ∑ 
m =1 
reg n m ∑ 
j=1 
j. x j,k . 
c m 
reg n m 
(28)
reg n k ∑ 
j=1 
x j,k = 1 , ∀ k (29)
Let x j, k , x j, k ∈ {0, 1}, j ∈ [1, regn k ], k ∈ [1, N ] be a binary vari-
able. If it is 1, it shows that the multithreaded application uses re-
gion 0, region1, …, region j − 1 and region j at the k th cache level.
Note that regn k represents the total number of regions in k th cache
level of the hierarchy. It is fixed for each cache level. 
The first part of Eq. (22) , P 
cache _ hierarchy 
dynamic 
, based on accessible vari-
ables is as follows: 
P cache _ hirarchy 
dynamic 
= γ . 
( 
AP P H 1 + 
N−1 ∑ 
k =1 
reg n k ∑ 
j=1 






. E n 
)
(30)
where γ is modeled as Eq. (17) . 
The second part of Eq. (22) , P 
cache _ hierarchy 
static 
, is the total leakage
power consumption related to the dedicated cache banks to core i
which is the main contributor to the total power consumption. 





reg n k ∑ 
j=1 
j. x j,k . P stati c k ( T max ) (31)
where P stati c k ( T max ) is the static power consumed by each re-
gion of the k th cache level, L k , at temperature T max . b k , number
of active cache layers in the k th cache level is b k = reg n k / 4 , k =
1 , 2 , . . . , N. In this model, we assume that each layer of cache
levels in the hierarchy includes four regions. 
5.1.3. Modeling on-chip interconnection power consumption 
Energy consumption of the on-chip interconnection network in
T s is calculated as Eq. (32) : 
E s interconnection = E static + E dynamic = P q n . T s + E s NP (32)
E s NP = NP.E s 1 = NP. ( D mesh + 1) .E P R = NP. ( D mesh + 1 ) .l.E f (33)R In a mesh topology with d dimensions, which there is k i nodes
n i th dimension, the average distance that a packet must traverse
o reach the destination can be calculated as Eq. (34) : 












In a 2D mesh with n nodesin each dimension, the average dis-
ance between two nodes can be calculated as follows: 






In a many-core platform based on 2D mesh topology ( n ≥ 32),
he average distance is: 




Finally, power consumption of on-chip interconnection between
odes can be calculated as: 




= P q 
n,n ′ ,n ′′ + 
E s NP 
T s 
 n.n ′ .n ′′ .P qC 
R 
+ E s NP 
T s 
= n.n ′ .n ′′ .ν.P c R + 
E s NP 
T s 
(37)
here T s = max ( d i ) , i = 1 , 2 , . . . , n , in Eq. (18) . 
Since Eq. (37) is the function of maximum execution time of the
apped applications, T s , and T s has a big value compare to E NP , the
econd term of Eq. (37) can be ignored, therefore, 
P interconnection = n . n ′ . n ′′ .ν .P c R (38)
As described in [55] , also as shown in Fig. 3 , particularly prob-
ematic for NoC structures is leakage power, which is dissipated
egardless of communication activity. At high network utilization,
tatic power may comprise more than 75% of the total NoC power
t the 22 nm technology and this percentage is expected to in-
rease in future technology generations. This fact is captured by
q. (38) . 
. Proposed power and temperature aware reconfigurable 
echnique 
.1. High level system description 
Power and thermal management techniques are receiving a lot
f attention in future many-core systems. The proposed technique
hat dynamically adapts to the system is based on an optimization
roblem. We model this optimization problem based on the power
odel presented in Section 5 . The optimization problem predicts
he future thermal state of the system, improves the performance
nd minimizes power consumption by completely satisfying ther-
al constraints. Fig. 10 shows the block diagram of the proposed
echnique. 
A. Asad et al. / Microprocessors and Microsystems 51 (2017) 76–98 85 


































Fig. 12. A look up table provided at Phase 1. 
Fig. 13. Collecting information by the central monitoring tile 6 in the core layer 








































i  The regulator which is based on the mentioned optimization
roblem monitors the 3D CMP state determined by temperature
nd working frequencies and makes decision about the configu-
ation of the cache hierarchy and voltage/frequency for the system
n the next time interval. Temperature is monitored by on-die ther-
al sensors. The proposed technique contains two phases, a design
ime phase (Phase 1) and a runtime phase (Phase 2). Designing the
egulator based on the solving the optimization problem is done at
esign time in Phase 1. Monitoring the system status and making
ecision about the cache configuration and cores voltage/frequency
n each interval are done at runtime in Phase 2. In next subsection,
e explain about the proposed method operation in detail. 
.2. Proposed system operation 
System operation can be divided into two phases: an off-line
esign time phase and run-time control phase. The overview of the
forementioned two phases is presented in Fig. 11 . 
.2.1. Phase 1: design time 
The inputs of this phase are the maximum power budget and
he floorplan of the chip, the maximum and minimum operating
requencies of the cores, the time period at which the DVFS and
ache hierarchy reconfiguration need to be applied, and the ther-
al models that are obtained based on the packaging and the heat
preader as shown in Fig. 11 . 
Fig. 12 shows a sample of look up table which is the output
f phase1. In this table, for each starting temperature value and
equired average frequency of the running workload, a frequency
ector and a cache hierarchy configuration (the capacity and tech-
ology of each level) are computed by solving an optimization
roblem described in Section 7 . 
.2.2. Phase 2: run time 
In typical CMP designs [51] and future many core systems, one
r more Power Control Units (PCUs) are embedded. The PCU can
e a dedicated small processor for chip power management as inntel’s Nehalem architecture or a specific low overhead hardware
51] . 
In the proposed architecture, we assume a single monitor tile
hat collects the statistical data from the whole network. Further-
ore, in order to reduce the interconnection overhead between the
CU and monitor unit, monitor tile should be placed near the PCU.
his proposed scheme is scalable to CMPs with large number of
ores with more than one PCU. 
In the proposed architecture, the PCU is in charge of conducting
he reconfiguration process in a centralized manner in the on-line
hase. The PCU utilizes the table obtained in off-line phase to set
he frequencies of the cores and also, assign the capacity and tech-
ology of each level of cache in the hierarchy to each core (recon-
guration process). The reconfiguration procedure is applied peri-
dically, at a pre-defined time period. A sample core layer which
ndicates tile 6 as the location of the monitor tile is illustrated in
ig. 13 . Tile 6 is chosen as the location of the monitoring tile in
his core layer since it is near to the other nodes to collect the de-
ired statistics. Once the statistical data from the whole network
rrive at the monitor tile, they are sent to the PCU. The PCU has
dequate storage to accommodate the statistical data gathered. 
It is assumed that each core has been equipped with a temper-
ture sensor to report the current temperature of the core to the
entral monitoring tile in the proposed architecture. It should be
oted that many of today’s platforms (e.g. Versatile Express Devel-
pment Platform [53] which includes ARM big.LITTLE chip) have
een equipped with sensors to measure frequency, voltage, tem-
erature, power, and energy consumption of each core or cluster
52] . Xu et al. [45] allocated a type of TSV for thermal monitoring
or transferring temperature information of temperature sensors in
ddition to the data and control TSVs for the future 3D NoCs. Simi-
arly, Bakker et al. [46] presented a power measurement technique
ased on power/thermal monitoring using on core sensors for Intel
CC [54] . 
At the end of each time interval, each core running a program
onitors its current temperature and sends a control flit contain-
ng the temperature value to the monitoring tile. The control flit is
hown in Fig. 14 . 
The communication between the monitor tile and other cores
n the core layer is handled by virtual point-to-point (VIP) con-
86 A. Asad et al. / Microprocessors and Microsystems 51 (2017) 76–98 
Fig. 14. Fields of a control flit containing temperature information. 
Fig. 15. Architecture of a router preparing VIP connections [47] . 












































































r  nections as a separate control network. Architecture of the used
routers preparing VIP connections in this work is shown in Fig. 15 .
The VIPs between the monitor tile and other cores are con-
structed on demand at run-time over the virtual channels. VIPs by-
pass the intermediate routers, and VIP connections are constructed
by borrowing one of the packet-switched virtual channels on top
of a packet-switched network. As shown in Fig. 15 , a register with
the capacity of one flit replaces the regular buffers in one vir-
tual channel (e.g. virtual channel 0) in each physical channel of
the routers in a VIP-enabled NoC. The flit in the VIP register (vir-
tual channel 0) is prioritized over regular VCs and is directed to
the crossbar input when the register has incoming flits to service.
Otherwise, a virtual channel is selected based on the outcome of
the routing function like traditional packet-switched networks. VIP
connections are not allowed to share the same links. At most one
VIP connection can be used per each router port. In contrast to the
dedicated point-to-point links which are physically established be-
tween the communicating cores and are fixed during the system
life-time, VIP connections are dynamically reconfigurable and can
be established based on the workload traffic pattern on the sys-
tem. Modarressi et al. [47] and Asad et al. [48] provide additional
details about VIP connections. The entire proposed reconfiguration
procedure (making decision about cache hierarchy architecture and
V/F of each core at each interval) is done very fast using VIPs and
does not degrade the system performance considerably. The recon-
figuration commands are sent over the VIPs to the cores by the
monitor tile. 
As shown in Fig. 16 , the command flit specifies the volt-
age/frequency of core i and the number of activated layers and
banks in each level of the cache hierarchy in shared or private use
based on P/S bit. If this bit is 1, the cache levels are in the pri-
vate use (multiprogramed), and if it is 0, the cache levels are in
the shared use (multithreaded). In the beginning of the each time interval, based on the gath-
red cores temperature information by the monitor tile, the PCU
nds the maximum temperature across the cores and sets it as
tarting temperature value, T start . 
Also, based on the power budget, P budget , the average frequency,
 avg ( T ), and the amount of dissipated power consumption by cores
nd cache hierarchy, P current , in the current interval, the PCU calcu-
ates the required average operating frequency across all cores for
he next interval. 
f a v g ( T + 1 ) = f a v g ( T ) + K. 
(
P budget − P current 
)
(39)
In Eq. (39) , f avg ( T ) is the frequency at T th time interval,
f a v g ( T + 1 ) is the frequency at the time interval after T th interval,
 is a constant value computed based on the system power behav-
or, P budget is the specified power budget and P current is the current
ower. 
According to the T start and predicted required average frequency
f the cores for the next time interval, the PCU chooses the fre-
uency assignment, and configure cache hierarchy for the cores
rom the filled look up table in phase 1. If the average frequency
oint cannot be supported in the table, the PCU chooses the next
ower frequency row in the look up table. 
.3. Problem formulation 
In this section, we investigate the details of optimization-based
econfiguration approach and formulate its fundamental goals dur-
ng run-time. We use the power model obtained in Section 5 to
ormulate the optimization problem used by reconfiguration reg-
lator. As mentioned before, future dark silicon CMPs consist of
any cores where only few of them can be simultaneously pow-
red on or utilized within the peak power and temperature budget.
ssume that, peak power budget and temperature limit are given
y a designer specified value, P budget , and T max . 
Unlike much of the prior work [2,13 –24] , we consider tempera-
ure as a fundamental constraint in the dark silicon estimations.
ark silicon modeling under TDP constraint may lead either to
nderestimation or overestimation of dark silicon [2,66] . To pro-
ide more accurate analysis of dark silicon, temperature needs to
e considered in estimating dark silicon. Therefore, we present a
oarse-grained thermal model of a 3D CMP in this section. 
.3.1. Heat propagation model 
In the thermal model of a 3D CMP used in this design, each
lock (e.g., core and cache bank) is represented by a set of thermal
odel elements (i.e., thermal resistance, heat capacitance, and cur-
ent source) [63] . In Fig. 17 , the heat sink is located at the bottom




























































































i  f the chip stack and there are thermal resistances between hori-
ontally adjacent blocks, R intra . Also, there are thermal resistances
etween vertically adjacent blocks, R inter . The power consumption
f a block influences its temperature as well as the temperature of
ther blocks. 
The steady-state temperature of each block (e.g., core and cache
ank) can be calculated using this thermal model. For example, we
an calculate the steady-state temperature of the thermal elements
 X, Y , 0) and ( X, Y , 1) in Fig. 17 as follows: 
 
ss 
( X,Y, 1 ) = P ( X,Y, 1 ) . R inter + T ss ( X,Y, 0 ) (40)
 
ss 
( X,Y, 1 ) = 
(
P ( X,Y, 0 ) + P ( X,Y, 1 ) 
)
. R hs + T start (41) 
here T ss 
( X,Y, 0 ) 
and T ss 
( X,Y, 1 ) 
are the steady-state temperatures, and
 ( X, Y , 0) and P ( X, Y , 1) are the power consumption of the blocks ( X,
 , 0) and ( X, Y , 1), respectively. T start denotes the ambient temper-
ture, and R hs is the thermal resistance from thermal element ( X,
 , 0) to the ambient through the cooling structure. In the target
MP, we call the core and the cache banks in the same stack as
ore-stack. Therefore based on Eqs. (40) and (41) , the steady-state
emperature of core i and cache banks stacked directly on it, stack i ,




= T start + R hs .P core i + 
N ∑ 
k =1 
R inter . P 
cache 
i, k . b i, k , ∀ i (42)
.3.2. Objective functions and constraints 
a) Optimization problem based on the multiprogramed workloads: 
We will now propose an optimization strategy to determine the
ptimal heterogeneous dark silicon 3D CMP architecture for mul-
iprogramed workloads. The outputs of the optimization problem
re: 1) determining the optimal number of active cores, 2) finding
hich cores are turned on and which cores are left dark, 3) as-
igning frequencies (and the corresponding voltages) to each core,
nd 4) allocating the optimal number of SRAM cache banks in
2, eDRAM cache banks in L3, STT-RAM cache banks in L4, PRAM
ache banks in L5 to each core and turning off unassigned cache
anks in the hierarchy. The goal of the proposed optimization is to
inimize power consumption under temperature limit and perfor-
ance constraint. 
In our model, optimal frequency of each core, f cores =
 f 1 , f 2 , . . . , f n } , and the optimal number of activated SRAM,
DRAM, STT-RAM and PRAM cache banks in each level directly
tacked on core i , L i = { L 1 , L 2 , . . . , L N } , are the optimization vari-
bles. The power optimization problem J 1 is presented below: 
inimize J1 = 
n ∑ 
i =1 








+ P interconnection , ∀ t (43) 
ub ject to : f min ≤ ( f i ) t ≤ f max , ∀ t, ∀ i (44)
P max . 
(
( f i ) 
α




+ h i . (T core i ) init ≤ (P core i ) t , ∀ t, ∀ i (45)
(T core i ) t = T start + R hs . (P core i ) t , ∀ t, ∀ i (46)
(T top 
i 
) t = (T core i ) t + 
N ∑ 
k =1 
R inter . (P 
cache 
i, k ) t . b i, k < T max , ∀ t, ∀ i 
(47) (T core i ) t+1 = (T core i ) t + 
∑ 
∀ j ∈ Ad j i 
a i, j 
(
(T core j ) t − (T 
core 
i ) t 
)
+ R hs . (P core i ) t < T max , ∀ t, ∀ i (48) 




= γ . 
( 
AP P H 1 + 
N−1 ∑ 
k =1 





















( f i ) t ≥ n × f a v g , ∀ t, ∀ i (50)
b i,k = 
B i,k − B i,k −1 
c k 
, ∀ i, ∀ k (51)
P interconnection = n.N. v .P c R (52) 
We assign (T core 
i 
) init = T start for the t = 1 , as the first time frame
f each interval. If the scheme is applied every 100 ms and each
ime frame is 0.5 ms, then the total number of time frames are
00, 1 ≤ t ≤ 200. 
In Eqs. (45) and (49) , (T core 
i 
) init is the temperature of the core
ayer which obtained by solving the optimization problem J 1 time
rame by time frame, (T core 
i 
) init = ( T core i ) t−1 . 





) is a constant value since we record the static
ower in different important temperatures in a table in the off-line
hase. 
Eq. (44) describes that working frequencies of cores are as-
umed continuous, ranging from f min to f max . Note that if the tem-
erature of a core in a core-stack exceeds the critical point and
oes not decrease by reducing its frequency to the minimum level,
hen the PCU turns the core off at the beginning of the next time
nterval. In this case, to save power consumption and remain under
 max , the PCU groups low instruction per cycle (IPC) applications
hat may be running on two or more cores into one core. IPC as
n appropriate performance parameter can be obtained from hard-
are performance counters which provided in most modern pro-
essors [51] . 
The heterogeneity in the operating frequency assigned to cores,
nd the heterogeneity in capacity and technology of each cache
evel in the hierarchy dedicated to each core in a CMP used in this
ptimization-based technique leads to heterogeneous CMPs which
re envisioned to be a promising design paradigm to combat to-
ay’s dark-silicon challenge. 
The proposed optimization model prevents the unpredictable
emperature variation in each time interval during runtime. Tem-
erature variation causes extensive temperature-related problems
n reliability especially in nano-scale designs. In this proposed
odel, Eq. (48) strongly prevents from operating temperature vari-
tion. Eqs. (45) , (49) and (52) , address the importance of the leak-
ge power as an important factor in total power consumption of
anometer designs. More specifically, in the Eqs. (45) and (49) , at
he start of each time interval, leakage power has been calculated
ased on the starting temperature of the core layer. 
In Eq. (50) , f avg is predicted based on Eq. (39) in each time inter-
al. In Eq. (52) , the on-chip interconnection power consumption is
odeled as a function of the number of cache levels in the hierar-
hy. This is the first work which consider on-chip interconnection
ower in parallel with cache hierarchy, analytically. 
NoC-Sprinting [67] is one of the newest topics in designing
ower-efficient NoCs in the dark silicon era. The proposed sprint-
ng technique in [67] can activate a subset of network components





























































































m  (routers and links) to connect a certain number of cores during
workloads execution. Since the focus of [67] is on how to design
interconnect, it does not consider any problems related to execu-
tion and cache systems. An efficient sprinting mechanism should
be able to provide different levels of parallelism desired by differ-
ent applications. Also, depending on workload characteristics, the
optimal amount of cache and optimal number of cores is required
to provide maximal performance speedup. Based on our proposed
optimization model, we can turn on network components in the
regions related to the turned-on cache levels and turn off others
in the gated-off regions. Also, our model finds the optimal amount
of cache and optimal number of cores. This proposed framework,
as the first work, can help to researchers want to consider sprint-
ing in their proposed techniques and models. 
An alternate formulation of the energy efficiency optimization
problem is to maximize performance under a fixed power budget.
Here, instead of letting large parts of the multicore remain unused
because of dark silicon, we adopt a dim silicon approach. In this
approach, there are more cores sharing the available power budget
and to find an energy-efficient runtime configuration, we exploit
the applications’ characteristics when distributing resources. Opti-
mization variables of the optimization problem in Eqs. (53) –(62)
are same as the above-mentioned optimization problem, Eqs. (43) –
(52) . Since in the later version power budget has been fixed by the
designer at first, it is very useful for future dark-silicon aware CMP
designing. This approach is presented as below: 
maximize J2 = 
n ∑ 
i =1 
( f i ) t , ∀ t (53)
sub ject to : f min ≤ ( f i ) t ≤ f max , ∀ t, ∀ i (54)
n ∑ 
i =1 








+ P interconnection ≤ P budget , ∀ t 
(55)
P max . 
(
( f i ) 
α




+ h i . (T core i ) init ≤ (P core i ) t , ∀ t, ∀ i (56)
(T core i ) t = T start + R hs . (P core i ) t , ∀ t, ∀ i (57)
(T top 
i 
) t = (T core i ) t + 
N ∑ 
k =1 
R inter . (P 
cache 
i, k ) t . b i, k < T max , ∀ t, ∀ i (58)
(T core 
i 
) t+1 = (T core i ) t + 
∑ 











+ R hs . (P core i ) t < T max , ∀ t, ∀ i 
(59)
(




= γ . 
(
AP P A 1 + 
N−1 ∑ 
K=1 


















, ∀ t, ∀ i, ∀ k 
(60)
b i,k = 
B i,k − B i,k −1 
c k 
, ∀ i, ∀ k (61)
P interconnection = n.N. v .P c R (62)
Eq. (55) represents the dark silicon constraint. This equation
shows that peak power dissipation during the applications running
must be less than the maximum power budget, P budget . 
b) Optimization problem based on the Multithreaded Workloads: 
For applying of these two optimization problems for
multithreaded applications, we should replace the term
(
 n 
i =1 ( P 
cache _ hirarchy 
i 
) t in Eqs. (49) and (60) by the presented
erm P cache _ hierarchy , in Eq. (63) , due to the shard style use of cache
evels in the hierarchy by the multithreaded applications. 
Therefore, Eqs. (49) and (60) are replaced with Eq. (63) . 
 
cache _ hierarchy = γ
( 
AP P H 1 + 
N−1 ∑ 
k =1 
reg n k ∑ 
j=1 
(










reg n k ∑ 
j=1 
j. x j,k . P stati c k ( T max ) (63)
here D is the degree of parallelism (DOP) of a multithreaded ap-
lication ( Fig. 9 ). We consider multithreaded applications with fix
umber of threads in this work. In the future work, we are going
o apply this model for multithreaded applications with the vari-
ble DOP. 














R inter . 
P cache _ hierarchy 
n 
≤ T max , ∀ i, t (64)
here n is number of cores in the core layer or number of tiles
spots) at the k th, ( k = 1 , 2 , . . . , N ) , cache level. 
In addition, Eq. (65) is added to two optimization problems for
ultithreaded applications. 
eg n k ∑ 
j=1 
x j,k = 1 , ∀ k (65)
Eq. (65) identifies the number of active regions in each cache
evel. x j, k is the optimization variable which shows the number
f active regions in each level. Other equations in the presented
ptimization problems are same for the multithreaded version. In
he multithreaded version, finding optimal frequency of each core,
f cores = { f 1 , f 2 , . . . , f n } , and the number of activated SRAM, eDRAM,
TT-RAM and PRAM cache regions in each level directly stacked on
he core layer, reg n caches = { x j, 1 , x j, 2 , . . . , x j,N } , are as optimization
ariables. 
.3.3. Architectural overhead 
The regulator (shown in Fig. 10 ) needs an amount of hardware
upport. As described earlier in Fig. 11 , the reconfiguration time
eriod at which the proposed reconfiguration policy needs to be
pplied is obtained as an input. Therefore at each tile, one counter
s needed to frame the control interval. In our case, we assume
00 ms as the maximum predetermined time interval by the de-
igner, so 30 bits for the counter at each tile is sufficient. We as-
ume the PCU is co-located with the monitor tile. Since the PCU
lready exists in typical CMP designs [51] , all computations in our
pproach can be performed by the PCU without any extra hard-
are overhead. 
The search operation for finding the frequency of the cores and
onfiguration of the cache hierarchy for the next interval in the
ook up table based on temperature and average frequency is com-
osed of a few simple calculations, which can be easily handled by
he PCU. Therefore, the overall hardware overhead is 30 bits per
ile. 
.3.4. Solving the proposed optimization problems 
In the optimization problem presented in Eqs. (43) –(52) and
qs. (53) –(62) , the objective functions and constraints, except Eqs.
49) and (60) , are linear. As discussed in [78] , linear functions are
onvex. Eqs. (49) and (60) seem to be non-linear. Because of the
act that for convexity proof [78] , all the constraints in the opti-
ization problem should be convex functions, we show that Eqs.
49) and (60) are convex. 



























































Fig. 18. Overview of dynamically adaptive mapping approach for a 3D CMP. 

































g  In Eqs. (49) and (60) , the only term which we need to prove its
onvexity is: 






Note that x β is convex on R ++ when β ≥ 1 or β ≤ 0 [78] . Based
n this point, Eqs. (49) and (60) which include of summation and
roduct of convex functions are convex. Therefore, objective func-
ion and all the constraints in the proposed optimization problems
re convex. In this context, we can proof that Eq. (63) is convex. 
To solve the optimization models, we use Maple [57] , an effi-
ient optimization solver. As the optimization models are solved
or each temperature and frequency point (as presented earlier in
ig. 11 ), the total time taken to perform phase 1 of the method is
ew hours. Note that phase 1 is performed only once for a system
t design time and the timing overhead for this is negligible. 
Furthermore, we can propose some efficient algorithms to solve
he proposed optimization problems. In Appendix A , we present
lgorithm 1 as a formal description of a polynomial time solution
or the proposed optimization problem in Eqs. (53) –(62) . 
. Proposed application-aware mapping technique 
Based on the prepared hardware (performance counters and
CU) required for the proposed runtime reconfiguration ap-
roaches, in Section 6 , we propose a mapping technique for the
arget 3D CMP to improve the thermal distribution at runtime as
n essential need in future many-core CMPs. 
Empirically, we have observed that if a memory bounded- ap-
lication/thread has a smaller amount of cache than needed, it will
esult in more cache misses and lower IPC. Therefore, IPC can be a
uitable parameter to identify between memory and computation-
ounded applications/threads [74] . 
Since computation-bounded applications/threads create hot- 
pots on the chip [75] , in the proposed runtime mapping tech-
ique, we place the memory and computation-bounded applica-
ions/threads on the core layer, uniformly, to balance the tempera-
ure. The proposed mapping technique predicts the CPI (or IPC) of
ifferent applications/threads for the next time interval by collect-
ng performance counters data at runtime and places tasks with
omplementary characteristics on adjacent neighbors to balance
he temperature. The proposed mapping technique includes two
tages, inter-region and intra-region mapping. In this technique,
here is a CPI predictor component in each core tile which has an
mportant role. The overview of the proposed technique is shown
n Fig. 18 . The PCU is responsible for doing this mapping technique
t the end of each predefined time interval. 
.1. Proposed CPI predictor 
We equip each core-tile in the core layer with a CPI predic-
or. The CPI-predictors use prepared performance counters in each
ore-tile. The goal of the CPI predictor, shown in Fig. 18 , is to de-
ermine CP I T +1 ( i, j ) , for all j ∈ [1, n ], for the next time interval.
et CP I T +1 ( i, j ) be the CPI of application/thread i on core j in the
ext time interval in Eq. (67) . To predict CP I T +1 ( i, j ) , the CPI pre-
ictor component uses CPI information measured by the hardware
ounter broken down into two components: compute CPI (base CPI
n the absence of miss events), and memory CPI (cycles lost due to
isses in the cache hierarchy). With these measurements on core j ,
e predict the CPI for the next time interval using a linear predic-
or as follows: 





are fixed parameters that are computed for
ach core configuration at off-line. At the end of each time interval,ach core in the core layer sends predicted CPI by the CPI predictor
n a control flit over VIPs to the monitoring tile, as shown in Fig.
3 . After gathering the CPI control flits, the PCU starts the mapping
lgorithm. 
According to the predicted CPI of each application/thread for
he next epoch, if the predicted CPI is greater than a thresh-
ld introduced and set up in [74] , that application/thread is in
he range of memory-intensive loads. Also, if the predicted CPI is
ess than the threshold, that application/thread is in the range of
omputation-intensive loads. Therefore, based on the predicted CPI
or each core, the PCU decides to allocate applications/threads on
o cores in two stages. 
.2. Stage 1: inter-region mapping 
As shown in Fig. 19 , we set up five thresholds, { t 1 , t 2 , t 3 ,
 4 , t 5 }, and based on the predicted CPI, the applications/threads
re classified into five types and a weight to each core is as-
igned: Heavy memory-intensive (HM), Medium Heavy memory-
ntensive (MM), Medium (M), Heavy computation-intensive (HC),
nd Medium Heavy computation-intensive (MC). Then, the PCU
alculates the average weight for each region ( W C a v g i ) and for the
hole core layer ( W C a v g total ) as shown in Fig. 20 . 
Based on an algorithm shown in Fig. 21 , the PCU compares
ach W C a v g i with W C a v g total to see the difference. If the differ-
nce between W C a v g i and W C a v g total is more than a threshold, the
ighest-weight application/thread in the region with the largest
 C a v g i and the lowest-weight application/thread in the region with
he smallest W C a v g i are swapped to balance thermal distribution.
his process is iterated until the difference between each W C a v g i 
nd W C a v g total is under the threshold. The PCU conducts applica-
ion/thread migration if needed after the iteration converges. 
.3. Stage 2: intra-region mapping 
In the proposed mapping technique, we assume that each re-
ion on the core layer includes four cores. In this stage, the PCU
90 A. Asad et al. / Microprocessors and Microsystems 51 (2017) 76–98 
Fig. 20. Computing of W C a v g i , { i = 1 , 2 , 3 , 4 } , and W C a v g total for a core layer with 16 
cores. 
Fig. 21. Scalable inter-region mapping algorithm. 























Fig. 23. A 16-core CMP architecture with three level heterogeneous cache hierarchy 









































p  sorts the applications/threads with respect to their predicted CPI
and groups them in pairs by selecting the highest and lowest ones
from the remaining sorted as ( CPI 1 > CPI 2 > CPI 3 > CPI 4). For exam-
ple in this inequality, the algorithm pairs (task1, task4) and (task2
and task3) on adjacent neighbors, in this stage. 
Since computation-bounded applications/threads create hot-
spots on the chip [75] , based on using this two-stages mapping
technique, we try to place memory and computation-bounded ap-
plications/threads on adjacent neighbors to balance the tempera-
ture on the chip. 
Fig. 22 shows a simple example of the two-stages application
mapping in a core layer of a CMP system with 16 cores. In this ex-
ample, there are 4 regions; the regions are named R1, R2, R3, and
R4. Regions R2 and R4 initially have 4 and 3 high-weight tasks,
respectively, while regions R1 and R3 only have 1 high-weight
task each. In this example, W C a v g 2 > W C a v g 4 > W C a v g 3 > W C a v g 1 .
According to the algorithm in Fig. 21 , the task with the highest
weight in R2 is swapped with the task with the lowest weight in
R1, and we get W a v g 2 > W a v g 4 > W a v g 1 > W a v g 3 . Similarly, the al-
gorithm swaps the new highest-weight task in R2 with the lowest-
weight task in R3 and gets the difference between each W C a v g i and
 C a v g total under the threshold. After the inter-region reallocation,
the algorithm performs the intra-region application mapping to fi-alize the location of each task. In this stage, for example in the re-
ion R2, there is CPI 21 > CPI 23 > CPI 22 > CPI 24 , then the algorithm
airs task1 and task4, also, task2 and task3 on adjacent neighbors
s shown in Fig. 22 (c) to balance the temperature. 
As shown in Fig. 22 (c), all of high-weight and low-weight tasks
ave been distributed uniformly on the core layer and in contrast
o Fig. 22 (a), the center of the core layer will not be able to create
 hot-spot region. 
.4. Applying two-stages mapping technique at runtime 
In this sub-section, we can integrate each of the proposed
ptimization-based reconfiguration schemes in Section 6 and pro-
osed runtime mapping technique. As a first step, we assign n ap-
lications/threads to n cores in the core layer in a random man-
er and assign a limited number of cache banks in each level of
he cache hierarchy to each core, then start running the appli-
ations/threads for a fixed time (e.g., 20 ms). Based on collected
ata in performance counters and providing feedback to the pre-
ictor component, the CPI parameter for each core is predicted and
he proposed two-stages mapping technique, shown in Fig. 18 , is
tarted by the PCU. After placing applications/threads on the cores
ecided by the two-stages mapping in the PCU, the proposed re-
onfiguration scheme, in Section 6 , is applied every reconfiguration
ime interval , as shown in Fig. 11 . 
In order to improve the temperature distribution of the 3D CMP
n presence of workload changes, the PCU applies the proposed
untime mapping technique every 20 × reconfiguration time interval .
. Experimental evaluation 
.1. Experimental platform 
In order to validate the efficacy of 3D CMP architectures in
his work, we employ a detailed simulation framework driven by
races extracted from real application workloads running on a full-
ystem simulator. The traces have been extracted from the GEM5
ull-system simulator [62] . For simulating a 3D CMP architecture,
he extracted traces from GEM5 are interfaced with 3D Noxim, as
 3D NoC simulator [61] . 
A full-system simulation of a 16-core CMP architecture at 32 nm
echnology, as shown in Fig. 23 , is performed for evaluation in this
ork. Fig. 23 shows more detail in compared to the Fig. 1 for a
6-core CMP. As shown in this figure, in the core layer, each core
s connected to a router. The area of the core tile (consisting of a
rocessing core, private 32KB L1 instruction and data caches, and
A. Asad et al. / Microprocessors and Microsystems 51 (2017) 76–98 91 










































































































t  he cache controller with cache tags) is 3.5 mm 2 estimated by Mc-
AT [60] and CACTI 6.0 [59] . The detailed system configurations are
iven in Table 4 . 
GEM5 is augmented with McPAT and 3D Noxim with ORION
64] to calculate the power consumption in this paper. The cache
apacities and energy consumption of SRAM and NVMs are esti-
ated from CACTI and NVSIM [58] , respectively. The simulation
latform for the evaluation of our proposed and other 3D CMP ar-
hitectures in this work is illustrated in Fig. 24 . 
For experimental evaluation, maximum temperature limit and
ark-silicon peak power budget, T max and P budget , is assumed to be
0 °C and 100 W , respectively. 
We assumed that the area of each cache layer is equal to the
rea of the core layer. As shown in Fig. 23 , in a core-stack, at the
rst level of the cache hierarchy, there are four SRAM cache banks,
ach of which has 256KB capacity, placed directly on top of one
ore. Based on estimates from CACTI 6.0, a 1MB SRAM cache bank
nd associated router/controller have an area roughly equal to the
rea of one core. At the second and third levels of the cache hierar-
hy, there is a 4MB eDRAM and a 4MB STT-RAM cache bank placed
irectly on top of one core in each layer, respectively. We consider
hree layers in each cache level of the stacked hierarchy for the 16-
ore CMP, as shown in Fig. 23 . Estimations from CACTI and NVSIM
hows that eDRAM and STT-RAM are about four times denser than
hat of SRAM in the same area. The detailed properties for het-
rogeneous cache memories in different technologies are listed in
able 1 . 
For 3D temperature estimation, we employed HotSpot [63] ver-
ion 5.0 as a grid-based thermal modeling tool. The thermal resis-
ance and capacitance of inter-layer material, (i.e. R inter and C in
ig. 17 ) are 140.4 J/K and 0.1 K/W, respectively. The interface mate-
ial between two adjacent silicon layers are connected by homoge-
eous TSVs. 
We use multiprogramed workloads consisting of 16 applica-
ions for performing our experiments. The applications are se-
ected from the SPEC20 0 0/20 06 benchmark suites [49] . Based on
emory demand intensity of benchmark applications, we classified
hem into three groups: memory bounded, medium, and computa-
ion bounded benchmarks. From this classification, we generated
 range of workloads (combinations of 16 benchmarks), as sum-
arized in Table 5 . Note that, the number in parentheses is the
umber of instances. Moreover, we use multithreaded workloads
elected from PARSEC [68] for performing the multithreaded ex-
eriments. 
In our evaluation, in Section 8.2 , first, we experiment the
roposed power model and runtime optimization-based recon-
guration techniques using a random mapping, Figs. 25 –39 . Un-
er a random mapping approach, applications/threads are ran-
omly mapped to the cores. In Section 8.2.2 , we have some sen-
itivity analysis, Figs. 35 –36 and Figs. 38 –39 . Finally, in Section
.2.3 , we integrate the proposed runtime mapping technique and
ptimization-based reconfiguration approaches and study the ef-
ect of applying the mapping algorithm in parallel with the re-
onfiguration techniques on the temperature, Figs. 40 and 41 . Basen this trend, we study about the proposed techniques in Sections
 and 7 , step by step. 
.2. Experimental results 
.2.1. Experimental results for the proposed runtime 
ptimization-based reconfiguration techniques 
In this sub-section, we evaluate a 3D CMP with stacked cache
ierarchy as shown in Fig. 23 in three different cases: a CMP with
xed core frequencies in the core layer and SRAM-only cache lev-
ls in the stacked hierarchy with fixed capacity (Baseline), the CMP
ith fixed core frequencies and fixed maximum available capac-
ty at each level of the heterogeneous cache hierarchy (Hybrid-fix),
nd the proposed CMP with runtime heterogeneous cache hier-
rchy management and per core DVFS (Hybrid-Proposed). As de-
cribed in Section 6 , since there are limited power budget and
emperature in modern processors, for better performance in the
aseline and Hybrid-fix we use allocated cache banks in each level
n the shared manner. In the Hybrid-Proposed scheme, we parti-
ion the shared cache space according to the specific demanding
f each individual application in a workload set. 
Based on Fig. 23 , in Hybrid-Proposed the capacity of L2 cache is
6 × 4 × 256KB SRAM, the capacity of L3 cache is 16 × 4MB eDRAM
nd the capacity of L4 cache is determined 16 × 4MB STT-RAM, re-
pectively. Hybrid-Fix is similar to the Hybrid-Proposed in cache
evels capacity and number of layers in each level. Since STT-RAM
nd eDRAM is made four times denser than SRAM, the Baseline
rchitecture with 16 × 4 × 256KB SRAM in L2 level, 16 × 1MB SRAM
n L3 level and 16 × 1MB SRAM in L4 level takes up similar amount
f area to Hybrid-Fix and Hybrid-Proposed. 
According to the filling of the look up table (shown in Fig. 12 )
sed in the Hybrid-Proposed scheme based on solving the opti-
ization problem in Eqs. (43) –(52) or optimization problem in Eqs.
53) –(62) , we categorized it into two different schemes: Hybrid-
roposed1 and Hybrid-Proposed2, respectively. In the Hybrid-
roposed1, optimization variables are resultant of solving the opti-
ization problem in Eqs. (43) –(52) and in the Hybrid-Proposed2,
ptimization variables are resultant of solving the optimization
roblem in Eqs. (53) –(62) . 
According to our results acquired by McPAT and Synopsys De-
ign Compiler, the designed controller has merely 74.88 ns latency
nd consumes less than 1.63 mW for the reconfiguration operation
n each time interval. Furthermore, our controller has a negligible
rea overhead, even though in the dark silicon era, area overhead
s of less importance. 
In addition to the comparing of Hybrid-Proposed1 and Hybrid-
roposed2 with Baseline and Hybrid-fix, we compare them with
he newest proposed 3D stacked cache architecture, 3D-CRP [65] .
ecently, Meng et al. [65] proposed a 3D cache resource pooling
3D-CRP) architecture and a runtime management policy. They de-
igned an integrated cache management and job allocation policy
hat maximizes the energy efficiency of 3D systems for dynami-
92 A. Asad et al. / Microprocessors and Microsystems 51 (2017) 76–98 
Table 4 
Specification of CMP configurations evaluated in this work. 
Component Description 
Number of cores 16, 4 × 4 mesh 
Core configuration Alpha21164, 3 GHz, area 3.5 mm 2 , 32 nm 
L1 cache SRAM, 4 way, 32B line, size 32KB per core 
L2/L3/L4 caches L2: SRAM, L3: SRAM, L4: SRAM (Baseline) L2: SRAM, L3: eDRAM, L4: STTRAM 
(Hybrid) 
Main memory 4GB, 320 cycle access, 4 on-chip Memory Controllers at each corner node 
Network router 2-stage wormhole switched, XYZ routing, virtual channel flow control, 2 VCs 
per port, a buffer with depth of 5 flits per each VC, 8 flits per Data Packet, 1 
flit per address packet, each flit is set to be 16-byte long 
Network topology 3D network, each layer is an 4 × 4 mesh, each node in layer 1 has a router, 16 
TSV links which are128b bi-directional in each layer 
Fig. 26. Validation of the power consumption model under executing multi- 
threaded workloads. 
Fig. 27. Comparison of throughput results of multiprogramed workloads normal- 






















Fig. 28. Comparison of throughput results of multithreaded workloads normalized 




































P  cally changing workloads [65] . Their policy predicts the resource
requirements by collecting the performance characteristics of each
workload at runtime. The 3D stacked cache systems in their work
is based on homogenous traditional memory technologies and they
don’t consider maximum power budget and temperature limit as
the most important constraints in the proposed technique. Also,
their proposed architecture only considers multiprogramed work-
loads. Specification of the CRP architecture used in our work is
same as Table 4 . The characteristics of 3D stacked cache system
in the CRP has been assumed like Baseline architecture. 
First, in Figs. 25 and 26 , we validate the proposed power model
in Section V for a homogenous (Baseline) and heterogeneous archi-
tecture (Hybrid-fix). These figures show that the proposed power
model estimates the power consumption of heterogeneous and ho-
mogenous 3D CMPs, with a good degree of accuracy, under run-
ning both multiprogramed and multithreaded workloads. As we
see, in the computation intensive workloads, our model and sim-
ulation are near, while in the memory intensive workloads, they
have some differences. 
Figs. 27 and 28 show the results of normalized throughput for
multiprogramed and multithreaded workloads, respectively, where
throughput is the number of executed instructions per second
(IPS). As shown in Fig. 27 , Hybrid-Proposed1 yields up to 0.39%hroughput improvement compared with the Hybrid-fix. Simi-
arly, Hybrid-Proposed2 yields up to 25.2% throughput improve-
ent compared with the Hybrid-fix. It is because of the lower
tatic power consumption and larger cell density of NVM technolo-
ies that Hybrid-Proposed1 and Hybrid-Proposed2 allocates larger
ache capacity in each level of the hierarchy and suitable clock fre-
uency to the cores for memory-intensive workloads under tem-
erature constraint. In this regard, Hybrid-Proposed1 yields up 16%
hroughput improvement, compared with the Hybrid-Proposed2.
t is because of that Hybrid-Proposed2 maximizes performance
nder the fixed power budget constraint. Allocating suitable fre-
uency/voltage to cores under power and temperature constraints
n Hybrid-Proposed methods and, also, according to the lower
tatic power consumption and larger cell density of NVM technolo-
ies, they work better than CRP technique. Hybrid-Proposed1 and
ybrid-Proposed2 improve throughput by about 40% and 32.3%
n average, respectively, in comparison with the CRP architec-
ure. Moreover, Hybrid-Proposed1 and Hybrid-Proposed2 improve
hroughput by about 54.3% and 48.2% on average, respectively, in
omparison with the Baseline architecture. 
Since CRP technique is applied for multiprogramed workloads
65] , we changed it and created a new version named CRPt for
ultithreaded workloads. As shown in Fig. 28 , Hybrid-Proposed1
nd Hybrid-Proposed2 improve throughput by about 20% and 12%,
n average, compared with the CRPt technique. Further, Hybrid-
roposed1 and Hybrid-Proposed2 improve throughput by about
5% and 16%, on average, compared with the Baseline architecture.
Figs. 29 and 30 show the results of normalized energy effi-
iency for multiprogramed and multithreaded workloads, respec-
ively, where energy efficiency is energy-delay product (EDP).
ince the Hybrid-Proposed1 and Hybrid-Proposed2 regulate the
requency/voltage of the cores and activate larger number of cache
evels due to the lower static power and larger cell density of NVM
echnologies, finish its execution earlier than Hybrid-fix and Base-
ine schemes. As shown in Fig. 29 , Hybrid-Proposed1 and Hybrid-
roposed2 yield about 26.8% and 20.2% EDP reduction on average,
A. Asad et al. / Microprocessors and Microsystems 51 (2017) 76–98 93 
Fig. 29. Comparison of Energy-Delay Product (EDP) results under multiprogramed 
workloads execution normalized to the Baseline. 
Fig. 30. Comparison of Energy-Delay Product (EDP) results under multithreaded 




































Fig. 31. Comparison of percentage time spent on average by the up layer of each 
core-stack at different tem perature points, (a) under executing MB2 test program 





































espectively, in comparison with the Hybrid-fix. Moreover, Hybrid-
roposed1 and Hybrid-Proposed2 reduces EDP by about 52% and
8.5% on average, respectively, compared with the CRP. In conclu-
ion, Hybrid-Proposed1 and Hybrid-Proposed2 improves 61% and
8% EDP reduction on average, respectively, in comparison with the
aseline architecture. 
Fig. 30 shows that Hybrid-Proposed1 and Hybrid-Proposed2 re- 
uces EDP by about 46% and 35% on average, respectively, com-
ared with the CRPt. Moreover, Hybrid-Proposed1 and Hybrid-
roposed2 improves 52.2% and 43% EDP reduction on average, re-
pectively, compared with the Baseline architecture. 
Fig. 31 (a) shows the percentage time that the top layer of the
D CMP system in each case, on average, spent at different tem per-
ture points under executing MB2 test program suite. As shown in
he figure, the Hybrid-Proposed1 and Hybrid-Proposed2 methods
lways ensures that the up layer of the proposed 3D CMP are be-
ow the maximum temperature of 80 °C, while the Baseline, CRP
nd Hybrid-fix spend a significant amount of time above the max-
mum temperature. As shown in Fig. 31 (b), for CB2 test program
uite as one of the computation intensive suite, the Baseline, CRP
nd Hybrid-fix schemes spend up to 57.6%, 16% and 40% of the
ime above the maximum temperature, respectively. 
Fig. 32 (a) and (b) show the comparison of maximum tempera-
ure of the up layer of each core-stack of the 3D CMP system in
ach case under executing MB2 and CB2 test program suites, re-
pectively. As shown in the figures, maximum temperature of the
p layer of each core-stack in the Hybrid-Proposed1 and Hybrid-
roposed2 methods always remain below the maximum tempera-
ure of 80 °C, while the Baseline, CRP and Hybrid-fix schemes vio-
ates the maximum temperature constraint. 
Fig. 33 plots the runtime profile of power consumption of the
D CMP system in each case when executing MB2 test program
uite under power budget constraint. 
As shown in Fig. 33 , the purple line illustrates the maximum
ower budget for the system (i.e. TDP). As can be observed in this
gure, the power consumption in some cases exceeds the TDP inaseline and Hybrid-fix schemes. As shown in this figure, Hybrid-
roposed2 never allows the total power consumption to violate the
DP. 
Due to the limitation of space, we only present the Figs. 31 –
3 for MB2 and CB2 test program suites. All the test program
uites used in this paper, multiprogramed and multithreaded, were
xperimented and in all of them, Hybrid-Proposed1 and Hybrid-
roposed2 did not violate the temperature and power budget con-
traint. 
In Fig. 34 , the lifetime results of each test program suite for
aseline and Hybrid-Proposed2 schemes has been calculated. This
alculation has been done without consideration to the aging
echanisms such as Negative Bias Temperature Instability (NBTI)
nd other temperature-related problems in reliability which is be-
ond the scope of current work. Due to the higher leakage power
onsumption of SRAM cache which leads to increase temperature,
RAM cache in the 3D architecture suffers from NBTI degradation
ignificantly more than the non-volatile memories. In a 3D archi-
ecture, higher temperature in the core layer makes cache layers
otter and thus imposes more degradation in the upper levels.
he temperature variation aggravates the NBTI effects in the up-
er layer. 
To evaluate lifetime, we assumed that programs in a test pro-
ram suite continuously run until one of the cache blocks exceeds
he maximum number of endurable writes in each cache level.
e assumed the endurable maximum number of writes for SRAM,
RAM, STT-RAM and PRAM based on Table 6 [56] . As shown in
ig. 34 , the life time of the Baseline architecture is up to 7.5X and
.85X on average in comparison to the Hybrid-Proposed2, because
f the low endurance problem of NVM technologies. 
.2.2. Sensitivity analysis 
In this sub-section, we evaluate the scalability of our proposed
chemes using the Mix2 multiprogramed workload, as one of the
oderate workloads in Table 5 , under a random mapping. Note
hat in all of the sensitivity studies, the architecture of the stacked
ache hierarchy (number of levels, layers and technology of each
ache level) is same as Fig. 23 . 
a) Sensitivity to the different number of cores: 
94 A. Asad et al. / Microprocessors and Microsystems 51 (2017) 76–98 
Table 5 
Multiprogramed workloads used in the experiment. 
Test program suite Benchmarks 
Memory Bounded set1 (MB1) zeusmp(2), libquantum(2), lbm(2), GemsFDTD(2), art(4), swim(4) 
Memory Bounded set2 (MB2) zeusmp(3), libquantum(3), lbm(3), GemsFDTD(3), art(2), swim(2) 
Memory Bounded set3 (MB3) zeusmp(4), libquantum(4), lbm(4), GemsFDTD(4) 
Medium set1 (MD1) mcf(2), sphinx3(2), bzip2(2), calculix(2), leslie3d(2), gcc(2), cactusADM, milc, omnetpp, wupwise 
Medium set2 (MD2) mcf(3), sphinx3(3), leslie3d(2), gcc(2), cactusADM(2), milc(2), omnetpp(2) 
Medium set3 (MD3) mcf(2), sphinx3(2), bzip2, calculix, leslie3d(2), gcc(2), cactusADM(2), milc(2), omnetpp(2) 
Computation Bounded set1 (CB1) parser(2), applu(2), face_rec(2), equake(2), astar(2), hmmer(2), bzip2(2), calculix(2) 
Computation Bounded set2 (CB2) parser(2), applu(2), face_rec(2), equake(2), astar(2), hmmer(2), bzip2, calculix, mpeg_dec(2) 
Computation Bounded set3 (CB3) parser(2), applu(2), face_rec(2), equake(2), astar(2), hmmer(2), mpeg_dec(4) 
Mixed set1 (Mix1) sphinx3, mcf, astar(2), hmmer(2), gamess(2),perlbench(2), gromacs(2), tonto(2), gcc, leslie3d 
Mixed set2 (Mix2) sphinx3(2), mcf, astar(2), hmmer, gamess(2), perlbench(2), soplex, gromacs, gcc(2), leslie3d(2) 
Mixed set3 (Mix3) sphinx3(2), mcf(2), astar, hmmer, gamess, perlbench, soplex(2), gcc(3), leslie3d(3) 
Fig. 32. Comparison of maximum temperature of the up layer of each core-stack 
under temperature limit constraint, (a) under executing MB2 test program suite, (b) 
under executing CB2 test program suite. 
Fig. 33. A runtime profile of the power consumption, executing MB2 test program 
suite, under power budget constraint. 
Fig. 34. Comparison of the life time results of each test program suite for Baseline 
and Hybrid-Proposed2 schemes. 
Table 6 
The endurable maximum number of writes for various 
memory technologies. 
Technology SRAM eDRAM STT-RAM PRAM 
Endurance 10 16 10 16 4 × 10 12 10 9 
Fig. 35. Comparison of throughput results normalized to the Baseline. 






P  In this section, we analyze the sensitivity of our proposals to
he different number of cores. Figs. 35 and 36 show the results of
ormalized throughput and energy efficiency, respectively, for 16,
2, 48, 64, 80, 96, 112, 128-core CMPs. As shown in Fig. 35 , Hybrid-
roposed1 and Hybrid-Proposed2 have better throughput than the
A. Asad et al. / Microprocessors and Microsystems 51 (2017) 76–98 95 
Fig. 37. Illustration of the various placement of TSVs and the sub-division of core 










































Fig. 38. Sensitivity to placement and number of TSVs. 



































a  aseline, CRP and Hybrid-fix for each platform. Furthermore, Figs.
5 and 36 show that the proposed methods’ throughput and EDP
mprovements over the Baseline increase with increasing the num-
er of cores. 
In our simulation, for clear explanation about the mapped
orkloads, the number of cores in each platform are coefficient of
6. For example in a 64-core system, we ran four Mix2 sets in a
andom distribution. 
b) Sensitivity to the placement and the number of TSVs: 
In this section, we analyze the sensitivity of our proposals to
he number and placement of TSVs. For this sensitivity study, we
rst start by experimenting with the placement of the TSVs in a
4-core CMP. Fig. 37 shows the three different placements that we
xperimented with: corner, staggering and All-TSVs. In the corner
lacement, TSVs are placed at the corner of the regions, while in
he staggering placement, TSVs are placed staggered. Additionally,
e sub-divided the core layer into 4, 8 and 16 regions by different
umber of TSVs in the corner and staggering placements. In the
orner and staggering placements, based on XYZ routing used in
he proposed architecture, packets are serialized to routers in each
egion. Hence, a packet routed from the core layer to the cache
ayer is first routed using XY routing to a particular router in the
ore layer, followed by a TSV traversal in the vertical direction to
 router in the related cache region. Finally, the packet follows XY
outing again in the cache layer to the destination cache bank. This
cheme repeats when communicating between the cache layers to
he core layer. In the All-TSVs option, we designate one TSV in each
ore tile in the core layer (64 TSVs) through which all cores can
ommunicate their request packets with any cache bank in the cor-
esponding region in the cache layer. 
Fig. 38 shows the result of this study. This figure shows that
n the Hybrid-Proposed1 there is 5% improvement, on average, in
hroughput with the staggering placement of TSVs compared to
he corner. It is because of that in the staggering placement, when
sing XY routing in the core layer, the Y-direction flows going to
he TSVs do not overlap with each other because the TSVs are
ow placed along different columns. Furthermore, Fig. 38 shows
hat with increasing number of TSVs because of reducing conges-
ion, throughput improvement increases. In this figure, throughput
esults have been normalized to All-TSVs. Note that experimental
valuation in the Section 8.2.1 have been done based on All-TSVs,
s shown in Fig. 23 . 
c) Sensitivity to the different number of DVFS levels: In this section, we study about the effect of the different num-
er of DVFS levels, used for filling the look up table shown in
ig. 12 , on the performance of our proposals. For reducing the dis-
ance between estimated frequency/voltage by the proposed opti-
ization problems and supported values in the look up table, we
an increase the number of DVFS levels and start our sensitivity
tudy. As shown in Fig. 39 , experiments have been done for five
ifferent number of levels. Fig. 39 shows that throughput improve-
ent increases from 30-Levels to 75-Levels. As shown in this fig-
re, the percent of improvement from 75-Levels to 120-Levels is
ot impressive and leads to using large lookup tables. In this fig-
re, throughput results have been normalized to 120-Levels. Note
hat experimental evaluation in the Section 8.2.1 have been done
ased on 50-Levels. 
.2.3. Experimental results for the integration of the proposed 
apping technique and optimization-based reconfiguration schemes 
In this subsection, we integrate the proposed runtime map-
ing technique in Section 6 and the optimization-based reconfigu-
ation approaches in Section 7 , and study the effect of applying the
apping algorithm in parallel with the reconfiguration techniques
n the temperature. For this study, we experiment the results of
his integration on the temperature in a 64-core CMP with 9 lay-
rs heterogeneous cache hierarchy based on the Mix2 multipro-
ramed workload such as Section 8.2.2 . Figs. 40 and 41 illustrate
he spatial distribution of the temperature of the 5th layer (mid-
le layer) of the cache hierarchy under the random and proposed
apping technique, respectively. As shown in Fig. 40 , the distri-
ution of the temperature is non-uniform under random mapping.
ig. 41 shows that temperature distribution is more uniform com-
ared with Fig. 40 and it is because of that the proposed mapping
echnique places hotspot computation intensive applications in the
eighborhood of memory intensive applications and alleviates the
urrounding hotspots placed based on the random mapping. If two
r more hotspots come close, this will produce thermal coupling
nd therefore locally raise the temperature. In our proposed map-
96 A. Asad et al. / Microprocessors and Microsystems 51 (2017) 76–98 
Fig. 40. Thermal maps of the middle layer of the cache hierarchy in a 64-core CMP 
with random mapping, (a) Baseline, (b) Hybrid-Proposed1. 
Fig 41. Thermal map of the middle layer of the cache hierarchy in a 64-core CMP 





























d  ing, interleaving hot and cold blocks in an effective method helps
o provide lower global power density (therefore, lower tempera-
ure). In Fig. 41 , we have 3 °C colder hotspots compared with Fig.
0 (b). 
. Conclusion 
In this paper, we proposed an integrated solution of runtime
ache way assignment (i.e., cache capacity allocation and power
ating of unnecessary cache ways) and per-core DVFS for CMPs
ith 3D stacked hybrid cache hierarchy in order to maximize the
ystem performance in an energy-efficient and temperature-aware
anner for both of multiprogramed and multithreaded workloads.
e proposed analytical formulations and applied them as a run-
ime solution for future 3D CMPs. In continue, we proposed an ap-
lication aware mapping technique which was integrated with the
roposed runtime optimization-based reconfiguration mechanisms
fficiently to balance the temperature distribution. Experimental
esults show that the proposed method (i.e., Hybrid-Proposed1)
mproves the instruction throughput and energy-delay product by
bout 54.3% and 61% on average, respectively, in comparison with
he conventional method where homogenous cache technology is
sed. 
ppendix A 
Algorithm 1 is a formal description of a polynomial time so-
ution for the proposed optimization problem in Eqs. (53) –(62) . In
his algorithm, L max 
i,k 
is maximum number of layers at the k th cache
evel on the core layer. 
The order of this algorithm is O ( m.n ) in a CMP with n cores
nd m steps in each time interval. Searching can be exhaustively
one (in polynomial time) to determine the best number of cores,Algorithm 1 Optimal frequency assignment, cache hierarchy reconfigura- 
tion and core selection. 
1. J 1 ∗ ← ∞ , L max 
i,k 
← L max 
k 
2. for i ∈ [1, n ] do: for k ∈ [1, N ] do: L i, k ← 0 
3. for t ∈ [1, 100] and m = 200 do: 
4. for i ∈ [1, n ] do: 
5. if ( t = 1 ) then: (T core 
i 
) init ← T start 
6. else: (T core 
i 
) t ← T start + R hs . (P core i ) t 
7. ( f i ) t 
∗ ← f max and θ ← 1 
8. 
 ← P budge t i − ( P max . ( ( f i ) αt / f αmax ) + h i . ( T core i ) t ) − ( P 
cache _ hirarchy 
i 
) t 
9. end if 
10. while ( 
 < 0 ) 
11. θ ← 

P max ( ( f i ) 
α
t / f 
α
max ) 
12. ( f i ) t = θ. ( f i ) t ∗
13. Calculate J 1 new and (T core i ) t+1 
14. if (J 1 new 〈 J 1 ∗ and (T core i ) t+1 〉 T max and ( f i ) t > f min ) 
15. then 
16. ( f i ) t 
∗ ← ( f i ) t 
17. end if 
18. for k ∈ [1, N ] do: 




20. L i,k + + 
21. Calculate J 1 new 
22. if ( J 1 new < J 1 
∗ and (T top 
i 
) t < T max ) then: 
23. J 1 ∗ = J 1 new 
24. else: 
25. L i,k − −
26. end if 
27. end if 
28. end for 




30. end while 
31. end for 
32. end for 
33.return { f ∗ , L ∗} 























































































































ores selection, cache configuration and frequency assignment that
aximizes performance within the dark silicon peak power bud-
et. The overall runtime overhead for this polynomial computation
s about 1.16 ms in our experiment. 
In this work, since the proposed technique is applied every
00 ms and each time frame is 0.5 ms, then the total number of
teps are 200 in each reconfiguration time interval. 
eferences 
[1] R.H. Dennard , V.L. Rideout , E. Bassous , A.R. Leblanc , Design of ion-implanted
MOSFET’s with very small physical dimensions, IEEE J. Solid-State Circuits 9
(5) (1974) 256–268 . 
[2] H. Esmaeilzadeh , E. Blem , R.S. Amant , K. Sankaralingam , D. Burger , Dark silicon
and the end of multicore scaling, in: Proceedings of the International Sympo-
sium in Computer Architecture (ISCA), 2011, pp. 365–376 . 
[3] M.B. Taylor , Is dark silicon useful? Harnessing the four horsemen of the com-
ing dark silicon apocalypse, in: Proceedings of the 49th Annual Design Au-
tomation Conference (DAC), 2012, pp. 1131–1136 . 
[4] P. Bose , Is dark silicon real?: Technical perspective, Commun. ACM Mag. 56 (2)
(2013) 92 . 
[5] J. Kao , S. Narendra , A. Chandrakasan , Subthreshold leakage modeling and re-
duction techniques, in: Proceedings of the IEEE/ACM International Conference
on Computer-Aided Design (ICCAD), 2002, pp. 141–148 . 
[6] N.S. Kim , T. Austin , D. Blaauw , T. Mudge , K. Flautner , J.S. Hu , et al. , Leakage
current: Moore’s law meets static power, Computer 36 (12) (2003) 68–75 . 
[7] W. Wang , P. Mishra , System-wide leakage-aware energy minimization using
dynamic voltage scaling and cache reconfiguration in multitasking systems,
IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 20 (5) (2012) 902–910 . 
[8] R. Jammy , Materials, process and integration options for emerging technolo-
gies, SEMATECH/ISMI Symposium, 2009 . 
[9] D.H. Woo , N.H. Seong , D.L. Lewis , H.-H.S. Lee , An optimized 3D-stacked mem-
ory architecture by exploiting excessive, high-density TSV bandwidth, in: Pro-
ceedings of High-Performance Computer Architecture (HPCA), 2010, pp. 1–12 . 
[10] G. Loh , H Gabriel , 3D-stacked memory architectures for multi-core proces-
sors, in: IEEE Computer Society ACM SIGARCH Computer Architecture News,
36, 2008, pp. 453–464 . 
[11] E. Kultursay , M.T. Kandemir , A. Sivasubramaniam , O. Mutlu , Evaluating
STT-RAM as an energy efficient main memory alternative, in: Proceed-
ings of the Performance Analysis of Systems and Software (ISPASS), 2013,
pp. 256–267 . 
[12] B.C. Lee , P. Zhou , J. Yang , Y. Zhang , B. Zhao , E. Ipek , et al. , Phase-change tech-
nology and the future of main memory, IEEE Micro 30 (1) (2010) 131–141 . 
[13] G. Venkatesh , J. Sampson , N. Goulding-Hotta , S.K. Venkata , M.B. Taylor ,
S. Swanson , QsCores: Trading dark silicon for scalable energy efficiency with
quasi-specific cores, in: Proceedings of the 44th Annual IEEE/ACM Interna-
tional Symposium on Microarchitecture (Micro), 2011, pp. 163–174 . 
[14] B. Raghunathan , Y. Turakhia , S. Garg , D. Marculescu , Cherry-picking: Exploiting
process variations in dark-silicon homogeneous chip multi-processors, in: Pro-
ceedings of the Conference on Design, Automation and Test in Europe (DATE),
2013, pp. 39–44 . 
[15] Y. Turakhia , B. Raghunathan , S. Garg , D. Marculescu , HaDeS: Architectural syn-
thesis for heterogeneous dark silicon chip multi-processors, in: Proceedings of
the 50th Annual Design Automation Conference (DAC), 2013, pp. 1–7 . 
[16] J. Allred , S. Roy , K. Chakraborty , Designing for dark silicon: a methodological
perspective on energy efficient systems, in: Proceedings of the ACM/IEEE In-
ternational Symposium on Low Power Electronics and Design (ISLPED), 2012,
pp. 255–260 . 
[17] F. Kriebel , S. Rehman , D. Sun , M. Shafique , J. Henkel , Aser: Adaptive soft error
resilience for reliability heterogeneous processors in the dark silicon era, in:
Proceedings of the ACM/EDAC/IEEE Design Automation Conference (DAC), 2014,
pp. 1–6 . 
[18] H. Esmaeilzadeh , A. Sampson , M. Ringenburg , L. Ceze , D. Grossman , D. Burger ,
Addressing dark silicon challenges with disciplined approximate computing,
in: Proceedings of the 4th Workshop on Energy-Efficient Design, 2012, pp. 1–2 .
[19] H. Esmaeilzadeh , A. Sampson , L. Ceze , D. Burger , Neural acceleration for
general-purpose approximate programs, in: Proceedings of the 45th An-
nual IEEE/ACM International Symposium on Microarchitecture (Micro), 2012,
pp. 449–460 . 
20] SMP variable, a multi-core CPU architecture for low power and high perfor-
mance, Whitepaper- http://www.nvidia.com , 2011. 
[21] K. Swaminathan , E. Kultursay , V. Saripalli , V. Narayanan , M. Kandemir , S. Datta ,
Steep slope devices: From dark to dim silicon, IEEE Micro 33 (5) (2013) 50–59 .
22] L. Wang, and K. Skadron, Dark vs. Dim Silicon and Near-Threshold Comput-
ing Extended Results, University of Virginia Department of Computer Science
Technical Report TR-2013, 1. 
23] U.R. Karpuzcu , A. Sinkar , N.S. Kim , J. Torrellas , Energy smart: Toward energy–
efficient many cores for near-threshold computing, in: Proceedings of the High
Performance Computer Architecture (HPCA), 2013, pp. 542–553 . 
[24] G. Venkatesh , J. Sampson , N. Goulding , S. Garcia , V. Bryksin , J. Lugo-Martinez ,
et al. , Conservation cores: Reducing the energy of mature computations, ACM
SIGARCH Comput. Archit. News 38 (1) (2010) 205–218 . 25] D. Zhao , H. Homayoun , A.V. Veidenbaum , Temperature aware thread migration
in 3D architecture with stacked DRAM, in: Proceedings of the ISQED, 2013,
pp. 80–87 . 
26] U. Kang , H.J. Chung , S. Heo , D.H. Park , H. Lee , J.H. Kim , et al. , 8 gb 3-d ddr3
dram using through-silicon-via technology, IEEE J. Solid-State Circuits 45 (1)
(2010) 111–119 . 
[27] J. Meng , A.K. Coskun , Analysis and runtime management of 3D systems with
stacked DRAM for boosting energy efficiency, in: Proceedings of the De-
sign, Automation & Test in Europe Conference & Exhibition (DATE), 2012,
pp. 611–616 . 
28] H. Tajik , H. Homayoun , N. Dutt , VAWOM: Temperature and process variation
aware wearout management in 3D multicore architecture, in: Proceedings of
the 50th Annual Design Automation Conference (DAC), 2013, pp. 1–8 . 
29] B. Lee , E. Ipek , O. Mutlu , D. Burger , Architecting phase change memory as a
scalable DRAM alternative, ACM SIGARCH Comput. Archit. News 37 (3) (2009)
2–13 . 
30] Q. Li , Y. He , J. Li , L. Shi , Y. Chen , C.J. Xue , Compiler-assisted refresh minimiza-
tion for volatile STT-RAM cache, IEEE Trans. Comput. 64 (8) (2015) 2169–2181 .
[31] G. Sun , X. Dong , Y. Xie , J. Li , Y. Chen , A novel architecture of the 3D stacked
MRAM L2 cache for CMPs, in: Proceedings of the 15th International Sympo-
sium on High Performance Computer Architecture (HPCA), 2009, pp. 239–249 .
32] M. Rasquinha , D. Choudhary , S. Chatterjee , S. Mukhopadhyay , S. Yalamanchili ,
An energy efficient cache design using Spin Torque Transfer (STT) RAM, in:
Proceedings of the 16th ACM/IEEE International Symposium on Low Power
Electronics and Design (ISLPED), 2010, pp. 389–394 . 
[33] Z. Wang , D.A. Jimenez , C. Xu , G. Sun , Y. Xie , Adaptive Placement and Migra-
tion Policy for an STT-RAM-Based Hybrid Cache, in: Proceedings of the High
Performance Computer Architecture (HPCA), 2014, pp. 13–24 . 
34] S.M. Syu , Y.H. Shao , I.C. Lin , High-endurance hybrid cache design in CMP ar-
chitecture with cache partitioning and access-aware policy, in: Proceedings of
the 23rd ACM International Conference on Great Lakes Symposium on VLSI
(GL SVL SI), 2013, pp. 19–24 . 
[35] Q. Li , J. Li , L. Shi , M. Zhao , C.J. Xue , Y. He , Compiler-assisted STT-RAM-based
hybrid cache for energy efficient embedded systems, IEEE Trans. Very Large
Scale Integr. (VLSI) Syst., 22 (8) (2014) 829–1840 . 
36] W. Kim , M.S. Gupta , G.-Y. Wei , D. Brooks , System level analysis of fast, per–
core DVFS using on-chip switching regulators, in: Proceedings of the Interna-
tional Symposium on High Performance Computer Architecture (HPCA), 2008,
pp. 123–134 . 
[37] E. Rotem , A. Mendelson , R. Ginosar , U. Weiser , Multiple clock and voltage do-
mains for chip multi processors, in: Proceedings of the 42nd Annual IEEE/ACM
International Symposium on Microarchitecture (Micro), 2009, pp. 459–468 . 
38] T. Chantem , R.P. Dick , X.S. Hu , Temperature-aware scheduling and assignment
for hard real-time applications on MPSoCs, IEEE Trans. Very Large Scale Integr.
(VLSI) Syst. 19 (10) (2011) 1884–1897 . 
39] J. Donald , M. Martonosi , Techniques for multicore thermal management: Clas-
sification and new exploration, ACM SIGARCH Comput. Archit. News 34 (2)
(2006) 78–88 . 
40] M. Arora , S. Manne , Y. Eckert , I. Paul , N. Jayasena , D. Tullsen , A comparison of
core power gating strategies implemented in modern hardware, ACM SIGMET-
RICS Perform. Eval. Rev. 42 (1) (2014) 559–560 . 
[41] A. Kumar , S. Li , P. Li-Shiuan , N.K. Jha , System-level dynamic thermal manage-
ment for high-performance microprocessors, IEEE Trans. Comput.-Aided Des.
Integr. Circuits Syst. 27 (1) (2008) 96–108 . 
42] S. Heo , K. Barr , K. Asanovi ́c , Reducing power density through activity migra-
tion, in: Proceedings of the International Symposium on Low Power Electron-
ics and Design (ISPLED), 2003, pp. 217–222 . 
43] J. Henkel , H. Khdr , S. Pagani , M. Shafique , New trends in dark silicon, in: Pro-
ceedings of the ACM/EDAC/IEEE Design Automation Conference (DAC), 2015,
pp. 1–6 . 
44] S. Murali , M. Coenen , A. Radulescu , K. Goossens , G. De Micheli , Mapping and
configuration methods for multi-use-case networks on chips, in: Proceedings
of Asia and South Pacific Conference on Design Automation (ASP-DAC), 2006,
pp. 146–151 . 
45] T.C. Xu , G. Schley , P. Liljeberg , M. Radetzki , J. Plosila , H. Tenhunen , Optimal
placement of vertical connections in 3d network-on-chip, J. Syst. Archit. 59 (7)
(2013) 441–454 . 
46] R. Bakker , M.W. Tol , A.D. Pimentel , Emulating asymmetric MPSoCs on the In-
tel SCC many-core processor, in: Proceedings of 22nd Euromicro International
Conference on Parallel, Distributed, and Network-Based Processing (PDP), 2014,
pp. 520–527 . 
[47] M. Modarressi , A. Tavakkol , H. Sarbazi-Azad , Virtual point-to-point connections
for NoCs, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 29 (6) (2010)
855–868 . 
48] A. Asad , M. Seyrafi, A.E. Zonouz , M. Soryani , M. Fathy , A predominant routing
for on-chip networks, in: Proceedings of the 4th International Design and Test
Workshop (IDT), 2009, pp. 1–6 . 
49] Standard performance evaluation corporation [Online], Available: http://www.
specbench.org . 
50] A. Hartstein , V. Srinivasan , T.R. Puzak , P.G. Emma , Cache miss behavior: is it
2 ? in: Proceedings of the 3rd Conference on Computing Frontiers, 2006,
pp. 313–320 . 
[51] R. Kumar , G. Hinton , A family of 45 nm IA processors, in: Proceedings of the
International Solid-State Circuits Conference-Digest of Technical Papers (ISSCC),
2009, pp. 58–59 . 
























































[52] T.S. Muthukaruppan , M. Pricopi , V. Venkataramani , T. Mitra , S. Vishin , Hierar-
chical power management for asymmetric multi-core in dark silicon era, in:
Proceedings of the 50th Annual Design Automation Conference (DAC), 2013,
pp. 1–9 . 
[53] ARM Ltd., http://www.arm.com/products/tools/developmentboards/versatile- 
express/index.php . 
[54] J. Howard , et al. , A 48-Core IA-32 message-passing processor with DVFS in
45 nm CMOS, in: Proceedings of the IEEE International Solid-State Circuits
Conference (ISSCC), 2010, pp. 108–109 . 
[55] C. Sun , C.H. Chen , G. Kurian , L. Wei , J. Miller , A. Agarwal , et al. , DSENT- a tool
connecting emerging photonics with electronics for opto-electronic network-
s-on-chip modeling, in: Proceedings of the 6th IEEE/ACM International Sym-
posium on Networks on Chip (NoCS), 2012, pp. 201–210 . 
[56] M. Chang , P. Rosenfeld , S.L. Lu , B. Jacob , Technology comparison for Large
Last-Level Caches (L3Cs): Low-leakage SRAM, Low write-energy STT-RAM, and
refresh-optimized eDRAM, in: Proceedings of the 19th International Sympo-
sium on High Performance Computer Architecture (HPCA), 2013, pp. 143–154 . 
[57] Waterloo Maple Software Inc. Maple 9.5 released: 2004. 
[58] X. Dong , C. Xu , N. Jouppi , Y. Xie , NVSim: A circuit-level performance, en-
ergy, and area model for emerging non-volatile memory, in: Emerging Memory
Technologies, Springer, 2012, pp. 15–50 . 
[59] N. Muralimanohar , R. Balasubramonian , N.P. Jouppi , CACTI 6.0: A Tool to Model
Large Caches, HP Laboratories, 2009 Technical Report . 
[60] S. Li , J.H. Ahn , R.D. Strong , J.B. Brockman , D.M. Tullsen , N.P. Jouppi , McPAT:
an integrated power, area, and timing modeling framework for multicore and
manycore architectures, in: Proceedings of the 42nd Annual IEEE/ACM Interna-
tional Symposium on Microarchitecture (Micro), 2009, pp. 469–480 . 
[61] M. Palesi, S. Kumar, D. Patti, Noxim: Network-on-chip simulator, 2010. http:
//noxim.sourceforge.net . 
[62] N. Binkert , B. Beckmann , G. Black , S.K. Reinhardt , A. Saidi , A. Basu , et al. , The
gem5 simulator, ACM SIGARCH Comput. Archit. News 39 (2) (2011) 1–7 . 
[63] W. Huang , S. Ghosh , S. Velusamy , K. Sankaranarayanan , K. Skadron , M.R. Stan ,
HotSpot: A compact thermal modeling methodology for early-stage VLSI de-
sign, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 14 (5) (2006) 501–513 . 
[64] A.B. Kahng , B. Li , L.S. Peh , K. Samadi , Orion 2.0: A power-area simulator for
interconnection networks, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 20
(1) (2011) 191–196 . 
[65] T. Zhang , J. Meng , A.K. Coskun , Dynamic cache pooling in 3D multicore pro-
cessors, ACM J. Emerging Technol. Comput. Syst. (JETC) 12 (2) (2015) 1–20 . 
[66] J.M. Allred , S. Roy , K. Chakraborty , Dark silicon aware multicore systems: Em-
ploying design automation with architectural insight, IEEE Trans. Very Large
Scale Integr. (VLSI) Syst. 22 (5) (2014) 1192–1196 . [67] J. Zhan , Y. Xie , G. Sun , NoC-Sprinting: Interconnect for fine-grained sprinting
in the dark silicon era, in: Proceedings of the 51st ACM/EDAC/IEEE Design Au-
tomation Conference (DAC), 2014, pp. 1–6 . 
[68] C. Bienia , S. Kumar , J.P. Singh , K. Li , The PARSEC benchmark suite: Characteri-
zation and architectural implications, in: Proceedings of the 17th International
Conference on Parallel Architectures and Compilation Techniques (PACT), 2008,
pp. 72–81 . 
[69] S. Lee , K. Kang , J. Jung , C.M. Kyung , Runtime 3-D stacked cache data man-
agement for energy minimization of 3-D chip-multiprocessors, in: Proceedings
of the International Symposium on Quality Electronic Design (ISQED), 2014,
pp. 197–204 . 
[70] K. Kang , J. Jung , S. Yoo , C.M. Kyung , Maximizing throughput of temperature–
constrained multi-core systems with 3D-stacked cache memory, in: Proceed-
ings of the International Symposium on Quality Electronic Design (ISQED),
2011, pp. 1–6 . 
[71] K. Kang , G. De Micheli , S. Lee , C.M. Kyung , Temperature-aware runtime power
management for chip-multiprocessors with 3-D stacked cache, in: Proceedings
of the International Symposium on Quality Electronic Design (ISQED), 2014,
pp. 163–170 . 
[72] N. Madan , L. Zhao , N. Muralimanohar , A. Udipi , R. Balasubramonian , R. Iyer ,
S. Makineni , D. Newell , Optimizing communication and capacity in a 3D
stacked reconfigurable cache hierarchy, in: Proceedings of the 15th Interna-
tional Symposium on High Performance Computer Architecture (HPCA), 2009,
pp. 262–274 . 
[73] H.Y. Cheng , H-Y, J. Zhan , J. Zhao , Y. Xie , J. Sampson , M.J. Irwin , Core vs. uncore:
the heart of darkness, in: Proceedings of the ACM/EDAC/IEEE Design Automa-
tion Conference (DAC), 2015, pp. 1–6 . 
[74] A. Sadeghi , K. Raahemifar , M. Fathy , A. Asad , Lighting the dark-silicon 3D chip
multi-processors by exploiting heterogeneity in cache hierarchy, in: Proceed-
ings of the International Symposium on Embedded Multicore/Many-core Sys-
tems-on-Chip (MCSoC), 2015, pp. 182–186 . 
[75] M. Monchiero , R. Canal , A. Gonzalez , Power/performance/thermal design-space
exploration for multicore architectures, IEEE Trans. Parallel Distrib. Syst. 19 (5)
(2008) 666–681 . 
[76] A. Sodani , Race to exascale: Opportunities and challenges, MICRO 2011 (2011)
Keynote talk . 
[77] H.Y. Cheng , M. Poremba , N. Shahidi , I. Stalev , M.J. Irwin , M. Kandemir , et al. ,
EECache: A comprehensive study on the architectural design for energy-effi-
cient last-level caches in chip multiprocessors, ACM Trans. Archit. Code Optim.
(TACO) 17 (2) (2015) 1–22 . 
[78] S. Boyd , L. Vandenberghe , Convex Optimization, Cambridge University Press,
2004 . 
