TRINITY: Coordinated Performance, Energy and Temperature Management in
  3D Processor-Memory Stacks by Rao, Karthik et al.
TRINITY: Coordinated Performance, Energy and
Temperature Management in 3D Processor-Memory Stacks
Karthik Rao1, William Song2, Yorai Wardi3, and Sudhakar Yalamanchili4
1Georgia Institute of Technology, Atlanta, USA , raokart@gatech.edu
2Yonsei University, Seoul, South Korea , wjhsong@yonsei.ac.kr
1Georgia Institute of Technology, Atlanta, USA , ywardi@ece.gatech.edu
1Georgia Institute of Technology, Atlanta, USA , sudha@gatech.edu
ABSTRACT
The consistent demand for better performance has lead to
innovations at hardware and microarchitectural levels. 3D
stacking of memory and logic dies delivers an order of mag-
nitude improvement in available memory bandwidth. The
price paid however is, tight thermal constraints.
In this paper, we study the complex multiphysics interac-
tions between performance, energy and temperature. Using
a cache coherent multicore processor cycle level simulator
coupled with power and thermal estimation tools, we in-
vestigate the interactions between (a) thermal behaviors (b)
compute and memory microarchitecture and (c) application
workloads. The key insights from this exploration reveal the
need to manage performance, energy and temperature in a
coordinated fashion. Furthermore, we identify the concept of
“effective heat capacity" i.e. the heat generated beyond which
no further gains in performance is observed with increases
in voltage-frequency of the compute logic. Subsequently, a
real-time, numerical optimization based, application agnostic
controller (TRINITY) is developed which intelligently man-
ages the three parameters of interest. We observe up to 30%
improvement in Energy Delay2 Product and up to 8 Kelvin
lower core temperatures as compared to fixed frequencies.
Compared to the ondemand Linux CPU DVFS governor, for
similar energy efficiency, TRINITY keeps the cores cooler
by 6 Kelvin which increases the lifetime reliability by up to
59%.
1. INTRODUCTION
The performance of data intensive computing systems that
process terabytes of data is increasingly limited by data move-
ment and corresponding energy overheads. 3D packaging
technologies enabled by advances such as Through-Silicon-
Via (TSV) technology [1], has led to stacking of silicon dies
thereby enabling the integration of memory and logic in a
small footprint with significant reductions in data movement
latency and energy. Further, the package provides an order
of magnitude increase in memory bandwidth. For example,
310
312
314
316
318
320
322
0
50
100
150
200
250
300
350
400
450
1.5GHz TRINITY 0.5GHz
Temperature (Kelvin)
Pe
rfo
rm
an
ce
 (*
10
 M
IP
S)
 an
d 
En
erg
y E
ffi
cie
nc
y (
op
s/J
ou
le)
ops/Joule Perf. Temp.
Figure 1: TRINITY balancing performance, energy and
temperature by effectively utilizing the heat capacity.
commercial standards like DDR3-1333 [2], DDR4-2667 [3],
HBM2 [4] and HMC2 [5] realize 10.66 GB/s, 21.34 GB/s,
256 GB/s and 320 GB/s, respectively. To effectively exploit
the high bandwidth provided by 3D die stacked DRAM, mul-
tiple efforts have explored moving compute logic inside the
package as part of the die stack revisiting the early efforts
at architecting Processing-In-Memory (PIM) designs [6, 7,
8, 9, 10, 11, 12]. The compute logic layer in the 3D stack
can range from simple atomic operations to multiple Out-
of-Order (OoO) cores to general purpose low power GPUs.
However, stacking memory and logic dies in this manner
exacerbates thermal effects which, if left unchecked will pre-
clude any performance gains from co-locating compute and
memory. In particular, the exponential relationship between
temperature and leakage current diminishes the performance
that can be achieved for the heat capacity of the package
thereby limiting the ability to exploit the order of magnitude
increase in available memory bandwidth [13, 14].
There is a rich body of work on managing thermal effects
in processors. Software-based efforts [15, 16, 17, 18, 19]
typically seek to redistribute heat to avoid peak temperature
violations. Hardware based efforts employ dynamic voltage
frequency scaling (DVFS) to manage the thermal fields [20,
21, 22]. Detailed thermal modeling using software packages
such as HotSpot [23] and 3D-ICE [24] enables the study of
ar
X
iv
:1
80
8.
09
08
7v
2 
 [c
s.A
R]
  9
 Se
p 2
01
8
microarchitectural effects on temperature. Although bulk of
the work has been pursued for 2D packages, the understand-
ing is still relevant to 3D packages. For example, researchers
have explored the thermal coupling between cores on the
same layer and between cores on different vertically stacked
layers [25]. In general these approaches have dealt with
temperature as a constraint. We argue that temperature is a
resource that has to be managed, like memory or compute
cycles. This approach is rooted in a different view of the
relationship between performance and heat capacity.
The heat capacity of the package is established based on
the thermal design power (TDP) which is set independent of
the application characteristics. However, some applications
such as sparse matrix computations have components that
are memory bound rather than compute bound. Temperature-
based approaches to improve the performance of such appli-
cations by boosting voltage-frequency in an attempt to utilize
the thermal headroom [26], will simply waste power with
little or no performance gain and significant reductions in
energy efficiency. An example is shown in Figure 1 where en-
ergy efficiency(temperature) of a memory bound benchmark
decreases(increases) with increasing clock frequency. On
the other hand, compute intensive applications such as dense
matrix algebra may extract performance benefits from DVFS
schemes but can exceed the temperature bounds. Further-
more, thermal coupling between adjacent cores can increase
leakage current (and therefore static power) and accelerate
temperature rise leading to premature throttling [27] and
therefore performance and energy efficiency loss.
Our goal is to ensure that for the amount of heat generated
by the compute logic for an application, the maximum per-
formance (throughput) is delivered. Doing so must implicitly
improve energy efficiency. A key insight is that applications,
and some application phases, simply cannot utilize the pack-
age thermal headroom even when operating at the highest
voltage-frequency state. We attempt to capture this observa-
tion by noting that for a specific application or phase there is
an effective heat capacity (EHC) - this is the heat generated
beyond which further gains in performance are in-feasible
with further increases in voltage-frequency of the compute
logic. For example, an application may be operating in a
memory bound phase and increases in compute logic fre-
quency has little effect on performance but may consume
the thermal headroom. Accordingly, we note that the EHC
is application-specific and time-varying. Consequently, our
goal is to maximize the performance that can be extracted
from the time-varying EHC. The solution must be online,
adaptive, and robust to modeling errors. The EHC corre-
sponds to a value of temperature which we will refer to as the
effective maximum temperature (EMT). Practical implemen-
tations will seek to operate at the EMT and minimize thermal
coupling induced leakage power.
Therefore, we propose TRINITY, a DVFS controller that
implements an on-line optimization technique that continu-
ously balances performance, energy efficiency and thermal
behaviors to fully utilize the EHC (See Fig. 1). Each volt-
age island implements an independently operated TRINITY
controller that is (i) based on numerical optimization, (ii) is
computationally inexpensive to implement, (iii) is self-tuning,
(iv) distributed (per-core), and (iv) application agnostic. The
behavior of spatially adjacent controllers is implicitly coupled
via temperature. Thus a network of interacting controllers
locally seeks to maximize throughput from the locally avail-
able EHC and their coordinated actions indirectly makes the
most efficient use the package heat capacity. Our vehicle for
exploration and demonstration is a cache-coherent multicore
processor integrated as the bottom die in a 3D DRAM stack
as shown in the Figure 16. Cores operate on distinct voltage
islands and are capable of operating in independent power
states each with a TRINITY controller. The controller is
designed considering practical implementation challenges
such as (i) measurement and actuation delays (ii) computa-
tion delays and (iii) hardware limitations such as having a
discrete set of voltage-frequency states. In cycle-level simu-
lations of a 16 core architecture, for 10% increase in Energy
Delay Product, TRINITY keeps the temperature lower by 6
Kelvin while achieving similar energy efficiency as compared
to ondemand Linux CPU governor. An added benefit of the
reduced temperature is the increase in lifetime reliability of
the 3D stack by up to 59%.
This paper seeks to make the following contributions:
1. The introduction of the concept of effective heat ca-
pacity as a thermal resource to be managed. (Section
2)
2. Development of TRINITY, an on-line DVFS controller
(i) based on numerical optimization, (ii) computation-
ally inexpensive to implement, (iii) self-tuning, (iv)
distributed (per-core), and (v) application agnostic for
cooperatively balancing the performance, energy effi-
ciency, and thermal behaviors of applications. (Section
4.2)
3. A comprehensive simulation-based characterization of
intra- and inter-die thermal coupling effects demon-
strating the need to maximally utilize the effective heat
capacity. (Section 3)
2. MOTIVATION
Figure 2: Heat map of the core layer showing reduction
in thermal headroom for neighboring cores.
Consider a 3D architecture as described Section 5.1 and
illustrated in Figures 15 and 16, where 16 cores are integrated
2
at the bottom of a 3D DRAM stack. When only one of the
cores is executing an application thread while the rest are
idle, the resulting thermal gradient from the ‘hot’ core to the
neighboring ‘cool’ cores is shown in Figure 2. We note that
the program thread executing on a core can increase the tem-
perature of neighboring cores by > 7 Kelvin. Not shown here,
is another observation that on migrating this thread from a lo-
cation next to the package boundary to a location in the center
of the core die decreases the temperature by up to 10 Kelvin
(these are computed as steady state temperatures). Ideally,
we would like the thermal gradients to be zero, performance
to be maximum, and the temperature to be the local EMT at
every core.
Achieving this goal via temperature regulation techniques
are of limited utility. For example, consider the use of a
temperature regulator [28] at each core. The objective of the
regulator is to maintain a fixed temperature. We ran a graph
benchmark and set the target temperature to 340 Kelvin for
each core. In Figure 3 we observe that none of the cores can
reach the target temperature. For cores which are idle i.e.
threads are waiting to be woken up, the controller tries to
raise the temperature of the core by increasing the correspond-
ing voltage-frequency but ends up wasting energy through
increases in leakage power due the rise in temperature and
no improvement in performance of the core. Temperature
regulation in this form is therefore inefficient in 3D stacks
because, (1) target tracking temperature (which is the EMT)
has to be known apriori, (2) target temperature will be dif-
ferent for different cores and will vary at run-time, and (3)
temperature dynamics is a rather slow process (100s of milli-
seconds) in comparison to application characteristics that can
vary rapidly (micro-seconds). Therefore control techniques
must be on-line, adaptive, and application agnostic.
The preceding example with temperature regulation illus-
trates an important point - for certain applications and dur-
ing certain application phases, package heat capacity cannot
be utilized completely. This points to the existence of an
effective heat capacity which roughly corresponds to the
temperature of the cores beyond which there is little or no
increase in performance. We refer to this temperature as the
effective maximum temperature. It also represents an energy
efficient (ops/joule) operating point. Note that the heat ca-
pacity of the entire package is established independent of
the specific workload and that effective heat capacity of an
application can be time varying. A thread currently in a mem-
ory intensive phase with EMT of X Kelvin, may transition
into a compute intensive phase where its EMT is Y Kelvin
(Y > X). Without profiling an application extensively, track-
ing the EMT is a challenge. To further emphasize the effect
of EHC, we present data from two benchmark applications
blackscholes (PARSEC [29]) and tc (GraphBig [30]) in
Table 1. The average temperature of the cores in Kelvin
and average performance (Million-Instructions-Per-Second
(MIPS)) for the two benchmarks is listed for three different
fixed frequencies.Performance and temperature characteristics of both ap-
plications vary widely. In order to demonstrate that there
is room to improve performance, energy and temperature in
these systems, we also compute the Energy Delay2 Product
(ED2P) at the three fixed frequencies. For compute intensive
applications like blackscholes, best ED2P is achieved at
300
305
310
315
320
325
1 33 65 97 12
9
16
1
19
3
22
5
25
7
28
9
32
1
35
3
38
5
41
7
44
9
48
1
51
3
54
5
57
7
60
9
64
1
67
3
70
5
73
7
76
9
80
1
83
3
86
5
89
7
92
9
96
1
99
3
Te
mp
era
tur
e i
n K
elv
in
Time in ms
CORE4 CORE5 CORE6 CORE7
Target Temperature = 340K10 11 14 15
8 9 12 13
2 3 6 7
0 1 4 5
300
305
310
315
320
325
330
1 33 65 97 12
9
16
1
19
3
22
5
25
7
28
9
32
1
35
3
38
5
41
7
44
9
48
1
51
3
54
5
57
7
60
9
64
1
67
3
70
5
73
7
76
9
80
1
83
3
86
5
89
7
92
9
96
1
99
3
Te
mp
era
tur
e i
n K
elv
in
Time in ms
CORE9 CORE10 CORE11 CORE14
10 11 14 15
8 9 12 13
2 3 6 7
0 1 4 5
Figure 3: Temperature Regulation Inefficiency: Except
for Core5 and Core11, rest of the cores are idle. At 400ms
mark Core11 becomes idle as well.
Table 1: Table demonstrating variable application heat
capacities and room for improving balance between per-
formance, temperature and energy.
Bench. 0.5GHz 1.0GHz 1.5GHz
Temp. (K) blacks. 318.18 329.68 340.93
tc 313.91 318.93 323.85
Perf. (MIPS) blacks. 12378.3 23710.7 33065.6
tc 2741.3 3633.9 4026.7
ED2P blacks. 0.67 0.33 0.26
tc 0.76 0.58 0.58
the highest frequency. But, for memory intensive benchmarks
like tc, there is no appreciable improvement in ED2P beyond
1.0GHz, The goal of TRINITY is to dynamically track these
behaviors with distributed on-line control.
We observe that it is important to make a distinction be-
tween peak temperature and effective heat capacity. The
former is a constraint that all thermal management schemes
seek to observe. Heat capacity reflects the net amount of heat
that can be generated. Observing only the former will not
maximize performance for the corresponding amount of heat.
Our goal is to extract as much performance as possible from
the heat generated by the application. Thread scheduling
techniques that seek to redistribute heat can be re-purposed
towards this end. In this sense, effective heat capacity is a
resource which the proposed on-line distributed controller
network is designed to exploit efficiently.
In light of the observations made in the previous para-
graphs, we first present a microbenchmark characterization
of the 3D stacked architecture in Section 3. We list the key
insights which lead us to development of the optimization
problem and its solution is described in Section 4. We then
proceed to evaluate the proposed controller over a set of
benchmark applications in Section 5. Finally, we list relevant
works in Section 6 and conclude the paper in Section 7.
3. CHARACTERIZATION
In this section we seek to find answers to the following
3
questions:
(1) What is the thermal impact of a hot core on neighboring
cool cores? What are the performance implications for both
the hot core and the cool cores?
(2) What is the thermal and performance behavior of a pro-
gram thread executing at different physical locations on the
core layer?
(3) How does memory addressing patterns in the L2 Cache
layer affect the temperature of the core layer and vice versa?
We therefore proceed by characterizing temperature and
performance of the 3D stack (See Section 5.1, Fig. 15 and
16) under a variety of microbenchmark workloads. Temper-
ature is measured in Kelvin and performance in MIPS. The
temperature numbers reported are steady state values. The
microbenchmarks are designed such that they (i) exhibit vari-
able ops/byte ratio, (ii) access specific memory locations, and
(iii) execute on specific physical cores.
3.1 Nomenclature
To better represent the characterization results, we first
describe a naming convention in Figure 4 which is used
throughout the characterization section of this paper. All
the microbenchmarks are single threaded programs. Most of
the results that follow have a single thread running on a single
fixed core (source core) accessing data from a single fixed
L2 Cache bank (source/remote bank). We make a distinction
when two cores are running independent microbenchmark
applications as and when required. While a microbenchmark
is running on a single core, the rest of the cores are pow-
ered up (Vdd and CLK are supplied) but idle. The 1-hop and
2-hop neighbors of the source core are termed SC+1 and
SC+2, respectively. Similarly, for the L2 Cache banks we
have SB+1 and SB+2. In what follows, a “memory inten-
sive benchmark" continuously performs load operations on
sequential memory locations whereas a “compute intensive
benchmark" repeats the following two steps: (i) load a block
of data from memory (ii) perform integer and floating point
operations for a fixed number of iterations.
Among the many cases of thermal coupling, we discuss 5
types in detail as shown in Figure 4:
(a) Thermal coupling between adjacent cores.
(b) Thermal coupling between a core and an L2 Cache bank
directly on top.
(c) Thermal coupling between an L2 Cache bank and an
idle core below it.
(d) Same as (c) but with a non-idle core.
(e) Thermal coupling variation when the computation is
moved from the package boundary to the center of the
die.
3.2 Thermal Coupling Analysis
Case (a): In Figure 5b, the temperature of the source core SC,
average temperature of its 1-hop neighbors and 2-hop neigh-
bors, SC+1 and SC+2, respectively is plotted for a memory
bound microbenchmark at three different clock frequencies
with the SC accessing data from SB and RB. Figure 5c is sim-
ilar to Figure 5b except that the microbenchmark is compute
bound. Out of the 16 possible locations for the SC, accounting
for symmetry, three locations viz. Core0, Core2 and Core3 are
chosen. The temperature and performance trends for Core0,
Core2 and Core3 are plotted in Figures 5, 6 and 7, respectively.
We note that regardless of whether the benchmark running on
SC is memory intensive or compute intensive, although SC+1
is idle, due to thermal coupling, steady state temperature of
SC+1 can go as high as 325 Kelvin. Thermal coupling effects
are negligible beyond a 2-hop neighborhood concurring with
prior work [31] (albeit [31] is for a 2D architecture). The
extent of thermal coupling in a 3D architecture however, is
more pronounced within the 1-hop neighborhood due to heat
shielding from upper layers. The key observation we make
is:
Observation 1: A ‘hot’ core reduces the EHC of neighboring
‘cool’ cores by up to 7 Kelvin.
A second more subtle observation to obtained by analyzing
the steady state temperature and performance of the SC when
accessing SB and RB (See Fig. 5a and 5b). By addressing
a RB, the SC temperature can be reduced by up to 8 Kelvin.
This however, comes at the price of 30% reduction in perfor-
mance. Therefore,
Observation 2: Memory address re-mapping has the poten-
tial to trade-off performance for reduction in temperature.
Case (e): Carrying forward from Case (a), we repeat the
same set of experiments but choose to run the microbench-
mark on a SC that is physically located at three specific
locations: (1) Corner (Core0) (2) Boundary (Core2 and (3)
Center (Core3). Using ‘Corner’ as the reference, we compute
the differences in temperature and performance for the other
two locations. Specifically, we annotate the differences as
follows: Corner - Boundary (C-B) and Corner - Center (C-C).
The trend of the data obtained is plotted in Figure 8. The
difference between figures 8a, 8b and 8c is only with the
memory location addressed, SB, RB and SB+1, respectively.
Temperature difference in Kelvin and performance differ-
ence in MIPS are plotted on the y-axis. In general, moving
the application thread from the corner to the boundary or cen-
ter reduces the temperature of the SC between 1−10 Kelvin
with negligible loss in performance. The greatest difference
is seen for the C-C case. Not only does the SC experience
reduction in temperature, its neighbors SC+1 too benefit by
up to 4 Kelvin due to the relocation. Note however, that this
phenomenon does not nullify Case (a). Only the magnitude
of thermal coupling is mitigated to a small extent.
Observation 3: Package boundaries become increasingly
important in 3D stacked environments. OS level thread
scheduling in cooperation with DVFS schemes can lead to
better utilization of the EHC.
To completely understand the thermal coupling between
the compute and memory layers, we divide this inter-layer
thermal coupling into Cases (b), (c) and (d). For Cases (b)
and (c) we refer to Figure 9 and for Case (d) we refer to Fig-
ure 10. Before we present the analysis, it is essential to note
that for the 3D architecture under consideration, in steady
state, the core layer always has the highest temperature when
compared to upper layers.
Case (b): The heat flow between the SC and the SB is influ-
4
10 11 14 15
8 9 12 13
2 3 6 7
0 1 4 5
Case (a) Case (b) Case (c) Case (d) Case (e)
Source Core
(SC)
Source L2$ Bank
(SB)
Remote Core
(RC)
Remote L2$ Bank
(RB)
Data Access Thermal Coupling
Idle core
Figure 4: Microbenchmark characterization nomenclature.
0
0.2
0.4
0.6
0.8
1
1.2
SB SB+1 RB SB+d SB SB+1 RB SB+d SB SB+1 RB SB+d
0.5GHz 1GHz 1.5GHz
Normalized Performance
Mem Bench CPU Bench
(a) Performance on y-axis is normalized w.r.t
SB
300
305
310
315
320
325
330
335
SB SB+1 RB SB+d SB SB+1 RB SB+d SB SB+1 RB SB+d
0.5GHz 1GHz 1.5GHz
mem bound bench
SC SC+1 SC+d SC+2 
(b) Temperature of source core and its neigh-
bors in Kelvin on y-axis.
300
305
310
315
320
325
330
335
340
SB SB+1 RB SB+d SB SB+1 RB SB+d SB SB+1 RB SB+d
0.5GHz 1GHz 1.5GHz
compute bound bench
SC SC+1 SC+d SC+2 
(c) Temperature of source core and its neigh-
bors in Kelvin on y-axis.
Figure 5: Performance and temperature variation when running mem bound and compute bound benchmarks on a
source core accessing source and remote cache banks at different core frequencies. SC is Core0.
enced by whether the SB is ‘active’ or ‘idle’. The temperature
trends for the SC and SB are presented in Figure 9a. When
the SB is idle, the average SC temperatures are 312.7, 319.3
and 327.1 Kelvin at 0.5, 1.0 and 1.5GHz, respectively. But
when the SB is active, the same SC temperatures increase by
about 1, 2 and 4 Kelvin for 0.5, 1.0 and 1.5GHz, respectively.
This clearly demonstrates the influence of memory address-
ing on the core layer temperatures. Not only does the average
temperature rise with increase in frequency, but also the vari-
ance. At higher clock frequencies, thermal ramifications due
to memory addressing patterns are more pronounced. The
performance trend as seen if Fig. 9b is in accordance with our
expectation, in that, instruction throughput benefits directly
due to clock frequency increase.
Case (c): Moving along the same analysis path as before, for
this case of thermal coupling, we wish to understand the vari-
ations in temperature of an ‘idle’ core directly underneath an
‘active’ L2 Cache bank. The temperature plots of the remote
core (RC) and remote cache bank (RB) in Figure 9a illustrate
this situation. Analogous to the previous case, bulk of the
power dissipated by the idle RC underneath the active RB is
on account of static power. Furthermore, as clock frequency
increases, idle RC temperature can increase up to 5 Kelvin
higher than the lowest temperature on the core layer.
Case (d): This case is essentially a superposition of Cases (b)
and (c). The experiments here attempt to replicate a scenario
where multiple cores can access a single L2 Cache bank.
As described in Fig. 4, both the SC and the RC access the
RB. Since RC is not idle anymore, we observe an increasing
trend in its temperature with clock frequency. The slope of
this increase however, is slightly steeper when compared to
SC temperature (SB active) in Figure 9a. Furthermore, the
increase in the clock frequency of the {RC - RB} voltage
island causes the performance of RC and SC to improve (See
Fig. 10b). Due to difference in network delays however,
slope of the performance curve for the SC is much smaller
than that for the RC.
In summary, microbenchmark characterization of the 3D
stack sheds light on subtle yet key insights. The EHC of an
application thread is affected not only by its own phases but
also by memory addressing patterns of neighboring cores.
A greedy approach to maximizing performance can indeed
utilize the thermal headroom of the package but may not
deliver the best energy efficiency (ops/Joule). Consequently,
the higher temperatures especially in thermally constrained
environments such as the one under consideration, can in-
crease thermal stresses and localized hotspots in turn reducing
lifetime reliability of the device. Nevertheless, maximizing
performance in the face of unavoidable thermal coupling, ne-
cessitates a strategy that cooperatively balances performance,
energy and temperature.
4. TRINITY
In this section, we present our proposed approach, TRIN-
ITY, an online DVFS controller that dynamically balances the
three parameters: performance, energy and temperature to
completely utilize the EHC in a 3D stack. TRINITY is, (i) ap-
plication agnostic, (ii) self-tuning, (iii) distributed (per-core),
(iv) based on numerical optimization, and (v) computationally
inexpensive to implement. The TRINITY controllers on each
core are implicitly coupled via temperature. Therefore, the
individual actions taken by the network of controllers works
towards making the best use of the EHC. We now present
a detailed description of the system models, optimization
5
00.2
0.4
0.6
0.8
1
1.2
SB SB+1 RB SB+d SB SB+1 RB SB+d SB SB+1 RB SB+d
0.5GHz 1GHz 1.5GHz
Normalized Performance
Mem Bench CPU Bench
(a) Performance on y-axis is normalized w.r.t
SB
300
305
310
315
320
325
330
335
SB SB+1 RB SB+d SB SB+1 RB SB+d SB SB+1 RB SB+d
0.5GHz 1GHz 1.5GHz
mem bound bench
SC SC+1 SC+d SC+2 
(b) Temperature of source core and its neigh-
bors in Kelvin on y-axis.
300
305
310
315
320
325
330
335
SB SB+1 RB SB+d SB SB+1 RB SB+d SB SB+1 RB SB+d
0.5GHz 1GHz 1.5GHz
compute bound bench
SC SC+1 SC+d SC+2 
(c) Temperature of source core and its neigh-
bors in Kelvin on y-axis.
Figure 6: Performance and temperature variation when running mem bound and compute bound benchmarks on a
source core accessing source and remote cache banks at different core frequencies. SC is Core2.
0
0.2
0.4
0.6
0.8
1
1.2
SB SB+1 RB SB+d SB SB+1 RB SB+d SB SB+1 RB SB+d
0.5GHz 1GHz 1.5GHz
Normalized Performance
Mem Bench CPU Bench
(a) Performance on y-axis is normalized w.r.t
SB
300
305
310
315
320
325
330
335
SB SB+1 RB SB+d SB SB+1 RB SB+d SB SB+1 RB SB+d
0.5GHz 1GHz 1.5GHz
mem bound bench
SC SC+1 SC+d SC+2 
(b) Temperature of source core and its neigh-
bors in Kelvin on y-axis.
300
305
310
315
320
325
330
335
SB SB+1 RB SB+d SB SB+1 RB SB+d SB SB+1 RB SB+d
0.5GHz 1GHz 1.5GHz
compute bound bench
SC SC+1 SC+d SC+2 
(c) Temperature of source core and its neigh-
bors in Kelvin on y-axis.
Figure 7: Performance and temperature variation when running mem bound and compute bound benchmarks on a
source core accessing source and remote cache banks at different core frequencies. SC in Core3.
0
2
4
6
8
10
12
C-B C-C C-B C-C C-B C-C C-B C-C 
SC SC+1 SC+d SC+2 
Te
m
pe
ra
tu
re
 D
iff
er
en
ce
SB
0.5GHz 1.0GHz 1.5GHz
-40 
-30 
-20 
-10 
0
10
20
30
40
C-B C-C 
Pe
rfo
rm
an
ce
 D
iff
er
en
ce
SB
0.5GHz 1.0GHz
1.5GHz
(a) Source core accessing source bank. Performance and spatial
temperature comparison of source core at Corner vs. Center vs.
Boundary.
0
1
2
3
4
5
6
7
8
9
10
C-B C-C C-B C-C C-B C-C C-B C-C 
SC SC+1 SC+d SC+2 
Te
m
pe
ra
tu
re
 D
iff
er
en
ce
RB
0.5GHz 1.0GHz 1.5GHz
-40 
-30 
-20 
-10 
0
10
20
30
40
C-B C-C 
Pe
rfo
rm
an
ce
 D
iff
er
en
ce
RB
0.5GHz 1.0GHz
1.5GHz
(b) Source core accessing remote bank. Performance and spatial
temperature comparison of source core at Corner vs. Center vs.
Boundary.
0
1
2
3
4
5
6
7
8
9
10
C-B C-C C-B C-C C-B C-C C-B C-C 
SC SC+1 SC+d SC+2 
Te
m
pe
ra
tu
re
 D
iff
er
en
ce
SB+1
0.5GHz 1.0GHz 1.5GHz
-40 
-30 
-20 
-10 
0
10
20
30
40
C-B C-C 
Pe
rfo
rm
an
ce
 D
iff
er
en
ce
SB+1
0.5GHz 1.0GHz
1.5GHz
(c) Source core accessing 1-hop neighbor of source bank. Perfor-
mance and spatial temperature comparison of source core at Cor-
ner vs. Center vs. Boundary.
Figure 8: Influence of package boundaries on thermal coupling and performance.
6
(a) Temperature variation of (i) source core and cache (ii) remote
core and cache.
(b) Performance variation of the source core when source bank is
‘idle’ and ‘active’.
Figure 9: Thermal coupling Cases (b) and (c). The er-
ror bars are variances in temperature due to different
ops/byte and physical locations of the source core.
(a) Temperature variation of source core and remote core.
(b) Performance variation of source core and remote core.
Figure 10: Thermal coupling Case (d). The error bars
are variances in temperature due to different ops/byte
and physical locations of the source core.
problem and the solution approach.
4.1 System Models
Power Model: We linearize the power model in [28] and
arrive at a third order polynomial which captures both leakage
and dynamic power of the core. The equation is as follows:
Pk = α f 3k +β fk + γTk +δ fkTk + ε (1)
where k represents the sample time instant, fk and Tk are
clock frequency and core temperature, respectively. The first
term models the dynamic power and the last four terms of
the equation represent leakage power. Since leakage power is
strongly correlated with the technology node and packaging
parameters, via non-linear regression, we calculate β ,γ,δ and
ε offline (See Table 2). To enable TRINITY to be application
agnostic, the constant α , which represents the activity factor
is therefore determined online. Figure 11a shows that our
approximation for the leakage power is within ±5mW of the
value measured on the simulator.
Table 2: Parameters estimated offline.
β -426.7×10−3 a1 0.9998
γ 0.674×10−3 b1 8.46
δ 1.618×10−3 c1 37
ε -90.38×10−3 ∆t 1ms
Temperature Model: Temperature at any given point in
the 3D stack at any given time t is given by the dynamical
equation
T˙(t) = AT(t)+BP(t) (2)
where T(t),P(t)∈RM are the temperature and power vectors,
respectively and the matrices A and B consist of the thermal
resistance and capacitance of the 3D stack [32]. In each of
the 6 layers in the 3D stack, broadly, we have 16 power dis-
sipating elements, therefore, M = 16×6 = 96. The large A
and B matrices capture the inter-layer and intra-layer thermal
coupling allowing for an accurate estimation of the tempera-
ture trajectory. At this juncture it is relevant to note that for
the 3D stack under consideration, we observe a time constant
for the rise in temperature of approximately 40ms and thus
the settling time is around 200ms. These numbers are in
agreement with practical observations [27]. Solving an opti-
mization problem as described in the Introduction becomes
increasingly computationally intensive as the dimensionality
of the model increases (typically O(M3)). Instead of using
Eqn. 2 we make an observation that discretizing and lin-
earizing Eqn. 2 for a short duration of time ∆t, reduces the
model complexity significantly from O(963)→ O(1). The
price paid for this reduction in complexity is in the ability
to accurately predict future temperature. Nonetheless, the
temperature of a core can now be estimated ∆t seconds into
the future using the following scalar equation:
Tk+1 = a1Tk +∆t(b1Pk + c1) (3)
where we observe that up to ∆t = 1ms, the simplified tem-
perature model is within 1 Kelvin as compared with values
obtained from the simulator (See Fig. 11b). Analogous to the
power model, the constants a1,b1 and c1 are dependent on
the technology node and packaging design choices. There-
fore they are estimated offline via non-linear regression (See
7
00.05
0.1
0.15
0.2
1 16 31 46 61 76 91 10
6
12
1
13
6
15
1
16
6
18
1
19
6
21
1
22
6
24
1
Po
we
r in
 W
Time in ms
Leakage Power Model
P_L (Simulator)
P_L (Model)
(a) Leakage Power Simulator vs. Model.
290
295
300
305
310
315
320
325
1 16 31 46 61 76 91 10
6
12
1
13
6
15
1
16
6
18
1
19
6
21
1
22
6
24
1
Te
mp
era
tur
e in
 Ke
lvi
n
Time in ms
Temperature Model
T (Simulator)
T (Model)
(b) Temperature Model Simulator vs. Model.
Figure 11: Leakage power and temperature model.
Table 2). The temperature estimate Tk+1 depends on the mea-
sured values at time sample k and thus does not accumulate
modeling errors at each time step.
Performance Model: We choose instruction throughput
i.e. MIPS as the metric. Performance is related to the clock
frequency of the core via the following equation:
χk(uk) = IPCk · fk (4)
where IPCk is Instructions-Per-Cycle and fk is the core clock
frequency at sample time k. As shall be described shortly,
this linear approximation counterbalances temperature rise
and is therefore sufficient for the purposes of the optimization
problem under consideration.
4.2 Solution Strategy
The objective of the problem we wish to solve is encoded
mathematically as follows:
max
fk+1
Qz2k+1+Rχk( fk+1) (5)
subj. to
f ≤ fk+1 ≤ f¯ (6)
0 <zk+1 (7)
where fk+1 is the core frequency which is within the bounds
f = 0.5GHz and f¯ = 1.5GHz. The term zk+1 = TMAX −Tk+1,
where Tk+1 is the temperature of the core in Kelvin (Eqn. 3)
and TMAX is an upper bound for a core’s temperature which
we set to 355Kelvin = 850C. The cost function described by
Eqn. 5 consists of two parts, the former that penalizes in-
crease in temperature and the latter that rewards performance.
The weights Q and R (Q,R > 0 for problem feasibility) are
tuning parameters that can be modified at run-time to give
variable importance to performance and temperature. We set
Q = 1 and allow R to tune itself at run-time. Equations 5 - 7
are solved periodically after every T seconds by each core
independently to determine f ∗k+1, the clock frequency that
maximizes the cost.
The intuition behind the problem definition is as follows:
Consider an application whose performance saturates beyond
a particular clock frequency and does not vary with time.
The periodic calculation of f ∗k+1 drives the system eventually
towards a point where the temperature of the core reaches
steady state. This steady state temperature is nothing but the
EMT and any further increase in the clock frequency will
reduce the cost thereby satisfying our original goal of max-
imally utilizing the EHC. For a particular choice of R, the
behavior of the objective function is illustrated in Figure 12.
The value T, referred to as the control cycle, is a design pa-
rameter which has to be at least greater than (i) measurement,
(ii) actuation, and (iii) computation delays. On processors
available in the market currently, measurement and actuation
delays are approximately 10s of micro seconds [33]. The
control cycle also depends on the model accuracy since, as
observed in the previous section, the simplified temperature
model has sufficient accuracy up to a duration of 1ms. There-
fore, we choose T = 1ms in our experiments. We assign
21 clock frequencies between 0.5−1.5GHz spaced 50MHz
apart. To solve the problem described in Eqn. 5, the three
steps of the algorithm are listed in Fig. 13. Since each core
solves the optimization problem independently, computing
f ∗k+1 requires finding the maximum element in an 21 length
array.
Figure 12: Behavior of the optimization cost.
The tuning parameter R influences all three variables: tem-
perature, performance and leakage energy. In fact, R ∈
[Rmin,Rmax] such that for R < Rmin, f ∗k+1 = f and for R >
Rmax, f ∗k+1 = f¯ . Emphasizing temperature over performance
leads to lower leakage energy whereas greater importance
to performance could potentially lead to wasted leakage en-
ergy. Therefore it is essential to choose an appropriate value
in order to exact the behavior desired. Fixing the value of
R is one approach. However, we observe that such a strat-
egy, (i) makes the solution application specific (ii) requires
extensive time consuming offline analysis, and (iii) could
easily push the controller into saturation where f ∗k+1 will be
remain at either f or f¯ for prolonged periods of time. In
order to adapt to dynamically varying application phases, we
allow R to re-calibrate itself periodically. We call this period
TR ≥T. The pseudo code for the re-calibration is described
in Figure 14. The re-calibration step basically determines the
bounds for R i.e. [Rmin,Rmax] and calculates the next value
as R = Rmin+η(Rmax−Rmin) where η ∈ [0,1]. For the OoO
core under consideration, we set η =
√
IPCk/IPCmax with
IPCmax = Issue Width = 4. The IPC ratio heuristic is a means
to obtain information about the compute or memory bounded-
ness of the application. The square root of the ratio is chosen
to push R towards Rmax (and thus better performance).
5. RESULTS
In this section we evaluate TRINITY. We first describe the
simulation environment and list the benchmark applications
8
TMAX, R, fk, 
Pk, Tk, 𝜒"
Compute cost for all 
feasible fk+1, find f*
and apply for 
seconds
Re-calibrate R after 
seconds
T
TR
Figure 13: TRINITY Algorithm.𝑅"#"$ = 1×10)*𝐽(𝑖) = 𝑧012 𝑖 3 + 𝑅"#"$ ∗ 	𝜒012 (𝑖) ; 𝑖 = {0,1,2,… , 20}Δ𝐽> = 𝐽 1 − 𝐽 0 	, Δ𝐽3> = 𝐽 20 − 𝐽(19)
𝑖𝑓	(Δ𝐽> < 0, Δ𝐽3> < 0)
Increase R	until Δ𝐽> > 0	& Δ𝐽3> < 0	𝑅EFG = 𝑅	
foo1(𝑅EFG ,1)𝑖𝑓	(Δ𝐽> > 0, Δ𝐽3> > 0)
Decrease R	until Δ𝐽> > 0	& Δ𝐽3> < 0	𝑅EHI = 𝑅
foo1(𝑅EHI,0)
foo1(𝑅#JK,flag) {𝑖𝑓(flag == 1)
Increase 𝑅#JK	until Δ𝐽> > 0	& Δ𝐽3> > 0	𝑅EHI = 𝑅#JK𝑖𝑓(flag == 0)
Decrease 𝑅#JK	until Δ𝐽> < 0	& Δ𝐽3> < 0	𝑅EFG = 𝑅#JK
}
𝑅 = 𝑅"#"$
Figure 14: Pseudo code for re-calibrating R.
used. Next we present an evaluation of the proposed control
scheme in detail. We implement a DVFS strategy similar
to the ondemand Linux CPU governor on the simulator and
compare it against TRINITY. We also present results by fixing
the core frequencies to 0.5, 1.0 and 1.5GHz. Since TRINITY
attempts to balance performance, energy and temperature,
we use Energy Delay2 Product (ED2P) along with temper-
ature as the primary comparison metric. In what follows,
we also use Energy Delay Product (EDP), Energy Efficiency
(ops/Joule), performance (MIPS) and lifetime reliability mea-
sured as Mean Time To Failure (MTTF) to understand the
implications of using TRINITY.
5.1 Experimental Framework
The physical layout is shown in Fig. 15 with the dimen-
sions listed in Table 3 and Figure 16 represents the functional
diagram of the 3D stacked architecture. The 3D stacked ar-
chitecture consists of 16 Out-of-Order (OoO) cores with two
levels of cache hierarchy [34], interfacing an HMC style [5]
4GB DRAM via an interconnection network. The simulator is
also equipped with power and thermal estimation framework
called Kitfox [?]. Kitfox internally models power consump-
tion based on McPat [35] and the thermal calculations are
done using 3D-ICE [24], both scaled to 16nm. The front-end
for the cycle level simulator [?] is a multicore emulator called
Qsim [36] that boots a Linux kernel and executes applications
of interest. The x86 instruction streams thus generated are fed
into the OoO core timing model. We use DRAMSim2 [37]
as the DRAM timing simulator whose voltage and timing
numbers are modified based on the work in [38].
Table 3: Simulation framework parameters. Technology
node is 16nm.
Component Parameters and Values
Processor Out-of-Order, 6-stage pipeline, 4-wide
issue/commit, 0.5−1.5GHz
L1 Cache per
core (16KB)
Private, 8-way, LRU replacement, 32
MSHRs, 64B lines, 1-cyc hit & lookup
time
L2 Cache per
bank (2MB)
16 banks in total, shared, 8-way, LRU
replacement, 128 MSHRs, 64B lines,
24-cyc hit & lookup time
Network
(1GHz)
4×4 torus ring, 6 port router, baseline
x-y routing
Memory Con-
troller
16 MCs in total, rank then bank
round robin, close page, Addr-map-
chan:row:bank:rank:col
DRAM config
per vault
256MB, 1-channel, 4 ranks, 2 banks per
rank, 64 bit bus @ 1600MHz
Heat Sink Conventional heat sink, Heat transfer
co-eff = 2.8×10−8W/µm2K
Per-Layer
TOP LAYER = BEOL: 25µm
SOURCE LAYER = SILICON: 10µm
BOTTOM LAYER = SILICON: 25µm
L2 Cache Layer
Core Layer
Stacked DRAM layers
0 1
2 3
4 5
6 7
8 9 12 13
15141110
Core Layout
8.4mm
8.4mm
MC
Rank 0 (DRAM Layer 0)
Rank 1 (DRAM Layer 1)
Rank 2 (DRAM Layer 2)
Rank 3 (DRAM Layer 3)
Bank 0 Bank 1
Vault
2.1mm
2.1mm
L2CacheBanki
Corei
V oltageIslandi
6 Port Router
L1 $
INT
FPU
F
R
T
S
C
H
Figure 15: Physical layout of the 3D stacked architecture.
L2 Cache
OoO Cores
Memory 
Memory 
Memory 
Memory 
Figure 16: Functional description of the 3D stack.Various microarchitectural design choices for placement of
cores, cache and DRAM exist as described in [39]. The archi-
tecture described in this work is similar to the one proposed in
[40]. We do however note that our microarchitectural design
choice does not restrict the scope of the problem definition.
The thermal and performance characterization insights can
be extended to other 3D architectures as well.
9
5.2 Benchmarks
We evaluate the optimization technique proposed over 6
benchmark applications from the PARSEC, SPLASH2x [29]
and GraphBig [30] suite. Specifically, we choose blacksc-
holes, streamcluster, barnes (PARSEC and SPLASH2x)
and kcore, pagerank and tc (GraphBig). Each of the bench-
mark applications are executed with 16 threads. The Graph-
Big applications stress the memory whereas PARSEC and
SPLASH2x stress the compute units thus giving a range of
application behavior.
5.3 Analyzing TRINITY Performance
Energy efficiency and ED2P results are plotted in Figures
18 and 17, respectively, comparing TRINITY against the
three fixed frequencies and ondemand. The right y-axis in
both the figures represents the spatial average temperature
of the core layer. The control cycle T is set to 1ms. Re-
calibrating R every control cycle increases the amount of
computations performed by the controller and hence we set
TR = 5ms.
The trend of ED2P is not the same for every benchmark.
Consequently, the strategy to balance performance, energy
and temperature (PET) should be different. In general, for
compute intensive benchmarks (blackscholes, barnes and
streamcluster), the highest clock frequency delivers the
best performance and also results in the highest temperature.
TRINITY on the other hand, trades performance for bene-
fits in temperature (lower by 8 Kelvin w.r.t 1.5GHz and 6
Kelvin w.r.t ondemand). In fact, the reduction in ED2P for
the compute intensive benchmarks is due to the reduction
in performance. Energy efficiency results however reveal
that TRINITY and ondemand perform equivalently (< 5%
difference). We can therefore conclude that in the process of
balancing PET for compute intensive benchmarks, TRINITY
achieves similar energy efficiency as ondemand but keeps the
temperature 6 Kelvin lower.
Analyzing memory intensive benchmarks (kcore, pager-
ank and tc), there is no appreciable improvement in perfor-
mance (MIPS). For example, average MIPS for kcore at 0.5,
1.0 and 1.5GHz is 4360.3, 5074.2 and 5265.2, respectively.
Possessing apriori knowledge that the application to be exe-
cuted is memory intensive, could lead to choosing the lowest
clock frequency as a possible strategy. While it certainly
keeps the entire 3D stack at a lower temperature, ED2P suf-
fers significantly. In these situations, TRINITY tunes R in
such a way that the lower half of the clock frequencies (0.5 -
1.0GHz) are chosen in the memory intensive phases. For all
three benchmarks, temperature of the core layer is at most
as high as 1.0GHz. Performance and ED2P, while certainly
better than 0.5GHz, are 5% and 13% worse than 1.0GHz,
respectively. On closer analysis of the controller data, we
observe that predicting performance for the upcoming con-
trol cycle based on the previous control cycle, sometimes
causes TRINITY to choose a clock frequency that does not
maximize performance for the EHC. These mispredictions
are mitigated to a certain extent by re-calibrating R every TR
seconds but it is one of the limitations of TRINITY.
Energy efficiency, similar to ED2P, has different trends
for different benchmarks as shown in Figure 18. While it in-
creases for barnes as clock frequency is increased, it reduces
for kcore. A detailed breakdown of different comparison
metrics is shown in Figure 20. Note that trading off per-
formance (MIPS), affects EDP and ED2P directly. Energy
efficiency of TRINITY however, is very similar to ondemand
(3% better). This is on account of reduced temperature.
To understand the source of temperature reduction, we plot
the average power dissipated at individual layers in Figure 19.
The x-axis represents different DVFS options for each bench-
mark and the y-axis shows average power in Watts. Total
power for each DVFS setting is broken down into dynamic
and leakage power for the core, L2 Cache and DRAM lay-
ers. This distribution of power helps understand the primary
source of power consumption for each benchmark applica-
tion. blackscholes and barnes, both compute intensive,
consume majority of the power in the core layer. kcore and
pagerank being memory intensive consume greater power
in the L2 Cache layer, specifically dynamic power in L2
Cache. Additionally, DRAM dynamic power is higher as
well due to increased L2 Cache misses. streamcluster, un-
like blackscholes and barnes shows increasing L2 Cache
power for increasing frequencies. Finally, although tc con-
sumes approximately same amount of power in both core
and cache layers, analyzing average MIPS for the three fixed
frequencies and reveals that tc is indeed memory intensive
(2741.3, 3633.8 and 4026.7 MIPS at 0.5, 1.0 and 1.5GHz,
respectively).
As seen in Fig. 19, the bulk of the power reduction (conse-
quently temperature) comes from reducing dynamic power
consumption of the core and cache layers. This is intuitive
since DVFS implemented by TRINITY directly affects only
the core and the corresponding L2 Cache bank. As compared
to ondemand, dynamic power of the core and cache layers
reduce by 38.5% and 28.2%, respectively. Furthermore, with
respect to ondemand, TRINITY is also able to reduce leak-
age power of the core and cache layers by 18% each. We
attribute the power reduction to the on-line adaptation of R.
In memory intensive parts of the application, η is low (< 1)
thus guiding the controller to choose the lower end of the
clock frequencies. In compute intensive regions, η is high
(> 2) allowing for higher clock frequencies to be chosen.
5.4 Impact on Lifetime Reliability
Changes in operating temperatures and voltages incur sig-
nificant impacts on reliability. Two dominant reliability mech-
anisms, electromigration (EM) and time-dependent dielectric
breakdown (TDDB), are used to evaluate the reliability im-
plications of TRINITY, compared to that of other execution
modes. We used the reliability models and parameters from
the work in [41] and references therein. Equations 8 and 9
show the reliability models of EM and TDDB, expressed as
mean-time-to-failure (MTTF).
MTTFEM = AEM× 1tact ×V
−n× e EakT (8)
MTTFTDDB = ATDDB× 1tact ×V
−c(a+bT )× e x+y/T+zTkT (9)
In the reliability equations, V and T are operating voltage
and temperature. tact (0 ≤ tact ≤ 1) is active-state residency
obtained from the execution time of each workload. For
instance, tact = 0.5 means that a workload utilizes the com-
puting system for 50% of time. It is assumed that the system
10
300
305
310
315
320
325
330
335
340
0
0.2
0.4
0.6
0.8
1
1.2
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
IT
Y
barnes
305
310
315
320
325
330
335
340
345
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
IT
Y
blackscholes
300
305
310
315
320
325
330
335
340
345
350
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
IT
Y
streamcluster
308
310
312
314
316
318
320
322
324
326
0.013
0.0135
0.014
0.0145
0.015
0.0155
0.016
0.5
GH
z
1.5
GH
z
TR
IN
IT
Y
kcore
ED2P
Temperature
314
316
318
320
322
324
326
328
330
332
334
0
0.005
0.01
0.015
0.02
0.025
0.03
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
IT
Y
pagerank
316
318
320
322
324
326
328
330
332
0
0.05
0.1
0.15
0.2
0.25
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
IT
Y
tc
Figure 17: Controller performance compared against fixed frequencies and ondemand. Controller Parameters: T = 1ms
and TR = 5ms. Left y-axis and right y-axis units are ED2P and Kelvin, respectively.
290
300
310
320
330
340
350
0
200
400
600
800
1000
1200
1400
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
ITY
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
ITY
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
ITY
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
ITY
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
ITY
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
ITY
barnes blackscholes streamcluster kcore pagerank tc
Temperature in Kelvin
En
erg
y E
ffi
cie
ncy
 (o
ps/
Jou
le) ops/Joule Temperature
Figure 18: Controller performance compared against fixed frequencies and ondemand. Controller Parameters: T = 1ms
and TR = 5ms. Left y-axis and right y-axis units are ops/Joule and Kelvin, respectively.
0
5
10
15
20
25
30
35
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
IT
Y
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
IT
Y
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
IT
Y
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
IT
Y
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
IT
Y
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
IT
Y
0.5
GH
z
1.0
GH
z
1.5
GH
z
On
D
TR
IN
IT
Y
barnes blackscholes streamcluster kcore pagerank tc AVERAGE
Po
we
r (
W
)
Power Consumption Distribution
CORE_Dyn CORE_Leak L2CACHE_Dyn L2CACHE_Leak DRAM_Dyn DRAM_Leak
Figure 19: Average power consumption by TRINITY compared against fixed frequencies and ondemand. Controller
Parameters OPT1: T = 1ms and TR = 5ms.
can be ideally power-gated for the remaining period and thus
has no reliability impact; the system may be used to process
other workloads, but resulting reliability impacts contribute
to those workloads. k is Boltzmann’s constant, and other
parameters are model-dependent scaling parameters [41]. As
shown in Eqn. 8, EM is primarily accelerated by temperature,
and voltage has a secondary effect. In fact, a few degrees of
average temperature change throughout the lifetime can eas-
ily produce several months to years of EM variations. On the
other hand, TDDB is more sensitive to voltage changes, but
temperature also has a non-negligible effect. The results show
that TRINITY achieves 59%, 88%, and 6% better reliability
than the ondemand and two fixed frequency executions at
1.5GHz and 1.0GHz, respectively. However, it trades perfor-
mance improvement with 24% reduction in reliability when
compared with the 0.5GHz fixed frequency execution.
5.5 Effect of TRINITY Parameter Variations
As mentioned in the previous sections, one of our goals
with TRINITY is to design it with practical implementation
in mind. Simplifying the model and reducing computational
complexity reduces the number of parameters that can be
11
ops/Joule, ops/s, Reliability, Temperature: Higher is better EDP, ED2P: Lower is better
-10 
-5 
0
5
10
15
20
0.
5G
H
z
1.
0G
H
z
1.
5G
H
z
O
nD
0.
5G
H
z
1.
0G
H
z
1.
5G
H
z
O
nD
PARSEC GRAPH
%
 D
iff
er
en
ce
ops/Joule
-20 
-15 
-10 
-5 
0
5
10
15
20
25
0.
5G
H
z
1.
0G
H
z
1.
5G
H
z
O
nD
0.
5G
H
z
1.
0G
H
z
1.
5G
H
z
O
nD
PARSEC GRAPH
EDP
-233.47
-33.62
-20 
-10 
0
10
20
30
40
50
0.
5G
H
z
1.
0G
H
z
1.
5G
H
z
O
nD
0.
5G
H
z
1.
0G
H
z
1.
5G
H
z
O
nD
PARSEC GRAPH
ED2P
-867.42
-74.08
-19.19
-40 
-20 
0
20
40
60
80
0.
5G
H
z
1.
0G
H
z
1.
5G
H
z
O
nD
0.
5G
H
z
1.
0G
H
z
1.
5G
H
z
O
nD
PARSEC GRAPH
ops/s
-60 
-40 
-20 
0
20
40
60
80
100
0.
5G
H
z
1.
0G
H
z
1.
5G
H
z
O
nD
0.
5G
H
z
1.
0G
H
z
1.
5G
H
z
O
nD
PARSEC GRAPH
Reliability
-20 
-15 
-10 
-5 
0
5
10
15
0.
5G
H
z
1.
0G
H
z
1.
5G
H
z
O
nD
0.
5G
H
z
1.
0G
H
z
1.
5G
H
z
O
nD
PARSEC GRAPH
D
iff
er
en
ce
 in
 K
el
vi
n
Temperature
Figure 20: TRINITY compared against fixed frequencies and ondemand. Controller Parameters OPT1: T = 1ms and
TR = 5ms.
manually tuned. We study the sensitivity of TRINITY to
variations in T and TR, which are the only manually tuned
parameters. Reducing the control cycle duration and TR has
the benefit of capturing rapidly varying application phases.
But it also increases the amount of controller computations
per unit time. We compare three different cases: (1) OPT1
(T = 1ms, TR = 5ms), (2) OPT2 (T = 1ms, TR = 10ms),
and (3) OPT3 (T= 0.5ms, TR = 5ms).
On the y-axis in Figure 21 we plot the % difference be-
tween TRINITY, ondemand and the three fixed frequencies
over all six benchmark applications tested. We also compare
temperatures of the three controller variations in the same
figure. A notable feature from Fig. 21 is that with respect to
1.0GHz, TRINITY achieves 30%, 5% and 17% better ED2P
at approximately same temperature. A second feature is that
as compared to 1.5GHz and ondemand, TRINITY trades
temperature for performance; on an average TRINITY keeps
the cores at least 6 Kelvin cooler. The large improvement
in EDP and ED2P in comparison with 0.5GHz is due to the
significant reduction in the total run-time of the application.
The third feature concurs with intuition: Increasing TR
implies tuning R less frequently thereby making the con-
troller less responsive to changes in application phases. This
increases the effect of performance mispredictions and thus
reduces EDP, ED2P and ops/Joule. Consequently, the average
temperature for OPT2 is higher than OPT1. Finally, decreas-
ing T (OPT3) does not show appreciable improvement in
EDP or ED2P as compared to OPT1.
Akin to any practical thermal/power/energy management
approach, TRINITY too faces the challenge of modeling pre-
cision vs. controller performance. Applications that would
benefit from TRINITY are those that have a mixture of com-
pute and memory bound phases because of the ability to adapt
itself at run-time to maximally utilize the EHC. However, if
those phases are shorter than the control interval T, they
might end up being overlooked. TRINITY works particu-
larly well for memory intensive applications like GraphBig
because at the same EDP, the average temperature and volt-
age is lower than ondemand which improves MTTF by 68%.
If the objective is to maximize performance alone, we get
limited returns in EDP and ED2P from TRINITY. Figure 21
shows that as compared to 1.5GHz, EDP, ED2P and ops/Joule
are the lowest for TRINITY.
5.6 Effect of Starting Temperature
The analysis presented in Section 5.3 considers a starting
temperature of 300K for every experiment. This however
does not expose the secondary effects due to thermal throt-
tling. To simulate practical runtime conditions, in this section
we initialize the starting temperature to 323K. The corre-
sponding EDP results are plotted in Figure 22 comparing
TRINITY against the ondemand heuristic.
The trend of EDP is not the same for every benchmark.
Consequently, the strategy to balance performance, energy
and temperature should be different. For compute inten-
sive workloads like blackscholes and barnes, the highest
clock frequency (1.5GHz) delivers the best performance but
also results in the highest temperature. This causes thermal
throttling which significantly reduces performance. TRIN-
ITY on the other hand, tries to trade performance for benefits
in temperature, 4 K on average with respect to ondemand.
Energy efficiency results however reveal that TRINITY and
ondemand perform equally well; TRINITY is 6.4% better,
arguably within simulation error bounds. The implication is
indeed in line with the definition of EMT. TRINITY chooses
the clock frequencies such that same or better performance
can be achieved at a much lower temperature.
Analyzing memory intensive benchmarks (kcore, pager-
ank, connectedcomponent and tc), there is no appreciable
improvement in performance (MIPS) as core frequencies are
increased. For example, average MIPS for kcore at 0.5, 1.0
and 1.5GHz is 4360.3, 5074.2 and 5265.2, respectively. Pos-
sessing apriori knowledge that the application to be executed
is memory intensive, could lead to choosing a lower clock
frequency as a possible strategy. While it certainly keeps
the entire 3D stack at a lower temperature, EDP can suffer
considerably. Although the average power is small, the ap-
plication takes much longer to complete. In these situations,
TRINITY tunes R in such a way that the lower half of the
clock frequencies (0.5 - 1.0GHz) are chosen in the memory
intensive phases. Except for connectedcomponent, temper-
ature of the core layer for the remaining three workloads is
lower by about 6 K. For connectedcomponent, both onde-
mand and TRINITY perform equally well and no appreciable
12
-40 
-30 
-20 
-10 
0
10
20
30
40
OPT1 OPT2 OPT3 OPT1 OPT2 OPT3
EDP ED2P
% 
Di
ffe
ren
ce
0.5GHz 1.0GHz 1.5GHz OnD
369.42284.27443.31
TRINITY vs.
87.74 103.94118.74
-10 
-8 
-6 
-4 
-2 
0
2
4
6
8
10
OPT1 OPT2 OPT3
Temperature
Te
mp
era
tur
e D
iff
ere
nc
e i
n K
elv
in
0
1
2
3
4
5
6
7
OPT1 OPT2 OPT3
ops/Joule
% 
Di
ffe
ren
ce
Figure 21: EDP, ED2P, temperature and energy efficiency comparison for different TRINITY parameters.
temperature or EDP difference is observed.
Again, to understand the source of temperature reduction,
the average power dissipated at individual layers is plotted
in Figure 23. The x-axis represents different DVFS options
for each benchmark and the y-axis shows average power in
Watts. Total power for each DVFS setting is broken down
into dynamic and leakage power for the core, L2 Cache and
DRAM layers. This distribution of power helps understand
the primary source of power consumption for each benchmark
application. blackscholes and barnes, both compute in-
tensive, consume majority of the power in the core layer.
kcore, pagerank and tc being memory intensive consume
greater power in the L2 Cache layer, specifically dynamic
power. Additionally, DRAM dynamic power is higher as well
due to increased L2 Cache misses. connectedcomponent,
unlike other memory bound workloads shows much higher
power consumed in the core die. However, power consumed
in the L2 Cache and DRAM is larger as compared to compute
intensive benchmarks.
As seen in Fig. 23, the bulk of the power reduction (con-
sequently reduction in temperature) comes from reducing
dynamic power consumption of the core and cache layers.
This is intuitive since DVFS implemented by TRINITY di-
rectly affects only the core and the corresponding L2 Cache
bank. As compared to ondemand, dynamic power of the core
and cache layers reduce by 11.7% and 18%, respectively.
Furthermore, with respect to ondemand, TRINITY is also
able to reduce leakage power of the core and cache layers by
15.5% and 16.5%, respectively. The power reduction can be
attributed to the on-line adaptation of R. In memory inten-
sive parts of the application, η is low (< 1) thus guiding the
controller to choose the lower end of the clock frequencies.
In compute intensive regions, η is high (> 2) allowing for
higher clock frequencies to be chosen.
TRINITY is designed so that it can be implemented on a
real physical system. Simplifying the model and reducing
computational complexity reduces the number of parameters
that can be manually tuned. In this section, the sensitivity of
TRINITY to variations in T and TR, which are the only man-
ually tuned parameters, is discussed. Reducing the control
cycle duration and TR has the benefit of capturing rapidly
varying application phases. But it could also increase the
amount of controller computations per unit time. Two cases
are compared here: (1) OPT1 (T= 1ms, TR = 1ms) and (2)
OPT2 (T= 1ms, TR = 5ms).
The y-axis in Figure 24b represents the absolute differ-
ence between TRINITY and ondemand whereas the y-axis
in Figures 24a, 24d and 24c represent % difference. Overall,
OPT2 fares slightly worse than OPT1 when using the metrics
EDP, ED2P and ops/Joule. OPT2 keeps the temperature of
the cores about a degree cooler than OPT1 by trading off
performance. This aspect is clearly observed in Figures 24a
and 24b.
The notable feature concurs with intuition: Increasing TR
implies tuning R less frequently thereby making the con-
troller less responsive to changes in application phases. This
increases the effect of performance mispredictions and thus
reduces EDP, ED2P and ops/Joule. Consequently, the average
temperature for OPT2 is higher than OPT1.
Akin to any practical thermal/power/energy management
approach, TRINITY too faces the challenge of modeling pre-
cision vs. controller performance. Applications that would
benefit from TRINITY are those that have a mixture of com-
pute and memory bound phases because of the ability to adapt
itself at run-time to maximally utilize the EHC. However, if
those phases are shorter than the control interval T, they
might end up being overlooked. TRINITY works particularly
well for memory intensive applications like GraphBig be-
cause at the same EDP, the average temperature and voltage
is lower than ondemand which improves MTTF by 10%.
6. RELATED WORK
Dynamic Thermal Management (DTM) of multicore pro-
cessors (2D and 3D) has evolved from heuristic based ap-
proaches to formal feedback control based techniques. They
can be divided into the following categories: (i) Hardware
level and (ii) Software level. At the hardware level, DVFS
of independent voltage islands has been explored in great
detail, for example [20, 25, 28, 31, 42, 43, 44, 45, 46, 47,
48]. Recently, [49] discussed use of thermal TSVs to extract
heat from the different layers in an architecture similar to the
one described in this work. They boost the performance of
applications by exploiting the improved cooling efficiency.
Some other approaches to mitigate thermal effects are instruc-
tion fetch throttling, clock gating [50, 51] and moving the
hottest datapaths closer to the heat sink (thermal herding)
[39]. Finally, at the software level we have migrating threads
from hot core to cool cores [52, 53], data compression at
the memory controller [16], two level prefetching with throt-
tling off-chip memory links [15], dynamic page allocation
13
315
320
325
330
335
340
345
350
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
On
D
TR
IN
IT
Y
On
D
TR
IN
IT
Y
On
D
TR
IN
IT
Y
On
D
TR
IN
IT
Y
On
D
TR
IN
IT
Y
On
D
TR
IN
IT
Y
On
D
TR
IN
IT
Y
ba bs cc kc pr tc Avg
EDP
Temperature
Figure 22: Controller performance compared against the ondemand heuristic. Controller Parameters: T = 1ms and TR = 1ms.
Left y-axis and right y-axis units are EDP and Kelvin, respectively.
0
5
10
15
20
25
30
35
40
On
D
TR
IN
IT
Y
On
D
TR
IN
IT
Y
On
D
TR
IN
IT
Y
On
D
TR
IN
IT
Y
On
D
TR
IN
IT
Y
On
D
TR
IN
IT
Y
On
D
TR
IN
IT
Y
ba bs cc kc pr tc Avg
Po
we
r (
W
)
CORE_Dyn CORE_Leak L2CACHE_Dyn L2CACHE_Leak DRAM_Dyn DRAM_Leak
Figure 23: Average power consumption by TRINITY compared against the ondemand heuristic. Controller Parameters: T = 1ms
and TR = 1ms.
[17], and data block reallocation with heterogeneous memory
architectures [19].
All the works listed in the previous paragraph except [28]
and [47] design their techniques wherein core or die tem-
perature is an upper limit i.e. a constraint. Some policies
are triggered only under emergencies [50, 51], while the rest
optimize a cost while staying within the thermal threshold.
TRINITY is not designed to wait till an emergency is trig-
gered. On the contrary, it tries to avoid such scenarios by
trading reduction in performance. References [31, 43, 45]
develop an MPC based optimal control problem, while [46,
48] present a numerical optimization based approach. The ob-
jectives of the problems considered vary: minimizing power
[43, 46], maximizing performance [25, 31, 45] and maximiz-
ing power efficiency [48]. In contrast to optimization, refer-
ences [42] and [28] design closed loop feedback controllers
to maintain a fixed temperature. TRINITY is similar to the
optimization based DVFS approaches but it offers an alterna-
tive approach to handle temperature and is computationally
cheap to implement. Instead of considering temperature as a
constraint, optimizing a cost based on it enables TRINITY to
balance parameters which were otherwise dealt with in iso-
lation. MPC and optimization based approaches referenced
here are computationally intensive due to the large model
dimensions and do not consider the leakage power depen-
dence on temperature. Although [31] proposes a distributed
14
-15 
-10 
-5 
0
5
10
On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2
ba ba cc kc pr tc Avg
ondemand vs. OPT1 vs. OPT2: EDP (%)
Lower is Better
(a) Comparison of EDP against ondemand for different TRIN-
ITY parameters.
-1 
0
1
2
3
4
5
6
7
On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2
ba ba cc kc pr tc Avg
ondemand vs. OPT1 vs OPT2: Temperature difference (K)
(b) Comparison of Temperature against ondemand for different
TRINITY parameters.
-15 
-10 
-5 
0
5
10
15
20
25
On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2
ba ba cc kc pr tc Avg
ondemand vs. OPT1 vs. OPT2: ED2P (%)
Lower is Better 
(c) Comparison of ED2P against ondemand for different TRIN-
ITY parameters.
0
1
2
3
4
5
6
7
8
9
On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2 On
D
OP
T1
OP
T2
ba ba cc kc pr tc Avg
ondemand vs. OPT1 vs OPT2: ops/Joule (%) 
Higher is Better
(d) Comparison of ops/Joule against ondemand for different
TRINITY parameters.
Figure 24: TRINITY Parameter Variation
approach to MPC in an attempt to make it practically feasible,
their policy eventually is heuristic based (also noted by [48]).
The work in [47] claims that minimizing thermal impact
extends the sustainability of desired Quality-of-Service levels
on mobile devices. In a similar vein, our work too alludes
to extending the lifetime reliability of 3D stacks by keeping
temperatures lower. Unlike prior works employing DVFS,
we present an analysis on lifetime reliability and demonstrate
the additional benefits obtained by efficiently managing the
heat capacity.
7. CONCLUSIONS
This work presents an approach to the coordinated con-
trol of performance, energy efficiency and temperature on
3D processor-memory stacks. It introduces the concept of
effective heat capacity as a thermal resource to be managed.
Through a comprehensive simulation-based characterization
of intra- and inter-die thermal coupling effects, the ability
to maximally utilize the effective heat capacity is illustrated.
An on-line DVFS controller called TRINITY is developed
for the same. Unlike prior research efforts which (i) consider
power, performance and temperature in isolation or in pairs,
(ii) do not explicitly model static power, (iii) are heuristic
based, this work acknowledges the complex interplay be-
tween performance, energy, temperature, microarchitectural
parameters and package physical constraints. An analysis of
EDP, ED2P, energy efficiency, temperature and also lifetime
reliability is presented demonstrating the benefits of intel-
ligently managing temperature as a resource and not just a
constraint.
8. ACKNOWLEDGEMENTS
This work was supported in part by the National Science
Foundation under Grant CNS 0855110 and the Oak Ridge
National Laboratory
9. REFERENCES
[1] M. Motoyoshi, “Through-silicon via (tsv),” Proceedings of the IEEE,
vol. 97, no. 1, pp. 43–48, 2009.
[2] “Ddr3 sdram standard.”
https://www.jedec.org/standards-documents/docs/jesd-79-3d, Jul 2012.
[3] “Ddr4 sdram standard.”
https://www.jedec.org/standards-documents/docs/jesd79-4a, Jun 2017.
[4] “High bandwidth memory (hbm) dram.”
https://www.jedec.org/standards-documents/docs/jesd235a, Nov 2015.
[5] “Hybrid memory cube consortium.”
http://www.hybridmemorycube.org/.
[6] K. Banerjee, S. J. Souri, P. Kapur, and K. C. Saraswat, “3-d ics: A
novel chip design for improving deep-submicrometer interconnect
performance and systems-on-chip integration,” Proceedings of the
IEEE, vol. 89, no. 5, pp. 602–633, 2001.
[7] C. C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari, “Bridging the
processor-memory performance gap with 3d ic technology,” IEEE
Design & Test of Computers, vol. 22, no. 6, pp. 556–564, 2005.
[8] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H.
Loh, D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso, et al., “Die
stacking (3d) microarchitecture,” in Proceedings of the 39th Annual
IEEE/ACM International Symposium on Microarchitecture,
pp. 469–479, IEEE Computer Society, 2006.
[9] G. H. Loh, “3d-stacked memory architectures for multi-core
processors,” in Computer Architecture, 2008. ISCA’08. 35th
International Symposium on, pp. 453–464, IEEE, 2008.
[10] D. H. Woo, N. H. Seong, D. L. Lewis, and H.-H. S. Lee, “An
optimized 3d-stacked memory architecture by exploiting excessive,
high-density tsv bandwidth,” in High Performance Computer
Architecture (HPCA), 2010 IEEE 16th International Symposium on,
pp. 1–12, IEEE, 2010.
[11] S. Borkar, “3d integration for energy efficient system design,” in
Proceedings of the 48th Design Automation Conference, pp. 214–219,
ACM, 2011.
[12] P. Emma, A. Buyuktosunoglu, M. Healy, K. Kailas, V. Puente, R. Yu,
15
A. Hartstein, P. Bose, and J. Moreno, “3d stacking of
high-performance processors,” in High Performance Computer
Architecture (HPCA), 2014 IEEE 20th International Symposium on,
pp. 500–511, IEEE, 2014.
[13] Y. Eckert, N. Jayasena, and G. H. Loh, “Thermal feasibility of
die-stacked processing in memory,” 2014.
[14] D. Milojevic, S. Idgunji, D. Jevdjic, E. Ozer, P. Lotfi-Kamran,
A. Panteli, A. Prodromou, C. Nicopoulos, D. Hardy, B. Falsari, et al.,
“Thermal characterization of cloud workloads on a power-efficient
server-on-chip,” in Computer Design (ICCD), 2012 IEEE 30th
International Conference on, pp. 175–182, IEEE, 2012.
[15] J. Ahn, S. Yoo, and K. Choi, “Dynamic power management of off-chip
links for hybrid memory cubes,” in Proceedings of the 51st Annual
Design Automation Conference, pp. 1–6, ACM, 2014.
[16] M. J. Khurshid and M. Lipasti, “Data compression for thermal
mitigation in the hybrid memory cube,” in 2013 IEEE 31st
International Conference on Computer Design (ICCD), pp. 185–192,
IEEE, 2013.
[17] W.-H. Lo, K.-z. Liang, and T. Hwang, “Thermal-aware dynamic page
allocation policy by future access patterns for hybrid memory cube
(hmc),” in 2016 Design, Automation & Test in Europe Conference &
Exhibition (DATE), pp. 1084–1089, IEEE, 2016.
[18] D. Zhao, H. Homayoun, and A. V. Veidenbaum, “Temperature aware
thread migration in 3d architecture with stacked dram,” in Quality
Electronic Design (ISQED), 2013 14th International Symposium on,
pp. 80–87, IEEE, 2013.
[19] L.-N. Tran, F. J. Kurdahi, A. M. Eltawil, and H. Homayoun,
“Heterogeneous memory management for 3d-dram and external dram
with qos,” in Design Automation Conference (ASP-DAC), 2013 18th
Asia and South Pacific, pp. 663–668, IEEE, 2013.
[20] J. Meng, K. Kawakami, and A. K. Coskun, “Optimizing energy
efficiency of 3-d multicore systems with stacked dram under power
and thermal constraints,” in Proceedings of the 49th Annual Design
Automation Conference, pp. 648–655, ACM, 2012.
[21] Y.-J. Chen, C.-L. Yang, P.-S. Lin, and Y.-C. Lu,
“Thermal/performance characterization of cmps with 3d-stacked drams
under synergistic voltage-frequency control of cores and drams,” in
Proceedings of the 2015 Conference on research in adaptive and
convergent systems, pp. 430–436, ACM, 2015.
[22] K. Kang, J. Jung, S. Yoo, and C.-M. Kyung, “Maximizing throughput
of temperature-constrained multi-core systems with 3d-stacked cache
memory,” in Quality Electronic Design (ISQED), 2011 12th
International Symposium on, pp. 1–6, IEEE, 2011.
[23] R. Zhang, M. R. Stan, and K. Skadron, “Hotspot 6.0: Validation,
acceleration and extension,” University of Virginia, Tech. Rep, 2015.
[24] A. Sridhar, A. Vincenzi, M. Ruggiero, T. Brunschwiler, and
D. Atienza, “3d-ice: Fast compact transient thermal modeling for 3d
ics with inter-tier liquid cooling,” in Proceedings of the International
Conference on Computer-Aided Design, pp. 463–470, IEEE Press,
2010.
[25] C. Zhu, Z. Gu, L. Shang, R. P. Dick, and R. Joseph,
“Three-dimensional chip-multiprocessor run-time thermal
management,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 27, no. 8, pp. 1479–1492, 2008.
[26] E. Rotem, A. Naveh, A. Ananthakrishnan, E. Weissmann, and
D. Rajwan, “Power-management architecture of the intel
microarchitecture code-named sandy bridge,” Ieee micro, vol. 32,
no. 2, pp. 20–27, 2012.
[27] I. Paul, S. Manne, M. Arora, W. L. Bircher, and S. Yalamanchili,
“Cooperative boosting: needy versus greedy power management,” in
ACM SIGARCH Computer Architecture News, vol. 41, pp. 285–296,
ACM, 2013.
[28] K. Rao, W. Song, S. Yalamanchili, and Y. Wardi, “Temperature
regulation in multicore processors using adjustable-gain integral
controllers,” in Control Applications (CCA), 2015 IEEE Conference
on, pp. 810–815, IEEE, 2015.
[29] C. Bienia, S. Kumar, and K. Li, “Parsec vs. splash-2: A quantitative
comparison of two multithreaded benchmark suites on
chip-multiprocessors,” in Workload Characterization, 2008. IISWC
2008. IEEE International Symposium on, pp. 47–56, IEEE, 2008.
[30] L. Nai, Y. Xia, I. G. Tanase, H. Kim, and C.-Y. Lin, “Graphbig:
understanding graph computing in the context of industrial solutions,”
in High Performance Computing, Networking, Storage and Analysis,
2015 SC-International Conference for, pp. 1–12, IEEE, 2015.
[31] A. Bartolini, M. Cacciari, A. Tilli, and L. Benini, “Thermal and energy
management of high-performance multicores: Distributed and
self-calibrating model-predictive controller,” IEEE Transactions on
Parallel and Distributed Systems, vol. 24, no. 1, pp. 170–183, 2013.
[32] Y. Han, I. Koren, and C. M. Krishna, “Tilts: A fast architectural-level
transient thermal simulation method,” Journal of Low Power
Electronics, vol. 3, no. 1, pp. 13–21, 2007.
[33] A. Mazouz, A. Laurent, B. Pradelle, and W. Jalby, “Evaluation of cpu
frequency transition latency,” Computer Science-Research and
Development, vol. 29, no. 3-4, pp. 187–195, 2014.
[34] J. G. Beu, M. C. Rosier, and T. M. Conte, “Manager-client pairing: a
framework for implementing coherence hierarchies,” in Proceedings
of the 44th Annual IEEE/ACM International Symposium on
Microarchitecture, pp. 226–236, ACM, 2011.
[35] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and
N. P. Jouppi, “Mcpat: an integrated power, area, and timing modeling
framework for multicore and manycore architectures,” in
Microarchitecture, 2009. MICRO-42. 42nd Annual IEEE/ACM
International Symposium on, pp. 469–480, IEEE, 2009.
[36] C. D. Kersey, A. Rodrigues, and S. Yalamanchili, “A universal parallel
front-end for execution driven microarchitecture simulation,” in
Proceedings of the 2012 Workshop on Rapid Simulation and
Performance Evaluation: Methods and Tools, pp. 25–32, ACM, 2012.
[37] P. Rosenfeld, E. Cooper-Balis, and B. Jacob, “Dramsim2: A cycle
accurate memory system simulator,” IEEE Computer Architecture
Letters, vol. 10, no. 1, pp. 16–19, 2011.
[38] S. M. Hassan and S. Yalamanchili, “Understanding the impact of air
and microfluidics cooling on performance of 3d stacked memory
systems,” in Proceedings of the Second International Symposium on
Memory Systems, pp. 387–394, ACM, 2016.
[39] K. Puttaswamy and G. H. Loh, “Thermal herding: Microarchitecture
techniques for controlling hotspots in high-performance 3d-integrated
processors,” in High Performance Computer Architecture, 2007.
HPCA 2007. IEEE 13th International Symposium on, pp. 193–204,
IEEE, 2007.
[40] F. Li, C. Nicopoulos, T. Richardson, Y. Xie, V. Narayanan, and
M. Kandemir, “Design and management of 3d chip multiprocessors
using network-in-memory,” in ACM SIGARCH Computer Architecture
News, vol. 34, pp. 130–141, IEEE Computer Society, 2006.
[41] W. J. Song, S. Mukhopadhyay, and S. Yalamanchili, “Managing
performance-reliability tradeoffs in multicore processors,” in
Reliability Physics Symposium (IRPS), 2015 IEEE International,
pp. 3C–1–12, IEEE, 2015.
[42] J. Donald and M. Martonosi, “Techniques for multicore thermal
management: Classification and new exploration,” in ACM SIGARCH
Computer Architecture News, vol. 34, pp. 78–88, IEEE Computer
Society, 2006.
[43] Y. Wang, K. Ma, and X. Wang, “Temperature-constrained power
control for chip multiprocessors with online model estimation,” in
ACM SIGARCH computer architecture news, vol. 37, pp. 314–324,
ACM, 2009.
[44] V. Pallipadi and A. Starikovskiy, “The ondemand governor: Past,
present, and future,” in Linux Symposium, pp. 215–230, 2006.
[45] F. Zanini, D. Atienza, G. De Micheli, and S. P. Boyd, “Online convex
optimization-based algorithm for thermal management of mpsocs,” in
Proceedings of the 20th symposium on Great lakes symposium on
VLSI, pp. 203–208, ACM, 2010.
[46] S. Murali, A. Mutapcic, D. Atienza, R. Gupta, S. Boyd, L. Benini, and
G. De Micheli, “Temperature control of high-performance multi-core
platforms using convex optimization,” in Design, Automation and Test
in Europe, 2008. DATE’08, pp. 110–115, IEEE, 2008.
[47] O. Sahin, P. T. Varghese, and A. K. Coskun, “Just enough is more:
Achieving sustainable performance in mobile devices under thermal
limitations,” in Computer-Aided Design (ICCAD), 2015 IEEE/ACM
International Conference on, pp. 839–846, IEEE, 2015.
[48] V. Hanumaiah, D. Desai, B. Gaudette, C.-J. Wu, and S. Vrudhula,
“Steam: a smart temperature and energy aware multicore controller,”
ACM Transactions on Embedded Computing Systems (TECS), vol. 13,
16
no. 5s, p. 151, 2014.
[49] A. Agrawal, J. Torrellas, and S. Idgunji, “Xylem: enhancing vertical
thermal conduction in 3d processor-memory stacks,” in Proceedings of
the 50th Annual IEEE/ACM International Symposium on
Microarchitecture, pp. 546–559, ACM, 2017.
[50] K. Skadron, T. Abdelzaher, and M. R. Stan, “Control-theoretic
techniques and thermal-rc modeling for accurate and localized
dynamic thermal management,” in High-Performance Computer
Architecture, 2002. Proceedings. Eighth International Symposium on,
pp. 17–28, IEEE, 2002.
[51] K. Skadron, M. R. Stan, W. Huang, S. Velusamy,
K. Sankaranarayanan, and D. Tarjan, “Temperature-aware
microarchitecture,” in ACM SIGARCH Computer Architecture News,
vol. 31, pp. 2–13, ACM, 2003.
[52] I. Yeo, C. C. Liu, and E. J. Kim, “Predictive dynamic thermal
management for multicore systems,” in Proceedings of the 45th
annual Design Automation Conference, pp. 734–739, ACM, 2008.
[53] G. Liu, M. Fan, and G. Quan, “Neighbor-aware dynamic thermal
management for multi-core platform,” in Proceedings of the
Conference on Design, Automation and Test in Europe, pp. 187–192,
EDA Consortium, 2012.
17
