Design and Analysis of Dynamic Thermal Management in Chip Multiprocessors (CMPs) by Yeo, In Choon
DESIGN AND ANALYSIS OF
DYNAMIC THERMAL MANAGEMENT
IN CHIP MULTIPROCESSORS (CMPS)
A Dissertation
by
IN CHOON YEO
Submitted to the Office of Graduate Studies of
Texas A&M University
in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
December 2009
Major Subject: Computer Science
DESIGN AND ANALYSIS OF
DYNAMIC THERMAL MANAGEMENT
IN CHIP MULTIPROCESSORS (CMPS)
A Dissertation
by
IN CHOON YEO
Submitted to the Office of Graduate Studies of
Texas A&M University
in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
Approved by:
Chair of Committee, Eun Jung Kim
Committee Members, Valerie E. Taylor
Hank Walker
Peng Li
Head of Department, Valerie E. Taylor
December 2009
Major Subject: Computer Science
iii
ABSTRACT
Design and Analysis of
Dynamic Thermal Management in Chip Multiprocessors (CMPs). (December 2009)
In Choon Yeo, B.S., Dongguk University;
M.S., Dongguk University
Chair of Advisory Committee: Dr. Eun Jung Kim
Chip Multiprocessors (CMPs) have been prevailing in the modern microprocessor
market. As the significant heat is converted by the ever-increasing power density and
current leakage, the raised operating temperature in a chip has already threatened
the system’s reliability and led the thermal control to be one of the most important
issues needed to be addressed immediately in chip designs. Due to the cost and
complexity of designing thermal packaging, many Dynamic Thermal Management
(DTM) schemes have been widely adopted in modern processors.
In this study, we focus on developing a simple and accurate thermal model,
which provides a scheduling decision for running tasks. And we show how to design
an efficient DTM scheme with negligible performance overhead. First, we propose an
efficient DTM scheme for multimedia applications that tackles the thermal control
problem in a unified manner. A DTM scheme for multimedia applications makes soft
realtime scheduling decisions based on statistical characteristics of multimedia appli-
cations. Specifically, we model application execution characteristics as the probability
distribution of the number of cycles required to decode frames. Our DTM scheme
for multimedia applications has been implemented on Linux in two mobile processors
providing variable clock frequencies in an Intel Pentium-M processor and an Intel
iv
Atom processor. In order to evaluate the performance of the proposed DTM scheme,
we exploit two major codecs, MPEG-4 and H.264/AVC based on various frame res-
olutions. Our results show that our DTM scheme for multimedia applications lowers
the overall temperature by 4 ◦C and the peak temperature by 6 ◦C (up to 10 ◦C),
while maintaining frame drop ratio under 5% compared to existing DTM schemes
for multimedia applications. Second, we propose a lightweight online workload esti-
mation using the cumulative distribution function and architectural information via
Performance Monitoring Counters (PMC) to observe the processes dynamic workload
behaviors. We also present an accurate thermal model for CMP architectures to ana-
lyze the thermal correlation effects by profiling the thermal impacts from neighboring
cores under the specific workload. Hence, according to the estimated workload char-
acteristics and thermal correlation effects, we can estimate the future temperature of
each core more accurately.
We implement a DTM scheme considering workload characteristics and ther-
mal correlation effects on real machines, an Intel Quad-Core Q6600 system and Dell
PowerEdge 2950 (dual Intel Xeon E5310 Quad-Core) system, running applications
ranging from multimedia applications to several benchmarks. Experiments results
show that our DTM scheme reduces the peak temperature by 8% with 0.54% perfor-
mance overhead compared to Linux Standard Scheduler, while existing DTM schemes
reduce peak temperature by 4% with up to 50% performance overhead.
vTo my family
vi
ACKNOWLEDGMENTS
I am sincerely grateful to my advisor, Dr. Eun Jung Kim, for allowing me to
conduct research with her. I am constantly amazed by her extraordinary ability to
transform seeming unsolvable problems into a tractable form her infinite knowledge on
subject matters, and her relentless attention to detail. Her exceptional commitment to
research and strong demand for excellence have guided me this far. I am truly grateful
to her insightful advice, her encouragement and constant motivation throughout this
work.
I would also like to thank professors Valerie E. Taylor, Hank Walker, and Peng
Li for their service on my advisory committee. Their insightful comments and con-
structive criticism helped me improve my research. In addition, I am deeply grateful
to Dr. Kihwan Yum for giving me powerful advising during this study.
Furthermore, I would like to thank my friends and fellow students at Texas A&M
University for numerous discussions about various issues related to research and lives.
I sincerely thank current and former members of the High Performance Computer Lab
for being supportive of me during this work. I also want to thank Seung-Ryong Kang,
Young-Woo Ahn, Seung-Jin Sul for being great friends and for always being available
whenever I need their assistance and help. I also thank to Chih-Chun Liu for being a
great friend and for always being available whenever I need his assistance and help.
All my friends at Texas A&M University have helped me in various ways during the
years of my Ph.D. program. I thank them all, especially Sun-Hwan Jang, Jae-Woo
Seo, Yoon-Jin Kim, Gun-Hee Jo, Ja-Ryeong Koo, Baik-Song Ahn, Heung-Ki Lee, and
Ju-Young Jung.
Last, but not least, I would like to thank my parents and my family members
vii
for their continuous support and encouragement. I am especially grateful to my wife
for her endless support and love. Without her dedication and belief in me, this work
would have been impossible.
viii
TABLE OF CONTENTS
CHAPTER Page
I INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . 1
II BACKGROUND AND RELATED WORK . . . . . . . . . . . . 7
A. Dynamic Thermal Management in Single Core Architecture 7
B. Dynamic Thermal Management in Multicore Architecture . 8
III EFFECTIVE DYNAMIC THERMALMANAGEMENT FOR
MPEG-4 DECODING . . . . . . . . . . . . . . . . . . . . . . . 12
A. Thermal Issues in Multimedia Applications . . . . . . . . 13
B. Overview of a Feedback Control . . . . . . . . . . . . . . . 14
C. Advanced Feedback Controller Using GOP Information . . 17
D. Thermal Control Using GOP Information . . . . . . . . . . 20
E. DTM with the Advanced Feedback Controller . . . . . . . 23
F. Experimental Results and Analysis . . . . . . . . . . . . . 26
G. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 27
IV THERMAL-AWARE SCHEDULING BASED ON STATIS-
TICAL CHARACTERISTICS OF MULTIMEDIA APPLI-
CATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A. The Problems of Multimedia Applications Processing . . . 31
B. The Workload of Multimedia Applications . . . . . . . . . 33
C. Thermal-Aware Scheduling for Multimedia Applications . . 36
D. Application Characteristics Profiler for Multimedia Ap-
plications . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
E. Experimental Environments . . . . . . . . . . . . . . . . . 42
F. Experimental Results and Analysis . . . . . . . . . . . . . 44
1. The effect on performance overhead and thermal
managements in Intel Pentium-M processor . . . . . . 44
2. The effect on performance overhead and thermal
managements in Intel Atom processor . . . . . . . . . 47
G. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 52
V PREDICTIVE DYNAMIC THERMALMANAGEMENT FOR
MULTICORE SYSTEMS . . . . . . . . . . . . . . . . . . . . . . 55
ix
CHAPTER Page
A. Predictive Thermal Model . . . . . . . . . . . . . . . . . . 57
1. The application-based thermal model in CMP systems 57
2. The core-based thermal model in CMP systems . . . . 60
3. The predictive thermal model . . . . . . . . . . . . . . 61
B. PDTM Scheduler . . . . . . . . . . . . . . . . . . . . . . . 62
C. Experimental Results and Analysis . . . . . . . . . . . . . 66
1. Digital thermal sensor for Intel quad-core . . . . . . . 67
2. Experimental results and analysis . . . . . . . . . . . 68
D. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 69
VI TEMPERATURE-AWARE SCHEDULER BASED ON THER-
MAL BEHAVIOR GROUPING IN MULTICORE SYSTEMS . . 70
A. Thermal Behavior Group . . . . . . . . . . . . . . . . . . . 73
1. Thermal behavior groups based on the applica-
tions’ thermal pattern . . . . . . . . . . . . . . . . . . 74
2. The region of the thermal behavior group . . . . . . . 77
B. Temperature-Aware Scheduler for Multicore Systems . . . 79
C. Experimental Results and Analysis . . . . . . . . . . . . . 82
1. 4-core system . . . . . . . . . . . . . . . . . . . . . . . 83
2. 8-core system . . . . . . . . . . . . . . . . . . . . . . 83
D. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 84
VII A THERMAL MODEL BASED ON WORKLOAD CHAR-
ACTERISTICS USING CDF . . . . . . . . . . . . . . . . . . . . 85
A. A Representative Workload Estimation Based on CDF . . 85
1. The definition of workload . . . . . . . . . . . . . . . 86
2. The statistical representative to estimate workload . . 86
3. Thermal parameters in CMP systems . . . . . . . . . 88
B. Thermal Mode Based on Workload . . . . . . . . . . . . . 91
1. Prior thermal model of a single core . . . . . . . . . . 93
2. The thermal impacts contributed by different workloads 94
3. New T ′ss according to thermal correlation . . . . . . . 94
4. New b′ according to thermal correlation . . . . . . . . 96
5. Future temperature estimation model . . . . . . . . . 97
C. A Proactive Correlation-aware Thermal Management . . . 100
1. System overview . . . . . . . . . . . . . . . . . . . . . 100
2. Thermal-aware thread scheduler (TATS) . . . . . . . . 102
D. Experimental Results and Analysis . . . . . . . . . . . . . 104
xCHAPTER Page
E. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 107
VIII A THERMAL MODEL FOR CMPS CAPTURING WORK-
LOAD CHARACTERISTICS AND NEIGHBORING CORE
EFFECTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
A. The Lumped Thermal RC Model . . . . . . . . . . . . . . 109
B. Workload-aware Thermal Model . . . . . . . . . . . . . . . 111
C. Thermal Correlation Effects . . . . . . . . . . . . . . . . . 115
D. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 122
IX CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . 123
A. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 123
B. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 124
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
xi
LIST OF TABLES
TABLE Page
I Experimental systems description . . . . . . . . . . . . . . . . . . . . 42
II The experimental multimedia data (Standard Definition) . . . . . . . 43
III The experimental multimedia data (High Definition) . . . . . . . . . 43
IV Environments parameters . . . . . . . . . . . . . . . . . . . . . . . . 67
V A set of benchmarks list . . . . . . . . . . . . . . . . . . . . . . . . . 67
VI The result of thermal behavior group using K -means clustering
on 4-core system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
VII Experimental systems descriptions . . . . . . . . . . . . . . . . . . . 82
VIII Each core’s respective Tss and thermal parameter b for a generated
example process with 100% workload running in the Intel Quad
Core Q6600 system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
IX The thermal parameter b and Tss according to workload in 4-core
system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
X Ttc and b according to thermal correlation profiled for core 1 . . . . . 96
XI The ratio of Tss for cores in an Intel Quad-Core processor . . . . . . 119
xii
LIST OF FIGURES
FIGURE Page
1 MPEG process with display buffers . . . . . . . . . . . . . . . . . . . 14
2 Low complexity vs. high complexity scenes . . . . . . . . . . . . . . . 18
3 Comparison with DTM and without DTM for frequency . . . . . . . 25
4 Temperature comparison with and without DTM . . . . . . . . . . . 26
5 Variance of temperature of high-complexity movies . . . . . . . . . . 28
6 Variance of temperature of mid- and low-complexity movies . . . . . 28
7 The timing gap between decoding and displaying data using the
buffer management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8 The workload of decoding and displaying multimedia data accord-
ing to several codecs and frame resolutions . . . . . . . . . . . . . . . 34
9 TAS overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
10 The cumulative distribution function (cdf) of decoding frames in
the multimedia application . . . . . . . . . . . . . . . . . . . . . . . 40
11 The frame drop of standard definition multimedia data encoded
by MPEG-4 and H.264/AVC in Intel Pentium-M processor . . . . . . 45
12 Resulting temperatures with feedback, frame, cycle counter, and
TAS in the standard definition multimedia data in Intel Pentium-
M processor (frame resolution : 800 X 600) . . . . . . . . . . . . . . 48
13 Resulting temperatures with feedback, frame, cycle counter, and
TAS in the high definition multimedia data in Intel Pentium-M
processor (frame resolution : 1280 X 720) . . . . . . . . . . . . . . . 49
14 The frame drop of standard definition multimedia data encoded
by MPEG-4 and H.264/AVC in Intel Atom processor . . . . . . . . . 50
xiii
FIGURE Page
15 Resulting temperatures with feedback, frame, cycle counter, and
TAS in the standard definition multimedia data in Intel Atom
processor (frame resolution : 800 X 600) . . . . . . . . . . . . . . . . 53
16 Resulting temperatures with feedback, frame, cycle counter, and
TAS in the high definition multimedia data in Intel Atom proces-
sor (frame resolution : 1280 X 720) . . . . . . . . . . . . . . . . . . 54
17 Real temperature of one core on running bzip2 benchmark . . . . . 58
18 The calculation of ∆t (migration time) using ABTM . . . . . . . . . 60
19 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
20 PDTM utilizes ABTM and CBTM simultaneously to predict both
short-term and long-term future temperature for multicore . . . . . . 65
21 Comparisons among without DTM, HRTM, and PDTM using
libquantum benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 66
22 Comparisons among without DTM, HRTM, and PDTM using
bzip2 and libquantum benchmarks . . . . . . . . . . . . . . . . . . . 66
23 Performance overhead : PDTM incurs only under 1% performance
overhead in average while running single benchmark . . . . . . . . . 68
24 Tss according to SPEC CPU 2006 benchmark suite . . . . . . . . . . 72
25 Thermal behavior for Group A . . . . . . . . . . . . . . . . . . . . . 74
26 The application thermal behavior according to applications and cores 75
27 Slopes for the thermal pattern at runtime . . . . . . . . . . . . . . . 78
28 DTM evaluations in 4-core system using test group 2 (bzip2 +
libquantum) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
29 The representative workload is 58% when the probability (ρ) is
0.7 in dynamic workload behavior . . . . . . . . . . . . . . . . . . . . 88
30 Thermal effects by different workloads . . . . . . . . . . . . . . . . . 92
xiv
FIGURE Page
31 The thermal range (∆T ) using Twss and Ttc to calculate T
′
ss for core 1 94
32 Validation of improved thermal model with workload estimation
and thermal correlation in static application. (Only core 1’s tem-
perature is drawn) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
33 Validation of new thermal model with fluctuant workload: whiling
playing the Transformer movie, the Mplayer software would gen-
erate two threads. One is the X windows daemon with stable
workload, and the other one is for decoding with fluctuant work-
load as shown above. . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
34 The difference of thermal control based on current temperature
and future temperature . . . . . . . . . . . . . . . . . . . . . . . . . 101
35 ProCATM system architecture . . . . . . . . . . . . . . . . . . . . . 102
36 DTM evaluation in Intel Quad Core Q6600 system for stable work-
load behaviors: libquantum + vacation . . . . . . . . . . . . . . . . . 104
37 DTM evaluation in Intel Quad Core Q6600 for dynamic workload
behaviors: Multimedia . . . . . . . . . . . . . . . . . . . . . . . . . . 106
38 An extended lumped thermal RC circuit model for a single core
in a CMP architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 111
39 Tss of SPEC CPU 2006 benchmarks . . . . . . . . . . . . . . . . . . 112
40 Temperature tracking using architectural information in SPEC
CPU integer benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 116
41 Temperature tracking using architectural information in SPEC
CPU floating-point benchmarks . . . . . . . . . . . . . . . . . . . . . 117
42 The extended thermal model for CMP architecture . . . . . . . . . . 118
43 The proposed platform . . . . . . . . . . . . . . . . . . . . . . . . . . 120
44 The comparisons between the estimated temperature considering
workload characteristics and thermal correlation effects and the
measured temperature in a CMP architecture . . . . . . . . . . . . . 121
1CHAPTER I
INTRODUCTION
Chip Multiprocessors (CMPs) have become the main trend in the design of new gen-
eration processors. CMP architectures include several cores within one single die
area to increase their performance. Instead of pushing the limits of a processor’s
frequency, the demand for more capable microprocessors must be satisfied by other
methods. However, decreased chip size and increased power-density produces a signif-
icant amount of heat, threatens system performance and reliability, and even increases
power leakage. This heat dissipation is pushing the limits of current packaging tech-
nology and cooling solutions. Packages are designed for the worst typical behaviors
and rely on Dynamic Thermal Management (DTM) techniques to control tempera-
ture at runtime. Therefore, over the past decade, chip design trends have shift to
providing more effective thermal managements.
Brooks and Martonosi [1] state the key goals of DTM as: (1) providing inexpen-
sive hardware or software responses, (2) reliably reducing power, and (3) impacting
performance as little as possible. Although many hardware-based temperature con-
trol techniques, such as Dynamic Frequency and Voltage Scaling (DVFS) and clock
gating, have been proposed and applied in modern processors, the demand for more
efficient techniques is prevailing in modern microprocessors. [1, 2, 3, 4, 5, 6].
Although modern microprocessors can meet the computation requirement for
multimedia data, the high definition multimedia data requires high computations,
which converts into a huge amount of heat in modern embedded systems. Therefore,
it is critical to keep the temperature of microprocessors under safe limits at runtime.
The journal model is IEEE Transactions on Networking.
2In DTM schemes for modern processors, DVFS is a common method to control tem-
perature. However, due to the nature of multimedia applications with different frame
sizes and types in data, it is not easy to match their QoS requirements while tempera-
ture is under control. There have been a handful of studies on thermal managements
for multimedia systems [7, 8, 9, 10, 4, 11]. However, according to our observations,
these schemes tend to overestimate or underestimate multimedia application require-
ments. Their results inevitably lead to manage operating temperature high or degrade
their performance. In order to compensate the tradeoff between QoS and tempera-
ture control, we derive application execution characteristics in various multimedia
codecs such as MPEG-4 and H.264/AVC. The application execution characteristics
can be represented by the probability distribution of cycle demand that is the num-
ber of cycles required to decode a frame. Using this representation, we estimate an
adequate processor speed to execute multimedia applications for decoding frames at
runtime. Then, we develop Thermal-Aware Scheduler (TAS) that takes optimal fre-
quencies to avoid thermal emergencies while minimizing performance degradations in
real environments. We experiment on Intels Pentium-M and Atom processors using
various multimedia data encoded by MPEG-4 and H.264/AVC. Compared to feed-
back control scheme [10], Frame-based scheme [12] and cycle counter-based scheduler
[13], TAS lowers average temperature by 6 ◦C and peak temperature by 10 ◦C or
more, with maximum 5% frame drop ratio.
In DTM schemes for CMP architectures, a thread migration has been proposed
to achieve thermal balance among cores without throttling the computation perfor-
mance in CMP architectures. However, several studies are reactive to the increased
chip temperature, while others, such as [4, 5], are proactive based on the predicted
future temperature. The proactive DTMs are more effective in temperature control
and preventing thermal emergencies, for they trigger the control schemes before the
3core temperature reach the desired threshold. Since applications have used different
functional units that can affect operating temperature, the temperature difference
among applications can be up to 9 ◦C [2, 3]. In fact, the temperature difference be-
tween on-chip components can be as much as 10 ∼ 15 ◦C [3]. Also, all applications do
not result in the same heat dissipation pattern. In other words, there are significant
variations in the thermal characteristics of different applications [3, 4] and different
cores in the same chip.
Also, according to our observations, the temperature of a core varies by 2 ◦C to
16 ◦C depending on different levels of thermal correlation in a 4-core CMP system.
The temperature of a component is highly correlated with those of other components
in the same chip [14, 15, 16, 3]. The temperature model, capturing correlation effects
in a uniprocessor, cannot be directly applied to that of a CMP, due to their potential
heterogeneity where each core has an independent task to run. Furthermore, [14, 4]
have already reported the significant variations in the thermal behaviors of different
applications. Although there have been a handful of studies using simple workload
models, such as average workload and Instructions Per Cycle (IPC), these studies
measured the workload information offline for temperature managements [17, 18].
We believe that it is critical to develop an efficient and online application-based ther-
mal model for DTM which is applicable to real world applications that have dynamic
workload behaviors and distinct thermal contributions to the chip temperature. To
demonstrate the proposed DTM scheme’s scalability and efficiency, especially to sat-
isfy the demand of thermal control in the recent server environments, we implement
and evaluate our DTM scheme on real machines, an Intel Quad-Core Q6600 system
and Dell PowerEdge 2950 (dual Intel Xeon E5310 Quad-Core) system, running ap-
plications ranging from multimedia applications to several benchmarks. For several
applications with dynamic workload behaviors, experiments results show that our
4DTM scheme reduces the peak temperature by 8% with 0.54% performance overhead
compared to Linux Standard Scheduler, while existing DTM schemes reduce peak
temperature by 4% with up to 50% performance overhead.
In summary, this thesis focuses on dynamic thermal managements in CMP ar-
chitectures with negligible performance overhead. Specifically, the contributions of
this thesis are follows:
• A near-optimal thermal management in various multimedia systems. We es-
timate multimedia applications’ thermal characteristics using statistical ap-
proaches to be suitable for various multimedia codecs. A thermal-aware sched-
uler for multimedia systems provides soft realtime performance guarantees with
statistical processor allocations.
• A better understanding of how the workload for running applications affects op-
erating temperature in CMP architectures. We define the dynamic workload for
running applications as a statistical function (cdf) and a workload characteris-
tics function w(t) at a given interval time [t − 1, t]. Also, the workload char-
acteristics function consists of a positive factor and negative factors obtained
by Performance Monitoring Counters (PMC). These analytical results provide
a statistical approach to understand the temperature variations influenced by
applications.
• Thermal correlation effects can explain how the heat transfer works in real CMP
products. Thermal models for CMP architectures should consider the heat trans-
fer among cores, which is defined as the thermal correlation effects. The ther-
mal model, capturing thermal correlation effects in a uniprocessor, cannot be
directly applied to the thermal model of CMP architectures due to the potential
heterogeneity where each core has an independent task to run. We provide an
5extended thermal model based on thermal correlation effects in CMP architec-
tures.
• A measurement of temperature via Digital Thermal Sensor (DTS). In order to
estimate temperature through Digital Thermal Sensor (DTS) in CMP architec-
ture, we develop a specific device driver to access them at runtime. In a silicon
die of CMP, each core has a unique thermal sensor that triggers independently.
The trigger point of these thermal sensors is not programmable by software
since it is set during the fabrication of the processor [19].
• Application-Based Thermal Management. Basically, an Application-Based Ther-
mal Management (ABTM) consists of three major components: an application-
based thermal model, a future temperature estimation, and a thermal-aware
scheduler. We used a specific device driver for Linux to access the Digital Ther-
mal Sensor (DTS) and measure each core’s real temperature, and then used the
temperature information in the future temperature estimation. Also, we used an
application-based thermal model to exploit the thermal model according to each
application’s execution behavior. At the same time, the future temperature es-
timation utilizes the workload characteristics and thermal correlation effects to
estimate the future temperature and the time left until the temperature reaches
the migration threshold. Hence, the thermal-aware scheduler is able to react to
the thermal emergency appropriately using the estimated information.
The rest of the thesis is organized as follows. In Chapter II, we describe back-
ground and related work of this thesis. In Chapter III, we propose an efficient thermal
management for multimedia applications. We consider how the performance of a mul-
timedia system is affected by the complexity of scenes, and then we find an appropriate
frequency based on the information on scene complexity. In Chapter IV, we first derive
6the applications’ characteristics in various multimedia applications by transmitting
MPEG-4 and H.264/AVC encoded by two different frame resolutions. By using those
applications’ characteristics, we estimate a frequency to execute multimedia applica-
tions which decode frames at runtime. In Chapter V, we present an advanced future
temperature prediction model for each core. This allows us to estimate the thermal
behavior considering both core temperature and applications temperature variations,
and to take appropriate measures to avoid thermal emergencies. In Chapter VI, the
proposed thermal-aware scheduler scheme utilizes an advanced future temperature
prediction model for each core to estimate different thermal behaviors and measures
the amount of time it takes for each core to reach the desired temperature threshold.
In Chapter VII, we further model thermal correlation effects by profiling the thermal
impacts from neighboring cores under the specific workload. Finally, in Chapter VIII,
we propose a thermal model based on thermal correlation effects and online workload
estimation using architectural information. In Chapter IX, we conclude this thesis
and discuss future work.
7CHAPTER II
BACKGROUND AND RELATED WORK
In this chapter, we review prior work in the areas of dynamic thermal managements.
A. Dynamic Thermal Management in Single Core Architecture
Several schemes using architecture adaptation provide Dynamic Thermal Manage-
ment (DTM) solutions [1, 2]. Brooks and Martonosi suggested the fetch toggling
to avoid thermal limit using the stall of instruction fetching [1]. Heo, et al. trans-
formed the fetched computation into other duplicated unit during cooling down the
overheated unit [2]. However, these schemes cannot satisfy the workload deadline
in real-time. Especially, missed deadlines result in low performance in multimedia
systems.
Skadron, et al. suggested several thermal management schemes including temperature-
tracking [16], hybrid scheme [3], and feedback control [20, 10]. [16] used the temperature-
tracking scheme to manage temperature based on frequency scaling, localized tog-
gling, and computation migrations. [3] proposed a hybrid scheme combining fetch
gating and DVFS. Also, a feedback control configures temperature based on feedback
information [20, 10]. Since the approach introduced by Skadron, et al. does not take
into account the complexity of scenes for multimedia, it cannot avoid the degradation
of performance in multimedia applications with radical picture changes.
In [20], Mircea, et al. designed a thermal model using the thermal behavior,
thermal resistances, and thermal capacitances within functional blocks at the ar-
chitectural level. Many temperature researches have adopted this model. Althoug
thermal behavior can be detected at runtime, this requires specific hardware, such as
the built-in Performance Monitoring Counter (PMC) [21].
8DTM schemes can be roughly grouped into two approaches: proactive schemes
[4, 22, 8, 9, 23, 7] and reactive schemes [20, 24, 10]. In proactive schemes, the results
of the previous task determine the speed of the multimedia system. Pouwelse, et al.
estimated the decoding time per frame based on the offline information on decoding
time and frame size [22, 8]. In [9], the Frame Data Computation Aware (FDCA) es-
timated the decoding time for incoming frames based on the information on decoding
macro blocks. However, their method did not require any pre-profile information. [4]
proposed predictive thermal management using profiled information, which showed
the maximum performance under the thermal constraints. [23] presented an offline
scheduling algorithm to save power with quality degradation.
In contrast, reactive schemes determine the speed of the system based on his-
torical information. [24] designed a user-level power management, in which daemons
configure the speed setting of the CPU using the characteristics of applications, such
as soft real-time, interactive, and batch program. Also, [20, 10] designed feedback
control in multimedia systems. Control modules change the level of frequency based
on the occupancy of the display buffer. In [7], Son, et al. suggested two schemes in-
cluding proactive and reactive. Their reactive scheme configured the frequency based
on delay and drop frame rate, while the proactive scheme determined the frequency
using the predicted decoding time based on the future GOP size. However, this
scheme needs a profiling process before decoding a GOP.
B. Dynamic Thermal Management in Multicore Architecture
Nowadays, several thermal control techniques [1, 3] via hardware-based mechanisms,
such as Dynamic Frequency Scaling (DFS), Dynamic Voltage Scaling (DVS), and
clock gating, have been proposed and applied in modern processors. However, these
9DTM mechanisms belong to temporal thermal control and bring high execution per-
formance overhead. Therefore, as multicore systems become more popular, some
software-based spatial thermal control mechanisms, such as [6, 17, 5], have been
studied in a CMP system.
In [17], the proposed mechanism, called heat-and-run, has two key components:
SMT thread assignment and CMP thread migration. Within heat-and-run the SMT
thread assignment attempts to increase processor-resource utilization by co-scheduling
threads which use complementary resources. The CMP thread migration moves
threads away from overheated cores and assigns them to free SMT contexts on al-
ternate cores. This maintains throughput while cooling the overheated cores. They
evaluated their experiments in an extended Wattch simulator by running five threads
within four cores. Heat-and-run thread assignment (HRTA) and heat-and-run thread
migration (HRTM) achieve 9% higher average throughput than stop-go and 6% higher
average throughput than DVS. Moreover, [25] confirmed that when performance is
constrained by temperature, the performance gains brought by thread migration and
the importance of limiting the migration frequency to reduce performance overhead.
[17] proposed new migration method for temperature-constrained multicore in order
to exchange threads whenever the simultaneous occurrence of a cold and a hot core
is detected. The authors demonstrate that their method yields the same throughput
as HRTM, but requires much fewer migrations. However, the performance overhead
migrations cause is not further considered according to the application memory usage.
Mulas et al. show that the thread with less memory usage tends to migrate
easier than other threads, thus reducing the performance overhead caused by migra-
tions [6]. However, this study ignores the application workload behaviors are ignored.
That implies that sets of running threads will migrate without considering the differ-
ent thermal effects caused by various threads, while the core temperature reaches the
10
upper/lower temperature threshold. Furthermore, these studies are based on simu-
lated results, and neglect thermal correlation among cores. The power dissipated by
the rest of the chip is assumed to be negligible. Moreover, in this studies the migra-
tion action is triggered by the current temperature (when temperature is higher than
the maximum allowed temperature).
Kumar et al. provide using HybDTM, a methodology for fine-grained, coordi-
nated thermal management using both software (priority scheduling) and hardware
(clock gating) techniques. HybDTM estimates temperature by using a regression-
based thermal model based on Performance Monitoring Counters [26]. However,
HybDTM cannot effectively reduce overheat temperature without noticeable perfor-
mance overhead (9.9% performance overhead compared to cases without any DTM).
This is why Performance Monitoring Counters cannot solely estimate real temper-
ature. Also, both priority scheduling and clock gating generated high performance
overhead. Most importantly, these proposed thermal management mechanisms make
impractical assumptions and have only been evaluated by running benchmarks, which
have stable application workloads and thermal behaviors. For example, HRTA cannot
co-schedule threads without knowing the thread characteristics, such as Instructions
Per Cycle (IPC) information and execution resources.
Liu et al. propose an application level power management, called Chameleon, for
real-world user applications [24]. Chameleon consists of three components: (1) an OS
interface that can be used by power-aware applications to measure their CPU usage,
(2) a CPU scheduler that supports per-process CPU power settings and application
isolation, and (3) a speed adapter that maps the CPU speed settings to the nearest
speed supported by the hardware. However, the need for power-aware applications
is impractical, since each application’s source codes need to be modified. Otherwise,
LongRun is used for the legacy applications. Even though this work deals with
11
power management, it inspired us to develop a scalable DTM for the real-world case
and more specifically, to satisfy the demand for thermal control in the recent server
environment.
Michaud et al. confirm that when performance is constrained by temperature,
the performance gains from thread migrations [25]. It also demonstrates the impor-
tance of limiting the migration frequency to reduce performance overhead. Hence,
several advanced DTM studies [17, 18, 6] advocate providing thermal fairness and re-
ducing the peak temperature through temperature-aware thread migration schemes.
However, as presented in [3], an accurate and practical dynamic model of temperature
is needed to characterize accurately current and future thermal stress and applica-
tion dependent behavior, as well as to evaluate architectural techniques for managing
thermal effects. Moreover, estimating thermal behavior from the average of power
dissipation is unreliable. Therefore, we propose to each application’s thermal be-
havior by characterizing its workload through the statistic probability distribution
and online workload estimation based on architectural information through Perfor-
mance Monitoring Counters (PMC). Most importantly, Shang et al. introduce the
thermal correlation effects in the on-chip networks [27]. We are motivated to model
the thermal correlation effects from neighboring cores in the CMP systems from the
architecture level and further present an adaptive and scalable DTM based on the
thermal correlation effects for the CMP system.
12
CHAPTER III
EFFECTIVE DYNAMIC THERMAL MANAGEMENT FOR MPEG-4
DECODING
In this work, we present Dynamic Thermal Management (DTM) based on a Dy-
namic Voltage and Frequency Scaling (DVFS) technique for MPEG-4 decoding to
guarantee thermal safety while maintaining a quality of service (QoS) constraint. Al-
though many low-power and low-temperature multimedia playback techniques have
been proposed, most of them are impractical in real-time and have several restricting
assumptions. Multimedia data consists of several frames requiring different decoding
efforts. Since both temperature and performance of a multimedia system are affected
by the complexity of scenes, our main idea is to use the information on scene com-
plexity to find an appropriate frequency. In order to predict the complexity of the
current scene, we extract information from the previous group of pictures (GOP) us-
ing feedback control with a display buffer. Experimental results with twelve movies
show that our DTM scheme guarantees the threshold of temperature (70 ◦C) while
maintaining 0% frame miss ratio. Also, the proposed DTM scheme decreases the
average temperature by up to 13% without any additional hardware and playback
latency.
The main contributions of this research are summarized as follows:
• Our DTM scheme estimates multimedia application’s characteristics according
to the complexity of the scene using GOP information and frame drop ratio.
• DTM with the advanced feedback controller provides soft realtime performance
guarantees under thermal safety.
• Compared to the prior DTMs for multimedia applications such as feedback
13
control [10], Frame-based [12], the proposed DTM lowers temperature by 13% on
average when running MPEG-4 data under 5% frame drop ratio.
A. Thermal Issues in Multimedia Applications
Thermal issues are becoming critical in multimedia systems to achieve high reliability.
Although the speed of a modern microprocessor supports processing of the multimedia
data in real-time, a multimedia system consumes lots of power for computation and
cooling. General purpose computer systems consume over 25% of the power for energy
management such as air conditioning, backup cooling and power delivery systems [1].
However, portable battery-operated devices cannot afford such high cooling power.
Without sufficient cooling, embedded systems suffer from long-time overheating
and eventually cause the system to crash. However, reducing the voltage level causes
the overall performance to slow down. Therefore, the best solution to reduce energy
dissipation with dynamic voltage and frequency scaling (DVFS) techniques is to dy-
namically adjust the voltage scales, while maintaining the minimum required circuitry
to accommodate workloads within appropriate computation time and throughput con-
straints [12]. Multimedia data consist of different frames with different deadlines to
be displayed. MPEG frames are classified into three different coding types including
intra (I), predictive (P ), and bi-directional (B) that consume various power/energy,
which leads to raise a different amount of temperature during decoding frames. In
addition, a wide variety of frame sizes make it difficult to predict power consumption
and to control temperature. Furthermore, since DVFS reduces the overall computa-
tion speed, it is likely to have some frames missing their deadlines. Therefore, it is
challenging to find a right speed to control system temperature without quality degra-
dation. In [10, 20], the authors suggested a feedback control from a display buffer
14
to find an adequate speed without quality degradation and to reduce the power con-
sumption of MPEG decoding. However, they did not concern thermal problems and
exploited only the buffer occupancy information that is not sufficient to control the
speed for both performance and temperature. We observe that the required decoding
time depends on the complexity of scenes that can be measured with the number of
frames in a group of pictures (GOP), and those frames in a GOP require similar com-
putation time for decoding. Therefore, the previous GOP information can be used to
predict the computation power of a current frame. We propose an efficient Dynamic
Thermal Management (DTM) scheme for a multimedia system to find an appropri-
ate frequency for the available decoding and display buffer based on an advanced
feedback control. Our scheme exploits the previous GOP information considering the
trade-offs between the quality of data and thermal safety using a frequency efficient
factor.
B. Overview of a Feedback Control
Read Blocks
Reconstruct MB
IDCT
Merge MB
Decoder thread
Dither frame
Display frame
Display thread
Display buffer for 
decoded stream
Store frame
Read headers
Blocks/M
BMB
Frame
Frame
Data Flow 
Procedure 
Flow 
Control thread
Fig. 1. MPEG process with display buffers
Fig. 1 shows the details of decoding procedures used in [20, 10]. After reading
15
a stream in ‘Read headers’ and ‘Read Blocks’ steps, a decoder thread decodes macro
blocks at ‘Reconstruct MD’, ‘IDCT’, and ‘Merge MB.’ Finally, the decoder thread
puts the decoded frame into a buffer for display. Then a display thread performs the
steps of ‘Dither’ and ‘Display’ with the frames in the display buffer. The decoder
thread executes CPU-intensive operations, while the display thread just displays the
decoded frames. To obtain an adequate frequency for the decoding stream, a control
thread checks the state of the display buffer. With the high occupancy in the display
buffer, the control thread decreases the frequency, since the decoding elapsed time
with the current frequency is too much faster than the display elapsed time. On the
contrary, the low occupancy lets the control thread increase the frequency to meet
the deadline of each frame. Therefore, the performance (i.e., the deadline of frames)
should be considered in the low occupancy, while energy efficiency and thermal safety
also should be considered in the high occupancy of the display buffer. To fulfill these
two considerations, the control thread has to determine an optimal frequency without
QoS degradation. Frames should be decoded sequentially and displayed on the display
device with a constant playback interval denoted by tinterval. Although each frame can
be decoded at a different elapsed time due to computation variations, each decoded
frame in the display buffer are displayed at a uniform speed. To support the QoS
requirement, the adjusted frequency should satisfy the following Equation (3.1)[10]:
i∑
k=1
Dk
i
≤ tinterval, 1 ≤ i ≤ n, (3.1)
whereDk is the decoding time for frame k, imeans the number of frames in the display
buffer, and n is the size of the display buffer. Note that for the consecutive frame, we
do not have any information about the required decoding time. Without the display
buffer, it is difficult to estimate the optimal frequency for decoding the frame. Also,
16
the display buffer with enough space for several frames can make a system determine
the optimal frequency without frame misses. Lu et al. uses a feedback controller to
adjust the frequency with the number of decoded frames in the display buffer within a
region specified by {Bl, Bh}, where Bl is the lower threshold for the number of frames
in the buffer and Bh is the higher threshold [10]. Using the feedback controller for
decoding frames in the display buffer assumes that the decoder speed is adequate
for decoding the frames as long as the number of the frames in the display buffer is
within the specified region. However, there are two serious problems in the feedback
controller with the display buffer. The first problem is that the feedback controller
using the display buffer does not satisfy the deadline of all frames. The frequency is
adjusted by the value based on the number of the frames in the display buffer within a
region specified by {Bl, Bh}. This problem occurs with the movies containing several
complicated scenes, such as Star Wars 3 and Terminator 3. For example, a high
frequency may be required even though the occupancy of the display buffer seems
to be sufficient to decode upcoming frames. In such cases, frames will be dropped if
the the optimal frequency is only derived from the display buffer occupancy. Before
decoding the next frame, we do not know how much time will be required for it.
Without a buffer, we must use the most conservative estimate to set the decoding
speed. But if there are some decoded frames already in the buffer between the decoder
and the display device, we can apply a predictive operating frequency to be close to
the optimum. Although the actual decoding time of the frames may vary greatly, its
effect on the real-time constraint is hidden by the display buffer. In our approach,
we always try to control the number of decoded frames in the buffer within a region
specified by {Bl, Bh}, where Bl is the lower threshold for the number of frames in the
buffer and Bh is the higher threshold. As long as the number of frames in the buffer
is within the specified region, we assume that the current decoder speed is the right
17
choice for decoding the frames; i.e. the average decode rate is equal to the display
rate. However, if the actual number of frames in the buffer becomes higher or lower
than the respective threshold, this means the current decoding speed is too fast or too
slow, respectively. We apply a formal feedback controller to pull the number of frames
back into the specified region by adjusting the decoding speed. The second problem is
that the feedback controller using the display buffer cannot control the temperature to
guarantee thermal safety. Without considering temperature constraints, the display
buffer decides the optimal frequency using only its occupancy. This is a very critical
problem in embedded systems, since most embedded systems do not have cooling
systems such as a fan.
C. Advanced Feedback Controller Using GOP Information
The previous feedback control scheme uses the occupancy of display buffer between
the decoder thread and the display thread to adjust the frequency of a processor to
avoid the buffer underflow and overflow. However, although the occupancy of the
display buffer is high, the frame may miss when the decoding time of the current
frame is longer than the total display time the decoded frames in the display buffer.
The relationship among the decoding time for each frame, the display interval and
the occupancy of the display buffer is defined in Equation (3.2). Let Di be a decoding
time of framei. The decoding time Di should be finished before all previous frames
in the display buffer will be presented. Otherwise, framei will miss its deadline.
Therefore,
Di < n · tinterval, (3.2)
where tinterval is the periodic display time in the display buffer and n is the occupancy
of the display buffer.
18
Low-Complexity Scenes
: I frame
: B or P frame
GOP timeGOP
GOP
GOP
GOP timeGOP GOP
High-Complexity Scenes
High Complexity scenes
Fig. 2. Low complexity vs. high complexity scenes
As shown in Fig. 2, the number of frames in a GOP decreases when pictures of
scenes change rapidly. A single GOP has several frames which consists of I, B, and
P frames. And for more complex scenes, the number of B and P frames decreases
while only a single I frame is allowed for any GOP. Since the display interval time
depends on frames per second (fps) and has a value between 25 msec and 30 msec,
we calculate Di of the frame that exceeds the display interval time, tinterval, in this
situation. Therefore, the scene complexity can be estimated by adding these Di values
in a GOP.
For example, let’s assume that frame fn should be displayed at time tn and fn+1
should be displayed at tn+1. When frame fn is ready to be displayed, there are four
frames from fn−3 to fn in the display buffer at time tn. Also assume that a decoding
time, Dn+1 of fn+1 and Dn+2 of fn+2 takes three times more than tinterval. Under
these circumstances, all frames (fn−3 ∼ fn+1) in the display buffer are displayed and
the buffer will be empty since the next frame (fn+2) has not been decoded. In this
case, the frame fn+2 is dropped. In order to avoid the future frames drop, the optimal
frequency of fn+2 should be determined at time time tn. However, it is difficult to
19
estimate the frequency for a future frame.
We propose a prediction scheme to find the adequate frequency using the in-
formation on decoding the previous GOP. According to our experiments, the frame
decoding time depends on the complexity of scenes, which continues to exist in several
consecutive frames. It means that frames in each GOP have similar complexity of
scenes. Therefore, since the GOP consists of several frames, we can predict the opti-
mal frequency of the current GOP using the information on the complexity of scenes
in the previous GOP. If the complexity of the previous GOP is high, the complexity
of the current GOP will be also high. Therefore, the complexity of the previous GOP
can be used as a weight factor to determine the frequency of the current GOP. The
weight factor (α) is calculated as follows:
α =
k∑
i=1
XiDi
k∑
i=1
Di
, 1 ≤ i ≤ k, (3.3)
where Di is the decoding time of the framei and Xi is the indicator which is 1 or 0.
With α, the new frequency can be calculated as in Equation (3.4). Let freqi be the
frequency of the decoding time of the framei. The frequency of the current frame
should be configured based on the number of previous frames which have taken less
time than the threshold time for displaying that frame.
freqi = (1− α) · freqbuf(i−1) + α · freqmax, (3.4)
where freqbuf(i−1) is the frequency value to be calculated by the feedback control based
on the occupancy of the display buffer. freqmax is the maximum frequency of the
processor. Hence, the complexity of scenes can be estimated by the weighting factor,
α. For example, if the previous GOP has twelve frames and three frames among
20
them have larger decoding time than the selected threshold value, α is calculated as
0.25. It implies that 75% of decoding time of frames in the previous GOP is decoded
within threshold and 25% decoding time of frames exceeds the threshold. According
to Equation (3.4), the frequency of the current frame is adjusted to handle the frames
with higher complexity based on the occupancy of the display buffer. Therefore,
the frequency for decoding the current frame is selected by the information of the
occupancy of the display buffer and the information of previous GOP. When the
higher complexity of the previous GOP is, the higher frequency of the current frame
is needed. Using this scheme, we can avoid the missed frames when the complexity
of scenes is increased suddenly.
D. Thermal Control Using GOP Information
Although many studies has been focused on the relationship between frequency and
power consumption, the relationship between frequency and temperature has to be
formulated to find out the optimal frequency within thermal safety. Therefore, we
consider a simple thermal model of the processor [28, 29] in that the relationship
between processor’s frequency and temperature is the basis for any frequency scaling
scheme. By modeling the power dissipation or by increasing the input power, more
precise models can be derived from this simple model [30].
We analyze Fourier’s Law of heat conduction where the rate of heating or cooling
is proportional to the difference in temperature between the object and the environ-
ment [30]. We define T (t) and P (t) as the temperature and the power at time t,
respectively. Then we can use the Fourier’s Law as the following [28, 29]:
T ′(t) = P (t)− bT (t), (3.5)
21
where b is a positive constant representing the power dissipation rate. Now, we define
freq(t) as the processor frequency at time t. The power consumption of a processor
is an increasing convex function of the frequency [28]. Most work assume that power
and processor frequency are relevant as follows [28]:
P (t) = a(freqγ(t)) (3.6)
for some constants a and γ > 1. With an assumption that T0 = 0 (The initial
temperature is the ambient one), through Equation (3.5), the solution to Equation
(3.6) can be presented as [30]:
T (t) =
∫ t
t0
a(freqγ(τ)e−b(t−τ))dτ + T0e
−b(t−t0). (3.7)
Then, for the variation of the temperature, we deal with two cases of the variation
at any point t [30]. The first case, when temperature is non-decreasing, by Equation
(3.5) and Equation (3.6), can be derived like the following.
freq(t) ≥ (
b(T (t))
a
)
1
γ . (3.8)
The other case, when temperature is non-increasing, can be expressed as follows:
freq(t) ≤ (
b(T (t))
a
)
1
γ . (3.9)
Therefore, we observe that scaling the frequency to change the temperature can be
performed for the desired direction. Finally, we derive the following equation if we
maintain the frequency constant at freq(t) = freqC during time interval [t0, t].
T (t) =
a(freqγc )
b
+ (T (t0)−
a(freqγc )
b
)e−b(t−t0). (3.10)
dT/dt = −b(T (t)−
a(freqγc )
b
) (3.11)
22
In addition, the temperature variation by the frequency is based on the equations
above (The initial temperature is 0. During [0, t], workload is executed at some fre-
quency level and there is no workload to be executed in the interval [t, t′]). Assuming
γ=3.0, we can obtain the thermal parameter values for a and b. The values of a and b,
are processor-specific but application-independent constants. Also, we can determine
the thermal parameters, while observing the heating and cooling curves when we run
an application which fully occupies the processor. After a long-time execution of the
application, the infinite steady-state temperature value T (∞) = Ts can be observed.
Setting T (t) = T and a(freqγc )/b = Ts , Equation (3.5) and (3.6) is transformed as
follows:
T = Ts + (Tinit − Ts)e
−bt, (3.12)
where Tinit is the initial temperature. Using Ts and sampling the temperature every
millisecond, the rate of increase is plotted against (T - Ts) at each point. The resulting
set of points is fitted to a straight line using least mean square error fitting. From
Equation (3.12), the slope of this straight line represents the value of b. We obtain
b = 0.016. By applying this value b to the relation, the value a is also obtained as
a = 3.0E − 28. We used the thermal model above for the simulation. With the
plotted temperature variations, we see the effects and decrease in temperature by the
suggested DTM for MPEG-4 decoding. All the plots are based on real experimental
data including the decoding times of each frame. There can be overestimation of
temperature computation, because there can be many short processor idle durations
even in the interval between the start and the finish of the decoding. We assume that
there is no processor idle duration in the decoding period for a frame. However, even
with this assumption, the relative temperature comparisons among four cases shows
that the GOP-based DTM can decrease the processor temperature overall. Although
23
the temperature variation by the 1.0 GHz static frequency decoding is shown for the
comparison, it is no good in terms of frame miss rate. We show the comparison of
the frame miss rate by each method in later section.
E. DTM with the Advanced Feedback Controller
To maintain the temperature under the thermal safety, we should use a DTM scheme
for the multimedia decoder. The new DTM scheme uses the accurate frequency from
the previous GOP frequency. In our scheme, we decide the threshold of temperature
to control the overall temperature during a decoder running. In order to decide the
temperature threshold, we need the occupancy of the display buffer which can indicate
the efficiency of the frequency for decoding the previous GOP.
e =
n∑
i=1
Di
n · tinterval
, 0 < e ≤ 1, (3.13)
where Temergency is the maximum allowable temperature and is defined as 80
◦C in our
experiments. And Tthreshold is the software threshold of temperature during decoding
MPEG-4 stream. Therefore, ∆T can be defined as the difference between an emer-
gency temperature and software temperature threshold. The software temperature
threshold is the factor that guarantees thermal safety. n is the total number of frames
in the display buffer and tinterval is the period of displaying the frames. The e is the
frequency efficient factor for decoding frames in the previous GOP only when Di is
equal to or less than tinterval. With the factor e, we decide the new software thresh-
old of temperature as shown in Algorithm (1). If Tcurrent exceeds Tthreshold, T
′
threshold
replaces Tthreshold, and freqmax is also replaced by freq[T
′
threshold]. Therefore, freqi
is determined to maintain the temperature under the thermal safety with Equation
(3.4) because the determined freqmax is smaller than the previous freqmax.
24
Algorithm 1 DTM algorithm
Require: Define Table[ ] for frequency according to threshold temperature
1: Determine a threshold temperature(Tthreshold).
2: i ← GOPi
3: for i = 1 to GOPmax do
4: Calculate e in GOPi−1
5: Estimate a current temperature(Tcurrent).
6: if Tcurrent > Tthreshold then
7: ∆T ← Temergency - Tthreshold
8: T ′threshold ← Tthreshold + (1-e)·∆T
9: index ← index + 1
10: freqmax ← Table[index]
11: freqi ← (1-α)·freqi−1 + α·freqmax
12: else if Tcurrent < (Tthreshold - MIN) then
13: index ← index - 1
14: freqmax ← Table[index]
15: freqi ← (1-α)·freqi−1 + α·freqmax
16: end if
17: end for
25
0 100 200 300 400 500 600 700 800700
800
900
1000
1100
1200
1300
1400
1500
1600
frames
Fr
eq
ue
nc
y 
(M
hz
)
 
 
Without DTM
With DTM
Fig. 3. Comparison with DTM and without DTM for frequency
For example, assuming Temergency to be 80
◦C and Tthreshold to be 60
◦C , ∆T is
calculated as 20 ◦C . If e is 0.75 and the current temperature is over the current soft-
ware temperature threshold, the new software temperature threshold can be adjusted
to 64 ◦C . As a result, freqmax is decreased to the next low frequency by Tthreshold.
In this example, freqmax is adjusted from 1600 Mhz to 1400 Mhz when Tthreshold
is changed. This new software temperature threshold makes freqmax decrease the
overall temperature. With this scheme, the adjusting temperature threshold can
guarantee to maintain the overall system temperature. Also, the processor frequency
can be efficiently adjusted at runtime, while taking into account the current thermal
condition and the previous frequency. Fig. 3 shows the difference of freqmax of the
case with DTM or without DTM. The DTM scheme can determine the lower fre-
quency than other DVFS schemes without managing temperature because freqmax is
changed through the temperature threshold. Fig. 4 shows the effect on the temper-
ature in comparison with DTM and without DTM. The proposed scheme prevents
the system temperature from reaching a dangerous level by controlling freqmax and
maintaining the temperature within the steady state.
26
0 200 400 600 800 1000 1200 1400 1600 180062
64
66
68
70
72
74
76
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
Without DTM
With DTM
Fig. 4. Temperature comparison with and without DTM
F. Experimental Results and Analysis
To demonstrate the benefits of our control algorithm, we compare three schemes
in terms of the number of missing frame, frames per second (fps) and the variance
of temperature. DYN-MB scheme stands for the feedback controller with the display
buffer, DYN-GOP scheme is the feedback controller with the display buffer and the in-
formation of GOP. Finally, DYN-DTM is the feedback controller based on the previous
GOP information with DTM. Although DYN-GOP and DYN-DTM use the information
of the previous GOP, only DYN-DTM supports the dynamic thermal management.
Fig. 5(a) and 5(b) describes the temperature variance in two movies, Star Wars 3
and Terminator 3, which have the higher-complexity data than any other movies. It
is observed that the DYN-DTM scheme controls the temperature more precisely than
the other two schemes. This is a reason why the thermal control in DYN-DTM uses
the efficiency of frequency and temperature threshold. Another noticeable result is
that DYN-DTM maintains the peak temperature to be at least 12% lower than other
benchmark schemes. As shown in Fig. 5(a), there are the high-complexity scenes in
the first part and the last part of this movie, while the middle part has relatively
27
lower complexity. Therefore, the DYN-DTM performs efficiently in the first and the last
parts while maintaining the software temperature threshold at 70 ◦C. Fig. 5(b) also
shows that DYN-DTM outperforms other two schemes even in multiple high-complexity
scenes that are located at the middle of the movie. As a result, the proposed DYN-DTM
scheme reduces the overall temperature up to 13% by using information from the pre-
viously decoded GOP and with dynamic thermal management. The most noticeable
merit from this scheme is that it prevents all frames from exceeding the threshold
temperature without dropping any frame at all.
G. Conclusions
In this work, we proposed a method to find a proper frequency using an advanced
feedback controller for the available decoding and display buffer based on the in-
formation of the previous GOP. Also, our scheme efficiently adjusts the frequency
using a frequency efficient factor, while keeping all frames from being dropped and
maintaining thermal safety. We have implemented the proposed scheme on Linux and
conducted benchmark testings. Experimental results prove that the proposed method
does not drop any frames while the temperature is kept under the threshold. In other
words, the proposed scheme suggests a solution for thermal constraints without any
quality degradation for MPEG-4 decoding.
28
(a) Star Sars 3 (b) Terminator 3
Fig. 5. Variance of temperature of high-complexity movies
(a) Under World 1 (b) Gilmore Girls
Fig. 6. Variance of temperature of mid- and low-complexity movies
29
CHAPTER IV
THERMAL-AWARE SCHEDULING BASED ON STATISTICAL
CHARACTERISTICS OF MULTIMEDIA APPLICATIONS
Dynamic Voltage and Frequency Scaling (DVFS) is a common method to control
temperature in microprocessors. However, due to the nature of multimedia applica-
tions with different frame sizes and types in data, it is not easy to match their QoS
requirements while temperature is under control. There have been handful studies on
temperature management for multimedia applications [7, 8, 9, 10, 4, 11]. However,
according to our observations, these schemes tend to overestimate or underestimate
multimedia application requirements, which could result in false, inevitably leading
to high operation temperature or performance degradation. In order to compensate
the tradeoff between QoS and thermal control, in this work, we first derive applica-
tion characteristics in various multimedia applications by transmitting MPEG-4 and
H.264/AVC encoded by two different frame resolutions. The application character-
istics can be represented by cycle demand, which is the number of cycles required
to decode a frame. Using this representation, we estimate an adequate processor
speed to execute multimedia application for decoding frames at runtime. Then, we
propose Thermal-Aware Scheduler (TAS) that takes optimal frequencies to avoid
thermal emergency while minimizing performance degradation in the embedded en-
vironments. To achieve this goal, TAS integrates DVFS feature into the traditional
soft real-time scheduler.
Also, TAS can be classified into hybrid schemes to be integrated proactive and re-
active approaches together. In the viewpoint of proactive approaches, TAS estimates
temperature parameters to get accurate information for temperature according to ap-
plications’ workload before running multimedia applications. Since those temperature
30
parameters are related to processor and work as architecture specific factors, future
temperature can be predicted more accurately using those temperature parameters.
In the viewpoint of reactive approaches, TAS utilizes cycle demand distribution as
application characteristics for multimedia applications, and those historical informa-
tion helps the optimal frequency be decided for decoding next frames. As a result,
TAS provides better solution to find the adequate frequency based on cycle demand
distribution and predict the future temperature using profiled thermal parameters ac-
cording to workload rather than using only one scheme between reactive and proactive
schemes.
We experimented on an Intel’s Pentium-M processor and Atom processor using
various multimedia data encoded by MPEG-4 and H.264/AVC. Compared to feedback
control DTM [10], Frame-based DTM [12] and cycle counter-based scheduler [31], TAS
lowers average temperature by 6 ◦C and peak temperature by 10 ◦C or more, with
maximum 5% frame drop ratio. Moreover, we also compare the predicted temperature
by application thermal behavior to the estimated temperature by thermal sensors in
Linux while playing movies.
The main contributions of this research are summarized as follows:
• We estimate multimedia applications’ thermal characteristics using statistical
approaches to be suitable for various multimedia codecs with only 2.5% error
on average.
• TAS provides soft realtime performance guarantees with statistical processor
allocations. Almost all deadlines of decoding and displaying frames in a lightly
loaded real environments, and bounds the deadline miss ratio under the application-
specific performance requirement (e.g., meeting 95% of deadlines) in a heavily
loaded environment.
31
• Compared to the previous DTMs such as feedback control [10], Frame-based [12],
and cycle counter scheduler [31], our proposed TAS lowers temperature by 10
◦C on average when decoding and displaying MPEG-4 and H.264/AVC data
under 5% frame drop ratio.
• Although other statistical DVFS algorithm [32, 33, 23] assume that the task
decoding frames requires high computation power for multimedia application
and the task for displaying decoded frames can be negligible, TAS exploits
all tasks related to multimedia applications at runtime and further reduces
operating temperature and the frame drops by considering both task-based and
system-based characteristics.
• We provide a hybrid estimated scheme with a reactive approach using statistical
cycle demand information and a proactive approach for system temperature
behavior with a certain workload.
A. The Problems of Multimedia Applications Processing
The display buffer with enough space for several frames can make a system determine
the optimal frequency without frame misses. Lu et al. uses a feedback controller to
adjust the frequency with the number of decoded frames in the display buffer within
a region specified by {Bl, Bh}, where Bl is the lower threshold for the number of
frames in the buffer and Bh is the higher threshold. Using the feedback controller
for decoding frames in the display buffer assumes that the decoder speed is adequate
for decoding the frames as long as the number of the frames in the display buffer is
within the specified region.
However, there are two serious problems in the feedback controller with the
display buffer. The first problem is that the feedback controller using the display
32
Fig. 7. The timing gap between decoding and displaying data using the buffer man-
agement
buffer does not satisfy the deadline of all frames. The frequency is adjusted by the
value based on the number of the frames in the display buffer within a region specified
by {Bl, Bh}. This problem occurs with the movies containing high complex scenes in
succession, such as Star Wars 3 and Terminator 3. For example, a high frequency may
be required even though the occupancy of the display buffer seems to be sufficient to
decode upcoming frames. In such cases, frames will be dropped if the the optimal
frequency is only derived from the display buffer occupancy. Also, since the occupancy
of the display buffer is not managed by accurate information such as the executed
cycle demand or workload, the feedback control scheme cannot provide a solution
to decode and display frames under thermal control. With the feedback control, an
optimal frequency can be adjusted by the buffer management, but it may overestimate
the necessary frequency estimation. As a result, the operating temperature can be
raised than what we expect.
As shown in Fig. 7, even if input for frame6 arrives at time, t6, display time for
frame6 is t9. Hence, the decoding time for frame6 is larger than the difference be-
33
tween t6 and t7. It means that the decoding task for frame6 needs high computation
power from the time, t6. Using the buffer management, however, the high frequency
for frame6 can be determined at time t8, due to consideration of buffer occupancy.
Therefore, it is infeasible for adjusting the frequency using only the buffer manage-
ment scheme. In order to determine the frequency more precisely for multimedia
applications, we use the cycle demand, C, to denote the minimum requirement cycle
to meet a frame deadline ( it is determined by frames per second). The parameter
C represents application characteristics and can be explained by workload at run-
time. In this work, we can estimate the parameter C using statistical information by
Instructions Per Cycle (IPC) and the number of instructions for each frame execution.
The second problem is that the feedback controller using the display buffer can-
not control the temperature to guarantee thermal safety. Without considering tem-
perature constraints, the display buffer decides the optimal frequency using only its
occupancy. This is a very critical problem in embedded systems, since the most em-
bedded systems do not have cooling systems such as a fan. Therefore, a new approach
is required to control temperature without the quality degradation. Although an op-
timal frequency can be adjusted by the buffer management, it may overestimate the
required frequency. Since the response time for decoding task in the buffer manage-
ment is too long, it cannot provide an immediate solution for adjusting frequency.
B. The Workload of Multimedia Applications
Statistical approaches using DVFS have been proposed to deal with demand varia-
tions by considering the probability distribution of CPU demands of individual task.
Each task makes a scheduling decision to change CPU frequency using statistical in-
formation when a new task starts, and the the frequency is maintained into the same
34
50 55 60 65 70
0
10
20
30
40
50
60
70
80
90
100
time (sec)
w
o
rk
lo
ad
 (C
PU
 ut
iliz
ati
on
)
CPU utilization for multimedia application
 
 
overall workload
decoder workload
display workload
(a) MPEG-4 data (800X600)
50 55 60 65 70
0
10
20
30
40
50
60
70
80
90
100
time (sec)
w
o
rk
lo
ad
 (C
PU
 ut
iliz
ati
on
)
CPU utilization for multimedia application
 
 
overall workload
decoder workload
display workload
(b) H.264/AVC data (800X600)
50 55 60 65 70
0
10
20
30
40
50
60
70
80
90
100
time (sec)
w
o
rk
lo
ad
 (C
PU
 ut
iliz
ati
on
)
CPU utilization for multimedia application
 
 
overall workload
decoder workload
display workload
(c) MPEG-4 data (1280X720)
50 55 60 65 70
0
10
20
30
40
50
60
70
80
90
100
time (sec)
w
o
rk
lo
ad
 (C
PU
 ut
iliz
ati
on
)
CPU utilization for multimedia application
 
 
overall workload
decoder workload
display workload
(d) H.264/AVC data (1280X720)
Fig. 8. The workload of decoding and displaying multimedia data according to several
codecs and frame resolutions
35
speed for the whole job [33, 32, 23]. The main approach of TAS is based on these
statistical DVFS schemes, but differs from them for three reasons. First, TAS exploits
a simple calculation based online profiling to estimate the demand distribution from
Performance Monitoring Counters (PMC), while the other approaches use complex
estimation approaches or cycle counters to measure the cycle demand for each task.
Second, TAS estimates the overall frequency of multiple tasks related to multimedia
applications which consist of at least three tasks and each task requires their own
the cycle demand. Therefore, all related tasks in multimedia applications should be
considered for estimating proper frequency at runtime. In contrast, PACE assumes a
single task or treats all concurrent tasks as a joint workload [32]. Also, the estimation
based on the cycle counter in process control block (PCB) cannot provide accurate
characteristics of multimedia applications.
Finally, TAS supports the latest multimedia data format according to various
codecs and frame resolutions. Although the other researches have focused on MPEG-
4 codec and small frame resolutions such as 320 X 240 pixels or 640 X 272 pixels,
the latest multimedia data format encoded by H.264/AVC codec and high-definition
(HD) video frame resolutions supported by HDTV and blue-ray technology requires
much more complex computations for decoding frames, and their QoS should be
guaranteed in higher standards. The workload of decoding MPEG-4 multimedia data
with small frame resolution as shown in Fig. 8(a) requires small CPU computations,
but the workload of MPEG-4 data with large frame resolution as shown in Fig. 8(c)
requires relatively huge CPU computations. Also, much more CPU computations
should be provided for the multimedia data encoded by H.264/AVC, as shown in
Fig. 8(b) and 8(d). Since huge CPU computations mean that their works may raise
operating temperature and happen large frame drops, more accurate estimation of the
demanded cycles for multimedia applications according to various codecs and frame
36
resolutions should be required to guarantee QoS under thermal safety.
C. Thermal-Aware Scheduling for Multimedia Applications
Although the most thermal management schemes have been based on a coarse-grained
approach using feedback control of the display buffer, a fine-grained approach using
more accurate information of frame and GOP should be considered to find the optimal
frequency under thermal safety. Since the reactive schemes have used history informa-
tion, they cannot provide an immediate solution to avoid critical thermal conditions.
Also, the proactive schemes make some overhead to profile workload of multimedia
applications before their execution even if the future temperature can be predicted.
Moreover, unless temperature cannot be managed even in low temperature degree,
it is too late to be controlled when the overheat happens. Therefore, we need more
effective scheme which consists of both reactive and proactive schemes together.
In this work, we propose Thermal-Aware Scheduler (TAS) to integrate both
proactive and reactive schemes. With the proactive scheme, TAS estimates system
thermal characteristics according to workload before running multimedia applications.
Since the system thermal characteristics using thermal parameters are dependent on
a specific processor or architecture, the thermal characteristics can be measured by
the thermal model added the effect of workload. Moreover, since these temperature
parameters are determined by processor and architecture specific factors, the future
temperature can be predicted more accurately using the temperature parameters.
With the reactive scheme, we obtain the probability distribution of cycle demand at
runtime, which is the number of cycle required to decode a frame. The probability
distribution helps to make a decision of accurate frequency for decoding and displaying
frames in multimedia applications.
37
Fig. 9. TAS overview
As a result, these proactive and reactive schemes are used to determine an optimal
frequency for multimedia applications with negligible performance overhead while
controlling temperature.
As shown in Fig. 9, TAS is comprised of three components: an application char-
acteristics profiler as the reactive scheme, the thermal characteristics predictor as the
proactive scheme, and the optimal frequency adaptor. The application characteristics
profiler exploits Instruction Per Cycle (IPC) and the number of instructions for each
frame, and automatically derives the probability distribution of their cycle demands.
The temperature predictor based on application workload determines the temperature
parameters to predict the future temperature dynamically. The optimal frequency
adaptor adjusts the frequency based on information of application characteristics and
thermal parameters by workload. Our framework provides the efficient temperature
management solution through an integration of application characteristics based on
the cycle demand estimation, the thermal prediction based on the statistical charac-
teristics, and DVFS, which are performed by the application characteristics profiler,
38
the thermal characteristics predictor, and the frequency adaptor, respectively. We
describe the operations of each component in the following sections.
D. Application Characteristics Profiler for Multimedia Applications
The application characteristics profiler estimates the probability distribution of cycle
demands for decoding frames at runtime. We estimate the cycle demand distribution
to obtain more accurate multimedia computation requirements. Therefore, compared
to other thermal management schemes, we are able to guarantee the thermal safety
under the desired temperature without overestimation. Moreover, the performance
degradation can be minimized by avoiding underestimation. With these advantages,
the cycle demand distribution provides statistical performance guarantees [31], which
is sufficient for MPEG-4 and H.264/AVC with various frame resolutions under the
thermal control.
To estimate the cycle demand distribution of decoding frames at runtime, we
need two steps: the first step is to measure cycle usage measured by Instruction Per
Cycle (IPC) and the number of instructions in a fixed window size, and the second
step is to derive the probability distribution at runtime. Although the information
and estimation through offline profiling can be more accurate, offline profiling makes
the additional system overhead, and it is not feasible in multimedia applications which
has dynamic workload in each frame. In order to measure the cycle usage for decoding
frames at runtime, we exploit Performance Monitoring Counters (PMC) for Intel’s
Pentium-M and Atom processor, and implement a monitoring module for Instruction
Counter and IPC measurement [34]. As a decoding step executes, the executed cycles
are calculated by Equation (4.1).
Ci =
Ii
IPC i
, (4.1)
39
where Ci is the used cycles, Ii is the number of instruction for decoding a frame,
and IPCi is the value of IPC for decoding ith frame obtained by PMC. Next, we can
derive the probability distribution of cycle demands in a fixed window size, which is
equal to frames per second (fps). To do this, we use a profiling window to keep track
of the number of cycles consumed by n frames. Even though the parameter n can be
specified by the application, we set n to the number of fps. Let Cmin and Cmax be
the minimum and maximum numbers of cycles, respectively, in the window. In our
environments, Cmin and Cmax are assumed to be 1 million cycles and 10 million cycles
because the most multimedia applications requires meeting 96% of frame decoding
demands no more than 9 million cycles, and then 9 million cycles per frame is the
maximum requirement for decoding in multimedia applications [31]. We obtain a
probability density function (pdf ) and a cumulative distribution function (cdf ) using
following steps:
1. Let X be a continuous random variable and then a probability distribution or
probability density function (pdf) of X is a function f(x) such that for any two
numbers a and b with a ≤ b,
p(a ≤ X ≤ b) =
b∫
a
f(x)dx. (4.2)
That is, the probability that X takes on a value in the interval [a, b] is the area
under the graph of the density function.
2. Using pdf p(x) in Equation (4.2), the cumulative distribution function (cdf)
F (x) of a discrete random variable X with P (x) is defined for every number x
by
F (x) = P (X ≤ x) =
∑
y≤x
p(y). (4.3)
40
1 2 3 4 5 6 7 8 9 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cycle demand distribution
cycles (millions)
cu
m
u
la
tiv
e 
pr
ob
ab
ilit
y
 
 
MPEG−4
H.264/AVC
Fig. 10. The cumulative distribution function (cdf) of decoding frames in the multi-
media application
3. We denote the cdf as F (x) for a random variable X as the number of cycles for
decoding a frame according to the pdf f(x) and probability p using Equation
(4.3) and (4.4),
P (Cmin ≤ X ≤ Cmax) =
Cmax∫
Cmin
F (x)dx, (4.4)
where X in the interval [Cmin, Cmax] with the same sized groups and F (x) =
P (X ≤ x) =
∑
y≤x p(y). We refer to c0, c1, · · · , cn with the same size, 1 million,
as the group boundaries.
4. For decoding frames in multimedia application, we estimate the probability
P (Cmin ≤ X ≤ Cmax) in MPEG-4 and H.264/AVC, as shown in Fig. 10.
According to multimedia applications and codecs such as MPEG-4, H.264/AVC
with various frame resolutions, the decoding time and computational requirements are
different respectively. To satisfy a various computational requirements, a frequency
for decoding frames should be decided by the cycle demand based on the probability
requirements for decoding frames in the window. Specifically, let ρ be the probability
required for decoding frames in a window, and every decoding task for a frame needs
41
to meet the probability ρ of deadlines. In other words, every frame of the window
should meet its deadline with a probability ρ. To support this requirement, the Ck
cycles should be allocated to all decoding tasks in the same window, i.e.,
F (C) = P [X ≤ Ck] ≥ ρ. (4.5)
To determine this parameter Cρ for a task, we find Cx whose cumulative distri-
bution is at least ρ, i.e., F (Cx) = P [X ≤ Cx] ≥ ρ. Since we assume the probability
ρ is 0.96, we determine this Cx as the parameter Cρ. The demanded probability, ρb,
is different to allocated probability, ρa. Even though ρb is required to decode frames
on a window, allocated cycle should be split the range [Cmin, Cmax] into equal-sized
groups. Therefore, we determine a little more allocated cycle (Ca) and allocated
probability (ρa) by Equation (4.5). In order to get a frequency for decoding frames
at a given window, frequency, fd, can be obtained by Equation (4.6).
fd =
Cρ × fps
∆t
, (4.6)
where fd is a frequency for the cycle demand for decoding frames in the window, fps
is frames per second, and the time interval ∆t is 1 sec. The demanded number of
instructions shows the requirement of instructions are different according to frames.
By this observation, we can calculate the frequency (fd) by Equation (4.5) and (4.6).
We determine an optimal frequency by the number of instructions for decoding frames
using a real multimedia data, and then calculate the allocated instructions based on
cdf F (x).
Although we find an optimal frequency for decoding frames, the additional system
workload generated by the operating system such as scheduling overhead, file I/O
handling, and network monitoring should be considered to guarantee the performance
in real systems. The system workload occupies between 5% and 50% according to the
42
assigned frequency. Therefore, the optimal frequency (fd) for decoding frames should
be adjusted by including the system workload.
E. Experimental Environments
Since the thermal management is difficult to be simulated, we have implemented
and evaluated TAS in the real-world mobile products. To evaluate the scalability,
we conduct our experiments in two different systems as shown in Table I. For our
experiments, we have implemented Process Monitoring (PMON) and measured tem-
perature using ACPI on Linux. We used 16 multimedia data encoded by MPEG-4
and H.264/AVC, respectively, with various frame resolutions as shown in Table II and
Table III.
Also, we measured the number of instructions and IPC using the Performance
API (PAPI) based on performance counter in most major microprocessors [34]. These
counters exist as a small set of registers that count events, occurrences of specific sig-
nals related to the processor’s function. Monitoring these events facilitates correlation
between the structure of source/object code and the efficiency of the mapping of that
code to the underlying architecture.
In order to demonstrate the flexibility of TAS, we exploit for the experiments
Table I. Experimental systems description
System I System II
Processor Intel Pentium-M 730 Intel Atom N270
Memory Size 1 GB 1 GB
LCD resolution 1600 X 1200 1280 X 800
Maximum Frequency 1.6 Ghz 1.6 Ghz
Minimum Frequency 0.6 Ghz 0.8 Ghz
Scaling level 6 levels 4 levels
Operating System SUSE 11.1 SUSE 11.1
(Kernel Version: 2.6.27) (Kernel Version: 2.6.27)
43
Table II. The experimental multimedia data (Standard Definition)
Name Encoded Codec Title Frame resolution The number
of frames
SD-M1 MPEG-4 Star Wars III 800 X 600 10,000
SD-M2 MPEG-4 Terminator 3 800 X 600 10,000
SD-M3 MPEG-4 24 hours 800 X 600 10,000
SD-M4 MPEG-4 Eragon 800 X 600 10,000
SD-H1 H.264/AVC The heartbreak kid 800 X 600 4,000
SD-H2 H.264/AVC 300 800 X 600 4,000
SD-H3 H.264/AVC Apocalypto 800 X 600 4,000
SD-H4 H.264/AVC Beowulf 800 X 600 4,000
Table III. The experimental multimedia data (High Definition)
Name Encoded Codec Title Frame resolution The number
of frames
HD-M1 MPEG-4 Star Wars III 1280 X 720 10,000
HD-M2 MPEG-4 Terminator 3 1280 X 720 10,000
HD-M3 MPEG-4 24 hours 1280 X 720 10,000
HD-M4 MPEG-4 Eragon 1280 X 720 10,000
HD-H1 H.264/AVC The heartbreak kid 1280 X 720 4,000
HD-H2 H.264/AVC 300 1280 X 720 4,000
HD-H3 H.264/AVC Apocalypto 1280 X 720 4,000
HD-H4 H.264/AVC Beowulf 1280 X 720 4,000
44
based on various multimedia data with different frame resolutions in two different
platforms. For the application with fluctuant workload, we use Mplayer to execute
the ”Transformers” video clip. One should note that the Mplayer would generate
two threads during execution: one is the X windows deamon, which maintains about
30% workload, and the other thread is decoding frames whose workload is fluctuant
between 40% and 70%.
F. Experimental Results and Analysis
1. The effect on performance overhead and thermal managements in Intel
Pentium-M processor
Fig. 11 shows that the frame drop ratio of multimedia data encoded by MPEG-4
and H.264/AVC using different thermal management schemes in Pentium-M proces-
sor. Feedback control and frame-based control schemes are indicated feedback and
frame. Cycle counter-based scheduler and TAS are indicated cycle counter and
TAS. The feedback control scheme uses PI controller based on monitoring the oc-
cupancy of the display buffer [10]. Also, the frequency is linearly subdivided into
40 discrete levels, which is not true in real systems. Due to using linearly frequency
decisions, the actual frequency can be determined higher, and then operating temper-
ature can be managed in higher degrees compared to other schemes. But the frame
drop ratio can be reduced, as shown in Fig. 11, which implies that the frequency
decision of feedback control scheme overestimates the required frequency compared
to other schemes.
As compared to feedback control scheme, frame-based scheme makes a decision
of the optimal frequency by considering the frame-dependent (FD) part of the decod-
ing process whereas the frame-independent (FI) part of dithering and display steps
45
SD−M1 SD−M2 SD−M3 SD−M4
0
1
2
3
4
5
6
7
8
9
10
multimedia data
N
or
m
al
iz
ed
 fr
am
e 
dr
op
s
The frame drops in Pentium−M (MPEG−4)
feedback
frame
cycle counter
TAS
(a) MPEG-4 (800 X 600)
SD−H1 SD−H2 SD−H3 SD−H4
0
1
2
3
4
5
6
7
8
9
10
The frame drops in Pentium−M (H.264/AVC)
multimedia data
N
or
m
al
iz
ed
 fr
am
e 
dr
op
s
feedback
frame
cycle counter
TAS
(b) H.264/AVC (800 X 600)
HD−M1 HD−M2 HD−M3 HD−M4
0
1
2
3
4
5
6
7
8
9
10
The frame drops in Pentium−M (MPEG−4)
multimedia data
N
or
m
al
iz
ed
 fr
am
e 
dr
op
s
feedback
frame
cycle counter
TAS
(c) MPEG-4 (1280 X 720)
HD−H1 HD−H2 HD−H3 HD−H4
0
1
2
3
4
5
6
7
8
9
10
The frame drops in Pentium−M (H264/AVC)
multimedia data
N
or
m
al
iz
ed
 fr
am
e 
dr
op
s
feedback
frame
cycle counter
TAS
(d) H.264/AVC (1280 X 720)
Fig. 11. The frame drop of standard definition multimedia data encoded by MPEG-4
and H.264/AVC in Intel Pentium-M processor
46
decoded frames [12]. Although the FD time varies considerably depending on the
frame type, the FI time is nearly constant of the given frames. That implies that the
FI time depends on the frame resolution of the given movie stream, which is obvi-
ously constant for the same movie. Therefore, frame-based control scheme has higher
frame drop ratio in movies based on high definition frame resolutions (1280 X 720)
than the standard definition frame resolution (800 X 600), as shown in Fig. 11(c) and
Fig. 11(d). Also, they assume that there is no display buffer, i.e., a frame should be
decoded and displayed in a given time, determined by a frame rate. However, they do
not consider the system workload such as workload generated by operating system,
file I/O, and several daemon processes, since the system workload should be one of
factors to determine the optimal frequency in real systems. By the frequency deci-
sions based on FD and FI time information, operating temperature can be controlled
in lower levels compared to other schemes as shown in Fig. 12(e), Fig. 12(f), Fig.
12(g), Fig. 12(h), Fig. 13(e), and Fig. 13(g).
Cycle counter-based scheduler determines the proper frequency based on the
statistical information of several frames [31]. Specially, the approach for thermal con-
trol based on statistical information overcome the disadvantages of feedback control
and frame-based control scheme by estimating relatively the accurate cycle demand
for decoding frames. Moreover, the cycle counter-based scheduler prevents poten-
tial overheads from frequent changes in the frequency unlike the frame-based control
scheme. This is the reason why the cycle counter-based scheduler has superior thermal
management compared to the feedback control scheme, but the cycle counter-based
scheduler shows higher frame drop ratio than other schemes, as shown in Fig. 11(b)
and Fig. 11(d). The reason is that the cycle counter scheme takes no account of the
effect by the display task and the system workload. Therefore, their estimation for
the optimal frequency is underestimated due to insufficient information through the
47
cycle counter of only decoding task.
TAS is also based on the statistical information of several frames, but TAS takes
into account all tasks related to multimedia applications. By this approach, TAS has
an advantage of managing lower temperature while reducing frame drops in multime-
dia applications. Specially, the display task as well as the decoding task should be
treated as an important factor for the frequency decision in multimedia data based
on the high definition frame resolution (1280 X 720), as shown in Fig. 11(c) and Fig.
11(d). Therefore, TAS overcomes the disadvantage of cycle counter-based scheduler
and keeps up better performance compared to other schemes. Also, since more ac-
curate frequency can be determined in TAS, temperature can be lowered than other
schemes in almost all results, as shown in Fig. 12 and Fig. 13.
Compared to other schemes, TAS reduces the peak temperature by 6 ◦C with a
reduction of frame drops by average 22.9%. Although temperature managed by TAS
shows a little bit higher than by the frame-based control scheme, the frame drop ratio
is reduced up to 82.6% compared to the frame-based control scheme.
2. The effect on performance overhead and thermal managements in Intel Atom
processor
Intel Atom processor based on an entirely new microarchitecture was developed specif-
ically targeted performance and low power for the embedded system [35]. Also, since
the data transfer through low power optimized front side bus is faster than Intel’s
Pentium-M processor, operating temperature can lower compared to Intel Pentium-
M processor. Although temperature in Atom processor can be managed lower than
Pentium-M processor, the overall frame drop ratio of all experiments in Atom proces-
sor is higher than Intel Pentium-M processor, as a shown in Fig. 14. There are two
reasons: first, the scaling range for DVFS in Atom processor is smaller than Pentium-
48
0 50 100 150 200 250 300 350 400 450
55
60
65
70
time (sec)
te
m
pe
ra
tu
re
 (C
eli
su
s)
 
 
feedback control
frame control
cycle counter
TAS
(a) Star Wars 3 (MPEG-4)
0 50 100 150 200 250 300 350 400 450
55
60
65
70
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
feedback control
frame control
cycle counter
TAS
(b) Terminator 3 (MPEG-4)
0 50 100 150 200 250 300 350 400 450
54
56
58
60
62
64
66
68
70
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
feedback control
frame control
cycle counter
TAS
(c) 24 hours (MPEG-4)
0 50 100 150 200 250 300 350 400 450
54
56
58
60
62
64
66
68
70
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
feedback control
frame control
cycle counter
TAS
(d) Eragon (MPEG-4)
0 50 100 150 200 250 300 350 400 450
54
56
58
60
62
64
66
68
70
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
feedback control
frame control
cycle counter
TAS
(e) Star Wars 3
(H.264/AVC)
0 50 100 150 200 250 300 350 400 450
54
56
58
60
62
64
66
68
70
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
feedback control
frame control
cycle counter
TAS
(f) Terminator 3
(H.264/AVC)
0 50 100 150 200 250 300 350 400 450
56
58
60
62
64
66
68
70
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
feedback control
frame control
cycle counter
TAS
(g) 24 hours (H.264/AVC)
0 50 100 150 200 250 300 350 400 450
56
58
60
62
64
66
68
70
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(h) Eragon (H.264/AVC)
Fig. 12. Resulting temperatures with feedback, frame, cycle counter, and TAS in the
standard definition multimedia data in Intel Pentium-M processor (frame res-
olution : 800 X 600)
49
0 30 60 90 120 150
52
54
56
58
60
62
64
66
68
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(a) The heartbreak kid
(MPEG-4)
0 30 60 90 120 150
56
58
60
62
64
66
68
70
72
74
76
78
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(b) Apocalypto (MPEG-4)
0 20 40 60 80 100
56
58
60
62
64
66
68
70
72
74
76
78
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(c) 300 (MPEG-4)
0 30 60 90 120 150
55
60
65
70
75
80
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(d) Beowulf (MPEG-4)
0 30 60 90 120 150
55
60
65
70
75
80
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(e) The heartbreak kid
(H.264/AVC)
0 30 60 90 120 150
55
60
65
70
75
80
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(f) Apocalypto
(H.264/AVC)
0 20 40 60 80 100
54
56
58
60
62
64
66
68
70
72
74
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(g) 300 (H.264/AVC)
0 30 60 90 120 150
50
55
60
65
70
75
80
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(h) Beowulf (H.264/AVC)
Fig. 13. Resulting temperatures with feedback, frame, cycle counter, and TAS in the
high definition multimedia data in Intel Pentium-M processor (frame resolu-
tion : 1280 X 720)
50
SD−M1 SD−M2 SD−M3 SD−M4
0
1
2
3
4
5
6
7
8
9
10
The frame drops in Atom (MPEG−4)
multimedia data
N
or
m
al
iz
ed
 fr
am
e 
dr
op
s
feedback
frame
cycle counter
TAS
(a) MPEG-4 (800 X 600)
SD−H1 SD−H2 SD−H3 SD−H4
0
1
2
3
4
5
6
7
8
9
10
The frame drops in Atom (H.264/AVC)
multimedia data
N
or
m
al
iz
ed
 fr
am
e 
dr
op
s
feedback
frame
cycle counter
TAS
(b) H.264/AVC (800 X 600)
HD−M1 HD−M2 HD−M3 HD−M4
0
1
2
3
4
5
6
7
8
9
10
The frame drops in Atom (MPEG−4)
multimedia data
N
or
m
al
iz
ed
 fr
am
e 
dr
op
s
feedback
frame
cycle counter
TAS
(c) MPEG-4 (1280 X 720)
HD−H1 HD−H2 HD−H3 HD−H4
0
1
2
3
4
5
6
7
8
9
10
The frame drops in Atom (H.264/AVC)
multimedia data
N
or
m
al
iz
ed
 fr
am
e 
dr
op
s
feedback
frame
cycle counter
TAS
(d) H.264/AVC (1280 X 720)
Fig. 14. The frame drop of standard definition multimedia data encoded by MPEG-4
and H.264/AVC in Intel Atom processor
M processor. Therefore, the overestimation or underestimation effects can be much
more serious than Pentium-M processor. The second is that the overall performance
of Atom processor is relatively lower than Pentium-M processor even with the same
frequency.
In the experiments for high definition multimedia data in Intel Pentium-M, tem-
perature control using the cycle counter-based scheduler and TAS show the best in
almost all results, as shown in Fig. 15 and Fig. 16. This is the reason why the
cycle counter-based scheduler and TAS determines more accurate frequency based on
the statistical information for the cycle demand to decode frames. However, cycle
51
counter-based scheduler can underestimate the required frequency for multimedia ap-
plications whenever the huge decoding time is required in high definition multimedia
data. Due to lack of overall estimation for decoding and display tasks, the under-
estimated frequency causes to increase performance overhead in the overall system.
Therefore, the frame drop ratio of cycle counter-based scheme is higher than the other
schemes, as shown in Fig. 14(c) and Fig. 14(d). Moreover, the inaccurate estimation
of frequency without the workload of displaying frames causes the frame drop ratio
bigger than other schemes. Also, since the cycle counter-based scheme depends on
the decoding time without any information for temperature, it is unable to predict
the future temperature or the thermal characteristics.
Compared to other schemes, TAS reduces the peak temperature by 6 ◦C with a
reduction of frame drops by average 27.3%. Although temperature managed by TAS
shows a little bit higher than by the frame-based control scheme, the frame drop ratio
is reduced up to 72.2% compared to the frame-based control scheme.
As a result, TAS derives statistical information by taking advantage of the cycle
demand obtained by IPC and the number of instructions. Based on the statistical
information for the previous frames, this scheme calculates the currently required
frequency as well as looking ahead to see if the future frequency should be increased
or decreased. This scheme can also adjust the frequency accordingly for movies with
rapid changes. The appropriate frequencies in different decoding time can be precisely
predicted with IPC and the number of instructions which depends on the processor.
Also, this leads TAS to operate in lower temperature levels than other thermal man-
agement schemes. TAS based on the application thermal characteristics lowers tem-
perature by about 4 ◦C in average and reduces about 6 ◦C in the peak temperature
compared to other previous thermal management schemes. Since TAS meets up to
5% frame drop ratios in multimedia applications, TAS outperforms the previous ther-
52
mal managements for multimedia applications such as the feedback control scheme,
the frame-based scheme, and the cycle counter scheduler.
G. Conclusions
In this work, we propose Thermal-Aware Scheduler (TAS) which uses both application
characteristics represented by the probability distribution of cycle demand to decode
a frame and the system thermal model augmented by the effect of workload. Our
experimental results show that the distribution of cycle demands in various codecs
affect temperature directly as an application workload. This implies that the overall
temperature can be predicted and controlled by the optimal frequency to decode
frames for any type of multimedia data. Also, TAS scheme explores the application
thermal characteristics based on statistical information of cycle demands, which can
estimate the future temperature within 2.5% prediction error in average compared
to the measured temperature by a thermal sensor. Therefore, TAS provides more
accurate estimation and more efficient temperature management compared to other
schemes such as the feedback control scheme, the frame-based scheme, and the cycle
counter scheme.
53
0 50 100 150 200 250 300 350 400
50
52
54
56
58
60
62
64
66
68
70
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
feedback control
frame control
cycle counter
TAS
(a) Star Wars 3 (MPEG-4)
0 50 100 150 200 250 300 350 400
52
54
56
58
60
62
64
66
68
70
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(b) Terminator 3 (MPEG-4)
0 50 100 150 200 250 300 350 400
52
54
56
58
60
62
64
66
68
70
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(c) 24 hours (MPEG-4)
0 50 100 150 200 250 300 350 400
52
54
56
58
60
62
64
66
68
70
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(d) Eragon (MPEG-4)
0 50 100 150 200 250 300 350 400
52
54
56
58
60
62
64
66
68
70
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(e) Star Wars 3
(H.264/AVC)
0 50 100 150 200 250 300 350 400
52
54
56
58
60
62
64
66
68
70
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(f) Terminator 3
(H.264/AVC)
0 50 100 150 200 250 300 350 400
50
52
54
56
58
60
62
64
66
68
70
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(g) 24 hours (H.264/AVC)
0 50 100 150 200 250 300 350 400
52
54
56
58
60
62
64
66
68
70
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(h) Eragon (H.264/AVC)
Fig. 15. Resulting temperatures with feedback, frame, cycle counter, and TAS in the
standard definition multimedia data in Intel Atom processor (frame resolution
: 800 X 600)
54
0 30 60 90 120 150
51
52
53
54
55
56
57
58
59
60
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(a) The heartbreak kid
(MPEG-4)
0 30 60 90 120 150
51
52
53
54
55
56
57
58
59
60
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(b) Apocalypto (MPEG-4)
0 20 40 60 80 100
52
53
54
55
56
57
58
59
60
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(c) 300 (MPEG-4)
0 30 60 90 120 150
52
53
54
55
56
57
58
59
60
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(d) Beowulf (MPEG-4)
0 30 60 90 120 150
52
54
56
58
60
62
64
66
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(e) The heartbreak kid
(H.264/AVC)
0 30 60 90 120 150
52
54
56
58
60
62
64
66
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(f) Apocalypto
(H.264/AVC)
0 20 40 60 80 100
53
54
55
56
57
58
59
60
61
62
63
64
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(g) 300 (H.264/AVC)
0 20 40 60 80 100 120 140 160
52
54
56
58
60
62
64
66
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
feedback control
frame control
cycle counter
TAS
(h) Beowulf (H.264/AVC)
Fig. 16. Resulting temperatures with feedback, frame, cycle counter, and TAS in the
high definition multimedia data in Intel Atom processor (frame resolution :
1280 X 720)
55
CHAPTER V
PREDICTIVE DYNAMIC THERMAL MANAGEMENT FOR MULTICORE
SYSTEMS
From this chapter, we introduce how to manage temperature in Chip Multiprocessors
(CMPs). CMPs have already been employed as the main trend in new generation
processors. CMPs includes multiple cores within one single die area to increase the
microprocessors’ performance. However, the increased complexity and decreased fea-
ture sizes have caused very high power density in modern processors. The power
dissipated is converted into heat and the processors are pushing the limits of pack-
aging and cooling solutions. The increased operating temperature potentially affects
the system reliability. Moreover, leakage power increases exponentially with operat-
ing temperature. Increasing leakage power can further raise the temperature resulting
in a thermal runaway [1]. Hence, there is a need to control temperature at all levels
of system design.
Recently, many hardware and software-based Dynamic Thermal Management
(DTM) [1, 3] techniques have been proposed in sense of that they, except [4], start to
control the temperature after the current temperature reaches the critical tempera-
ture threshold. DTM schemes can be characterized as temporal or spatial. Temporal
management schemes, such as Dynamic Frequency Scaling (DFS), Dynamic Voltage
Scaling (DVS), clock gating, slowdown the CPU computation to reduce heat dissi-
pation. Although they could effectively reduce temperature, they incur significant
performance overhead. On the other hand, spatial management schemes, such as
thread migration, can reduce the temperature without throttling the computation
[17]. However, neighboring thermal effect and the application thermal behavior are
not considered in prior works. Due to packaging technology in CMPs, the tempera-
56
ture of each core will be affected by other cores. The temperature differential between
cores can be as much as 10 ∼ 15 ◦C [3]. There are significant variations in the thermal
behavior among different applications [3, 4].
Motivated by these facts, we propose a Predictive Dynamic Thermal Manage-
ment (PDTM) in the context of multicore systems. Our PDTM scheme utilizes an
advanced future temperature prediction model for each core to estimate the thermal
behavior considering both core temperature and applications temperature variations,
and then take appropriate measures to avoid thermal emergencies. To the authors’
best knowledge, no prior attempt has been made to implement the temperature pre-
diction model along with the thermal-aware scheduling on a real four-core product
under Linux environment. The experimental results on Intel’s Quad-Core system
running two SPEC CPU 2006 benchmarks simultaneously show PDTM lowers tem-
perature by about 5% in average and reduces up to 3 ◦C in peak temperature with
only at most 8% performance overhead compared to Linux standard scheduler with-
out DTM. Moreover, to validate the presented PDTM, we also rebuilt HRTM [17],
and our PDTM outperforms HRTM in reducing average temperature by about 7%,
performance overhead by 0.15%, and peak temperature by about 3 ◦C, while running
single benchmark.
The main contributions of this work are summarized as follows:
• We propose an advanced future temperature prediction model for multicore
systems with only 1.6% error in average.
• We demonstrate that our scheme outperforms the existing DTM schemes (HRTM
and HybDTM) and provides thermal fairness among cores in a CMP system.
• The proposed PDTM incurs low performance overhead which is only 1% when
running single benchmark, and 8% when running two benchmarks simultane-
57
ously.
• Most importantly, there is no additional hardware unit required for our predic-
tion model and thermal-aware scheme. It means that our model and scheme
is scalable for all the multicore systems and can be applied to real-world CMP
products.
A. Predictive Thermal Model
In this section, we present a thermal model to predict the future temperature at
any point during the execution of a specific application. The model is based on our
observation that the rate of change in temperature during the execution of an applica-
tion depends on the difference between the current temperature and the steady state
temperature of the application1. Moreover, the thermal behavior is different among
applications. Since the system temperature is affected by both each application’s ther-
mal behavior and each processors’ thermal pattern, we define the application-based
thermal model and the processor-based thermal model in this work.
1. The application-based thermal model in CMP systems
The Application-based Thermal Model (ABTM) accommodates the short-term ther-
mal behavior in order to predict the future temperature in fine-grained. As shown
in Fig. 17, there are rapid temperature changes even when the workload is statically
100%. Specifically, this model first derives the thermal behavior from local inter-
vals (short term temperature reactions) and then predicts the future temperature by
incorporating this behavior into a regression based approach that is known as the
1The steady state temperature of an application is defined as the temperature the
system would reach if the application is executed infinitely.
58
0 50 100 150 200 250 300
55
60
65
70
75
80
85
90
time(sec)
te
m
pe
ra
tu
re
(C
els
ius
)
Fig. 17. Real temperature of one core on running bzip2 benchmark
Recursive Least Square Method (RLSM). In the general least-squares problem, the
output of a linear model y is given by the linear parameterized expression
y = θ1f1(u) + θ2f2(u) + · · ·+ θnfn(u), (5.1)
where u = [u1 ,u2 ,· · · ,un ] is the model’s input vector, f1,...,fn are known functions
of u, and θ1, θ2,...,θn are unknown parameters to be estimated. In our study, let the
input vector, u, and the output vector, y, be time units and working temperature
respectively. To identify the unknown parameters θi, experiments usually have to
be performed to obtain a training data set composed of data pairs (ui ;yi ), i =
1,· · · ,m}. Expressed in matrix notation, the following equation can be obtained: Y
59
= Xθ where X is an m × n matrix:
X =


f1(u1) · · · fn(u1)
...
...
...
f1(um) · · · fn(um)


(5.2)
θ is a n × 1 unknown parameter vector:
θ = [θ1, θ2, ..., θn]
T (5.3)
and Y is a n×1 output vector:
Y = [Y1, Y2, ..., Yn]
T (5.4)
If XTX is nonsingular, the least square estimator can be derived as
θ = (XTX)−1XTY, (5.5)
Denote the ith row of the joint data matrix [X : Y ] by [X
T
i : Yi]. Suppose that a
new data pair [XTk+1 : Yk+1] becomes available as the (k + 1)
th entry in the data set.
To avoid recalculating the least squares estimator using all input and output data
samples, let Pk = (X
TX)−1 for the kth in Equation (5.5). Likewise, the recursive
least square method at (k + 1)th can be developed as
Pk+1 = Pk −
Pkxk+1x
T
k+1Pk
1 + yTk+1Pkyk+1
, (5.6)
where yk+1 is the output vector and xk+1 is input vector of of fk+1.
θk+1 = θk + Pk+1xk+1(yk+1 − x
T
k+1θk) (5.7)
60
 
 

 



	













	








 




	













	









	


Fig. 18. The calculation of ∆t (migration time) using ABTM
where matrix P is an intermediate variable in the algorithm. Eventually, we get
future temperature, yn, by an application thermal behavior using the current θ vector.
Detailed descriptions of the Least Square Method and Recursive Least Square Method
can be found in the literatures [36]. With Equation (5.1), ABTM can predict future
temperature for an application as shown in Fig. 18.
2. The core-based thermal model in CMP systems
The heat transfer equations model the steady state temperature of systems with the
heat sources [37]. It has been observed in those models that the temperature changes
exponentially to the steady state starting from any initial temperature. In other
words, the rate of temperature change is proportional to the difference between the
current temperature and the steady state [37]. We initially assume that the steady
state temperature of the application is known. Later we will relax this constraint.
Let Tss be the steady state temperature of an application. Let T (t) represent the
temperature at time t and let Tinit be the temperature when an application starts
execution (T (0)=Tinit). The prediction model assumes that the rate of variations of
61
temperature is proportional to the difference between the current temperature and
the steady state temperature of the application [30]. Thus
dT
dt
= b× (Tss − T ). (5.8)
Solving Equation (5.8) with T (0) = Tinit and T (∞)=Tss, we get
T (t) = Tss − (Tss − Tinit)× e
−bt (5.9)
where b is a processor-specific constant. The value of b is determined using Equation
(5.8) by observing heating and cooling curves corresponding to all SPEC CPU 2006
benchmarks on the core. Also, since the value of b is different to the amount of
workload, b should be determined by the workload on each processor. Running several
benchmarks, we obtained b = 0.009 when the workload is 100%. We precompute the
steady state temperature of an application offline. By rearranging Equation (5.9), we
get the steady state temperature Tss of the application.
Tss =
T (t)− Tinit × e
−bt
(1− e−bt)
(5.10)
Therefore, with Equation (5.9) and (5.10), we get the future temperature after time
t and the steady state temperature, Tss, of each core.
3. The predictive thermal model
Our approach, which towards characterizing the thermal contribution of individual
processor, uses ABTM and CBTM at runtime as the input for the overall thermal
model to directly estimate the future temperature. For each application, we ex-
ploit both short-term (ABTM) and long-term (CBTM) future temperature values
62
to prevent Ping-Pong effect2. The application-based temperature, Tapp, predicts the
transient variations in application temperature which includes the temperature con-
tribution at the running period on the core before being migrated into other core. On
the other hand, the core-based temperature, Tcore, is calculated with the aggregated
temperature by workload. The overall predictive temperature is then given as:
Tpredict = wsTapp + wlTcore (5.11)
where Tpredict is determined as the overall predictive temperature, ws is a weighting
factor of ABTM, and wl is a weighting factor of CBTM. Note that ws and wl should
be adjusted according to the application workload. Since the benchmarks we used in
this study maintain 100% workload in most time, we find that the optimal values for
ws and wl are 0.7 and 0.3 ,respectively, based on our experimental results. Instead
of iteratively calculating Tpredict, a predefined temperature trigger threshold provokes
the calculation. And after a certain core’s temperature has exceeded the temperature
trigger threshold, it detects processes that exceed the workload threshold and applies
appropriate ABTM. This approach can reduce the overheads from the prediction
calculations. In order to properly predict any future temperature, we need to know
∆t beforehand. This value is the time interval that is required for the current ABTM
to reach the temperature threshold of the next stage in Fig. 18.
B. PDTM Scheduler
The Linux standard scheduler is designed to compromise two opposing aspects: re-
sponse time and throughput. Interactive processes such as shell programming are
built to run in a satisfactory response time. On the other hand, CPU-intensive pro-
2A process can be migrated among several cores very frequently.
63


ff
fifl
ffi !fi
ff
!"#
$
fi
ff
%
#
&%


'
((
ff
)fi!
&*
+
fi
%
#fl
!"#
$
fi
ff
$
fl#
ff ,


#
*
+
fi
%
#fl
!"#
$
fi
ff
$
fl#
ff
- 

fi!
&
%
)"#fl
.
ff

&
 
/



!0
%
)"#fl
.
ff

&
 
Predictive DTM
Monitoring Future temperature 
prediction model
Thermal-aware scheduler
1fi

fl
2
fi

#
3
4
2
fi

#
Fig. 19. System overview
grams needs to ensure throughput. To keep up with this corollary in multi-cores, a
certain process is rarely migrated into another core in Linux standard scheduler. This
is mainly because an active process uses running information like TLB for the process
through cache memory [38]. However, when the workload is noticeably unbalanced,
the Linux standard scheduler initiates process migrations despite migration overhead.
However, the Linux standard scheduler does not take the temperature behavior into
account. To resolve this issue, PDTM enables the scheduling policy to accommodate
the temperature behavior as well as workloads in a multicore environment.
Our PDTM mainly composes of three components as shown in Fig. 19. In
the monitoring part, the application workload (CPU utilization) is monitored for
application’s migration to balance workload by Linux standard scheduler. However,
it is not aware of temperature. PDTM uses Digital Thermal Sensor (DTS) to detect
temperature at runtime. The detected temperature information can be used in the
future temperature prediction model.
As shown in Algorithm (2), PDTM determines that migration is necessary when
64
Algorithm 2 PDTM scheduler algorithm
1: Tcur ← CalcT(processi)
2: for Tcur ≥ Tttt do
3: ∆tm ← ABTM
−1(Ttmt)
4: for j = 1 to MAXcores do
5: Tcbtm ← CBTM(∆tm)
6: Tabtm ← ABTM(∆tm)
7: T [j] ← ωs·Tabtm+ωl·Tcbtm
8: end for
9: Migrated Core ← MIN CORE(T [])
10: Tpred ← MIN TEMP(T [])
11:
12: if Current Core 6= Migrated Core then
13: MIGRATION(processi → Migrated Core)
14: end if
15:
16: if Tpred ≥ Tpst then
17: Decrement priority(processi)
18: else
19: Increment priority(processi) until priority = 0
20: end if
21: end for
65
56
7
8
9
:;<=;>
?
:@>;
:A<;
BCDEFGCDE
8
9H
I
8
J
GCDE
K
LMN
O
PQRS T
U
VWX
Y
U
Z[
Y\
W]^
\
Z_
`a
K
b
c
b
d
Fig. 20. PDTM utilizes ABTM and CBTM simultaneously to predict both short-term
and long-term future temperature for multicore
the predicted temperature exceeds the migration threshold (Ttmt). When the current
temperature (Tcur) reaches the temperature trigger threshold (Tttt), ∆tm, the time to
which the migration threshold, is calculated by ABTM. PDTM begins to calculate the
future temperature via ABTM and CBTM for other cores after ∆tm. The core with
minimum value among future temperature (T []) is selected as new core for migration.
As shown in Fig. 20, the goal is to find the future coolest core after ∆tm with our
prediction. If the prediction temperature, Tpred is also larger than priority scheduling
temperature(Tpst), the priority of application should be adjusted as well as migration.
ABTM is capable of predicting the future temperature within the short-term
by tracking the application’s thermal behavior, and is recognized as a fine-grained
scheme. However, since frequent local temperature differences significantly affect
temperature predictions, this impairs the ability to predict thermal behaviors in the
long run. On the other hand, although CBTM lacks the ability to track each ap-
plication’s thermal behavior, this model is capable of foreseeing long-term thermal
behaviors by facilitating the workload constant and the current working temperature
of a core. This makes CBTM to be recognized as a coarse-grained scheme. In sum-
66
0 200 400 600 800
50
55
60
65
70
75
80
85
90
time(sec)
te
m
pe
ra
tu
re
(C
els
ius
)
 
 
Core 1
Core 2
Core 3
Core 4
(a) Without DTM
0 200 400 600 800
50
55
60
65
70
75
80
85
90
time(sec)
te
m
pe
ra
tu
re
(C
els
ius
)
 
 
Core 1
Core 2
Core 3
Core 4
(b) HRTM
0 200 400 600 800
50
55
60
65
70
75
80
85
90
time(sec)
te
m
pe
ra
tu
re
(C
els
ius
)
 
 
Core 1
Core 2
Core 3
Core 4
(c) PDTM
Fig. 21. Comparisons among without DTM, HRTM, and PDTM using libquantum
benchmarks
0 200 400 600 800
50
60
70
80
90
100
time(sec)
te
m
pe
ra
tu
re
(C
els
ius
)
 
 
Core 1
Core 2
Core 3
Core 4
(a) Without DTM
0 200 400 600 800
50
60
70
80
90
100
time(sec)
te
m
pe
ra
tu
re
(C
els
ius
)
 
 
Core 1
Core 2
Core 3
Core 4
(b) HRTM
0 200 400 600 800
50
60
70
80
90
100
time(sec)
te
m
pe
ra
tu
re
(C
els
ius
)
 
 
Core 1
Core 2
Core 3
Core 4
(c) PDTM
Fig. 22. Comparisons among without DTM, HRTM, and PDTM using bzip2 and
libquantum benchmarks
mary, PDTM takes the advantages of ABTM and CBTM in order to predict local
application thermal behaviors as well as tracking the core behavior. The accuracy of
our prediction model help determine the future coolest core for migration.
C. Experimental Results and Analysis
In order to estimate working temperature through Digital Thermal Sensor (DTS) for
multicore systems, we develop a specific driver to access them in runtime. In CMPs
silicon die, each core has a unique thermal sensor that triggers independently. The
67
Table IV. Environments parameters
Parameters values (◦ Celsius)
Initial Temperature 54
Trigger Threshold 60
Migration Threshold 70
Priority Scheduling Threshold 82
Table V. A set of benchmarks list
Benchmarks Temperature Memory Usage
perlbench+hmmer Low Low
perlbench+bzip2 Low High
libquantum+hmmer High Low
libquantum+bzip2 High High
trigger point of these thermal sensors is not programmable by software since it is set
during the fabrication of the processor [19]. In the experiments, we set temperature
trigger threshold as 60◦C to start PDTM, and the migration threshold as 70 ◦C to
migrate applications when the predicted temperature exceeds the migration threshold.
Also, priority scheduling threshold is 82 ◦C. When predicted temperature is reached
at priority scheduling threshold, the priority of application can be adjusted as lower
value. Our implementation parameters are provided in Table IV. All experiments are
tested under ambient temperature control and fixed fan speed.
1. Digital thermal sensor for Intel quad-core
In Intel’s Core Architecture, the DTS can be accessed by a Machine Specific Register
(MSR). The value in the MSR is an unsigned number and the unit is Celsius (◦C).
In MSR, we use IA32 THERM STATUS register in order to get temperature of each
core. Within the register, it uses 7 bits where the value of DTS is stored. We can get
temperature for four cores by Equation (5.12).
Tcore = Tjunction −DTSvalue (5.12)
68
0
0
200
400
600
800
1000
1200
1400
1600
PERFORMANCE OVERHEAD
Benchmark
tim
e 
(se
c)
perlbmk bzip2  gcc    mcf    gobmk  hmmer  sjeng  libqt  h264ref astar   
 
w/o DTM
HybDTM
HRTM
PDTM
Fig. 23. Performance overhead : PDTM incurs only under 1% performance overhead
in average while running single benchmark
Tjunction is a manufactural value by Intel.
2. Experimental results and analysis
To demonstrate PDTM, we conduct the experiments with a single SPEC2006 bench-
mark and a set of two SPEC2006 benchmarks as shown in Table V. Running the single
benchmark, the presented PTDM can decrease 8% temperature in average (Fig. 21),
and reduces up to 5 ◦C in the peak temperature with only under 1% performance
overhead compared to Linux standard scheduler without DTM, as shown in Fig. 23.
Running two benchmarks simultaneously, PDTM can even lower about 10% temper-
ature in average and reduces up to 3 ◦C in peak temperature while running a set of
benchmarks with only under 8% performance overhead compared to Linux standard
scheduler without DTM, as shown in Fig. 22. It means PDTM can be more effec-
tive to control temperature than Linux standard scheduler when temperature and
workload is higher.
In order to verify our scheme, we also rebuilt HybDTM [26] (the software scheme-
changing priority) and HRTM [17] on Quad-Core system. HybDTM uses the priority-
based scheme and HRTM uses the migration-based scheme. HybDTM scheme relies
69
on the hardware performance counter, while HRTM relies on the current tempera-
ture information. The experimental results show that PDTM outperforms HRTM
in reducing average temperature by about 7%, performance overhead by 0.15%, and
the peak temperature by about 3 ◦C. In addition, the future temperature prediction
model provides more accurate prediction with only less than 1.6% error; on the other
hand, the estimation model, introduced in HybDTM, has at most 5% average error.
The main reason of the accuracy in the prediction model is that we consider not only
the core-based temperature at each core, but also the application thermal behavior.
Therefore, PDTM is capable to manage the temperature fairness and control the
overall temperature lower than other schemes even in the CPU-intensive situation.
D. Conclusions
In this work, we propose the Predictive Dynamic Thermal Management (PDTM) with
an advanced future temperature prediction model for multicore systems, and imple-
ment PDTM on Intel Quad-Core with a specific device driver to access the Digital
Thermal Sensor. We demonstrate that our scheme is able to reduce the overall tem-
perature and provide thermal fairness among four cores. The proposed temperature
prediction model can provide more accurate prediction and more efficient temperature
management by using ABTM and CBTM with lower performance overhead compared
to other schemes (HRTM and HybDTM). Most importantly, there is no additional
hardware unit required for our prediction models and scheduler.
70
CHAPTER VI
TEMPERATURE-AWARE SCHEDULER BASED ON THERMAL BEHAVIOR
GROUPING IN MULTICORE SYSTEMS
While manufacturing technology continues to improve reducing the size of packages,
the physical limits of semiconductor-based microelectronics have become a major de-
sign concern. Due to the demand of more capable microprocessors, some methods,
such as instruction-level parallelism (ILP) and thread level parallelism (TLP), have
been proposed and employed in the modern processors. Moreover, multiple inde-
pendent CPUs become a common solution to increase the system’s overall TLP in
the current market. A combination of increased available space due to the refined
manufacturing processes and the demand for increased TLP is the logic behind the
creation of Chip multiprocessors (CMPs).
Instead of pushing the limits of processor’s frequency, the demand for more ca-
pable microprocessors must be satisfied by other methods. However, due to the
decreased chip size and increased power-density, the power has been converted into
significant heat and threaten the system performance, reliability, and even increased
the power leakage. The great heat dissipation is pushing the limits of current pack-
aging technology and cooling solution. Packages are designed for worst typical be-
havior and rely on Dynamic Thermal Management (DTM) techniques to control the
temperature. Therefore, the chip design trend has been shifted to provide better
power-efficiency, lower power-density, and more effective thermal management in the
recent decade.
In this work, we propose a proactive thermal-aware scheduler (TAS) that ex-
ploits this variability in the context of multicore systems. TAS scheme utilizes an
advanced future temperature prediction model for each core to estimate different ther-
71
mal behaviors and measure the time duration before each core reaching the desired
temperature threshold. Therefore, the appropriate measurements would be triggered
to avoid thermal emergency based on the measured results. Although the proactive
schemes have be proposed in [4, 5], the scheme in [4] is only applicable to the context
of multimedia applications, since it predicts the temperature of the next frame based
on the profile of the past frames; on the contrary, the scheme in [5] fails to consider
the difference of temperature increasing pattern in different cores, because the steady
state temperature and thermal parameter b are impractically assumed to be the same
in each core within a single chip. Most importantly, in [5], the authors propose to
migrate tasks from a potentially overheated core to the future coolest core based on
the temperature prediction results. However, we believe that the target core of task
migration should be determined by the core which needs longest time period to reach
the predefined temperature threshold, because the temperature of the coolest can be
increased faster than others due to the thermal correlation effects and its own thermal
increasing pattern.
Therefore, we propose a simple and accurate prediction model to profile the ap-
plication’s thermal behavior and classify them into several groups offline, and then
measure the time duration before reaching the desired temperature threshold for each
core. The proposed temperature-aware scheduler is scalable to any current multicore
model and architecture with on-chip thermal sensors that can be accessed at the soft-
ware level. Eventually, we exploit and implement the advanced future temperature
prediction model with the TAS strategy in the Intel Quad-Core Q6600 system. Ex-
periments were conducted under CPU-intensive SPEC CPU 2006 benchmark, TAS
maintains the system temperature below a given threshold by the proposed prediction
model. Moreover, we demonstrate that TAS scheme based on simple parameters can
control the tradeoffs between throughput and thermal fairness. Compared to tradi-
72
pearl bzip2 gcc mcf gobmk hmmer sjeng libq. h264ref astar xalanch60
65
70
75
80
85
90
95
100
SPEC CPU 2006 benchmarks
Te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
core 1
core 2
core 2
core 4
Fig. 24. Tss according to SPEC CPU 2006 benchmark suite
tional schedulers employed in conjunction with DTM techniques, the temperature-
aware scheduler can achieve higher throughput while maintaining QoS guarantees for
soft real-time tasks with marginal loss in fairness among the best-effort tasks.
The main contributions of this work are summarized as follows:
• We classify the applications’ thermal behavior groups using K -means clustering
method with the steady state temperature.
• We propose an efficient temperature-aware scheduler in multicore systems and
implement it in Intel Quad-Core Q6600 and two Quad-Core Intel Xeon E5310
processors systems. We demonstrate that our scheme is able to successfully
reduce the overall temperature and provides the thermal fairness among cores.
• Most importantly, there is no additional hardware unit required for our temperature-
aware scheduler. Our scheme is applicable to any multicore environment in
real-world CMP products seamlessly.
73
A. Thermal Behavior Group
In this section, we propose how to classify the thermal behavior group by Tss. Also,
we introduce how to predict the future temperature and the time duration before
reaching the predefined threshold using the thermal parameter b and the thermal
behavior groups. we discuss about the advanced future temperature prediction model
for each core to estimate the different application thermal behaviors and measure the
time duration before each core reaching the desired temperature threshold. Based
on the prediction results, the appropriate measures are triggered to avoid thermal
emergency. Instead of being reactive to the current temperature, the temperature
control techniques should be triggered if the core is predicted to be overheated in the
near future to more effectively control temperature under the desired temperature
threshold. Although the proactive schemes have been proposed in [4, 5], the scheme
in [4] is only applicable to the context of multimedia applications, since it predicts
the temperature of the next frame based on the profile of the past frames; on the
contrary, the scheme in [5] fails to consider the difference of temperature increasing
pattern in different cores, because the steady state temperature and the thermal
parameter b are impractically assumed to be the same in each core within a single
chip. Most importantly, in [5], the authors propose to migrate tasks from a potentially
overheated core to the future coolest core based on the temperature prediction results.
However, we believe that the target core of task migration should be determined by the
core which needs longest time period to reach the predefined temperature threshold,
because the temperature of the coolest can be increased faster than others due to the
thermal correlation effects and its own thermal increasing pattern.
74
0 50 100 150 200 250 300 350
50
55
60
65
70
75
80
85
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
Thermal behavior for group A
400.perlbench
401.bzip2
403.gcc
456.hmmer
Fig. 25. Thermal behavior for Group A
1. Thermal behavior groups based on the applications’ thermal pattern
As shown in Fig. 24, Tss of each benchmark suite is different from each other, although
all of their CPU utilizations are almost 100%. In order to manage temperature
at runtime, the accurate applications’ thermal behavior should be necessary. We
observe Tss and the thermal parameter b, Tss value is more sensitive than the thermal
parameter b to different thermal behaviors of applications. As shown in Fig. 25, the
applications’ thermal patterns are similar if their Tss are analogous. In this research,
we classify SPEC CPU 2006 benchmark applications with Tss value as several thermal
behavior groups using aK -means clustering method. The K -means clustering method
is an algorithm to cluster n objects based on attributes into k partitions, k < n. It is
similar to the expectation-maximization algorithm for mixtures of Gaussians in that
they both attempt to find the centers of natural clusters in the data. It assumes
that the object attributes form a vector space. The objective it tries to achieve is to
75
0 50 100 150 200 250 300 350 400
55
60
65
70
75
80
85
90
95
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
Application thermal behaviors according to applications
400.perlbench
462.libquantum
(a) Running two applications on
the same core
0 100 200 300 400 500 600 700 800 900
50
55
60
65
70
75
80
85
90
95
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
Application thermal behaviors according to cores
core 1
core 3
(b) Running one application on
the different cores
Fig. 26. The application thermal behavior according to applications and cores
minimize the total intra-cluster variance, or, the squared error function as follow:
V =
k∑
i=1
∑
xj∈Si
(xj − µi)
2 (6.1)
where there are k clusters Si, i = 1, 2, ..., k, and µi is the centroid or mean point of
all the points xj ∈ Si. As our preliminary experiments for eleven SPEC CPU 2006
benchmarks, k = 5 is the optimal value to classify applications as thermal behavior
group as shown in Table VI.
For example, 400.perlbench, 401.bzip2, 403.gcc, and 456.hmmer applications
can be classified as the same group ( Group A ) that has a similar thermal pattern
and Tss, as shown in Fig. 25. As our preliminary results, each Tinit in Group A is
different in four applications, but Tinit cannot affect application’s thermal pattern
and their Tss.
76
Table VI. The result of thermal behavior group using K -means clustering on 4-core system
SPEC CPU Core 1 Core 2 Core 3 Core 4
Applications Tss Tss Tss Tss k = 2 k = 3 k = 4 k = 5 k = 6 GROUP
400.perlbench 83 ◦C 77 ◦C 74 ◦C 77 ◦C 1 3 2 5 3 A
401.bzip2 83 ◦C 77 ◦C 73 ◦C 77 ◦C 1 1 2 5 3 A
403.gcc 84 ◦C 76 ◦C 74 ◦C 77 ◦C 1 1 2 5 3 A
429.mcf 84 ◦C 80 ◦C 76 ◦C 78 ◦C 1 1 1 3 4 D
445.gobmk 82 ◦C 77 ◦C 73 ◦C 76 ◦C 1 1 2 1 6 C
456.hmmer 84 ◦C 77 ◦C 73 ◦C 77 ◦C 1 1 2 5 3 A
458.sjeng 83 ◦C 76 ◦C 72 ◦C 76 ◦C 1 3 2 1 6 C
462.libquantum 92 ◦C 84 ◦C 81 ◦C 84 ◦C 2 3 2 4 1 E
464.h264ref 83 ◦C 74 ◦C 72 ◦C 74 ◦C 1 3 4 2 5 B
473.astar 84 ◦C 79 ◦C 74 ◦C 77 ◦C 1 1 1 3 4 D
483.xalanchbmk 83 ◦C 74 ◦C 73 ◦C 76 ◦C 1 4 2 2 2 B
77
2. The region of the thermal behavior group
By the previous thermal equations [30], we obtain
T (t) = Tss − (Tss − Tinit) · e
−bt (6.2)
Using Equation (6.2) and our measurements, we can obtain Tss and b using the
following steps:
1. We first run each SPEC CPU 2006 benchmark suite on each core until the
temperature is not changed anymore to obtain the respective steady state tem-
perature.
2. Then, we calculate the thermal parameter b by accessing the real temperature
from the Digital Thermal Sensor (DTS) within a core using Equation (6.2).
Using Equation (6.2), we calculate thermal parameter, b, as shown in Equation
(6.3).
b = −
log Tss−T (t)
Tss−Tinit
t
, (6.3)
where Tss is the steady state temperature, Tinit is the initial temperature, and T (t)
is the current temperature at time t. As a result in Fig. 26(a), the thermal curve
is different according to which application works. Moreover, even though the same
application is running, the thermal pattern is also different according to which core
is used as in Fig. 26(b).
While we exploit the steady state temperature value, Tss, for clustering, we need
to find another metrics for the classification of a new application at runtime. Since the
Tss value is not available before reaching the steady state, we use only the measured
temperatures while applications running. In our observation, the thermal pattern can
78
time
Temperature
Tinit
Tss
steep 
region
gentle region flat region
(a)
(b)
(c)
Fig. 27. Slopes for the thermal pattern at runtime
be divided by three regions as shown in Fig. 27. Each region has a different thermal
slope that can affect the temperature increasing rate at a given time.
As shown in Fig. 27, the slope (a), (b), and (c) are different according to the time
t. To calculate the slope for operating temperature, we can use a simple equations as
follows:
Si =
T (i+∆t)− T (i)
∆t
, (6.4)
where Si is the slope of the application’s thermal pattern for ith region, T (i) is
the previous temperature, and T (i + ∆t) is the current temperature. Also ∆t is a
predefined time interval. Using the current temperature and the slope value, we can
estimate an application’s current region at runtime. As mentioned above, applications
in the same thermal group have similar the thermal pattern, and the slope of regions in
the same group is also similar. Based on those slope values and the thermal behavior
groups, a temperature-aware scheduler estimates more accurate future temperature,
and provide more effective dynamic thermal management.
79
B. Temperature-Aware Scheduler for Multicore Systems
Since Linux standard scheduler is not aware of operating temperature for cores in
multicore environments, we propose a temperature-aware scheduler that exploits this
variability in the context of multicore systems. Although the proactive schemes have
been proposed in [5], the scheme in [5] fails to consider the difference of temperature
increasing pattern in different cores, because Tss and the thermal parameter b are
impractically assumed to be the same in each core within a single chip. Most impor-
tantly, in [5], the authors propose to migrate tasks from a potentially overheated core
to the future coolest core based on the temperature prediction results. However, we
believe that the target core of task migration should be determined by the core which
needs the longest time period to reach the predefined temperature threshold, because
temperature of the coolest can be increased faster than others due to the thermal
correlation effects and applications’ thermal behaviors. Each core’s and the applica-
tion’s thermal behavior is different by Tss and the thermal parameter b. Therefore,
the migrating task from a potential overheated core to the coolest core is unnecessary
and improper.
As shown in algorithm (3), we first profile Tss of applications offline, and then
classify them as thermal behavior groups using K -means clustering method. When-
ever an application starts to run, the slope of the application is calculated after
triggering a start threshold. According to an execution time, t, it is possible that
this application can be classified into which thermal behavior group. As a result, we
acquire Tss and thermal parameter b for the application from this thermal behavior
group at runtime. Based on these Tss and thermal parameter b, a temperature-aware
scheduler starts to predict each core’s future temperature and the time period before
reaching the desired temperature threshold. If the estimated time period is less than
80
Algorithm 3 Temperature-Aware Scheduler based on Grouping for Multicore Sys-
tems
1: Classification Tss into the thermal behavior group by K-means clustering
2: Tcur ← Access(DTST emperature)
3: slopet ← Calculate(Tcur, Tprev) at the time, t
4: Thermal Groupi ← Find(slopet, t) for applicationi
5: Tss ← Get(Thermal Groupi)
6: b ← Get(Thermal Groupi)
7: for Tcur ≥ Tgate do
8: for j = 1 to MAXcores do
9: Calculate timeperiodest in Current CORE
10: if time periodest ≤ 2 sec then
11: Target Core ← Longest time CORE(T [])
12: MIGRATION(processi → Target Core)
13: end if
14: end for
15: end for
2 seconds, it means the cores are going to be overheated in the near future, and the
task migration should be triggered. The migration target core is determined by which
other core needs the longest time to reach the desired temperature threshold. Then,
all the tasks on the potentially overheated core can be migrated to the target core to
balance the heat within multicore environments.
81
0 100 200 300 400 500 600 700 800
50
55
60
65
70
75
80
85
90
95
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
Linux Standard Scheduler
Core 1
Core 2
Core 3
Core 4
(a) Standard Scheduler
0 100 200 300 400 500 600 700 800
45
50
55
60
65
70
75
80
85
90
95
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
Thermal−Balancing Policy
Core 1
Core 2
Core 3
Core 4
(b) Thermal-Balancing Policy
0 100 200 300 400 500 600 700 800
50
55
60
65
70
75
80
85
90
95
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
PDTM
Core 1
Core 2
Core 3
Core 4
(c) Predictive DTM
0 100 200 300 400 500 600 700 800
50
55
60
65
70
75
80
85
90
95
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
Temperature−Aware Scheduler
Core 1
Core 2
Core 3
Core 4
(d) Temperature-Aware Scheduler
Fig. 28. DTM evaluations in 4-core system using test group 2 (bzip2 + libquantum)
82
Table VII. Experimental systems descriptions
System I System II
Cores 4 cores 8 cores
Processor Intel Quad-Core two Intel Quad-Core
Q6600 Xeon E5310
Memory 1 GB 1 GB
OS SUSE 10.3 RedHat Enterprise 4
C. Experimental Results and Analysis
In order to demonstrate the applicability of our temperature-aware scheduler for
various applications, we utilize several thermal behavior groups classified by K -means
clustering method. In this work, we used twelve applications in SPEC CPU 2006
benchmark suite for profiling. In our experiments, we choose bzip2 and libquantum in
SPEC CPU 2006 benchmark, vacation from STAMP benchmark [39]. We select bzip2
and libquantum because they are CPU-intensive. Also, vacation is a client/server
travel reservation system benchmark that is appropriate to present the demand of
thermal control in the server systems.
To compare with other schemes, we also rebuild the Predictive Dynamic Thermal
Management (PDTM) [5] and Thermal Balancing Policy (TBP) [6] in our systems.
All experiments in this work is under the ambient temperature control, and the speed
of cooling fan is also fixed.
83
1. 4-core system
To verify a temperature-aware scheduler for 4-core system, two applications run
simultaneously. As shown in Fig. 28, compared to Linux Standard Scheduler, a
temperature-aware scheduler reduces peak temperature up to 8 ◦C. However, PDTM
reduce the peak temperature by 2 ◦C, while TBP is increased by 2 ◦C. For the per-
formance overhead evaluation, the temperature-aware scheduler presents less than
12% performance overhead, and PDTM has 8% compared to the Linux Standard
Scheduler, while TBP incurs 35%. Since the temperature-aware scheduler finds the
longest core instead of the coolest core under thermal threshold, our scheme reduces
the number of migration while providing better thermal-balancing for cores compared
to other DTMs. Therefore, the temperature-aware scheduler based on thermal be-
havior grouping provides the better effectiveness in temperature control for multicore
systems.
2. 8-core system
In 8-core system, the temperature-aware scheduler outperforms PDTM and TBP in
both temperature control effectiveness and efficiency. The temperature-aware sched-
uler reduces peak temperature by 5 ◦C with 7.52% performance overhead compared
to Linux Standard Scheduler, while PDTM and TBP reduce peak temperature by 4
◦C and 13.2% performance overhead and 2 ◦C and 35% performance overhead, respec-
tively. Although TBP also decreases the peak temperature and presents smoother
thermal pattern compared to Linux Standard Scheduler, TBP causes impractically
huge performance overhead. Moreover, the exchanged threads cannot effectively re-
duce core 1’s temperature. Since PDTM is not aware of the different thermal effects
contributed by running applications, PDTM cannot accurately predict future tem-
84
perature and react in time. Therefore, PDTM fails to control the temperature under
the desired level.
D. Conclusions
In this work, we propose a temperature-aware scheduler based on thermal behavior
grouping in multicore systems. To classify applications according to the thermal
behavior, we use Tss value as a classification feature in K -means clustering method.
We observe that among thermal parameter b and Tss, Tss is more proper to explain
the application’s thermal pattern. The proposed temperature-aware scheduler finds
a core which takes the longest time to reach a temperature threshold instead of the
coolest core for process migrations. To verify the temperature-aware scheduler, we
implement it on two multicore systems such as a 4-core (Intel Quad-Core Q6600)
and 8-core (two Quad-Core Intel Xeon E5310 processors) systems. We demonstrate
that the temperature-aware scheduler is able to reduce the overall temperature and
provide the thermal fairness among cores. Also, the temperature-aware scheduler
can provide more accurate prediction and more efficient temperature management
by using the thermal behavior grouping and the method to find the longest core
with lower performance overhead compared to other schemes such as Linux Standard
Scheduler, Thermal-Balancing Policy, and Predictive DTM.
85
CHAPTER VII
A THERMAL MODEL BASED ON WORKLOAD CHARACTERISTICS USING
CDF
In this work, we propose a Proactive Correlation-Aware Thermal Management (Pro-
CATM) that incorporates three main components: a representative workload estima-
tion, a future temperature estimation model and a thermal-aware thread scheduler.
The representative workload estimation utilizes the workload probability distribution
to measure each running thread-level workload behavior locally and core-level work-
load behavior within each core globally. The representative workload is estimated
using the cumulative distribution function (cdf ) at runtime. Thus, the thermal im-
pacts contributed by various threads are distinguished by the estimated representative
workload. We further model the thermal correlation by profiling the thermal impacts
from neighboring cores under the specific workload. Once the thermal behavior of
each running thread is obtained and the thermal correlation is modeled for the neigh-
bor cores, the future temperature estimation model can then estimate each core’s
future temperature by taking both the thermal behaviors and the thermal correlation
into account. Therefore, based on the estimated future temperatures, the thermal-
aware thread scheduler moves the running thread from the possible overheated core
to the future coolest core (migration), or reduce the processor resources (priority
scheduling) while migration is not possible within multicore systems to avoid thermal
emergency and provide thermal fairness with negligible performance overhead.
A. A Representative Workload Estimation Based on CDF
In this section, we introduce a statistical model to estimate workload. To capture
the dynamic workload change, first we define workload with an execution time infor-
86
mation for a given time inverval, then we model a representative workload through
a cumulative distribution function (cdf ) and standard deviation based on workload
history information.
1. The definition of workload
An application consists of a sequence of instructions to be executed. Execution time
(tapp) of the application can be represented in terms of Cycles Per Instruction (CPI),
the number of instructions being executed, and the CPU frequency as follows: [40]:
tapp =
IC · CPI
fc
, (7.1)
where IC is the dynamic instruction count, CPI is the average number of cycles per
instruction, and fc is the operating frequency. Therefore, we can define workload
(Wapp) by the execution time of the application, tapp. Although Linux kernel provides
CPU utilization, we exploit Performance Monitoring Counters (PMC) [34] to measure
tapp more accurately in our experiments.
2. The statistical representative to estimate workload
Instead of using simple average of Wapp, we attempt to use a representative workload
that can capture the system dynamics at runtime. In this study, we propose to derive
the representative workload from a cumulative distribution function (cdf ) of Wapp
and its standard deviation. We denote the cdf as F (x) for a random variable X for
Wapp according to a probability density function (pdf ), f(x), and probability p using
Equation (7.2)
P (Wmin ≤ X ≤Wmax) =
Wmax∫
Wmin
F (x)dx, (7.2)
87
where X is in the interval [Wmin,Wmax] and F (x) = P (X ≤ x). And Wmin implies
that there is no workload in the system and Wmax implies 100% workload, respec-
tively. To satisfy a various computational requirements, the representative should be
decided by the probability requirements for application workload, Wapp, in a given
time period. Specifically, let ρ be the probability required for application workload
in a given time period. In our observations, even dynamic workload of applications
can be defined as the representative workload by a probability ρ as follow:
P [X ≤ Wapp] ≥ ρ. (7.3)
As shown in Fig. 29, we can exploit the representative workload using cdf when
playing a multimedia application. Moreover, in order to distinguish the threads with
stable workload behaviors from those with highly unstable workload behaviors, the
standard deviation, denoted as σ, is considered. In this study, we classify the threads
with σ less than 7.0 as the threads with stable workload behaviors in our systems.
Therefore, we use ρ = 0.5 to represent these stable threads’ workload and ρ = 0.7
to represent those threads with highly unstable workload behaviors for the thermal
safety in the cdf.
ρ =


0.5 if σ < 7.0, stable workload,
0.7 if σ ≥ 7.0, dynamic workload
. (7.4)
Here, we consider the thread-level workload estimation as local, while core-level
workload estimation as global. For global workload estimation, the overall workload in
a single core is also monitored at runtime. The same as the thread-level, the concepts
of σ and ρ are also adopted in the core-level. Thus, the representative workload
for each core will be used to estimate the future temperature as explained in the
next section, and the different thermal effects contributed by different threads could
88
0 20 40 60 80 100 120 140 160
30
40
50
60
70
80
90
time (sec)
w
o
rk
lo
ad
 (%
)
Dynamic Workload Behavior
(a) Dynamic workload behaviors
1 2 3 4 5 6 7 8 9 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
workload (%)
pr
ob
ab
ilit
y
Cumulated Distribution Function (cdf)
(b) The representative workload calcu-
lated by cdf
Fig. 29. The representative workload is 58% when the probability (ρ) is 0.7 in dynamic
workload behavior
also be distinguished by the representative workloads if there are multiple threads
running in a single core. Moreover, to effectively control the temperature with less
performance overhead, we set 30% as the workload threshold. That implies that the
ProCATM would only control those threads with workloads higher than 30%, because
the threads with workloads under 30% only affect the temperature at most 2 ◦C in
our systems.
3. Thermal parameters in CMP systems
In order to provide a thermal model of a processor, we should consider the relationship
between temperature and workload of applications on the processor. By modeling the
power dissipation, more precise models can be derived from a simple model [30]. We
analyze Fourier’s Law of heat conduction where the formula states that the rate of
heating or cooling is proportional to the difference in temperature between the object
and the environment [30]. We define T (t) and P (t) as temperature and power at time
89
t, respectively. Then we can use the Fourier’s Law as the following [41, 29]:
T ′(t) = P (t)− bT (t), (7.5)
where b is a positive constant representing the power dissipation rate. Now, we define
f(t) as processor frequency at time t. Since the power consumption of a processor is an
increasing convex function of the frequency, power consumption can be represented by
frequency [41]. Most studies assume that power and processor frequency are relevant
to the followings:
P (t) = a(fα(t)), (7.6)
for some constant a and α > 1. With an assumption that T0 = 0 (the initial temper-
ature is the ambient one), the solution of Equation (7.5) using Equation (7.6) can be
presented as follows:
T (t) =
∫ t
t0
P (τ)e−b(t−τ)dτ + T0e
−b(t−t0), (7.7)
T (t) =
∫ t
t0
a(frα(τ)e−b(t−τ))dτ + T0e
−b(t−t0). (7.8)
Then, for the variation of temperature, we deal with two cases of the variation at any
point t [30]. First, the case that temperature is non-decreasing, by Equation (7.5)
and Equation (7.6), can be derived like the following:
f(t) ≥ (
b(T (t))
a
)
1
α . (7.9)
Then, the case that temperature is non-increasing can be expressed like the follow.
fr(t) ≤ (
b(T (t))
a
)
1
α . (7.10)
90
Therefore, we can observe that scaling the frequency to change temperature can be
performed for the desired direction. Finally, We can derive the following Equation if
we maintain the frequency constant at f(t) = fc during the time interval at [t0, t].
T (t) =
a(fαc )
b
+ (T (t0)−
a(fαc )
b
)e−b(t−t0), (7.11)
dT
dt
= −b(T (t)−
a(fαc )
b
). (7.12)
where fc is the current frequency on the processor. In order to determine thermal
parameters, a and b, we assume α = 3.0 [41], and then we can obtain the values
for a and b. Although the values of a and b are processor-specific, b is more relative
to application’s workload at runtime. Because b can be affected by executed cycles
for applications. When we run an application infinitely with the maximum CPU
utilization and observe the heating and cooling curves, thermal parameters can be
determined by using Equation (7.11). After a sufficient time of execution in the
maximum CPU utilization, the infinite steady-state temperature value T (∞) = Tss
can be observed. By setting T (t) = T and a(f
α
c )
b
= Tss, Equation (7.11) is transformed
as follows:
T = Tss + (Tinit − Tss)e
−bt, (7.13)
dT
dt
= −b(T − Tss), (7.14)
where Tinit is the initial temperature.
Using Tss and sampling the temperature every millisecond, from Equation (7.14),
the rate of increase, dT/dt, is plotted against (T − Tss) at each point. The resulting
set of points is fitted to a straight line using least mean square error fitting. From the
Equation (7.14), the slope of this straight line represents the value of b. In order to
measure a and b more accurately, we should know the meaning of those values. The
91
change in temperature is based on individual component’s thermal resistance and
capacitance in specific processors [20]. To obtain current and future temperatures,
we should take account for thermal resistance Rth and thermal capacitance Cth, while
changing in temperature from Told to Tnew over a time interval ∆t like Equation (7.15).
Tnew = P ·Rth + (Told − P · Rth)e
−∆t
Rth·Cth , (7.15)
where Rth is thermal resistance and Cth is thermal capacitance. With Equation (7.15)
and (7.11), we can derive the thermal parameters a and b as follows:
a =
1
Cth
, b =
1
Rth · Cth
(7.16)
By Equation (7.16), the thermal parameter a is represented as thermal capacitance
Cth. Thermal capacitance is defined as the amount of thermal energy required to
raise temperature of one mole of material by 1 Kelvin and can be measured at con-
stant volume or at constant pressure [29]. Therefore, this value is practically constant
in the same material. In contrast, the thermal parameter b is related to an appli-
cation’s workload. This is because the thermal resistance is in inverse proportional
to the power consumption. Hence, characterizing the workload behavior is critical
for distinguishing the different threads’ thermal effects. As shown in Fig. 30, the
workload dominates the temperature change in a core. Therefore, it is important to
characterize an application’s workload in thermal control.
B. Thermal Mode Based on Workload
In this section, we propose a proper thermal model for CMPs to estimate future
temperature considering thermal correlation among neighboring cores.
92
0 50 100 150
51
51.5
52
52.5
53
53.5
54
54.5
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
Thermal effect by different workloads
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Fig. 30. Thermal effects by different workloads
Table VIII. Each core’s respective Tss and thermal parameter b for a generated exam-
ple process with 100% workload running in the Intel Quad Core Q6600
system
Core 1 Core 2 Core 3 Core 4
b 0.0199 0.0175 0.0169 0.0181
Tss 78
◦C 72 ◦C 68 ◦C 71 ◦C
Table IX. The thermal parameter b and Tss according to workload in 4-core system
Core 1 Core 2 Core 3 Core 4
Workload (%) b Tss b Tss b Tss b Tss
20% 0.0139 59 ◦C 0.0092 58 ◦C 0.0053 52 ◦C 0.0065 57 ◦C
40% 0.015 64 ◦C 0.0058 62 ◦C 0.0085 57 ◦C 0.0065 58 ◦C
60% 0.0187 68 ◦C 0.0092 65 ◦C 0.0078 61 ◦C 0.0113 63 ◦C
80% 0.0179 73 ◦C 0.0164 70 ◦C 0.0165 67 ◦C 0.0138 68 ◦C
100% 0.0199 78 ◦C 0.0175 72 ◦C 0.0169 68 ◦C 0.0181 71 ◦C
93
1. Prior thermal model of a single core
The heat transfer equations are introduced to model the steady state temperature
of systems with heat sources in [37]. With those heat transfer equations, Wang and
Bettati present that the rate of temperature change is proportional to the differ-
ence between the current temperature and the steady state in [30]. Let Tss be the
steady state temperature of an application. Then, we denote T (t) as the temperature
at time t and Tinit as the initial temperature when an application starts execution
(T (0)=Tinit). Thus,
dT
dt
= b× (Tss − T ). (7.17)
where b is a thermal parameter. Solving Equation (7.17) with T (0) = Tinit and T (∞)
= Tss, we can obtain
T (t) = Tss − (Tss − Tinit)× e
−bt (7.18)
Using Equation (7.18) and our measurements, we can obtain Tss and b using following
steps:
1. We first run an application with 100% workload for a long time, and then
measure the steady sate temperature (Tss) when temperature is not changed
any more.
2. We calculate the thermal parameter b by measure temperature through Digital
Thermal Sensor (DTS) within the core using Equation (7.18).
As the result, we obtain each core’s respective value b and Tss for the generated
process in Table VIII by executing a generated process with 100% workload in each
core individually. Therefore, once the thermal parameter b and the steady state
temperature are obtained, we can estimate the core’s future temperature (T (t)) after
time t by Equation (7.18). We can notice that each core’s thermal parameter b and
94
Fig. 31. The thermal range (∆T ) using Twss and Ttc to calculate T
′
ss for core 1
Tss are different even though the cores are within the same package as shown in Table
VIII. Moreover, we have observed that Tss and thermal parameter b are different
according to the workload in each core, as well as thermal correlation effect among
neighboring cores in the CMP systems. Therefore, we are motivated to improve
the prior thermal model by including the workload behavior and thermal correlation
concepts.
2. The thermal impacts contributed by different workloads
In the real world applications, the workload is fluctuant, and each core’s Tss and b
are changed by the variance of workload at runtime. Therefore, by running processes
with several different workloads on each core, we observe the relationship between
workload and thermal parameter b, as well as Tss in Table IX.
3. New T ′ss according to thermal correlation
We classify new T ′ss into two parts: T
w
ss (according to its own workload), and Ttc (a
thermal correlation affected by neighboring cores’ temperature). Thus, we calculate
95
new T ′ss according to own workload by the following Equation (7.19).
T ′ss = T
w
ss + Ttc. (7.19)
First, we can obtain Twss from Table IX. Since neighboring cores’ temperature is
relative to their own workloads, Ttc should also consider each cores’ workloads as well
as their temperature. As shown in Fig. 31, the thermal range of core 1 is determined
by the thermal correlation effect from core 2, core 3, and core 4 in our 4-core system.
In order to calculate the T ′ss for core 1, we develop Equation (7.20) to obtain Ttc to
model the thermal correlation impact from other cores.
Ttc =
n∑
i=2
∆T ×Wi, (7.20)
where ∆T is the thermal range between core 1’s temperature with and without ther-
mal correlation from neighboring cores. Wi is each core’s representative workload
estimated. For example, there are 4 threads with different workloads running on each
core individually in the 4-core system (Core 1 : 100%, Core 2 : 50%, Core 3 : 30%,
Core 4 : 20%). We first obtain Twss as 78
◦C and b as 0.0199 from Table IX. Then,
we calculate the thermal correlation effect from each neighboring core with 100%
workload, as shown in Table X.
Therefore, by Equation (7.19) and (7.20), the T ′ss can be obtained by the follow-
ing:
96
Table X. Ttc and b according to thermal correlation profiled for core 1
Ttc b
Only Core1 (100%) 78 ◦C 0.0199
Core1 (100%) + Core2 (50%) 85 ◦C 0.0246
Core1 (100%) + Core3 (30%) 84 ◦C 0.0195
Core1 (100%) + Core4 (20%) 83 ◦C 0.0176
T ′ss = 78 + (85− 78)× 50%
+ (84− 78)× 30%
+ (83− 78)× 20%
= 84.3 (◦ C)
In above example, the calculated T ′ss (84.3
◦C) is higher than the original Tss
(78 ◦C). The difference between these values represents thermal correlation effect, Ttc
(6.3 ◦C), among neighboring cores.
4. New b′ according to thermal correlation
Also, in order to advance the new b′ by considering the thermal correlation effect, we
define b′ as b′ = bw + btc and develop the following equations (7.21) and (7.22):
btc =
n∑
i=2
∆b×Wi (7.21)
b′ = bw + (btc ×
(T ′ss − Tcur)
(T ′ss − Tinit)
), (7.22)
97
where bw is determined according to own workload and btc is thermal parameter
affected by neighboring cores. And Tcur is current temperature and Tinit is initial
temperature. In Equation (7.21), ∆b is the difference between core 1’s thermal pa-
rameter b with and without thermal correlation by neighboring cores. In contrast
with T ′ss, thermal parameter b
′ is changeable according to current temperature (Tcur).
Therefore, even if the thermal parameter b′ can be changed by current temperature
and thermal correlation, b′ determines only temperature increase rate.
5. Future temperature estimation model
In this section, we propose a new thermal model to estimate future temperature for
each application in CMPs. We focus on obtaining both new T ′ss and new thermal
parameter b′ according to the estimated workload and profiled thermal correlation
impacts.
The original thermal models for estimating the future temperature at time t is
improved from Equation (7.18) to the following Equation (7.23) for a specific core
with workload estimation and thermal correlation by neighboring cores.
T ′(t) = T ′ss − (T
′
ss − Tinit)× e
−b′t
T ′(t) = Twss + Ttc − (T
w
ss + Ttc − Tinit)× e
−(bwtc+(btc×
(T ′ss−Tcur)
(T ′ss−Tinit)
))×t
(7.23)
In order to validate our new thermal model, we conduct several experiments
running some applications with different workload. The estimated future tempera-
ture for core 1 through our new thermal models are compared with the monitored
temperature by the Digital Thermal Sensor in Fig. 32. As shown in Fig. 32, the
estimated future temperature by the improved thermal models is very accurate, espe-
cially within the first 200 seconds, which is much longer than enough to react against
98
to the increasing temperature.
Moreover, in order to demonstrate that the improved thermal models can be
effective even under the fluctuant workload, we also evaluate our thermal models by
executing multimedia data, which generates two individual threads. We first calcu-
late the representative workload through the cumulative distribution function (cdf ),
and then estimate temperature by considering both the representative workload and
the thermal correlation in the equations above. The result of workload estimation
by cdf is shown in Fig. 33(a), and the estimated temperatures compared with the
monitored real temperature is shown in Fig. 33(b). Thus, the results also demon-
strate the accuracy of our improved thermal models under fluctuant workloads con-
sidering both workload and thermal correlation from neighboring cores. Also, since
thermal control based on current temperature may overheat by physical nature of
temperature.as shown in Fig. 34(b). In order to overcome this problem, we propose
thermal control based on future temperature as shown in Fig. 34(b). Therefore, the
proposed Future Temperature Estimation Model estimates each core’s future temper-
ature (Test) for its individual steady state temperature according to the application
and core representative workloads (Wapp rep,Wcore rep ) estimated by Representative
Workload Estimation (RWE). The estimated future temperature is validated against
the measured temperature for actual processors with Digital Thermal Sensors (DTS),
with an average error of 2.4%. Eventually, the time duration (∆t) before the tem-
perature reaches the migration threshold can be calculated and passed to TATS for
thread control along with (Test). (The detailed explanations will be brought in the
following sections.) Therefore, instead of blindly migrating all the running threads
or rescheduling all their priorities, the proposed ProCATM is able to adaptively cope
the threads according to their different thermal effects, based on their representative
workloads and neighboring thermal correlation effects. Consequently, ProCATM can
99
0 100 200 300 400 500
50
55
60
65
70
75
80
85
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
100% workload (Core 1) + 50% workload (Core 2)
measured temperature
estimation temperature
(a) 100% (Core 1) + 50% (Core 2)
0 100 200 300 400 500
50
55
60
65
70
75
80
85
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
30% workload (Core 1) + 70% workload (Core 3)
measured temperature
estimation temperature
(b) 30% (Core 1) + 70% (Core 3)
0 100 200 300 400 500
55
60
65
70
75
80
85
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
40% (Core 1) + 60% (Core 2) + 70% (Core 3) + 50% (Core 4)
measured temperature
estimation temperature
(c) variable workloads on all cores
0 100 200 300 400 500
55
60
65
70
75
80
85
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
80% (Core 1) + 20% (Core 2) + 50% (Core 3) + 90% (Core 4)
measured temperature
estimation temperature
(d) variable workloads on all cores
Fig. 32. Validation of improved thermal model with workload estimation and thermal
correlation in static application. (Only core 1’s temperature is drawn)
100
0 50 100 150 200 250 300
30
40
50
60
70
80
90
time (sec)
w
o
rk
lo
ad
 (%
)
Workload estimation by cdf
mesured workload
estimated workload by cdf
(a) Workload
0 50 100 150 200 250 300
55
60
65
70
75
80
85
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
Estimated temperature by workload and thermal correlation
measured temperature
estimated temperature
(b) Temperature
Fig. 33. Validation of new thermal model with fluctuant workload: whiling playing
the Transformer movie, the Mplayer software would generate two threads.
One is the X windows daemon with stable workload, and the other one is for
decoding with fluctuant workload as shown above.
control the temperature at a desired level with negligible performance overhead.
C. A Proactive Correlation-aware Thermal Management
In this section, we introduce the system design and architecture of the proposed
ProCATM. Moreover, we present how Thermal-Aware Thread Scheduler (TATS) uti-
lizes the workload behavior and thermal correlation information to achieve thermal
balancing and lower the peak temperature.
1. System overview
Basically, a Proactive Correlation-Aware Thermal Management (ProCATM) con-
sists of three major components: Representative Workload Estimation (RWE), Fu-
ture Temperature Estimation Model (FTEM) and Thermal-Aware Thread Scheduler
101
(a) Based on current temperature
(b) Based on future temperature
Fig. 34. The difference of thermal control based on current temperature and future
temperature
102
Fig. 35. ProCATM system architecture
(TATS). As shown in Fig. 35, we depict the system architecture on a 4-core (In-
tel Quad Core Q6600 processor) machine. We developed a specific device driver for
Linux to access Digital Thermal Sensor (DTS) for monitoring each core’s tempera-
ture, and temperature information would be used in the FTEM. As mentioned before,
RWE is used to exploit the representative workload in both thread and core levels
to present each application’s workload behavior, while FTEM utilizes the representa-
tive workload and thermal correlation information to estimate the future temperature
(Test) and the time duration (∆t) before temperature reaches the migration thresh-
old. Hence, the TATS is able to react against to the thermal emergency appropriately
using the estimated information. In the following section, we discuss about the TATS
in details.
2. Thermal-aware thread scheduler (TATS)
To guarantee the thermal safety, Thermal-Aware Thread Scheduler (TATS) con-
sists of two schedulers: the priority scheduler and migration scheduler. Basically,
103
when current temperature reaches the trigger threshold, the RWE starts to moni-
tor the application’s workload behavior and calculate the representative workloads
through cdf for the running thread and core. Hence, the core representative work-
load (Wcore rep) and application representative workload (Wapp rep) can be utilized in
the FTEM. In FTEM, the time duration (∆t) before reaching migration threshold
can be estimated based on the profiled T ′ss and b
′ for different workloads. According
to the ∆t, TATS migrates the running threads from the possible overheated core to
another core. Here, since a thread under 30% workload affects the core tempera-
ture at most 2 ◦C in our observations, TATS deals with the threads with workload
higher than 30% to reduce the performance overhead. In TATS, migration can be
adopted in most cases, unless all the cores’ temperature reaches the priority schedul-
ing threshold. In this case, TATS should utilize the priority scheduler to adjust the
nice value in the Linux process scheduler to reduce the thread’s priority and increase
the cooling time, because migration cannot effectively reduce the core temperature
if all the core temperatures are near the maximum allowable temperature. Also, we
ignore the difference of performance overhead caused by migrating threads with dif-
ferent memory usages, because we observe that the migration performance overhead
is dominated by the thread suspending and restarting processes in the Linux kernel,
rather than the different memory usage. For example, by comparing the libquantum
benchmark and a generated transaction thread, the difference of migration overhead
is just 0.0346 millisecond, although both of them maintain almost 100% workload,
but the generated transaction thread has about 51% memory usage in Linux kernel,
while the libquantum has only around 3% memory usage.
Therefore, by considering the thermal effect of different workloads and the ther-
mal correlation, TATS is able to effectively reduce the peak temperature of each core
and achieve thermal balancing with ignorable performance overhead.
104
0 100 200 300 400 500 600 700 800 900
50
55
60
65
70
75
80
85
90
95
100
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
TEST GROUP 2 (STANDARD)
 
 
Core 1
Core 2
Core 3
Core 4
(a) Standard Scheduler
0 100 200 300 400 500 600 700 800 900
50
55
60
65
70
75
80
85
90
95
100
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
TEST GROUP 2 (TBP)
 
 
Core 1
Core 2
Core 3
Core 4
(b) Thermal-Balancing Policy
0 100 200 300 400 500 600 700 800 900
55
60
65
70
75
80
85
90
95
100
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
TEST GROUP 2 (PDTM)
 
 
Core 1
Core 2
Core 3
Core 4
(c) Predictive DTM
0 100 200 300 400 500 600 700 800 900
50
55
60
65
70
75
80
85
90
95
100
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
TEST GROUP 2 (TCDTM)
 
 
Core 1
Core 2
Core 3
Core 4
(d) ProCATM
Fig. 36. DTM evaluation in Intel Quad Core Q6600 system for stable workload behav-
iors: libquantum + vacation
D. Experimental Results and Analysis
In this section, the detailed experimental environment and results would be brought,
along with the analysis of the efficiency and effectiveness of the proposed ProCATM.
To compare the effectiveness and efficiency of the proposed ProCATM, we also
rebuild the Predictive Dynamic Thermal Management (PDTM) [5] and Thermal Bal-
ancing Policy (TBP) [6] in our systems. All the experiments in this work are under
ambient temperature control, and the speed of cooling fan is also fixed.
As shown in Fig. 36, all the DTMs have lower peak temperature compared to
105
the Linux Standard Scheduler in 4-core system. Compared to the Linux Standard
Scheduler, both ProCATM and PDTM reduce the peak temperature by 3.13%, while
TBP is reduced by 2.08%. For the performance overhead evaluation, the ProCATM
and PDTM present less than 0.46% performance overhead compared to the Linux
Standard Scheduler, while TBP incurs 6.64%. This is also the reason why the tem-
perature in TBP does not decrease obviously after executing 600 seconds. Since there
are only two threads running in the system, the thermal correlation effect is minor.
Moreover, both of the treads maintain 100% workload stably, and the difference of
thermal behaviors can be ignored. Therefore, PDTM presents the similar effectiveness
in thermal control compared to ProCATM.
As shown in Fig. 37, we first notice that ProCATM presents a smoother temper-
ature pattern, and provides better thermal fairness by having narrower temperature
gaps among all cores in multimedia applications. ProCATM reduces the peak temper-
ature by 1.35% compared to Linux Standard Scheduler, while both TBP and PDTM
increase the peak temperature by 1.35%. Since there is only one non CPU-intensive
multimedia application executed simultaneously, the temperature decrease in the pro-
posed ProCATM is minor. In PDTM, the temperature pattern seems to be similar
to the pattern of the Linux Standard Scheduler; however, the temperature of core 3
and core 4 in PDTM is higher than in Linux Standard Scheduler, because PDTM
tends to migrate the threads into core 3 and core 4. Although PDTM rarely migrates
the threads into core 1, some system threads can be assigned to core 1. Since the
Tss and thermal value b are higher, core 1 is more sensitive in temperature changing.
Therefore, the system threads still keep core 1 in higher temperature, although the
multimedia threads are running on core 3 and core 4. On the contrary, the core 1’s
temperature in TBP is even higher than in Standard Scheduler. Besides the higher
Tss and b of core 1, TBP trigger threads exchange while the thresholds are reached.
106
0 50 100 150 200 250 300
50
55
60
65
70
75
80
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
TEST GROUP 4 (STANDARD)
Core 1
Core 2
Core 3
Core 4
(a) Standard Scheduler
0 50 100 150 200 250 300
50
55
60
65
70
75
80
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
TEST GROUP 4 (TBP)
Core 1
Core 2
Core 3
Core 4
(b) Thermal-Balancing Policy
0 50 100 150 200 250 300
50
55
60
65
70
75
80
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
TEST GROUP 4 (PDTM)
Core 1
Core 2
Core 3
Core 4
(c) Predictive DTM
0 50 100 150 200 250 300
50
55
60
65
70
75
80
time (sec)
te
m
pe
ra
tu
re
 (C
els
ius
)
TEST GROUP 4 (TCDTM)
Core 1
Core 2
Core 3
Core 4
(d) ProCATM
Fig. 37. DTM evaluation in Intel Quad Core Q6600 for dynamic workload behaviors:
Multimedia
107
Therefore, even though the thread in core 1 are exchanged out to avoid increasing core
1’s temperature, the thread exchanged into core 1 still can potentially keep increasing
the core 1’s temperature. Therefore, the thermal safety cannot be guaranteed in TBP.
E. Conclusions
In this work, to avoid thermal emergencies and provide thermal fairness in CMP
systems, we propose and implement an adaptive and scalable run-time thermal man-
agement scheme, called a Proactive Correlation-Aware Thermal Management (Pro-
CATM), on the real-world CMP products. Since the significant variations in the
thermal behaviors among different applications and the severe thermal correlation
effect among multicores are ignored by all the prior DTM works. We suggest to char-
acterize each application’s distinct thermal behavior by applying a cumulative distri-
bution function into the application workload and a proper thermal model for CMP
systems to analyze the thermal correlation effect by profiling the thermal impacts
from neighboring cores under the specific workload. Thus, the future temperature
of each core can be more accurately estimated for adopting an appropriate reaction
against the thermal emergency through the proposed ProCATM. To demonstrate the
scalability and effectiveness, we implement and evaluate the proposed ProCATM in
Intel Quad Core Q6600 processor system running grouped multimedia application
and benchmarks. According to the experimental results, ProCATM reduces the peak
temperature by up to 9.09% in our 4-core system with only 2.28% performance over-
head compared to the Linux standard scheduler.
108
Algorithm 4 SWETM algorithm
1: while Tcurr < Tthreshold do
2: Calculate CDF (Wapp) {according to the S.D.}
3: Calculate CDF (Wcore) {according to the S.D.}
4: T ′ss ← CDF (Wcore)
5: ∆t ← Calculate WhenOverheated(Tlimit, T
′
ss)
6: for i = 1 to App Num (running within that core) do
7: if ∆t ≤ Response T ime and Wappi >Workload Threshold then
8: for j = 1 to Core Num do
9: Wfj ← Wappi + Wcorej
10: T ′ssj ← CDF (Wfj)
11: Tfj ← FTEM (T
′
ssj,∆t)
12: if (Tfj - Tlimit) < 1 then
13: Action ← PRIOTIRY SCHEDULING
14: else
15: Action ← MIGRATION
16: end if
17: end for
18: T igger Control(Action)
19: end if
20: end for
21: end while
109
CHAPTER VIII
A THERMAL MODEL FOR CMPS CAPTURING WORKLOAD
CHARACTERISTICS AND NEIGHBORING CORE EFFECTS
Skadron et al. first proposed an architectural thermal model for microprocessors,
HotSpot [16], which constructs a multi-layer lumped thermal RC network to model
the heat dissipation path from the silicon die through the cooling package to the am-
bient. In HotSpot, the silicon die is partitioned into functional blocks based on the
floorplan of the microprocessor, with a thermal RC network connecting the blocks.
However, due to the complexity of component block level thermal models and insuf-
ficient physical information extracted from floorplans in modern processors, it is not
feasible to design a thermal model according to the hottest chip block. Therefore, we
need a high-level thermal model that capture the thermal effect incurred by applica-
tion behavior and can be managed by operating systems according to applications’
runtime behavior.
A. The Lumped Thermal RC Model
In this section, we explain a lumped thermal RC model to capture thermal character-
istics of processors as well as external cooling effects such as fans or cooling packages.
We assume that the speed of external fans is static to be used as a constant in a
thermal model. We apply Fourier’s Law of heat conduction, which states that the
cooling rate is proportional to the difference in temperature between the object and
the environment. Hence, the heating rate is proportional to the difference between the
current temperature and the steady state temperature reachable by the input power
that is the heat source for processors. We define T (t) and P (t) to be the temperature
and the power consumption at time t, respectively. Then, we formulate the Fourier’s
110
Law as the follows [41, 28]:
T ′(t) =
P (t)
C
− b · T (t), (8.1)
where b is a positive constant that represents the power dissipation rate, which is the
inverse of time, τ = R · C. The parameters R and C are the thermal resistance and
capacitance, respectively, and represent thermal characteristics of the chip. The heat
transfer equations are introduced to model the steady state temperature of systems
with heat sources in [37]. With those heat transfer equations, Wang and Bettati
present that the rate of temperature change is proportional to the difference between
the current temperature and the steady state in [30]. Let Tss be the steady state
temperature of an application. Then, we denote T (t) as the temperature at time t
and Tinit as the initial temperature when an application starts execution (T (0)=Tinit).
Thus,
dT
dt
= b× (Tss − T ). (8.2)
where b is a thermal parameter. Solving Equation (8.2) with T (0) = Tinit and T (∞)
= Tss, we can obtain
T (t) = Tss − (Tss − Tinit)× e
−b·t (8.3)
Fig. 38 shows the thermal RC circuit model for a single core in a CMP architec-
ture. We assume that the initial temperature is Tinit, i.e., T (t0) = Tinit. The core’s
temperature at time t (Tc(t)) is calculated using a lumped thermal RC model [16, 42]
and expressed as follows.
Tc(t) = Rc · Pc · (1− e
−bc·t) +Rp · Pp · (1− e
−bp·t) + Tinit,c (8.4)
where Tinit,c and Tinit,p are the initial temperatures of the core and the package,
respectively, Rc and Rp are the thermal resistances of the core and the package,
111
Fig. 38. An extended lumped thermal RC circuit model for a single core in a CMP
architecture
respectively, and bc and bp are the thermal parameters of the core and the package,
respectively. Also, Pc and Pp are the average power consumption of the core and the
package at time interval t, respectively. The above approximation is derived from the
fact that the thermal time constant of the core, τc = Rc · Cc, is much smaller than
that of the package, τp = Rp · Cp [42]. The thermal parameter values (Rc, Rp, bc, bp)
in Equation (8.4) are determined from the temperature curve of a SPEC CPU 2006
benchmark.
We obtain the thermal parameters using nonlinear regression analysis provided
by the SPSS statistics tool.
B. Workload-aware Thermal Model
Although the temperature of a core can be calculated by a lumped thermal RC
model as shown in Equation (8.4), this model cannot provide the temperature varia-
tions caused by running applications. The temperature variations can be affected by
individual functional blocks as mentioned in Hotspot [16], but it is difficult to obtain
detail information of the cores at runtime. Therefore, we need to find new metrics
to explain the thermal effects by workload in running applications. We propose a
112
pearl bzip2 gcc mcf gobmk hmmer sjeng libq. h264ref astar xalanch60
65
70
75
80
85
90
95
100
SPEC CPU 2006 benchmarks
Te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
core 1
core 2
core 2
core 4
Fig. 39. Tss of SPEC CPU 2006 benchmarks
workload-aware thermal model capturing more accurate workload characteristics of
applications based on the architectural information. First, we approximate the steady
state temperature (Tss) using only thermal parameters of the cores. And then, a
workload estimation factor that describes workload characteristics is calculated from
a regression analysis using the variance of Tss in running eleven SPEC CPU 2006
benchmarks. Using the workload estimation factor, we estimate temperature affected
by workload characteristics of applications. In Equation (8.4), the steady state tem-
perature (Tss) can be represented when t becomes ∞.
Tc(∞) = Rc · Pc +Rp · Pp.
Tss ≈ Rc · Pc. (8.5)
Since Rp ·Pp for the package is much smaller than Rc ·Pc for the core and Rp and Pp
are dependent on processor specifications, we approximate Tss considering only the
effect of cores, as shown in Equation (8.5).
We first run SPEC CPU 2006 benchmarks on each core until the temperature
does not change anymore to obtain the respective steady state temperature. As shown
in Fig. 39, Tss of each SPEC CPU 2006 benchmark is different from each other,
113
although their CPU utilizations are almost 100%. To represent various workload
characteristics that affect Tss, we add a workload estimation factor (wx) describing
workload characteristics for an application x as shown in Equation (8.6).
Tss(x) = wx · Rc · Pc, (8.6)
where Tss(x) is the steady state temperature of an application x. To estimate wx,
we use the architectural information provided by Performance Monitoring Counters
(PMC) which are a set of special-purpose registers built in modern microprocessors
to store the counts of hardware-related activities within computer systems [19]. We
use two counters in PMC; the number of accumulated cycles of functional blocks
monitoring active clock cycles and the number of completed memory transactions
monitoring memory transactions at runtime. ”Unhalted CPU Cycle” counter is used
as an indicator for a CPU-intensive work and ”Bus Transactions Memory” counter
as an indicator for a Memory-intensive work. However, ”Bus Transactions Memory”
counter does not provide how much memory is used for running applications. To
provide more sufficient information about the memory characteristics of applications,
the memory usage measured by the kernel is used, too. We define the workload
estimation factor wx(t) for a running application x during a time interval [t− 1, t], as
shown in Equation (8.7).
wx(t) = α · C(t) + β ·M(t) + γ ·N(t), (8.7)
where C(t) is the unhalted clock cycle, M(t) is the number of completed memory
transactions, and N(t) is the memory usage during a time interval [t − 1, t]. We
observed that C(t) works as a positive factor in temperature variations, while M(t)
and N(t) work as negative factors. After running four integer and floating-point
benchmarks in SPEC CPU 2006 benchmarks, we obtain α, β, and γ of each core
114
using a linear regression. For example, α, β, and γ of core 1 can be calculated to
3.28E−005, −4.28E−007, and−8.36E−006, respectively, in SPEC CPU 2006 integer
benchmarks. In the case of floating-point benchmarks, α, β, and γ of core 1 can be
calculated to 2.58E−005, −2.35E−009, and −9.85E−006, respectively. Compared
to integer benchmarks, temperature variation of floating-point benchmarks is less
affected by architecture information. Since α, β, and γ of each core are obtained by
Tss in Equation (8.6) before running applications, the workload characteristics can be
described by the variance of C(t), M(t), and N(t) at runtime, as shown in Equation
(8.7). Using these three parameters, we can derive a thermal model as follows.
Tc(t) = wx(t) · Rc · Pc · (1− e
−bc·t) +Rp · Pp · (1− e
−bp·t) + Tinit,c, (8.8)
where Tc(t) is the core temperature at time t, Tinit,c is the initial temperatures of
the core. Also, bc and bp are the thermal parameters for the core and the package
and wx(t) is a workload estimation factor of a running application x. As shown
in Equation (8.8), we estimate temperature using a workload-aware thermal model
including workload characteristics of each application and the thermal parameters of
the package as well as those of the cores.
To investigate how three factors affect temperature in running applications, we
compare the actually measured temperature and the estimated temperature using
bzip2 integer benchmark and lbm floating-point benchmark in SPEC CPU 2006
benchmarks. The estimated temperature using only active CPU cycles shows huge
estimation error compared to the measured temperature as shown in Fig. 40(a) and
41(a). The thermal model using active clock cycles, the memory transactions, and
the memory usage shows more accurate temperature estimations, as shown in Fig.
40(d) and 41(d). Since our workload-aware thermal model considers a positive factor
(active clock cycles) and negative factors (the memory transactions and the mem-
115
ory usage), the results show more accurate temperature estimations regardless of a
workload type such as CPU-intensive or Memory-intensive applications.
C. Thermal Correlation Effects
In this section, we develop a thermal correlation model for a CMP architecture based
on the lumped thermal RC model in the lumped thermal RC model. The thermal
model in a CMP architecture is composed of multiple nodes, one for each core in
the package. The heat transfer conduction between neighboring cores is modeled by
connecting them with a thermal resistance, as shown in Fig. 42. Each node is also
connected to a current source, which models its power consumption, and this power
is dissipated as heat with uniform power consumption. To calculate the thermal
correlation effects among neighboring cores, we define the thermal correlation factor
(ψ) that can be represented as a factor to estimate operating temperature affected by
each core’s own Tss and the amount of heat transfer from neighboring cores. Hence,
we calculate the temperature increase ratio (Γ) of Tss as shown in Equation (8.9).
Γi(j) =
(TEss,i(j) − Tinit,i)
(Tss,i − Tinit,i)
=
∆TEss,i(j)
∆Tss,i
(8.9)
where Γi(j) is the ratio of temperature incremental change of Tss between the steady
state temperature (Tss,i) for core i and the overall steady state temperature (T
E
ss,i(j))
for core i including the heat transfer from core j, and Tinit,i is the initial temperature
of core i. Γ1(2) is 1.207 in Table XI, which implies that the overall Tss of core 1 is
raised by 20.7% compared to Tss for only core 1 without any heat transfer. Therefore,
we can define the thermal correlation factor of core i (ψi(j)) by heat transfer from core
j as follows:
ψi(j) = Γi(j) − 1, (8.10)
116
0 100 200 300 400 500
50
55
60
65
70
75
80
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
measured temperature
estimated temperature
(a) Using only active CPU cycles
0 100 200 300 400 500
45
50
55
60
65
70
75
80
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
measured temperature
estimated temperature
(b) Using cycle and memory usage
0 100 200 300 400 500
50
55
60
65
70
75
80
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
measured temperature
estimated temperature
(c) Using cycle and memory transactions
0 100 200 300 400 500
45
50
55
60
65
70
75
80
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
measured temperature
estimated temperature
(d) Using active CPU cycles and the
memory transactions
Fig. 40. Temperature tracking using architectural information in SPEC CPU integer
benchmarks
117
0 100 200 300 400 500 600 700
45
50
55
60
65
70
75
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
measured temperature
estimated temperature
(a) Using only active CPU cycles
0 100 200 300 400 500 600 700
45
50
55
60
65
70
75
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
measured temperature
estimated temperature
(b) Using cycle and memory usage
0 100 200 300 400 500 600 700
45
50
55
60
65
70
75
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
measured temperature
estimated temperature
(c) Using cycle and memory transactions
0 100 200 300 400 500 600 700
45
50
55
60
65
70
75
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
measured temperature
estimated temperature
(d) Using active CPU cycles and the
memory transactions
Fig. 41. Temperature tracking using architectural information in SPEC CPU floating–
point benchmarks
118
Fig. 42. The extended thermal model for CMP architecture
where ψi(j) is the thermal correlation factor between cores i and j.
To derive the thermal model for a CMP architecture, each core’s thermal vari-
ations can be expressed with the same approach as the approximation suggested for
the single core.
As shown in Equation (8.11), the temperature of core i can be approximated
using core i’s thermal parameters and the thermal correlation factor (ψ). Also, we
can estimate the temperature for core i using workload estimation factor (w(t)) for
cores and the thermal correlation effects from other cores as follows:
Ti(t) = w(t) · Ri · Pi · (1− e
−bi·t) +Rp · Pp · (1− e
−bp·t) + Tinit,i
+
n∑
j=1,j 6=i
ψi(j) · Tj(t), (8.11)
where Ri and bi are the thermal resistance and thermal parameter for core i, re-
119
Table XI. The ratio of Tss for cores in an Intel Quad-Core processor
Γ1 Γ1(2) Γ1(3) Γ1(4)
core 1 1.000 1.207 1.034 1.17
Γ2 Γ2(1) Γ2(3) Γ2(4)
core 2 1.000 1.080 1.040 1.000
Γ3 Γ3(1) Γ3(2) Γ3(4)
core 3 1.000 1.000 1.000 1.037
Γ4 Γ4(1) Γ4(2) Γ4(3)
core 4 1.000 1.034 1.034 1.069
spectively, Rp and bp are thermal resistance and thermal parameter for the package,
respectively, ψ is the thermal correlation factor, w(t) is the workload estimation fac-
tor of core i, and Ti(t) is temperature of core i at time t. This final model uses
the workload-aware thermal model and the thermal correlation effects in a CMP
architecture. Unlike prior studies that use simulations to evaluate their own mod-
els, we implement and evaluate our thermal model in a 4-core (Intel Quad Core
Q6600) system. To evaluate our thermal model capturing workload characteristics
and the thermal correlation effects, we developed a specific device driver for Linux
to access the Digital Thermal Sensor (DTS) for monitoring each core’s temperature
for Intel microarchitecture. Also, we developed monitoring and estimation tasks to
capture architectural information via Performance Monitoring Counters (PMC) and
perform estimating workload characteristics at runtime. As shown in Fig. 43, we
show the overall prototype implementation and measurement setup for our experi-
ments. This diagram depicts different aspects of our implementation that correspond
120
Fig. 43. The proposed platform
to determine workload characteristics and online thermal tracking in a real-product.
To verify the thermal model, we profile thermal parameters and a workload estima-
tion factor of each core using four integer and floating-point benchmarks in SPEC
CPU 2006 benchmarks. After profiling, we compare the temperature predicted by
our thermal model with the measured temperature, as shown in Fig. 44. The es-
timated temperature shows above 97 % temperature prediction accuracy compared
to an actual temperature measurement of each core, when two libquantum integer
benchmark and lbm floating-point benchmark in SPEC CPU 2006 benchmarks run
on cores 1 and 2 simultaneously. When three libquantum integer benchmark and lbm
floating-point benchmark in SPEC CPU 2006 benchmarks run on cores 1, 3, and
4, the thermal model estimate temperature more accurately compared to an actual
temperature measurement, as shown in Fig. 44(b) and 44(d), respectively. Although
we show only two results due to space limitations, the average temperature prediction
accuracy of the thermal model is above 92% while we test the thermal model with
seven integer and floating-point SPEC CPU 2006 benchmarks.
121
0 100 200 300 400 500 600 700 800
50
55
60
65
70
75
80
85
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
Running applications on Core 1 and Core 2
 
 
measurement
estimation
(a) Running libquantum SPEC CPU 2006
integer benchmark on Core 1 and Core 2
0 100 200 300 400 500 600 700 800
50
55
60
65
70
75
80
85
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
Running applications on Core1, Core3, and Core4
 
 
measurement
estimation
(b) Running libquantum SPEC CPU 2006
integer benchmark on Core 1, Core 3, and
Core 4
0 200 400 600 800 1000 1200
45
50
55
60
65
70
75
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
measurement
estimation
(c) Running lbm SPEC CPU 2006
floating-point benchmark on Core 1 and
Core 2
0 100 200 300 400 500 600 700
45
50
55
60
65
70
75
time (sec)
Te
m
pe
ra
tu
re
 (C
els
ius
)
 
 
measurement
estimation
(d) Running lbm SPEC CPU 2006
floating-point benchmark on Core 1, Core
3, and Core 4
Fig. 44. The comparisons between the estimated temperature considering workload
characteristics and thermal correlation effects and the measured temperature
in a CMP architecture
122
D. Conclusions
As increasing power density with technology advance, the thermal control in CMPs
has been a critical issue for improving system reliability. In this work, we propose
a more accurate thermal model capturing workload characteristics and the thermal
correlations among neighboring cores in a CMP architecture. The thermal model
estimates temperature using a lumped thermal RC model enhanced with a workload
estimation factor approximated by a regression analysis and the thermal correlation
effects factors that describe the amount of heat transfer from neighboring cores. To
estimate workload characteristics of running applications, we use active CPU cycles
and the number of completed memory transactions provided by Performance Moni-
toring Counters (PMC), and the memory usage by the kernel.
To demonstrate the scalability and effectiveness of our thermal model, we im-
plement and evaluate it in a 4-core (Intel Quad Core Q6600) system running integer
and floating-point SPEC CPU 2006 benchmarks. As the experimental results, our
thermal model shows above 92% temperature prediction accuracy, compared to an
actual temperature measurement of each core. In the future work, we will develop
Dynamic Thermal Management (DTM) for CMPs exploiting accurate temperature
prediction provided by our thermal model.
123
CHAPTER IX
CONCLUSIONS AND FUTURE WORK
In this chapter, we summarize the major results of this thesis and discuss future
directions of this work.
A. Conclusions
In Chapter III, we propose an efficient thermal management for multimedia applica-
tions considering performance of a multimedia system affected by the complexity of
scenes, and then we find an appropriate frequency based on the information on scene
complexity.
In Chapter IV, we first derive application characteristics in various multimedia
applications by transmitting MPEG-4 and H.264/AVC encoded by two different frame
resolutions. By using this applications’ characteristics, we estimate a processor speed
to execute multimedia application for decoding frames at runtime.
In Chapter V, we presents an advanced future temperature prediction model for
each core to estimate the thermal behavior considering both core temperature and
applications temperature variations and take appropriate measures to avoid thermal
emergencies.
In Chapter VI, the proposed TAS scheme utilizes an advanced future temperature
prediction model for each core to estimate different thermal behaviors and measure
the time left until each core reaches the desired temperature threshold.
In Chapter VII, to avoid thermal emergencies and provide thermal fairness in
CMP systems, we propose and implement an adaptive and scalable run-time thermal
management scheme, called a Proactive Correlation-Aware Thermal Management
(ProCATM), on the real-world CMP products.
124
Chapter VIII presents a thermal model based workload characteristics of running
applications. We propose the thermal model based on thermal correlation effects and
online workload estimation using architectural information via Performance Monitor-
ing Counters (PMC).
B. Future Work
Chip Multiprocessors (CMPs) have been pervasive in modern processor designs, and it
is likely that the demand for management temperature under thermal safety in a CMP
architecture will remain very high. In order to solve the dilemmatic tradeoff between
an efficient thermal management and performance degradation in CMP architecture,
we would like to expand this work to cover real CMP products such as 4-core system
( Intel Q6600 Quad-Core) and 8-core system (two Intel Xeon E5310 Quad-Core) that
are increasingly used in high-performance system. This involves addressing issues
associated with efficient overlay self-reconfiguration and maintenance, load balancing
among servers in the network, and optimal selection of servers or paths for streaming
services. We also have interest in studying online workload estimation of running
applications in real environments.
In addition, we have further interest in applications’ execution behaviors, which
can directly affect temperature in a CMP architecture. As the speed of processors
increases and the complexity of a chip becomes higher, the thermal management in
a CMP architecture can become very susceptible to application execution behavior
because it is difficult to capture it at runtime. Therefore, we need a metric to rep-
resent applications’ execution behaviors. Although our work succeeded at designing
accurate online workload estimation and modeling the thermal correlation effects for
Dynamic Thermal Management (DTM), new technologies have introduced such as
125
CPU hotpulgging and individual Dynamic Voltage and Frequency Scaling of each
core in a CMP architecture, and then their integrations can be required to be applied
into DTM in future work.
126
REFERENCES
[1] D. Brooks and M. Martonosi, “Dynamic Thermal Management for High-
Performance Microprocessors,” in Proc. IEEE HPCA, pp. 171–182, 2001.
[2] S. Heo, K. Barr, and K. Asanovic, “Reducing Power Density through Activity
Migration,” in Proc. IEEE ISLPED, pp. 217–222, 2003.
[3] K. Skadron, “Hybrid Architectural Dynamic Thermal Management,” in Proc.
IEEE DATE, pp. 10–15, 2004.
[4] J. Srinivasan and S. Adve, “Predictive Dynamic Thermal Management for Mul-
timedia Applications,” in Proc. ACM ICS, pp. 109–120, 2003.
[5] I. Yeo, C. C. Liu, and E. J. Kim, “Predictive Dynamic Thermal Management
for Multicore Systems,” in Proc. ACM DAC, pp. 734–739, 2008.
[6] F. Mulas, M. Buttu, M. Pittau, S. Carta, D. Atienza, A. Acquaviva, L. Benini,
and G. D. Micheli, “Thermal Balancing Policy for Streaming Computing on
Multiprocessor Architectures,” in Proc. IEEE DATE, pp. 734–739, 2008.
[7] D. Son, C. Yu, and H. N. Kim, “Dynamic Voltage Scaling on MPEG Decoding,”
in Proc. IEEE ICPADS, pp. 633–640, 2001.
[8] J. Pouwelse, K. Langendoen, and H. Sips, “Dynamic Voltage Scaling on a Low-
Power Microprocessor,” in Proc. ACM MOBICOM, pp. 251–259, 2001.
[9] B. Lee, E. Nurvitadhi, R. Dixit, C. Yu, and M. Kim, “Dynamic Voltage Scaling
Techniques for Power Efficient Video Decoding,” the EUROMICRO Journal,
vol. 51, no. 10-11, pp. 633–652, 2005.
127
[10] Z. Lu, J. Lach, M. Stan, and K. Skadron, “Reducing Multimedia Decode Power
using Feedback Control,” in Proc. IEEE ICCD, pp. 489–496, 2003.
[11] I. Yeo, H. K. Lee, K. H. Yum, and E. J. Kim, “Effective Dynamic Thermal
Management for MPEG-4 Decoding,” in Proc. IEEE ICCD, pp. 623–628, 2007.
[12] K. Choi, K. Dantu, W. C. Cheng, and M. Pedram, “Frame-based Dynamic
Voltage and Frequency Scaling for a MPEG Decoder,” in Proc. IEEE ICCAD,
pp. 732–737, 2002.
[13] W. Yuan and K. Nahrstedt, “Reduced energy decoding of MPEG streams,”
ACM Trans. Computer System, vol. 24, no. 3, pp. 292–331, 2006.
[14] K. Skadron, M.Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tar-
jan, “Temperature-Aware Microarchitecture: Modeling and Implementation,”
ACM Trans. Architecture and Code Optimization, vol. 1, no. 1, 2004.
[15] M. R. Stan, K. Skadron, M. Barcella, W. Huang, K. Sankaranarayanan, and
S. Velusamy, “HotSpot: a Dynamic Compact Thermal Model at the Processor
Architecture Level,” Microelectronics Journal: Circuit and Systems, vol. 1, no.
1, 2003.
[16] K. Skadron, M. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tar-
jan, “Temperature-Aware Microarchitecture,” in Proc. IEEE ISCA, pp. 2–13,
2003.
[17] M. D. Powell, M. Gomaa, and T. N. Vijaykumar, “Heat-and-Run: Leveraging
SMT and CMP to Manage Power Density Through the Operating System,” in
Proc. ACM ASPLOS-XI, pp. 260–270, 2004.
128
[18] J. Choi, C.-Y. Cher, H. Franke, H. Hamann, A. Weger, and P. Bose, “Thermal-
Aware Task Scheduling at the System Software Level,” in Proc. IEEE ISLPED,
pp. 213–218, 2007.
[19] “Intel 64 and IA-32 Architectures Software Developer’s Manual,”
http://www.intel.com/products/processor/manuals/ (Accessed on 01-07-10).
[20] K. Skadron, T. Abdelzaher, and M. R. Stan, “Control-Theoretic Techniques and
Thermal-RC Modeling for Accurate and Localized Dynamic Thermal Manage-
ment,” in Proc. IEEE HPCA, pp. 17–28, 2002.
[21] K.-J. Lee and K. Skadron, “Using Performance Counters for Runtime Tem-
perature Sensing in High-Performance Processors,” in Proc. IEEE IPDPS, pp.
232–237, 2005.
[22] J. Pouwelse, K. Langendoen, R. Lagendijk, and H. Sips, “Power-Aware Video
Decoding,” in 22nd Picture Coding Symposium, 2001.
[23] M. Mesarina and Y. Turner, “Reduced energy decoding of MPEG streams,”
Multimedia System, vol. 9, no. 2, pp. 202–213, 2003.
[24] X. Liu, P. Shenoy, and M. Corner, “Chameleon: Application Level Power Man-
agement with Performance Isolation,” in Proc. ACM MULTIMEDIA, pp. 839–
848, 2005.
[25] P. Michaud, A. Seznec, D. Fetis, Y. Sazeides, and T. Constantinou, “A Study
of Thread Migration in Temperature-Constrained Multicores,” ACM Trans. Ar-
chitecture and Code Optimization, vol. 4, no. 2, 2007.
[26] A. Kumar, L. Shang, L.-S. Peh, and N. K. Jha, “HybDTM: A Coordinated
Hardware-Software Approach for Dynamic Thermal Management,” in Proc.
129
ACM DAC, pp. 548–553, 2006.
[27] L. Shang, L.-S. Peh, A. Kumar, and N. K. Jha, “Thermal Modeling, Charac-
terization and Management of On-Chip Networks,” in Proc. IEEE MICRO, pp.
67–78, 2004.
[28] N. Bansal, T. Kimbrel, and K. Pruhs, “Speed Scaling to Manage Energy and
Temperature,” Journal of ACM, vol. 54, no. 1, pp. 1–39, 2007.
[29] J. Sergent and A. Krum, Thermal Management Handbook, Columbus, USA,
McGraw-Hill, 1998.
[30] S. Wang and R. Bettati, “Reactive Speed Control in Temperature-Constrained
Real-Time Systems,” in Proc. IEEE ECRTS, pp. 73–95, 2006.
[31] W. Yuan and K. Nahrstedt, “Energy-Efficient Soft Real-Time CPU Scheduling
for Mobile Multimedia Systems,” in Proc. ACM SOSP, pp. 149–163, 2003.
[32] J. R. Lorch and A. J. Smith, “Improving Dynamic Voltage Scaling Algorithms
with PACE,” in Proc. ACM SIGMETRICS, pp. 50–61, 2001.
[33] F. Gruian, “Hard real-time scheduling for low-energy using stochastic data and
DVS processors,” in Proc. IEEE ISLPED, pp. 46–51, 2001.
[34] PAPI, “Performance API,” Available from http://icl.cs.utk.edu/papi (Accessed
on 01-07-10).
[35] Intel, “Intel Atom Processor,” http://www.intel.com/products/processor/atom
(Accessed on 01-07-10).
[36] X. Chen, “Recursive Least-Squares Method with Membership Functions,” in
Proc. IEEE Machine learning and Cybernetics, pp. 1962–1966, 2004.
130
[37] F. Kreith and M. S. Bohn, Principles of Heat Transfer, Monterey, USA,
Brooks/Cole Publishing Company, 2000.
[38] D. Bovet and M. Cesati, Understaning the Linux Kernel, Sebastopol, USA,
O’Reilly Media, Inc, 2005.
[39] C. C. Minh, “STAMP - Stanford Transactional Applications for Multi-
Processing,” Available from http://stamp.stanford.edu/ (Accessed on 01-07-10).
[40] J. Li and J. F. Martinez, “Power-Performance Implications of Thread-level Par-
allelism on Chip Multiprocessors,” in Proc. IEEE ISPASS, pp. 124–134, 2005.
[41] N. Bansal, T. Kimbrel, and K. Pruhs, “Dynamic Speed Scaling to Manage
Energy and Temperature,” in Proc. IEEE FOCS, pp. 520–529, 2004.
[42] R. Rao and S. Vrudhula, “Performance Optimal Processor Throttling under
Thermal Constraints,” in Proc. IEEE CASES, pp. 257–266, 2007.
131
VITA
In Choon Yeo received his B.S. in computer engineering from Dongguk University,
Korea, in 1995 and his M.S. in computer engineering from Dongguk University, Korea,
in 1997. He graduated with a Ph.D. in computer science and engineering from Texas
A&M University in December 2009.
During 1997-2004, he worked as a system engineer for Sindoricoh in Korea. His
research interests include high-performance energy-efficient computer architectures,
Dynamic Thermal Management (DTM) and Dynamic Power Management (DPM)
on Multicore, compiler and hardware support for dynamic optimizations, virtual ma-
chines, and binary instrumentation. He may be contacted at:
Department of Computer Science and Engineering
Texas A&M University
College Station, TX 77843-3112
