Dimetrodon: Processor-level Preventive Thermal Management via Idle Cycle Injection by Reddi, Vijay Janapa et al.
 
Dimetrodon: Processor-level Preventive Thermal Management via
Idle Cycle Injection
 
 
(Article begins on next page)
The Harvard community has made this article openly available.
Please share how this access benefits you. Your story matters.
Citation Bailis, Peter, Vijay Janapa Reddi, Sanjay Gandhi, David Brooks,
and Margot Seltzer. 2011. Dimetrodon: processor-level preventive
thermal management via idle cycle injection. In Proceedings of the
48th Design Automation Conference (DAC 2011), San Diego,
California, June 5-10, 2011, ed. Leon Stok, Nikil D. Dutt, and
Soha Hassoun.
Accessed February 19, 2015 9:01:22 AM EST
Citable Link http://nrs.harvard.edu/urn-3:HUL.InstRepos:8739093
Terms of Use This article was downloaded from Harvard University's DASH
repository, and is made available under the terms and conditions
applicable to Open Access Policy Articles, as set forth at
http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-
use#OAPDimetrodon: Processor-level Preventive Thermal
Management via Idle Cycle Injection

Peter Bailis, Vijay Janapa Reddiz, Sanjay Gandhi, David Brooks, and Margo Seltzer
Harvard University, zAMD Research
Abstract
Processor-leveldynamicthermalmanagementtechniqueshavelong
targeted worst-case thermal margins. We examine the thermal-
performance trade-offs in average-case, preventive thermal man-
agement by actively degrading application performance to achieve
long-term thermal control. We propose Dimetrodon, the use of
idle cycle injection, a ﬂexible, per-thread technique, as a preventive
thermalmanagementmechanismanddemonstrateitsefﬁciencycom-
pared to hardware techniques in a commodity operating system
on real hardware under throughput and latency-sensitive real-world
workloads. Compared to hardware techniques that also lack ﬂex-
ibility, Dimetrodon achieves favorable trade-offs for temperature
reductions up to 30% due to rapid heat dissipation during short idle
intervals.
Categories and Subject Descriptors
C.0 [Computer Systems Organization]:
General—Hardware/Software interfaces and System architectures
General Terms
Performance, Reliability
Keywords
Thermal management, Average-case design, Idle injection
1. INTRODUCTION
Thermal management is increasingly important across several
domains. Increased operating temperatures can result in exponen-
tially reduced mean-time-to-failure (MTTF) values [25], while pro-
cessor leakage power increases exponentially with temperature [23,
24]. Power costs have begun to eclipse the cost of physical hard-
ware[6], andthepowerrequiredtocoolaprocessorisnearlyequiv-
alent to the electricity required to power it [17]. Up to 80% of data
center construction cost is attributable to power and cooling infras-
tructure [5], and chiller power, a historically dominant data cen-
ter energy overhead, scales quadratically with the amount of heat
extracted [18], . Processor cooling is also a signiﬁcant problem
for mobile devices as thermal conditions can affect user experience
through both heat dissipation and potentially intrusive cooling so-
lutions (e.g., noisy fans) [22].
Traditional dynamic thermal management (DTM) techniques fo-
cus on reducing worst-case thermal emergencies but do not con-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for proﬁt or commercial advantage and that copies
bear this notice and the full citation on the ﬁrst page. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior speciﬁc
permission and/or a fee.
DAC ’11 Jun 05-10 2011, San Diego, CA, USA
Copyright 2011 ACM 978-1-4503-0636-2/11/06 ...$10.00.
tributetoloweringoveralltemperaturesWhilethesetechniqueshave
many beneﬁts such as increased reliability [23] and decreased chip
packaging requirements [24], they are not designed to operate un-
der normal thermal conditions. In practice, these DTM mecha-
nisms are not activated except under extreme thermal conditions
that are likely caused by some other catastrophic failure (e.g., cool-
ing system problems).
This work focuses on reducing average-case processor operating
temperatures, exploring the trade-offs between application perfor-
mance and long-term thermal behavior through preventive thermal
management. Our focus is on thread-level thermal management;
once a thread is executing on a particular core, we want to control
its thermal impact. Multicore-aware strategies, such as core migra-
tion [11] as well as more complex thermal-aware thread schedule
placement [9], are orthogonal to the problem we consider here but
are potentially complementary to our goals. We focus solely on
reducing temperature but also ensure that additional energy is not
consumed by the CPU as a result.
Dimetrodon
1 is a software-level thermal management technique
designed to assist in application-level proactive thermal manage-
ment. We employ idle cycle injection, a scheduler-level mechanism
to inject idle cycles of variable length into process execution, pro-
viding responsive, ﬁne-grained control, allowing individual threads
to absorb substantial portions of the burden of cooling, carefully
mitigating performance reductions. Per-thread policy control al-
lows us to target only key heat-producing workloads as opposed
to system-wide policies such as current dynamic voltage and fre-
quency scaling (DVFS) mechanisms, which may unfairly penalize
heterogeneous workloads [12].
We implemented Dimetrodon in a commodity operating system
and evaluated its efﬁciency in reducing temperatures while pro-
viding predictive throughput and latency models. We evaluate our
techniquesonmoderncommodityhardwareusingamixofindustry-
standardized throughput and latency-sensitive workloads and de-
rive quantitative models for the trade-off between application per-
formance and temperature reduction (over the idle temperature).
Dimetrodon achieves favorable performance for many of our work-
loads (over 16:1 for small temperature reductions and close to 1:1
for larger reductions) and outperforms similar techniques such as
voltageandfrequencyscalingfortemperaturereductionsupto30%.
Dimetrodon’s strength comes in injecting short idle periods during
which the processor is able to cool quickly; however, larger idle
periods exhibit diminishing marginal beneﬁts, decreasing its efﬁ-
ciency for large temperature reductions.
The main contributions of this work are:
1The Dimetrodon genus of large prehistoric reptiles were the dominant car-
nivores of their time. Dimetrodon possessed a large sail attached to its back,
which enabled efﬁcient thermal regulation and was likely used to control its
body temperature [7].0 0.6 1.2 1.8 2.4 3.2 3.8
Time (seconds)
0
10
20
30
40
50
60
70
80
P
o
w
e
r
 
(
W
)
Race-to-Idle
Dimetrodon
Figure 1: Race-to-idle versus Dimetrodon power consumption.
The scheduler injected idle cycles into a multi-threaded CPU-
bound process, lowering average power consumption during
execution; the four power levels correspond to periods during
which a varying number of the four processor cores idled.
 We propose the use of idle cycle injection, a ﬂexible soft-
ware technique, for per-thread preventive thermal control.
 We present an implementation and evaluation of several
preventive thermal management techniques on real hardware
in a commodity operating system.
 We characterize the application-level impact across worst-
case thermal load and real-world workloads.
2. DESIGN AND MODEL
Typicalrace-to-idlescheduling, underwhichjobsareruntocom-
pletion and the processor subsequently idles, incurs twofold energy
costs: it requires energy to power the CPU and energy to power the
cooling system to dissipate the heat generated by this unrestricted
execution. If, however, we can run the job more slowly, incorpo-
rating some of the idle cycles into the job execution, the processor
will produce less heat on average during execution. Figure 1 shows
how we can achieve lower power on average during computation
by slowing execution. We can redistribute computation in order to
lower the average power dissipated while the processor is active.
Our goal is to maintain a lower average temperature over time.
Totargetthisaverage-caseexecution, Dimetrodonperiodicallyidles
the processor, injecting idle cycles at the scheduler level. By idling
the processor for periods of time in-between regular program exe-
cution (at the scheduler quantum—timeslice—level), it brieﬂy en-
ters low-power states and cools. Idle cycle injection can be imple-
mented as a scheduler policy that can be adjusted online according
to the thermal proﬁle and performance constraints of the applica-
tion. The timescale of quanta is measured in milliseconds, which
allows the processor long-term cooling opportunities.
From the perspective of a single thread, Dimetrodon moves idle
cycles from idle periods after it completes to periodic intervals dur-
ing process execution. We can vary both the proportion and length
of these idle periods. Assuming that we can enter similar idle states
during these injected periods as after execution completes, then
we consume the same total energy while using less average power
(§2.2). We express the proportion of idle periods as a probability;
this is not the only possible injection model, however it simpliﬁes
our analysis and implementation.
2.1 Per-thread Software-level Control
Dimetrodon provides high ﬂexibility. At the software level (par-
ticularly at the operating system or hypervisor level), one can con-
trol which threads are affected with arbitrary precision. Based
on software system-speciﬁc information such as a process’s user-
granted priority level, thermal characteristics, voluntary and invol-
untary preemption patterns, and overall system condition (temper-
ature, power consumption, etc.), the thermal management system
can make more informed decisions about when to idle the proces-
sor and which threads to slow. Similarly, one can override policy
control for high-priority threads or in times of high system load.
Software-level control allows ﬁne-grained policies independent
of the hardware platform. DVFS is not yet available for individual
cores on commodity hardware [14] and may be limited by mini-
mum voltage constraints, while it and other hardware techniques
such as clock throttling typically allow only coarse, system-level
policies with limited conﬁgurability [8]. Operating system control
provides a wide range of possibilities: one can easily select the de-
sired trade-off between throughput and heat with high resolution,
allowing changes in throughput on the magnitude of fractions of a
percent, as opposed to tens of percent. On processors that do not
support low power idle states or clock gating, Dimetrodon is still
useful as executing an idle loop of nop equivalents allows many
functional units within the processor to cool. The decision whether
to idle can be made efﬁciently and effectively at the scheduler level.
2.2 AnalyticalModel: ThroughputandPower
Under Dimetrodon, each time the scheduler is about to sched-
ule a thread, with user-deﬁned probability p, it instead runs the idle
thread for a quantum of length L. By varying p and L, we achieve
different trade-offs between cooling and latency. For example, if p
is 50% and L is the same length as a scheduling quantum, then we
double the length of time for the job to run, but we lower the av-
erage heat produced by the job. Overall, the processor will use the
same total amount of energy to complete the computation, provided
it can enter similar idle states in-between computation intervals as
following it. By increasing p, we increase the job’s latency, but can
reduce temperature by more. Decreasing L can gain back some of
the latency loss at a possibly reduced cooling beneﬁt.
Understanding the impact of the p and L parameters and select-
ing appropriate conﬁgurations is subtle. The analysis presented
here and the validation and evaluation in Section 3 provide insight
as to how Dimetrodon can be used in practice.
Throughputandruntime. WecanconsideraCPU-boundthread
t that spends its entire real-world runtime of R seconds on the
CPU with an average quanta length of q milliseconds. Suppose
t is scheduled S times before it completes. If we idle the processor
for a period of time L with probability p at each time t is scheduled,
then we can predict t’s runtime under Dimetrodon, D(t).
D(t) = R + S
p
1   p
L
The thread must run for at least R seconds in order to complete, but
each time t is scheduled it may be preempted.
p
1 p is the number
of idle quanta per each execution quanta of t. For example, if we
idle with probability 75%, then 3 out of 4 times t is scheduled we
will idle instead, so there will be 3 idle quanta for every 1 executed
quanta. Therefore, there are 3 S idle quanta. In practice we can
determine S by dividing R by the average quanta length of t, q.
We do not affect the ﬁxed, non-CPU-related runtime overheads of
thread execution.
Power. Varying the number of idle cycles injected results in
lower average processor power consumption and therefore less heat
dissipation by the processor while maintaining the same total en-
ergy, provided we can enter the same idle states. Transition times
in the tens of s [15] are negligible at quanta lengths measured in
ms, however microarchitectural state may play a larger role (e.g., if
a low power state ﬂushes cache lines).We analytically compare Dimetrodon’s power consumption to a
typical race-to-idle scenario. As above, consider a thread t with
runtime R, average quantum length q, and idle cycle length L. For
race-to-idle, we consider a window of time of length D(t) such that
the processor idles for time tidle = D(t) R. Assuming an active
processor power consumption u watts and idle power consumption
of m watts, then under race-to-idle the processor will consume
uR + tidle m
joules. Under Dimetrodon, the processor will consume u watts
during the R seconds that t is running and m watts for the
L
q
p
1 p R
seconds it is not idling. Therefore, with Dimetrodon, the processor
will consume
uR +
L
q
p
1   p
m R
joules. The two policies consume the same amount of total energy.
Intuitively, this is the case because we simply shift the idle cycles
from after the computation ends to between compute quanta. The
average power while executing t is lower for p < 1.
3. EVALUATION
We evaluated Dimetrodon in a commodity operating system on a
modern server in order to characterize its effects on processor heat
and the role of parameters p and L. We ﬁrst validate our analytical
model presented in Section 2.2, then evaluate Dimetrodon using a
worst-case heat generation stress test, cpuburn. We next examine
real-world workloads from the SPEC CPU2006 benchmark suite
and demonstrate per-thread control. We also characterize the effect
on latency-sensitive workloads using SPEC Web.
3.1 Implementation
We implemented Dimetrodon in the FreeBSD 7.2 kernel
2. When
the scheduler selects the next thread to run, we decide whether to
run the thread or whether to run the idle thread. If we decide to
idle, we pin the thread that would have run on the runqueue (so it
is not run by another processor) and schedule the kernel idle thread
instead, which causes the processor to enter the idle state. Once
the idle quantum is over, the preempted thread is unpinned and is
made runnable again. While we could have avoided context switch-
ing overheads by trapping the thread in the kernel and issuing hlt
instructions, choosing to run the idle thread greatly simpliﬁed our
implementation. We control Dimetrodon using system calls.
We always schedule kernel-level threads. This is a policy deci-
sion that could easily be changed, however care should be taken
to avoid preempting certain critical kernel threads. For example,
when servicing an interrupt from the network as in a web server, a
kernel thread will ﬁrst run to handle the interrupt, and then notify
a user thread. If we preempt kernel threads, then the processing of
the network event may be delayed twice — once in the kernel and
again in the user thread.
3.2 Experimental Setup
We tested Dimetrodon on a representative 1U rackmount server.
Ourserverhad anIntelNehalem-basedXeonE5520 quad-core pro-
cessor running at 2.26 GHz rated at 80 watts within a Supermicro
SYS-6016T-MTLF chassis. It had 4 GB of DDR3 ECC RAM, a
500 GB 7200 RPM hard drive, and four ﬁve-watt case fans. The
2FreeBSD 7.2 supports two schedulers, the 4.4 BSD scheduler, a traditional
multi-level feedback queue with a ﬁxed timeslice of 100ms, and the ULE
scheduler, a modern scheduler designed to better support multiprocessor
and low latency systems, however the 4.4 BSD scheduler was the default
scheduler through FreeBSD 7.0 [26]. For simplicity of implementation,
we modiﬁed the 4.4 BSD scheduler, however the mechanism generalizes to
ULE and other schedulers.
processor supported the C1E low power state (which does not ﬂush
the processor cache) and had a DVFS scaling settings every 133
MHz with a minimum of frequency of 1.6 GHz (71% of maxi-
mum). To measure processor power consumption, we connected
a Fluke i410 current clamp to the processor power leads and used
a Keithley 2701 ethernet-enabled multimeter to collect measure-
ments. We maintained a thermostat setpoint at 25.2
C and ﬁxed
the system fan speed at full using an external controller. We used
the FreeBSD coretemp module to measure temperature for each
core (reported results) and used external temperature sensors at the
server’s rear air vents.
In order to simplify our analysis, we disabled simultaneous mul-
tithreading (SMT) on our processor, which allows multiple thread
contexts to execute on a single core. In order to cause the entire
core to enter the C1E low power state we need to halt all thread
contexts on the core. This is feasible but requires additional care in
co-scheduling idle quanta.
Foreachconﬁguration, weexecutedfourinstancesofeachbench-
mark in parallel (one per core). To evaluate the trade-offs, we com-
pare the reduction in temperature (over the idle temperature) with
the reduction in application performance. We refer to 1:1 trade-offs
solely as a baseline for comparison,
3.3 Model Validation
We validated both our power and throughput models using the
cpuburn package [20] (speciﬁcally burnP6), which contains a
single-threaded inﬁnite loop containing a compact sequence of x86
instructions designed to thermally stress test processors.
We compared our analytical model for throughput to our imple-
mentation. We ran our ﬁnite cpuburn under a variety of con-
ﬁgurations and measured nominal deviation from the predictive
model. For 100 trials per conﬁguration (p 2 f25;:5;:75g, L 2
f25;50;75;100g ms), our implementation resulted in throughputs
that were on average 1.0% lower than expected. This throughput
reduction is due mostly to conﬁgurations with higher p; we believe
the deviation from our model increased as p increased largely due
to context switching and state monitoring overheads.
We measured the energy consumed by Dimetrodon versus race-
to-idle for equivalent periods of time (for p 2 f:25;:5;:75g, L 2
f50;100g ms). We recorded the power consumed by the proces-
sor three times per millisecond throughout the execution of a ﬁ-
nite loop of cpuburn instructions with a runtime of 7 seconds.
Over an average of ﬁve trials for each benchmark Dimetrodon con-
sumed between 97.6% and 103.7% of the energy of race-to-idle,
with an average deviation of -0.37% and an average absolute devi-
ation of 1.67%. Given the clamp accuracy (approximately 3.5%),
these measurements validate our power model.
3.4 System Characterization
We next characterized Dimetrodon behavior for static policies
under a worst-case thermal load. We executed many instances of
cpuburn concurrently, fully burdening the processor. Core tem-
peratures stabilized after approximately 300 seconds of cpuburn,
and the average relative rise above the idle temperature was ap-
proximately equivalent across fan speed conﬁgurations. As shown
in Figure 2, the temperature regularly ﬂuctuates but is signiﬁcantly
reduced from the conﬁguration where no injection occurs. These
ﬂuctuations are due to our probabilistic implementation; a more de-
terministic model would likely result in smoother curves but with
similar overall temperature trends.
We measured the average temperature over the last 30 seconds
of a 300 second execution and calculated the reduction in system
temperature compared to the idle temperature relative to the tem-
perature produced by unconstrained operation. For example, an0 50 100 150 200 250 300
Time (s)
0
5
10
15
20
C
o
r
e
 
T
e
m
p
e
r
a
t
u
r
e
 
R
i
s
e
 
o
v
e
r
 
I
d
l
e
 
(
C
)
0
.25
.5
.75
Figure 2: Average core temperature increase over the idle tem-
perature during ﬁve minutes of cpuburn execution for differ-
ent idle proportions (p). L=100 ms.
1 10 100
Quanta Length (L, ms)
0
2
4
6
8
10
12
14
16
E
f
f
i
c
i
e
n
c
y
 
(
T
e
m
p
e
r
a
t
u
r
e
:
T
h
r
o
u
g
h
p
u
t
)
p
.1
.25
.5
.75
Figure 3: Efﬁciency of Dimetrodon for cpuburn varying idle
quanta length (L) and proportion (p). There are diminishing
marginal beneﬁts to increasing the quanta length. Higher p
curves are smoother due to the probablistic implementation.
idle temperature of 40
C, an unconstrained temperature 60
C, and
a resulting temperature of 50
C would constitute a 50% reduction
in temperature over idle.
Dimetrodon achieved at least a 1:1 trade-off between cpuburn
throughput and temperature decrease compared to the idle temper-
ature but typically achieved better. Efﬁciency was correlated with
cycle length; as shown in Figure 3, short idle quanta lengths are
particularly efﬁcient, but there are diminishing marginal returns for
longer quanta lengths. Fundamentally, each core was able to cool
(exponentially) quickly within a short time window, but this efﬁ-
ciency decreased in longer time windows. Accordingly, we ob-
tained smaller temperature decreases more efﬁciently than large
temperature decreases. For a particular reduction in throughput,
preempting cpuburn for a shorter idle cycle duration (decreasing
L) but a more frequent interval (increasing p) allowed for a better
temperature to throughput trade-off than using a longer idle quanta
length.
100p
L > 1 holds for pareto boundary conﬁgurations. We
achieved a 16:1 temperature to throughput trade-off at a tempera-
ture reduction of 4.4%, but only a 1:1 trade-off at a temperature
reduction of 90%. For large reduction targets, we could not use the
more efﬁcient short idle quanta and efﬁciency dropped.
We developed a quantitative metric in order to better character-
ize the trade-off between throughput and temperature. By curve-
ﬁtting the pareto boundary between temperature and throughput,
we quantify the trade-off between desired temperature reduction r
and throughput reduction T(r) as
T(r) =  r

Figure 4: Wide-range parameter sweeps of Dimetrodon com-
pared to other thermal management techniques. The pareto
boundary is darkened.
where  and  are constants for r 2 [0;:75]. For cpuburn,  =
1:092 and  = 1:541. For r > :75, T(r)  r.
Finally, we compared Dimetrodon to several other comparable
techniques We ran cpuburn under static policies using voltage
and frequency scaling (VFS)
3 and the FreeBSD p4tcc driver,
which controls the processor’s thermal control circuit as a ﬁne-
grainedclockgatingtechnique[4], exhaustivelysweepingsetpoints
for each mechanism. As shown in Figure 4, VFS allowed good
trade-offsbetweenthroughputandtemperature(forexample, a30%
throughput reduction produced a 50% temperature reduction) due
in large part to its quadratic reduction in power utilization as volt-
age scales down. Additionally, unlike Dimetrodon, VFS actually
reduces total power consumed by the system.
However, Dimetrodon outperformed VFS for temperature reduc-
tions up to 30%. These data points correspond to the very short, ef-
ﬁcient idle quanta lengths previously shown in Figure 3. However,
the diminishing marginal utility of idle cycle lengths limited idle
cycle injection’s effectiveness for large temperature reductions, at
which point the quadratic power beneﬁts of VFS became more ef-
fective. In cases where system-wide policies are tolerable, it is ad-
vantageous to use VFS for preventive thermal management when
temperature reductions greater than 30% are necessary.
While small idle quanta allowed the best trade-offs, p4tcc,
which activated ﬁne-grained, clock-level duty cycling, performed
signiﬁcantlyworse, failingtoachieveeven1:1performancetothrough-
put trade-offs at high temperature reductions. This suggests that the
optimal idle cycle length is longer than the length of several clock
signals and that the beneﬁt of reducing cycle length decreases at ex-
tremely short time periods. Based on these results, the optimal idle
period appears closer to the order of one ms, but may be shorter.
3.5 CPU-Bound Workloads
We subsequently evaluated Dimetrodon impact on several real-
world workloads from the SPEC CPU2006 benchmarking suite [1],
an industry-standard benchmark designed to test system processor
and memory performance.
We ﬁrst determined the thermal proﬁles of each of the SPEC
3Because FreeBSD did not support DVFS for our motherboard and proces-
sor, we ran our VFS tests under Linux 2.6.32 (Ubuntu 10.04 LTS) using the
same binary and number of processes. We veriﬁed that the cpuburn tem-
perature increases are equivalent on both Linux and FreeBSD with default
processor settings.Workload Rise (%)  
cpuburn 100 1.092 1.541
calculix 99.3 1.282 1.697
namd 87.2 1.248 1.546
dealII 84.4 1.324 1.688
bzip2 84.4 1.529 1.811
gcc 80.3 1.425 1.848
astar 71.7 1.351 1.416
Table 1: Real workload results. Average per-core tempera-
ture increase over the idle temperature for benchmarks from
SPEC CPU2006 expressed as a percentage of the tempera-
ture increase for cpuburn when run unmodiﬁed (race-to-idle).
Best-ﬁt parameter estimation is shown for throughput reduc-
tion T(r) for r 2 [0;:5].
CPU benchmarks. Based on this characterization, we selected six
benchmarks that spanned a range of thermal proﬁles to examine
in further detail. We tracked the average quantum length for each
of the benchmarks (L), and found that the workloads were entirely
CPU-bound. Subsequently, ourthroughputmodelisalsoapplicable
to these workloads.
WethenexaminedDimetrodon’seffectivenessacrossvariousidle
quantum lengths and probabilities as in our characterizations and
developed predictive models for each workload, shown in Table 1.
Despite the observed differences in absolute temperature increases,
the differences in pareto optimal trade-offs between throughput and
temperature were negligible, except at low throughput reductions.
The absolute amount of heat being dissipated was different across
workloads, but the relative efﬁciency curves did not substantially
change. All workloads achieved better than 1:1 temperature to
throughput trade-offs until at least 50% temperature reductions.
Most benchmarks behaved similarly to cpuburn except astar,
which was less effectively modulated; this is because it was sig-
niﬁcantly cooler-running than the other benchmarks and therefore
beneﬁted less from aggressive thermal modulation.
3.6 Thread-Speciﬁc Control
We now demonstrate Dimetrodon’s usefulness in per-thread con-
trol. Degrading total system performance to limit a single process’s
heat output is inefﬁcient; instead, as discussed in Section 2.1, per-
thread control is desirable for thermally heterogeneous workload
combinations. In this demonstration, we consider a periodic, short-
runningprocess, the“cool”process(aloopthatexecutedcpuburn
for six seconds, slept for one minute, and repeated), executing con-
currently with a CPU-bound application, the “hot” process (four in-
stances of calculix). As shown in Figure 5, degrading the cool
process’s performance because it is co-located with the hot pro-
cess is undesirable if we want to optimize for per-process through-
put while lowering temperature. Under global, non-thread-speciﬁc
(system-wide) thermal actuation, the cool process is unfairly pe-
nalized for the “hot” process’s heat generation. With per-thread
control, the “cool” process can run interrupted while system-level
temperatures are lowered. For per-chip techniques such as DVFS,
the only solution to this problem is to intelligently schedule jobs
across machines in the datacenter [16], which is possibly expen-
sive if performed online, or to migrate threads between cores [11],
which may be ineffective on fully-burdened machines [9]. Instead,
we can use per-thread control.
3.7 Web Server Workload
In order to demonstrate Dimetrodon’s impact on quality of ser-
vice (QoS) sensitive applications, in addition to our throughput-
based benchmarking, we also considered a latency-sensitive work-
load: web serving. While the impact on a QoS-sensitive workload
0 20 40 60 80 100
Temperature Reduction over Idle (%)
0
20
40
60
80
100
C
o
o
l
 
P
r
o
c
e
s
s
 
T
h
r
o
u
g
h
p
u
t
 
(
%
)
Per-thread
Global
Figure 5: Global versus thread-speciﬁc control in Dimetro-
don using idle quanta injection. With thread-speciﬁc control,
the lower-heat “cool” process can execute without interruption
while the system temperature is lowered by degrading “hot”
processperformance. Withsystem-wide policies, cool processes
are unfairly penalized. The pareto boundary is darkened.
is dependent on the particular QoS metric, we ran SPECWeb
4 [2],
an industry standard web benchmark testing web server loads such
as banking and eCommerce applications. This workload is signif-
icantly different from the prior workloads we have considered: la-
tency is a critical factor and the load generated by a request is much
quicker to fulﬁll.
WeranSPECWebwith440simultaneousconnectionssplitacross
two physical clients on a private network. Our server experienced
approximately 15-25% load per core throughout the benchmark
(the highest load possible for our conﬁguration), and the overall
temperature rise was approximately 6
C with no thermal actuation.
Several competing factors inﬂuence Dimetrodon’s efﬁciency for
SPECWeb. The workload allows for idle periods in-between un-
interrupted processor execution due to time between requests, pro-
viding natural thermal modulation. Dimetrodon alters the distri-
bution of these idle periods, and, in delaying execution, can lead
to increases in the total number of outstanding requests. Because
SPECWeb continually issues requests, if we defer a request, then
when it is eventually processed there may be higher load on the
system, possibly leading to increased heat generation. Therefore,
injection efﬁciency depends on balancing the heat-reducing idle cy-
cle injection and deferring idle cycles, which increases processor
load and heat. Under heavy load, the processor is closer to satura-
tion; fewer natural idle periods exist, and injecting idle cycles does
not induce the same load-increasing behavior.
Overall, however, Dimetrodon is useful in reducing heat gen-
eration in a web serving context. SpecWEB performance is de-
termined by three QoS thresholds: “good” (three second response
or less), “tolerable” (ﬁve second response time or less), and “fail”
(longer than ﬁve seconds) [2], each providing a range for allowable
performance degradation. We show Dimetrodon’s efﬁciency across
a range of idle cycle amounts and lengths in Figure 6. At the lower,
“tolerable” QoS threshold, we allowed up to 20% temperature re-
ductions with virtually no drop-off in performance, and tempera-
turereductionsupto50%incurredcorrespondinglysmallercoststo
performance. Even under tighter requirements (“good” metric), we
allowed at least 1:1 and often better trade-offs until temperature re-
ductions of 30% or more, at which point performance quickly falls
below the acceptable range. Again, shorter quanta lengths were
more efﬁcient in reducing temperature than longer quanta lengths.
4Particularly, we used the SpecWeb2005 eCommerce workload also found
in SPECWeb2009 [3].0 20 40 60 80 100
Temperature Reduction over Idle (%)
0
20
40
60
80
100
R
e
l
a
t
i
v
e
 
Q
o
S
 
(
%
)
Good
Tolerable
Figure 6: QoS and temperature reductions for web workload
according to both “good” and “tolerable” metrics. Dimetrodon
efﬁciency depends on the QoS threshold placement. The pareto
boundary is darkened.
These results suggest that under latency-sensitive workloads, Di-
metrodon’s effect on performance will largely determined by both
QoS metric and workload distribution.
4. RELATED WORK
Manydynamicthermalmanagementtechniquestargetworst-case
thermal bounds. There is a wide spectrum of methods [8] rang-
ing from the microarchitectural level [24] to the operating system
level [21]. Techniques such as DVFS, fetch throttling, and clock
gating are standard features on many existing microprocessors and
reduce the burden of worst-case temperature management [8]. Sev-
eral software-level scheduling schemes [9, 11] explicitly consider
proactivethermalmanagementandthetemperature-applicationper-
formance trade-offs involved. The key distinction between Dimet-
rodon and previous work is its focus on preventive, average-case
cooling; our success metric is based on limiting overall tempera-
ture instead of bounding temperatures below a critical threshold,
which leads to a different approach to thermal management.
Dimetrodon’s idle cycle injection is similar to other techniques.
RohouandSmith[21]targetedthermalreductionsbyrestrictingthe
allowed CPU utilization of hot processes. Choi et al. [9] proposed
severalreactivetechniquesforoperatingsystemleveldynamicther-
mal management, including a “cool loop” that would run when
the system became overburdened. We explore the effects of both
idle cycle length and proportion in a proactive thermal management
context. Gandhi et al. [10] proposed the use of a similar scheduler-
level idling technique for power-capping in data centers; Google
recently introduced this mechanism into the Linux kernel [19]. Di-
metrodon and this ﬁnal technique target different domains (heat
andpower), butrearchitectingthepower-cappingmechanismtouse
shorter idle quanta would provide thermally-beneﬁcial side-effects.
Dimetrodon is complementary to several existing thermal man-
agement techniques. While we have focused on idle cycle injec-
tion due to its ﬂexibility, many hardware methods such as DVFS
and throttling are applicable in a preventive thermal management
context. Dimetrodon can be used in conjunction with multi-server
thermal management solutions such as multi-core thermal manage-
ment [11] and thermal-aware job placement [16]. Dimetrodon acts
on a per-thread level, but can be combined with these techniques,
especially when temperature predictions are inaccurate, when a
data center experiences uniform temperature increases, or when
there are no “cool” machines available. The substantial amount of
literature targeting power reductions [13] may also prove useful in
preventive thermal management as power reductions can translate
to heat reductions.
5. CONCLUSIONS
We have presented a novel approach to preventively reducing
average-case processor temperatures, Dimetrodon, which injects
brief periods of inactivity into application execution, allowing the
processor to cool while entering a low-power idle state. Using a
prototype implementation and a real-world server-class platform,
we examined the trade-offs between application performance and
temperature reduction across both worst-case thermal load and a
range of real-world throughput and latency-sensitive benchmarks.
Dimetrodon is particularly effective for short idle periods but ex-
hibits diminishing beneﬁts with longer idle periods. Software idle
cycleinjectioncanbeappliedonaper-threadbasisandoutperforms
voltage and frequency scaling (which allows quadratic reductions
in power) for temperature reductions up to 30%. We conclude that
Dimetrodon provides predictable performance trade-offs while al-
lowing ﬂexible operation.
6. REFERENCES
[1] SPEC CPU2006. http://www.spec.org/cpu2006/.
[2] SPECWeb2005. http://www.spec.org/web2005/.
[3] SPECWeb2009. http://www.spec.org/web2009/.
[4] Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A:
System Programming Guide, March 2010.
[5] BARROSO, L. A., AND HÖLZLE, U. The Datacenter as a Computer: An
Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool
Publishers, 2009.
[6] BELADY, C. In the data center, power and cooling costs more than the it
equipment it supports. Electronics Cooling 23, 1 (February 2007).
[7] BRAMWELL, C. D., AND FELLGETT, P. B. Thermal regulation in sail lizards.
Nature 242 (1973), 203–205.
[8] BROOKS, D., AND MARTONOSI, M. Dynamic thermal management for
high-performance microprocessors. In HPCA ’01.
[9] CHOI, J., CHER, C.-Y., FRANKE, H., HAMANN, H., WEGER, A., AND
BOSE, P. Thermal-aware task scheduling at the system software level. In
ISLPED ’07.
[10] GANDHI, A., HARCHOL-BALTER, M., DAS, R., KEPHART, J., AND
LEFURGY, C. Power capping via forced idleness. In WEED ’09.
[11] GOMAA, M., POWELL, M. D., AND VIJAYKUMAR, T. N. Heat-and-run:
leveraging SMT and CMP to manage power density through the operating
system. In ASPLOS ’04.
[12] HASAN, J., JALOTE, A., VIJAYKUMAR, T. N., AND BRODLEY, C. E. Heat
stroke: Power-density-based denial of service in smt. In HPCA 2005.
[13] ISCI, C., BUYUKTOSUNOGLU, A., CHER, C.-Y., BOSE, P., AND
MARTONOSI, M. An analysis of efﬁcient multi-core global power management
policies: Maximizing performance for a given power budget. In MICRO ’06.
[14] KIM, W., GUPTA, M., WEI, G. Y., AND BROOKS, D. System level analysis of
fast, per-core DVFS using on-chip switching regulators. In ISCA ’08.
[15] MEISNER, D., GOLD, B. T., AND WENISCH, T. F. Powernap: eliminating
server idle power. In ASPLOS ’09.
[16] MOORE, J., CHASE, J., RANGANATHAN, P., AND SHARMA, R. Making
scheduling “cool”: temperature-aware workload placement in data centers. In
USENIX ATC ’05.
[17] PATEL, C. D., AND SHAH, A. J. Cost model for planning, development and
operation of a data center. Hewlett Packard Technical Report HPL-2005-107R1.
[18] PELLEY, S., MEISNER, D., WENISCH, T. F., AND VANGILDER, J. W.
Understanding and abstracting total data center power. In WEED ’09.
[19] QAZI, S. Idle cycle injection in Linux. Presented at the 2010 Linux
Collaboration Summit.
[20] REDELMEIER, R. cpuburn. http://pages.sbcglobal.net/redelm/.
[21] ROHOU, E., AND SMITH, M. D. Dynamically managing processor temperature
and power. In In 2nd Workshop on Feedback-Directed Optimization (1999).
[22] ROTEM, E., COHEN, A., AND CAIN, H. Temperature measurement in the Intel
Core Duo processor. In THERMINIC ’06.
[23] SANTARINI, M. Thermal integrity: A must for low-power IC digital design.
EDN (September 2005), 37–42.
[24] SKADRON, K., STAN, M. R., HUANG, W., VELUSAMY, S.,
SANKARANARAYANAN, K., AND TARJAN, D. Temperature-aware
microarchitecture. In ISCA ’03.
[25] SRINIVASAN, J., ADVE, S. V., BOSE, P., AND RIVERS, J. A. The case for
lifetime reliability-aware microprocessors. In ISCA ’04.
[26] THE FREEBSD PROJECT. Generic kernel conﬁguration ﬁle for FreeBSD/i386
in FreeBSD 7.0. http://fxr.watson.org/fxr/source/i386/
conf/GENERIC?v=FREEBSD70.