Reliability-Aware Power Management Of Multi-Core Systems (MPSoCs) by Waldschmidt, Klaus et al.
Reliability-Aware Power Management Of Multi-Core
Systems (MPSoCs)
Klaus Waldschmidt, Jan Haase, Andreas Hofmann, Markus Damm, and Dennis Hauser
Technische Informatik, J.W.Goethe Universita¨t
Post Box 11 19 32, 60054 Frankfurt a.M., Germany
{waldsch|haase|ahofmann|damm|dhauser}@ti.cs.uni-frankfurt.de
Abstract
Long-term reliability of processors in embedded
systems is experiencing growing attention since de-
creasing feature sizes and increasing power con-
sumption have a negative influence on the lifespan.
Among other measures, the reliability can be in-
fluenced significantly by Dynamic Power Manage-
ment (DPM), since it affects the processor’s tem-
perature. Compared to single-core systems recon-
figurable multi-core SoCs offer much more possibil-
ities to optimize power and reliability.
The impact of different DPM-strategies on the
lifespan of multi-core processors is the focus of
this presentation. It is shown that the long-term
reliability of a multi-core system can be influ-
enced deliberately with different DPM strategies
and that temperature cycling greatly influences
the estimated lifespan. In this presentation, a
new reliability-aware dynamic power management
(RADPM) policy is explained.
1 Introduction
Systems-on-Chip (SoCs) and Networks-on-
Chip (NoCs) will be based on multi-core platforms
in future. Besides an increase in performance,
reliability and lifespan of these processor cores
will become an important issue. Due to the
combination of processing cores with other com-
ponents as Systems-on-Chip, these cores cannot
be replaced. Moreover, smaller feature sizes
and increasing power densities further stress the
lifetime, as they lead to a higher vulnerability to
wear-out based failure mechanisms like electro-
or stress migration. Therefore reliability adds to
the essential problems like performance and power
management that must be addressed during the
design of an embedded system.
The approaches to tackle the new problem are
mostly design-centric. RAMP [9], for example, is
a model to determine lifespan estimates depending
on the architecture of a processor. In a subsequent
paper, however, the authors extend their approach
to a so called Dynamic Reliability Management [8],
whose idea is to adjust a processor at runtime (e.g.
by voltage scaling) to meet a certain reliability tar-
get, though no algorithms for this cause are pro-
posed. Apart from this, no further concepts or
algorithms for dynamic reliability management do
yet exist.
It has been noted in [7] that DPM schemes affect
a processor’s reliability, since they directly influ-
ence a processor’s temperature. The essential fail-
ure mechanisms like electromigration, corrosion,
time-dependent dielectric breakdown (TDDB), hot
carrier injection (HCI), surface inversion, and
stress migration are more or less temperature de-
pendent [6]. While DPM tends to lower a proces-
sor’s temperature, which is beneficial, it also leads
to the unfavorable effect of temperature cycling,
i.e. frequent heating up and cooling down.
Therefore our remedy to balance the trade-
off between power consumption and reliabil-
ity is Reliability-Aware Dynamic Power Manage-
ment (RADPM). Compared to single-core systems,
DPM on multi-core SoCs has a lot more ways to
scale the energy consumption of a chip if tasks can
migrate between cores: Aside from clock frequency
reduction, whole cores can be switched off without
disrupting the execution of applications.
The workload of a parallel computing environ-
1Dagstuhl Seminar Proceedings 06141
Dynamically Reconfigurable Architectures
http://drops.dagstuhl.de/opus/volltexte/2006/745
ment depends on the parallelizability of the appli-
cations. Since DPM reacts to this workload, this
obviously can be realized efficiently only with a dy-
namic approach.
The SDVM (Self Distributing Virtual Ma-
chine) [3] as a middleware for dynamic, automatic
distribution of code and data over any network of
computing resources seems to be an ideal choice
to be run on multi-core processors. In particu-
lar, it supports adding and removing computing
resources at runtime, making the implementation
of the aforementioned dynamic power management
on multi-core processors possible.
To permit DPM on an SDVM-driven multi-core
SoC, an appropriate power managing mechanism
has been drafted and implemented, which scales
the performance of the cores according to the cur-
rent workload. The reliability-awareness is then
achieved by a new power management policy.
The simulations described in this presentation
show the potential of reliability-aware power man-
agement strategies for multi-core SoCs. In sec-
tion 2, the concept of the SDVM and its appli-
cation to reconfigurable multi-core SoCs is pre-
sented. Power management, its relevance for re-
liability, and the modeling thereof is illustrated in
section 3. After some simulation results in sec-
tion 4, this presentation closes with a conclusion
in section 5.
2 SDVM - A middleware for
MPSoCs
The Self Distributing Virtual Machine (SDVM) [4]
is a dataflow-driven parallel computing middle-
ware. It was designed to feature undisturbed par-
allel computation flow while adding and removing
processing units from computing clusters. Applica-
tions for the SDVM must be cut to convenient ap-
plication fragments, which will be spread through-
out the cluster depending on the data distribu-
tion. Each processing unit which is encapsulated
by the SDVM virtual layer and thus acting as an
autonomous member of the cluster is called a site.
The sites communicate by message passing.
The SDVM has several distinguishing features
which support the aforementioned reliability-aware
power management for SoCs. These features in-
clude:
• undisturbed parallel computation while resiz-
ing the cluster
• distributed dynamic scheduling and thereby
automatic load balancing
• participating computing resources may have
different architectures and processing speeds
• support for any connection network topology
Due to the above mentioned features, the SDVM
offers the convenient mechanisms to support differ-
ent power states of processing units in a SoC.
The SDVM is currently implemented as a proto-
type [1] running as Linux daemons on a worksta-
tion cluster creating a site on each system. For this
project it has been used to simulate a multi-core
processor system.
2.1 Integration of power management
The aforementioned power management capabili-
ties were integrated into the SDVM by implement-
ing a new module, the energy manager, which con-
trols the energy state of the local site. A method
where any site may freely decide for itself its en-
ergy state may result in a situation where all sites
simultaneously decide to shut down; therefore the
energy managers use an election algorithm to de-
fine a master which then is the only one to decide.
The master regularly defines the new power con-
figuration of each core based upon the tempera-
ture and the mean workload of each core. This
information is distributed through the cluster by
the SDVM’s message mechanism. The master may
even decide to shut down its own site or to quit be-
ing the master; then the election is simply started
again among the remaining sites. The main task
of the energy managers in slave mode is to listen
to the master core and to implement its orders,
setting the local site to the desired PM-state.
3 Reliability Aware Power Man-
agement
Sophisticated power management can scale the
power consumption of a system using a variety of
measures like frequency scaling, dynamic voltage
scaling (affecting dynamic power), adaptive body
2
biasing (affecting leakage power), and clock-gating.
Unlike static power management dynamic power
management reacts dynamically and continuously
to workload variation at runtime.
The temperature of a processor depends on
its power consumption, and since dynamic power
management lowers the average temperature, it
should contribute to the chip’s lifespan. But, as
it was noted in [7], the switching between differ-
ent power consumption levels leads to thermal cy-
cling, which can cause various types of failures like
lifted bonds, solder fatigue or even a cracked die [6].
Hence, it is not obvious how a certain power man-
agement policy affects the lifetime, if all influences
are taken into account. Therefore a model is re-
quired to evaluate power management policies.
3.1 Modeling Reliability Aware Power
Management
The long-term reliability of a processor is affected
by its operating temperature as well as thermal cy-
cling. The effect of the temperature can be mod-
eled by the Arrhenius equation, which describes the
influence of the temperature on the rate of chemi-
cal reactions. The MTTF (Mean-Time-To-Failure)
can be estimated by the following formula:
MTTF ∼ e
Ea
kT (1)
where T is the operating temperature in Kelvin, k
is Boltzmann’s constant, and Ea is the activation
energy in electron volts of the precise failure mech-
anism considered. The Arrhenius equation is the
basis for modeling the temperature-dependence of
several failure mechanisms.
With the knowledge of the physical and struc-
tural construction of a chip, the models for differ-
ent failure mechanisms can be combined to get a
model (like RAMP [9]) for the processor’s reliabil-
ity. As no assumptions on the internal structure
of the processors or the materials used are made
in this presentation, it would make no sense to
use those detailed models for our purposes. In-
stead, equation 1 is used as a generic temperature-
dependant reliability measure for processors. For
Ea a value of 0.9 eV is used.
The effect of thermal cycling on the reliability
of a chip can be modeled by the Coffin-Manson
relation, which computes the number of cycles to
failure, Nf , as [6]
Nf = C0 · (∆T )
−q (2)
where ∆T is the magnitude of thermal cycling,
C0 is a material-dependant constant, and q is the
empirically determined Coffin-Manson exponent.
This exponent depends on the failure mechanism
considered; important mechanisms are lifted wire
bondings, solder fatigue, or package failures. In
this presentation, the reliability of the package is
chosen, therefore a value of 1.9 [9] is used.
For our purposes, equations 1 and 2 are used for
a comparitive analysis of each of the PM-strategies
described below to the non-powermanaged case.
The acceleration factors AFT and AFTC repre-
sent the acceleration of the time to failure due
to temperature and temperature cycling, respec-
tively. Thus lower values are better; values below
1 denote a prolonged lifespan compared to the non-
powermanaged case. The acceleration factor of the
temperature dependant MTTF is calculated using
equation 3.
AFT =
MTTFnoPM
MTTF
(3)
To calculate the acceleration factor of the Coffin-
Manson relation no value for C0 in equation 2 needs
to be chosen, since it then cancels out. As differ-
ent PM-strategies not only alter the magnitude of
thermal cycling, but also the cycle frequency f ,
the ratio thereof has to be taken into account, too.
Hence, equation 4 is used to calculate AFTC .
AFTC =
f
fnoPM
·
( ∆T
∆TnoPM
)1.9
(4)
3.2 Power management policies
In view of the previous section, a power manage-
ment strategy which is aware of reliability issues
should limit the temperature as well as tempera-
ture changes. While the first is a side effect of usual
power management strategies, the latter might in-
volve keeping a processor ”powered up”, although
this might not be necessary regarding performance,
and is definitely not desirable regarding power con-
sumption. So obviously, there’s a trade-off between
power consumption, performance, and reliability.
In this simulation, two reliability aware dynamic
power management strategies are considered: The
low-temperature-policy, which tries to keep the
3
temperature as low as possible, and the smooth-
temperature-policy, whose goal is to restrict ther-
mal cycling. Furthermore, the low-temperature-
policy aims to limit the temperature of a core to
a given maximum tmax. These policies are com-
pared to the (reliability unaware) fast-upgrade-
policy, which tries to optimize performance and
serves as a representative of usual power manage-
ment strategies. Figure 1 shows a diagram describ-
ing the smooth-temperature policy.
The simulated computing environment is a ho-
mogenous multi-core-processor with four cores.
Each core has four different Power-Management
states as shown in Table 1.
Table 1: Simulated PM-states
PM-state clock supply time to
frequency voltage HFM
HFM max. max. —
LFM reduced reduced short
SLEEP stopped reduced long
OFF stopped off very long
For a more detailed description of the three
power management policies and the PM-states of
the cores see [2].
4 Results
The test setup simulates a homogenous multi-core
processor with four cores. To each PM-state, a
typical power consumption value is assigned (see
Table 2). These values serve as examples and are
based on the power consumption of an Intel Pen-
tium M processor [5].
The hypothetical temperature TJ of a core is de-
termined out of its power consumption by the for-
mula
TJ = TA + θAJ · PDISS (5)
Where TA is the environmental temperature, θAJ
is the thermal resistance of the core and its cooling
system, and PDISS is the power consumption. For
θAJ , a value of 4.5
◦C/W is used. However, the real
temperature of the core Treal furthermore depends
on the heat capacity of the package including the
cooling system. This is modeled using the formula
Treal = Trealold +
TJ − Trealold
15
(6)
which is evaluated every second by the simulation
environment for each core.
Table 2: PM-states and their power consumption
PM-state HFM LFM SLEEP OFF
power 15 W 7.5 W 3 W 0.2 W
consumption (10 W idle) (4 W idle)
With this setup, each PM-strategy (and the ”no-
PM strategy” as a reference) was simulated using
identical workloads composed of multiple instances
of a parallelized application. As an example, an al-
gorithm performing the Romberg integration was
used. The results of the simulations of the PM-
strategies are given in Figures 2, 3, and 4. The fig-
ures show the workload (area chart) and the tem-
perature (black line) of each of the four cores.
Figure 2: Fast-upgrade policy
Figure 3: Smooth-temperature policy
4
Figure 1: The smooth-temperature policy.
Figure 4: Low-temperature policy
Figures 2, 3, and 4 show a clear difference be-
tween the three policies. The low-temperature pol-
icy restricts the maximum temperature to 61◦C,
while with the other two policies a maximum tem-
perature of 86◦C is obtained. The higher mean
temperature of core 1 in Figure 2 is caused by the
fact that the fast-upgrade policy always leaves one
core in HF-mode.
Regarding thermal cycling, a reduction both
in frequency and magnitude by the smooth-
temperature policy compared to the fast-upgrade
policy can be seen. Because of the temperature
limitation, the low-temperature policy causes ther-
mal cycling of lower magnitude, but with signifi-
cant higher frequencies, expecially when the tem-
perature limit is reached.
Using the models described in section 3.1, for
each policy the acceleration factors AFT and AFTC
are computed. Table 3 gives the means of these
values over all cores, together with the mean CPU
time and the mean power consumption for the
given workload. The acceleration factors of the
non-powermanaged policy are 1, since this case
is the reference. All PM-strategies are beneficial
regarding failure due to temperature, especially
the low-temperature and the fast-upgrade policy.
The fact that the low temperature policy is not
much better than the fast upgrade policy regard-
ing AFT (despite the lower maximum tempera-
ture) is owed to extended computation durations
of the first whereas the latter has shorter compu-
tation durations which leaves more time for cooling
down. In terms of thermal cycling related failure
the smooth-temperature policy is the most reliable
5
Table 3: AFT , AFTC , mean runtime, and mean
power consumption of all cores
mean power
PM-policy AFT AFTC runtime consumption
no PM 1 1 32.7s 48.15 W
fast-upgrade 0.12 3.27 35.0s 31.55 W
smooth-temperature 0.19 1.2 35.9s 37.29 W
low-temperature 0.1 3.28 64.8s 23.18 W
power management policy. Compared to the other
two policies its acceleration factor is over 2 times
smaller.
The results clearly show that the reliability of
a multi-core chip can be influenced actively with
PM-strategies. It should be pointed out that such
an approach is only possible using dynamic power
management, which in turn can be implemented
only within a system which distributes the work-
load dynamically. Incorporating reliability aware-
ness into compile-time power management schemes
seems to be almost infeasible.
5 Conclusion
In this presentation, reliability-aware dynamic
power management (RADPM) is described, which
targets lifespan-controlling goals. The usability
of RADPM to extend system-lifetime was demon-
trated by simulating a multi-core chip on the
SDVM. The SDVM was augmented for this pur-
pose to implement different PM-policies. The basic
approach, however, could be implemented on any
multi-core system which distributes the workload
dynamically.
The PM-policies presented are no final solutions
for RADPM, but rather serve as a proof of con-
cept, that the long-term reliability of a multi-core
chip can actually be improved. Further implemen-
tations for RADPM on multi-core chips could in-
clude a ”reliability account” for each core or con-
sider the chip’s topology to optimize the tempera-
ture distribution. In combination with the smooth-
temperature policy a reliability-aware scheduling
could further reduce thermal cycling. One of our
next goals is to measure the impact of dynamic re-
configurations on the reliability. Furthermore, it
would be beneficial to explore the impact and op-
timization of power-management strategies on the
reliability of the external power supply circuits.
A new insight, however, is that parallelism may
not only be used to improve performance, but to
improve reliability as well.
References
[1] The SDVM homepage, 2006. http://sdvm.ti.
cs.uni-frankfurt.de.
[2] Jan Haase, Markus Damm, Dennis Hauser,
and Klaus Waldschmidt. Reliability-aware
power management of multi-core processors. In
5th IFIP Working Conference on Distributed
and Parallel Embedded Systems (DIPES 2006),
Braga, Portugal, October 2006. To be pub-
lished.
[3] Jan Haase, Frank Eschmann, Bernd Klauer,
and Klaus Waldschmidt. The SDVM: A Self
Distributing Virtual Machine. In Organic and
Pervasive Computing – ARCS 2004: Interna-
tional Conference on Architecture of Comput-
ing Systems, volume 2981 of Lecture Notes in
Computer Science, Heidelberg, 2004. Springer
Verlag.
[4] Jan Haase, Frank Eschmann, and Klaus Wald-
schmidt. The SDVM - an Approach for Future
Adaptive Computer Clusters. In 10th IEEE
Workshop on Dependable Parallel, Distributed
and Network-Centric Systems (DPDNS05),
Denver, Colorado, USA, April 2005.
[5] Intel. Pentium M Processor Datasheet,
April 2004. http://www.intel.com/design/
mobile/datashts/252612.htm.
[6] JEDEC. Failure mechanisms and models for
semiconductor devices, 2003. JEDEC Publi-
cation JEP122-B, Jedec Solid State Technolgy
Association.
[7] K. Mihic, T. Simunic, and G. De Micheli. Re-
liability and power management of integrated
systems. In DSD - Euromicro Symposium on
Digital System Design, pages 5–11, 2004.
[8] J. Srinivasan and et al. The case for lifetime
reliability-aware microprocessors. In Proc. of
the 31st Annual Intl. Symp. on Comp. Archi-
tecture, 2004.
6
[9] Jayanth Srinivasan, Sarita V. Adve, Pradip
Bose, Jude Rivers, and Chao-Kun Hu. Ramp:
A model for reliability aware microprocessor
design. In IBM Research Report, RC23048
(W0312-122), December 2003.
7
