Optimization of Reliability and Power Consumption in Systems on a Chip by Simunic, Tajana et al.
Optimization of  Reliability and
Power Consumption in SoCs
Tajana Šimunić Rosing, UC San Diego
Kresimir Mihic, Cypress Semiconductors
Giovanni De Micheli, EPF Lausanne
Tajana Simunic Rosing & Giovanni De Micheli
Integrated System Technology Issues
 Extremely small size
 Thinner interconnect -> more chance of EM failure
 Thinner dielectric ->  more chance of TDDB failure
 Narrower design margins
 Extremely large scale
 High transistor density
• Causes more failures
• Enables redundancy
 Energy consumption
 Increased energy consumption is a hurdle to modular redundancy
 Power and thermal management are critical
• Reliability exponentially related to temperature
Designing reliable integrated systems requires techniques 
that integrate with power management 
and tie to the underlying technology
Courtesy of Fred Pollack, Intel
Keynote speech,  MICRO-32
100
1000
W
at
ts
/c
m
2
P4 @ 1.4GHz, 75W
1
10
Hot plate
i386
i486
Pentium
®
PentiumPro ®
Pentium II
®
Pentium
III ®
P4
Nuclear Reactor
P5
Rocket
Noozle
Sun’s
surface
Tajana Simunic Rosing & Giovanni De Micheli
Reliability
 Reliability is the the probability function R(t) that a system
works correctly in [0, t] without repairs
 The mean time to failure MTTF is  E[t] = ∫ R(t)dt
 Assuming a unit works correctly in [0, t], the failure rate
is the conditional probability λ(t) that a unit fails in [t,
t+Δt]
 It depends on temperature, environmental exposure, mechanical
and thermal stress
 The component failure rate is often assumed to be constant
during useful lifetime of device:
• R(t) = exp (– λt) and MTTF = 1/ λ
 Two types of failures can be defined in integrated systems:
 Soft failures – transient malfunctions
 Hard failures - permanent malfunctions
t
Λ(t)
Tajana Simunic Rosing & Giovanni De Micheli
Related Work – Reliability for SoCs
 Reliability at the architecture level
 Integrated simulation of power and reliability at microarchitecture level
RAMP [Srinivasan’03]
 Redundancy tradeoffs [Shivakumar’03]
 Dynamic Thermal Management (DTM)
 HotSpot [Skadron’03], ThermaHeard [Shang’04]
• Simulate and reduce thermal hotspots
 Thermal management for multimedia [Srinivasan’03]
 Dynamic Voltage Scaling (DVS) as related to reliability
 Routing and DVS for reduction of hotspots [Shang’04]
 Dynamic Power Management (DPM)
 Primarily focused on lowering energy consumption
 Soft errors studied by many, e.g.:
 Ultra-low power systems [Maheshwari’02]
 Sensing systems [Marculescu’03]
 Hard failure mechanisms studied at length in the past, e.g.:
 Temperature cycling [Huang’00]
 TDDB [Degraeve’98]
Tajana Simunic Rosing & Giovanni De Micheli
Reliable low-power design
 Simulate system-level reliability
 Model three sources of hard errors:
• Electromigration (EM), Time-dependent dielectric breakdown (TDDB), and
Temperature Cycling (TC)
     as a function of a power management policy
 Design and optimize a system management policy
 Maximize reliability and minimize energy consumption
 Combined dynamic reliability management (DRM) with dynamic
power management (DPM) optimization
• Markov, semi-Markov and TISM models
10
30
50
70
90
110
130
0.0 10.0 20.0 30.0 40.0 50.0
Power Savings [%] 
M
T
T
F
 [
ye
ar
s]
EM
TDDB
TC
System
Tajana Simunic Rosing & Giovanni De Micheli
Hard failures
 Defects in silicon or package, permanent once present
 Expected lifetime decreases with hard error rate
 Extrinsic
• Caused by process and manufacturing defects
• Usually screened out before shipping a product
 Intrinsic
• Occur during operation
• Depends on materials, process parameters, system design and
operating conditions
• Should occur after device passes its useful lifetime
• Examples: electromigration, time dependent dielectric
breakdown, thermal cycling
Tajana Simunic Rosing & Giovanni De Micheli
Electromigration (EM)
 Result of momentum transfer from electrons to the ions which
make interconnect lattice
 Leads to opening of metal lines/contacts, shortening between
adjacent metal lines, shortening between metal levels,
increased resistance of metal lines/contacts or junction
shortening
 Described by Black's model where Ao is an empirically
determined constant, J is the current density in the
interconnect, Jcrit is the threshold current density, k is the
Boltzmann's constant, Ea and n  are 0.7 and 2
kT
Ea
n
critoEM
eJJAMTTF
!
!= )(
 Failure rate due to EM is modeled only in active and idle states as in
sleep state leakage current is not yet large enough to cause migration:
;)( ,
'
,
ss kT
Ea
EM
sm
kT
Ea
n
critso
EM
score
eeJJA
!!
=!= ""
idleactives ,=!
Tajana Simunic Rosing & Giovanni De Micheli
Time Dependent Dielectric Breakdown (TDDB)
 Wear out mechanism of dielectric due electric field and
temperature; causes formation of conductive paths
through dielectrics
 MTTF is a function of the empirically determined constant
Ao, the field acceleration parameter γ, the electric field
across the dielectric Eox, the activation energy Ea  and T
kT
Ea
E
oTDDB
eeAMTTF ox
!"
=
sleepidleactives ,,=!
 Failure rate due to TDDB:
;
,
'
,
, sssox kT
Ea
TDDB
sm
kT
Ea
E
o
TDDB
score
eeeA
!!
== ""
#
Tajana Simunic Rosing & Giovanni De Micheli
Temperature Cycling (TC)
 Caused by thermal cycles that occur during power state changes
 Slow and fast thermal cycles
 Induces plastic deformations in materials - leads to cracks, short circuits
and other failures of metal films and interlayer dielectrics
 Depends on temperature range and average temperature:
( ) ( )[ ]qmoldavgof TTCTTCCN
!
!!!=
2minmax1
 Failure rate due to TC:
( ) ( )[ ] sleepsfTTCTTCC s
q
moldsavgsleepactiveo
TC
score =!="""= ,21
'
,
#
Tajana Simunic Rosing & Giovanni De Micheli
Reliability of complex systems
 A system is a connection of components
 System reliability depends on the topology
 Series/parallel configurations
 N out of K configurations
 General topologies
 Examples:
 CPU, memory and interconnect form a series reliability network as all
three are necessary for the correct functioning of the system
 Dual CPU system could be viewed as a parallel reliability combination
as only one CPU is needed in order for the system to function
!
="= =
#
#
$
n
i
if
t
system
n
i
isystem etRtRtR
0)()()(
0
%
))(1(1)(
0
tRtR
n
i
isystem !
"
""=
Series Parallel
Tajana Simunic Rosing & Giovanni De Micheli
Basic Reliability Configurations
 Active parallel configuration has all redundant components working
concurrently
 Energy consumption is high
 Time to transition on failure is very low
 Failure rate is higher than standby parallel
 E.g. identical controllers for aircraft guidance
 Standby parallel configuration has redundant components in low-power
mode until failure of the active component
 Energy consumption lower
 Time to transition on failure higher
 Low failure rate
 E.g. dual CPU platform
 Series combination has the highest failure rate
 E.g. CPU, memory, interconnect
M
fm
sby
!
! =
!
=
=
N
i
icore
1
""
1
1
1 ))1(( !
=
!" !=
M
i f
M
ii
fap
i
C
#
#
Tajana Simunic Rosing & Giovanni De Micheli
DPM&DRM - Dependability modeling
Faulty
Good
Degraded
Typical
Faulty
Good
Simple
Repairs allowed
Failure rate
Markov processes model memoryless
systems with constant failure rates
ARM
CPU
Cache
WLAN
WAN
WPAN
IrDA
Reliability topology
Tajana Simunic Rosing & Giovanni De Micheli
DPM&DRM - Power management modeling
Power
Manager
Policy
Queue
4   3  2    1
Sleep
Active
Idle
Sleep
System
Active
Idle
Environment
Tajana Simunic Rosing & Giovanni De Micheli
DPM&DRM System Model Details
 Combine:
 Power-state machine model - TISMDP
 Reliability model - Markov process
 Represent overall system as combination of components’  PSMs where failure
rates depend on system state
 System control aims to increase energy efficiency and enhance reliability
D eparture
A rrival
N o
A rrival
A rrival
Idle
State
!
core,id le
Sleep
State
Transition
to  A ctive
Transition
to  S leep
A ctive
State
G o to  sleep
t <  "  t, P
i
A rrival
#
w orkload
#
w orkload
#
f0
f
0
, V
0
, P
a0
, #
f0
!
core,active
!
core,sleep
!
core,ta
!
core,ts
P
ts
P
ta
t
ts
t
ta
#
w orkload
t <  "  t, P
s
Tajana Simunic Rosing & Giovanni De Micheli
DPM&DRM Policy Optimization
Minimize average energy consumed under reliability and
performance constraints – get randomized policy
Obtain globally optimal policy using linear programming
 Policy is obtained from state-action frequencies f(s,a) as a table
of probabilities of issuing command a when system is in state s
!!!
!
!!
! !!
!
" " "
=
" "
" " "
=
=
#$
#<
#=
##=%
Fi Aa Ss
i
corec
sconstc
const
N
c
cperf
Aa Ss
s
Aa Aa Ss
s
N
c
cenergy
asfasyas
clTpl
cPerft
casfasT
csasfassMasfts
t
),(),(),(
;Re)(
;cos
;1),(),(
,;0),'(),|'(),(..
cosmin
1
,
'
1
,
&&
&
Variable definitions:
cost (s,a) average cost incurred while in
state s given action a
f( s,a ) frequency of executing action a
while in state s
M( s’| t,s,a ) probability of arriving to state s’
given action a taken in state s
T( s,a ) expected time spent in state s
given action a
Tpl(λc) reliability constraint as a
function of network topology Tpl
 λc core reliabili
Tajana Simunic Rosing & Giovanni De Micheli
DPM Constraint Formulation
 Energy and performance cost:
 k(si, ai) - lump sum cost
 c(si+1,si,ai) - cost rate (e.g. power or performance penalty)
 F(ti | si, ai) - probability distribution of next event occurrence
 p(si+1| ti, si, ai) – probability of transition into next state si+1
 Expected time spent in each state:
 Probability of arrival into each state:
Tajana Simunic Rosing & Giovanni De Micheli
Reliability Constraint Formulation
 Failure rate of each state is a sum of the failure rates due to
all mechanisms (EM, TDDB, TC) acting in that state
 Expected temperature in a state needs to be calculated
ssstate
asy
ssstateactivestate TeTTT ,
),(
, )( +!=
!
"
)(
packagethdiethactiveactive
RRPT +!
constS
II
A
lCSfCSy
SIfSIyCIfCIy
CAfCAy
Re),(),(
),(),(),(),(
),(),(
!
++
+
"
""
"
 Total failure rate of a core is a weighted sum of state failure
rates, for example: 
 Core has three power states: active, idle and sleep
 Two actions:  “go to sleep” (S) and “continue” (C)
 System failure rate is calculated based on system topology
as a function of series and parallel combinations
Tajana Simunic Rosing & Giovanni De Micheli
Optimization example
 95nm technology
 Five cores; standard workloads (audio, video, www, email)
 MTTF constraint set to 10 years; minimized power consumption
t ts t ta 
[s] [s]
DSP (TMS6211) [22] 1.1 0.5 0.01 250u 100n
Video (SAF7113H) [23] 0.44 N/A 0.07 110m 0.9
Audio (SST-Melody-DAA) [24] 0.11 0.03 3.00E-03 6u 0.13
I/O (MSP43011x2) [25] 1.00E-03 N/A 6.00E-06 100n 6u
DRAM (Rambus 512M) [26] 1.58 0.37 1.00E-02 16n 16n
IP block Pactive [W] Pidle [W] Psleep [W]
Tajana Simunic Rosing & Giovanni De Micheli
Single Core Design
 Maximum power savings
achievable given MTTF of
10 years are at 90% for
all cores and temperature
ranges except for DSP,
Video and Audio at 90 C
due to TC mechanism
0%
20%
40%
60%
80%
100%
P
o
w
e
r 
sa
v
in
g
s 
%
DSP VIDEO AUDIO I/O RAM
25C
50C
90C
0
20
40
60
80
100
120
140
160
180
200
0.0 10.0 20.0 30.0 40.0 50.0
Power Savings [%] 
M
T
T
F
 [
y
ea
rs
]
Init  EM
Init TDDB
InitTC
Updated EM
Updated TDDB
Updated TC
Updated Design
Init  Design
 Design change effect -
widening metal lines
–Current density down by
20%, core area up by 5%,
temperature down by 2%,
but TC up by 10%
Tajana Simunic Rosing & Giovanni De Micheli
Design with redundancy
 Standby-off and standby-sleep redundancy model
 Power savings with MTTF set to 10 years
0%
20%
40%
60%
80%
100%
P
o
w
e
r 
sa
v
in
g
s 
%
DSP VIDEO AUDIO I/O RAM
50C
90C
0%
20%
40%
60%
80%
100%
P
o
w
er
 s
av
in
g
s 
%
DSP VIDEO AUDIO I/O RAM
50C
90C
 System meets MTTF of 10 years when one more redundant
core in standby off mode is added to DSP, Audio and I/O;
power savings of 40% are achieved
Tajana Simunic Rosing & Giovanni De Micheli
Redundancy
 Using redundancy helps improving reliability but at the cost of
increased area and power consumption
 Instead of spare cores use functional redundancy & dynamic
reconfiguration
1.45
1.5
1.55
1.6
1.65
1.7
1.871.821.771.711.661.611.561.221.171.111.061.010.960.91
MTTF (factor)
P
o
w
e
r(
W
)
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Tr=80C
Tr=100C
Tr=120C
T=80C
T=100C
T=120C
No Redundancy With Redundancy
Tajana Simunic Rosing & Giovanni De Micheli
DVS, DPM and Reliability
 Simulate using a “typical day”
workload, consisting of video,
audio, www and telnet traffic
interspersed throughout the day
 95nm technology,
power/performance properties of
XScale PXA270
 Aggressive DPM:
 Large power savings, but reliability
loss due to TC
 DVS only:
 Smaller power savings, but longer
MTTF due to EM/TDDB
 Both DVS/DPM give best tradeoff
State Active (mW) Idle (mW) Freq (MHz)
P1 925 260 624
P2 747 222 520
P3 279 129 208
P4 116 64 104
Psleep 0.163 0.163 0
0
50
100
150
200
250
0 0.2 0.4 0.6 0.8 1
Policy
M
T
T
F
 (
y
r)
EM
TDDB
TC
MTTF
Tajana Simunic Rosing & Giovanni De Micheli
Power and MTTF with DVS/DPM
 DVS/DPM improves MTTF by 45%, with 61% power savings
Policy Power MTTF
None 0% 0%
DVS 35% 42%
DPM (Rmax) 16% 6%
DPM (ave) 47% -12%
DPM (Pmax) 99% -34%
both (Rmax) 46% 47%
both (ave) 61% 45%
both (Pmax) 99% 34%
-40%
-30%
-20%
-10%
0%
10%
20%
30%
40%
50%
60%
0% 20% 40% 60% 80% 100%
Power (%)M
T
T
F
 (
%
)
DPM&DVS
DPM Pmax
DPM Pave
Tajana Simunic Rosing & Giovanni De Micheli
Summary
 Reliability is strongly affected by both DVS and DPM
 Integrated methodology for analysis, optimization and
management of reliability and power consumption:
 Simulator gives fast feedback on topology design and system
characteristics for a wide range of operating conditions
 Optimizer provides a policy capable of giving an optimal
implementation of reliability and power management control
 Results obtained for a number of integrated systems
implemented in 95nm technology show:
 Large dependence between power management policy and
reliability due to tradeoff between EM, TDDB and TC effects
 40% power savings on top of meeting MTTF of 10 years for an
integrated system consisting of five cores with redundancy
