Designing reliable systems with unreliable components by De Micheli, Giovanni
DESIGNING RELIABLE SYSTEMS
with
UNRELIABLE COMPONENTS
Giovanni De Micheli
Centre Systèmes Intégrés
De Micheli 2
Outline
• Introduction and motivation
• Variability and robustness
– Self-calibrating circuits
• Soft failures
– Error detection and correction
• Hard failures
– Redundancy
– System-level management
• Hardware system integration issues
– On-chip networks
– Self-healing
• Nanotechnology challenges
• Summary and conclusions
De Micheli 3
Reliable design:
where do we need it ?
• Traditional applications
– Long-life applications (space missions)
– Life-critical, short-term applications  (aircraft engine control, fly-by-wire)
– Defense applications (aircraft, guidance & control)
– Nuclear industry
– Telecommunications
• New computation-critical applications
– Health industry
– Automotive industry
– Industrial control systems and production lines
– Banking, reservations, commerce
De Micheli 4
The economic perspective
• Availability is a critical business metric for commercial systems and services
– Nearly 100% availability (“five nines+”) is almost mandatory
• Service outages are frequent
– 65% website managers report outages over a 6-month period
– 25% report three or more outages  [Internet week 2000 ]
• High cost of downtime of systems providing vital services
– Lost opportunities and revenues, non-compliance penalties, potential loss of
lives
– Cost per an hour of downtime varies from $89K for cellular services to $6.5M for
stock brokerage  [Gartner Group 1998]
• Revenue for high availability products in the data/telecom/computer server
market is over $100B (≈ $15B for servers alone)   [IMEX Research 2003]
De Micheli 5
The physical perspective
• Malfunctions are more likely to happen
– Manufacturing imperfections
• Design close to scaling limits
• Smaller devices, larger variances
• Hostile environments
– Embedded systems exposed to harsh conditions
– System fragility due to low-voltage operation
• Material aging
– Oxide breakdown, electromigration failures
– More likely with scaled-down lithography
De Micheli 6
Reliability is a system issue
Hardware
System network
Processing elements
Memory
Storage system
Operating system
Reliable communication
Sw Implemented
Fault Tolerance
Application program 
interface (API)
Middleware
Applications
Error correcting codes, M-out-of-N and
standby redundancy , voting, watchdog
timers, reliable storage (RAID, mirrored disks)
CRC on messages , acknowledgment,
watchdogs, heartbeats, consistency protocols
Memory management and exception handling, 
detection of process failures, checkpoint and 
rollback
Checkpointing and rollback, application
replication,  software,  voting (fault masking),
process pairs, robust data structures,
recovery blocks, N-version programming,
[ Iyer ]
De Micheli 7
Hardware reliability
challenges
• Extremely small device size
– Coping with deep submicron (DSM) technologies
• Spreading of parameters
• Extremely large circuit scale
– System complexity
• Interaction of computing and storage components
• New computing materials
– Higher device and defect density
• How to make the new technologies viable
De Micheli 8
Outline
• Introduction and motivation
• Variability and robustness
– Self-calibrating circuits
• Soft errors
– Error detection and correction
• Hard failures
– Redundancy
– System-level management
• Hardware system integration issues
– On-chip networks
– Self-healing
• Nanotechnology challenges
• Summary and conclusions
De Micheli 9
Roadmap qualitative trends
• Continued gate downscaling
• Increased transistor density and frequency
Power and thermal management
• Lower supply voltage
Reduced noise immunity
• Increased spread of physical parameters
 Inaccurate modeling of physical behavior
De Micheli 10
Design space exploration
worst case analysis
Voltage
Delay
max
typ
min
Pareto points on w.c. curve
De Micheli 11
Δ
Adaptive design space
worst case analysis
Voltage
Delay
min
typmax As parameters spread,
w.c. design is too pessimistic
Δ
De Micheli 12
Self-calibrating circuits
• Address variability and
robustness
• Design self-calibrating circuits
operating at the edge of failure
• Examples:
– Dynamic voltage scaling of bus
swings [Ienne –EPFL]
– Dynamic voltage scaling in
processors
• Razor [Austin – U Mich]
– Dynamic latency adjustment
• Terror [Stanford]
48
-b
it 
LF
SR
48
-b
it 
LF
SR
X
X
X
clk/2
clk/2
clk clk
clk/2
clk/2
clk
!=
40
-b
it 
E
rr
or
 C
ou
nt
er
Slow Pipeline A
Slow Pipeline B
Fast Pipeline
clk/2
18
18
36
36
36
18x18
18x18
18x18
stabilize
dd
v
FIFO
ch
F
Controller
FIFO
n
dd
v
En
co
de
r
D
ec
od
er
Ack
ch
errors
ch
v
De Micheli 13
• General paradigm
– A circuit may be in correct or faulty operational state, depending
on a parameter (e.g., voltage)
– Computed/transmitted data need checks
• If data is faulty, data is recomputed and/or retransmitted
– Error rate is monitored on line
– Feedback loop to control operational state parameter based on
error rate
    Circuits can generate errors:
– Errors must be detected and corrected
– Correction rate is used for calibration
How to calibrate?
De Micheli 14
Outline
• Introduction and motivation
• Variability and robustness
– Self-calibrating circuits
• Soft failures
– Error detection and correction
• Hard failures
– Redundancy
– System-level management
• Hardware system integration issues
– On-chip networks
– Self-healing
• Nanotechnology challenges
• Summary and conclusions
De Micheli 15
Dealing with transient
malfunctions
• Soft errors
– Data corruption due external
radiation exposure
• Crosstalk
– Data corruption due to
internal field exposure
• Both malfunctions manifest
themselves as timing errors
– Error containment
De Micheli 16
Soft errors
• Due to charge injection:
– Charged ion:
• Alpha particle
• Neutron scattering (from cosmic rays)
• Soft error rate increase with:
– Environment (e.g., altitude, latitude)
– Decrease of node  critical charge
(capacitance*voltage)
• Used to be a problem for DRAMs
– Now important also for SRAM, registers, etc.
WL
BL
  particle
De Micheli 17
Propagation of soft error
De Micheli 18
Radiation-hardened registers
• Protection against soft errors
– Timing errors
• Each latch is duplicated
– Shadow latch has delayed clock
• Comparison between original
and shadow latch detects error
– Error correction is possible
[IROC Technologies]
De Micheli 19
Outline
• Introduction and motivation
• Variability and robustness
– Self-calibrating circuits
• Soft failures
– Error detection and correction
• Hard failures
– Redundancy
– System-level management
• Hardware system integration issues
– On-chip networks
– Self-healing
• Nanotechnology challenges
• Summary and conclusions
De Micheli 20
Aging of materials
• Failure mechanisms
– Electromigration
– Oxide Breakdown
– Thermo-mechanical stress
• Temperature dependence
– Arrhenius law
De Micheli 21
Failure rate
the bathtub curve
time
Failure rate
De Micheli 22
Component redundancy
• Component redundancy for enhanced reliability
– Energy consumption penalty may be severe
• Power-managed standby components
– Provide for temporary/permanent back-up
– Provide for load and stress sharing
• Power management and reliability are intertwined:
– PM allows reasonable use of redundancy on chip
– Failure rates depend on effect of PM on components
• A programmable and flexible interconnection
means is required
De Micheli 23
Example
Standby
Standby
Faulty
Standby
memory
When core operates
failure rate is higher as compared
to standby unit
When core fails,
it is replaced by standby core
System management may
alternate cores at high frequency,
voltage and failure rate, to
optimize long term reliability
De Micheli 24
Issues
• Analyze system-level reliability
– as a function of a power management policy
• Determine a system management policy
– to maximize reliability (over a time interval) and
minimize energy consumption
• Determine a system management policy and
system topology
– to maximize reliability (over a time interval) and
minimize energy consumption
De Micheli 25
System-level management
• Reliability and energy management can be
modeled by stochastic processes
– Stochastic optimum control for policy design
– As more accurate models are required, policy
design is harder
• Simulation of system management policies is
useful for assessing effectiveness of
redundancy and energy cost
– Simulation results show dominant effect of
temperature and its cycling on system reliability
• Optimal policy design is also possible
De Micheli 26
Effect of DPM policy on MTTF
• Power and temperature gap
between active and sleep state
• Small gap
– Thermal cycle effects dominate
EM and TDDB only in the lower
temperature spectrum
– MTTF decreases/increases as
DPM gets more aggressive
• Wider gap
– Thermal cycles effects dominate
– MTTF decreases always as
DPM gets more aggressive
[Simunic – UCSD]
De Micheli 27
Outline
• Introduction and motivation
• Variability and robustness
– Self-calibrating circuits
• Soft failures
– Error detection and correction
• Hard failures
– Redundancy
– System-level management
• Hardware system integration issues
– On-chip networks
– Self-healing
• Nanotechnology challenges
• Summary and conclusions
De Micheli 28
Extremely large scale circuits
Component-based design
• SoCs are designed (re)-using large macrocells
– Processors, controllers, memories…
– Plug and play methodology is very desirable
– Components are qualified before use
• Design challenge:
– Provide a functionally-correct, reliable operation of the
interconnected components
   Critical issue:
– Design of the communication fabric
De Micheli 29
Why on-chip networking ?
• Provide a structured methodology for realizing
on-chip communication schemes
– Modularity
– Flexibility
• Cope with inherent limitations of busses
– Performance and power of busses do not scale up
• Support reliable operation
– Layered approach to error detection and correction
De Micheli 32
ICACHE MEM.CTRL.
AMBA BUS
INTERFACE
FROM  EXT.
MEMORY
HRDATA AMBA BUS
• Compare original AMBA bus to
extended bus with error detection
and correction or retransmission
– SEC coding
– SEC-DED coding
– ED coding
• Explore energy efficiency [Bertozzi]
Data-link protocol example:
error-resilient coding
H DECODER H ENCODER
MTTF
De Micheli 33
Going beyond buses
• Buses:
– Pro: simple, existing standards
– Con: performance, energy, arbitration
• Other network topologies:
– Pro: higher performance, experience with MP
– Con: physical routing, need for network and transport layers
• Challenges:
– Exploit network architecture and corresponding protocols
– Design network and transport protocols with small overhead
• Switching, routing, packetization, …
Network
Interface
Packets
Routes
PE
De Micheli 34
Crossbar & Partial CB cost
PC
FC
Key issue: crossbar is not scalable!
Partial crossbar is a compromise solution
[Murali –Stanford]
De Micheli 35
Micro-network synthesis
• xPipes:
– A library of components for
micro-network synthesis
• Network interfaces
• Switches
• Communication links
• xPipesCompiler:
– An assembler to synthesize a
micro-network from a
structural description
– Output is a SystemC micro-
network model
Provide an integrated flow to support micro-network design
NI files Switch files Link files
xPipes Library
xPipesCompiler
Routing
tables
Core
models
Application
Application
Specif ic
NoC
SystemC
Output
Simulation
& synthesis
De Micheli 37
Video object plane decoder
VLD RLD iscan
vopmarm pad
iDCT iQuant acdc
smemupsmp vopr
3x3 sw
4x4 sw
5x5 sw
vopr smem
arm vopm Pad
VLD RLD iscan
iDCT iQuant acdc
3x2 sw
3x3 sw
2x3 sw
Regular network Ad hoc network
up
smp
De Micheli 38
VOPD Simulation
Arrows show the time that packet heads take from source NI to destination
De Micheli 39
NoC Emulation on FPGA
• Emulation on FPGA enables functional and
performance validation of NoC based systems
– Accurate execution model
– Probing for profiling and gathering of statistics
• The emulation can achieve important speedups
compared to cycle accurate simulation:
– Up to four orders of magnitude faster
– Real inputs with millions of packets can be used
• Two-level network emulation
– Network backbone in configurable hardware
– Network protocol and parameters in software
    [Genko – EPFL]
Virtex-II Pro FPGA
•2 Power PC Cores
•3 M programmable gates
De Micheli 40
Micro-networks and reliability
• Micro-networks are the platform to integrate multi-
processor systems
• Micro-networks support seamlessly redundant
stand-by components for achieving high availability
• Micro-networks provide fault-tolerant
communication by supporting alternative paths
De Micheli 41
Self-healing
• Bio-wall project [Mange – EPFL]
• Cellular design with redundancy
– Each cell programmed by a string (gene)
– FPGA technology
• Self-healing property:
– Upon cell failure, neighbors reconfigure to take over function
De Micheli 42
Cellular self-repair
RG+OG
2 3 4X=1 SPARE
CELL
faulty molecule
De Micheli 43
Cellular self-repair
RG+OG
2 3 4X=1 SPARE
CELL
De Micheli 44
Autonomic computing
• Broad R&D project launched by IBM
• Self-healing
– Design computer and software that perform
self-diagnostic functions and can fix
themselves without human intervention
– Strong analogies to biological systems
• Reduced cost of design and maintenance
De Micheli 45
Autonomics principles
• An autonomic system:
– must know itself
– reconfigures itself under varying condition
– optimizes its operations at run time
– must support self-healing
– must defend itself against attacks
– must know the environment
– manages and optimizes internal resources without human
intervention
De Micheli 46
Outline
• Introduction and motivation
• Variability and robustness
– Self-calibrating circuits
• Soft failures
– Error detection and correction
• Hard failures
– Redundancy
– System-level management
• Hardware system integration issues
– On-chip networks
– Self-healing
• Nanotechnology challenges
• Summary and conclusions
De Micheli 47
New computing materials
• When will current semiconductor technologies run out of steam?
• What factor will provide a radical change in technology?
– Performance, power density, cost?
• Several emerging technologies:
– Carbon nanotubes, nanowires, quantum devices, molecular electronics,
biological computing, …
• Are these technologies compatible with silicon?
– What is the transition path?
• What are the common characteristics, from a design technology
standpoint?
De Micheli 48
Common characteristics of
nano-devices
• Self-assembly used to create structures
– Manufacturing paradigm is bottom-up
• Significant presence of physical defects
– Massively fault-tolerant design style
• Competitive advantage stems from the
 high density of computing elements
–  Two orders up as compared to scaled CMOS
De Micheli 49
• Key ingredients:
– Massive parallelism and redundancy
– Exploit properties of crosspoint architectures
• E.g., Programmable Logic Arrays (PLAs)
– Local and global reconfiguration
– Some nanotechnologies are compatible with CMOS
    Some design technologies for robust DSM CMOS design
can be applied to nanotechnology
Robust nano-design
De Micheli 50
Outline
• Introduction and motivation
• Variability and robustness
– Self-calibrating circuits
• Soft failures
– Error detection and correction
• Hard failures
– Redundancy
– System-level management
• Hardware system integration issues
– On-chip networks
– Self-healing
• Nanotechnology challenges
• Summary and conclusions
De Micheli 51
Summary
• The electronic market is driven by embedded applications where
reliability and robustness are key figures of merit
• Hardware systems are more prone to fail
– Variations in manufacturing
– Hard and soft malfunctions
• Reliability can be enhanced by component and communication
redundancy
– System management is critical for long-lasting operation
– On-chip networks support redundancy
• Self-healing, massive parallelism and redundancy are key to design
highly-dependable circuits with nano-technologies
– Sub 45nm CMOS technologies
– Novel silicon and non-silicon based nano-technologies
De Micheli 52
